Circular String Linearization

Sedefcho · Post by **Sedefcho** » Sun Feb 08, 2009 2:56 pm

This is a problem mentioned in Dan Gusfied 'Algorithms on Strings Tress and Sequences'
(page 148 - problem is named APL12). There the author says it could be easily solved
if we build a suffix tree.

Glass Beads - UVA Problem 719
http://icpcres.ecs.baylor.edu/onlinejud ... 7/719.html
is exactly the same problem.

Still I don't think suffix tree is possible for this problem just because the
input string can be up to 10,000 chars. So it would be too much memory
to build a suffix tree for such a string (I think).

My first question is: what do you think about that?
Would you agree that a suffix tree there might take
too much memory?

Of course the same question applies if we decide to use Trie - which
is basically a less optimized version of the Suffix Tree
(can be regarded as such, at least, in a sense).

Second thing I want to ask is the following.

There are almost no hints on problem 719 in this forum.
The only three notes which I see are here on the following location.
http://acm.uva.es/board/viewtopic.php?f ... 122ce2a20f

They are basically
* a suggestion to use suffix tree (which I think will not
even work due to memory constraints)
* a suggestion (from Adrian Kuegel) which says
"You don't need any advanced data structure for this
problem. Hint: think of something similar to BFS."
* a statement: this problem can be solved in O(n)

I thought for quite some time but can't find a way yet to
transform this problem into something similar to BFS. I mean
I see a direct transformation but then the worst-case complexity
will still be O(N^2) /in the way I think about it/.

So my second question is: can somebody provide
a slightly greater hint on this problem (I don't
want a spoiler just a slightly greater hint)?

You may send me a personal
message if you prefer so.

Many thanks in advance.

mf · Post by mf » Sun Feb 08, 2009 3:28 pm

Sedefcho wrote:Still I don't think suffix tree is possible for this problem just because the
input string can be up to 10,000 chars. So it would be too much memory
to build a suffix tree for such a string (I think).

Suffix trees take about 5-10 words per each character in input, so a suffix tree for 10000 characters would need less than half a megabyte - a very modest amount of memory today.

And for this problem, I think, suffix arrays are the way to go in contests -- I have a very simple O(n log^2 n) implementation in only 20 lines of C++, and it needs only 16 bytes per char.

There's also O(n) algorithm (I think it's called Duval's algorithm), based on theory of Lyndon words and prime strings, but I didn't look deeply into it.

maxdiver · Post by **maxdiver** » Mon Feb 09, 2009 11:03 pm

There are lots of approaches to solve this problem.

First of all, hashes. Indeed, we can compute LCP (longest common prefix) of any two suffixes S[i..] and S[j..] of given string (we can do this be binary search in O(log N), N = S.length). After that we can compare lexicographically any two suffixes S[i..] and S[j..] in O(log N) (we should find LCP and compare s[i+LCP] and s[j+LCP]). Finally, for given string S we can find it's minimal cyclic shift by taking string T=S+S and comparing all suffixes 0..N-1, so O(N logN) with quite easy algorithm.
Sample implementation:

Code: Select all

long long get_hash (const vector<long long> & h, const vector<long long> & pw, int i, int j) {
	return ((j>=0 ? h[j] : 0) - (i ? h[i-1] : 0)) * pw[h.length()-i];
}

bool smaller (const string & s, int n, const vector<long long> & h, const vector<long long> & pw, int i, int j) {
	int l = 0,  r = n;
	while (l < r) {
		int m = (l + r + 1) >> 1;
		if (get_hash (h, pw, i, i+m-1) == get_hash (h, pw, j, j+m-1))
			l = m;
		else
			r = m-1;
	}
	return s[i+l] < s[j+l];
}

Second, Duval. For given string S of length N it finds in O(N) time and O(1) memory its Lyndon decomposition, i.e. decomposition into substrings S1,S2,...,Sk such that S1+...+Sk = S and S1>=S2>=...>=Sk and each Si is a simple string (every its suffix is lexicographically strictly smaller than Si itself). Sounds very hard, but the algorithms is very easy. It's almost a greed algorithm, but with some back-tracking, and nevertheless it works in linear time (more precisely, no more than 4N-3 char comparisons).
Implementation of Duval:

Code: Select all

string s;
int n = (int) s.length();
int i=0;
while (i < n) {
	int j=i+1, k=i;
	while (j < n && s[k] <= s[j]) {
		if (s[k] < s[j])
			k = i;
		else
			++k;
		++j;
	}
	while (i <= k) {
		cout << s.substr (i, j-k) << ' ';
		i += j - k;
	}
}

If you are interested, I can describe here Duval's algorithm and how it works.
Then, if we know Lyndon decomposition for a string S+S, then we can find minimal cyclic shift as a start position of last block of same simple substrings, placed before N:

Code: Select all

string min_cyclic_shift (string s) {
	s += s;
	int n = (int) s.length();
	int i=0, ans=0;
	while (i < n/2) {
		ans = i;
		int j=i+1, k=i;
		while (j < n && s[k] <= s[j]) {
			if (s[k] < s[j])
				k = i;
			else
				++k;
			++j;
		}
		while (i <= k)  i += j - k;
	}
	return s.substr (ans, n/2);
}

Third, there are some approaches to minimal cyclic shift, connected with prefix-function or z-function. They should be also very simple, but unfortunately I don't know them

Sedefcho · Post by **Sedefcho** » Tue Feb 10, 2009 10:35 am

Thanks for these detailed explanations & references, I will try to
have a look and hopefully understand them (or at least some of them).

Thanks a lot.

tryit1 · Post by **tryit1** » Tue Feb 10, 2009 9:29 pm

And for this problem, I think, suffix arrays are the way to go in contests -- I have a very simple O(n log^2 n) implementation in only 20 lines of C++, and it needs only 16 bytes per char.

Can you please post it here ?

mf · Post by mf » Tue Feb 10, 2009 10:06 pm

Yes, here it is:

Code: Select all

// Suffix array in $O(n \log^2 n)$ time, 16 bytes/char overhead.
// Input: text, N
// Output:
//   sa[] is a sorted list of offsets to all non-empty suffixes,
//   lcp[i] = length of longest common prefix of text+sa[i] and text+sa[i+1]
char text[MAX]; long long key[MAX]; int N, sa[MAX], rank[MAX], *lcp=(int*)key;
struct Cmp { bool operator()(int i, int j) const { return key[i]<key[j]; } };

void build() {
    for (int i = 0; i < N; i++) { sa[i] = i; key[i] = text[i]; }
    sort(sa, sa+N, Cmp());
    for (int K = 1; ; K *= 2) {
        // invariant: suffixes are now sorted by their first K characters
        for (int i = 0; i < N; i++)
            rank[sa[i]] = i>0 && key[sa[i-1]]==key[sa[i]] ? rank[sa[i-1]] : i;
        if (K >= N) break;

        // sort suffixes by key (rank of this suffix, rank of suffix K chars ahead)
        for (int i = 0; i < N; i++)
            key[i] = rank[i] * (N+1LL) + (i+K < N ? rank[i+K]+1 : 0);
        sort(sa, sa+N, Cmp());
        // now they're sorted by their first 2K characters
    }

    // Kasai's algorithm for LCP array:
    for (int i = 0, k = 0; i < N; i++) {
        if (k > 0) k--;
        if (rank[i] == N-1) { lcp[N-1] = -1; k = 0; continue; }
        int j = sa[rank[i]+1];
        while (text[i+k] == text[j+k]) k++;
        lcp[rank[i]] = k;
    }
}

I wrote it for my ICPC team's reference notebook, not specifically for this UVa's problem. And Sedefcho already told me that, unfortunately, this code slightly timeouts on it :(

Also, check this thread on TC forums for another implementation of this same algorithm -
http://forums.topcoder.com/?module=Thre ... 79&start=0

tryit1 · Post by **tryit1** » Tue Feb 10, 2009 10:47 pm

mf wrote:Y
I wrote it for my ICPC team's reference notebook

thanks,
If they are commented like above, do you have them online somewhere or can you send the rar to tryitn1@gmail.com. Would certainly be useful.

mf · Post by mf » Tue Feb 10, 2009 10:53 pm

I've put my old ACM ICPC notebook online here.
But the codes there have a lot less comments, though.

Edited on 2010/10/02: updated link.

tryit1 · Post by **tryit1** » Tue Jul 28, 2009 12:56 pm

[quote="mf"]Yes, here it is:

Code: Select all

// Suffix array in $O(n \log^2 n)$ time, 16 bytes/char overhead.
// Input: text, N
// Output:
//   sa[] is a sorted list of offsets to all non-empty suffixes,
//   lcp[i] = length of longest common prefix of text+sa[i] and text+sa[i+1]
char text[MAX]; long long key[MAX]; int N, sa[MAX], rank[MAX], *lcp=(int*)key;
struct Cmp { bool operator()(int i, int j) const { return key[i]<key[j]; } };

void build() {
    for (int i = 0; i < N; i++) { sa[i] = i; key[i] = text[i]; }
    sort(sa, sa+N, Cmp());
    for (int K = 1; ; K *= 2) {
        // invariant: suffixes are now sorted by their first K characters
        for (int i = 0; i < N; i++)
            rank[sa[i]] = i>0 && key[sa[i-1]]==key[sa[i]] ? rank[sa[i-1]] : i;
        if (K >= N) break;

        // sort suffixes by key (rank of this suffix, rank of suffix K chars ahead)
        for (int i = 0; i < N; i++)
            key[i] = rank[i] * (N+1LL) + (i+K < N ? rank[i+K]+1 : 0);
        sort(sa, sa+N, Cmp());
        // now they're sorted by their first 2K characters
    }

    // Kasai's algorithm for LCP array:
    for (int i = 0, k = 0; i < N; i++) {
        if (k > 0) k--;
        if (rank[i] == N-1) { lcp[N-1] = -1; k = 0; continue; }
        int j = sa[rank[i]+1];
        while (text[i+k] == text[j+k]) k++;
        lcp[rank[i]] = k;
    }
}

The above algorithm assumes that LCP of 2 consecutive suffixes j,j+1 (none of them rank ==N-1) has atleast LCP of ( j-1,j ) -1 in common . Why is this true ?

mf · Post by mf » Tue Jul 28, 2009 2:20 pm

The above algorithm assumes that LCP of 2 consecutive suffixes j,j+1 (none of them rank ==N-1) has atleast LCP of ( j-1,j ) -1 in common . Why is this true ?

That's not true.

What's a true fact here is a bit more complex to formulate. But just for example, take the string "abcdabcz". The first suffix is "abcdabcz", its LCP is 3, because the next lexicographically larger suffix (next suffix in suffix array) is "abcz". And so we can deduce that LCP of 2nd suffix - "bcdabcz" is at least 2, because there's the suffix "bcz" (just drop the first letter of "abcz").

And I already told you who the author of the algorithm is, why didn't you just google for the paper?

mogers · Post by **mogers** » Mon Feb 01, 2010 3:26 am

mf wrote:I've put my old ACM ICPC notebook online here.
But the codes there have a lot less comments, though.

The link doesn't seem to work anymore. Could you please share it again? Or send me by e-mail to mr_mogers@sapo.pt

Thank you

mf · Post by mf » Sat Oct 02, 2010 5:39 pm

Sorry for delay.
I've put most of my ACM ICPC-related stuff here - http://github.com/infnty/acm/
My old team notebook is at http://github.com/infnty/acm/raw/master ... tebook.pdf

zobayer · Post by **zobayer** » Wed Feb 16, 2011 11:04 am

probably this is the simplest algorithm:

Code: Select all

/*
Finds alphabetically first representation of a cyclic string in O(length)
*/

minimumExpression(s) {
	s = s + s;
	len = length(s), i = 0, j = 1, k = 0;
	while(i + k < len && j + k < len) {
		if(s[i+k] == s[j+k]) k++;
		else if(s[i+k] > s[j+k]) { i = i+k+1; if(i <= j) i = j+1; k = 0; }
		else if(s[i+k] < s[j+k]) { j = j+k+1; if(j <= i) j = i+1; k = 0; }
	}
	return min(i, j);
}

This is called Minimum Expression introduced by Zhou Yuan.

OJ Board

Circular String Linearization

Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization

Re: Circular String Linearization