Circular String Linearization

Let's talk about algorithms!

Moderator: Board moderators

Post Reply
Sedefcho
A great helper
Posts: 374
Joined: Sun Jan 16, 2005 10:18 pm
Location: Bulgaria

Circular String Linearization

Post by Sedefcho »

This is a problem mentioned in Dan Gusfied 'Algorithms on Strings Tress and Sequences'
(page 148 - problem is named APL12). There the author says it could be easily solved
if we build a suffix tree.

Glass Beads - UVA Problem 719
http://icpcres.ecs.baylor.edu/onlinejud ... 7/719.html
is exactly the same problem.

Still I don't think suffix tree is possible for this problem just because the
input string can be up to 10,000 chars. So it would be too much memory
to build a suffix tree for such a string (I think).

My first question is: what do you think about that?
Would you agree that a suffix tree there might take
too much memory?

Of course the same question applies if we decide to use Trie - which
is basically a less optimized version of the Suffix Tree
(can be regarded as such, at least, in a sense).

Second thing I want to ask is the following.

There are almost no hints on problem 719 in this forum.
The only three notes which I see are here on the following location.
http://acm.uva.es/board/viewtopic.php?f ... 122ce2a20f

They are basically
* a suggestion to use suffix tree (which I think will not
even work due to memory constraints)
* a suggestion (from Adrian Kuegel) which says
"You don't need any advanced data structure for this
problem. Hint: think of something similar to BFS."
* a statement: this problem can be solved in O(n)

I thought for quite some time but can't find a way yet to
transform this problem into something similar to BFS. I mean
I see a direct transformation but then the worst-case complexity
will still be O(N^2) /in the way I think about it/.

So my second question is: can somebody provide
a slightly greater hint on this problem (I don't
want a spoiler just a slightly greater hint)?

You may send me a personal
message if you prefer so.

Many thanks in advance.
Last edited by Sedefcho on Tue Feb 10, 2009 2:18 pm, edited 1 time in total.

mf
Guru
Posts: 1244
Joined: Mon Feb 28, 2005 4:51 am
Location: Zürich, Switzerland
Contact:

Re: Circular String Linearization

Post by mf »

Sedefcho wrote:Still I don't think suffix tree is possible for this problem just because the
input string can be up to 10,000 chars. So it would be too much memory
to build a suffix tree for such a string (I think).
Suffix trees take about 5-10 words per each character in input, so a suffix tree for 10000 characters would need less than half a megabyte - a very modest amount of memory today.

And for this problem, I think, suffix arrays are the way to go in contests -- I have a very simple O(n log^2 n) implementation in only 20 lines of C++, and it needs only 16 bytes per char.

There's also O(n) algorithm (I think it's called Duval's algorithm), based on theory of Lyndon words and prime strings, but I didn't look deeply into it.

maxdiver
Learning poster
Posts: 51
Joined: Tue Sep 04, 2007 2:12 pm
Location: Russia, Saratov
Contact:

Re: Circular String Linearization

Post by maxdiver »

There are lots of approaches to solve this problem.

First of all, hashes. Indeed, we can compute LCP (longest common prefix) of any two suffixes S[i..] and S[j..] of given string (we can do this be binary search in O(log N), N = S.length). After that we can compare lexicographically any two suffixes S[i..] and S[j..] in O(log N) (we should find LCP and compare s[i+LCP] and s[j+LCP]). Finally, for given string S we can find it's minimal cyclic shift by taking string T=S+S and comparing all suffixes 0..N-1, so O(N logN) with quite easy algorithm.
Sample implementation:

Code: Select all

long long get_hash (const vector<long long> & h, const vector<long long> & pw, int i, int j) {
	return ((j>=0 ? h[j] : 0) - (i ? h[i-1] : 0)) * pw[h.length()-i];
}

bool smaller (const string & s, int n, const vector<long long> & h, const vector<long long> & pw, int i, int j) {
	int l = 0,  r = n;
	while (l < r) {
		int m = (l + r + 1) >> 1;
		if (get_hash (h, pw, i, i+m-1) == get_hash (h, pw, j, j+m-1))
			l = m;
		else
			r = m-1;
	}
	return s[i+l] < s[j+l];
}
Second, Duval. For given string S of length N it finds in O(N) time and O(1) memory its Lyndon decomposition, i.e. decomposition into substrings S1,S2,...,Sk such that S1+...+Sk = S and S1>=S2>=...>=Sk and each Si is a simple string (every its suffix is lexicographically strictly smaller than Si itself). Sounds very hard, but the algorithms is very easy. It's almost a greed algorithm, but with some back-tracking, and nevertheless it works in linear time (more precisely, no more than 4N-3 char comparisons).
Implementation of Duval:

Code: Select all

string s;
int n = (int) s.length();
int i=0;
while (i < n) {
	int j=i+1, k=i;
	while (j < n && s[k] <= s[j]) {
		if (s[k] < s[j])
			k = i;
		else
			++k;
		++j;
	}
	while (i <= k) {
		cout << s.substr (i, j-k) << ' ';
		i += j - k;
	}
}
If you are interested, I can describe here Duval's algorithm and how it works.
Then, if we know Lyndon decomposition for a string S+S, then we can find minimal cyclic shift as a start position of last block of same simple substrings, placed before N:

Code: Select all

string min_cyclic_shift (string s) {
	s += s;
	int n = (int) s.length();
	int i=0, ans=0;
	while (i < n/2) {
		ans = i;
		int j=i+1, k=i;
		while (j < n && s[k] <= s[j]) {
			if (s[k] < s[j])
				k = i;
			else
				++k;
			++j;
		}
		while (i <= k)  i += j - k;
	}
	return s.substr (ans, n/2);
}
Third, there are some approaches to minimal cyclic shift, connected with prefix-function or z-function. They should be also very simple, but unfortunately I don't know them :)

Sedefcho
A great helper
Posts: 374
Joined: Sun Jan 16, 2005 10:18 pm
Location: Bulgaria

Re: Circular String Linearization

Post by Sedefcho »

Thanks for these detailed explanations & references, I will try to
have a look and hopefully understand them (or at least some of them).

Thanks a lot.

tryit1
Experienced poster
Posts: 119
Joined: Sun Aug 24, 2008 9:09 pm

Re: Circular String Linearization

Post by tryit1 »

And for this problem, I think, suffix arrays are the way to go in contests -- I have a very simple O(n log^2 n) implementation in only 20 lines of C++, and it needs only 16 bytes per char.
Can you please post it here ?

mf
Guru
Posts: 1244
Joined: Mon Feb 28, 2005 4:51 am
Location: Zürich, Switzerland
Contact:

Re: Circular String Linearization

Post by mf »

Yes, here it is:

Code: Select all

// Suffix array in $O(n \log^2 n)$ time, 16 bytes/char overhead.
// Input: text, N
// Output:
//   sa[] is a sorted list of offsets to all non-empty suffixes,
//   lcp[i] = length of longest common prefix of text+sa[i] and text+sa[i+1]
char text[MAX]; long long key[MAX]; int N, sa[MAX], rank[MAX], *lcp=(int*)key;
struct Cmp { bool operator()(int i, int j) const { return key[i]<key[j]; } };

void build() {
    for (int i = 0; i < N; i++) { sa[i] = i; key[i] = text[i]; }
    sort(sa, sa+N, Cmp());
    for (int K = 1; ; K *= 2) {
        // invariant: suffixes are now sorted by their first K characters
        for (int i = 0; i < N; i++)
            rank[sa[i]] = i>0 && key[sa[i-1]]==key[sa[i]] ? rank[sa[i-1]] : i;
        if (K >= N) break;

        // sort suffixes by key (rank of this suffix, rank of suffix K chars ahead)
        for (int i = 0; i < N; i++)
            key[i] = rank[i] * (N+1LL) + (i+K < N ? rank[i+K]+1 : 0);
        sort(sa, sa+N, Cmp());
        // now they're sorted by their first 2K characters
    }

    // Kasai's algorithm for LCP array:
    for (int i = 0, k = 0; i < N; i++) {
        if (k > 0) k--;
        if (rank[i] == N-1) { lcp[N-1] = -1; k = 0; continue; }
        int j = sa[rank[i]+1];
        while (text[i+k] == text[j+k]) k++;
        lcp[rank[i]] = k;
    }
}
I wrote it for my ICPC team's reference notebook, not specifically for this UVa's problem. And Sedefcho already told me that, unfortunately, this code slightly timeouts on it :(

Also, check this thread on TC forums for another implementation of this same algorithm -
http://forums.topcoder.com/?module=Thre ... 79&start=0

tryit1
Experienced poster
Posts: 119
Joined: Sun Aug 24, 2008 9:09 pm

Re: Circular String Linearization

Post by tryit1 »

mf wrote:Y
I wrote it for my ICPC team's reference notebook
thanks,
If they are commented like above, do you have them online somewhere or can you send the rar to tryitn1@gmail.com. Would certainly be useful.

mf
Guru
Posts: 1244
Joined: Mon Feb 28, 2005 4:51 am
Location: Zürich, Switzerland
Contact:

Re: Circular String Linearization

Post by mf »

I've put my old ACM ICPC notebook online here.
But the codes there have a lot less comments, though.

Edited on 2010/10/02: updated link.

tryit1
Experienced poster
Posts: 119
Joined: Sun Aug 24, 2008 9:09 pm

Re: Circular String Linearization

Post by tryit1 »

[quote="mf"]Yes, here it is:

Code: Select all

// Suffix array in $O(n \log^2 n)$ time, 16 bytes/char overhead.
// Input: text, N
// Output:
//   sa[] is a sorted list of offsets to all non-empty suffixes,
//   lcp[i] = length of longest common prefix of text+sa[i] and text+sa[i+1]
char text[MAX]; long long key[MAX]; int N, sa[MAX], rank[MAX], *lcp=(int*)key;
struct Cmp { bool operator()(int i, int j) const { return key[i]<key[j]; } };

void build() {
    for (int i = 0; i < N; i++) { sa[i] = i; key[i] = text[i]; }
    sort(sa, sa+N, Cmp());
    for (int K = 1; ; K *= 2) {
        // invariant: suffixes are now sorted by their first K characters
        for (int i = 0; i < N; i++)
            rank[sa[i]] = i>0 && key[sa[i-1]]==key[sa[i]] ? rank[sa[i-1]] : i;
        if (K >= N) break;

        // sort suffixes by key (rank of this suffix, rank of suffix K chars ahead)
        for (int i = 0; i < N; i++)
            key[i] = rank[i] * (N+1LL) + (i+K < N ? rank[i+K]+1 : 0);
        sort(sa, sa+N, Cmp());
        // now they're sorted by their first 2K characters
    }

    // Kasai's algorithm for LCP array:
    for (int i = 0, k = 0; i < N; i++) {
        if (k > 0) k--;
        if (rank[i] == N-1) { lcp[N-1] = -1; k = 0; continue; }
        int j = sa[rank[i]+1];
        while (text[i+k] == text[j+k]) k++;
        lcp[rank[i]] = k;
    }
}
The above algorithm assumes that LCP of 2 consecutive suffixes j,j+1 (none of them rank ==N-1) has atleast LCP of ( j-1,j ) -1 in common . Why is this true ?

mf
Guru
Posts: 1244
Joined: Mon Feb 28, 2005 4:51 am
Location: Zürich, Switzerland
Contact:

Re: Circular String Linearization

Post by mf »

The above algorithm assumes that LCP of 2 consecutive suffixes j,j+1 (none of them rank ==N-1) has atleast LCP of ( j-1,j ) -1 in common . Why is this true ?
That's not true.

What's a true fact here is a bit more complex to formulate. But just for example, take the string "abcdabcz". The first suffix is "abcdabcz", its LCP is 3, because the next lexicographically larger suffix (next suffix in suffix array) is "abcz". And so we can deduce that LCP of 2nd suffix - "bcdabcz" is at least 2, because there's the suffix "bcz" (just drop the first letter of "abcz").

And I already told you who the author of the algorithm is, why didn't you just google for the paper?

mogers
New poster
Posts: 29
Joined: Tue May 30, 2006 5:09 pm
Location: Porto, Portugal

Re: Circular String Linearization

Post by mogers »

mf wrote:I've put my old ACM ICPC notebook online here.
But the codes there have a lot less comments, though.
The link doesn't seem to work anymore. Could you please share it again? Or send me by e-mail to mr_mogers@sapo.pt

Thank you

mf
Guru
Posts: 1244
Joined: Mon Feb 28, 2005 4:51 am
Location: Zürich, Switzerland
Contact:

Re: Circular String Linearization

Post by mf »

Sorry for delay.
I've put most of my ACM ICPC-related stuff here - http://github.com/infnty/acm/
My old team notebook is at http://github.com/infnty/acm/raw/master ... tebook.pdf

zobayer
Experienced poster
Posts: 110
Joined: Tue May 06, 2008 2:18 pm
Location: CSE-DU, Bangladesh
Contact:

Re: Circular String Linearization

Post by zobayer »

probably this is the simplest algorithm:

Code: Select all

/*
Finds alphabetically first representation of a cyclic string in O(length)
*/

minimumExpression(s) {
	s = s + s;
	len = length(s), i = 0, j = 1, k = 0;
	while(i + k < len && j + k < len) {
		if(s[i+k] == s[j+k]) k++;
		else if(s[i+k] > s[j+k]) { i = i+k+1; if(i <= j) i = j+1; k = 0; }
		else if(s[i+k] < s[j+k]) { j = j+k+1; if(j <= i) j = i+1; k = 0; }
	}
	return min(i, j);
}
This is called Minimum Expression introduced by Zhou Yuan.
You should not always say what you know, but you should always know what you say.

Post Reply

Return to “Algorithms”