p, and p in a similar way. Note that, for
any choice of p, the Pos array is also ordered according to p, because u < v implies u p v. All suffixes
that have equal p-prefixes, for some p < N, must appear in consecutive positions in the Pos array, because
the Pos array is sorted lexicographically. These facts are central to our search algorithm.
Suppose that we wish to find all instances of a string W = w . . .
0 w 1 wP 1 of length P N in A. Let
LW = min ( k : W P APos[k] or k = N ) and RW = max ( k : APos[k] P W or k = 1 ). Since Pos is in
. . .
P-order, it follows that W matches a i a i + 1 ai +P 1 if and only if i = Pos[k] for some k [LW , RW ].
Thus, if LW and RW can be found quickly, then the number of matches is RW LW + 1 and their left end-
points are given by Pos[LW ] , Pos[LW + 1] , ... Pos[RW ]. But Pos is in P-order, hence a simple binary
search can find LW and RW using O(log N) comparisons of strings of size at most P; each such comparison
requires O(P) single-symbol comparisons. Thus, the Pos array allows us to find all instances of a string in
A in time O(P log N). The algorithm is given in Fig. 1.
The algorithm in Fig. 1 is very simple, but its running time can be improved. We show next that the
p-comparisons involved in the binary search need not be started from scratch in each iteration of the
while loop. We can use information obtained from one comparison to speedup the ensuing comparisons.
When this strategy is coupled with some additional precomputed information, the search is improved to
P + log2 (N 1) single-symbol comparisons in the worst case, which is a substantial improvement.
3
if W P APos[0] then
LW 0
else if W >P APos[N 1] then
LW N
else
{ (L, R) (0, N 1)
while R L > 1 do
{ M (L + R)/2
if W P APos[M] then
R M
else L M
}
LW R
}
Figure 1: An O(Plog N) search for LW.
Let lcp(v, w) be the length of the longest common prefix of v and w. When we lexicographically
compare v and w in a left-to-right scan that ends at the first unequal symbol we obtain lcp(v, w) as a bypro-
duct. We can modify the binary search in Fig. 1 by maintaining two variables, l and r, such that
l = lcp(APos[L] , W), and r = lcp(W, APos[R] ). Initially, l is set by the comparison of W and APos[0] in line
1, and r is set in the comparison against APos[N 1] in line 3. Thereafter, each comparison of W against
APos[M] in line 9, permits l or r to be appropriately updated in line 10 or 12, respectively. By so maintain-
ing l and r, h = min(l, r) single-symbol comparisons can be saved when comparing APos[M] to W, because
APos[L] =l W =r APos[R] implies APos[k] =h W for all k in [L, R] including M. While this reduces the
number of single-symbol comparisons needed to determine the P-order of a midpoint with respect to W, it
turns out that the worst case running time is still O(P log N) (e.g., searching ac P 1
N 2 b for c b).
To reduce the number of single-symbol comparisons to P + log2 (N 1) in the worst case, we
use precomputed information about the lcps of APos[M] with each of APos[L] and APos[R]. Consider the set
of all triples (L, M, R) that can arise in the inner loop of the binary search of Fig. 1. There are exactly
N 2 such triples, each with a unique midpoint M [1, N 2], and for each triple
0 L < M < R N 1. Suppose that (LM , M, RM ) is the unique triple containing midpoint M. Let Llcp
be an array of size N 2 such that Llcp[M] = lcp(APos[L , A ), and Let Rlcp be another array of
M ] Pos[M]
size N 2 such that Rlcp[M] = lcp(APos[M] , APos[R ). The construction of the two (N 2)-element
M ]
arrays, Llcp and Rlcp, can be interwoven with the sort producing Pos and will be shown in Section 4. For
now, we assume that the Llcp and Rlcp arrays have been precomputed.
Consider an iteration of the search loop for triple (L, M, R). Let h = max(l, r) and let h be the
difference between the value of h at the beginning and at the end of the iteration. Assuming, without loss
of generality, that r l = h, there are three cases to consider4, based on whether Llcp[M] is greater than,
equal to, or less than h. The cases are illustrated in Fig. 2(a), 2(b), and 2(c), respectively. The vertical bars
denote the lcps between W and the suffixes in the Pos array (except for l and r, these lcps are not known at
4 The first two cases can be combined in the program. We use three cases only for description purposes.
4
the time we consider M). The shaded areas illustrate Llcp[M]. For each case, we must determine whether
LW is in the right half or the left half (the binary search step) and we must update the value of either l or r.
It turns out that both these steps are easy to make:
Case 1: Llcp[M] > l (Fig. 2(a))
in this case, APos[M] =l +1 APos[L] l +1 W, and so W must be in the right half and l is unchanged.
Case 2: Llcp[M] = l (Fig. 2(b))
in this case, we know that the first l symbols of Pos[M] and W are equal; thus, we need to compare
only the l + 1st symbol, l + 2nd symbol, and so on, until we find one, say l + j, such that
W l +j Pos[M]. The l + jth symbol determines whether LW is in the right or left side. In either
case, we also know the new value of r or l - it is l + j. Since l = h at the beginning of the loop, this
step takes h + 1 single-symbol comparisons.
Case 3: Llcp[M] < l (Fig. 2(c))
in this case, since W matched l symbols of L and < l symbols of M, it is clear that LW is in the left
side and that the new value of r is Llcp[M].
Hence, the use of the arrays Llcp and Rlcp (the Rlcp array is used when l < r) reduces the number of
single-symbol comparisons to no more than h + 1 for each iteration. Summing over all iterations and
observing that h P, the total number of single-symbol comparisons made in an on-line string search
is at most P + log2 (N 1) , and O(P + log N) time is taken in the worst-case. The precise search
algorithm is given in Fig. 3.
3. Sorting
The sorting is done in log2 (N + 1) stages. In the first stage, the suffixes are put in buckets according to
their first symbol. Then, inductively, each stage further partitions the buckets by sorting according to twice
the number of symbols. For simplicity of notation, we number the stages 1, 2, 4, 8, etc., to indicate the
number of affected symbols. Thus, in the Hth stage, the suffixes are sorted according to the H-order. For
simplicity, we pad the suffixes by adding blank symbols, such that the lengths of all of them become
N + 1. This padding is not necessary, but it simplifies the discussion; the version of the algorithm detailed
in Fig. 4 does not make this assumption. The first stage consists of a bucket sort according to the first
l r l r l r
L M R L M R L M R
(a) (b) (c)
Figure 2: The three cases of the O(P + log N) search.
5
l lcp(APos[0] , W)
r lcp(APos[N 1] , W)
if l = P or wl aPos[0] +l then
LW 0
else if r < P or wr aPos[N 1] +r then
LW N
else
{ (L, R) (0, N 1)
while R L > 1 do
{ M (L + R)/2
if l r then
if Lcp[M] l then
m l + lcp(APos[M]+l , Wl )
else m Lcp[M]
else if Rcp[M] r then
m r + lcp(APos[M]+r , Wr )
else m Rcp[M]
if m = P or wm aPos[M]+m then
(R, r) (M, m)
else (L, l) (M, m)
}
LW R
}
Figure 3: An O(P + log N) search for LW.
symbol of each suffix. The result of this sort is stored in the Pos array and in another array BH of Boolean
values which demarcates the partitioning of the suffixes into m1 buckets (m1 | |); each bucket holds
the suffixes with the same first symbol. The array Pos will become progressively sorted as the algorithm
proceeds. Assume that after the Hth stage the suffixes are partitioned into mH buckets, each holding
suffixes with the same H first symbols, and that these buckets are sorted according to the H-relation. We
will show how to sort the elements in each H-bucket to produce the 2H-order in O(N) time. Our sorting
algorithm uses similar ideas to those in [KMR72].
Let A th
i and A j be two suffixes belonging to the same bucket after the H step; that is, Ai =H Aj. We
need to compare Ai and Aj according to the next H symbols. But, the next H symbols of Ai (Aj) are
exactly the first H symbols of Ai +H (Aj+H). By the assumption, we already know the relative order,
according to the H-relation, of Ai +H and Aj+H. It remains to see how we can use that knowledge to com-
plete the stage efficiently. We first describe the main idea, and then show how to implement it efficiently.
We start with the first bucket, which must contain the smallest suffixes according to the H-relation.
Let Ai be the first suffix in the first bucket (i.e., Pos[0] = i), and consider Ai H (if i H < 0, then we
ignore Ai and take the suffix of Pos[1], and so on). Since Ai starts with the smallest H-symbol string,
Ai H should be the first in its 2H-bucket. Thus, we move Ai H to the beginning of its bucket and mark
this fact. For every bucket, we need to know the number of suffixes in that bucket that have already been
moved and thus placed in 2H-order. The algorithm basically scans the suffixes as they appear in the H-
6
order, and for each Ai it moves Ai H (if it exists) to the next available place in its H-bucket. While this
basic idea is simple, its efficient implementation (in terms of both space and time) is not trivial. We
describe it below.
We maintain three integers arrays, Pos, Prm, and Count, and two boolean arrays, BH and B2H, all
with N elements5. At the start of stage H, Pos[i] contains the start position of the ith smallest suffix
(according to the first H symbols), Prm[i] is the inverse of Pos, namely, Prm[Pos[i]] = i, and BH[i] is 1
iff Pos[i] contains the leftmost suffix of an H-bucket (i.e., APos[i] H APos[i 1]). Count and B2H are tem-
porary arrays; their use will become apparent in the description of a stage of the sort. A radix sort on the
first symbol of each suffix is easily tailored to produce Pos, Prm, and BH for stage 1 in O(N) time.
Assume that Pos, Prm, and BH have the correct values after stage H, and consider stage 2H.
We first reset Prm[i] to point to the leftmost cell of the H-bucket containing the ith suffix rather than
to the suffix's precise place in the bucket. We also initialize Count[i] to 0 for all i. All operations above
can be done in O(N) time. We then scan the Pos array in increasing order, one bucket at a time. Let l and
r (l r) mark the left and right boundary of the H-bucket currently being scanned. Let Ti (the H left exten-
sion of i) denote Pos[i] H. For every i, l i r, we increment Count[Prm[Ti ]], set Prm[Ti ] =
Prm[T st
i ] + Count[Prm[T i ]] 1, and set B2H[Prm[Ti ]] to 1. In effect, all the suffixes whose H + 1
through 2Hth symbols equal the unique H-prefix of the current H-bucket are moved to the top of their H-
buckets with respect to the Prm array (Pos is updated momentarily). The B2H field is used to mark those
prefixes that were moved. Before the next H-bucket is considered, we make another pass through this one,
find all the moved suffixes, and reset the B2H fields such that only the leftmost of them in each 2H-bucket
is set to 1, and the rest are reset to 0. This way, the B2H fields correctly mark the beginning of the 2H-
buckets. Thus the scan updates Prm and sets B2H so that they are consistent with the 2H-order of the
suffixes. In the final step, we update the Pos array (which is the inverse of Prm), and set BH to B2H. All
the steps above can clearly be done in O(N) time, and, since there are at most log2 (N + 1) stages, the
sorting requires O(N log N) time in the worst case. A pseudo-code implementation is given in Fig. 4.
Average-case analysis is presented in Section 5.
5 We present a conceptually simpler but more space expensive algorithm above in order to clearly expound the idea behind the sort. In
fact, two N-element integer arrays are sufficient, and since the integers are always positive we can use their sign bit for the boolean
values. Thus, the space requirement is only two integers per symbol. The trick is to remove the Count array by temporarily using the
Prm value of the leftmost suffix in a bucket to hold the count for the bucket. Instead of initializing Count to 0 in the first step, we turn
off the BH field of every bucket so that we can tell when a Prm value is the first reference to a particular bucket. The second step again
consists of a scan of the Pos array. If BH[Prm[Ti ]] is off then Ti is to be the first suffix in the 2H order of its H-bucket. We search
the bucket for it and actually place it at the head of its bucket (as opposed to just modifying Prm to reflect where it will go in the
simpler algorithm). This allows us to then use Prm[Ti ] as the counter for the bucket because we can restore it later knowing that the
Prm-value of the first suffix in each bucket is being so used. We thus set the BH field back on and set Prm[Ti ] to 1. For later refer-
ences to this bucket, which we know because BH[Prm[Ti ]] is now on, we simply adjust Prm[Ti ] with the count in
Prm[Pos[Prm[Ti ]]] and bump the count. At the end of this step the Prm fields used as counters are reset to the position of their
suffix. The B2H fields could not be set in the preceding steps because the BH values were being used as counter flags. In a separate
pass, the B2H values are updated to identify 2H buckets as in the simple algorithm.
7
# Sorting with 3N positive integers and
2N booleans. `Count' can be eliminated # Inductive sorting stage (with lcp info calls) #
and booleans folded into sign bits, to
produce a 2N integer sort. # for H 1, 2, 4, 8, . . . while H < N do
{ for each =H bucket [l, r] do
var Pos, Prm, Count: array [0..N 1] of int; { Count[l] 0
BH, B2H: array [0..N] of boolean; for c [l, r] do
Prm[Pos[c]] l
}
# First phase bucket sort, Bucket and d N H
Link overlay Pos and Prm, respectively # e Prm[d]
Prm[d] e + Count[e]
for a do Count[e] Count[e] + 1
Bucket[a] 1 B2H[Prm[d]] true
for i 0, 1, 2, . . . , N 1 do for each =H bucket [l, r] in H-order do
( Bucket[ai ] , Link[i] ) ( i, Bucket[ai ] ) { for d { Pos[c] H : c [l, r] } [0, N 1] do
c 0 { e Prm[d]
for a in order do Prm[d] e + Count[e]
{ i Bucket[a] Count[e] Count[e] + 1
while i 1 do B2H[Prm[d]] true
{ j Link[i] }
Prm[i] c for d { Pos[c] H : c [l, r] } [0, N 1] do
if i = Bucket[a] then if B2H[Prm[d]] then
{ BH[c] true { e min ( j : j > Prm[d] and
Set(c,0) # lcp info call # (BH[ j] or not B2H[ j]) )
} for f [Prm[d] + 1, e 1] do
else B2H[ f ] false
BH[c] false }
c c + 1 }
i j for i [0, N 1] do
} Pos[Prm[i]] i
} for i [0, N 1] do
BH[N] true if B2H[i] and not BH[i] then
for i [0, N 1] do { Set( i, H + Min_Height( Prm[Pos[i 1] + H] ,
Pos[Prm[i]] i Prm[Pos[i] + H]) )
BH[i] B2H[i]
}
}
Figure 4: The O(Nlog N) suffix sorting algorithm.
8
4. Finding Longest Common Prefixes
The O(P + log N) search algorithm requires precomputed information about the lcps between the suffixes
starting at each midpoint M and its left and right boundaries LM and RM. Computing a suffix array requires
2N integers and we will see here that computing and recording the associated lcp information requires an
extra N integers. We first show how to compute the lcps between suffixes that are consecutive in the sorted
Pos array during the sort. We will see later how to compute all the necessary lcps. The key idea is the fol-
lowing. Assume that after stage H of the sort we know the lcps between suffixes in adjacent buckets (after
the first stage, the lcps between suffixes in adjacent buckets are 0). At stage 2H the buckets are partitioned
according to 2H symbols. Thus, the lcps between suffixes in newly adjacent buckets must be at least H and
at most 2H 1. Furthermore, if Ap and Aq are in the same H-bucket but are in distinct 2H-buckets, then
lcp(Ap , Aq ) = H + lcp(Ap +H , Aq +H ). (4.1)
Moreover, we know that lcp(Ap +H , Aq +H ) < H. The problem is that we only have the lcps between
suffixes in adjacent buckets, and Ap +H and Aq +H may not be in adjacent buckets. However, if APos[i] and
APos[ j] where i < j have an lcp less than H and Pos is in H order, then their lcp is the minimum of the
lcp's of every adjacent pair of suffixes between Pos[i] and Pos[ j]. That is,
lcp(APos[i] , APos[ j] ) = min ( lcp(APos[k] , APos[k +1] )) (4.2)
k [i, j 1]
Using (4.2) directly would require too much time, and maintaining the lcp of every pair of suffixes too
much space. By using an O(N)-space height balanced tree structure that records the minimum pairwise lcp
over a collection of intervals of the suffix array, we will be able to determine the lcp between any two
suffixes in O(log N) time. We will describe this data structure, which we call an interval tree, after we
firmly establish our basic approach (interval trees are similar to the Cartesian trees of Vuillemin [Vui80]).
We define height(i) = lcp(APos[i 1] , APos[i] ), 1 i N 1, where Pos is the final sorted order of
the suffixes. These N 1 height values are computed in an array Hgt[i]. The computation is performed
inductively, together with the sort, such that Hgt[i] achieves its correct value at stage H iff height(i) < H,
and it is undefined (specifically, N + 1) otherwise. Formally, if height(i) < H then Hgt[i] = height(i);
otherwise Hgt[i] = N + 1. Notice that, if height(i) < H, then APos[i 1] and APos[i] must be in different H-
buckets since H-buckets contain suffixes with the same H-symbol prefix. Further observe that a Hgt value
is initially N + 1, it is set to its height value during the appropriate stage of the sort, and it retains this value
thereafter.
Let PosH, HgtH, and PrmH be the values of the given arrays at the end of stage H. In stage 2H of
the sort, the 2H
2H-ordered list Pos is produced by sorting the suffixes in each H-bucket of the H-ordered
list PosH. The following lemma captures the essence of how we compute Hgt2H from HgtH given Pos2H
and Prm2H.
Lemma 1: If H height(i) < 2H then
height(i) = H + min ( HgtH [k] : k [min(a, b) + 1, max(a, b)] ),
where a = Prm2H [ Pos2H [i 1] + H ], and b = Prm2H [ Pos2H [i] + H ].
Proof: Let p = Pos2H [i 1] and q = Pos2H [i]. As we have observed, height(i) < 2H implies height(i) =
H + lcp(A 2H 2H
p + H , A q + H ). Next observe that Pos [a] = p + H and Pos [b] = q + H by the choice of a
and b. Without loss of generality, assume that a < b. We now know that height(i) = H + lcp(u, v) where
9
u = APos2H[a], v = APos2H[b], lcp(u, v) < H, and u