Lucian Ilie

The "runs" conjecture
by M. Crochemore, L. Ilie, and L. Tinta

The conjecture

A run is a maximal (non-extendable) repetition of exponent at least two. Here is an example: the string w=abbababbaba has a run [3..7] with period 2 and exponent 2.5, that is, w [3..7] = babab = (ba) ^2.5.Other runs are [2..3], [7..8],[8..11], [5..10] and [1..11].The result of a two-decade effort in the stringology community to find an algorithm to compute all runs in a string in linear time resulted in the paper of [Kolpakov, Kucherov, FOCS'99] that

(i) used previous techniques of Crochemore, Main, and Lorentz to construct an algorithm that computes all runs in time proportional to the size of the output and
(ii) proved that the maximum number of runs in a string of length n, RUNS(n), is linear, i.e., RUNS(n) <= cn , where c is a constant.

Therefore, the crucial contribution was (ii). However, they could not provide any bound on the constant c.

The conjecture of Kolpakov and Kucherov, supported by computations, is that, for binary alphabets,

RUNS(n) <= n.

Known upper bounds

Rytter, STACS'06: RUNS(n) <= 5 n
Rytter, Inf. and Comput., 2007: RUNS(n) <= 3.44 n
Puglisi, Simpson, and Smyth, submitted 2007: RUNS(n) <= 3.48 n
Crochemore and Ilie, JCSS, 2007: RUNS(n) <= 1.6 n

Current state of art

The idea used by [Crochemore and Ilie, JCSS, 2007] is to count the runs by their centers (the starting position of the second period) as well as count separately microruns (runs with short period).

Macroruns. The following theorem is used to bound the number of runs with period p or larger. A d -run is a run with period between 2 d and 3 d.

Theorem. There is at most 1 center of a d-run on average in each interval of length d.

The Theorem is used as follows. If we put d_i = (p/ 2 )( 3 / 2 )ⁱ, i>=0, then the number of runs with period p or larger is at most n/ d ₀ +n/ d ₁ +n/ d ₂ + .... Therefore

RUNS _>=p (n) <= ( 6 /p)n.

Microruns. The runs with short periods are counted as follows. For a given bound b and maximum period p, the runs are non-uniformly distributed and we try to amortize them. That is, given a position i, we try all possible combinations of periods for the runs with centers at the left of i, until a position i-j is found such that the ratio between the number of centers inside the interval [ i-j..i ]and its length, j+1 , falls below b. Also, at any moment, the number of centers of runs that have both the center and beginning inside the interval should satisfy the corresponding amortizing condition. If successful, this procedure proves that

RUNS _<=p (n) <= bn.

Putting these two bounds together for p = 9 and b = 1, we obtained in [Crochemore and Ilie, JCSS, 2007] that RUNS(n) <= ( 6/10 )n + n = 1.6 n. For larger values of p , and/or smaller values of b , better bounds can be obtained but the computation can become very demanding.

In [Crochemore, Ilie, Tinta, CPM'08] we gave an algorithm that reduces very much the complexity of this computation and then implemented it on the SHARCNET high performance computer ( www.sharcnet.ca ). The results obtained are given below. (Note that they hold for arbitrary alphabets.)

Current best bound:

p= 10 , b= 0.85 : RUNS(n) <= 1.396 n
p= 15 , b= 0.89 : RUNS(n) <= 1.265 n
p= 20 , b= 0.89 : RUNS(n) <= 1.176 n
p= 25 , b= 0.91 : RUNS(n) <= 1.141 n
p= 30 , b= 0.91 : RUNS(n) <= 1.104n
p= 35 , b= 0.93 : RUNS(n) <= 1.097 n
p= 40 , b= 0.93 : RUNS(n) <= 1.077 n
p= 50 , b= 0.93 : RUNS(n) <= 1.048 n
p= 60 , b= 0.93 : RUNS(n) <= 1.029 n