This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Computer Algebra, Eötvös Loránd University, Hungary 11email: bupe@compalg.inf.elte.hu 22institutetext: Dipartimento di Informatica ed Applicazioni, University of Salerno, Italy 22email: cicalese@dia.unisa.it 33institutetext: I3S, UMR6070, CNRS et Université de Nice-Sophia Antipolis, France 33email: fici@i3s.unice.fr 44institutetext: AG Genominformatik, Technische Fakultät, Bielefeld University, Germany 44email: zsuzsa@cebitec.uni-bielefeld.de

Algorithms for Jumbled Pattern Matching in Strings

Péter Burcsi 11    Ferdinando Cicalese 22    Gabriele Fici 33    Zsuzsanna Lipták 44
Abstract

The Parikh vector p(s)p(s) of a string ss over a finite ordered alphabet Σ={a1,,aσ}\Sigma=\{a_{1},\ldots,a_{\sigma}\} is defined as the vector of multiplicities of the characters, p(s)=(p1,,pσ)p(s)=(p_{1},\ldots,p_{\sigma}), where pi=|{jsj=ai}|.p_{i}=|\{j\mid s_{j}=a_{i}\}|. Parikh vector qq occurs in ss if ss has a substring tt with p(t)=qp(t)=q. The problem of searching for a query qq in a text ss of length nn can be solved simply and worst-case optimally with a sliding window approach in O(n)O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time.

The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1)O(1) time. The preprocessing can be done trivially in Θ(n2)\Theta(n^{2}) time.

The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ2\sigma\geq 2 and has sub-linear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n)O(n) size data structure, each of which can be constructed in O(n)O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n(σlogσ)1/2logmm)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{\log m}{\sqrt{m}}), where m=iqim=\sum_{i}q_{i} is the length of a string with Parikh vector qq. The second uses wavelet trees and improves the expected runtime to O(n(σlogσ)1/21m)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{1}{\sqrt{m}}), i.e., by a factor of logm\log m.

Electronic version of an article accepted for publication in the International Journal of Foundations of Computer Science (IJFCS), to appear in 2011 ©World Scientific Publishing Company, www.wordscinet.com/ijfcs/

Keywords: Parikh vectors, permuted strings, pattern matching, string algorithms, average case analysis, text indexing, non-standard string matching

1 Introduction

Parikh vectors of strings count the multiplicity of the characters. They have been reintroduced many times by different names (compomer [5], composition [3], Parikh vector [22], permuted string [7], permuted pattern [10], and others). They are natural objects to study, due to their numerous applications; for instance, in computational biology, they have been applied in alignment algorithms [3], SNP discovery [5], repeated pattern discovery [10], and, most naturally, in interpretation of mass spectrometry data [4]. Parikh vectors can be seen as a generalization of strings, where we view two strings as equivalent if one can be turned into the other by permuting its characters; in other words, if the two strings have the same Parikh vector.

The problem we are interested in here is answering the question whether a query Parikh vector qq appears in a given text ss (decision version), or where it occurs (occurrence version). An occurrence of qq is defined as an occurrence of a substring tt of ss with Parikh vector qq. The problem can be viewed as an approximate pattern matching problem: We are looking for an occurrence of a jumbled version of a query string tt, i.e. for the occurrence of a substring tt^{\prime} which has the same Parikh vector. In the following, let nn be the length of the text ss, mm the length of the query qq (defined as the length of a string tt with Parikh vector qq), and σ\sigma the size of the alphabet.

The above problem (both decision and occurrence versions) can be solved with a simple sliding window based algorithm, in O(n)O(n) time and O(σ)O(\sigma) additional storage space. This is worst case optimal with respect to the case of one query. However, when we expect to search for many queries in the same string, the above approach leads to O(Kn)O(Kn) runtime for KK queries. To the best of our knowledge, no faster approach is known. This is in stark contrast to the classical exact pattern matching problem, where all exact occurrences of a query pattern of length mm are sought in a text of length nn. In that case, for one query, any naive approach leads to O(nm)O(nm) runtime, while quite involved ideas for preprocessing and searching are necessary to achieve an improved runtime of O(n+m)O(n+m), as do the Knuth-Morris-Pratt [17], Boyer-Moore [6] and Boyer-Moore-type algorithms (see, e.g., [2, 14]). However, when many queries are expected, the text can be preprocessed to produce a data structure of size linear in nn, such as a suffix tree, suffix array, or suffix automaton, which then allows to answer individual queries in time linear in the length of the pattern (see any textbook on string algorithms, e.g. [23, 18]).

1.1 Related work

Jumbled pattern matching is a special case of approximate pattern matching. It has been used as a filtering step in approximate pattern matching algorithms [15], but rarely considered in its own right.

The authors of [7] present an algorithm for finding all occurrences of a Parikh vector in a runlength encoded text. The algorithm’s time complexity is O(n+σ)O(n^{\prime}+\sigma), where nn^{\prime} is the length of the runlength encoding of ss. Obviously, if the string is not runlength encoded, a preprocessing phase of time O(n)O(n) has to be added. However, this may still be feasible if many queries are expected. To the best of our knowledge, this is the only algorithm that has been presented for the problem we treat here.

An efficient algorithm for computing all Parikh fingerprints of substrings of a given string was developed in [1]. Parikh fingerprints are Boolean vectors where the kk’th entry is 11 if and only if aka_{k} appears in the string. The algorithm involves storing a data point for each Parikh fingerprint, of which there are at most O(nσ)O(n\sigma) many. This approach was adapted in [10] for Parikh vectors and applied to identifying all repeated Parikh vectors within a given length range; using it to search for queries of arbitrary length would imply using Ω(P(s))\Omega(P(s)) space, where P(s)P(s) denotes the number of different Parikh vectors of substrings of ss. This is not desirable, since, for arbitrary alphabets, there are non-trivial strings of any length with quadratic P(s)P(s) [8].

1.2 Results

In this paper, we present two novel algorithms which perform significantly better than the simple window algorithm, in the case where many queries arrive.

For the binary case, we present an algorithm which answers decision queries in O(1)O(1) time, using a data structure of size O(n)O(n) (Interval Algorithm, Sect. 3). The data structure is constructed in Θ(n2)\Theta(n^{2}) time.

For general alphabets, we present an algorithm with expected sublinear runtime which uses O(n)O(n) space to answer occurrence queries (Jumping Algorithm, Sect. 4). We present two different variants of the algorithm. The first one uses a very simple data structure (an inverted table) and answers queries in time O(σJlog(nJ+m)),O(\sigma J\log(\frac{n}{J}+m)), where JJ denotes the number of iterations of the main loop of the algorithm. We then show that the expected value of JJ for the case of random strings and patterns is O(nmσlogσ)O(\frac{n}{\sqrt{m}\sqrt{\sigma\log\sigma}}), yielding an expected runtime of O(n(σlogσ)1/2logmm)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{\log m}{\sqrt{m}}), per query

The second variant of the algorithm uses wavelet trees [13] and has query time O(σJ)O(\sigma J), yielding an overall expected runtime of O(n(σlogσ)1/21m)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{1}{\sqrt{m}}), per query. (Here and in the following, log\log stands for logarithm base 22.)

Our simulations on random strings and real biological strings confirm the sublinear behavior of the algorithms in practice. This is a significant improvement over the simple window algorithm w.r.t. expected runtime, both for a single query and repeated queries over one string.

The Jumping Algorithm is reminiscent of the Boyer-Moore-like approaches to the classical exact string matching problem [6, 2, 14]. This analogy is used both in its presentation and in the analysis of the number of iterations performed by the algorithm.

2 Definitions and problem statement

Given a finite ordered alphabet Σ={a1,,aσ},a1aσ\Sigma=\{a_{1},\ldots,a_{\sigma}\},a_{1}\leq\ldots\leq a_{\sigma}. For a string sΣs\in\Sigma^{*}, s=s1sns=s_{1}\ldots s_{n}, the Parikh vector p(s)=(p1,,pσ)p(s)=(p_{1},\ldots,p_{\sigma}) of ss defines the multiplicities of the characters in ss, i.e. pi=|{jsj=ai}|p_{i}=|\{j\mid s_{j}=a_{i}\}|, for i=1,,σi=1,\ldots,\sigma. For a Parikh vector pp, the length |p||p| denotes the length of a string with Parikh vector pp, i.e. |p|=ipi|p|=\sum_{i}p_{i}. An occurrence of a Parikh vector pp in ss is an occurrence of a substring tt with p(t)=pp(t)=p. (An occurrence of tt is a pair of positions 0ijn0\leq i\leq j\leq n, such that sisj=ts_{i}\ldots s_{j}=t.) A Parikh vector that occurs in ss is called a sub-Parikh vector of ss. The prefix of length ii is denoted pr(i)=pr(i,s)=s1sipr(i)=pr(i,s)=s_{1}\ldots s_{i}, and the Parikh vector of pr(i)pr(i) as prv(i)=prv(i,s)=p(pr(i))prv(i)=prv(i,s)=p(pr(i)).

For two Parikh vectors p,qσp,q\in{\mathbb{N}}^{\sigma}, we define pqp\leq q and p+qp+q component-wise: pqp\leq q if and only if piqip_{i}\leq q_{i} for all i=1,,σi=1,\ldots,\sigma, and p+q=up+q=u where ui=pi+qiu_{i}=p_{i}+q_{i} for i=1,,σi=1,\ldots,\sigma. Similarly, for pqp\leq q, we set qp=vq-p=v where vi=qipiv_{i}=q_{i}-p_{i} for i=1,,σi=1,\ldots,\sigma.

Jumbled Pattern Matching (JPM). Let sΣs\in\Sigma^{*} be given, |s|=n|s|=n. For a Parikh vector qσq\in{\mathbb{N}}^{\sigma} (the query), |q|=m|q|=m, find all occurrences of qq in ss. The decision version of the problem is where we only want to know whether qq occurs in ss.

We assume that KK many queries arrive over time, so some preprocessing may be worthwhile.

Note that for K=1K=1, both the decision version and the occurrence version can be solved worst-case optimally with a simple window algorithm, which moves a fixed size window of size mm along string ss. Maintain the Parikh vector cc of the current window and a counter rr which counts indices ii such that ciqic_{i}\neq q_{i}. Each sliding step costs either 0 or 2 update operations of cc, and possibly one increment or decrement of rr. This algorithm solves both the decision and occurrence problems and has running time Θ(n)\Theta(n), using additional storage space Θ(σ)\Theta(\sigma).

Precomputing, sorting, and storing all sub-Parikh vectors of ss would lead to Θ(n2)\Theta(n^{2}) storage space, since there are non-trivial strings with a quadratic number of Parikh vectors over arbitrary alphabets [8]. Such space usage is inacceptable in many applications.

For small queries, the problem can be solved exhaustively with a linear size indexing structure such as a suffix tree, which can be searched down to length m=|q|m=|q| (of the substrings), yielding a solution to the decision problem in time O(σm)O(\sigma^{m}). For finding occurrences, report all leaves in the subtrees below each match; this costs O(M)O(M) time, where MM is the number of occurrences of qq in ss. Constructing the suffix tree takes O(n)O(n) time, so for m=o(logn)m=o(\log n), we get a total runtime of O(n)O(n), since MnM\leq n for any query qq.

3 Decision problem in the binary case

In this section, we present an algorithm for strings over a binary alphabet which, once a data structure of size O(n)O(n) has been constructed, answers decision queries in constant time. It makes use of the following nice property of binary strings.

Lemma 1

Let s{a,b}s\in\{a,b\}^{*} with |s|=n|s|=n. Fix 1mn1\leq m\leq n. If the Parikh vectors (x1,mx1)(x_{1},m-x_{1}) and (x2,mx2)(x_{2},m-x_{2}) both occur in ss, then so does (y,my)(y,m-y) for any x1yx2x_{1}\leq y\leq x_{2}.

Proof

Consider a sliding window of fixed size mm moving along the string and let (p1,p2)(p_{1},p_{2}) be the Parikh vector of the current substring. When the window is shifted by one, the Parikh vector either remains unchanged (if the character falling out is the same as the character coming in), or it becomes (p1+1,p21)(p_{1}+1,p_{2}-1) resp. (p11,p2+1)(p_{1}-1,p_{2}+1) (if they are different). Thus the Parikh vectors of substrings of ss of length mm build a set of the form {(x,mx)x=pmin(m),pmin(m)+1,,pmax(m)}\{(x,m-x)\mid x=\textrm{pmin}(m),\textrm{pmin}(m)+1,\ldots,\textrm{pmax}(m)\} for appropriate pmin(m)\textrm{pmin}(m) and pmax(m)\textrm{pmax}(m). ∎

Assume that the algorithm has access to the values pmin(m)\textrm{pmin}(m) and pmax(m)\textrm{pmax}(m) for m=1,,nm=1,\ldots,n; then, when a query q=(x,y)q=(x,y) arrives, it answers yes if and only if x[pmin(x+y),pmax(x+y)]x\in[\textrm{pmin}(x+y),\textrm{pmax}(x+y)]. The query time is O(1)O(1).

The table of the values pmin(m)\textrm{pmin}(m) and pmax(m)\textrm{pmax}(m) can be easily computed in a preprocessing step in time Θ(n2)\Theta(n^{2}) by scanning the string with a window of size mm, for each mm. Alternatively, lazy computation of the table is feasible, since for any query qq, only the entry m=|q|m=|q| is necessary. Therefore, it can be computed on the fly as queries arrive. Then, any query will take time O(1)O(1) (if the appropriate entry has already been computed), or O(n)O(n) (if it has not). After nn queries of the latter kind, the table is completed, and all subsequent queries can be answered in O(1)O(1) time. If we assume that the query lengths are uniformly distributed, then this can be viewed as a coupon collector problem where the coupon collector has to collect one copy of each length mm. Then the expected number of queries needed before having seen all nn coupons is nHnnlnnnH_{n}\approx n\ln n (see e.g. [11]). The algorithm will have taken O(n2)O(n^{2}) time to answer these nlnnn\ln n queries.

The assumption of the uniform length distribution may not be very realistic; however, even if it does not hold, we never take more time than O(n2+K)O(n^{2}+K) for KK many queries. Since any one query may take at most O(n)O(n) time, our algorithm never performs worse than the simple window algorithm. Moreover, for those queries where the table entries have to be computed, we can even run the simple window algorithm itself and report all occurrences, as well. For all others, we only give decision answers, but in constant time.

The size of the data structure is 2n=O(n)2n=O(n). The overall running time for either variant is Θ(K+n2)\Theta(K+n^{2}). As soon as the number of queries is K=ω(n)K=\omega(n), both variants outperform the simple window algorithm, whose running time is Θ(Kn)\Theta(Kn).

Example 1

Let s=ababbaabaabbbaaabbabs=ababbaabaabbbaaabbab. In Table 1, we give the table of pmin and pmax for ss. This example shows that the locality of pmin and pmax is preserved only in adjacent levels. As an example, the value pmax(3)=3\textrm{pmax}(3)=3 corresponds to the substring aaaaaa appearing only at position 1414, while pmax(5)=4\textrm{pmax}(5)=4 corresponds to the substring aabaaaabaa appearing only at position 66.

Table 1: An example of the linear data structure for answering queries in constant time.

mm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pmin 0 0 0 1 2 2 3 3 4 4 5 5 6 7 7 8 8 9 9 10
pmax 1 2 3 3 4 4 4 5 5 6 7 7 7 8 8 9 9 9 10 10

4 The Jumping Algorithm

In this section, we introduce our algorithm for general alphabets. We first give the main algorithm and then present two different implementations of it. The first one, an inverted prefix table, is very easy to understand and to implement, takes O(n)O(n) space and O(n)O(n) time to construct (both with constant 11), and can replace the string. Then we show how to use a wavelet tree of ss to implement our algorithm, which has the same space requirements as the inverted table, can be constructed in O(n)O(n) time, and improves the query time by a factor of logm\log m.

4.1 Main algorithm

Let s=s1snΣs=s_{1}\ldots s_{n}\in\Sigma^{*} be given. Recall that prv(i)prv(i) denotes the Parikh vector of the prefix of ss of length ii, for i=0,,ni=0,\ldots,n, where prv(0)=p(ϵ)=(0,,0)prv(0)=p(\epsilon)=(0,\ldots,0). Consider Parikh vector pσp\in{\mathbb{N}}^{\sigma}, p(0,,0)p\neq(0,\ldots,0). We make the following simple observations:

Observation 1
  1. 1.

    For any 0ijn0\leq i\leq j\leq n, p=prv(j)prv(i)p=prv(j)-prv(i) if and only if pp occurs in ss at position (i+1,j)(i+1,j).

  2. 2.

    If an occurrence of pp ends in position jj, then prv(j)pprv(j)\geq p.

The algorithm moves two pointers LL and RR along the text, pointing at these potential positions ii and jj. Instead of moving linearly, however, the pointers are updated in jumps, alternating between updates of RR and LL, in such a manner that many positions are skipped. Moreover, because of the way we update the pointers, after any update it suffices to check whether RL=|q|R-L=|q| to confirm that an occurrence has been found (cf. Lemma 2 below).

We first need to define a function firstfit, which returns the smallest potential position where an occurence of a Parikh vector can end. Let pσp\in{\mathbb{N}}^{\sigma}, then

firstfit(p):=min{jprv(j)p},\textsc{firstfit}(p):=\min\{j\mid prv(j)\geq p\},

and set firstfit(p)=\textsc{firstfit}(p)=\infty if no such jj exists. We use the following rules for updating the two pointers, illustrated in Fig. 1.

Refer to caption
Figure 1: The situation after the update of RR (above) and after the update of LL (below). RR is placed at the first fit of prv(L)+qprv(L)+q, thus qq^{\prime} is a super-Parikh vector of qq. Then LL is placed at the beginning of the longest good suffix ending in RR, so q′′q^{\prime\prime} is a sub-Parikh vector of qq.

Updating RR: Assume that the left pointer is pointing at position LL, i.e. no unreported occurrence starts before L+1L+1. Notice that, if there is an occurrence of qq ending at any position j>Lj>L, it must hold that prv(L)+qprv(j)prv(L)+q\leq prv(j). In other words, we must fit both prv(L)prv(L) and qq at position jj, so we update RR to

Rfirstfit(prv(L)+q).R\leftarrow\textsc{firstfit}(prv(L)+q).

Updating LL: Assume that RR has just been updated. Thus, prv(R)prv(L)qprv(R)-prv(L)\geq q by definition of firstfit. If equality holds, then we have found an occurrence of qq in position (L+1,R)(L+1,R), and LL can be incremented by 11. Otherwise prv(R)prv(L)>qprv(R)-prv(L)>q, which implies that, interspersed between the characters that belong to qq, there are some “superfluous” characters. Now the first position where an occurrence of qq can start is at the beginning of a contiguous sequence of characters ending in RR which all belong to qq. In other words, we need the beginning of the longest suffix of s[L+1,R]s[L+1,R] with Parikh vector q\leq q, i.e. the smallest position ii such that prv(R)prv(i)qprv(R)-prv(i)\leq q, or, equivalently, prv(i)prv(R)qprv(i)\geq prv(R)-q. Thus we update LL to

Lfirstfit(prv(R)q).L\leftarrow\textsc{firstfit}(prv(R)-q).

Finally, in order to check whether we have found an occurrence of query qq, after each update of RR or LL, we check whether RL=|q|R-L=|q|. In Figure 2, we give the pseudocode of the algorithm.

  • Algorithm Jumping Algorithm
  • Input: 

    query Parikh vector qq

  • Output: 

    A set OccOcc containing all beginning positions of occurrences of qq in ss

  • 1.

    set m|q|;Occm\mathrel{\leftarrow}|q|;Occ\mathrel{\leftarrow}\emptyset; L0L\mathrel{\leftarrow}0;

  • 2.

    while L<nmL<n-m

  • 3.

    do Rfirstfit(prv(L)+q)R\mathrel{\leftarrow}\textsc{firstfit}(prv(L)+q);

  • 4.

    if RL=mR-L=m

  • 5.

    then add L+1L+1 to OccOcc;

  • 6.

    LL+1L\mathrel{\leftarrow}L+1;

  • 7.

    elseLfirstfit(prv(R)q)L\mathrel{\leftarrow}\textsc{firstfit}(prv(R)-q);

  • 8.

    if RL=mR-L=m

  • 9.

    then add L+1L+1 to OccOcc;

  • 10.

    LL+1L\mathrel{\leftarrow}L+1;

  • 11.

    return OccOcc;

Figure 2: Pseudocode of Jumping Algorithm

It remains to see how to compute the firstfit and prvprv functions. We first prove that the algorithm is correct. For this, we will need the following lemma.

Lemma 2

The following algorithm invariants hold:

  1. 1.

    After each update of RR, we have prv(R)prv(L)qprv(R)-prv(L)\geq q.

  2. 2.

    After each update of LL, we have prv(R)prv(L)qprv(R)-prv(L)\leq q.

  3. 3.

    LRL\leq R.

Proof

1. follows directly from the definition of firstfit and the update rule for RR. For 2., if an occurrence was found at (i,j)(i,j), then before the update we have L=i1L=i-1 and R=jR=j. Now LL is incremented by 11, so L=iL=i and prv(R)prv(L)=qesi<qprv(R)-prv(L)=q-e_{s_{i}}<q, where eke_{k} is the kk’th unity vector. Otherwise, Lfirstfit(prv(R)q)L\leftarrow\textsc{firstfit}(prv(R)-q), and again the claim follows directly from the definition of firstfit. For 3., if an occurrence was found, then LL is incremented by 11, and RL=m10R-L=m-1\geq 0. Otherwise, L=firstfit(prv(R)q)=min{prv()prv(R)q}RL=\textsc{firstfit}(prv(R)-q)=\min\{\ell\mid prv(\ell)\geq prv(R)-q\}\leq R. ∎

Theorem 4.1

Algorithm Jumping Algorithm is correct.

Proof

We have to show that (1) if the algorithm reports an occurrence, then it is correct, and (2) if there is an occurrence, then the algorithm will find it.

(1) If the algorithm reports an index ii, then (i,i+m1)(i,i+m-1) is an occurrence of qq: An index ii is added to OccOcc whenever RL=mR-L=m. If the last update was that of RR, then we have prv(R)prv(L)qprv(R)-prv(L)\geq q by Lemma 2, and together with RL=m=|q|R-L=m=|q|, this implies prv(R)prv(L)=qprv(R)-prv(L)=q, thus (L+1,R)=(i,i+m1)(L+1,R)=(i,i+m-1) is an occurrence of qq. If the last update was LL, then prv(R)prv(L)qprv(R)-prv(L)\leq q, and it follows analogously that prv(R)prv(L)=qprv(R)-prv(L)=q.

(2) All occurrences of qq are reported: Let’s assume otherwise. Then there is a minimal ii and j=i+m1j=i+m-1 such that p(s[i,j])=qp(s[i,j])=q but ii is not reported by the algorithm. By Observation 1, we have prv(j)prv(i1)=qprv(j)-prv(i-1)=q.

Let’s refer to the values of LL and RR as two sequences (Lk)k=1,2,(L_{k})_{k=1,2,\ldots} and (Rk)k=1,2,(R_{k})_{k=1,2,\ldots}. So we have L1=0L_{1}=0, and for all k1,k\geq 1, Rk=firstfit(prv(Lk)+q)R_{k}=\textsc{firstfit}(prv(L_{k})+q), and Lk+1=Lk+1L_{k+1}=L_{k}+1 if RkLk=mR_{k}-L_{k}=m and Lk+1=firstfit(prv(Rk)q)L_{k+1}=\textsc{firstfit}(prv(R_{k})-q) otherwise. In particular, Lk+1>LkL_{k+1}>L_{k} for all kk.

First observe that if for some kk, Lk=i1L_{k}=i-1, then RR will be updated to jj in the next step, and we are done. This is because Rk=firstfit(prv(Lk)+q)=firstfit(prv(i1)+q)=firstfit(prv(j))=jR_{k}=\textsc{firstfit}(prv(L_{k})+q)=\textsc{firstfit}(prv(i-1)+q)=\textsc{firstfit}(prv(j))=j. Similarly, if for some kk, Rk=jR_{k}=j, then we have Lk+1=i1L_{k+1}=i-1.

So there must be a kk such that Lk<i1<Lk+1L_{k}<i-1<L_{k+1}. Now look at RkR_{k}. Since there is an occurrence of qq after LkL_{k} ending in jj, this implies that Rk=firstfit(prv(Lk)+q)jR_{k}=\textsc{firstfit}(prv(L_{k})+q)\leq j. However, we cannot have Rk=jR_{k}=j, so it follows that Rk<jR_{k}<j. On the other hand, i1<Lk+1Rki-1<L_{k+1}\leq R_{k} by our assumption and by Lemma 2. So RkR_{k} is pointing to a position somewhere between i1i-1 and jj, i.e. to a position within our occurrence of qq. Denote the remaining part of qq to the right of RkR_{k} by qq^{\prime}: q=prv(j)prv(Rk)q^{\prime}=prv(j)-prv(R_{k}). Since Rk=firstfit(prv(Lk)+q)R_{k}=\textsc{firstfit}(prv(L_{k})+q), all characters of qq must fit between LkL_{k} and RkR_{k}, so the Parikh vector p=prv(i)prv(Lk)p=prv(i)-prv(L_{k}) is a super-Parikh vector of qq^{\prime}. If p=qp=q^{\prime}, then there is an occurrence of qq at (Lk+1,Rk)(L_{k}+1,R_{k}), and by minimality of (i,j)(i,j), this occurrence was correctly identified by the algorithm. Thus, Lk+1=Lk+1i1L_{k+1}=L_{k}+1\leq i-1, contradicting our choice of kk. It follows that p>qp>q^{\prime} and we have to find the longest good suffix of the substring ending in RkR_{k} for the next update Lk+1L_{k+1} of LL. But s[i,Rk]s[i,R_{k}] is a good suffix because its Parikh vector is a sub-Parikh vector of qq, so Lk+1=firstfit(prv(Rk)q)i1L_{k+1}=\textsc{firstfit}(prv(R_{k})-q)\leq i-1, again in contradiction to Lk+1>i1L_{k+1}>i-1. ∎

We illustrate the proof in Fig. 3.

Refer to caption
Figure 3: Illustration for proof of correctness.

4.2 Variant using an inverted table

Storing all prefix vectors of ss would require O(σn)O(\sigma n) storage space, which may be too much. Instead, we construct an “inverted prefix vector table” II containing the increment positions of the prefix vectors: for each character akΣa_{k}\in\Sigma, and each value jj up to p(s)kp(s)_{k}, the position in ss of the jj’th occurrence of character aka_{k}. Formally, I[k][j]=min{iprv(i)kj}I[k][j]=\min\{i\mid prv(i)_{k}\geq j\} for j1j\geq 1, and I[k][0]=0I[k][0]=0. Then we have

firstfit(p)=maxk=1,,σ{I[k][pk]}.\textsc{firstfit}(p)=\max_{k=1,\ldots,\sigma}\{I[k][p_{k}]\}.

We can also compute the prefix vectors prv(i)prv(i) from table II: For k=1,,σk=1,\ldots,\sigma,

prv(j)k=max{iI[k][i]j}prv(j)_{k}=\max\{i\mid I[k][i]\leq j\}

The obvious way to find these values is to do binary search for jj in each row of II. However, this would take time Θ(σlogn)\Theta(\sigma\log n); a better way is to use information already acquired during the run of the algorithm. By Lemma 2, it always holds that LRL\leq R. Thus, for computing prv(R)kprv(R)_{k}, it suffices to search for RR between prv(L)kprv(L)_{k} and prv(L)k+(RL)prv(L)_{k}+(R-L). This search takes time proportional to log(RL)\log(R-L). Moreover, after each update of LL, we have LRmL\geq R-m, so when computing prv(L)kprv(L)_{k}, we can restrict the search for LL to between prv(R)kmprv(R)_{k}-m and prv(R)kprv(R)_{k}, in time O(logm)O(\log m). For more details, see Section 4.4.

Table II can be computed in one pass over ss (where we take the liberty of identifying character akΣa_{k}\in\Sigma with its index kk). The variables ckc_{k} count the number of occurrences of character aka_{k} seen so far, and are initialized to 0.

  • Algorithm Construct II
  • 1.

    for i=1i=1 to nn

  • 2.

    csi=csi+1c_{s_{i}}=c_{s_{i}}+1;

  • 3.

    I[si][csi]=i;I[s_{i}][c_{s_{i}}]=i;

Table II requires O(n)O(n) storage space (with constant 1). Moreover, the string ss can be discarded, so we have zero additional storage. (Access to si,1in,s_{i},1\leq i\leq n, is still possible, at cost O(σlogn)O(\sigma\log n).)

Example 2

Let Σ={a,b,c}\Sigma=\{a,b,c\} and s=cabcccaaabccbaaccas=cabcccaaabccbaacca. The prefix vectors of ss are given below. Note that the algorithm does not actually compute these.


pos.123456789101112131415161718scabcccaaabccbaaccaa’s0011111234444456667b’s0001111111222333333c’s0111234444456666788{{{{{{{{{{\begin{array}[]{p{.9cm} | @{\hspace{.1cm}}*{19}{p{.53cm}}}\text{pos.}}}\lx@intercol\vrule\hskip 2.84544pt&&1&2&3&4&5&6&7&8&9&10&11&12&13&14&15&16&17&18\\ \hline\cr$s$}}\lx@intercol\vrule\hskip 2.84544pt&&$c$&$a$&$b$&$c$&$c$&$c$&$a$&$a$&$a$&$b$&$c$&$c$&$b$&$a$&$a$&$c$&$c$&$a$\\ \# $a$'s}}\lx@intercol\vrule\hskip 2.84544pt&0&0&1&1&1&1&1&2&3&4&4&4&4&4&5&6&6&6&7\\ \# $b$'s}}\lx@intercol\vrule\hskip 2.84544pt&0&0&0&1&1&1&1&1&1&1&2&2&2&3&3&3&3&3&3\\ \# $c$'s}}\lx@intercol\vrule\hskip 2.84544pt&0&1&1&1&2&3&4&4&4&4&4&5&6&6&6&6&7&8&8\\ \end{array}


The inverted prefix table II:

012345678a02789141518b031013c0145611121617{{{{{{{{\begin{array}[]{ p{.8cm} |@{\hspace{.2cm}} *{9}{p{0.63cm}}}}}\lx@intercol\vrule\hskip 5.69046pt&0&1&2&3&4&5&6&7&8\\ \hline\cr a}}\lx@intercol\vrule\hskip 5.69046pt&0&2&7&8&9&14&15&18\\ b}}\lx@intercol\vrule\hskip 5.69046pt&0&3&10&13\\ c}}\lx@intercol\vrule\hskip 5.69046pt&0&1&4&5&6&11&12&16&17\\ \end{array}


Query q=(3,1,2)q=(3,1,2) has 4 occurrences, beginning in positions 5,6,7,135,6,7,13, since (3,1,2)=prv(10)prv(4)=prv(11)prv(5)=prv(12)prv(6)=prv(18)prv(12)(3,1,2)=prv(10)-prv(4)=prv(11)-prv(5)=prv(12)-prv(6)=prv(18)-prv(12). The values of LL and RR are given below:


kk, see proof of Thm. 4.1 1 2 3 4 5 6 7
L 0 4 5 6 7 10 12
R 8 10 11 12 14 18 18
occurrence found? yes yes yes yes

4.3 Variant using a wavelet tree

A wavelet tree on sΣs\in\Sigma^{*} allows rank, select, and access queries in time O(logσ)O(\log\sigma). For akΣa_{k}\in\Sigma, rankk(s,i)=|{jsj=ak,ji}|{\textit{rank}}_{k}(s,i)=|\{j\mid s_{j}=a_{k},j\leq i\}|, the number of occurrences of character aka_{k} up to and including position ii, while selectk(s,i)=min{jrankk(s,j)i}{\textit{select}}_{k}(s,i)=\min\{j\mid{\textit{rank}}_{k}(s,j)\geq i\}, the position of the ii’th occurrence of character aka_{k}. When the string is clear, we just use rankk(i){\textit{rank}}_{k}(i) and selectk(i){\textit{select}}_{k}(i). Notice that

  • prv(j)=(rank1(j),,rankσ(j))prv(j)=({\textit{rank}}_{1}(j),\ldots,{\textit{rank}}_{\sigma}(j)), and

  • for a Parikh vector p=(p1,,pσ)p=(p_{1},\ldots,p_{\sigma}), firstfit(p)=maxk=1,,σ{selectk(pk)}\textsc{firstfit}(p)=\max_{k=1,\ldots,\sigma}\{{\textit{select}}_{k}(p_{k})\}.

So we can use a wavelet tree of string ss to implement those two functions. We give a brief recap of wavelet trees, and then explain how to implement the two functions above in O(σ)O(\sigma) time each.

A wavelet tree is a complete binary tree with σ=|Σ|\sigma=|\Sigma| many leaves. To each inner node, a bitstring is associated which is defined recursively, starting from the root, in the following way. If |Σ|=1|\Sigma|=1, then there is nothing to do (in this case, we have reached a leaf). Else split the alphabet into two roughly equal parts, Σleft\Sigma_{\rm left} and Σright\Sigma_{\rm right}. Now construct a bitstring of length nn from ss by replacing each occurrence of a character aa by 0 if aΣlefta\in\Sigma_{\rm left}, and by 11 if aΣrighta\in\Sigma_{\rm right}. Let slefts_{\rm left} be the subsequence of ss consisting only of characters from Σleft\Sigma_{\rm left}, and srights_{\rm right} that consisting only of characters from Σright\Sigma_{\rm right}. Now recurse on the left child with string slefts_{\rm left} and alphabet Σleft\Sigma_{\rm left}, and on the right child with srights_{\rm right} and Σright\Sigma_{\rm right}. An illustration is given in Fig. 4. At each inner node, in addition to the bitstring BB, we have a data structure of size o(|B|)o(|B|), which allows to perform rank and select queries on bit vectors in constant time ([20, 9, 21]).

Now, using the wavelet tree of ss, any rank or select operation on ss takes time O(logσ)O(\log\sigma), which would yield O(σlogσ)O(\sigma\log\sigma) time for both prv(j)prv(j) and firstfit(p)\textsc{firstfit}(p). However, we can implement both in a way that they need only O(σ)O(\sigma) time: In order to compute rankk(j){\textit{rank}}_{k}(j), the wavelet tree, which has logσ\log\sigma levels, has to be descended from the root to leaf kk. Since for prv(j)prv(j), we need all values rank1(j),,rankσ(j){\textit{rank}}_{1}(j),\ldots,{\textit{rank}}_{\sigma}(j) simultaneously, we traverse the complete tree in O(σ)O(\sigma) time.

For computing firstfit(p)\textsc{firstfit}(p), we need maxk{selectk(pk)}\max_{k}\{{\textit{select}}_{k}(p_{k})\}, which can be computed bottom-up in the following way. We define a value xux_{u} for each node uu. If uu is a leaf, then uu corresponds to some character akΣa_{k}\in\Sigma; set xu=pkx_{u}=p_{k}. For an inner node uu, let BuB_{u} be the bitstring at uu. We define xux_{u} by xu=max{select0(Bu,xleft),x_{u}=\max\{{\textit{select}}_{0}(B_{u},x_{\rm left}), select1(Bu,xright)}{\textit{select}}_{1}(B_{u},x_{\rm right})\}, where xleftx_{\rm left} and xrightx_{\rm right} are the values already computed for the left resp. right child of uu. The desired value is equal to xrootx_{\rm root}.

Example 3

Let s=bbacaccabaddabccaaacs=bbacaccabaddabccaaac (cp. Fig. 4). We demonstrate the computation of firstfit(2,3,2,1)\textsc{firstfit}(2,3,2,1) using the wavelet tree. We have firstfit(2,3,2,1)\textsc{firstfit}(2,3,2,1) =max{selecta(s,2),selectb(s,3),selectc(s,2),selectd(s,1)}=\max\{{\textit{select}}_{a}(s,2),{\textit{select}}_{b}(s,3),{\textit{select}}_{c}(s,2),{\textit{select}}_{d}(s,1)\}, where in slight abuse of notation we put the character in the subscript instead of its number. Denote the bottom left bitstring as Ba,bB_{a,b}, the bottom right one as Bc,dB_{c,d}, and the top bitstring as Ba,b,c,dB_{a,b,c,d}. Then we get max{select0(Ba,b,2),select1(Ba,b,3)}=max{4,6}=6\max\{{\textit{select}}_{0}(B_{a,b},2),{\textit{select}}_{1}(B_{a,b},3)\}=\max\{4,6\}=6, and max{select0(Bc,d,2),select1(Bc,d,1)}=max{2,4}=4\max\{{\textit{select}}_{0}(B_{c,d},2),{\textit{select}}_{1}(B_{c,d},1)\}=\max\{2,4\}=4. So at the next level, we compute max{select0(Ba,b,c,d,6),select1(Ba,b,c,d,4)}=max{9,11}=11\max\{{\textit{select}}_{0}(B_{a,b,c,d},6),{\textit{select}}_{1}(B_{a,b,c,d},4)\}=\max\{9,11\}=11.

Refer to caption
Figure 4: The wavelet tree for string bbacaccabaddabccaaacbbacaccabaddabccaaac. For clarity, the leaves have been omitted. Note also that the third line at each inner node (the strings over the alphabet {a,b,c,d}\{a,b,c,d\}) are only included for illustration.

4.4 Algorithm Analysis

Let 𝔸1(s,q)\mathbb{A}_{1}(s,q) denote the running time of the Jumping Algorithm using inverted tables over a text ss and a Parikh vector qq, and 𝔸2(s,q)\mathbb{A}_{2}(s,q) that of the Jumping Algorithm using a wavelet tree. Further, let J=J(s,q)J=J(s,q) be the number of iterations performed in the while loop in line 2, i.e., the number of jumps performed by the algorithm on the input q.q.

The time spent in each iteration depends on how the functions firstfit and prvprv are implemented (lines 3 and 7). In the wavelet tree implementation, as we saw before, both take time O(σ)O(\sigma), so the overall runtime of the algorithm is

𝔸2(s,q)=O(σJ).\mathbb{A}_{2}(s,q)=O(\sigma J). (1)

For the inverted table implementation, it is easy to see that computing firstfit takes O(σ)O(\sigma) time. Now denote, for each i=1,,J,i=1,\dots,J, by L^i,R^i\hat{L}_{i},\,\hat{R}_{i} the value of LL and RR after the ii’th execution of line 3 of the algorithm, respectively.111The L^i\hat{L}_{i} and R^i\hat{R}_{i} coincide with the LkL_{k} and RkR_{k} from the proof of Theorem 4.1 almost but not completely: When an occurrence is found after the update of LL, then the corresponding pair Lk,RkL_{k},R_{k} is skipped here. The reason is that now we are only considering those updates that carry a computational cost. The computation of prv(Li^)prv(\hat{L_{i}}) in line 3 takes O(σlogm)O(\sigma\log m): For each k=1,,σ,k=1,\dots,\sigma, the component prv(Li^)kprv(\hat{L_{i}})_{k} can be determined by binary search over the list I[k][prv(R^i1)km],I[k][prv(R^i1)km+1],,I[k][prv(R^i1)k].I[k][prv(\hat{R}_{i-1})_{k}-m],I[k][prv(\hat{R}_{i-1})_{k}-m+1],\dots,I[k][prv(\hat{R}_{i-1})_{k}]. By Li^R^i1m,\hat{L_{i}}\geq\hat{R}_{i-1}-m, the claim follows.

The computation of prv(Ri^)prv(\hat{R_{i}}) in line 7 takes O(σlog(Ri^R^i1+m)).O(\sigma\log(\hat{R_{i}}-\hat{R}_{i-1}+m)). Simply observe that in the prefix ending at position Ri^\hat{R_{i}} there can be at most Ri^Li^\hat{R_{i}}-\hat{L_{i}} more occurrences of the kk’th character than there are in the prefix ending at position Li^.\hat{L_{i}}. Therefore, as before, we can determine prv(Ri^)kprv(\hat{R_{i}})_{k} by binary search over the list I[k][prv(L^i)k],I[k][prv(L^i)k+1],,I[k][prv(L^i)k+Ri^Li^].I[k][prv(\hat{L}_{i})_{k}],I[k][prv(\hat{L}_{i})_{k}+1],\dots,I[k][prv(\hat{L}_{i})_{k}+\hat{R_{i}}-\hat{L_{i}}]. Using the fact that Li^R^i1m,\hat{L_{i}}\geq\hat{R}_{i-1}-m, the desired bound follows.

The last three observations imply

𝔸1(s,q)=O(σJlogm+σi=1Jlog(Ri^R^i1+m)).\mathbb{A}_{1}(s,q)=O\left(\sigma J\log m+\sigma\sum_{i=1}^{J}\log(\hat{R_{i}}-\hat{R}_{i-1}+m)\right).

Notice that this is an overestimate, since line 7 is only executed if no occurrence was found after the current update of RR (line 4). Standard algebraic manipulations using Jensen’s inequality (see, e.g. [16]) yield i=1Jlog(Ri^R^i1+m)Jlog(nJ+m).\sum_{i=1}^{J}\log(\hat{R_{i}}-\hat{R}_{i-1}+m)\leq J\log\left(\frac{n}{J}+m\right). Therefore we obtain

𝔸1(s,q)=O(σJlog(nJ+m)).\mathbb{A}_{1}(s,q)=O\left(\sigma J\log\left(\frac{n}{J}+m\right)\right). (2)

4.4.1 Average case analysis of JJ

The worst case running time of the Jumping Algorithm, in either implementation, is superlinear, since there exist strings ss of any length nn and Parikh vectors qq such that J=Θ(n)J=\Theta(n): For instance, on the string s=ababababs=ababab\ldots ab and q=(2,0)q=(2,0), the algorithm will execute n/2n/2 jumps.

This sharply contrasts with the experimental evaluation we present later. The Jumping Algorithm appears to have in practice a sublinear behavior. In the rest of this section we provide an average case analysis of the running time of the Jumping Algorithm leading to the conclusion that its expected running time is sublinear.

We assume that the string ss is given as a sequence of i.i.d. random variables uniformly distributed over the alphabet Σ.\Sigma. According to Knuth et al.  [17] “It might be argued that the average case taken over random strings is of little interest, since a user rarely searches for a random string. However, this model is a reasonable approximation when we consider those pieces of text that do not contain the pattern […]”. The experimental results we provide will show that this is indeed the case.

Let us concentrate on the behaviour of the algorithm when scanning a (piece of the) string which does not contain a match. According to the above observation we can reasonably take this as a measure of the performance of the algorithm, considering that for any match found there is an additional step of size 1, which we can charge as the cost of the output.

Let Em,σE_{m,\sigma} denote the expected value of the distance between RR and LL, following an update of R,R, i.e. if LL is in position i,i, then we are interested in the value \ell such that firstfit(prv(i)+q)=i+.\textsc{firstfit}(prv(i)+q)=i+\ell. Notice that the probabilistic assumptions made on the string, together with the assumption of absence of matches, allows us to treat this value as independent of the position i.i. We will show the following result about Em,σ.E_{m,\sigma}. For the sake of the clarity, we defer the proof of this technical fact to the next section.

Lemma 3

Em,σ=Ω(m+mσlnσ).E_{m,\sigma}=\Omega\left(m+\sqrt{m\sigma\ln\sigma}\right).

At each iteration (when there is no match) the LL pointer is moved forward to the farthest position from RR such that the Parikh vector of the substring between LL and RR is a sub-Parikh vector of q.q. In particular, we can upper bound the distance between the new positions of LL and RR with m.m. Thus for the expected number of jumps performed by the algorithm, measured as the average number of times we move LL, we have

𝔼[J]=nEm,σm=O(nmσlnσ).\mathbb{E}[J]=\frac{n}{E_{m,\sigma}-m}=O\left(\frac{n}{\sqrt{m\sigma\ln\sigma}}\right). (3)

Recalling (1) and  (2), and using (3) for a random instance we have the following result concerning the average case complexity of the Jumping Algorithm.

Theorem 4.2

Let sΣs\in\Sigma^{*} be fixed. Algorithm Jumping Algorithm finds all occurrences of a query qq

  1. 1.

    in expected time O(n(σlogσ)1/2logmm)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{\log m}{\sqrt{m}}) using an inverted prefix table of size O(n)O(n), which can be constructed in a preprocessing step in time O(n)O(n);

  2. 2.

    in expected time O(n(σlogσ)1/21m)O(n(\frac{\sigma}{\log\sigma})^{1/2}\frac{1}{\sqrt{m}}) using a wavelet tree of ss of size O(n)O(n), which can be computed in a preprocessing step in time O(n)O(n).

We conclude this section by remarking once more that the above estimate obtained by the approximating probabilistic automaton appears to be confirmed by the experiments.

4.4.2 The proof of Lemma 3

We shall argue asymptotically with mm and according to whether or not the Parikh vector qq is balanced, and in the latter case according to its degree of unbalancedness, measured as the magnitude of its largest and smallest components.

Case 1. qq is balanced, i.e., q=(mσ,,mσ).q=(\frac{m}{\sigma},\dots,\frac{m}{\sigma}). Then, from equations (7) and (12) of [19], it follows that

Em,σm+{m2m(mm/2)if σ=2,2mσlnσ2πotherwise.E_{m,\sigma}\approx m+\begin{cases}m2^{-m}{m\choose{m/2}}&\mbox{if }\sigma=2,\cr\sqrt{2m\sigma\ln\frac{\sigma}{\sqrt{2\pi}}}&\mbox{otherwise.}\end{cases} (4)

The author of [19] studied a variant of the well known coupon collector problem in which the collector has to accumulate a certain number of copies of each coupon. It should not be hard to see that by identifying the characters with the coupon types, the random string with the sequence of coupons obtained, and the query Parikh vector with the number of copies we require for each coupon type, the expected time when the collection is finished is the same as our Em,σ.E_{m,\sigma}. It is easy to see that (4) provides the claimed bound of Lemma 3.

Case 2. q=(q1,,qσ)(mσ,,mσ).q=(q_{1},\dots,q_{\sigma})\neq(\frac{m}{\sigma},\dots,\frac{m}{\sigma}). Assume, w.l.o.g., that q1q2qσ.q_{1}\geq q_{2}\geq\dots\geq q_{\sigma}. We shall argue by cases according to the magnitude of q1.q_{1}.

Subcase 2.1. Suppose q1=mσ+Ω(mlnσσ).q_{1}=\frac{m}{\sigma}+\Omega\left(\sqrt{\frac{m\ln\sigma}{\sigma}}\right). Let us consider again the analogy with the coupon collector who has to collect qiq_{i} copies of coupons of type i,i, with i=1,,σ.i=1,\dots,\sigma. Clearly the collection is not completed until the q1q_{1}’th copy of the coupon of type 11 has been collected. We can model the collection of these type-1 coupons as a sequence of Bernoulli trials with probability of success 1/σ.1/\sigma. The expected waiting time until the q1q_{1}’th success is σq1\sigma q_{1} and from the previous observation this is also a lower bound on Em,σ.E_{m,\sigma}. Thus,

Em,σσq1=σ(mσ+Ω(mlnσσ))=Ω(m+mσlnσ),E_{m,\sigma}\geq\sigma q_{1}=\sigma\left(\frac{m}{\sigma}+\Omega\left(\sqrt{\frac{m\ln\sigma}{\sigma}}\right)\right)=\Omega\left(m+\sqrt{m\sigma\ln\sigma}\right),

which confirms the bound claimed, also in this case.

Subcase 2.2. Finally, assume that q1=mσ+o(mlnσσ).q_{1}=\frac{m}{\sigma}+o\left(\sqrt{\frac{m\ln\sigma}{\sigma}}\right). Then, for the smallest component qσq_{\sigma} of qq we have qσm(σ1)q1=mσo(mσlnσ).q_{\sigma}\geq m-(\sigma-1)q_{1}=\frac{m}{\sigma}-o\left(\sqrt{m\sigma\ln\sigma}\right). Consider now the balanced Parikh vector q=(qσ,,qσ).q^{\prime}=(q_{\sigma},\dots,q_{\sigma}). We have that qqq^{\prime}\leq q and |q|=σqσ.|q^{\prime}|=\sigma q_{\sigma}. By the analysis of Case 1., above, on balanced Parikh vectors, and observing that collecting qq implies collecting qq^{\prime} also, it follows that

Em,σ\displaystyle E_{m,\sigma} \displaystyle\geq Eσqσ,σ\displaystyle E_{\sigma q_{\sigma},\sigma}
=\displaystyle= Ω(σqσ+σ2qσlnσ)\displaystyle\Omega\left(\sigma q_{\sigma}+\sqrt{\sigma^{2}q_{\sigma}\ln\sigma}\right)
=\displaystyle= Ω(σ(mσo(mσlnσ))+σ2(mσo(mσlnσ))lnσ)\displaystyle\Omega\left(\sigma\left(\frac{m}{\sigma}-o\left(\sqrt{m\sigma\ln\sigma}\right)\right)+\sqrt{\sigma^{2}\left(\frac{m}{\sigma}-o\left(\sqrt{m\sigma\ln\sigma}\right)\right)\ln\sigma}\right)
=\displaystyle= Ω(mo(σmσlnσ)+mσlnσo(σ2lnσmσlnσ)),\displaystyle\Omega\left(m-o\left(\sigma\sqrt{m\sigma\ln\sigma}\right)+\sqrt{m\sigma\ln\sigma-o\left(\sigma^{2}\ln\sigma\sqrt{m\sigma\ln\sigma}\right)}\right),

in agreement with the bound claimed. This completes the proof.

4.5 Simulations

We implemented the Jumping Algorithm in C++ in order to study the number of jumps JJ. We ran it on random strings of different lengths and over different alphabet sizes. The underlying probability model is an i.i.d. model with uniform distribution. We sampled random query vectors with length between logn\log n (=log2n=\log_{2}n) and n\sqrt{n}, where nn is the length of the string. Our queries were of one of two types:

  1. 1.

    Quasi-balanced Parikh vectors: Of the form (q1,,qσ)(q_{1},\ldots,q_{\sigma}) with qi(xϵ,x+ϵ)q_{i}\in(x-\epsilon,x+\epsilon), and xx running from logn/σ\log n/\sigma to n/σ\sqrt{n}/\sigma. For simplicity, we fixed ϵ=10\epsilon=10 in all our experiments, and sampled uniformly at random from all quasi-balanced vectors around each xx.

  2. 2.

    Random Parikh vectors with fixed length mm. These were sampled uniformly at random from the space of all Parikh vectors with length mm.

The rationale for using quasi-balanced queries is that those are clearly worst-case for the number of jumps JJ, since JJ depends on the shift length, which in turn depends on firstfit(prv(L)+q)\textsc{firstfit}(prv(L)+q). Since we are searching in a random string with uniform character distribution, we can expect to have minimal firstfit(prv(L)+q)\textsc{firstfit}(prv(L)+q) if qq is close to balanced, i.e. if all entries qiq_{i} are roughly the same. This is confirmed by our experimental results which show that JJ decreases dramatically if the queries are not balanced (Fig. 7, right).

We ran experiments on random strings over different alphabet sizes, and observe that our average case analysis agrees well with the simulation results for random strings and random quasi-balanced query vectors. Plots for n=105n=10^{5} and n=106n=10^{6} with alphabet sizes σ=2,4,16\sigma=2,4,16 resp. σ=4,16\sigma=4,16 are shown in Fig. 6.

In Fig. 5 we show comparisons between the running time of the Jumping algorithm and that of the simple window algorithm. The simulations over random strings and Parikh vectors of different sizes appear to perfectly agree with the guarantees provided by our asymptotic analyses. This is of particular importance from the point of view of the applications, as it shows that the complexity analysis does not hide big constants.

Refer to caption
Refer to caption
Figure 5: Running time comparisons between the Jumping Algorithm and the window algorithm. The text is a random string (uniform i.i.d.) of size 9 000 0009\>000\>000 from a four letter alphabet. Parikh vectors of different sizes between 10 and 2000 were randomly generated and the results averaged over all queries of the same size. On the left are the results for quasi-balanced Parikh vectors (cf. text). On the right are the results for random Parikh vectors.

To see how our algorithm behaves on non-random strings, we downloaded human DNA sequences from GenBank [12] and ran the Jumping Algorithm with random quasi-balanced queries on them. We found that the algorithm performs 2 to 10 times fewer jumps on these DNA strings than on random strings of the same length, with the gain increasing as nn increases. We show the results on a DNA sequence of 11 million bp (from Chromosome 11) in comparison with the average over 10 random strings of the same length (Fig. 7, left).

Refer to caption
Refer to caption
Figure 6: Number of jumps for different alphabet sizes for random strings of size 100 000100\>000 (left) and 1 000 0001\>000\>000 (right). All queries are randomly generated quasi-balanced Parikh vectors (cf. text). Data averaged over 10 strings and all random queries of same length.
Refer to caption
Refer to caption
Figure 7: Number of jumps in random vs. nonrandom strings: Random strings over an alphabet of size 44 vs. a DNA sequence, all of length 1 000 00001\>000\>0000, random quasi-balanced query vectors. Data averaged over 10 random strings and all queries with the same length (left). Comparison of quasi-balanced vs. arbitrary query vectors over random strings, alphabet size 44, length 1 000 0001\>000\>000, 10 strings. The data shown are averaged over all queries with same length mm (right).

5 Conclusion

Our simulations appear to confirm that in practice the performance of the Jumping Algorithm is well predicted by the average case analysis we proposed. A more precise analysis is needed, however. Our approach seems unlikely to lead to any refined average case analysis since that would imply improved results for the intricate variant of the coupon collector problem of [19].

Moreover, in order to better simulate DNA or other biological data, random string models other than uniform i.i.d. should also be analysed, such as first or higher order Markov chains.

We remark that our wavelet tree variant of the Jumping Algorithm, which uses rank/select operations only, opens a new perspective on the study of Parikh vector matching. We have made another family of approximate pattern matching problems accessible to the use of self-indexing data structures [21]. We are particularly interested in compressed data structures which allow fast execution of rank and select operations, while at the same time using reduced storage space for the text. Thus, every step forward in this very active area can provide improvements for our problem.

Acknowledgements

We thank Gonzalo Navarro for fruitful discussions.

References

  • [1] A. Amir, A. Apostolico, G. M. Landau, and G. Satta. Efficient text fingerprinting via Parikh mapping. J. Discrete Algorithms, 1(5-6):409–421, 2003.
  • [2] A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited. SIAM J. Comput., 15(1):98–105, 1986.
  • [3] G. Benson. Composition alignment. In Proc. of the 3rd International Workshop on Algorithms in Bioinformatics (WABI’03), pages 447–461, 2003.
  • [4] S. Böcker. Sequencing from compomers: Using mass spectrometry for DNA de novo sequencing of 200+ nt. Journal of Computational Biology, 11(6):1110–1134, 2004.
  • [5] S. Böcker. Simulating multiplexed SNP discovery rates using base-specific cleavage and mass spectrometry. Bioinformatics, 23(2):5–12, 2007.
  • [6] R. S. Boyer and J. S. Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977.
  • [7] A. Butman, R. Eres, and G. M. Landau. Scaled and permuted string matching. Inf. Process. Lett., 92(6):293–297, 2004.
  • [8] M. Cieliebak, T. Erlebach, Zs. Lipták, J. Stoye, and E. Welzl. Algorithmic complexity of protein identification: combinatorics of weighted strings. Discrete Applied Mathematics, 137(1):27–46, 2004.
  • [9] D. Clark. Compact pat trees. PhD thesis, University of Waterloo, Canada, 1996.
  • [10] R. Eres, G. M. Landau, and L. Parida. Permutation pattern discovery in biosequences. Journal of Computational Biology, 11(6):1050–1060, 2004.
  • [11] W. Feller. An Introduction to Probability Theory and Its Applications. Wiley, 1968.
  • [12] Website. http://www.ncbi.nlm.nih.gov/Genbank/.
  • [13] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841–850, 2003.
  • [14] R. N. Horspool. Practical fast searching in strings. Softw., Pract. Exper., 10(6):501–506, 1980.
  • [15] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison of approximate string matching algorithms. Software Practice and Experience, 26(12):1439–1458, 1996.
  • [16] S. Jukna. Extremal Combinatorics. Springer, 1998.
  • [17] D. E. Knuth, J. H. Morris Jr., and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977.
  • [18] M. Lothaire. Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, New York, NY, USA, 2005.
  • [19] R. May. Coupon collecting with quotas. Electr. J. Comb., 15, 2008.
  • [20] J. I. Munro. Tables. In Proc. of Foundations of Software Technology and Theoretical Computer Science (FSTTCS’96), pages 37–42, 1996.
  • [21] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1), 2007.
  • [22] A. Salomaa. Counting (scattered) subwords. Bulletin of the European Association for Theoretical Computer Science (EATCS), 81:165–179, 2003.
  • [23] W. F. Smyth. Computing Patterns in Strings. Pearson Addison Wesley (UK), 2003.