This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online and Offline Algorithms for Counting Distinct Closed Factors via Sliding Suffix Trees

Takuya Mieno The University of Electro-Communications, Japan Shun Takahashi Hokkaido University, Japan Kazuhisa Seto Hokkaido University, Japan Takashi Horiyama Hokkaido University, Japan
Abstract

A string is said to be closed if its length is one, or if it has a non-empty factor that occurs both as a prefix and as a suffix of the string, but does not occur elsewhere. The notion of closed words was introduced by [Fici, WORDS 2011]. Recently, the maximum number of distinct closed factors occurring in a string was investigated by [Parshina and Puzynina, Theor. Comput. Sci. 2024], and an asymptotic tight bound was proved. In this paper, we propose two algorithms to count the distinct closed factors in a string TT of length nn over an alphabet of size σ\sigma. The first algorithm runs in O(nlogσ)O(n\log\sigma) time using O(n)O(n) space for string TT given in an online manner. The second algorithm runs in O(n)O(n) time using O(n)O(n) space for string TT given in an offline manner. Both algorithms utilize suffix trees for sliding windows.

1 Introduction

String processing is a fundamental area in computer science, with significant importance ranging from theoretical foundations to practical applications. One of the most active areas of this field is the study of repetitive structures within strings, which has driven advances in areas such as pattern matching algorithms [23, 20] and compressed string indices [9, 22, 24]. Understanding repetitive structures in strings is important for the advancement of information processing technology. For surveys on these topics, see [14, 34] and [30, 29]. The concept of closed words [18] is a sort of such repetitive structures of strings. A string is said to be closed if its length is one, or if it has a non-empty factor that occurs both as a prefix and as a suffix of the string, but does not occur elsewhere111The notion of closed words is equivalent to those of return words [15, 21] and periodic-like words [13].. For example, string 𝚊𝚋𝚊𝚊𝚋\mathtt{abaab} is closed because 𝚊𝚋\mathtt{ab} occurs both as a prefix and as a suffix, but does not occur elsewhere in 𝚊𝚋𝚊𝚊𝚋\mathtt{abaab}. Closed words have been studied primarily in the field of combinatorics on finite and infinite words [37, 12, 18, 27, 5, 19, 32, 6, 31]. Regarding the number of closed factors (i.e., substrings) appearing in a string, an asymptotic tight bound Θ(n2)\Theta(n^{2}) on the maximum number of distinct closed factors of a string is known [5]. More recently, Parshina and Puzynina refined this bound to n26\sim\frac{n^{2}}{6} in 2024 [31]. Despite these progresses on the number of closed factors, there is no non-trivial algorithm for computing the exact number of distinct closed factors of a given string to our knowledge.

In this paper, we present both online and offline algorithms for counting the number of distinct closed factors of a given string TT of length nn. The first counting algorithm is an online approach performed in O(nlogσ)O(n\log\sigma) time and O(n)O(n) space, where σ\sigma is the number of distinct characters in the string. The second counting algorithm is an offline approach performed in O(n)O(n) time and O(n)O(n) space, assuming TT is drawn from an integer alphabet of size nO(1)n^{O(1)}. We begin by characterizing the number of distinct closed factors of TT through the repeating suffixes of some prefixes and factors of TT. Based on this characterization, we design an online algorithm that utilizes Ukkonen’s online suffix tree construction [36], as well as suffix trees for a sliding window [17, 25, 33, 26]. We then design a linear-time offline algorithm by simulating sliding-window suffix trees within the static suffix tree of the entire string TT. This simulation is of independent interest, as it has the potential to speed up sliding-window algorithms for strings in an offline setting. Furthermore, we explore the enumeration of (distinct) closed factors in a string and propose an algorithm that combines our counting method with a geometric data structure for handling points in the two-dimensional plane [11], resulting a somewhat faster solution.

Related work.

Recent work has highlighted algorithmic advances in the study of closed factors and related problems over the past decade [8, 4, 28, 2, 1, 6, 7, 35]. The line of algorithmic research on closed factors is initialized by Badkobeh et al, [3, 4], who addressed various problems related to factorizing a string into a sequence of closed factors and proposed efficient algorithms for these tasks. In the domain of online string algorithms, Alzamel et al. [2] proposed an algorithm that computes closed factorizations in an online settings, and more recently, Sumiyoshi et al. [35] achieved a speedup of loglogn\log\log n times in total execution time. Additionally, a new concept, maximal closed substrings (MCS), was introduced in [6], along with an O(nlogn)O(n\log n)-time enumeration algorithm [7].

2 Preliminaries

2.1 Basic notations

Let Σ\Sigma be an ordered alphabet. An element of Σ\Sigma is called a character. An element of Σ\Sigma^{\star} is called a string. The empty string, denoted by ε\varepsilon, is the string of length 0. For a string TΣT\in\Sigma^{\star}, the length of TT is denoted by |T||T|. If T=xyzT=xyz for some strings x,y,zΣx,y,z\in\Sigma^{\star}, we call xx, yy, and zz a prefix, a factor, and a suffix of TT, respectively. A string bTb\neq T is said to be a border of TT if bb is both a prefix of TT and a suffix of TT. We denote by 𝑏𝑜𝑟𝑑(T)\mathit{bord}(T) the longest border of TT. Note that the longest border always exists since ε\varepsilon is a border of any string. For each ii with 1i|T|1\leq i\leq|T|, we denote by T[i]T[i] the iith character of TT. For each i,ji,j with 1ij|T|1\leq i\leq j\leq|T|, we denote by T[i..j]T[i..j] the factor of TT that starts at position ii and ends at position jj. For convenience, we define T[i..j]=εT[i..j]=\varepsilon for any i>ji>j. For a strings TT and SS, the set 𝑜𝑐𝑐T(S)={iT[i..i+|S|1]=S}\mathit{occ}_{T}(S)=\{i\mid T[i..i+|S|-1]=S\} of integers is said to be the occurrences of SS in TT. If 𝑜𝑐𝑐T(S)\mathit{occ}_{T}(S)\neq\emptyset, we say that SS occurs in TT as a factor. A string TT is said to be closed if there is a border of TT that occurs exactly twice in TT. If |𝑜𝑐𝑐T(S)|=1|\mathit{occ}_{T}(S)|=1, we say that SS is a unique factor of TT. Also, if |𝑜𝑐𝑐T(S)|2|\mathit{occ}_{T}(S)|\geq 2, we say that SS is a repeating factor of TT. We denote by 𝑙𝑟𝑠(T)\mathit{lrs}(T) the longest repeating suffix of string TT. Further we denote 𝑙𝑟𝑠2(T)=𝑙𝑟𝑠(𝑙𝑟𝑠(T))\mathit{lrs}^{2}(T)=\mathit{lrs}(\mathit{lrs}(T)).

In what follows, we fix a non-empty string TT of arbitrary length nn over an alphabet Σ\Sigma of size σ\sigma. This paper assumes the standard word RAM model with word size Ω(logn)\Omega(\log n).

2.2 Suffix trees

The most important tool of this paper is a suffix tree of string TT [38]. A suffix tree of TT, denoted by 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T), is a compact trie for the set of suffixes of TT. Below, we summarize some known properties of suffix trees and define related notations:

  • Each edge of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) is labeled with a factor of TT of length one or more.

  • Each internal node uu, including the root, has at least two children, and the first characters of the labels of outgoing edges from uu are mutually distinct (unless TT is a unary string).

  • For each node vv of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T), we denote by 𝗌𝗍𝗋T(v)\mathsf{str}_{T}(v) the string spelled out from the root to vv. We also define 𝗌𝗍𝗋𝗅𝖾𝗇T(u)=|𝗌𝗍𝗋T(u)|\mathsf{strlen}_{T}(u)=|\mathsf{str}_{T}(u)| for a node uu.

  • There is a one-to-one correspondence between leaves of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) and unique suffixes of TT. More precisely, for each leaf \ell, the string 𝗌𝗍𝗋T()\mathsf{str}_{T}(\ell) equals some unique suffix of TT, and vice versa. Note that repeating suffixes of TT are not always represented by a node in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T).

Throughout this paper, we assume for convenience that the last character of TT is unique. For each 1in1\leq i\leq n, we denote by 𝗅𝖾𝖺𝖿T(i)\mathsf{leaf}_{T}(i) the leaf \ell of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) such that 𝗌𝗍𝗋T()=T[i..n]\mathsf{str}_{T}(\ell)=T[i..n].

It is known that 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) for given string TT can be constructed in O(n)O(n) time [16] if Σ\Sigma is linearly sortable222A typical example of linearly sortable alphabets is an integer alphabet {1,2,,nc}\{1,2,\ldots,n^{c}\} for some constant cc. Any nn characters from the alphabet can be sorted in O(n)O(n) time using radix sort., i.e., any nn characters from Σ\Sigma can be sorted in O(n)O(n) time. Additionally, when the input string TT is given in an online manner, 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) can be constructed in O(nlogσ)O(n\log\sigma) time using Ukkonen’s algorithm [36]. In both algorithms, the resulting suffix trees are edge-sorted, meaning that the first characters of the labels of outgoing edges from each node are (lexicographically) sorted. Thus we assume that all suffix trees in this paper are edge-sorted.

Theorem 1 ([36]).

For incremental i=1,2,,ni=1,2,\ldots,n, we can maintain the suffix tree of T[1..i]T[1..i] and the length of 𝑙𝑟𝑠(T[1..i])\mathit{lrs}(T[1..i]) in a total of O(nlogσ)O(n\log\sigma) time.

In other words, for every j=1,2,,n1j=1,2,\ldots,n-1, we can update 𝖲𝖳𝗋𝖾𝖾(T[1..j])\mathsf{STree}(T[1..j]) to 𝖲𝖳𝗋𝖾𝖾(T[1..j+1])\mathsf{STree}(T[1..j+1]) in amortized O(logσ)O(\log\sigma) time. Ukkonen’s algorithm maintains the active point in the suffix tree, which is the locus of the longest repeating suffix of the string. To maintain the active point efficiently, Ukkonen’s algorithm uses auxiliary data structures called suffix links: Each internal node uu of the suffix tree has a suffix link that points to the node vv, such that 𝗌𝗍𝗋T(v)\mathsf{str}_{T}(v) is the suffix of 𝗌𝗍𝗋T(u)\mathsf{str}_{T}(u) of length |𝗌𝗍𝗋T(u)|1|\mathsf{str}_{T}(u)|-1. See Fig. 1 for examples of a suffix tree and related notations.

Refer to caption

Figure 1: The suffix tree of string T=𝚋𝚊𝚋𝚌𝚊𝚋T=\mathtt{babcab}. Each leaf of the tree represents a suffix of TT, with the integer inside each leaf indicating the starting position of the suffix. Namely, the leaf labeled with number ii corresponds to 𝗅𝖾𝖺𝖿T(i)\mathsf{leaf}_{T}(i). In this suffix tree, 𝗌𝗍𝗋T(u)=𝚋\mathsf{str}_{T}(u)=\mathtt{b}, 𝗌𝗍𝗋𝗅𝖾𝗇T(u)=1\mathsf{strlen}_{T}(u)=1, 𝗌𝗍𝗋T(𝗅𝖾𝖺𝖿T(3))=𝚋𝚌𝚊𝚋\mathsf{str}_{T}(\mathsf{leaf}_{T}(3))=\mathtt{bcab}, and 𝗌𝗍𝗋𝗅𝖾𝗇T(𝗅𝖾𝖺𝖿T(3))=4\mathsf{strlen}_{T}(\mathsf{leaf}_{T}(3))=4. The star symbol indicates the locus of the active point, which represents the longest repeating suffix 𝚊𝚋\mathtt{ab} of TT. The dotted arrows represent suffix links.

Furthermore, based on Ukkonen’s algorithm, we can maintain suffix trees for a sliding window over TT in a total of O(nlogσ)O(n\log\sigma) time, using space proportional to the window size:

Theorem 2 ([17, 25, 33, 26]).

Using O(W)O(W) space, a suffix tree for a variable-width sliding window can be maintained in amortized O(logσ)O(\log\sigma) time per one operation: either (1) appending a character to the right or (2) deleting the leftmost character from the window, where WW is the maximum width of the window. Additionally, the length of the longest repeating suffix of the window can be maintained with the same time complexity.

In other words, for every incremental pair of indices i,ji,j with iji\leq j, we can update 𝖲𝖳𝗋𝖾𝖾(T[i..j])\mathsf{STree}(T[i..j]) to either 𝖲𝖳𝗋𝖾𝖾(T[i..j+1])\mathsf{STree}(T[i..j+1]) or 𝖲𝖳𝗋𝖾𝖾(T[i+1..j])\mathsf{STree}(T[i+1..j]) in amortized O(logσ)O(\log\sigma) time. Later, we use the above algorithmic results as black boxes.

2.3 Weighted ancestor queries

For a node-weighted tree 𝒯\mathcal{T}, where the weight of each non-root node is greater than the weight of its parent, a weighted ancestor query (WAQ) is defined as follows: given a node uu of the tree and an integer tt, the query returns the farthest (highest) ancestor of uu whose weight is at least tt. We denote by 𝖶𝖠𝖰𝒯(u,t)\mathsf{WAQ}_{\mathcal{T}}(u,t) the returned node for weighted ancestor query (u,t)(u,t) on tree 𝒯\mathcal{T}. The subscript 𝒯\mathcal{T} will be omitted when clear from the context. In general, a WAQ on a tree of size NN can be answered in O(loglogN)O(\log\log N) time, which is worst-case optimal with linear space. However, if the input is the suffix tree of some string of length nn, and the weight function is 𝗌𝗍𝗋𝗅𝖾𝗇T\mathsf{strlen}_{T}, as defined in the previous subsection, any WAQ can be answered in O(1)O(1) time after O(n)O(n)-time preprocessing [10].

3 Counting closed factors online

In this section, we propose an algorithm to count the distinct closed factors for a string given in an online manner. Since there may be Ω(n2)\Omega(n^{2}) distinct closed factors in a string of length nn [5, 31], enumerating them requires quadratic time in the worst case. However, for counting the closed factors, we can achieve subquadratic time even when the string is given in an online manner, as described below.

3.1 Changes in number of closed factors

First we consider the changes in the number of distinct closed factors when a character is appended. Let 𝒞(T)\mathcal{C}(T) be the set of distinct closed factors occurring in TT. Let dj=|𝒞(T[1..j])||𝒞(T[1..j1])|d_{j}=|\mathcal{C}(T[1..j])|-|\mathcal{C}(T[1..j-1])| for 2jn2\leq j\leq n. For convenience let d1=1d_{1}=1.

Observation 1.

For any closed suffixes uu and vv of the same string, |𝑏𝑜𝑟𝑑(u)||𝑏𝑜𝑟𝑑(v)||\mathit{bord}(u)|\neq|\mathit{bord}(v)| holds if uvu\neq v.

Based on this observation, we count the distinct borders of closed factors instead of the closed factors themselves. We show the following:

Lemma 1.

For each jj with 1jn1\leq j\leq n, dj=|𝑙𝑟𝑠(T[1..j])||𝑙𝑟𝑠2(T[1..j]))|d_{j}=|\mathit{lrs}(T[1..j])|-|\mathit{lrs}^{2}(T[1..j]))| holds if 𝑙𝑟𝑠(T[1..j])ε\mathit{lrs}(T[1..j])\neq\varepsilon. Also, dj=1d_{j}=1 holds if 𝑙𝑟𝑠(T[1..j])=ε\mathit{lrs}(T[1..j])=\varepsilon.

Proof.

The latter case is obvious since 𝑙𝑟𝑠(T[1..j])=ε\mathit{lrs}(T[1..j])=\varepsilon implies that T[i]T[i] is a unique character in T[1..j]T[1..j], making it the only new closed factor. In the following, we assume 𝑙𝑟𝑠(T[1..j])ε\mathit{lrs}(T[1..j])\neq\varepsilon and denote tj=𝑙𝑟𝑠(T[1..j])t_{j}=\mathit{lrs}(T[1..j]) and zj=𝑙𝑟𝑠2(T[1..j])z_{j}=\mathit{lrs}^{2}(T[1..j]). It can be observed that for any new closed factor u𝒞(T[1..j])𝒞(T[1..j1])u\in\mathcal{C}(T[1..j])\setminus\mathcal{C}(T[1..j-1]), uu is a unique suffix of T[1..j]T[1..j] and the longest border of uu repeats in T[1..j]T[1..j]. Equivalently, |u|>|tj||u|>|t_{j}| and |𝑏𝑜𝑟𝑑(u)||tj||\mathit{bord}(u)|\leq|t_{j}| hold. Additionally, |u|>|tj||u|>|t_{j}| implies |𝑏𝑜𝑟𝑑(u)|>|zj||\mathit{bord}(u)|>|z_{j}|. Thus |zj|<|𝑏𝑜𝑟𝑑(u)||tj||z_{j}|<|\mathit{bord}(u)|\leq|t_{j}|, and hence dj|tj||zj|d_{j}\leq|t_{j}|-|z_{j}| by Observation 1. Conversely, for any suffix bb of T[1..j]T[1..j] satisfying |zj|<|b||tj||z_{j}|<|b|\leq|t_{j}|, there exists exactly one closed suffix whose longest border is bb since bb is repeating in T[1..j]T[1..j]. Further, since |b|>|zj||b|>|z_{j}|, the closed suffix is longer than tjt_{j}, and thus the closed suffix is unique in T[1..j]T[1..j]. Therefore, dj|tj||zj|d_{j}\geq|t_{j}|-|z_{j}| holds. ∎

We note that the upper bound dj|𝑙𝑟𝑠(T[1..j])||𝑙𝑟𝑠2(T[1..j]))|d_{j}\leq|\mathit{lrs}(T[1..j])|-|\mathit{lrs}^{2}(T[1..j]))| of djd_{j} is already claimed in [31], however, the equality was not shown. By definition of djd_{j} and Lemma 1, the next corollary immediately follows:

Corollary 1.

The total number of distinct closed factors of TT is

jJ|𝑙𝑟𝑠(T[1..j])|jJ|𝑙𝑟𝑠2(T[1..j]))|+n|J|\sum_{j\in J}|\mathit{lrs}(T[1..j])|-\sum_{j\in J}|\mathit{lrs}^{2}(T[1..j]))|+n-|J|

where J={j[1,n]𝑙𝑟𝑠(T[1..j])ε}J=\{j\in[1,n]\mid\mathit{lrs}(T[1..j])\neq\varepsilon\}.

3.2 Online algorithm

By Theorem 1, we can compute jJ|𝑙𝑟𝑠(T[1..j])|\sum_{j\in J}|\mathit{lrs}(T[1..j])| in O(nlogσ)O(n\log\sigma) time online. Thus, by Corollary 1, our remaining task is to compute jJ|𝑙𝑟𝑠2(T[1..j])|\sum_{j\in J}|\mathit{lrs}^{2}(T[1..j])| online, equivalently, to compute the sequence 𝒵=(|z1|,|z2|\mathcal{Z}=(|z_{1}|,|z_{2}| ,|zn|)\ldots,|z_{n}|) where zj=𝑙𝑟𝑠2(T[1..j])z_{j}=\mathit{lrs}^{2}(T[1..j]) for each jj. To compute the sequence, we use sliding suffix trees of Theorem 2. It is known that the starting position of 𝑙𝑟𝑠(T[1..j])\mathit{lrs}(T[1..j]) is not smaller than that of 𝑙𝑟𝑠(T[1..j1])\mathit{lrs}(T[1..j-1]). Namely, starting from t1=𝑙𝑟𝑠(T[1..1])=εt_{1}=\mathit{lrs}(T[1..1])=\varepsilon, the sequence t1,t2,,tn\langle t_{1},t_{2},\ldots,t_{n}\rangle of factors of TT can be obtained by O(n)O(n) sliding-operations on TT that consists of (1) appending a character to the right or (2) deleting the leftmost character from the factor, where tj=𝑙𝑟𝑠(T[1..j])t_{j}=\mathit{lrs}(T[1..j]) for each jj. Thus, by Theorem 2, we can obtain every |zi|𝒵|z_{i}|\in\mathcal{Z} in amortized O(logσ)O(\log\sigma) time by appropriately applying sliding-operations to reach the windows tjt_{j} for all jj with jJj\in J in order. Finally, we obtain the following:

Theorem 3.

For a string TT of length nn given in an online manner, we can count the distinct closed factors of TT in O(nlogσ)O(n\log\sigma) time using O(n)O(n) space where σ\sigma is the alphabet size.

4 Counting closed factors offline: shaving log factor

In this section, we assume that the alphabet is linearly sortable, allowing us to construct 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) in O(n)O(n) time offline [16]. Our goal is to remove the logσ\log\sigma factor from the complexity in Theorem 3, which arises from using dynamic predecessor dictionaries (e.g., AVL trees) at non-leaf nodes. To shave this logarithmic factor, we do not rely on such dynamic dictionaries. Instead, use a WAQ data structure constructed over 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T).

Simulating sliding suffix tree in linear time.

The idea is to simulate the online algorithm from Section 3 using the suffix tree of the entire string TT. In algorithms underlying Theorem 1 and 2, the topology of suffix trees and the loci of the active points change step-by-step. The changes include (i) moving the active point, (ii) adding a node, an edge, or a suffix link, and (iii) removing a node, an edge, or a suffix link.

Our data structure consists of two suffix trees:

  • (1)

    The suffix tree 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) of TT enhanced with a WAQ data structure.

  • (2)

    The suffix tree 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) of the factor w=T[i..j]w=T[i..j], which represents a sliding window.

Note that 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) includes the active point. All nodes and edges of 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) are connected to their corresponding nodes and edges in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T). Specifically, an internal node uu of 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) is connected to the node u~\tilde{u} of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) where 𝗌𝗍𝗋w(u)=𝗌𝗍𝗋T(u~)\mathsf{str}_{w}(u)=\mathsf{str}_{T}(\tilde{u}). Similarly, an edge (u,v)(u,v) of 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) is connected to the edge (u~,v)(\tilde{u},v^{\prime}) of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) where the first characters of the edge-labels are the same. Such node u~\tilde{u} and edge (u~,v)(\tilde{u},v^{\prime}) in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) always exist: An internal node uu implies that, for some distinct characters c1c_{1} and c2c_{2}, 𝗌𝗍𝗋w(u)c1\mathsf{str}_{w}(u)c_{1} and 𝗌𝗍𝗋w(u)c2\mathsf{str}_{w}(u)c_{2} occur in ww, and thus also in TT. Furthermore, a leaf \ell of 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w), representing suffix T[k..j]T[k..j] of ww, is connected to the leaf of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) representing the suffix T[k..n]T[k..n] of TT. Note that suffix links of 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) do not connect to 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) and are maintained within 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w). See Fig. 2 for illustration.

Refer to caption


Figure 2: The suffix tree of string T=𝚊𝚋𝚊𝚋𝚌𝚊𝚋𝚌𝚊𝚌$T=\mathtt{ababcabcac\$} is shown on the left. The suffix tree of T[2..7]=𝚋𝚊𝚋𝚌𝚊𝚋T[2..7]=\mathtt{babcab}, which is the same as the one in Fig. 1, is shown on the right. Suffix links are omitted in this figure. In the tree on the left, each node and edge enclosed in a bold line is connected to a corresponding one in the tree on the right. For clarity, those connections are not drawn.

We maintain the active point in 𝖲𝖳𝗋𝖾𝖾(w)\mathsf{STree}(w) as follows:

  • When the active point is on an edge and moves down, if the character on the edge following the active point matches the next character T[j+1]T[j+1], the active point simply moves down. Otherwise, if the active point cannot move down, we create a new branching node uu at the locus and add a new leaf \ell and edge e=(u,)e=(u,\ell). The new leaf and edge are connected to the corresponding edge e~\tilde{e} and leaf of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) as the above discussion. The edge e~\tilde{e} of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) can be found in constant time using a weighted ancestor query on 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T).

  • Similarly, when the active point is a node uu and moves down, if uu represents T[i..j]T[i^{\prime}..j], we query 𝖶𝖠𝖰𝖲𝖳𝗋𝖾𝖾(T)(𝗅𝖾𝖺𝖿T(i),ji+2)\mathsf{WAQ}_{\mathsf{STree}(T)}(\mathsf{leaf}_{T}(i^{\prime}),j-i^{\prime}+2) and check the resulting edge. The resulting edge of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) is already connected to an edge of the smaller suffix tree if and only if uu has an outgoing edge whose edge label starts with T[j+1]T[j+1]. If the active point can move down to the existing edge, we simply move it to the edge (via its corresponding edge in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T)). Otherwise, we create a new leaf \ell and edge e=(u,)e=(u,\ell), connecting them to their corresponding leaf and edge of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) as described above.

In Ukkonen’s algorithm, a new node/edge is created only if the active point reaches it, so every node/edge creation can be simulated as described. Node/edge deletions are straightforward: we simply disconnect the node/edge from 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) and remove them. Therefore, we can simulate the sliding suffix tree over TT in amortized O(1)O(1) time per sliding-operation. We have shown the next theorem:

Theorem 4.

Given an offline string TT over a linearly sortable alphabet, we can simulate a suffix tree for a sliding window in a total of O(n)O(n) time.

By Theorem 4 and the algorithm described in Section 3, we obtain the following:

Corollary 2.

We can count the distinct closed factors in TT over a linearly sortable alphabet in O(n)O(n) time using O(n)O(n) space.

5 Enumerating closed factors offline

In this section, we discuss the enumeration of (distinct) closed factors in an offline string TT. We first consider The open-close-array 𝖮𝖢T\mathsf{OC}_{T} of TT, which is the binary sequence of length nn where 𝖮𝖢T[i]=1\mathsf{OC}_{T}[i]=1 iff T[i..n]T[i..n] is closed [28]. Since an open-close-array can be computed in O(n)O(n) time, we can enumerate all the occurrences of closed factors of TT in Θ(n2)\Theta(n^{2}) time by constructing 𝖮𝖢T[i..n]\mathsf{OC}_{T[i..n]} for all 1in1\leq i\leq n. We then map the occurrences to their corresponding loci on 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) using constant-time WAQs O(n2)O(n^{2}) times. Thus we have the following:

Proposition 1.

We can enumerate all the occurrences of closed factors of TT and the distinct closed factors of TT in Θ(n2)\Theta(n^{2}) time.

Next, we propose an alternative approach that can achieve subquadratic time when there are few closed factors to output. Using the algorithms described in Sections 3 and 4, we can enumerate the ending positions and the longest borders of distinct closed factors in O(n+𝗈𝗎𝗍𝗉𝗎𝗍)O(n+\mathsf{output}) time. For each such a border bb of a closed factor, we need to determine the starting position of the closed factor by finding the nearest occurrence of bb to the left. Such occurrences can be computed efficiently by utilizing range predecessor queries over the list of (the integer-labels of) the leaves of 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) as follows: Let b=T[s..t]b=T[s..t] be a border of some closed factor ww where bb occurs exactly twice in ww. We first find the locus of bb in 𝖲𝖳𝗋𝖾𝖾(T)\mathsf{STree}(T) by a WAQ. Let vv be the nearest descendant of the locus, inclusive. Then, the leaves in the subtree rooted at vv represent the occurrences of bb in TT. Thus the nearest occurrence of bb to the left equals the predecessor value of ss within the leaves under vv. By applying Belazzougui and Puglisi’s range predecessor data structure [11], we obtain the following:

Proposition 2.

We can enumerate the distinct closed factors of TT in O(nlogn+𝗈𝗎𝗍𝗉𝗎𝗍logϵn)O(n\sqrt{\log n}+\mathsf{output}\cdot\log^{\epsilon}n) time where 𝗈𝗎𝗍𝗉𝗎𝗍O(n2)\mathsf{output}\in O(n^{2}) is the number of distinct closed factors in TT and ϵ\epsilon is a fixed constant with 0<ϵ<10<\epsilon<1.

6 Conclusions

In this paper, we proposed two algorithms for counting distinct closed factors of a string. The first algorithm runs in O(nlogσ)O(n\log\sigma) time using O(n)O(n) space for a string given in an online manner. The second algorithm runs in O(n)O(n) time and space for a static string over a linearly sortable alphabet. Additionally, we discussed how to enumerate the distinct closed factors, and showed a Θ(n2)\Theta(n^{2})-time algorithm using open-close-arrays, as well as an O(nlogn+𝗈𝗎𝗍𝗉𝗎𝗍logϵn)O(n\sqrt{\log n}+\mathsf{output}\cdot\log^{\epsilon}n)-time algorithm that combines our counting algorithm with a range predecessor data structure. Since there can be Ω(n2)\Omega(n^{2}) distinct closed factors in a string [5, 31], the 𝗈𝗎𝗍𝗉𝗎𝗍logϵn\mathsf{output}\cdot\log^{\epsilon}n term can be superquadratic in the worst case. This leads to the following open question: For the enumerating problem, can we achieve O(n𝗉𝗈𝗅𝗒𝗅𝗈𝗀(n)+𝗈𝗎𝗍𝗉𝗎𝗍)O(n\mathsf{polylog}(n)+\mathsf{output}) time, which is linear in the output size?

References

  • [1] Hayam Alamro, Mai Alzamel, Costas S. Iliopoulos, Solon P. Pissis, Wing-Kin Sung, and Steven Watts. Efficient identification of k-closed strings. Int. J. Found. Comput. Sci., 31(5):595–610, 2020.
  • [2] Mai Alzamel, Costas S. Iliopoulos, W. F. Smyth, and Wing-Kin Sung. Off-line and on-line algorithms for closed string factorization. Theor. Comput. Sci., 792:12–19, 2019.
  • [3] Golnaz Badkobeh, Hideo Bannai, Keisuke Goto, Tomohiro I, Costas S. Iliopoulos, Shunsuke Inenaga, Simon J. Puglisi, and Shiho Sugimoto. Closed factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, September 1-3, 2014, pages 162–168. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2014.
  • [4] Golnaz Badkobeh, Hideo Bannai, Keisuke Goto, Tomohiro I, Costas S. Iliopoulos, Shunsuke Inenaga, Simon J. Puglisi, and Shiho Sugimoto. Closed factorization. Discret. Appl. Math., 212:23–29, 2016.
  • [5] Golnaz Badkobeh, Gabriele Fici, and Zsuzsanna Lipták. On the number of closed factors in a word. In Language and Automata Theory and Applications - 9th International Conference, LATA 2015, Nice, France, March 2-6, 2015, Proceedings, volume 8977 of Lecture Notes in Computer Science, pages 381–390. Springer, 2015.
  • [6] Golnaz Badkobeh, Alessandro De Luca, Gabriele Fici, and Simon J. Puglisi. Maximal closed substrings. In String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 16–23. Springer, 2022.
  • [7] Golnaz Badkobeh, Alessandro De Luca, Gabriele Fici, and Simon J. Puglisi. Maximal closed substrings. CoRR, abs/2209.00271, 2022.
  • [8] Hideo Bannai, Shunsuke Inenaga, Tomasz Kociumaka, Arnaud Lefebvre, Jakub Radoszewski, Wojciech Rytter, Shiho Sugimoto, and Tomasz Walen. Efficient algorithms for longest closed factor array. In String Processing and Information Retrieval - 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Proceedings, volume 9309 of Lecture Notes in Computer Science, pages 95–102. Springer, 2015.
  • [9] Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees. J. Comput. Syst. Sci., 117:1–22, 2021.
  • [10] Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • [11] Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 2053–2071. SIAM, 2016.
  • [12] Michelangelo Bucci, Aldo de Luca, and Alessandro De Luca. Rich and periodic-like words. In Developments in Language Theory, 13th International Conference, DLT 2009, Stuttgart, Germany, June 30 - July 3, 2009. Proceedings, volume 5583 of Lecture Notes in Computer Science, pages 145–155. Springer, 2009.
  • [13] Arturo Carpi and Aldo de Luca. Periodic-like words, periodicity, and boxes. Acta Informatica, 37(8):597–618, 2001.
  • [14] Maxime Crochemore, Lucian Ilie, and Wojciech Rytter. Repetitions in strings: Algorithms and combinatorics. Theor. Comput. Sci., 410(50):5227–5235, 2009.
  • [15] Fabien Durand. A characterization of substitutive sequences using return words. Discret. Math., 179(1-3):89–101, 1998.
  • [16] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997.
  • [17] Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490–505, 1989.
  • [18] Gabriele Fici. A classification of trapezoidal words. In Proceedings 8th International Conference Words 2011, Prague, Czech Republic, 12-16th September 2011, volume 63 of EPTCS, pages 129–137, 2011.
  • [19] Gabriele Fici. Open and closed words. Bulletin of EATCS, (123), 2017.
  • [20] Zvi Galil and Joel I. Seiferas. Time-space-optimal string matching. J. Comput. Syst. Sci., 26(3):280–294, 1983.
  • [21] Amy Glen, Jacques Justin, Steve Widmer, and Luca Q. Zamboni. Palindromic richness. Eur. J. Comb., 30(2):510–531, 2009.
  • [22] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018.
  • [23] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977.
  • [24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ\delta-optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024.
  • [25] N. Jesper Larsson. Extended application of suffix trees to data compression. In Proceedings of the 6th Data Compression Conference (DCC ’96), Snowbird, Utah, USA, March 31 - April 3, 1996, pages 190–199. IEEE Computer Society, 1996.
  • [26] Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai, and Takuya Mieno. Constant-time edge label and leaf pointer maintenance on sliding suffix trees. CoRR, abs/2307.01412, 2024.
  • [27] Alessandro De Luca and Gabriele Fici. Open and closed prefixes of sturmian words. In Combinatorics on Words - 9th International Conference, WORDS 2013, Turku, Finland, September 16-20. Proceedings, volume 8079 of Lecture Notes in Computer Science, pages 132–142. Springer, 2013.
  • [28] Alessandro De Luca, Gabriele Fici, and Luca Q. Zamboni. The sequence of open and closed prefixes of a Sturmian word. Adv. Appl. Math., 90:27–45, 2017.
  • [29] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022.
  • [30] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022.
  • [31] Olga G. Parshina and Svetlana Puzynina. Finite and infinite closed-rich words. Theor. Comput. Sci., 984:114315, 2024.
  • [32] Olga G. Parshina and Luca Q. Zamboni. Open and closed factors in arnoux-rauzy words. Adv. Appl. Math., 107:22–31, 2019.
  • [33] Martin Senft. Suffix tree for a sliding window: An overview. In 14th Annual Conference of Doctoral Students - WDS 2005, pages 41–46. Matfyzpress, 2005.
  • [34] W. F. Smyth. Computing regularities in strings: A survey. Eur. J. Comb., 34(1):3–14, 2013.
  • [35] Wataru Sumiyoshi, Takuya Mieno, and Shunsuke Inenaga. Faster and simpler online/sliding rightmost lempel-ziv factorizations. In String Processing and Information Retrieval - 31th International Symposium, SPIRE 2024, volume 14899 of Lecture Notes in Computer Science, pages 321–335. Springer, 2024.
  • [36] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
  • [37] Laurent Vuillon. A characterization of sturmian words by return words. Eur. J. Comb., 22(2):263–275, 2001.
  • [38] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973.