Online and Offline Algorithms for Counting Distinct Closed Factors via Sliding Suffix Trees

Takuya Mieno The University of Electro-Communications, Japan Shun Takahashi Hokkaido University, Japan Kazuhisa Seto Hokkaido University, Japan Takashi Horiyama Hokkaido University, Japan

Abstract

A string is said to be closed if its length is one, or if it has a non-empty factor that occurs both as a prefix and as a suffix of the string, but does not occur elsewhere. The notion of closed words was introduced by [Fici, WORDS 2011]. Recently, the maximum number of distinct closed factors occurring in a string was investigated by [Parshina and Puzynina, Theor. Comput. Sci. 2024], and an asymptotic tight bound was proved. In this paper, we propose two algorithms to count the distinct closed factors in a string $T$ of length $n$ over an alphabet of size $\sigma$ . The first algorithm runs in $O(n\log\sigma)$ time using $O(n)$ space for string $T$ given in an online manner. The second algorithm runs in $O(n)$ time using $O(n)$ space for string $T$ given in an offline manner. Both algorithms utilize suffix trees for sliding windows.

1 Introduction

String processing is a fundamental area in computer science, with significant importance ranging from theoretical foundations to practical applications. One of the most active areas of this field is the study of repetitive structures within strings, which has driven advances in areas such as pattern matching algorithms [23, 20] and compressed string indices [9, 22, 24]. Understanding repetitive structures in strings is important for the advancement of information processing technology. For surveys on these topics, see [14, 34] and [30, 29]. The concept of closed words [18] is a sort of such repetitive structures of strings. A string is said to be closed if its length is one, or if it has a non-empty factor that occurs both as a prefix and as a suffix of the string, but does not occur elsewhere¹¹1The notion of closed words is equivalent to those of return words [15, 21] and periodic-like words [13].. For example, string $\mathtt{abaab}$ is closed because $\mathtt{ab}$ occurs both as a prefix and as a suffix, but does not occur elsewhere in $\mathtt{abaab}$ . Closed words have been studied primarily in the field of combinatorics on finite and infinite words [37, 12, 18, 27, 5, 19, 32, 6, 31]. Regarding the number of closed factors (i.e., substrings) appearing in a string, an asymptotic tight bound $\Theta(n^{2})$ on the maximum number of distinct closed factors of a string is known [5]. More recently, Parshina and Puzynina refined this bound to $\sim\frac{n^{2}}{6}$ in 2024 [31]. Despite these progresses on the number of closed factors, there is no non-trivial algorithm for computing the exact number of distinct closed factors of a given string to our knowledge.

In this paper, we present both online and offline algorithms for counting the number of distinct closed factors of a given string $T$ of length $n$ . The first counting algorithm is an online approach performed in $O(n\log\sigma)$ time and $O(n)$ space, where $\sigma$ is the number of distinct characters in the string. The second counting algorithm is an offline approach performed in $O(n)$ time and $O(n)$ space, assuming $T$ is drawn from an integer alphabet of size $n^{O(1)}$ . We begin by characterizing the number of distinct closed factors of $T$ through the repeating suffixes of some prefixes and factors of $T$ . Based on this characterization, we design an online algorithm that utilizes Ukkonen’s online suffix tree construction [36], as well as suffix trees for a sliding window [17, 25, 33, 26]. We then design a linear-time offline algorithm by simulating sliding-window suffix trees within the static suffix tree of the entire string $T$ . This simulation is of independent interest, as it has the potential to speed up sliding-window algorithms for strings in an offline setting. Furthermore, we explore the enumeration of (distinct) closed factors in a string and propose an algorithm that combines our counting method with a geometric data structure for handling points in the two-dimensional plane [11], resulting a somewhat faster solution.

Related work.

Recent work has highlighted algorithmic advances in the study of closed factors and related problems over the past decade [8, 4, 28, 2, 1, 6, 7, 35]. The line of algorithmic research on closed factors is initialized by Badkobeh et al, [3, 4], who addressed various problems related to factorizing a string into a sequence of closed factors and proposed efficient algorithms for these tasks. In the domain of online string algorithms, Alzamel et al. [2] proposed an algorithm that computes closed factorizations in an online settings, and more recently, Sumiyoshi et al. [35] achieved a speedup of $\log\log n$ times in total execution time. Additionally, a new concept, maximal closed substrings (MCS), was introduced in [6], along with an $O(n\log n)$ -time enumeration algorithm [7].

2 Preliminaries

2.1 Basic notations

Let $\Sigma$ be an ordered alphabet. An element of $\Sigma$ is called a character. An element of $\Sigma^{\star}$ is called a string. The empty string, denoted by $\varepsilon$ , is the string of length $0$ . For a string $T\in\Sigma^{\star}$ , the length of $T$ is denoted by $|T|$ . If $T=xyz$ for some strings $x,y,z\in\Sigma^{\star}$ , we call $x$ , $y$ , and $z$ a prefix, a factor, and a suffix of $T$ , respectively. A string $b\neq T$ is said to be a border of $T$ if $b$ is both a prefix of $T$ and a suffix of $T$ . We denote by $\mathit{bord}(T)$ the longest border of $T$ . Note that the longest border always exists since $\varepsilon$ is a border of any string. For each $i$ with $1\leq i\leq|T|$ , we denote by $T[i]$ the $i$ th character of $T$ . For each $i,j$ with $1\leq i\leq j\leq|T|$ , we denote by $T[i..j]$ the factor of $T$ that starts at position $i$ and ends at position $j$ . For convenience, we define $T[i..j]=\varepsilon$ for any $i>j$ . For a strings $T$ and $S$ , the set $\mathit{occ}_{T}(S)=\{i\mid T[i..i+|S|-1]=S\}$ of integers is said to be the occurrences of $S$ in $T$ . If $\mathit{occ}_{T}(S)\neq\emptyset$ , we say that $S$ occurs in $T$ as a factor. A string $T$ is said to be closed if there is a border of $T$ that occurs exactly twice in $T$ . If $|\mathit{occ}_{T}(S)|=1$ , we say that $S$ is a unique factor of $T$ . Also, if $|\mathit{occ}_{T}(S)|\geq 2$ , we say that $S$ is a repeating factor of $T$ . We denote by $\mathit{lrs}(T)$ the longest repeating suffix of string $T$ . Further we denote $\mathit{lrs}^{2}(T)=\mathit{lrs}(\mathit{lrs}(T))$ .

In what follows, we fix a non-empty string $T$ of arbitrary length $n$ over an alphabet $\Sigma$ of size $\sigma$ . This paper assumes the standard word RAM model with word size $\Omega(\log n)$ .

2.2 Suffix trees

The most important tool of this paper is a suffix tree of string $T$ [38]. A suffix tree of $T$ , denoted by $\mathsf{STree}(T)$ , is a compact trie for the set of suffixes of $T$ . Below, we summarize some known properties of suffix trees and define related notations:

•

Each edge of $\mathsf{STree}(T)$ is labeled with a factor of $T$ of length one or more.
•

Each internal node $u$ , including the root, has at least two children, and the first characters of the labels of outgoing edges from $u$ are mutually distinct (unless $T$ is a unary string).
•

For each node $v$ of $\mathsf{STree}(T)$ , we denote by $\mathsf{str}_{T}(v)$ the string spelled out from the root to $v$ . We also define $\mathsf{strlen}_{T}(u)=|\mathsf{str}_{T}(u)|$ for a node $u$ .
•

There is a one-to-one correspondence between leaves of $\mathsf{STree}(T)$ and unique suffixes of $T$ . More precisely, for each leaf $\ell$ , the string $\mathsf{str}_{T}(\ell)$ equals some unique suffix of $T$ , and vice versa. Note that repeating suffixes of $T$ are not always represented by a node in $\mathsf{STree}(T)$ .

Throughout this paper, we assume for convenience that the last character of $T$ is unique. For each $1\leq i\leq n$ , we denote by $\mathsf{leaf}_{T}(i)$ the leaf $\ell$ of $\mathsf{STree}(T)$ such that $\mathsf{str}_{T}(\ell)=T[i..n]$ .

It is known that $\mathsf{STree}(T)$ for given string $T$ can be constructed in $O(n)$ time [16] if $\Sigma$ is linearly sortable²²2A typical example of linearly sortable alphabets is an integer alphabet $\{1,2,\ldots,n^{c}\}$ for some constant $c$ . Any $n$ characters from the alphabet can be sorted in $O(n)$ time using radix sort., i.e., any $n$ characters from $\Sigma$ can be sorted in $O(n)$ time. Additionally, when the input string $T$ is given in an online manner, $\mathsf{STree}(T)$ can be constructed in $O(n\log\sigma)$ time using Ukkonen’s algorithm [36]. In both algorithms, the resulting suffix trees are edge-sorted, meaning that the first characters of the labels of outgoing edges from each node are (lexicographically) sorted. Thus we assume that all suffix trees in this paper are edge-sorted.

Theorem 1 ([36]).

For incremental $i=1,2,\ldots,n$ , we can maintain the suffix tree of $T[1..i]$ and the length of $\mathit{lrs}(T[1..i])$ in a total of $O(n\log\sigma)$ time.

In other words, for every $j=1,2,\ldots,n-1$ , we can update $\mathsf{STree}(T[1..j])$ to $\mathsf{STree}(T[1..j+1])$ in amortized $O(\log\sigma)$ time. Ukkonen’s algorithm maintains the active point in the suffix tree, which is the locus of the longest repeating suffix of the string. To maintain the active point efficiently, Ukkonen’s algorithm uses auxiliary data structures called suffix links: Each internal node $u$ of the suffix tree has a suffix link that points to the node $v$ , such that $\mathsf{str}_{T}(v)$ is the suffix of $\mathsf{str}_{T}(u)$ of length $|\mathsf{str}_{T}(u)|-1$ . See Fig. 1 for examples of a suffix tree and related notations.

Refer to caption — Figure 1: The suffix tree of string $T=\mathtt{babcab}$ . Each leaf of the tree represents a suffix of $T$ , with the integer inside each leaf indicating the starting position of the suffix. Namely, the leaf labeled with number $i$ corresponds to $\mathsf{leaf}_{T}(i)$ . In this suffix tree, $\mathsf{str}_{T}(u)=\mathtt{b}$ , $\mathsf{strlen}_{T}(u)=1$ , $\mathsf{str}_{T}(\mathsf{leaf}_{T}(3))=\mathtt{bcab}$ , and $\mathsf{strlen}_{T}(\mathsf{leaf}_{T}(3))=4$ . The star symbol indicates the locus of the active point, which represents the longest repeating suffix $\mathtt{ab}$ of $T$ . The dotted arrows represent suffix links.

Furthermore, based on Ukkonen’s algorithm, we can maintain suffix trees for a sliding window over $T$ in a total of $O(n\log\sigma)$ time, using space proportional to the window size:

Theorem 2 ([17, 25, 33, 26]).

Using $O(W)$ space, a suffix tree for a variable-width sliding window can be maintained in amortized $O(\log\sigma)$ time per one operation: either (1) appending a character to the right or (2) deleting the leftmost character from the window, where $W$ is the maximum width of the window. Additionally, the length of the longest repeating suffix of the window can be maintained with the same time complexity.

In other words, for every incremental pair of indices $i,j$ with $i\leq j$ , we can update $\mathsf{STree}(T[i..j])$ to either $\mathsf{STree}(T[i..j+1])$ or $\mathsf{STree}(T[i+1..j])$ in amortized $O(\log\sigma)$ time. Later, we use the above algorithmic results as black boxes.

2.3 Weighted ancestor queries

For a node-weighted tree $\mathcal{T}$ , where the weight of each non-root node is greater than the weight of its parent, a weighted ancestor query (WAQ) is defined as follows: given a node $u$ of the tree and an integer $t$ , the query returns the farthest (highest) ancestor of $u$ whose weight is at least $t$ . We denote by $\mathsf{WAQ}_{\mathcal{T}}(u,t)$ the returned node for weighted ancestor query $(u,t)$ on tree $\mathcal{T}$ . The subscript $\mathcal{T}$ will be omitted when clear from the context. In general, a WAQ on a tree of size $N$ can be answered in $O(\log\log N)$ time, which is worst-case optimal with linear space. However, if the input is the suffix tree of some string of length $n$ , and the weight function is $\mathsf{strlen}_{T}$ , as defined in the previous subsection, any WAQ can be answered in $O(1)$ time after $O(n)$ -time preprocessing [10].

3 Counting closed factors online

In this section, we propose an algorithm to count the distinct closed factors for a string given in an online manner. Since there may be $\Omega(n^{2})$ distinct closed factors in a string of length $n$ [5, 31], enumerating them requires quadratic time in the worst case. However, for counting the closed factors, we can achieve subquadratic time even when the string is given in an online manner, as described below.

3.1 Changes in number of closed factors

First we consider the changes in the number of distinct closed factors when a character is appended. Let $\mathcal{C}(T)$ be the set of distinct closed factors occurring in $T$ . Let $d_{j}=|\mathcal{C}(T[1..j])|-|\mathcal{C}(T[1..j-1])|$ for $2\leq j\leq n$ . For convenience let $d_{1}=1$ .

Observation 1.

For any closed suffixes $u$ and $v$ of the same string, $|\mathit{bord}(u)|\neq|\mathit{bord}(v)|$ holds if $u\neq v$ .

Based on this observation, we count the distinct borders of closed factors instead of the closed factors themselves. We show the following:

Lemma 1.

For each $j$ with $1\leq j\leq n$ , $d_{j}=|\mathit{lrs}(T[1..j])|-|\mathit{lrs}^{2}(T[1..j]))|$ holds if $\mathit{lrs}(T[1..j])\neq\varepsilon$ . Also, $d_{j}=1$ holds if $\mathit{lrs}(T[1..j])=\varepsilon$ .

Proof.

The latter case is obvious since $\mathit{lrs}(T[1..j])=\varepsilon$ implies that $T[i]$ is a unique character in $T[1..j]$ , making it the only new closed factor. In the following, we assume $\mathit{lrs}(T[1..j])\neq\varepsilon$ and denote $t_{j}=\mathit{lrs}(T[1..j])$ and $z_{j}=\mathit{lrs}^{2}(T[1..j])$ . It can be observed that for any new closed factor $u\in\mathcal{C}(T[1..j])\setminus\mathcal{C}(T[1..j-1])$ , $u$ is a unique suffix of $T[1..j]$ and the longest border of $u$ repeats in $T[1..j]$ . Equivalently, $|u|>|t_{j}|$ and $|\mathit{bord}(u)|\leq|t_{j}|$ hold. Additionally, $|u|>|t_{j}|$ implies $|\mathit{bord}(u)|>|z_{j}|$ . Thus $|z_{j}|<|\mathit{bord}(u)|\leq|t_{j}|$ , and hence $d_{j}\leq|t_{j}|-|z_{j}|$ by Observation 1. Conversely, for any suffix $b$ of $T[1..j]$ satisfying $|z_{j}|<|b|\leq|t_{j}|$ , there exists exactly one closed suffix whose longest border is $b$ since $b$ is repeating in $T[1..j]$ . Further, since $|b|>|z_{j}|$ , the closed suffix is longer than $t_{j}$ , and thus the closed suffix is unique in $T[1..j]$ . Therefore, $d_{j}\geq|t_{j}|-|z_{j}|$ holds. ∎

We note that the upper bound $d_{j}\leq|\mathit{lrs}(T[1..j])|-|\mathit{lrs}^{2}(T[1..j]))|$ of $d_{j}$ is already claimed in [31], however, the equality was not shown. By definition of $d_{j}$ and Lemma 1, the next corollary immediately follows:

Corollary 1.

The total number of distinct closed factors of $T$ is

\sum_{j\in J}|\mathit{lrs}(T[1..j])|-\sum_{j\in J}|\mathit{lrs}^{2}(T[1..j]))|+n-|J|

where $J=\{j\in[1,n]\mid\mathit{lrs}(T[1..j])\neq\varepsilon\}$ .

3.2 Online algorithm

By Theorem 1, we can compute $\sum_{j\in J}|\mathit{lrs}(T[1..j])|$ in $O(n\log\sigma)$ time online. Thus, by Corollary 1, our remaining task is to compute $\sum_{j\in J}|\mathit{lrs}^{2}(T[1..j])|$ online, equivalently, to compute the sequence $\mathcal{Z}=(|z_{1}|,|z_{2}|$ $\ldots,|z_{n}|)$ where $z_{j}=\mathit{lrs}^{2}(T[1..j])$ for each $j$ . To compute the sequence, we use sliding suffix trees of Theorem 2. It is known that the starting position of $\mathit{lrs}(T[1..j])$ is not smaller than that of $\mathit{lrs}(T[1..j-1])$ . Namely, starting from $t_{1}=\mathit{lrs}(T[1..1])=\varepsilon$ , the sequence $\langle t_{1},t_{2},\ldots,t_{n}\rangle$ of factors of $T$ can be obtained by $O(n)$ sliding-operations on $T$ that consists of (1) appending a character to the right or (2) deleting the leftmost character from the factor, where $t_{j}=\mathit{lrs}(T[1..j])$ for each $j$ . Thus, by Theorem 2, we can obtain every $|z_{i}|\in\mathcal{Z}$ in amortized $O(\log\sigma)$ time by appropriately applying sliding-operations to reach the windows $t_{j}$ for all $j$ with $j\in J$ in order. Finally, we obtain the following:

Theorem 3.

For a string $T$ of length $n$ given in an online manner, we can count the distinct closed factors of $T$ in $O(n\log\sigma)$ time using $O(n)$ space where $\sigma$ is the alphabet size.

4 Counting closed factors offline: shaving log factor

In this section, we assume that the alphabet is linearly sortable, allowing us to construct $\mathsf{STree}(T)$ in $O(n)$ time offline [16]. Our goal is to remove the $\log\sigma$ factor from the complexity in Theorem 3, which arises from using dynamic predecessor dictionaries (e.g., AVL trees) at non-leaf nodes. To shave this logarithmic factor, we do not rely on such dynamic dictionaries. Instead, use a WAQ data structure constructed over $\mathsf{STree}(T)$ .

Simulating sliding suffix tree in linear time.

The idea is to simulate the online algorithm from Section 3 using the suffix tree of the entire string $T$ . In algorithms underlying Theorem 1 and 2, the topology of suffix trees and the loci of the active points change step-by-step. The changes include (i) moving the active point, (ii) adding a node, an edge, or a suffix link, and (iii) removing a node, an edge, or a suffix link.

Our data structure consists of two suffix trees:

(1)

The suffix tree $\mathsf{STree}(T)$ of $T$ enhanced with a WAQ data structure.
(2)

The suffix tree $\mathsf{STree}(w)$ of the factor $w=T[i..j]$ , which represents a sliding window.

Note that $\mathsf{STree}(w)$ includes the active point. All nodes and edges of $\mathsf{STree}(w)$ are connected to their corresponding nodes and edges in $\mathsf{STree}(T)$ . Specifically, an internal node $u$ of $\mathsf{STree}(w)$ is connected to the node $\tilde{u}$ of $\mathsf{STree}(T)$ where $\mathsf{str}_{w}(u)=\mathsf{str}_{T}(\tilde{u})$ . Similarly, an edge $(u,v)$ of $\mathsf{STree}(w)$ is connected to the edge $(\tilde{u},v^{\prime})$ of $\mathsf{STree}(T)$ where the first characters of the edge-labels are the same. Such node $\tilde{u}$ and edge $(\tilde{u},v^{\prime})$ in $\mathsf{STree}(T)$ always exist: An internal node $u$ implies that, for some distinct characters $c_{1}$ and $c_{2}$ , $\mathsf{str}_{w}(u)c_{1}$ and $\mathsf{str}_{w}(u)c_{2}$ occur in $w$ , and thus also in $T$ . Furthermore, a leaf $\ell$ of $\mathsf{STree}(w)$ , representing suffix $T[k..j]$ of $w$ , is connected to the leaf of $\mathsf{STree}(T)$ representing the suffix $T[k..n]$ of $T$ . Note that suffix links of $\mathsf{STree}(w)$ do not connect to $\mathsf{STree}(T)$ and are maintained within $\mathsf{STree}(w)$ . See Fig. 2 for illustration.

We maintain the active point in $\mathsf{STree}(w)$ as follows:

•

When the active point is on an edge and moves down, if the character on the edge following the active point matches the next character $T[j+1]$ , the active point simply moves down. Otherwise, if the active point cannot move down, we create a new branching node $u$ at the locus and add a new leaf $\ell$ and edge $e=(u,\ell)$ . The new leaf and edge are connected to the corresponding edge $\tilde{e}$ and leaf of $\mathsf{STree}(T)$ as the above discussion. The edge $\tilde{e}$ of $\mathsf{STree}(T)$ can be found in constant time using a weighted ancestor query on $\mathsf{STree}(T)$ .
•

Similarly, when the active point is a node $u$ and moves down, if $u$ represents $T[i^{\prime}..j]$ , we query $\mathsf{WAQ}_{\mathsf{STree}(T)}(\mathsf{leaf}_{T}(i^{\prime}),j-i^{\prime}+2)$ and check the resulting edge. The resulting edge of $\mathsf{STree}(T)$ is already connected to an edge of the smaller suffix tree if and only if $u$ has an outgoing edge whose edge label starts with $T[j+1]$ . If the active point can move down to the existing edge, we simply move it to the edge (via its corresponding edge in $\mathsf{STree}(T)$ ). Otherwise, we create a new leaf $\ell$ and edge $e=(u,\ell)$ , connecting them to their corresponding leaf and edge of $\mathsf{STree}(T)$ as described above.

In Ukkonen’s algorithm, a new node/edge is created only if the active point reaches it, so every node/edge creation can be simulated as described. Node/edge deletions are straightforward: we simply disconnect the node/edge from $\mathsf{STree}(T)$ and remove them. Therefore, we can simulate the sliding suffix tree over $T$ in amortized $O(1)$ time per sliding-operation. We have shown the next theorem:

Theorem 4.

Given an offline string $T$ over a linearly sortable alphabet, we can simulate a suffix tree for a sliding window in a total of $O(n)$ time.

By Theorem 4 and the algorithm described in Section 3, we obtain the following:

Corollary 2.

We can count the distinct closed factors in $T$ over a linearly sortable alphabet in $O(n)$ time using $O(n)$ space.

5 Enumerating closed factors offline

In this section, we discuss the enumeration of (distinct) closed factors in an offline string $T$ . We first consider The open-close-array $\mathsf{OC}_{T}$ of $T$ , which is the binary sequence of length $n$ where $\mathsf{OC}_{T}[i]=1$ iff $T[i..n]$ is closed [28]. Since an open-close-array can be computed in $O(n)$ time, we can enumerate all the occurrences of closed factors of $T$ in $\Theta(n^{2})$ time by constructing $\mathsf{OC}_{T[i..n]}$ for all $1\leq i\leq n$ . We then map the occurrences to their corresponding loci on $\mathsf{STree}(T)$ using constant-time WAQs $O(n^{2})$ times. Thus we have the following:

Proposition 1.

We can enumerate all the occurrences of closed factors of $T$ and the distinct closed factors of $T$ in $\Theta(n^{2})$ time.

Next, we propose an alternative approach that can achieve subquadratic time when there are few closed factors to output. Using the algorithms described in Sections 3 and 4, we can enumerate the ending positions and the longest borders of distinct closed factors in $O(n+\mathsf{output})$ time. For each such a border $b$ of a closed factor, we need to determine the starting position of the closed factor by finding the nearest occurrence of $b$ to the left. Such occurrences can be computed efficiently by utilizing range predecessor queries over the list of (the integer-labels of) the leaves of $\mathsf{STree}(T)$ as follows: Let $b=T[s..t]$ be a border of some closed factor $w$ where $b$ occurs exactly twice in $w$ . We first find the locus of $b$ in $\mathsf{STree}(T)$ by a WAQ. Let $v$ be the nearest descendant of the locus, inclusive. Then, the leaves in the subtree rooted at $v$ represent the occurrences of $b$ in $T$ . Thus the nearest occurrence of $b$ to the left equals the predecessor value of $s$ within the leaves under $v$ . By applying Belazzougui and Puglisi’s range predecessor data structure [11], we obtain the following:

Proposition 2.

We can enumerate the distinct closed factors of $T$ in $O(n\sqrt{\log n}+\mathsf{output}\cdot\log^{\epsilon}n)$ time where $\mathsf{output}\in O(n^{2})$ is the number of distinct closed factors in $T$ and $\epsilon$ is a fixed constant with $0<\epsilon<1$ .

6 Conclusions

In this paper, we proposed two algorithms for counting distinct closed factors of a string. The first algorithm runs in $O(n\log\sigma)$ time using $O(n)$ space for a string given in an online manner. The second algorithm runs in $O(n)$ time and space for a static string over a linearly sortable alphabet. Additionally, we discussed how to enumerate the distinct closed factors, and showed a $\Theta(n^{2})$ -time algorithm using open-close-arrays, as well as an $O(n\sqrt{\log n}+\mathsf{output}\cdot\log^{\epsilon}n)$ -time algorithm that combines our counting algorithm with a range predecessor data structure. Since there can be $\Omega(n^{2})$ distinct closed factors in a string [5, 31], the $\mathsf{output}\cdot\log^{\epsilon}n$ term can be superquadratic in the worst case. This leads to the following open question: For the enumerating problem, can we achieve $O(n\mathsf{polylog}(n)+\mathsf{output})$ time, which is linear in the output size?

References

[1] Hayam Alamro, Mai Alzamel, Costas S. Iliopoulos, Solon P. Pissis, Wing-Kin Sung, and Steven Watts. Efficient identification of k-closed strings. Int. J. Found. Comput. Sci., 31(5):595–610, 2020.
[2] Mai Alzamel, Costas S. Iliopoulos, W. F. Smyth, and Wing-Kin Sung. Off-line and on-line algorithms for closed string factorization. Theor. Comput. Sci., 792:12–19, 2019.
[3] Golnaz Badkobeh, Hideo Bannai, Keisuke Goto, Tomohiro I, Costas S. Iliopoulos, Shunsuke Inenaga, Simon J. Puglisi, and Shiho Sugimoto. Closed factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, September 1-3, 2014, pages 162–168. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2014.
[4] Golnaz Badkobeh, Hideo Bannai, Keisuke Goto, Tomohiro I, Costas S. Iliopoulos, Shunsuke Inenaga, Simon J. Puglisi, and Shiho Sugimoto. Closed factorization. Discret. Appl. Math., 212:23–29, 2016.
[5] Golnaz Badkobeh, Gabriele Fici, and Zsuzsanna Lipták. On the number of closed factors in a word. In Language and Automata Theory and Applications - 9th International Conference, LATA 2015, Nice, France, March 2-6, 2015, Proceedings, volume 8977 of Lecture Notes in Computer Science, pages 381–390. Springer, 2015.
[6] Golnaz Badkobeh, Alessandro De Luca, Gabriele Fici, and Simon J. Puglisi. Maximal closed substrings. In String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 16–23. Springer, 2022.
[7] Golnaz Badkobeh, Alessandro De Luca, Gabriele Fici, and Simon J. Puglisi. Maximal closed substrings. CoRR, abs/2209.00271, 2022.
[8] Hideo Bannai, Shunsuke Inenaga, Tomasz Kociumaka, Arnaud Lefebvre, Jakub Radoszewski, Wojciech Rytter, Shiho Sugimoto, and Tomasz Walen. Efficient algorithms for longest closed factor array. In String Processing and Information Retrieval - 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Proceedings, volume 9309 of Lecture Notes in Computer Science, pages 95–102. Springer, 2015.
[9] Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees. J. Comput. Syst. Sci., 117:1–22, 2021.
[10] Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
[11] Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 2053–2071. SIAM, 2016.
[12] Michelangelo Bucci, Aldo de Luca, and Alessandro De Luca. Rich and periodic-like words. In Developments in Language Theory, 13th International Conference, DLT 2009, Stuttgart, Germany, June 30 - July 3, 2009. Proceedings, volume 5583 of Lecture Notes in Computer Science, pages 145–155. Springer, 2009.
[13] Arturo Carpi and Aldo de Luca. Periodic-like words, periodicity, and boxes. Acta Informatica, 37(8):597–618, 2001.
[14] Maxime Crochemore, Lucian Ilie, and Wojciech Rytter. Repetitions in strings: Algorithms and combinatorics. Theor. Comput. Sci., 410(50):5227–5235, 2009.
[15] Fabien Durand. A characterization of substitutive sequences using return words. Discret. Math., 179(1-3):89–101, 1998.
[16] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997.
[17] Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490–505, 1989.
[18] Gabriele Fici. A classification of trapezoidal words. In Proceedings 8th International Conference Words 2011, Prague, Czech Republic, 12-16th September 2011, volume 63 of EPTCS, pages 129–137, 2011.
[19] Gabriele Fici. Open and closed words. Bulletin of EATCS, (123), 2017.
[20] Zvi Galil and Joel I. Seiferas. Time-space-optimal string matching. J. Comput. Syst. Sci., 26(3):280–294, 1983.
[21] Amy Glen, Jacques Justin, Steve Widmer, and Luca Q. Zamboni. Palindromic richness. Eur. J. Comb., 30(2):510–531, 2009.
[22] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018.
[23] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977.
[24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in $\delta$ -optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024.
[25] N. Jesper Larsson. Extended application of suffix trees to data compression. In Proceedings of the 6th Data Compression Conference (DCC ’96), Snowbird, Utah, USA, March 31 - April 3, 1996, pages 190–199. IEEE Computer Society, 1996.
[26] Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai, and Takuya Mieno. Constant-time edge label and leaf pointer maintenance on sliding suffix trees. CoRR, abs/2307.01412, 2024.
[27] Alessandro De Luca and Gabriele Fici. Open and closed prefixes of sturmian words. In Combinatorics on Words - 9th International Conference, WORDS 2013, Turku, Finland, September 16-20. Proceedings, volume 8079 of Lecture Notes in Computer Science, pages 132–142. Springer, 2013.
[28] Alessandro De Luca, Gabriele Fici, and Luca Q. Zamboni. The sequence of open and closed prefixes of a Sturmian word. Adv. Appl. Math., 90:27–45, 2017.
[29] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022.
[30] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022.
[31] Olga G. Parshina and Svetlana Puzynina. Finite and infinite closed-rich words. Theor. Comput. Sci., 984:114315, 2024.
[32] Olga G. Parshina and Luca Q. Zamboni. Open and closed factors in arnoux-rauzy words. Adv. Appl. Math., 107:22–31, 2019.
[33] Martin Senft. Suffix tree for a sliding window: An overview. In 14th Annual Conference of Doctoral Students - WDS 2005, pages 41–46. Matfyzpress, 2005.
[34] W. F. Smyth. Computing regularities in strings: A survey. Eur. J. Comb., 34(1):3–14, 2013.
[35] Wataru Sumiyoshi, Takuya Mieno, and Shunsuke Inenaga. Faster and simpler online/sliding rightmost lempel-ziv factorizations. In String Processing and Information Retrieval - 31th International Symposium, SPIRE 2024, volume 14899 of Lecture Notes in Computer Science, pages 321–335. Springer, 2024.
[36] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
[37] Laurent Vuillon. A characterization of sturmian words by return words. Eur. J. Comb., 22(2):263–275, 2001.
[38] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973.