Lower Bounds and Improved Algorithms for Asymmetric Streaming Edit Distance and Longest Common Subsequence
In this paper, we study edit distance () and longest common subsequence () in the asymmetric streaming model, introduced by Saks and Seshadhri [SS13]. As an intermediate model between the random access model and the streaming model, this model allows one to have streaming access to one string and random access to the other string. Meanwhile, and are both fundamental problems that are often studied on large strings, thus the (asymmetric) streaming model is ideal for studying these problems.
Our first main contribution is a systematic study of space lower bounds for and in the asymmetric streaming model. Previously, there are no explicitly stated results in this context, although some lower bounds about can be inferred from the lower bounds for longest increasing subsequence () in [SW07, GG10, EJ08]. Yet these bounds only work for large alphabet size. In this paper, we develop several new techniques to handle in general and for small alphabet size, thus establishing strong lower bounds for both problems. In particular, our lower bound for provides an exponential separation between edit distance and Hamming distance in the asymmetric streaming model. Our lower bounds also extend to and longest non-decreasing subsequence () in the standard streaming model. Together with previous results, our bounds provide an almost complete picture for these two problems.
As our second main contribution, we give improved algorithms for and in the asymmetric streaming model. For , we improve the space complexity of the constant factor approximation algorithms in [FHRS20, CJLZ20] from to , where is the length of each string and is the edit distance between the two strings. For , we give the first approximation algorithm with space for any constant , over a binary alphabet. Our work leaves a plethora of intriguing open questions, including establishing lower bounds and designing algorithms for a natural generalization of and , which we call longest non-decreasing subsequence with threshold ().
1 Introduction
Edit distance () and longest common subsequence () are two classical problems studied in the context of measuring similarities between two strings. Edit distance is defined as the smallest number of edit operations (insertions, deletions, and substitutions) to transform one string to the other, while longest common subsequence is defined as the longest string that appears as a subsequence in both strings. These two problems have found wide applications in areas such as bioinformatics, text and speech processing, compiler design, data analysis, image analysis and so on. In turn, these applications have led to an extensive study of both problems.
With the era of information explosion, nowadays these two problems are often studied on very large strings. For example, in bioinformatics a human genome can be represented as a string with billion letters (base pairs). Such data provides a huge challenge to the algorithms for and , as the standard algorithms for these two problems using dynamic programming need time and space where is the length of each string. These bounds quickly become infeasible or too costly as becomes large, such as in the human genome example. Especially, some less powerful computers may not even have enough memory to store the data, let alone processing it.
One appealing approach to dealing with big data is designing streaming algorithms, which are algorithms that process the input as a data stream. Typically, the goal is to compute or approximate the solution by using sublinear space (e.g., for some constant or even ) and a few (ideally one) passes of the data stream. These algorithms have become increasingly popular, and attracted a lot of research activities recently.
Designing streaming algorithms for and , however, is not an easy task. For , only a couple of positive results are known. In particular, assuming that the edit distance between the two strings is bounded by some parameter , [CGK16a] gives a randomized one pass algorithm achieving an approximation of , using linear time and space, in a variant of the streaming model where one can scan the two strings simultaneously in a coordinated way. In the same model [CGK16a] also give randomized one pass algorithms computing exactly, using space and time . This was later improved to space and time in [CGK16b, BZ16]. Furthermore, [BZ16] give a randomized one pass algorithm computing exactly, using space and time , in the standard streaming model. We note that all of these algorithms are only interesting if is small, e.g., where is some small constant, otherwise the space complexity can be as large as . For , strong lower bounds are given in [LNVZ05, SW07], which show that for exact computation, even constant pass randomized algorithms need space ; while any constant pass deterministic algorithm achieving a approximation of also needs space , if the alphabet size is at least .
Motivated by this situation and inspired by the work of [AKO10], Saks and Seshadhri [SS13] studied the asymmetric data streaming model. This model is a relaxation of the standard streaming model, where one has streaming access to one string (say ), and random access to the other string (say ). In this model, [SS13] gives a deterministic one pass algorithm achieving a approximation of using space , as well as a randomized one pass algorithm algorithm achieving an additive approximation of using space where is the maximum number of times any symbol appears in . Another work by Saha [Sah17] also gives an algorithm in this model that achieves an additive approximation of using space .
The asymmetric streaming model is interesting for several reasons. First, it still inherits the spirit of streaming algorithms, and is particularly suitable for a distributed setting. For example, a local, less powerful computer can use the streaming access to process the string , while sending queries to a remote, more powerful server which has access to . Second, because it is a relaxation of the standard streaming model, one can hope to design better algorithms for or to beat the strong lower bounds for in this model. The latter point is indeed verified by two recent works [FHRS20, CJLZ20] (recently accepted to ICALP as a combined paper [CFH+21]), which give a deterministic one pass algorithm achieving a approximation of , using space and time for any constant , as well as deterministic one pass algorithms achieving approximation of and , using space and time .
A natural question is how much we can improve these results. Towards answering this question, we study both lower bounds and upper bounds for the space complexity of and in the asymmetric streaming model, and we obtain several new, non-trivial results.
Related work.
On a different topic, there are many works that study the time complexity of and . In particular, while [BI15, ABW15] showed that and cannot be computed exactly in truly sub-quadratic time unless the strong Exponential time hypothesis [IPZ01] is false, a successful line of work [CDG+19, BR20, KS20, AN20, HSSS19, RSSS19, RS20] has led to randomized algorithms that achieve constant approximation of in near linear time, and randomized algorithms that provide various non-trivial approximation of in linear or sub-quadratic time. Another related work is [AKO10], where the authors proved a lower bound on the query complexity for computing in the asymmetric query model, where one have random access to one string but only limited number of queries to the other string.
1.1 Our Contribution
We initiate a systematic study on lower bounds for computing or approximating and in the asymmetric streaming model. To simplify notation we always use approximation for some , i.e., outputting an with , where is either or . We note that for , this is equivalent to a approximation in the standard notation.
Previously, there are no explicitly stated space lower bounds in this model, although as we will discuss later, some lower bounds about can be inferred from the lower bounds for longest increasing subsequence in [SW07, GG10, EJ08]. As our first contribution, we prove strong lower bounds for in the asymmetric streaming model.
Theorem 1.
There is a constant such that for any with , given an alphabet , any -pass randomized algorithm in the asymmetric streaming model that decides if for two strings with success probability must use space .
This theorem implies the following corollary.
Corollary 1.1.
Given an alphabet , the following space lower bounds hold for any constant pass randomized algorithm with success probability in the asymmetric streaming model.
-
1.
for computing of two strings if .
-
2.
for approximation of for two strings if .
Our theorems thus provide a justification for the study of approximating in the asymmetric streaming model. Furthermore, we note that previously, unconditional lower bounds for in various computational models are either weak, or almost identical to the bounds for Hamming distance. For example, a simple reduction from the equality function implies the deterministic two party communication complexity (and hence also the space lower bound in the standard streaming model) for computing or even approximating is .111We include this bound in the appendix for completeness, as we cannot find any explicit statement in the literature. However the same bound holds for Hamming distance. Thus it has been an intriguing question to prove a rigorous, unconditional separation of the complexity of and Hamming distance. To the best of our knowledge the only previous example achieving this is the work of [AK10] and [AJP10], which showed that the randomized two party communication complexity of achieving a approximation of is , while the same problem for Hamming distance has an upper bound of . Thus if is a constant, this provides a separation of vs. a constant. However, this result also has some disadvantages: (1) It only works in the randomized setting; (2) The separation becomes obsolete when is small, e.g., ; and (3) The lower bound for is still weak and thus it does not apply to the streaming setting, as there even recoding the index needs space .
Our result from Corollary 1.1, on the other hand, complements the above result in the aforementioned aspects by providing another strong separation of and Hamming distance. Note that even exact computation of the Hamming distance between and is easy in the asymmetric streaming model with one pass and space . Thus our result provides an exponential gap between edit distance and Hamming distance, in terms of the space complexity in the asymmetric streaming model (and also the communication model since our proof uses communication complexity), even for deterministic exact computation.
Next we turn to , which can be viewed as a generalization of . For example, if the alphabet , then we can fix the string to be the concatenation from to , and it’s easy to see that . Therefore, the lower bound of computing for randomized streaming in [SW07] with also implies a similar bound for in the asymmetric streaming model. However, the bound in [SW07] does not apply to the harder case where is a permutation of , and their lower bound where is actually for longest non-decreasing subsequence, which does not give a similar bound for in the asymmetric streaming model. 222One can get a similar reduction to , but now needs to be the sorted version of , which gives additional information about in the asymmetric streaming model since we have random access to . Therefore, we first prove a strong lower bound for in general.
Theorem 2.
There is a constant such that for any with , given an alphabet , any -pass randomized algorithm in the asymmetric streaming model that decides if for two strings with success probability must use space . Moreover, this holds even if is a permutation of when or .
Similar to the case of , this theorem also implies the following corollary.
Corollary 1.2.
Given an alphabet , the following space lower bounds hold for any constant pass randomized algorithm with success probability in the asymmetric streaming model.
-
1.
for computing of two strings if .
-
2.
for approximation of for two strings if .
We then consider deterministic approximation of . Here, the work of [GG10, EJ08] gives a lower bound of for any pass streaming algorithm achieving a approximation of , which also implies a lower bound of for asymmetric streaming when . These bounds match the upper bound in [GJKK07] for and , and in [FHRS20, CJLZ20] for . However, a major drawback of this bound is that it gives nothing when is small (e.g., ). For even smaller alphabet size, the bound does not even give anything for exact computation. For example, in the case of a binary alphabet, we know that and thus taking corresponds to exact computation. Yet the bound gives a negative number.
This is somewhat disappointing as in most applications of and , the alphabet size is actually a fixed constant. These include for example the English language and the human DNA sequence (where the alphabet size is for the bases). Therefore, in this paper we focus on the case where the alphabet size is small, and we have the following theorem.
Theorem 3.
Given an alphabet , for any where , any -pass deterministic algorithm in the asymmetric streaming model that computes a approximation of for two strings must use space .
Thus, even for a binary alphabet, achieving approximation for small (e.g., which corresponds to exact computation) can take space as large as for any constant pass algorithm. Further note that by taking , we recover the bound with a much smaller alphabet.
Finally, we turn to and longest non-decreasing subsequence (), as well as a natural generalization of and which we call longest non-decreasing subsequence with threshold (). Given a string and a threshold , denotes the length of the longest non-decreasing subsequence in such that each symbol appears at most times. It is easy to see that the case of corresponds to and the case of corresponds to . Thus is indeed a generalization of both and . It is also a special case of when as we can take to be the concatenation of copies of each symbol, in the ascending order (and possibly padding some symbols not in ). How hard is ? We note that in the case of () and () a simple dynamic programming can solve the problem in one pass with space , and approximation can be achieved in one pass with space by [GJKK07]. Thus one can ask what is the situation for other . Again we focus on the case of a small alphabet and have the following theorem.
Theorem 4.
Given an alphabet , for deterministic approximation of for a string in the streaming model with passes, we have the following space lower bounds:
-
1.
for any constant (this includes ), when is any constant.
-
2.
for (this includes ), when .
-
3.
for , when .
Thus, case 1 and 2 show that even for any constant approximation, any constant pass streaming algorithm for and needs space when , matching the upper bound up to a logarithmic factor. Taking and for example, we further get a lower bound of for approximating using any constant pass streaming algorithm. This matches the upper bound. These results complement the bounds in [GG10, EJ08, GJKK07] for the important case of small alphabet, and together they provide an almost complete picture for and . Case 3 shows that for certain choices of and , the space we need for can be significantly larger than those for and . It is an intriguing question to completely characterize the behavior of for all regimes of parameters.
We also give improved algorithms for asymmetric streaming and . For , [FHRS20, CJLZ20] gives a -approximation algorithm with space for any constant . We further reduced the space needed from to where . Specifically, we have the following theorem.
Theorem 5.
Assume , in the asymmetric streaming model, there are one-pass deterministic algorithms in polynomial time with the following parameters:
-
1.
A -approximation of using space.
-
2.
For any constant , a -approximation of using space.
For over a large alphabet, the upper bounds in [FHRS20, CJLZ20] match the lower bounds implied by [GG10, EJ08]. We thus again focus on small alphabet. Note that our Theorem 3 does not give anything useful if is small and is large (e.g., both are constants). Thus a natural question is whether one can get better bounds. In particular, is the dependence on linear as in our theorem, or is there a threshold beyond which the space jumps to say for example ? We note that there is a trivial one pass, space algorithm even in the standard streaming model that gives a approximation of (or approximation in standard notation), and no better approximation using sublinear space is known even in the asymmetric streaming model. Thus one may wonder whether this is the threshold. We show that this is not the case, by giving a one pass algorithm in the asymmetric streaming model over the binary alphabet that achieves a approximation of (or approximation in standard notation), using space for any constant .
Theorem 6.
For any constant , there exists a constant and a one-pass deterministic algorithm that outputs a approximation of for any two strings , with space and polynomial time in the asymmetric streaming model.
Finally, as mentioned before, we now have an almost complete picture for and , but for the more general the situation is still far from clear. Since is a special case of , if then the upper bound of in [FHRS20, CJLZ20] still applies and this matches our lower bound in case 3, Theorem 4 by taking . One can then ask the natural question of whether we can get a matching upper bound for the case of small alphabet. We are not able to achieve this, but we provide a simple algorithm that can use much smaller space for certain regimes of parameters in this case.
Theorem 7.
Given an alphabet with . For any and , there is a one-pass streaming algorithm that computes a approximation of for any with space.
1.2 Overview of our Techniques
Here we provide an informal overview of the techniques used in this paper.
1.2.1 Lower Bounds
Our lower bounds use the general framework of communication complexity. To limit the power of random access to the string , we always fix to be a specific string, and consider different strings . In turn, we divide into several blocks and consider the two party/multi party communication complexity of or , where each party holds one block of . However, we need to develop several new techniques to handle edit distance and small alphabets.
Edit distance.
We start with edit distance. One difficulty here is to handle substitutions, as with substitutions edit distance becomes similar to Hamming distance, and this is exactly one of the reasons why strong complexity results separating edit distance and Hamming distance are rare. Indeed, if we define to be the smallest number of insertions and deletions (without substitutions) to transform into , then and thus a lower bound for exactly computing (e.g., those implied from [GG10, EJ08]) would translate directly into the same bound for exactly computing . On the other hand, with substitutions things become more complicated: if is small (e.g., ) then in many cases (such as examples obtained by reducing from [GG10, EJ08]) the best option to transform into is just replacing each symbol in by the corresponding symbol in if they are different, which makes exactly the same as their Hamming distance.
To get around this, we need to ensure that is large. We demonstrate our ideas by first describing an lower bound for the deterministic two party communication complexity of , using a reduction from the equality function which is well known to have an communication complexity bound. Towards this, fix where is a special symbol, and fix . We divide into two parts such that is obtained from the string by replacing some symbols of the form by , while is obtained from the string by replacing some symbols of the form by . Note that the way we choose ensures that before replacing any symbol by .
Intuitively, we want to argue that the best way to transform into , is to delete a substring at the end of and a substring at the beginning of , so that the resulted string becomes an increasing subsequence as long as possible. Then, we insert symbols into this string to make it match except for those symbols. Finally, we replace the symbols by substitutions. If this is true then we can finish the argument as follows. Let be two subsets with size , where for any , all symbols of the form in with are replaced by . Now if then it doesn’t matter where we choose to delete the substrings in and , the number of edit operations is always by a direct calculation. On the other hand if and assume for simplicity that the smallest element they differ is an element in , then there is a way to save one substitution, and the the number of edit operations becomes .
The key part is now proving our intuition. For this, we consider all possible such that is transformed into and is transformed into , and compute the two edit distances respectively. To analyze the edit distance, we first show by a greedy argument that without loss of generality, we can assume that we apply deletions first, followed by insertions, and substitutions at last. This reduces the edit distance problem to the following problem: for a fixed number of deletions and insertions, what is the best way to minimize the Hamming distance (or maximize the number of agreements of symbols at the same indices) in the end. Now we break the analysis of into two cases. Case 1 is where the number of deletions (say ) is large. In this case, the number of insertions (say ) must also be large, and we argue that the number of agreements is at most . Case 2 is where is small. In this case, must also be small. Now we crucially use the structure of and , and argue that symbols in larger than (or original index beyond ) are guaranteed to be out of agreement. Thus the number of agreements is at most . In each case combining the bounds gives us a lower bound on the total number of operations. The situation for and is completely symmetric and this proves our intuition.
In the above construction, and have different lengths ( while ). We can fix this by adding a long enough string with distinct symbols than those in to the end of both and , and then add symbols of at the end of for . We argue that the best way to do the transformation is to transform into , and then insert symbols of . To show this, we first argue that at least one symbol in must be kept, for otherwise the number of operations is already larger than the previous transformation. Then, using a greedy argument we show that the entire must be kept, and thus the natural transformation is the optimal.
To extend the bound to randomized algorithms, we modify the above construction and reduce from Set Disjointness (), which is known to have randomized communication complexity . Given two strings representing the characteristic vectors of two sets , if and only if , or equivalently, . For the reduction, we first create two new strings which are “balanced” versions of . Formally, and . We create slightly differently, i.e., and . Now both and have ’s, we can use them as the characteristic vectors of the two sets in the previous construction. A similar argument now leads to the bound for randomized algorithms.
Longest common subsequence.
Our lower bounds for randomized algorithms computing exactly are obtained by a similar and simpler reduction from : we still fix to be an increasing sequence of length and divide evenly into blocks of constant size. Now consists of the blocks with an odd index, while consists of the blocks with an even index. Thus is a permutation of . Next, from we create in a slightly different way and use to modify the blocks in and respectively. If a bit is then we arrange the corresponding block in the increasing order, otherwise we arrange the corresponding block in the decreasing order. A similar argument as before now gives the desired bound. We note that [SW07] has similar results for LIS by reducing from . However, our reduction and analysis are different from theirs. Thus we can handle , and even the harder case where is a permutation of .
We now turn to over a small alphabet. To illustrate our ideas, let’s first consider and choose . It is easy to see that . We now represent each string as follows: at any index , we record a pair where and . Thus, if we read from left to right, then upon reading a , may increase by and does not change; while upon reading a , does not change and may decrease by . Hence if we use the horizontal axis to stand for and the vertical axis to stand for , then these points form a polygonal chain. We call the value at point and it is easy to see that must be the value of an endpoint of some chain segment.
Using the above representation, we now fix and choose , so . We let such that and . Since any common subsequence between and must be of the form it suffices to consider common subsequence between and , and that between and , and combine them together. Towards that, we impose the following properties on : (1) The number of ’s, ’s, and ’s in each string is at most ; (2) In the polygonal chain representation of each string, the values of the endpoints strictly increase when the number of ’s increases; and (3) For any endpoint in where the number of ’s is some , there is a corresponding endpoint in where the number of ’s is , and the values of these two endpoints sum up to a fixed number . Note that property (2) implies that must be the sum of the values of an endpoint in where the number of ’s is some , and an endpoint in where the number of ’s is , while property (3) implies that for any string , there is a unique corresponding string , and (regardless of the choice of ).
We show that under these properties, all possible strings form a set with , and this set gives a fooling set for the two party communication problem of computing . Indeed, for any , we have . On the other hand, for any , the values must differ at some point for and . Hence by switching, either or will have a with that has length at least . Standard arguments now imply an communication complexity lower bound. A more careful analysis shows that we can even replace the symbol by , thus resulting in a binary alphabet.
The above argument can be easily modified to give a bound for approximation of when , by taking the string length to be some . To get a better bound, we combine our technique with the technique in [EJ08] and consider the following direct sum problem: we create copies of strings and where each copy uses distinct alphabets with size . Assume for and the alphabet is , now again consists of copies of , where each for ; while consists of copies . The direct sum problem is to decide between the following two cases for some : (1) such that there are copies in with , and (2) and , . We do this by arranging the ’s row by row into an matrix (each entry is a length string) and letting be the concatenation of the columns. We call these strings the contents of the matrix, and let be the concatenation of the ’s. Now intuitively, case (1) and case (2) correspond to deciding whether or , which implies a approximation. The lower bound follows by analyzing the -party communication complexity of this problem, where each party holds a column of the matrix.
However, unlike the constructions in [GG10, EJ08] which are relatively easy to analyze because all symbols in (respectively ) are distinct, the repeated symbols in our construction make the analysis of much more complicated (we can also use distinct symbols but that will only give us a bound of instead of ). To ensure that the is to match each to the corresponding , we use another symbols and add buffers of large size (e.g., size ) between adjacent copies of . We do the same thing for correspondingly. Moreover, it turns out we need to arrange the buffers carefully to avoid unwanted issues: in each row , between each copy of we use a buffer of new symbol. Thus the buffers added to each row are sequentially and this is the same for every row. That is, in each row the contents use the same alphabet but the buffers use different alphabets . Now we have a matrix and we again let be the concatenation of the columns while let be the concatenation of the ’s. Note that we are using an alphabet of size . We use a careful analysis to argue that case (1) and case (2) now correspond to deciding whether or , which implies a approximation. The lower bound follows by analyzing the -party communication complexity of this problem, and we show a lower bound of by generalizing our previous fooling set construction to the multi-party case, where we use a good error correcting code to create the gap.
The above technique works for . For the case of our bound for can be derived directly from our bound for , which we describe next.
Longest increasing/non-decreasing subsequence.
Our lower bound over small alphabet is achieved by modifying the construction in [EJ08] and providing a better analysis. Similar as before, we consider a matrix where is a large constant and . We now consider the -party communication problem where each party holds one column of , and the problem is to decide between the following two cases for a large enough constant : (1) for each row in , there are at least ’s between any two ’s, and (2) there exists a row in which has more than ’s, where is a constant. We can use a similar argument as in [EJ08] to show that the total communication complexity of this problem is and hence at least one party needs . The difference is that [EJ08] sets while we need to pick to be a larger constant to handle the case . For this we use the Lovász Local Lemma with a probabilistic argument to show the existence of a large fooling set. To reduce to , we define another matrix such that if and otherwise. Now let be the concatenation of all columns of . We show that case (2) implies and case (1) implies . This implies a approximation for any constant by setting and appropriately.
The construction is slightly different for . This is because if we keep the ’s in , they will already form a very long non-decreasing subsequence and we will not get any gap. Thus, we now let the matrix have size where can be any constant. We replace all ’s in column with a symbol for , such that . Similarly we replace all ’s in row with a symbol for , such that . Also, we let . We can show that the two cases now correspond to and .
We further prove an lower bound for approximation of when . This is similar to our previous construction for , except we don’t need buffers here, and we only need to record the number of some symbols. More specifically, let and be the set of all strings over alphabet with length such that and for any . Thus has size and , the number of ’s in is exactly . Further, for any , either or has more than ’s. We now consider the matrix where each row consists of such that each has length for , and for the same row all use the same alphabet while for different rows the alphabets are disjoint. To make sure the of the concatenation of the columns is roughly the sum of the number of ’s, we require that . Now we analyze the party communication problem of deciding whether the concatenation of the columns has or for some constant , which implies a approximation. The lower bound is again achieved by generalizing the set to a fooling set for the party communication problem using an error correcting code based approach.
In Theorem 4, we give three lower bounds for . The first two lower bounds are adapted from our lower bounds for and , while the last lower bound is adapted from our lower bound for by ensuring all symbols in different rows or columns of the matrix there are different.
1.2.2 Improved Algorithms
We now turn to our improved algorithms for and in the asymmetric streaming model.
Edit distance.
Our algorithm for edit distance builds on and improves the algorithm in [FHRS20, CJLZ20]. The key idea of that algorithm is to use triangle inequality. Given a constant , the algorithm first divides evenly into blocks. Then for each block of , the algorithm recursively finds an -approximation of the closest substring to in . That is, the algorithm finds a substring and a value such that for any substring of , . Let be the concatenation of from to . Then using triangle inequality, [FHRS20] showed that is a approximation of . The space is achieved by recursively applying this idea, which results in a approximation.
To further reduce the space complexity, our key observation is that, instead of dividing into blocks of equal length, we can divide it according to the positions of the edit operations that transform to . More specifically, assume we are given a value with for some constant , we show how to design an approximation algorithm using space . Towards this, we can divide and each into blocks and such that for any . However, such a partition of and is not known to us. Instead, we start from the first position of and find the largest index such that for some substring of . To do this, we start with and try all substrings of with length in . If there is some substring of within edit distance to , we set and store all the edit operations that transform to where is the substring closest to in edit distance. We continue doing this with until we can not find a substring of within edit distance to .
One problem here is that can be much larger than and we cannot store with space. However, since we have stored some substring (we only need to store the two indices ) and the at most edit operations that transform to , we can still query every bit of using space.
After we find the largest possible index , we store , and . We then start from the -th position of and do the same thing again to find the largest such that there is a substring of within edit distance to . We continue doing this until we have processed the entire string . Assume this gives us pairs of indices and integers , from to , we can use space to store them. We show by induction that is a substring of for . Recall that and each . Thus, the process must end within steps and we have . Then, let be the concatenation of from to . Using techniques developed in [FHRS20], we can show is a approximation of . For any small constant , we can compute a approximation of with space using the algorithm in [CJLZ20]. This gives us a approximation algorithm with space.
Similar to [FHRS20], we can use recursion to further reduce the space. Let be a small constant and a value be given as before. There is a way to partition and each into blocks such that . Now similarly, we want to find the largest index such that there is a substring of within edit distance to . However naively this would require space to compute the edit distance. Thus again we turn to approximation.
We introduce a recursive algorithm called . It takes two additional parameters as inputs: an integer and a parameter for the amount of space we can use. It outputs a three tuple: an index , a pair of indices and an integer . Let be the largest index such that there is a substring of within edit distance to .
We show the following two properties of : (1) , and (2) for any substring , . Here, is a function of that measures the approximation factor. If , outputs and the substring of that is closest to using space by doing exact computation. In this case we set . Otherwise, it calls itself up to times with parameters and . This gives us outputs for . Let be the concatenation of for to . We find the pair of indices such that is the substring that minimizes . We output , , and . We then use induction to show property (1) and (2) hold for these outputs, where if and if . Thus we have .
This gives an space algorithm as follows. We run algorithm with and to find tuples: . Again, let be the concatenation of from to . Similar to the space algorithm, we can show and is a approximation of . Since the depth of recursion is at most and each level of recursion needs space, uses space.
The two algorithms above both require a given value . To remove this constraint, our observation is that the two previous algorithms actually only need the number to satisfy the following relaxed condition: there is a partition of into blocks such that for each block , there is a substring of within edit distance to . Thus, when such a is not given, we can do the following. We first set to be a large constant . While the algorithm reads from left to right, let be the number of we have stored so far. Each time we run at this level, we increase by . If the current satisfies the relaxed condition, then by a similar argument as before should never exceed . Thus whenever , we increase by a factor. Assume that is updated times in total and after the -th update, becomes . We show that (but may be much smaller than ). To see this, suppose for some . Let be the position of where is updated to . We know it is possible to divide into blocks such that for each part, there is a substring of within edit distance to it. By property (1) and a similar argument as before, we will run at most times until we reach the end of . Since , must be always smaller than and hence will not be updated. Therefore we must have . This shows and . Running with takes space and the number of intermediate results ( and ’s) is . This gives us a approximation algorithm with space complexity .
and .
We show that the reduction from to discovered in [RS20] can work in the asymmetric streaming model with a slight modification. Combined with our algorithm for , this gives a space algorithm for that achieves a approximation for binary strings. We stress that the original reduction in [RS20] is not in the asymmetric streaming model and hence our result does not follow directly from previous works.
We also provide a simple algorithm to approximate . Assume the alphabet is , the idea is to create a set and for each , we check whether is a subsequence of . We show that in a properly chosen with size obtained by sparsifying , there is some such that is a subsequence of and is a approximation of
1.3 Open Problems
Our work leaves a plethora of intriguing open problems. The main one is to close the gap between our lower bounds and the upper bounds of known algorithms, especially for the case of small alphabets and large (say constant) approximation. We believe that in this case it is possible to improve both the lower bounds and the upper bounds. Another interesting problem is to completely characterize the space complexity of .
2 Preliminaries
We use the following conventional notations. Let be a string of length over alphabet . By , we mean the length of . We denote the -th character of by and the substring from the -th character to the -th character by . We denote the concatenation of two strings and by . By , we mean the set of positive integers no larger than .
Edit Distance The edit distance (or Levenshtein distance) between two strings , denoted by , is the smallest number of edit operations (insertion, deletion, and substitution) needed to transform one into another. The insertion (deletion) operation adds (removes) a character at some position. The substitution operation replace a character with another character from the alphabet set .
Longest Common Subsequence We say the string is a subsequence of if there exists indices such that . A string is called a common subsequence of strings and if is a subsequence of both and . Given two strings and , we denote the length of the longest common subsequence (LCS) of and by .
In the proofs, we sometime consider the matching between . By a matching, we mean a function such that if , we have . We require the matching to be non-crossing. That is, for , if and are both not , we have . The size of a matching is the number of such that . We say a matching is a best matching if it achieves the maximum size. Each matching between and corresponds to a common subsequence. Thus, the size of a best matching between and is equal to .
Longest Increasing Subsequence In the longest increasing subsequence problem, we assume there is a given total order on the alphabet set . We say the string is an increasing subsequence of if there exists indices such that and . We denote the length of the longest increasing subsequence (LIS) of string by . In our discussion, we let and to be two imaginary characters such that for all .
Longest Non-decreasing Subsequence The longest non-decreasing subsequence is a variant of the longest increasing problem. The difference is that in a non-decreasing subsequence , we only require .
Lemma 2.1 (Lovász Local Lemma).
Let be a finite set of events. For , let denote the neighbours of in the dependency graph (In the dependency graph, mutually independent events are not adjacent). If there exist an assignment of reals to the events such that
Then, for the probability that none of the events in happens, we have
In the Set Disjointness problem, we consider a two party game. Each party holds a binary string of length , say and . And the goal is to compute the function as defined below
We define as the minimum number of bits required to be sent between two parties in any randomized multi-round communication protocol with 2-sided error at most . The following is a well-known result.
We will consider the one-way -party communication model where players each holds input respectively. The goal is to compute the function . In the one-way communication model, each player speaks in turn and player can only send message to player . We sometimes consider multiple round of communication. In an round protocol, during round , each player speaks in turn sends message to . At the end of round , player sends a message to . At the end of round , player must output the answer of the protocol. We note that our lower bound also hold for a stronger blackboard model. In this model, the players can write messages (in the order of ) on a blackboard that is visible to all other players.
We define the total communication complexity of in the -party one-way communication model, denoted by , as the minimum number of bits required to be sent by the players in every deterministic communication protocol that always outputs a correct answer. The total communication complexity in the blackboard model is the total length of the messages written on the blackboard by the players.
For deterministic protocol that always outputs the correct answer, we let be the maximum number of bits required to be sent by some player in protocol . We define , the maximum communication complexity of as where ranges over all deterministic protocol that outputs a correct answer. We have where is the number of rounds.
Let be a subset of where is some finite universe and is an integer. Define the span of by . The notion -fooling set introduced in [EJ08] is defined as following.
Definition 2.1 (-fooling set).
Let where is some finite universe. Let . For some integer , we say is a -fooling set for iff for each and for each subset of with cardinality , the span of contains a member such that .
We have the following.
Lemma 2.3 (Fact 4.1 from [EJ08]).
Let be a -fooling set for , we have .
3 Lower Bounds for Edit Distance
We show a reduction from the Set Disjointness problem () to computing between two strings in the asymmetric streaming model. For this, we define the following two party communication problem between Alice and Bob.
Given an alphabet and three integers . Suppose Alice has a string and Bob has a string . There is another fixed reference string that is known to both Alice and Bob. Alice and Bob now tries to compute .We call this problem . We prove the following theorem.
Theorem 8.
Suppose each input string to has length and let . Fix . Then .
To prove this theorem, we first construct the strings based on the inputs to . From , Alice constructs the string such that and . Similarly, from , Bob constructs the string such that and . Now Alice lets be a modification from the string such that , if then the symbol (at index ) is replaced by . Similarly, Bob lets be a modification from the string such that , if then the symbol (at index ) is replaced by .
We have the following Lemma.
Lemma 3.1.
If then .
To prove the lemma we observe that in a series of edit operations that transforms to , there exists an index s.t. is transformed into and is transformed into . We analyze the edit distance in each part. We first have the following claim:
Claim 3.1.
For any two strings and , there is a sequence of optimal edit operations (insertion/deletion/substitution) that transforms to , where all deletions happen first, followed by all insertions, and all substitutions happen at the end of the operations.
Proof.
Note that a substitution does not change the indices of the symbols. Thus, in any sequence of such edit operations, consider the last substitution which happens at index l. If there are no insertion/deletion after it, then we are good. Otherwise, consider what happens if we switch this substitution and the insertion/deletions before this substitution and after the second to last substitution. The only symbol that may be affected is the symbol where index l is changed into. Thus, depending on this position, we may or may not need a substitution, which results in a sequence of edit operations where the number of operations is at most the original number. In this way, we can change all substitutions to the end.
Further notice that we can assume without loss of generality that any deletion only deletes the original symbols in , because otherwise we are deleting an inserted symbol, and these two operations cancel each other. Therefore, in a sequence of optimal edit operations, all the deletions can happen before any insertion. ∎
For any , let denote the number of symbols up to index in . Note that is equal to the number of ’s in . We have the following lemma.
Lemma 3.2.
For any , let where , then if and if .
Proof.
By Claim 3.1 we can first consider deletions and insertions, and then compute the Hamming distance after these operations (for substitutions).
We consider the three different cases of . Let the number of insertions be and the number of deletions be . Note that . We define the number of agreements between two strings to be the number of positions where the two corresponding symbols are equal.
The case of and .
Here again we have two cases.
- Case (a):
-
In this case, notice that the LCS after the operations between and is at most the original . With insertions, the number of agreements can be at most , thus the Hamming distance at the end is at least . Therefore, in this case the number of edit operations is at least , and the equality is achieved when .
- Case (b):
-
In this case, notice that all original symbols in larger than (or beyond index before the insertions) are guaranteed to be out of agreement. Thus the only possible original symbols in that are in agreement with after the operations are the symbols with original index at most . Note that the LCS between and is . Thus with insertions the number of agreements is at most , and the Hamming distance at the end is at least .
Therefore the number of edit operations is at least . Now notice that and the quantity is non-decreasing as increases. Thus the number of edit operations is at least .
The other case of is similar, as follows.
The case of .
Here again we have two cases.
- Case (a):
-
In this case, notice that the LCS after the operations between and is at most the original . With insertions, the number of agreements can be at most , thus the Hamming distance at the end is at least . Therefore, in this case the number of edit operations is at least , and the equality is achieved when .
- Case (b):
-
In this case, notice that all original symbols in larger than (or beyond index before the insertions) are guaranteed to be out of agreement. Thus the only possible original symbols in that are in agreement with after the operations are the symbols with original index at most . Note that the LCS between and is . Thus with insertions the number of agreements is at most , and the Hamming distance at the end is at least .
Therefore the number of edit operations is at least . Now notice that and the quantity is non-decreasing as increases. Thus the number of edit operations is at least .
∎
We can now prove a similar lemma for . For any , let denote the number of symbols from index to in . Note that is equal to the number of ’s in .
Lemma 3.3.
Let where , then if and if .
Proof.
We can reduce to Lemma 3.2. To do this, use to minus every symbol in and in , while keeping all the symbols unchanged. Now, reading both strings from right to left, becomes the string with some symbols of the form replaced by ’s. Similarly becomes where .
If we regard as as in Lemma 3.2 and define as in that lemma, we can see that .
Now the lemma basically follows from Lemma 3.2. In the case of , we have
In the case of , we have
∎
We can now prove Lemma 3.1.
Proof of Lemma 3.1.
We show that for any , . First we have the following claim.
Claim 3.2.
If , then for any , we have .
To see this, note that when is even, we have and so . Now consider the case of being odd and let for some . We know and , so we only need to look at and and count the number of symbols ’s in them. If the number of ’s is at least , then we are done.
The only possible situation where the number of ’s is is that which means and this contradicts the fact that .
We now have the following cases.
- Case (a):
- Case (b):
- Case (c):
∎
We now prove Theorem 8.
Proof of Theorem 8.
We begin by upper bounding when .
Claim 3.3.
If then .
To see this, note that if then there exists a such that . Thus , and , . Note that the number of ’s in is and thus . Similarly the number of ’s in is and thus . To transform to , we choose , transform to , and transform to .
Therefore, in the case of , we have while in the case of , we have . Thus any protocol that solves can also solve , hence the theorem follows. ∎
In the proof of Theorem 8, the two strings and have different lengths, however we can extend it to the case where the two strings have the same length and prove the following theorem.
Theorem 9.
Suppose each input string to has length and let . Fix , let and . Define as the two party communication problem of computing . Then .
Proof.
We extend the construction of Theorem 8 as follows. From input to , first construct as before. Then, let , and . Note that we also have , and .
We finish the proof by establishing the following two lemmas.
Lemma 3.4.
If then .
Proof.
First we can see that since we can first use at most edit operations to change into (note that the first symbols are the same), and then add symbols of at the end.
Now we have the following claim:
Claim 3.4.
In an optimal sequence of edit operations that transforms to , at the end some symbol in must be kept and thus matched to at the same position.
To see this, assume for the sake of contradiction that none of the symbols in is kept, then this already incurs at least edit operations, contradicting the fact that .
We now have a second claim:
Claim 3.5.
In an optimal sequence of edit operations that transforms to , at the end all symbols in must be kept and thus matched to at the same positions.
To see this, we use Claim 3.4 and first argue that some symbol of is kept and matched to . Assume this is the symbol . Then we can grow this symbol both to the left and to the right and argue that all the other symbols of must be kept. For example, consider the symbol if . There is a symbol that is matched to in the end. If this symbol is not the original , then the original one must be removed either by deletion or substitution, since there cannot be two symbols of in the end. Thus instead, we can keep the original symbol and reduce the number of edit operations.
More precisely, if this symbol is an insertion, then we can keep the original symbol and get rid of this insertion and the deletion of the original , which saves operations. If this symbol is a substitution, then we can keep the original symbol, delete the symbol being substituted, and get rid of the deletion of the original , which saves operation. The case of is completely symmetric. Continue doing this, we see that all symbols of must be kept and thus matched to .
Now, we can see the optimal sequence of edit operations must transform into , and transform the empty string into . Thus by Lemma 3.1 we have
∎
We now have the next lemma.
Lemma 3.5.
If then .
Proof.
Again, to transform to , we can first transform into , and insert at the end. If then by Claim 3.3 . Therefore we have
Thus any protocol that solves can also solve , hence the theorem follows. ∎
∎
From Theorem 9 we immediately have the following theorem.
Theorem 10.
Any -pass randomized algorithm in the asymmetric streaming model that computes exactly between two strings of length with success probability at least must use space at least .
We can generalize the theorem to the case of deciding if is a given number . First we prove the following lemmas.
Lemma 3.6.
Let be an alphabet. For any let and be three strings. Then .
Proof.
First it is clear that , since we can just transform to . Next we show that .
To see this, suppose a series of edit operations transforms to or for some and transforms to the other part of (called ). Then by triangle inequality we have . Also note that . Thus the number of edit operations is at least . ∎
Lemma 3.7.
Let be an alphabet. For any let and be four strings. If there is no common symbol between any of the three pairs of strings , and , then .
Proof.
First it is clear that , since we can just transform to and then replace by . Next we show that .
To see this, suppose a series of edit operations transforms to for some and transforms to the other part of . Then by triangle inequality we have . Since there is no common symbol between and , we have . Thus the number of edit operations is at least . The case of transforming to for some is completely symmetric since equivalently it is transforming to for some . ∎
We have the following two theorems.
Theorem 11.
There is a constant such that for any with , and alphabet with , any -pass randomized algorithm in the asymmetric streaming model that decides if between two strings with success probability at least must use space at least .
Proof.
Theorem 9 and Theorem 10 can be viewed as deciding if for two strings of length over an alphabet with size . Thus we can first use the constructions there to reduce to the problem of deciding if with a fixed string of length . The number of symbols used is as well. Now to increase the length of the strings to , we pad a sequence of the symbol at the end of both and until the length reaches . By Lemma 3.6 the edit distance stays the same and thus the problem is still deciding if . By Theorem 9 the communication complexity is and thus the theorem follows. ∎
Theorem 12.
There is a constant such that for any and alphabet with , any -pass randomized algorithm in the asymmetric streaming model that decides if between two strings with success probability at least must use space at least .
Proof.
Theorem 9 and Theorem 10 can be viewed as deciding if for two strings of length over an alphabet with size . Thus we can first use the constructions there to reduce to the problem of deciding if with a fixed string of length , and the number of symbols used is . Now we take the 2 unused symbols and pad a sequence of these two symbols with length at the end of and . By Lemma 3.7 the edit distance increases by and thus the problem becomes deciding if . Next, to increase the length of the strings to , we pad a sequence of the symbol at the end of both strings until the length reaches . By Lemma 3.6 the edit distance stays the same and thus the problem is still deciding if . By Theorem 9 the communication complexity is and thus the theorem follows. ∎
Combining the previous two theorems we have the following theorem, which is a restatement of Theorem 1.
Theorem 13 (Restatement of Theorem 1).
There is a constant such that for any with , given an alphabet , any -pass randomized algorithm in the asymmetric streaming model that decides if between two strings with success probability at least must use space at least .
For , by taking we also get the following corollary:
Corollary 3.1.
Given an alphabet , for any , any -pass randomized algorithm in the asymmetric streaming model that achieves a approximation of between two strings with success probability at least must use space at least .
4 Lower Bounds for LCS
In this section, we study the space lower bounds for asymmetric streaming LCS.
4.1 Exact computation
4.1.1 Binary alphabet, deterministic algorithm
In this section, we assume can be diveded by and let . We assume the alphabet is . Consider strings of the form
(1) |
That is, contains blocks of consecutive symbols. Between each block of symbols, we insert ’s and we also add ’s to the front, and the end of . are integers such that
(2) | |||
(3) |
Thus, the length of is and it contains exactly ’s.
Let be the set of all of form 1 that satisfying equations 2, 3. For each string , we can define a string as following. Assume , we set . That is, simply removed the first ’s of . We denote .
Claim 4.1.
.
Proof.
Notice that for , if , then . We have .
The size of equals to the number of choices of integers that satisfies 2 and 3. For an lower bound of , we can pick of the integers to be , and set the remaining to be or . Thus the number of such choices is at least
∎
We first show the following lemma.
Lemma 4.1.
Let . For every ,
For any two distinct ,
Proof of Lemma 4.1.
We first show . Notice that is of the form
It cantains block of ’s, each consists consecutive ’s. These blocks of ’s are seperated by some ’s. Also, has ’s and ’s. Let be the first position of the -th block of ’s.
Let us consider a matching between and .
If the matching does not match any ’s in to , the size of such a matching is at most since it is the number of ’s in .
Now we assume some 1’s in are matched to . Without loss of generality, we can assume the first symbol in is matched. This is because all ’s in are consecutive and if the first in (i.e. ) is not matched, we can find another matching of the same size that matches . For the same reason, we can assume the first in is matched to position for some . Assume is the number of ’s matched. Again, without loss of generality, we can assume the first ’s starting from position in are matched since all ’s in are consecutive and there are no matched ’s between two matched ’s. We have two cases.
-
Case 1: is matched to for some . Let be the number of matched ’s. We know since there are ’s in .
If , we match first ’s in starting from position . Consider the number of ’s that are still free to match in . The number of ’s before is . Since , is at most , we can match all of them to first third of . Also, we need blocks of ’s to match all ’s in . The number of ’s after last matched in is (which is zero when ). Again, we can match all these ’s since . In total, we can match ’s. This gives us a matching of size . We argue that the best strategy is to always match ’s. To see this, if we removed matched ’s, this will let us match additional ’s for some . By our construction of , is strictly smaller than for all . If we keep doing this, the size of matching will decrease. Thus, the largest matching we can find in this case is .
-
Case 2: is matched to for some . By the same argument in Case 1, the best strategy is to match as many ’s as possible. The number of ’s in starting from position is . The number of ’s that are free to match is . Since for all , the number of ’s can be matched is strictly smaller than . The largest matching we can find in this case is smaller than .
This proves the size of the largest matching we can find is exactly . We have .
For the second statement in the lemma, say and are two distinct strings in . For convenience, we assume , and . Let be the smallest integer such that . We have since . Without loss of generality, we assume . We show that . Notice that is of the form
By the same notation, let be the first position of the -th block of ’s in
Consider the match that matches the first in to position . The number of ’s before is . We matches all ’s in . The number of ’s after the last matched in is . This gives us a match of size . By our choice of , we have for . The size of the matching equals to which is larger than . Thus, the length of LCS between and is larger than . This finishes our proof.
∎
Lemma 4.2.
In the asymmetric streaming model, any deterministic protocol that computes for any , in passes of needs space.
Proof.
Consider a two party game where player 1 holds a string and player 2 holds a string . The goal is to verify whether . It is known that the total communication complexity of testing the equality of two elements from set is , see [KN97] for example. We can reduce this to computing the length of LCS. To see this, we first compute and with . By lemma 4.1, if both , we know , otherwise, . Here, is known to both parties.
The above reduction shows the total communication complexity of this game is since . If we only allow rounds of communication, the size of the longest message sent by the players is . Thus, in the asymmetric model, any protocol that computes in passes of needs space.
∎
4.1.2 size alphabet, randomized algorithm
Lemma 4.3.
Assume . In the asymmetric streaming model, any randomized protocol that computes correctly with probability at least for any , in of passes of needs space. The lower bound also holds when is a permutation of .
Proof of Lemma 4.3.
The proof is by giving a reduction from set-disjointness.
In the following, we assume alphabet set which is the set of integers from 1 to . We let the online string be a permutation of and the offline string be the concatenation from 1 to . Then computing is equivalent to compute since any common subsequence of and must be increasing.
We now describe our approach. For convenience, let . Without loss of generality, we assume can be divided by . Consider a string with the following property.
(4) |
For each , we consider subsets for as defined below
Notice that if and . For an odd , . Oppositely, for an even , . Also notice that for distinct with , and .
We abuse the notation a bit and let be a string such that if , the string consists of elements in set arranged in an increasing order, and if , the string is arranged in decreasing order.
Let be the concatenation of for such that . By the definition of ’s, we know is a permutation of .
For convenience, let and be two subsequences of such that
If . Then there exist some such that . Notice that in , we have Thus, since only one of and is increasing and the other is decreasing. For any , we have
(5) | ||||
(6) |
If , we prove that . We only need to consider in an longest increasing subsequence, when do we first pick some elements from the second half of . Say the first element picked in the second half of is in or for some . We have
This is because and can not both be 1, , and . The length of of the substring of before is and the length of LIS after is . Thus, we have .
This gives a reduction from computing to computing . Now assume player 1 holds the first half of and player 2 holds the second half. Both players have access to . Since . Any randomized protocol that computes with success probability at least has an total communication complexity . Thus, any randomized asymmetric streaming algorithm with passes of needs space. ∎
We can generalize the above lemma to the following.
Theorem 14 (Restatement of Theorem 2).
There is a constant such that for any with , given an alphabet , any -pass randomized algorithm in the asymmetric streaming model that decides if between two strings with success probability at least must use at least space.
Proof.
Without loss of generality, assume so . Since we assume , we have .
Let . we let the offline string be the concatenation of two parts and where is the concatenation of symbols in in ascending order. is the symbol repeated times. Thus, . also consists of two parts and such that is over alphabet with length and is the symbol repeated times. Since the symbol does not appear in and the symbol does not appear in . Thus, . By Lemma 4.3, any randomized algorithm that computes with probability at least 2/3 using passes of requires space.
∎
4.2 Approximation
We now show a lower bound for deterministic approximation of in the asymmetric streaming model.
Theorem 15 (Restatement of Theorem 3).
Assume , and . In the asymmetric streaming model, any deterministic protocol that computes an approximation of for any , with constant number of passes of needs space.
Proof of Theorem 15.
In the following, we assume the size of alphabet set is such that
We let such that can be divided by 60.
For any distinct , we can build a set in the same way as we did in section 4.1.1 except that we replace wtih (we used notation instead ). Thus, . Similarly, we can define function and . Let . By Claim 4.1 and Lemma 4.1, we know . For any , we have . For any two distinct , we know at least one of and is at larger than .
Consider such that and for . Thus, can be viewed as an element in
For alphabet , we can similarly define , and . We let be similarly defined as but over alphabet .
Let . We can define function such that
(7) |
Consider an error-correcting code over alphabet with constant rate and constant distance . We can pick and , for example. Then the size of the code is . For any code word where , we can write
Let . Then
(8) |
We consider a -player one-way game where the goal is to compute the function on input . In this game, player holds for and can only send message to player (Here, ). We now show the following claim.
Claim 4.2.
is a fooling set for the function .
Proof.
Consider any two different codewords . Denote and . By our defintion of , we know and for . We have . Thus, .
The span of is the set . We need to show that there exists some in the span of such that . Since are two different codewords of . We know there are at least indices such that . Let be the set of indices that . Then, for , we have
We can build a as following. For , if . We then set and . Otherwise, we set and . For , we set and . Thus, for at least indices , we have . We must have . ∎
Consider a matrix of size of the following form
where and for (the elements of matrix are strings over alphabet ). Thus, the -th row of is an element in . We define the following function .
(9) |
where is the -th row of matrix and is the same as except the inputs are over alphabet instead of for . Also, we define in exactly the same way as except elements in are over alphabet instead of .
Consider a player game where each player holds one column of . The goal is to compute . We first show the following Claim.
Claim 4.3.
The set of all matrix such that is a fooling set for .
Proof of Claim 4.3.
For any two matrix such that . We know . There is some row such that . We know there is some elements in the span of and , such that by Claim 4.2. Thus, there is some element in the span of and such that . Here, by in the span of and , we mean the -th column is either or . ∎
Since . By the above claim, we have a fooling set for size . Thus, . Since .
We now show how to reduce computing to approximating the length of LCS in the asymmetric streaming model.
We consider a matrix of size such that
(10) |
In other words, is obtained by inserting a column of symbols to at every third position. For , let . That is, is the concatenation of elements in the -th column of . We can set to be the concatenation of the columns of . Thus, since for any , we have
For , we have defined . We let and be two non-empty strings such that . We consider another matrix of size such that
(11) |
For , let be the concatenation of elements in the -th row of . We can set to be the concatenation of rows of . Thus,
We now show the following Claim.
Claim 4.4.
If , . If , .
Proof.
We first divide into blocks such that
where . We know is the -th column of , is the -th column of and is symbol repeated times.
If , we show . We consider a matching between and . For our analysis, we let be the largest integer such that some symbols in is matched to symbols in the -th row of . Since is the concatenation of rows of . If no symbols in is matched, we let and we set , we have
We now show that, there is an optimal matching such that is matched to at most symbols in . There are two cases:
-
Case (a): . In this case, can only be matched to -th row of , . consists of symbols and and is of the form
We first show that we can assume and symbols in are only matched to the block between the block of and the block of .
If in , there are some , symbols before the block of matched to . Say the number of such matches is at most . We know there are at most , symbols in . In this case, notice that, there is no symbol in can be matched to . We can build another matching between and by removing these matches and add matches between symbols in and . We can do this since there are symbols in and before the -th row, we can match at most symbols. So there are at least unmatched symbols in . The size of the new matching is at least the size of the old matching.
Similarly, if there are some , symbols after the block of in matched to . Then, no symbol in can be matched to . We can remove these matches and add matched symbols. This gives us a matching with size at least the size fo the old matching.
Thus, we only need to consider the case where is matched to the part of after the block of symbols and before the block of symbols, which is . Since , we know is exactly . Also, we can match at most symbols. Thus, is matched to at most symbols in .
-
Case (b): . We can assume in , except symbols, only symbols are matched to . To see this, assume with is the largest integer that some symbol or in are matched to . By symbols, we mean symbols and . We now consider how many symbols in can be matched to . We only need to consider the substring
Let be the largest integer such that some symbol or from is matched. Notice that for any , symbol , only appears in the -th row of and is the concatenation of row of . For with , only those in can be matched since block is after in . For , with , only those in can be matched by assumption on .
Notice that for any , we have has length and has length . Thus, the number of matched symbols is at most . We can build another matching by first remove the matches of symbols in . Then, we match another symbols in to the symbols in from -th row to -th row. These symbols are not matched since we assume is only matched to first rows of in the original matching. Further, we can match to the block in between the block of and the block of . This gives us additional matches. Since we , we know the number of added matches is at least the number of removed matches. In the new matching, . Thus, we can assume in , except symbols, only symbols are matched to .
By the same argument in Case (a), we can assume and symbols in are only matched to the block between the block of and the block of . Thus, we can match and symbols. Also, we can match at most symbols since there are this many symbols in from -th row to -th row. Thus, is matched to at most symbols in .
In total, the size of matching is at most
Thus, if , we know
If , that means there is some row of such that for at least positions , we have .
We now build a mathing between and with size at least . We first match , which is a subsequence of , to the first block in , this gives us at least matches. Then, we match all the symbols in the first rows of to . This gives us matches.
We consider the string
It is a subsequence of . Also, for at least positions , we know . For the rest of the positions, we know . Thus, .
After the -th row of , we can match another symbols to . This gives us a matching of size at least . Thus, if , .
∎
Assume where is some constant. Let . If we can give a approximation of , we can distinguish and .
Thus, we can reduce computing in the player setting to computing . The string (the offline string) is known to all players and it contains no information about . For the -th player, it holds -th column of . If is odd, the player knows the -th column of . If is even, the player knows the -th row of which is the -th row of and -th row of which consist of only .
∎
5 Lower Bounds for LIS and LNS
In this section, we introduce our space lower bound for and .
Let , can be seen as a binary string of length . For each integer , we can define a function whose domain is a subset of . Let be some constant. We have following definition
(12) |
We leave undefined otherwise. Let be a matrix and denote the -th row of by . We can define as the direct sum of copies of . Let
(13) |
That is, if there is some such that and if for all , .
In the following, we consider computing and in the -party one-way communication model. When computing , player holds the -th element of for . When computing , player holds the -th column of matrix for . In the following, we use to denote the total communication complexity of and respectively use to denote the total communication complexity of . We also consider multiple rounds of communication and we denote the number of rounds by .
Lemma 5.1.
For any constant , there exists a constant (depending on ), such that there is a -fooling set for function of size for some constant .
We note that Lemma 4.2 of [EJ08] proved a same result for the case .
Proof of Lemma 5.1.
We consider sampling randomly from as follows. For , we independently pick with probability and with probability . We set for some large constant . For , we let to be event that there are no two 1’s in the substring . Let be the probability that event happens. By a union bound, we have
(14) |
Let . Notice that since we are sampling each position of independently, the event is dependent to at most other events in . We set . For large enough , we have
(15) |
Here, the second inequality follows from the fact that is a constant and , so we can pick to be large enough, say (or to be small enough) to guarantee .
Thus, we can use Lovás Local Lemma here. By Lemma 2.1, we have
(16) |
Notice that “there are at least ’s between any two ’s in ” is equivalent to none of events happens. We say a sampled string is good if none of happens. Thus, for any good string , we have . The probability that a sampled string is good is at least . For convenience, we let .
Assume we independently sample strings in this way, the expected number good string is . Let be independent random samples. We consider a string in the span of these strings, such that, for , let if there is some such that . is in the span of these strings. We now consider the probability that has at least 1’s, i.e. . Notice that with probability , thus . Let . The expected number of 1’s in is . Let and be the probability that has less than 1’s. Using Chernoff bound, we have,
(17) |
Let and . We consider the probability that these sample is not a -fooling set for . Since for any samples, it is not a -fooling set with probability at most . Let denote the event that these samples form a -fooling set. Using a union bound, the probability that does not happen is
Let be a random variable equals to the number of good string among the samples. As we have shown, the expectation of , . Also notice that . Thus, with a positive probability, there are good samples and they form a -fooling set.
(18) |
Notice that
(19) |
Since we assume is a constant, it is larger than 1 when is a constant large enough (depends on ). This finishes the proof.
∎
The following lemma is essentially the same as Lemma 4.3 in [EJ08].
Lemma 5.2.
Let be a -fooling set for . Then the set of all matrix such that is a -fooling set for .
Lemma 5.3.
Proof.
5.1 Lower bound for streaming LIS over small alphabet
We now present our space lower bound for approximating LIS in the streaming model.
Lemma 5.4.
For with and any constant , any deterministic algorithm that makes passes of and outputs a -approximation of requires space.
Proof of Lemma 5.4.
We assume the alphabet set which has size . Let be a large constant and assume can be divided by for similicity. We set and . Consider a matrix of size . We denote the element on -th row and -th column by . ALso, we require that is either or . For each row of , say , either there are at least 0’s between any two nonzeros or it has more than nonzeros. We let be a binary matrix such that if and if for .
Without loss of generality, we can view for , or for as a string. More specifically, for , and for .
We let . Thus, is a string of length . For convenience, we denote . Here, we require the length of . If , we can pad with 0 symbols to make it has length . This will not affect the length of the longest increasing subsequence of .
We first show that if there is some row of that contains more than nonzeros, then . Say contains more than nonzeros. By our definition of , is strictly increasing when restricted to the nonzero positions. Thus, . Also notice that is a subsequence of . This is because contains element for and is the concatenation of ’s for from 1 to . Thus, .
Otherwise, for any row , there are at least zeros between any two nonzero positions. We show that . Assume and let be a longest increasing subsquence of .
We can think of as a path on the matrix. By go down steps from , we mean walk from to . By go right steps form , we mean walk from to . Then corresponds to the path . Notice that for each step , we know . Thus, by our construction of matrix , the step can only be one of the three types:
-
Type 1: and . If , then for any , we have if and are both non zero,
-
Type 2: and .This correspons to the case where and are picked from the same row of . However, since we assume in each row of , the number of 0’s between any two nonzeros is at least . Since and are both nonzeros and , we must have .
-
Type 3: and . When , . Since we require , we must have .
For , let be the number of step of Type in the path . Then, . We say is step for . And let
Or equivalently, is the distance we go downward with steps of Type 1 and is the distance we go upward with steps of Type 3. Since only steps of Type 1 and Type 3 can go up and down, we know
(20) |
For the number of distance we go right, for each step of Type 2, we go at least positions right. For the step of Type 3, we go right at least steps. Since the total distance we go right is . Thus, we have
(21) |
This show that if , we have . And if , . Here, and can be any large constant up to our choice and is fixed. For any , we can choose and such that . This gives us a reduction from computing to compute a -approximation of .
In the -party game for computing , each player holds one column of . Thus, player also holds since is determined by . If the players can compute a approximation of in the one-way communication model, we can distinguish the case of and . Thus, any passes deterministic streaming algorihtm that approximate within a factor requires at least . By Lemma 5.3, .
∎
5.2 Longest Non-decreasing Subsequence
We can proof a similar space lower bound for approximating the length of longest non-decreasing subsequence in the streaming model.
Lemma 5.5.
For with and any constant , any deterministic algorithm that makes passes of and outputs a -approximation of requires space.
Proof of Lemma 5.5.
In this proof, we let the alphabet set . The size of the alphabet is . Without loss of generality, we can assume .
Let be a binary matrix such that for any row of , say for , either there are at least 0s between any two 1’s, or, has at least 1’s.
Similar to the proof of Lemma 5.4, we show a reduction from computing to approximating the length of . In the following, we set and for some constant such that is an integer.
Let us consider a matrix such that for any :
(22) |
Thus, for all positions such that , we know . Also, for , assume for some , we have . For positions such that , we have .
Consider the sequence where is the concatenation of symbols in the -th column of matrix .
We now show that if , and if , . 333Here, we assume . If , we can repeat each symbol in times and show , and if , . The proof is the same.
If , consider a longest non-decreasing subsequence of denoted by . Than can be divided into two parts and such that consists of symbols from and consists of symbols from . Similar to the proof of Lemma 5.4, corresponds to a path on matrix . Since we are concatenating the columns of , the path can never go left. Each step is either go right at least positions since there are at least 0’s between any two 1’s in the same row of , or, go downward to another row. Thus, the total number of steps is at most since has rows and columns. For , if we restricted to positions that are in , symbols in column of must be smaller than symbols in column if . Thus, the length of must be at most the length of for any , which is at most . Thus the length of is at most .
If , then we know there is some such that row of constains at least 1’s. We know if . Thus, contains a non-decreasing subseqeunce of length at least . Since is a subsequence of . We know .
For any constant , we can pick constants and such that . Thus, if we can approximate to within a factor, we can distinguish the case of and . The lower bound then follows from Lemma 5.3.
∎
Lemma 5.6.
Let and such that . Then any deterministic algorithm that makes constant pass of and outputs a approximation of takes space.
Proof.
Let the alphabet set and assume that
.
We assume . Since we assume , if , we can use less symbols in for our construction and this will not affect the result. Let that can be divided by . For any two symbols , we consider the set such that
We define a function such that for any , . Thus, for any , the string has exactly symbols and symbols.
We let . We know .
Consider an error-correcting code over alphabet set with constant rate and constant distance . We can pick and . Then the size of the code is . For any code word where , we can write
Let . Since the code has constant rate , the size of is .
Let . We can define function such that
(23) |
Claim 5.1.
is a fooling set for .
Proof.
Let and be two distinct elements in . Let and . By the definition, we know for at least positions , we have . Also, by the construction of set , if , then one of and has more than symbols.
in the span of and if such that and . We can find a in the span of and such that .
∎
For , we can define , , , similarly except the alphabet is instead of .
Consider a matrix of size such that and . We define a function such that
(24) |
In the following, we consider a -party one way game. In the game, player holds . The goal is to compute .
Claim 5.2.
The set of all matrix such that for is a fooling set for .
Proof.
For any two matrix such that . We know . There is some row such that . We know there is some elements in the span of and , such that by Claim 5.1. Thus, there is some element in the span of and such that . ∎
Thus, we get a -fooling set for function in the -party setting. The size of the fooling set is . Thus, and .
Consider a matrix of size such that is obtained by inserting a column of symbols to at every third position. Thus,
Let .
We now show how to reduce computing to approximating the length of . We claim the following.
Claim 5.3.
If , . If , .
Proof.
We first divide into parts such that
where . We know is the -th column of , is the -th column of and .
If , we show . Let be a longest non-decreasing subsequence of . We can divide into parts such that is a subsequence of . For our analysis, we set . If is empty or contains no symbols, let . Otherwise, we let be the largest integer such that appeared in .We have
We now show that, for any , the length of is at most . To see this, if , to . Since , there are exactly symbols in . Thus .
If , if there is some symbol in the included in for some . Then can not include symbols in for all . Also for any , the number of symbols in is but the number in is . Thus, the optimal strategy it to pick the symbols in and then add symbols ( ’s for ) in to .
In total, length of is at most
Thus, if , we know .
If , that means there is some row of such that for at least positions , the number of symbols in is at least .
We now build a non-decreasing subsequence of with length at least . We set to be empty initially. There are at least symbols in . We add all of them to . Then, we add all the symbols for in to . This adds symbols to
We consider the string
It is a subsequence of . There are symbols in . This is because for at least positions , we know contains more than symbols. For the rest of the positions, we know contains symbols.
Finally, we add all the symbols for in to . This adds another symbols to . The sequence has length at least . ∎
Assume where is some constant. Let . If we can give a approximation of , we can distinguish and .
By the fact that , any deterministic streaming algorithm with constant passes of that approximate within a factor requires space. ∎
5.3 Longest Non-decreasing Subsequence with Threshold
We now consider a variant of problem we call longest non-decreasing subsequence with threshold (). In this section, we assume the alphabet is . In this problem, we are given a sequence and a threshold , the longest non-decreasing subsequence with threshold is the longest non-decreasing subsequence of such that each symbol appeared in it is repeated at most times. We denote the length of such a subsequence by .
We first give our lower bounds for .
Theorem 16 (Restatement of Theorem 4).
Given an alphabet , for deterministic approximation of for a string in the streaming model with passes, we have the following space lower bounds:
-
1.
for any constant (this includes ), when is any constant.
-
2.
for (this includes ), when .
-
3.
for , when .
Theorem 16 is a result of the following four Lemmas.
Lemma 5.7.
Let and and be any constant. Let . Then any -pass deterministic algorithm a approximation of takes space.
Proof.
In the proof of Lemma 5.5, each symbol in sequence appeared no more than times. If and , we have . The lower bound follows from our lower bound for in Lemma 5.5.
∎
Lemma 5.8.
Let and and be any constant. Let be some constant. Then any pass deterministic algorithm outputs a approximation of takes space.
Proof.
Let . If we repeat every symbol in times, we get a string . Then, . When is a constant, the lower bound follows from lower bounds for in Lemma 5.4. ∎
Lemma 5.9.
Let . Assume and such that . Let . Then any -pass deterministic algorithm that outputs a approximation of takes space.
Proof.
In the proof of Lemma 5.6, we considered strings where each symbol is appeared at most times where . We . Thus, . The lower bound follows from Lemma 5.6.
∎
Lemma 5.10.
Assume , , and . Let , any -pass deterministic algorithm that outputs an approximation of requires space.
Proof.
The lower bound is achieved using the same construction in the proof of Lemma 15 with some modifications. In Section 4.1.1, for any , we build a fooling set and (we used the notation instead of in Section 4.1.1) such that where the function simply delete the first 10 ’s in . We prove Lemma 4.1 and Claim 4.1. We modify the construction of and with three symbols . The modification is to replace every symbols in the strings in with symbols. This gives us a new set . Thus, the function now becomes, on input , first remove 10 ’s and then replace all symbols with .
Let . We can show that for every ,
Also, for any two distinct ,
The proof is the same as the proof of Lemma 4.1.
We now modify the construction in the proof of Lemma 15. Since , we choose such that . More specifically, for the matrix (see equation 11), we do the following modification. For any , . We replace with string . Here, are three symbols in such that . In the matrix (see equation 10), for , we replace every symbols with and symbols with . For , we place symbols with and symbols with .
We also replace the block in the -th row of both and with . Here, are three different symbols in such that .
is the concatenation of rows of . We require that symbols appeared earlier in are smaller. Since is the concatenation of all symbols that appeared in and each symbol in repeated times. After the symbol replacement, we have . Also notice that the alphabet size is now instead of in the proof of Lemma 15. The space lower bound then follows from a similar analysis in the proof of Lemma 15.
∎
We now prove a trivial space upper bound for .
Lemma 5.11.
Given an alphabet with . For any and , there is a one-pass streaming algorithm that computes a approximation of for any with space.
Proof.
We assume and the input is a string . Let be a longest non-decreasing subsequence of with threshold and we can write where is the number of times symbol repeated and .
We let be the set . Thus, if , . Consider the set such that . Thus, . For convenience, we let . We initialize the set to be an empty set. For each , run in parallel, we check is is a subsequence of . If is a subsequence of , add to .
Meanwhile, we also compute exactly with an additional bits of space. If , we output . Otherwise, we output .
We now show the output is a -approximation of . Let be the largest element in that is no larger than for and .
If , we have and
since we assume . Thus, Note that is a subsequence of and thus also a subsequence of . Thus, we add to the set . That means, the final output will be at least . Denote the final output by , we have
On the otherhand, the the output is for some , we find a subsequence of and thus . We know
If , no symbol in is repeated more than times. Thus, . Thus, we output . Notice that if , either some symbol in the longest non-decreasing subsequence is repeated more than times, or . In either case, we have and is a approximation of .
∎
6 Algorithms for Edit Distance and LCS
6.1 Algorithm for edit distance
In this section, we presents our space efficient algorithms for approximating edit distance in the asymmetric streaming setting.
We can compute edit distance exactly with dynamic programming using the following recurrence equation. We initialize and for . Then for ,
(25) |
Where is the edit distance between and . The three options in the case each corresponds to one of the edit operations (substitution, insertion, and deletion). Thus, we can run a dynamic programming to compute matrix . When the edit distance is bounded by , we only need to compute the diagonal stripe of matrix with width . Thus, we have the following.
Lemma 6.1.
Assume we are given streaming access to the online string and random access to the offline string . We can check whether or not, with bits of space in time. If , we can compute exactly with bits of space in time.
Claim 6.1.
For two strings , assume , we can output the edit operations that transforms to with bits of space in time.
Proof.
The idea is to run the dynamic programming times. We initialize . If , we find the largest integer such that and set and and continue. If , we compute the the elements , , and . By 25, we know there is some such that . We set and , output the corresponding edit operation and continue. We need to run the dynamic programming times. The running time is . ∎
Lemma 6.2.
For two strings , assume . For any integer , there is a way to divide and each into parts so that and (we allow or to be empty for some ), such that and for all .
Proof of Lemma 6.2.
Since , we can find edit operations on that transforms into . We can write these edit operations in the same order as where they occured in . Then, we first find the largest and , such that the first edit operations transforms , to . Notice that (or ) is 0 if the first edit operations insert before (or delete first symbols in ). We can set and and continue doing this until we have seen all edit operations. This will divide and each in to at most parts. ∎
In our construction, we utilize the following result from [CJLZ20].
Lemma 6.3.
[Theorem 1.1 of [CJLZ20]] For two strings , there is a deterministic algorithm that compute a -approximation of with bits of space in polynomial time.
We now present our algorithm that gives a constant approximation of edit distance with improved space complexity compared to the algorithm in [FHRS20], [CJLZ20]. The main ingredient of our algorithm is a recursive procedure called . The pseudocode is given in algorithm 2. It takes four inputs: an online string , an offline string , an upper bound of edit distance , and an upper bound of space available . The output of is a three tuple: a pair of indices , two integers and . Througout the analysis, we assume is a small constant up to our choice.
We have the following Lemma.
Lemma 6.4.
Let be the output of with input . Then assume is the largest integer such that there is a substring of , say , with . Then, we have . Also, assume is the substring of that is closest to in edit distance. We have
(26) |
Here if and if
Proof of Lemma 6.4.
In the following, we let be the output of with input . We let be the largest integer such that there is a substring of , say , with and is the substring of that minimizes the edit distance to .
first compares and . If , we first set , then we know for any such that , since we can transform to with substitutions. Thus, we are guaranteed to find such a pair and we can record the edit operation that transform to . We set , and continue with . When (). we may not be able to store in the memory for random access since our algorithm uses at most bits of space. However, we have remembered a pair of indices and at most edit operations that can transform to . This allows us to query each bit of from left to right once with time. Thus, for each substring of y, we can compute its edit distance from . Once we find such a substring with , by Claim 6.1, we can then compute the edit operations that transfom to with space. Thus, if , we can find the largest integer such that there is a substring of , denoted by , with with bits of space. If there is no substring with , we terminate the procedure and return current , , and .
If , then is a substring of , we can find a substring of such that its edit distance to is at most . Thus, will not terminate at . We must have .
Also notice that when , we always do exact computation. is the substring in that minimizes the edit distance to . Thus, Lemma 6.4 is correct when .
For the case , needs to call itself recursively. Notice that each time the algorithm calls itself and enters the next level, the upper bound of edit distance is reduced by a factor . The recursion ends when . Thus, we denote the depth of recursion by , where . We assume the algoirthm starts from level 1 and at level for , the upper bound of edit distance becomes .
We prove the Lemma by induction on the depth of recursion. The base case of the induction is when , for which we have shown the correctness of Lemma 6.4. We now assume Lemma 6.4 holds when the input is for any strings .
We first show . Notice that the while loop at line 2 terminates when either or . If , we know is set to be . Since by definition. We must have .
If the while loop terminates when . By the definition of , we know and can be divided into blocks, say,
where , such that
For convenience, we denote and . By the defintion, we know and and all are non-negative. We have .
We show that for all by induction on . For the base case , let be the largest integer such that there is a substring of within edit distance to . We know . Since we assume Lemma 6.4 holds for inputs , we know . Thus, and . Now assume holds for some , we show that . If , holds. We can assume . We show that . To see this, let be the largest integer such that there is a substring of within edit distance to . We know since we assume Lemma 6.4 holds for inputs . Notice that , we know is at least since is a substring of . We know . Thus, . Thus, we must have .
We now prove inequality 26. After the while loop, the algorithm then finds a substring of , , that minimizes where is a approximation of . For convenience, we denote
Thus,
(27) |
Let be the substring of that is closest to in edit distance. By the inductive hypothesis, we assume the output of satisfies Lemma 6.4. We know
(28) | ||||
By the optimality of and triangle inequality, we have
By triangle inequality | ||||
Also notice that we can write such that
(29) |
We set . Since we assume , we know . This proves inequality 26.
∎
Lemma 6.5.
Given any , let , runs in polynomial time. It queries from left to right in one pass and uses bits of extra space.
Proof of Lemma 6.5.
If , we need to store at most edit operations that transforms, to for current and . This takes bits of space. Notice when , we do not need to remember . Instead, we query by looking at and the edit operations.
If , the algorithm is recursive. Let us consider the recursion tree. We assume the algoirthm starts from level 1 (root of the tree) and the depth of the recursion tree is . At level for , the upper bound of edit distance (third input to algorithm ) is . The recursion stops when . Thus, the depth of recursion is by our assumption on and . The order of computation on the recursion tree is the same as depth-first search and we only need to query at the bottom level. There are at most leaf nodes (nodes at the bottom level). For the -th leaf nodes of the recursion tree, we are computing with where is the last position visited by the previous leaf node. Thus, we only need to access from left to right in one pass.
For each inner node, we need to remember pairs of indices and integers for , which takes space. For computing an -approximation of , we can use the space-efficient approximation algorithm from [CJLZ20] that uses only space. Thus, each inner node takes bits of space. For the leaf nodes, we have . Thus, we can compute it with bits of extra memory. Since the order of computation is the same as depth-first search, we only need to maintain one node in each recursive level and we can reuse the space for those nodes we have already explored. Since the depth of recursion is , the total space required is .
For the time complexity, notice that the space efficient algorithm for approximating takes polynomial time. The depth of recursion is and at each recursive level, the number of nodes is polynomial in , we need to try different at each node except the leaf nodes. Thus, the running time is still polynomial.
∎
We now present our algorithm for approximating edit distance in the asymmetric streaming model. The pseudocode of our algorithm is given in algorithm 3. It takes three input, an online string , an offline string and a parametert .
Lemma 6.6.
Assume , Algorithm 3 can be run with bits of space in polynomial time.
Proof.
Notice that is initially set to be a constant such that is an integer. is multiplied by whenever . We assume that in total, is updated times and after the -th update, , where . We let . Thus, and . We denote the before -th while loop by so that and for .
We first show the following claim.
Claim 6.2.
Proof.
Assume the contrary, we have , and thus .
Let . That is, after is updated to , is updated from to . For convenience, we denote . Since , there is a substring of , say , such that . Let , by Lemma 6.2, we can partition and each into parts such that
and for . We denote for so that starts with -th symbol and ends with -th symbol of . We have and .
We first show that for , . Assume there is some such that and . By Lemma 6.4, is at least the largest integer such that there is a substring of within edit distance to . Since is a substring of and . There is a substring of within edit distance to . Thus, we must have . Thus . This is contradictory to our assumption that .
We now show that . If , we have . By Lemma 6.4, we must have . Since , we will terminate the while loop and set . Thus, we have .
Meanwhile, by the assumption that , we have . If is updated times, is at least . Thus, . This is contradictory to . ∎
By the above claim, we have . Thus, we can remember all for with bits of space. For convenience, let . We can use the space efficient algorithm from [CJLZ20] (Lemma 6.3) to compute a -approximation of with bits of space in polynomial time. By Lemma 6.5, we can run with space since . The total amount of space used is .
We run times and compute a -approximation of with polynomial time. The total running time is still a polynomial. ∎
Lemma 6.7.
Assume , there is a one pass deterministic algorithm that outputs a -approximation of in asymmetric streaming model, with bits of space in polynomial time.
Proof of Lemma 6.7.
Notice that algorithm 3 executes times and records outputs for . We also denote with . Thus, we can partition into parts such that
where . Since , by Lemma 6.4, we know .
We now show that the output is a approximation of . Let . Notice that and . We have
By triangle inequality | ||||
On the other hand, we can divide into parts such that and guarantee that
(30) |
Also, by Lemma 6.4, is the substring of that is closest to in edit distance, we know
(31) |
We have
By 30 | ||||
By 31 | ||||
Since we can pick to be any constant, we pick and the output is indeed a approximation of . ∎
Lemma 6.8.
Assume , for any constant , there is an algorithm that outputs a -approximation of with bits of space in polynomial time in the asymmetric streaming model.
Proof of Lemma 6.8.
We run algorithm 3 with parameter . Without loss of generality,we can assume is an integer. The time and space complexity follows from Lemma 6.6.
Again we can divide into parts such that where . By Lemma 6.4, we have .
Let be a partition of such that
We now show that is a approximation of . Similarly, we let . We have
By triangle inequality | ||||
Let be the substring of that is closest to in edit distance. By Lemma 6.4, we know
This finishes the proof. ∎
6.2 Algorithm for LCS
We show that the algorithm presented in [RS20] for approximating to factor can be slightly modified to work in the asymmetric streaming model. The following is a restatement of Theorem 6
Theorem 17 (Restatement of Theorem 6).
Given two strings , for any constant , there exists a constant , such that there is a one-pass deterministic algorithm that outputs a approximation of with bits of space in the asymmetric streaming model in polynomial time.
The algorithm from [RS20] uses four algorithms as ingredients , , , . We give a description here and show why they can be modified to work in the asymmetric streaming model.
Algorithm takes three inputs: two strings and a symbol . Let be the number of symbol in string and we similarly define . computes the length of longest common subsequence between and that consists of only symbols. Thus, .
Algorithm takes two strings as inputs. It is defined as .
Algorithm takes three strings as inputs. It finds the optimal partition that maximizes . The output is and .
Algorithm takes two strings as inputs. Here we require the input strings have equal length. first computes a constant approximation of , denoted by . The output is .
In the following, we assume and input strings to the main algorithm both have length . All functions are normalized with respect to . Thus, we have ,,, .
The algorithm first reduce the input to the perfectly unbalanced case.
We first introduce a few parameters.
.
is a constant. It can be smaller than by an arbitrary constant factor.
is a parameter that depends on the accuracy of the approximation algorithm for we assume.
Definition 6.1 (Perfectly unbalanced).
We say two string are perfectly unbalanced if
(34) |
and
(35) |
Here, we require to be a sufficiently small constant such that and .
To see why we only need to consider the perfectly unbalanced case, [RS20] proved the following two Lemmas.
Lemma 6.9.
If , then
Lemma 6.10.
Let be sufficiently small constants. If , then
If the two input string are not perfectly unbalanced, we can compute and to get a approximation fo for some small constant .
Given two strings in the perfectly unbalanced case, without loss of generality, we assume . The algorithm first both strings into three parts such that and where . Then, the inputs are divided into six cases according to the first order statistics (number of 0’s and 1’s) of , , , . For each case, we can use the four ingredient algorithms to get a approximation of for some small constant . We refer readers to [RS20] for the pseudocode and analysis of the algorithm. We omit the details here.
We now prove Theorem 17.
Proof of Theorem 17.
As usual, we assume is the online string and is the offline string. If the two input strings are not perfectly unbalanced, we can compute in space in the asymmetric streaming model since we only need to compare the first order statistics of and . Also, for any constant we can compute a constant approximation (dependent on ) of using space by Lemma 6.8. Thus, we only need to consider the case where and are perfectly balanced.
Notice that the algorithm from [RS20] needs to compute , , , with input strings chosen from , , , , and , , , ,
If we know the number of 1’s and 0’s in , and , then we can compute and with any pair of input strings from , , , , and , , , , .
For , according to the algorithm in [RS20], we only need to compute , , and . For any constant , we can get a constant approximation (dependent on ) of edit distance with space in the asymmetric streaming model by Lemma 6.8.
For , there are two cases. For the first case, the online string is divided into two parts and the input strings are where . Notice that can only be or . In this case, we only need to remember , , and . Since the length of , , and are all fixed. We know , and similar for and . We know for
Given , we know , , and . Thus we only need to read from left to right once and remember the index that maximizes .
For the case when the input strings are , if we know , similarly, we can compute by reading from left to right once with bits of space. Here, is not known to us before computation. However, in the perfectly unbalanced case, we assume is a sufficiently small constant. We can simply assume and run in the asymmetric streaming model. This will add an error of at most . The algorithm still outputs a approximation of for some small constant .
∎
References
- [ABW15] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for lcs and other sequence similarity measures. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on. IEEE, 2015.
- [AJP10] Alexandr Andoni, T.S. Jayram, and Mihai Patrascu. Lower bounds for edit distance and product metrics via poincare type inequalities. In Proceedings of the twenty first annual ACM-SIAM symposium on Discrete algorithms, pages 184–192, 2010.
- [AK10] Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SIAM Journal on Discrete Mathematics, 39(6):2398–2429, 2010.
- [AKO10] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Foundations of Computer Science (FOCS), 2010 IEEE 51st Annual Symposium on. IEEE, 2010.
- [AN20] Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s a constant factor. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020.
- [BI15] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (STOC). IEEE, 2015.
- [BR20] Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In Proceedings of the 52nd annual ACM symposium on Theory of computing (STOC), 2020.
- [BZ16] Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and document exchange. In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science, pages 51–60. IEEE, 2016.
- [CDG+19] Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucky, and Michael Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In Foundations of Computer Science (FOCS), 2018 IEEE 59th Annual Symposium on. IEEE, 2019.
- [CFH+21] Kuan Cheng, Alireza Farhadi, MohammadTaghi Hajiaghayi, Zhengzhong Jin, Xin Li, Aviad Rubinstein, Saeed Seddighin, and Yu Zheng. Streaming and small space approximation algorithms for edit distance and longest common subsequence. In International Colloquium on Automata, Languages, and Programming. Springer, 2021.
- [CGK16a] Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. Low distortion embedding from edit to hamming distance using coupling. In Proceedings of the 48th IEEE Annual Annual ACM SIGACT Symposium on Theory of Computing. ACM, 2016.
- [CGK16b] Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. Streaming algorithms for computing edit distance without exploiting suffix trees. arXiv preprint arXiv:1607.03718, 2016.
- [CJLZ20] Kuan Cheng, Zhengzhong Jin, Xin Li, and Yu Zheng. Space efficient deterministic approximation of string measures. arXiv preprint arXiv:2002.08498, 2020.
- [EJ08] Funda Ergun and Hossein Jowhari. On distance to monotonicity and longest increasing subsequence of a data stream. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 730–736, 2008.
- [FHRS20] Alireza Farhadi, MohammadTaghi Hajiaghayi, Aviad Rubinstein, and Saeed Seddighin. Streaming with oracle: New streaming algorithms for edit distance and lcs. arXiv preprint arXiv:2002.11342, 2020.
- [GG10] Anna Gál and Parikshit Gopalan. Lower bounds on streaming algorithms for approximating the length of the longest increasing subsequence. SIAM Journal on Computing, 39(8):3463–3479, 2010.
- [GJKK07] Parikshit Gopalan, TS Jayram, Robert Krauthgamer, and Ravi Kumar. Estimating the sortedness of a data stream. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 318–327. Society for Industrial and Applied Mathematics, 2007.
- [HSSS19] MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, and Xiaorui Sun. Approximating lcs in linear time: beating the barrier. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1181–1200. Society for Industrial and Applied Mathematics, 2019.
- [IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity. Journal of Computer and System Sciences, 63(4):512–530, 2001.
- [KN97] Eyal Kushilevitz and Noam Nisan. Communication Complexity. Cambridge Press, 1997.
- [KS92] Bala Kalyanasundaram and Georg Schintger. The probabilistic communication complexity of set intersection. SIAM Journal on Discrete Mathematics, 5(4):545–557, 1992.
- [KS20] Michal Kouckỳ and Michael E Saks. Constant factor approximations to edit distance on far input pairs in nearly linear time. In Proceedings of the 52nd annual ACM symposium on Theory of computing (STOC), 2020.
- [LNVZ05] David Liben-Nowell, Erik Vee, and An Zhu. Finding longest increasing and common subsequences in streaming data. In COCOON, 2005.
- [Raz90] Alexander A Razborov. On the distributional complexity of disjointness. In International Colloquium on Automata, Languages, and Programming, pages 249–253. Springer, 1990.
- [RS20] Aviad Rubinstein and Zhao Song. Reducing approximate longest common subsequence to approximate edit distance. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1591–1600. SIAM, 2020.
- [RSSS19] Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for lcs and lis with truly improved running times. In Foundations of Computer Science (FOCS), 2019 IEEE 60th Annual Symposium on. IEEE, 2019.
- [Sah17] Barna Saha. Fast & space-efficient approximations of language edit distance and RNA folding: An amnesic dynamic programming approach. In FOCS, 2017.
- [SS13] Michael Saks and C Seshadhri. Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1698–1709. SIAM, 2013.
- [SW07] Xiaoming Sun and David P Woodruff. The communication and streaming complexity of computing the longest common and increasing subsequences. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 336–345, 2007.
- [SZ99] Leonard J Schulman and David Zuckerman. Asymptotically good codes correcting insertions, deletions, and transpositions. IEEE transactions on information theory, 45(7):2552–2557, 1999.
Appendix A Lower Bound for ED in the Standard Streaming Model
Theorem 18.
There exists a constant such that for strings , any deterministic pass streaming algorithm achieving an additive approximation of needs space.
Proof.
Consider an asymptotically good insertion-deletion code over a binary alphabet (See [SZ99] for example). Assume has rate and distance . Both and are some constants larger than 0, and we have . Also, for any with , we have . Let and consider the two party communication problem where player 1 holds and player 2 holds . The goal is to decide whether . Any deterministic protocol has communication complexity at least . Note that any algorithm that approximates within an additive error can decide whether . Thus the theorem follows. ∎
We note that the same bound holds for Hamming distance by the same argument.