Approximate Trace Reconstruction
Abstract
In the usual trace reconstruction problem, the goal is to exactly reconstruct an unknown string of length after it passes through a deletion channel many times independently, producing a set of traces (i.e., random subsequences of the string). We consider the relaxed problem of approximate reconstruction. Here, the goal is to output a string that is close to the original one in edit distance while using much fewer traces than is needed for exact reconstruction. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within edit distance, and where we only use traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and which are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. To complement our algorithms, we present a general black-box lower bound for approximate reconstruction, building on a lower bound for distinguishing between two candidate input strings in the worst case. In particular, this shows that approximating to within edit distance requires traces for in the worst case.
1 Introduction
In the trace reconstruction problem, we observe noisy samples of an unknown binary string after passing it through a deletion channel several times [BKKM04, Lev01]. For a parameter , the channel deletes each bit of the string with probability independently, resulting in a trace. The deletions for different traces are also independent. We only observe the concatenation of the surviving bits, without any information about the deleted bits or their locations.
How many samples (traces) from the deletion channel does it take to exactly recover the unknown string with high probability? Despite two decades of work, this question is still wide open. For a worst-case string, very recent work shows that traces suffice [Cha20b], building upon the previous best bound of [DOS19, NP17]; furthermore, traces are necessary [Cha20a, HL20]. Improved upper bounds are known in the average-case setting, where the unknown string is uniformly random [BKKM04, HMPW08, HPP18, MPV14, PZ17, VS08], in the coded setting, where the string is guaranteed to reside in a pre-defined set [BLS20, CGMR20, SYY20, SDDF18, SDDF20], and in the smoothed-analysis setting where the unknown string is perturbed before trace generation [CDL+21].
Given that exact reconstruction may be challenging, we relax the requirements and ask: when is it possible to approximately reconstruct an unknown string with much less information? More precisely, the algorithm should output a string that is close to the true string under some metric. Since the channel involves deletions, we consider edit distance, measuring the minimum number of insertions, deletions, and substitutions between a pair of strings. Letting denote the length of the unknown string, we investigate the necessary and sufficient number of traces to approximate the string up to edit distance. We call this the -approximate reconstruction problem.
Trace reconstruction has received much recent attention because of DNA data storage, where reconstruction algorithms are used to recover the data [OAC+18, CGK12, BPRS20, GBC+13, YGM17, LCA+19]. Biochemical advances have made it possible to store digital data using synthetic DNA molecules with higher information density than electromagnetic devices. During the data retrieval process, each DNA molecule is imperfectly replicated several times, leading to a set of noisy strings that contain insertion, substitution, and deletion errors. Error-correcting codes are used to deal with missing data, and hence, an approximate reconstruction algorithm would be practically useful. Decreasing the number of traces would reduce the time and cost of data retrieval.
For any deletion probability , a single trace achieves a -approximation in expectation. On the other hand, if , then it is not clear whether -approximate reconstruction requires asymptotically fewer traces than exact reconstruction. More generally, we propose the following goal: determine the smallest such that any string can be -approximately reconstructed with traces. Here is a parameter that may depend on and . Although we focus on an information-theoretic formulation (measuring the number of traces), the reconstruction algorithm should also be computationally efficient (polynomial time in and the number of traces).
A natural approach would be to transform exact reconstruction methods into more efficient approximation algorithms. Unfortunately, revising these algorithms to allow some incorrect bits seems nontrivial or perhaps impossible. For example, certain algorithms assume that the string has been perfectly recovered up to some point, and they use this to align the traces and determine the next bit [BKKM04, HMPW08, HPP18]. Another technique involves computing the bit-wise average of the traces and outputing the string that most closely matches the average. These mean-based statistics suffice to distinguish any pair of strings, but only if there are traces [DOS19, NP17]. Also, the maximum likelihood solution is poorly understood for the deletion channel, and current analyses are limited to a small number of traces [Mit09, SYY20, SDDF18, SDDF20]
Designing algorithms to find approximate solutions may in fact require fundamentally different methods than exact reconstruction. For a simple supporting example, consider the family of strings containing all ones except for a single zero that lies in some position between and , e.g., . Determining the exact position of the zero requires traces when the deletion probability is a constant [BKKM04, MPV14]. However if the string comes from this family, we can output the all ones vector and achieve an approximation to within Hamming distance one.
As a starting point, we consider classes of strings defined by run-length assumptions. For instance, if the 1-runs are sufficiently long and the zero runs are either short or long, we can -approximately reconstruct the string using traces. We then strengthen our upper bound to work even when the string can be partitioned into regions that are either locally dense or sparse. Finally, we prove new lower bounds on the necessary trace complexity; for example, approximating arbitrary strings to within edit distance requires traces for any constant .
1.1 Related work
The trace reconstruction problem was introduced to the theoretical computer science community by Batu, Kannan, Khanna, and McGregor [BKKM04]. There is an exponential gap between the known upper and lower bounds for the number of traces needed to reconstruct an arbitrary string with constant deletion rate [Cha20a, DOS19, Cha20b, HL20, NP17]. The main open theoretical question is whether a polynomial number of traces suffice. There has also been experimental and theoretical work on maximum likelihood decoding, where approximation algorithms have been developed for average-case strings given a constant number of traces [SYY20, SDDF18, SDDF20]. Holden, Pemantle, and Peres show that traces suffice for reconstructing a random string, building on previous work [BKKM04, HMPW08, HPP18, PZ17, VS08]. Chase extended work by Holden and Lyons to show that traces are necessary for random strings [Cha20a, HL20].
A related question to ours is to understand the limitations of known techniques for distinguishing pairs of strings that are close in edit distance. Grigorescu, Sudan, and Zhu show that there exist pairs that cannot be distinguished with a small number of traces when using a mean-based algorithm [GSZ20]. They further identify strings that are separated in edit distance, yet can be exactly reconstructed with few traces. Their results are incomparable to ours because the sets of strings they consider are different. Instead of considering local density assumptions, they consider local agreement up to single-bit shifts. Their algorithm uses a subexponential number of traces only when the edit distance separation is at most .
Many other variants of trace reconstruction have been studied as stand-alone problems, united by the goal of broadening our understanding of reconstruction problems. Krishnamurthy, Mazumdar, McGregor, and Pal consider matrices (rows/columns deleted) and sparse strings [KMMP19]. Davies, Rácz, and Rashtchian consider labeled trees, where the additional structure of some trees leads to more efficient reconstruction [DRR19]. Circular trace reconstruction considers strings and traces up to circular rotations of the bits [NR21]. Population recovery reconstructs multiple unknown strings simultaneously [BCF+19, BCSS19, Nar21]. Going beyond i.i.d. deletions, algorithms have also been developed for position- or character- dependent error rates [HHP18], or for ancestral state reconstruction, where deletions are based on a Markov chain [ADHR12]. It should not go without mention that forms of approximate trace reconstruction have been studied in more applied frameworks; in particular Srinivasavaradhan, Du, Diggavi, and Fragouli study heuristics for reconstructing approximately given one or two traces [SDDF18].
Comparison to Coded Trace Reconstruction.
Cheraghchi, Gabrys, Milenkovic, and Ribeiro explore coded trace reconstruction, where the unknown string is assumed to come from a code, and they show that codewords can be reconstructed with high probability using much fewer traces than average-case reconstruction [CGMR20] (see also [DM07, Lev01, Mit09]). Brakensiek, Li, and Spang extend this work and present codes with rate that can be reconstructed using traces [BLS20]. Improved coded reconstruction results are known when the number of errors in a trace is a fixed constant [AVDGiF19, CKY20, HM14, KNY20, SYY20].
An existing approach for coded trace reconstruction does use approximation as an intermediate step, where the original string can be recovered after error correction [BLS20]. Our focus is different, and our results are incomparable to those from coded trace reconstruction. We investigate classes of strings that are very different from codes (e.g., pairs of strings in our classes can be very close). We also consider strings that require traces to exactly reconstruct, whereas the work on coded trace reconstruction shows that their classes of strings can be exactly reconstructed with a sublinear number of traces. Overall, we do not aim to optimize the “rate” of our classes of strings. Instead, our main contribution is the effectiveness of new algorithmic techniques and local approximation methods, including novel alignment ideas and the use of runs in approximating edit distance.
Additionally, coded trace reconstruction lower bounds can be used as a black box to obtain lower bounds for approximate trace reconstruction by constructing a code that is an -net [BLS20]. However, these lower bounds reduce to results on average-case reconstruction, and hence, this approach currently leads to lower bounds for approximate reconstruction that are exponentially smaller than what we prove.
1.2 Our results
We assume that the deletion probability is a fixed constant and is the retention probability. In our statements, denote constants, and hides constants, where these may depend on . Unless stated otherwise, has base . The phrase with high probability means probability at least . A run in a string is a substring of consecutive bits of the same value, and we often refer specifically to -runs and -runs. We use bold to denote runs, or more generally substrings, and let denote its length (number of bits). Table 1 summarizes our results, and we restate the theorems in the relevant sections for the reader’s convenience.
Algorithms for approximate reconstruction
Our results exhibit the ability to approximately reconstruct strings based on various run-length or density assumptions. For these classes of strings, we develop new polynomial-time, alignment-based algorithms, and we show that traces suffice. We assume that the algorithms know , , , and the class that the unknown string comes from, though the last assumption is not necessary for Theorem 1. We also provide warm-up examples (see Proposition 8 and Proposition 9 in Section 2), which may be helpful to the reader before diving into the algorithms in Section 3.
Our first theorem only requires -runs to be long, while the length of the -runs is more flexible; they can be either long or short, assuming there is a gap.
Theorem 1.
Let be a string on bits such that all of its -runs have length at least and none of its -runs have length between and . There exists some constant such that if , then can be -approximately reconstructed with traces.
Classes of strings | # samples -approx. | Reference |
---|---|---|
All runs have length | Proposition 8 & Corollary 14 | |
The -runs have length | 1 | Proposition 9 |
Long -runs; either long or short -runs | Theorem 1 & Theorem 2 | |
Intervals length , density | 1 | Theorem 3 |
Arbitrary strings, edit distance, | Theorem 4 & Corollary 5 |
The following theorem extends Theorem 1 to a wider class of strings by allowing many of the bits in the runs to be arbitrarily flipped to the opposite value.
Theorem 2.
Suppose that . Let be a string on bits such that all of its -runs have length at least and none of its -runs have length between and . Suppose that is formed from by modifying at most arbitrary bits in each run of . If , then can be -approximately reconstructed with traces.
For the final class, we consider a slightly different relaxation of having long runs. We impose a local density or sparsity constraint on contiguous intervals. If this holds, then a single trace suffices.
Theorem 3.
There exists some constant such that for , if can be divided into contiguous intervals with all having length at least and density at least of s or s, then can be -approximately reconstructed with a single trace in polynomial time.
The algorithm for Theorem 3 extends to handle independent insertions at a rate of , since the proof relies on finding high density regions, which are unchanged by such insertions.
We provide some justification for the strings considered in the above theorems. Strings that either contain long runs or that are locally dense are a natural class to examine in order to understand the advantage gained by approximate reconstruction over exact. Strings with sufficiently long runs require traces to reconstruct exactly, as exact reconstruction for this set involves distinguishing between our prior example strings and , but can be approximately reconstructed with substantially less traces for large enough values of . We then relax the condition that strings have long runs to the condition that strings are locally dense. Both strings with long runs and strings that are locally dense also look very different than average-case strings (i.e., uniformly random), which have runs with length at most with high probability and can be exactly reconstructed with traces [HPP18].
Lower bounds for approximate reconstruction
We prove lower bounds on the number of traces required for -approximate reconstruction. We present two results, for edit distance and Hamming distance, respectively. The more challenging result is Theorem 4, which shows that any algorithm that reconstructs a length arbitrary string within edit distance requires traces, where denotes the minimum number of traces for distinguishing a pair of -bit strings. Currently, is the best known lower bound for exact reconstruction, which argues via a pair of strings that are hard to distinguish [Cha20a].
Theorem 4.
Suppose that traces are required to distinguish between two length strings and with probability at least , where . Then there exists absolute constants such that for , any algorithm that -approximately reconstructs arbitrary length strings with probability must use at least traces.
Plugging in the bound on , our theorem shows that traces are required for -approximate reconstruction. For example, if , then we obtain the following.
Corollary 5.
For any constant , we have that traces are necessary to -approximately reconstruct an arbitrary -bit string with probability .
Theorem 4 also allows for to be as small as , implying that a very close approximation is not possible with substantially fewer traces than exact reconstruction.
Our lower bound for Hamming distance in Theorem 6 is simpler. It shows that traces are necessary to achieve an approximation closer than in Hamming distance to the actual string. In particular, we get a linear lower bound for a linear Hamming distance approximation, which is much stronger than our result for edit distance.
Theorem 6.
Any algorithm that can output an approximation within Hamming distance of an arbitrary length string with probability at least must use traces.
1.3 Technical overview
The high-level strategy for all of our algorithms is the following. First, we identify the remnants of structured substrings, that is, long runs and dense substrings, from the original string in the traces. Then, when given more than one trace, we can use these substrings to align traces. After aligning traces, we capitalize on the approximate nature of our objective by estimating lengths of runs which are close in edit distance to substrings of the original string.
The gap condition for -runs in Theorem 1 states that the unknown string only contains -runs with length either less than or greater than , for large enough (and nothing in the middle). We show that there exist values , with , such that with high probability there does not exist a -run of length at least in the original string that has been reduced to a -run of length less than in a trace, nor a -run of length less than reduced to a -run of length more than . This implies that we can always distinguish between short and long runs of s in all of the traces (which would be challenging without the gap condition). We can align long runs of s from the traces and take a scaled average of the lengths of the th long run of s across all traces. By using a scaled average across traces, we can estimate the number of bits between consecutive long runs of s. Then, our algorithm outputs a run of s here, which accounts for long runs of s and short runs of s. Note that this piece of the algorithm is inherently approximate since we replace short runs of s with s. This completes, what we call, our algorithm for identifying long runs.
The algorithm for Theorem 2 is similar to Theorem 1. We identify long -runs from in each of the traces and align by these -runs, then approximate the rest using s. However, the alignment step is more difficult since the long -runs from may not be -runs in and not easily found in traces. Instead, we identify long 0-dense substrings in each trace that with high probability originate from long -runs in . We refer to this as the algorithm for identifying dense substrings. Then we align and average as in Theorem 1 to approximate the unknown string.
Our algorithm in Theorem 3 takes a uniform partition of a single trace, where each part has length , and it outputs a series of runs, where each run has length and parity the majority bit of the interval. Note the partitions have length at most an fraction of the high density intervals. Therefore in any high density interval of the original string, most of the partitions of the trace originating from that interval will also have high density of the same bit. We refer to the method for this result as the algorithm for majority voting in substrings.
The algorithms and analyses for these three theorems are in Section 3.
Lower Bounds.
For the edit distance approximation in Theorem 4, we start with two strings of length that require traces to distinguish for some constant and . We then construct a hard distribution over length strings by concatenating substrings, where each substring is an independent random choice between the two strings. Our strategy is to show that if the algorithm outputs an approximation within edit distance, then it must correctly determine a large number of the component strings. However, proving this requires some work because the guarantee of the reconstruction algorithm is in terms of an edit distance approximation. To handle this challenge, we provide a technical lemma that relates the edit distance of any pair of strings to a sum of binary indicator vectors for the equivalence of certain substrings (Lemma 13). Then, we use this lemma to argue that the algorithm’s output must be far from the true string if the number of traces is less than because many substrings must disagree.
For the Hamming distance lower bound in Theorem 6, we use a more straightforward argument. We start with a known lower bound from Batu, Kannan, Khanna, and McGregor [BKKM04]. They observe that traces are necessary to determine if a string starts with or zeros. We then construct a hard pair of strings of length roughly such that if the algorithm misjudges the prefix length, then it must incur a cost of at least in Hamming distance. Since , we obtain the desired lower bound.
The proofs for both lower bounds appear in Section 4.
1.4 Preliminaries
Let denote the edit distance between and , defined as the minimum number of insertions, deletions, and substitutions that are required to transform into . Note that edit distance is a metric. For each class of strings that we consider, we present an algorithm and argue that it can -approximately reconstruct any string from the class. Our algorithms output a string , an approximation of , satisfying with high probability.
We denote a single run by and a set of runs by . Our convention is to let denote the unknown string that we wish to reconstruct, and will often be a modified version. A single trace will be denoted by and a set of traces by . Tildes will also be used to mark runs and intervals of traces. Some strings we partition into substrings ; their concatenation to form is denoted as .
Some of our algorithms reconstruct by partitioning it into substrings and reconstructing these substrings approximately. Specifically, we will find strings such that the edit distance between and is at most , and then we will invoke the following lemma to see that and have edit distance at most .
Lemma 7.
Let and be strings on bits. If the edit distance between and is at most for all , then .
Proof.
We will use the fact that edit distance satisfies the triangle inequality. Consider bit strings and . Then,
This extends to and by recursively applying the above inequality. ∎
2 Warm-up: Approximating strings that only have long runs
We begin with two simple cases that demonstrate some of our algorithmic techniques. For this section, we defer proofs to Appendix A. We note that other methods may lead to similar or slightly better results in some regimes, but we follow this presentation as a prelude to Section 3.
The first algorithm -approximately reconstructs a string with long runs using traces by scaling an average of the run length across all traces.
Proposition 8.
Let be a string on bits such that all of its runs have length at least . Then can be -approximately reconstructed with traces.
Algorithm
-
Set-up:
String on bits such that all of its runs have length at least .
-
1.
Sample traces, , from the deletion channel with deletion probability . Fail if all traces do not have the same number of runs. Otherwise let denote the number runs in every trace.
-
2.
Compute for all , where are the runs of .
-
3.
Output , where has length and bit value matching run of the traces.
The analysis is a basic use of Chernoff bounds; see Appendix A for details.
Ideally we would only require that -runs have length , without restricting the length of -runs. The following result shows that if we require the -runs to be , which is an order of larger than in Theorem 1, then approximate reconstruction is possible using one trace.
Proposition 9.
Let be a string on bits such that all of its -runs have length at least . Then there exists a constant such that for , can be -approximately reconstructed with a single trace.
Algorithm
-
Set-up:
String on bits such that all its -runs have length at least .
-
1.
Sample 1 trace from the deletion channel with deletion probability .
-
2.
Let ; ,…, be -runs in with length at least ; and , for , be the bits in before , between and , and after , respectively.
-
3.
Output , where is a -run, length , and is a -run, length .
The algorithm for Proposition 9 no longer attempts to align multiple traces. Step three is approximate by design because we use -runs to fill in the gaps between the long -runs. The error is from the variance of how many bits of each run are deleted by the deletion channel. See Appendix A for the proof.
3 Approximating more general classes of strings
Moving on from our warm-ups, we reconstruct larger classes of strings. Our first two algorithms in this section reconstruct strings that still contain some long runs, where these help us align traces. Our third algorithm reconstructs from a single trace by approximately preserving local density.
3.1 Identifying long runs
To weaken the assumptions of Proposition 8, we want to consider strings where -runs can be any length but -runs must still be long and have length . When relaxing the length restriction on the -runs, the alignment step, step 1, of the algorithm for Proposition 8 begins to fail—entire runs of s may be deleted, combining consecutive -runs and making it difficult to identify which runs align together between traces. To still use an alignment algorithm that averages run lengths, we impose the weaker condition on the -runs that they must be divided into short -runs and long -runs. As long as there is a gap of sufficiently large size such that there are no -runs with length in the gap, then in the traces we can identify which -runs are long and which are short.
See 1
Algorithm for identifying long runs
-
Set-up:
String on bits such that all of its -runs have length at least , where , and all of its -runs have length either greater than or less than .
-
1.
Sample traces, , from the deletion channel with probability .
-
2.
Define , and for all , index the -runs in with length at least as . For , let be the bits between and in and let be the bits before and the bits after for all .
-
3.
If there exist such that , then fail without output. Otherwise, let .
-
4.
Compute for all and for all .
-
5.
Output , where is a -run, length , and is a -run, length .
Observe that the algorithm is inherently approximate, as we fill in the gaps between the long 0-runs with -runs, omitting any short -runs.
Analysis
Proof of Theorem 1.
Let be a string on bits such that all of its -runs have length at least , where , and all of its -runs have length either greater than or less than . Take traces of . By a Chernoff bound, with probability at least , no -run is fully deleted in any trace; in the following we assume that we are on this event.
We will justify that in the traces we can identify all -runs that had length at least in . Let be a -run from with length . Using a Chernoff bound, the probability that in a single trace is transformed into a run with is bounded by
Similarly, for any -run in such that , the probability that is reduced to a run with is bounded by
It follows that, with probability at least , there does not exist any -run and any trace such that either of the “unlikely” inequalities above holds. On this event, we have that for any -run of length at least , and any trace , we can identify the image of in trace . In particular, on this event, the number of -runs in each trace that has length at least is equal to the number of -runs in of length at least ; thus . The algorithm and proof now proceed very similarly to those of Proposition 9, except since we have more than a single trace, we estimate lengths of subsequences by scaling an average of the corresponding subsequences from the traces.
Let and find every -run in with length at least , indexing them as . For , let be the bits between the last bit of and the first bit of in and let be the bits before and the bits after . Let be the contiguous substring of from which came and the contiguous substring of from which came.
For all , we will approximate with a -run of length , for , and we will approximate with , a -run of length , for . Applying a Chernoff bound and then a union bound, and .
Since contains alternating -runs with length at least and -runs with length at most , has at least a density of s. Therefore and . Let and we see that from Lemma 7
If we apply this algorithm and analysis with instead of , the result follows. Constants were taken large enough to account for this factor of 2. ∎
Note that the above theorem holds when the constant is unknown. Given traces of , we can determine whether or not had such a gap, and the corresponding value, with high probability. We can then execute the algorithm as stated.
3.2 Identifying dense substrings
Here we extend the class of strings we can approximately reconstruct, proving a robust version of Theorem 1. Specifically, we consider strings with similar properties to those in Theorem 1, allowing for additional bit flips.
See 2
The general goal of the algorithm is similar to that of Theorem 1, which is to identify long -runs from in each trace of and to align by these -runs; then, we approximate the rest of with -runs. Because and have small edit distance, a good approximation for is also good for . Unfortunately the long -runs from are no longer necessarily -runs in , and therefore they are more difficult to find in the traces. Instead we find long -dense substrings in .
Let and be as in the theorem statement. We also fix throughout this subsection. Fix a trace of , as well as an index . Let denote the length of the trace. Define the indices and to be those that are s to the left and right of in , respectively, if such indices exist. We count the s in between indices and with the quantity
Note that is not defined if or are not defined. We use a slightly different quantity on the boundary of the trace to handle this. Letting the definition of and remain the same, if or is not defined, then we consider or , respectively. Combining the interior and boundary quantities, let if there are 1s to the left and right of , let if there are 1s to the right of but not the left, and let if there are 1s to the left of but not the right.
In each trace we identify a set of substrings of that are -dense, and then decide whether each such substring is long or short using ; that is, whether the corresponding unknown -runs in are long (length at least the upper bound of the gap) or short (length at most the lower bound of the gap). If the traces all agree on the number of long -dense substrings, we align the traces by these substrings and reconstruct in a manner similar to that of Theorem 1.
Algorithm for identifying dense substrings
-
Set-up:
String on bits formed by flipping at most bits in each run of , where is a string on bits such that all of its -runs have length at least , for , and all of its -runs have length either greater than or less than .
-
1.
Sample traces, , from the deletion channel with deletion probability .
-
2.
Set and . For each trace , let be the smallest index of such that and Let be the smallest index such that and Compute . Starting bits to the right of the last bit counted in , continue scanning to the right and repeat this process, finding indices and computing , for .
-
3.
Set . For every trace , let . If is not the same across all traces, the algorithm fails. Otherwise, define and for all , we let be a -run of length , for .
-
4.
Define and , for and as in the definition of . Let be -runs where has length for , has length , and has length .
-
5.
Output .
Analysis
Let be fixed such that , and let be fixed such that . Suppose is a string on bits such that every -run in has length at least and all of its -runs have length either greater than or length less than . Let be a string on bits that is formed by flipping at most bits within each run of . Let be a trace of . A -run in may have some bits flipped from 0 to 1 in , becoming the substring , so let denote the number of s in .
Next, we prove several properties of when the bit at index in trace was from a -run in and .
Lemma 10.
Let be a random trace from , and let be an index of such that . If the bit at is from a 0-run in , then the following holds for the quantity :
-
1.
(Property 1) With probability at least the bits at indices and come from a -run adjacent to .
-
2.
(Property 2) If indices and come from a -run adjacent to , then is upper bounded by a random variable from the distribution .
-
3.
(Property 3) If and the bits at indices and come from a -run adjacent to , then with probability at least ,
Proof of Property 1.
It suffices to prove the claim for . Index is s to the left of , and therefore not from , since at most s of were flipped to s. Further, by a Chernoff bound, with probability at least the -run left-adjacent to in has at least bits surviving in . At most bits of the left-adjacent -run to in are flipped to , so at least s from this -run survive in . It follows that came from the left adjacent -run to in . ∎
Proof of Property 2.
Recall that is the number of s in that were not flipped to in . This component of is from the distribution . Let the contribution to by any s not from be the random variable . Each bit that was flipped to in either -run adjacent to in can contribute with probability at most to . From the assumption on and , any other from will be outside of the range . Therefore we can upper bound the contribution of by a random variable sampled from . ∎
Proof of Property 3.
By Property 2, is upper bounded by a random variable from the distribution . By a Chernoff bound, with probability the first binomial term varies from its mean by at most . The second binomial term is upper bounded by and . ∎
Now we are ready to prove Theorem 2.
Proof of Theorem 2.
Define . Take traces of , , and fix a trace . Our first goal is to find long -dense substrings in ; we can also think of these long -dense substrings as corresponding to long -runs in . Let be the smallest index of such that and there are at least 0s in within indices of , i.e.
Next find the index such that and there are exactly 0s in within the interval of indices , i.e. The goal of this procedure is to find an index such that the bit at is from a -run in with high probability.
With probability at least , every -run in is reduced to a substring with at least s in . This implies that the length interval contains bits from at most two runs in and at most one run with probability . By construction, this interval contains at least s (the inequality coming from the fact that ). Since each -run had at most bits flipped to , there must be at least s in the interval that came from some -run in . In this construction, the s from the that survived in are nested between at most s that were flipped from the left-adjacent -run to in and at most s that were flipped from the right-adjacent -run to in . This implies that the th in this interval must be from the -run .
Compute . Note that with high probability, if a trace does not have 1s to the right of , the original string can be well-approximated by outputting the all 0s string with length . Starting bits to the right of the last bit counted in , continue scanning to the right and repeat this process, finding indices and computing , for . We jump ahead bits to the right between iterations because this forces the next bit that satisfies the condition to not overlap with the previous -run with high probability by Property 1.
We justify that this process succeeds, meaning that it catches all long -runs from , in all traces, with high probability. For -run in such that , with probability at least at least bits from all such -runs survive in all traces. Further there are at most s among these bits. Therefore, with probability at least , we have at least s that have at most s inserted among them, and this triggers the calculation of for some .
By the theorem assumptions, there exists an interval such that no -run in has in the gap . Let be the middle of the gap scaled by , so . By Property 3 and a union bound, with probability at least , all -runs in with will trigger the calculation of an with in all traces, and all -runs in with will either not trigger an calculation, or if they do, will have for all traces.
For every trace , let . If is not the same across all traces, the algorithm fails. Otherwise let for all , and for each trace relabel the with as .
The proof now proceeds similarly to that of Theorem 1. We approximate long -runs in , which are close to some long -dense substrings of with high probability, with -runs, and the rest is approximated with -runs. We first estimate the distance between the -runs in . Consider a -run that generates an estimate of , and take and , for and as in the definition of . The average of the indices can be at most bits to the left of the first from , and therefore is at most off by bits. The same is true for . By a Chernoff bound, is an estimate of the distance between -runs with accuracy with probability at least . The substring between these -runs also has at least a density of s, so we can fill with -runs for a good approximation. Let be -runs where has length for , has length , and has length . Hence by Lemma 7 the -runs contribute at most to the edit distance error.
It remains to estimate the lengths of the long -runs in . Fix , let be a -run of length , for . For every , define as above (the number of s from in ). With probability at least the average of over traces is within of the mean . Combining this with Property 2, with probability at least ,
Since , we have that
This is at worst an approximation of with edit distance error at most
where we use and . Taking a union bound over all , and applying Lemma 7, with probability at least the long -run estimates contribute at most error . Putting this all together, we output the string . One more application of Lemma 7 implies that . Since is within edit distance from , the triangle inequality lets us conclude that .
If we apply this algorithm and analysis with instead of , the result follows. Constants were taken large enough to account for this factor of . ∎
As before, the theorem holds when the constant is unknown. Given traces of , we can find whether has a gap, and the corresponding value, with high probability.
3.3 Majority voting in substrings
A natural follow-up question to the previous theorems is what happens when the string no longer has long runs, but instead has long dense regions.
See 3
Algorithm for majority voting in substrings
-
Set-up:
String on bits such that can be divided into contiguous intervals all of length at least and density at least of s or s.
-
1.
Sample a single trace from the deletion channel with probability .
-
2.
Uniformly partition into contiguous substrings of length , so , with a shorter last interval if needed.
-
3.
Output , where is a run of length with value the majority bit of for .
Analysis
We first present three properties of the traces generated by high density strings with large length.
Lemma 11.
Fix and . Let be a string on at least bits, where with density of at least of either or . For a trace of , the following properties hold with probability at least .
-
1.
(Property 1)
-
2.
(Property 2) .
-
3.
(Property 3) has density at least of s.
Proof.
Assume w.l.o.g. that has density at least of . Applying a Chernoff bound gives that with probability at least , the length of is in the range . Taking this lower bound gives . Since , we see that , completing the proof of Property 1. Another way of writing the same Chernoff bound result is that , proving Property 2.
Applying a Chernoff bound to the number of s in , with probability at least , the number of non-deleted s is at least . Combining this with the first application of a Chernoff bound, a union bound gives that with probability at least , the density of s in the trace (denoted ) satisfies the following inequalities:
Note that the second inequality comes from the fact that the expression to the left is increasing in , and therefore is minimized at . ∎
Using these results, we can now proceed to the main proof of this section.
Proof of Theorem 3.
Suppose is a binary string on bits that can be divided into intervals such that all intervals have length at least and density at least of either or . Take a trace . Define . Divide the trace into consecutive intervals of width denoted as , where (with shorter if necessary).
Our approximate string is , where is a run of length with value the majority bit of for , and define to be the range of bits in that correspond to the bits in . Define as the bits present in from the interval in .
Consider for some that w.l.o.g. has majority bit . By Property 3 of Lemma 11 , at most bits in are . Consider all intervals such that . There are at least such intervals . At most of these intervals can have majority bit . Therefore, the fraction of these intervals that have majority bit is upper bounded by the following for , where we use Property 1 of Lemma 11 to say that :
Thus, in the concatenation of of the majority bits of all such that , the fraction of s is at most . Furthermore, the length of this concatenation is within of , where the first term comes from Property 2 in Lemma 11 and the second term comes from the two intervals that could cross both and either or . This approximation of therefore has density at least of and length differing by a fraction of from . Therefore, this is a total of a approximation of . This is true for all .
The last error that needs to be considered in our algorithm is the bits from for all such that for all (in other words is on a boundary). We can assume that the bits in the output string from these are all errors, and there are at most such boundaries. Therefore, this contributes a total error of bits. Putting it all together with Lemma 7, . If we apply this algorithm and analysis with instead of , the result follows. ∎
4 Lower bounds for approximate reconstruction
We turn our attention to proving limitations of approximate reconstruction. We provide two results, one for edit distance approximation and another for Hamming distance. Throughout this section we fix the deletion probability to be a constant .
4.1 Lower bound for edit distance approximation
Let denote a fixed constant. Let be a lower bound on the number of traces required to distinguish between two length strings and with probability at least . We can take to be as small as we like by slightly decreasing the lower bound, and therefore, we assume that . Previous work identifies two strings such that , where the hides the factor [Cha20a, HL20]. They use and for . Our strategy holds for any family of pairs that witness the lower bound. However, we note that this specific pair is already close in edit distance, and hence, outputting either of them would always be an approximation within edit distance two.
We instead form a string of length by concatenating a sequence of blocks, where each block is a uniformly random choice between and . Setting the block length to be , we show that any algorithm that approximates within edit distance must require at least traces for a constant . Our strategy follows previous results on exact reconstruction lower bounds that argue based on traces being independent of the choice of string in each block [Cha20a, HL20, MPV14]. However, the proof is not a straightforward extension because we must account for the algorithm being approximate. In essence, we argue that if the algorithm outputs a good enough approximation, then it must be able to distinguish between in many blocks.
Input Distribution and Indistinguishable Blocks
We define the hard distribution as follows. Let and be strings of length . We construct a random string of length by concatenating blocks Each of the substrings is set to be or uniformly and independently. The approximate reconstruction algorithm receives traces for . By assumption, with traces from or , the algorithm must fail to distinguish between them with probability at least . As this is an information-theoretical statement, we next argue that the traces are independent of the choice between and with probability at least .
To formalize this claim, we introduce some notation. Let denote a set of traces generated from the random string described above by passing through the deletion channel times. Since the channel deletes bits independently, we can equivalently determine the set of traces by passing each block for through the channel one at a time and then concatenating the subsequences to form a trace from . We let denote the distribution over sets of traces where generates these traces. By our assumption, any algorithm that receives traces must fail to distinguish between and with probability at least .
Next, we decompose the trace distribution in a way that relates the failure probability to the event that the traces are independent of . We express the distribution over traces of as a convex combination of two distributions and , where intuitively sampling from corresponds to being unable to determine with any advantage (see Definition 1 below). Formally, we take and to be any distributions over traces of such that for some we have
(1) |
where , and moreover, the following three properties hold: (i) is independent of , (ii) is not independent of whether or , and (iii) the distributions and have disjoint supports. We sketch how to construct distributions as in Eq. (1). The distribution from is discrete over the -wise product of distributions over . Depending on , the distribution gives different weights to each subsequence based on its length and the number of times it is a subsequence of . Assume that some traces have higher probability of occurring under than . Assign the mass in that comes from to and the remainder to (if the probability is higher for , swap and ). Doing this for all multisets of subsequences leads to and having disjoint support. The parameter normalizes the distributions.
We now argue that by claiming that there is an algorithm using traces with failure probability at most . By our hypothesis, with traces, any algorithm has failure probability at least . This implies that , which leads to . Since and have disjoint supports, the traces from these distributions identify , and the algorithm correctly determines . From Eq. (1), with probability , the traces are sampled from or . Otherwise, with probability , traces are sampled from . When traces come from , an algorithm that outputs either or has probability of being correct.
Now, define a binary latent variable such that with probability and with probability . If , then samples traces from , and if , it samples from . Using this notation, we can define the event that the traces are independent of a block in . Recall that we sample traces from by sampling traces from for each and then concatenating the traces of the blocks (using an arbitrary but fixed ordering of the traces).
Definition 1.
For , we say that the block is indistinguishable from the traces of if the distribution samples the traces of the block from , or in other words, if .
Lemma 12.
If and the number of blocks satisfies , then at least blocks are indistinguishable with probability at least .
Proof.
Using the notation and arguments by Eq. (1), we have that , which implies that with probability at least . Hence, the expected number of indistinguishable blocks is at least . Since traces are generated for each block independently, the binary random variables are independent. By a Chernoff bound, the probability that the number of indistinguishable blocks deviates from its mean by is at most , where we have used that and . ∎
From Indistinguishable Blocks to Edit Distance Error
We move on to a technical lemma that allows us to lower bound the edit distance by looking at the indicator vectors for the agreement of substrings in an optimal alignment. In what follows, we consider partitions into substrings, which are collections of non-overlapping, contiguous sequences of characters (a substring may be empty; substrings in a partition may have varying lengths).
Lemma 13.
Let and be strings. For an integer , assume that is partitioned into substrings . Then, there exists a partition of into substrings such that111 It is tempting to conjecture that equality can be achieved in Lemma 13 if we instead take the minimum over all partitions of . However, an example shows that this does not always hold. Over the alphabet , consider the pair and . Their edit distance is . Using four blocks, partition . Decompose . Summing the indicator vectors only equals two, and not four.
Proof.
Let . We proceed by induction on the number of substrings . For the base case, , we have that the edit distance between and is zero if and only if . For the inductive step, assume the lemma holds up to substrings with . We consider two cases, where in both we will split into two substrings .
For the first case, assume that matches the prefix of , so that . Then, we have that . Applying the inductive hypothesis with substrings for the pair and finishes this case.
For the second case, does not match the prefix of , and hence, any minimum edit distance alignment between and uses at least one edit in the portion. Consider any alignment between and with edits. Let denote the partition where is aligned to and is aligned to . Since the prefixes differ, we have , which implies that . Applying the inductive hypothesis with substrings to the pair and leads to a partition such that . We conclude that for this partition of . ∎
Using the above lemmas, we can now prove the edit distance lower bound theorem.
See 4
Proof.
Let be a small constant such that and , where we set . Assume that the approximate reconstruction algorithm receives traces.
Let denote the output of the reconstruction algorithm on input where and . Assume for contradiction that with high probability. Using Lemma 13, we can partition into blocks such that
(2) |
Since , Lemma 12 establishes that there are at least blocks in that are indistinguishable with high probability using the traces. For each of these blocks, the algorithm cannot guess between or with any advantage. While we do not know how the alignment corresponds to the indistinguishable blocks, we know that for at least values , we have that with probability at least 1/2. Thus, the sum in Eq. (2) is at least in expectation, and by a Chernoff bound, it is at least with high probability. This implies that , contradicting the edit distance being at most . ∎
Corollary 5 now follows immediately from this theorem and the previous trace reconstruction lower bounds [Cha20a], showing that for , we have that traces are necessary to -approximately reconstruct an arbitrary -bit string with probability .
4.2 Lower Bound for Hamming Distance Approximation
See 6
Proof.
Let . Define to be the string of zeros followed by pairs of and ending with zeros. Define to be zeros followed by pairs of and ending with zeros. These two strings have Hamming distance .
Differentiating between and is equivalent to determining the number of s at the beginning or end of them (as this is a promise problem). It is known that it requires traces to determine if the length of the 0-run at the beginning is even or odd with probability at least because the problem reduces to differentiating between two binomial distributions [BKKM04]. Therefore, with probability at least , a reconstruction algorithm using fewer traces must output a string that is at least Hamming distance away from the actual string. ∎
5 Conclusion
We studied the challenge of determining the relative trace complexity of approximate versus exact string reconstruction. Outputting a string close to the original in edit distance with few traces is a central problem in DNA data storage that has gone largely unnoticed in lieu of exact reconstruction. We present algorithms for classes of strings, where these classes lend themselves to techniques in every theoretician’s toolbox (e.g., concentration bounds, estimates from averages), while introducing new alignment techniques that may be useful for other algorithms. Additionally, these classes of strings are hard to reconstruct exactly (they contain the set of -bit strings with Hamming weight , which suffices to derive an lower bound on the trace complexity).
We left open the intriguing question of whether -approximate reconstruction is actually easier than exact reconstruction for all strings. On the other hand, we showed that it is easier for at least some strings. Our algorithms output a string within edit distance from the original string using traces for large classes of strings. In some cases, we showed how to approximately reconstruct with a single trace. We also presented lower bounds that interpolate between the hardness of approximate and exact trace reconstruction.
Algorithms with small sample complexity for the approximate trace reconstruction problem could also provide insight into exact solutions. If we know that the unknown string belongs to a specified Hamming ball of radius , then one can recover the string exactly with traces by estimating the histogram of length subsequences [KR97, KMMP19]. It is an open question whether an analogous claim can be proven for edit distance [GSZ20]. Do traces suffice if we know an edit ball of radius that contains the string? If this is true, then an algorithm satisfying our notion of edit distance approximation would imply an exact reconstruction result.
Approximate trace reconstruction is also a specialization of list decoding for the deletion channel, where the goal is to output a small set of strings that contains the correct one with high probability. We are not aware of any work on list decoding in the context of trace reconstruction, even though it seems like a natural problem to study. Using an approximate reconstruction algorithm, we could output the whole edit ball around the approximate string. For more on list decoding with insertions and deletions, see the work by Guruswami, Haeupler, and Shahrasbi and references therein [GHS20].
6 Acknowledgments
We thank João Ribeiro and Josh Brakensiek for discussions on coded trace reconstruction, as well as the anonymous reviewers for helpful feedback on an earlier version of the paper.
References
- [ADHR12] Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sebastien Roch. Global alignment of molecular sequences via ancestral state reconstruction. Stochastic Processes and their Applications, 122(12):3852–3874, 2012.
- [AVDGiF19] Mahed Abroshan, Ramji Venkataramanan, Lara Dolecek, and Albert Guillén i Fàbregas. Coding for deletion channels with multiple traces. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1372–1376. IEEE, 2019.
- [BCF+19] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 745–768. IEEE Computer Society, 2019.
- [BCSS19] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), volume 145 of LIPIcs, pages 44:1–44:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
- [BKKM04] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 910–918, 2004.
- [BLS20] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In IEEE Annual Symposium on Foundations of Computer Science, FOCS, 2020.
- [BPRS20] V. Bhardwaj, P. A. Pevzner, C. Rashtchian, and Y. Safonova. Trace Reconstruction Problems in Computational Biology. IEEE Transactions on Information Theory, pages 1–1, 2020.
- [CDL+21] Xi Chen, Anindya De, Chin Ho Lee, Rocco A Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
- [CGK12] George M. Church, Yuan Gao, and Sriram Kosuri. Next-Generation Digital Information Storage in DNA. Science, 337(6102):1628, 2012.
- [CGMR20] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020.
- [Cha20a] Zachary Chase. New lower bounds for trace reconstruction. Annales de l’Institut Henri Poincaré (to appear), 2020. Preprint at https://arxiv.org/abs/1905.03031.
- [Cha20b] Zachary Chase. New upper bounds for trace reconstruction. Preprint available at https://arxiv.org/abs/2009.03296, 2020.
- [CKY20] Johan Chrisnata, Han Mao Kiah, and Eitan Yaakobi. Optimal Reconstruction Codes for Deletion Channels. Preprint available at https://arxiv.org/abs/2004.06032, 2020.
- [DM07] Eleni Drinea and Michael Mitzenmacher. Improved lower bounds for the capacity of iid deletion and duplication channels. IEEE Transactions on Information Theory, 53(8):2693–2714, 2007.
- [DOS19] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
- [DRR19] Sami Davies, Miklós Z. Rácz, and Cyrus Rashtchian. Reconstructing trees from traces. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory (COLT), volume 99 of Proceedings of Machine Learning Research, pages 961–978. PMLR, 2019.
- [GBC+13] Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77–80, 2013.
- [GHS20] Venkatesan Guruswami, Bernhard Haeupler, and Amirbehshad Shahrasbi. Optimally resilient codes for list-decoding from insertions and deletions. In Proc. 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 524–537, 2020.
- [GSZ20] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance. Preprint available at https://arxiv.org/abs/2011.13737, 2020.
- [HHP18] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Proceedings of the Fifteenth Workshop on Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018.
- [HL20] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020.
- [HM14] Bernhard Haeupler and Michael Mitzenmacher. Repeated deletion channels. In 2014 IEEE Information Theory Workshop (ITW 2014), pages 152–156. IEEE, 2014.
- [HMPW08] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proc. 19th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 389–398, 2008.
- [HPP18] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Proceedings of the 31st Conference On Learning Theory (COLT), pages 1799–1840, 2018.
- [KMMP19] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. Preprint at https://arxiv.org/abs/1904.09618, 2019.
- [KNY20] Han Mao Kiah, Tuan Thanh Nguyen, and Eitan Yaakobi. Coding for Sequence Reconstruction for Single Edits. In IEEE International Symposium on Information Theory (ISIT), 2020.
- [KR97] Ilia Krasikov and Yehuda Roditty. On a Reconstruction Problem for Sequences. Journal of Combinatorial Theory, Series A, 77(2):344–348, 1997.
- [LCA+19] Randolph Lopez, Yuan-Jyue Chen, Siena Dumas Ang, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Georg Seelig, Karin Strauss, and Luis Ceze. DNA assembly for nanopore data storage readout. Nature Communications, 10(1):1–9, 2019.
- [Lev01] Vladimir I. Levenshtein. Efficient Reconstruction of Sequences from Their Subsequences or Supersequences. Journal of Combinatorial Theory, Series A, 93(2):310–332, 2001.
- [Mit09] Michael Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probability Surveys, 6:1–33, 2009.
- [MPV14] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace Reconstruction Revisited. In European Symposium on Algorithms (ESA), pages 689–700. Springer, 2014.
- [Nar21] Shyam Narayanan. Population Recovery from the Deletion Channel: Nearly Matching Trace Reconstruction Bounds. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021. Preprint at https://arxiv.org/abs/2004.06828.
- [NP17] Fedor Nazarov and Yuval Peres. Trace reconstruction with samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1042–1046, 2017.
- [NR21] Shyam Narayanan and Michael Ren. Circular Trace Reconstruction. In Proceedings of Innovations in Theoretical Computer Science (ITCS), 2021. Preprint at https://arxiv.org/abs/2009.01346.
- [OAC+18] Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher N Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, and Karin Strauss. Random access in large-scale DNA data storage. Nature Biotechnology, 36:242–248, 2018.
- [PZ17] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 228–239. IEEE Computer Society, 2017.
- [SDDF18] Sundara Rajan Srinivasavaradhan, Michelle Du, Suhas Diggavi, and Christina Fragouli. On maximum likelihood reconstruction over multiple deletion channels. In IEEE International Symp. on Information Theory (ISIT), pages 436–440, 2018.
- [SDDF20] Sundara Rajan Srinivasavaradhan, Michelle Du, Suhas Diggavi, and Christina Fragouli. Algorithms for reconstruction over single and multiple deletion channels. Preprint available at https://arxiv.org/abs/2005.14388, 2020.
- [SYY20] Omer Sabary, Eitan Yaakobi, and Alexander Yucovich. The error probability of maximum-likelihood decoding over two deletion channels. Preprint available at https://arxiv.org/abs/2001.05582, 2020.
- [VS08] Krishnamurthy Viswanathan and Ram Swaminathan. Improved String Reconstruction Over Insertion-Deletion Channels. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 399–408, 2008.
- [YGM17] SM Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free DNA-based data storage. Scientific reports, 7(1):1–6, 2017.
Appendix A Appendix
The following are omitted proofs from our warm-up approximate reconstruction algorithms.
A.1 Analysis of first warm-up algorithm
Proof of Proposition 8.
It is straight-forward to check that if contains runs, then with probability at least all traces contain runs. Next, we estimate the lengths of runs in . For traces , label the runs in as , and recall that denotes the length of the th run, , in . For , the scaled average estimates for . Applying a Chernoff bound and then a union bound, . Let , where substring is a run with length and bit value matching run of the traces. We have seen that with probability at least , for every the edit distance between and is at most . On this event, has edit distance at most from , by Lemma 7. ∎
We can also achieve slightly stronger guarantees. If the number of traces in Proposition 8 is linear, then the algorithm actually reconstructs exactly with high probability. Also, the output from the algorithm for Proposition 8 will approximately reconstruct strings that do not quite satisfy the current assumptions, as described in the premises of the following corollary.
Corollary 14 (Robustness).
Let be an -bit string such that all runs have length at least except for at most runs. We can -approximately reconstruct with traces.
Proof.
Taking , with probability every long run (those with length at least ) will not be entirely deleted, and with probability at least none of the short runs are entirely deleted. By a Chernoff bound, with probability at least the number of traces where no short run is entirely deleted is at least . We identify the traces with the maximum number of runs and then use the algorithm for Proposition 8 using these traces. ∎
A.2 Analysis of second warm-up algorithm
Proof of Proposition 9.
Suppose that all of the -runs of have length at least . Take a single trace . By a Chernoff bound, with probability at least , every -run from with length at least will have length at least in . Find every -run in with length at least and index them as ,…,. For , let be the bits between the last bit of and the first bit of and let be the bits before and the bits after . Let be the contiguous substring of from which came and the contiguous substring of from which came. For all , we will approximate with a -run of length , and with , a -run of length .
Since contains alternating -runs with length at least and -runs with length at most , has at least a density of s. By a Chernoff bound, . Therefore and have edit distance at most . If , then, as before, by a Chernoff bound , and so has edit distance at most from . If then the approximation of s has edit distance at most from with probability at least .
Let and observe that the number of -runs is at most , since there at most this many -runs which separate -runs. Then applying Lemma 7, we have with probability at least that
The theorem follows by taking . ∎
Appendix B Chernoff-Hoeffding Bound
In many proofs, we use the following concentration bound:
Lemma 15 (Chernoff-Hoeffding bound).
Let be independent. Let with . Then for , , and the following holds: