This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximate Trace Reconstruction

   Sami Davies University of Washington (daviess@uw.edu).    Miklós Z. Rácz Princeton University (mracz@princeton.edu); research supported in part by NSF grant DMS 1811724 and by a Princeton SEAS Innovation Award.    Cyrus Rashtchian Dept. of Computer Science & Engineering, University of California, San Diego (crashtchian@eng.ucsd.edu).    Benjamin G. Schiffer Princeton University (bgs3@princeton.edu).
Abstract

In the usual trace reconstruction problem, the goal is to exactly reconstruct an unknown string of length nn after it passes through a deletion channel many times independently, producing a set of traces (i.e., random subsequences of the string). We consider the relaxed problem of approximate reconstruction. Here, the goal is to output a string that is close to the original one in edit distance while using much fewer traces than is needed for exact reconstruction. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within n/polylog(n)n/\mathrm{polylog}(n) edit distance, and where we only use polylog(n)\mathrm{polylog}(n) traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and which are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. To complement our algorithms, we present a general black-box lower bound for approximate reconstruction, building on a lower bound for distinguishing between two candidate input strings in the worst case. In particular, this shows that approximating to within n1/3δn^{1/3-\delta} edit distance requires n1+3δ/2/polylog(n)n^{1+3\delta/2}/\mathrm{polylog}(n) traces for 0<δ<1/30<\delta<1/3 in the worst case.

1 Introduction

In the trace reconstruction problem, we observe noisy samples of an unknown binary string after passing it through a deletion channel several times [BKKM04, Lev01]. For a parameter q(0,1)q\in(0,1), the channel deletes each bit of the string with probability qq independently, resulting in a trace. The deletions for different traces are also independent. We only observe the concatenation of the surviving bits, without any information about the deleted bits or their locations.

How many samples (traces) from the deletion channel does it take to exactly recover the unknown string with high probability? Despite two decades of work, this question is still wide open. For a worst-case string, very recent work shows that exp(O~(n1/5))\exp(\widetilde{O}(n^{1/5})) traces suffice [Cha20b], building upon the previous best bound of exp(O(n1/3))\exp(O(n^{1/3})) [DOS19, NP17]; furthermore, Ω~(n3/2)\widetilde{\Omega}(n^{3/2}) traces are necessary [Cha20a, HL20]. Improved upper bounds are known in the average-case setting, where the unknown string is uniformly random [BKKM04, HMPW08, HPP18, MPV14, PZ17, VS08], in the coded setting, where the string is guaranteed to reside in a pre-defined set [BLS20, CGMR20, SYY20, SDDF18, SDDF20], and in the smoothed-analysis setting where the unknown string is perturbed before trace generation [CDL+21].

Given that exact reconstruction may be challenging, we relax the requirements and ask: when is it possible to approximately reconstruct an unknown string with much less information? More precisely, the algorithm should output a string that is close to the true string under some metric. Since the channel involves deletions, we consider edit distance, measuring the minimum number of insertions, deletions, and substitutions between a pair of strings. Letting nn denote the length of the unknown string, we investigate the necessary and sufficient number of traces to approximate the string up to εn\varepsilon n edit distance. We call this the εn\varepsilon n-approximate reconstruction problem.

Trace reconstruction has received much recent attention because of DNA data storage, where reconstruction algorithms are used to recover the data [OAC+18, CGK12, BPRS20, GBC+13, YGM17, LCA+19]. Biochemical advances have made it possible to store digital data using synthetic DNA molecules with higher information density than electromagnetic devices. During the data retrieval process, each DNA molecule is imperfectly replicated several times, leading to a set of noisy strings that contain insertion, substitution, and deletion errors. Error-correcting codes are used to deal with missing data, and hence, an approximate reconstruction algorithm would be practically useful. Decreasing the number of traces would reduce the time and cost of data retrieval.

For any deletion probability qq, a single trace achieves a qnqn-approximation in expectation. On the other hand, if q=1/2q=1/2, then it is not clear whether (n/100)(n/100)-approximate reconstruction requires asymptotically fewer traces than exact reconstruction. More generally, we propose the following goal: determine the smallest ε\varepsilon such that any string can be εn\varepsilon n-approximately reconstructed with poly(n)\mathrm{poly}(n) traces. Here ε\varepsilon is a parameter that may depend on nn and qq. Although we focus on an information-theoretic formulation (measuring the number of traces), the reconstruction algorithm should also be computationally efficient (polynomial time in nn and the number of traces).

A natural approach would be to transform exact reconstruction methods into more efficient approximation algorithms. Unfortunately, revising these algorithms to allow some incorrect bits seems nontrivial or perhaps impossible. For example, certain algorithms assume that the string has been perfectly recovered up to some point, and they use this to align the traces and determine the next bit [BKKM04, HMPW08, HPP18]. Another technique involves computing the bit-wise average of the traces and outputing the string that most closely matches the average. These mean-based statistics suffice to distinguish any pair of strings, but only if there are exp(Ω(n1/3))\exp(\Omega(n^{1/3})) traces [DOS19, NP17]. Also, the maximum likelihood solution is poorly understood for the deletion channel, and current analyses are limited to a small number of traces [Mit09, SYY20, SDDF18, SDDF20]

Designing algorithms to find approximate solutions may in fact require fundamentally different methods than exact reconstruction. For a simple supporting example, consider the family of strings containing all ones except for a single zero that lies in some position between n/3n/3 and 2n/32n/3, e.g., 11111011111111\cdots 11011\cdots 111. Determining the exact position of the zero requires Ω(n)\Omega(n) traces when the deletion probability is a constant [BKKM04, MPV14]. However if the string comes from this family, we can output the all ones vector and achieve an approximation to within Hamming distance one.

As a starting point, we consider classes of strings defined by run-length assumptions. For instance, if the 1-runs are sufficiently long and the zero runs are either short or long, we can εn\varepsilon n-approximately reconstruct the string using O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) traces. We then strengthen our upper bound to work even when the string can be partitioned into regions that are either locally dense or sparse. Finally, we prove new lower bounds on the necessary trace complexity; for example, approximating arbitrary strings to within n1/3δn^{1/3-\delta} edit distance requires n1+3δ/2/polylog(n)n^{1+3\delta/2}/\mathrm{polylog}(n) traces for any constant 0<δ<1/30<\delta<1/3.

1.1 Related work

The trace reconstruction problem was introduced to the theoretical computer science community by Batu, Kannan, Khanna, and McGregor [BKKM04]. There is an exponential gap between the known upper and lower bounds for the number of traces needed to reconstruct an arbitrary string with constant deletion rate [Cha20a, DOS19, Cha20b, HL20, NP17]. The main open theoretical question is whether a polynomial number of traces suffice. There has also been experimental and theoretical work on maximum likelihood decoding, where approximation algorithms have been developed for average-case strings given a constant number of traces [SYY20, SDDF18, SDDF20]. Holden, Pemantle, and Peres show that exp(O(log1/3n))\exp(O(\log^{1/3}n)) traces suffice for reconstructing a random string, building on previous work [BKKM04, HMPW08, HPP18, PZ17, VS08]. Chase extended work by Holden and Lyons to show that Ω~(log5/2n)\widetilde{\Omega}(\log^{5/2}n) traces are necessary for random strings [Cha20a, HL20].

A related question to ours is to understand the limitations of known techniques for distinguishing pairs of strings that are close in edit distance. Grigorescu, Sudan, and Zhu show that there exist pairs that cannot be distinguished with a small number of traces when using a mean-based algorithm [GSZ20]. They further identify strings that are separated in edit distance, yet can be exactly reconstructed with few traces. Their results are incomparable to ours because the sets of strings they consider are different. Instead of considering local density assumptions, they consider local agreement up to single-bit shifts. Their algorithm uses a subexponential number of traces only when the edit distance separation is at most o(n)o(\sqrt{n}).

Many other variants of trace reconstruction have been studied as stand-alone problems, united by the goal of broadening our understanding of reconstruction problems. Krishnamurthy, Mazumdar, McGregor, and Pal consider matrices (rows/columns deleted) and sparse strings [KMMP19]. Davies, Rácz, and Rashtchian consider labeled trees, where the additional structure of some trees leads to more efficient reconstruction [DRR19]. Circular trace reconstruction considers strings and traces up to circular rotations of the bits [NR21]. Population recovery reconstructs multiple unknown strings simultaneously [BCF+19, BCSS19, Nar21]. Going beyond i.i.d. deletions, algorithms have also been developed for position- or character- dependent error rates [HHP18], or for ancestral state reconstruction, where deletions are based on a Markov chain [ADHR12]. It should not go without mention that forms of approximate trace reconstruction have been studied in more applied frameworks; in particular Srinivasavaradhan, Du, Diggavi, and Fragouli study heuristics for reconstructing approximately given one or two traces [SDDF18].

Comparison to Coded Trace Reconstruction.

Cheraghchi, Gabrys, Milenkovic, and Ribeiro explore coded trace reconstruction, where the unknown string is assumed to come from a code, and they show that codewords can be reconstructed with high probability using much fewer traces than average-case reconstruction [CGMR20] (see also [DM07, Lev01, Mit09]). Brakensiek, Li, and Spang extend this work and present codes with rate 1γ1-\gamma that can be reconstructed using exp(O(log1/3(1/γ)))\exp(O(\log^{1/3}(1/\gamma))) traces [BLS20]. Improved coded reconstruction results are known when the number of errors in a trace is a fixed constant [AVDGiF19, CKY20, HM14, KNY20, SYY20].

An existing approach for coded trace reconstruction does use approximation as an intermediate step, where the original string can be recovered after error correction [BLS20]. Our focus is different, and our results are incomparable to those from coded trace reconstruction. We investigate classes of strings that are very different from codes (e.g., pairs of strings in our classes can be very close). We also consider strings that require Ω(n)\Omega(n) traces to exactly reconstruct, whereas the work on coded trace reconstruction shows that their classes of strings can be exactly reconstructed with a sublinear number of traces. Overall, we do not aim to optimize the “rate” of our classes of strings. Instead, our main contribution is the effectiveness of new algorithmic techniques and local approximation methods, including novel alignment ideas and the use of runs in approximating edit distance.

Additionally, coded trace reconstruction lower bounds can be used as a black box to obtain lower bounds for approximate trace reconstruction by constructing a code that is an εn\varepsilon n-net [BLS20]. However, these lower bounds reduce to results on average-case reconstruction, and hence, this approach currently leads to lower bounds for approximate reconstruction that are exponentially smaller than what we prove.

1.2 Our results

We assume that the deletion probability qq is a fixed constant and p:=1qp:=1-q is the retention probability. In our statements, C,C,C′′,C1,C2,C,C^{\prime},C^{\prime\prime},C_{1},C_{2},\ldots denote constants, and O()O(\cdot) hides constants, where these may depend on p,qp,q. Unless stated otherwise, log()\log(\cdot) has base 1/q1/q. The phrase with high probability means probability at least 1O(1/n)1-O(1/n). A run in a string is a substring of consecutive bits of the same value, and we often refer specifically to 0-runs and 11-runs. We use bold 𝐫\mathbf{r} to denote runs, or more generally substrings, and let |𝐫||\mathbf{r}| denote its length (number of bits). Table 1 summarizes our results, and we restate the theorems in the relevant sections for the reader’s convenience.

Algorithms for approximate reconstruction

Our results exhibit the ability to approximately reconstruct strings based on various run-length or density assumptions. For these classes of strings, we develop new polynomial-time, alignment-based algorithms, and we show that O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) traces suffice. We assume that the algorithms know nn, qq, ε\varepsilon, and the class that the unknown string comes from, though the last assumption is not necessary for Theorem 1. We also provide warm-up examples (see Proposition 8 and Proposition 9 in Section 2), which may be helpful to the reader before diving into the algorithms in Section 3.

Our first theorem only requires 11-runs to be long, while the length of the 0-runs is more flexible; they can be either long or short, assuming there is a gap.

Theorem 1.

Let XX be a string on nn bits such that all of its 11-runs have length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon and none of its 0-runs have length between Clog(n)C^{\prime}\log(n) and 3Clog(n)3C^{\prime}\log(n). There exists some constant CC such that if CCC^{\prime}\geqslant C, then XX can be εn\varepsilon n-approximately reconstructed with O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) traces.

Table 1: Table of sample complexity bounds for εn\varepsilon n-approximate reconstruction.
Classes of strings # samples εn\varepsilon n-approx. Reference
All runs have length 5log(n)\geqslant 5\log(n) O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) Proposition 8 & Corollary 14
The 11-runs have length Clog(n)/ε2\geqslant C^{\prime}\log(n)/\varepsilon^{2} 1 Proposition 9
Long 11-runs; either long or short 0-runs O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) Theorem 1 & Theorem 2
Intervals length Clog(n)/ε2\geqslant C^{\prime}\log(n)/\varepsilon^{2}, density 1ε12\geqslant 1-\frac{\varepsilon}{12} 1 Theorem 3
Arbitrary strings, n1/3δn^{1/3-\delta} edit distance, δ(0,13)\delta\in(0,\frac{1}{3}) Ω~(n1+3δ/2)\widetilde{\Omega}(n^{1+3\delta/2}) Theorem 4 & Corollary 5

The following theorem extends Theorem 1 to a wider class of strings by allowing many of the bits in the runs to be arbitrarily flipped to the opposite value.

Theorem 2.

Suppose that p>3εp>3\varepsilon. Let YY be a string on nn bits such that all of its 11-runs have length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon and none of its 0-runs have length between Clog(n)C^{\prime}\log(n) and 3Clog(n)3C^{\prime}\log(n). Suppose that XX is formed from YY by modifying at most εClog(n)\varepsilon C^{\prime}\log(n) arbitrary bits in each run of YY. If C1000/pC^{\prime}\geqslant 1000/p, then XX can be εn\varepsilon n-approximately reconstructed with O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) traces.

For the final class, we consider a slightly different relaxation of having long runs. We impose a local density or sparsity constraint on contiguous intervals. If this holds, then a single trace suffices.

Theorem 3.

There exists some constant CC such that for CCC^{\prime}\geqslant C, if XX can be divided into contiguous intervals I1,,ImI_{1},\ldots,I_{m} with all IiI_{i} having length at least Clog(n)/ε2C^{\prime}\log(n)/\varepsilon^{2} and density at least 1ε121-\frac{\varepsilon}{12} of 0s or 11s, then XX can be εn\varepsilon n-approximately reconstructed with a single trace in polynomial time.

The algorithm for Theorem 3 extends to handle independent insertions at a rate of O(ε)O(\varepsilon), since the proof relies on finding high density regions, which are unchanged by such insertions.

We provide some justification for the strings considered in the above theorems. Strings that either contain long runs or that are locally dense are a natural class to examine in order to understand the advantage gained by approximate reconstruction over exact. Strings with sufficiently long runs require Ω(n)\Omega(n) traces to reconstruct exactly, as exact reconstruction for this set involves distinguishing between our prior example strings 1n/201n/211^{n/2}01^{n/2-1} and 1n/2101n/21^{n/2-1}01^{n/2}, but can be approximately reconstructed with substantially less traces for large enough values of ε\varepsilon. We then relax the condition that strings have long runs to the condition that strings are locally dense. Both strings with long runs and strings that are locally dense also look very different than average-case strings (i.e., uniformly random), which have runs with length at most 2logn2\log n with high probability and can be exactly reconstructed with O(exp(log1/3(n)))O(\exp(\log^{1/3}(n))) traces [HPP18].

Lower bounds for approximate reconstruction

We prove lower bounds on the number of traces required for εn\varepsilon n-approximate reconstruction. We present two results, for edit distance and Hamming distance, respectively. The more challenging result is Theorem 4, which shows that any algorithm that reconstructs a length nn arbitrary string within εn\varepsilon n edit distance requires f(C/ε)f(C/\varepsilon) traces, where f(n)f(n) denotes the minimum number of traces for distinguishing a pair of nn-bit strings. Currently, f(n)=Ω~(n1.5)f(n)=\widetilde{\Omega}(n^{1.5}) is the best known lower bound for exact reconstruction, which argues via a pair of strings that are hard to distinguish [Cha20a].

Theorem 4.

Suppose that f(n)f(n) traces are required to distinguish between two length nn strings XX^{\prime} and YY^{\prime} with probability at least 1/2+α1/2+\alpha, where α=1/8\alpha=1/8. Then there exists absolute constants C,ε>0C,\varepsilon^{\star}>0 such that for εεlog(n)/n\varepsilon^{\star}\geqslant\varepsilon\geqslant\log(n)/n, any algorithm that εn\varepsilon n-approximately reconstructs arbitrary length nn strings with probability 11/n1-1/n must use at least f(C/ε)f(C/\varepsilon) traces.

Plugging in the bound on f(1/ε)f(1/\varepsilon), our theorem shows that (1/ε)3/2/polylog(1/ε)(1/\varepsilon)^{3/2}/\mathrm{polylog}(1/\varepsilon) traces are required for εn\varepsilon n-approximate reconstruction. For example, if ε=n2/3δ\varepsilon=n^{-2/3-\delta}, then we obtain the following.

Corollary 5.

For any constant δ(0,1/3)\delta\in(0,1/3), we have that n1+3δ/2/polylog(n)n^{1+3\delta/2}/\mathrm{polylog}(n) traces are necessary to n1/3δn^{1/3-\delta}-approximately reconstruct an arbitrary nn-bit string with probability 11/n1-1/n.

Theorem 4 also allows for ε\varepsilon to be as small as log(n)/n\log(n)/n, implying that a very close approximation is not possible with substantially fewer traces than exact reconstruction.

Our lower bound for Hamming distance in Theorem 6 is simpler. It shows that Ω(n)\Omega(n) traces are necessary to achieve an approximation closer than n/4n/4 in Hamming distance to the actual string. In particular, we get a linear lower bound for a linear Hamming distance approximation, which is much stronger than our result for edit distance.

Theorem 6.

Any algorithm that can output an approximation within Hamming distance n/41n/4-1 of an arbitrary length nn string with probability at least 3/43/4 must use Ω(n)\Omega(n) traces.

1.3 Technical overview

The high-level strategy for all of our algorithms is the following. First, we identify the remnants of structured substrings, that is, long runs and dense substrings, from the original string in the traces. Then, when given more than one trace, we can use these substrings to align traces. After aligning traces, we capitalize on the approximate nature of our objective by estimating lengths of runs which are close in edit distance to substrings of the original string.

The gap condition for 0-runs in Theorem 1 states that the unknown string only contains 0-runs with length either less than a1:=Clog(n)a_{1}:=C^{\prime}\log(n) or greater than a2:=3Clog(n)a_{2}:=3C^{\prime}\log(n), for large enough CC^{\prime} (and nothing in the middle). We show that there exist values a1,a2a_{1}^{\prime},a_{2}^{\prime}, with pa1<a1<a2<pa2pa_{1}<a_{1}^{\prime}<a_{2}^{\prime}<pa_{2}, such that with high probability there does not exist a 0-run of length at least a2a_{2} in the original string that has been reduced to a 0-run of length less than a2a_{2}^{\prime} in a trace, nor a 0-run of length less than a1a_{1} reduced to a 0-run of length more than a1a_{1}^{\prime}. This implies that we can always distinguish between short and long runs of 0s in all of the traces (which would be challenging without the gap condition). We can align long runs of 0s from the traces and take a scaled average of the lengths of the iith long run of 0s across all TT traces. By using a scaled average across traces, we can estimate the number of bits between consecutive long runs of 0s. Then, our algorithm outputs a run of 11s here, which accounts for long runs of 11s and short runs of 0s. Note that this piece of the algorithm is inherently approximate since we replace short runs of 0s with 11s. This completes, what we call, our algorithm for identifying long runs.

The algorithm for Theorem 2 is similar to Theorem 1. We identify long 0-runs from YY in each of the traces and align by these 0-runs, then approximate the rest using 11s. However, the alignment step is more difficult since the long 0-runs from YY may not be 0-runs in XX and not easily found in traces. Instead, we identify long 0-dense substrings in each trace that with high probability originate from long 0-runs in YY. We refer to this as the algorithm for identifying dense substrings. Then we align and average as in Theorem 1 to approximate the unknown string.

Our algorithm in Theorem 3 takes a uniform partition of a single trace, where each part has length Clog(n)/εC\log(n)/\varepsilon, and it outputs a series of runs, where each run has length Clog(n)/(εp)C\log(n)/(\varepsilon p) and parity the majority bit of the interval. Note the partitions have length at most an O(ε)O(\varepsilon) fraction of the high density intervals. Therefore in any high density interval of the original string, most of the partitions of the trace originating from that interval will also have high density of the same bit. We refer to the method for this result as the algorithm for majority voting in substrings.

The algorithms and analyses for these three theorems are in Section 3.

Lower Bounds.

For the edit distance approximation in Theorem 4, we start with two strings of length C/εC/\varepsilon that require f(C/ε)f(C/\varepsilon) traces to distinguish for some constant C(0,1)C\in(0,1) and ε<C\varepsilon<C. We then construct a hard distribution over length nn strings by concatenating εn/C\varepsilon n/C substrings, where each substring is an independent random choice between the two strings. Our strategy is to show that if the algorithm outputs an approximation within εn\varepsilon n edit distance, then it must correctly determine a large number of the component strings. However, proving this requires some work because the guarantee of the reconstruction algorithm is in terms of an edit distance approximation. To handle this challenge, we provide a technical lemma that relates the edit distance of any pair of strings to a sum of binary indicator vectors for the equivalence of certain substrings (Lemma 13). Then, we use this lemma to argue that the algorithm’s output must be far from the true string if the number of traces is less than f(C/ε)f(C/\varepsilon) because many substrings must disagree.

For the Hamming distance lower bound in Theorem 6, we use a more straightforward argument. We start with a known lower bound from Batu, Kannan, Khanna, and McGregor [BKKM04]. They observe that Ω(k)\Omega(k) traces are necessary to determine if a string starts with kk or k+1k+1 zeros. We then construct a hard pair of strings of length roughly 4k4k such that if the algorithm misjudges the prefix length, then it must incur a cost of at least 2k2k in Hamming distance. Since k=Ω(n)k=\Omega(n), we obtain the desired lower bound.

The proofs for both lower bounds appear in Section 4.

1.4 Preliminaries

Let dE(X,X)d_{\mathrm{E}}(X,X^{\prime}) denote the edit distance between XX and XX^{\prime}, defined as the minimum number of insertions, deletions, and substitutions that are required to transform XX into XX^{\prime}. Note that edit distance is a metric. For each class of strings that we consider, we present an algorithm and argue that it can εn\varepsilon n-approximately reconstruct any string from the class. Our algorithms output a string X^\widehat{X}, an approximation of XX, satisfying dE(X,X^)εnd_{\mathrm{E}}(X,\widehat{X})\leqslant\varepsilon n with high probability.

We denote a single run by 𝐫\mathbf{r} and a set of runs by 𝐫1,,𝐫k{\mathbf{r}}_{1},\ldots,{\mathbf{r}}_{k}. Our convention is to let XX denote the unknown string that we wish to reconstruct, and YY will often be a modified version. A single trace will be denoted by X~\widetilde{X} and a set of traces by X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T}. Tildes will also be used to mark runs and intervals of traces. Some strings XX we partition into \ell substrings X1,,XX^{1},\ldots,X^{\ell}; their concatenation to form XX is denoted as X=X1X2XX=X^{1}X^{2}\cdots X^{\ell}.

Some of our algorithms reconstruct XX by partitioning it into substrings X1,,XX^{1},\ldots,X^{\ell} and reconstructing these substrings approximately. Specifically, we will find strings X^i\widehat{X}^{i} such that the edit distance between X^i\widehat{X}^{i} and XiX^{i} is at most ε|Xi|\varepsilon|X^{i}|, and then we will invoke the following lemma to see that X=X1XX=X^{1}\cdots X^{\ell} and X^=X^1X^\widehat{X}=\widehat{X}^{1}\cdots\widehat{X}^{\ell} have edit distance at most εn\varepsilon n.

Lemma 7.

Let X=X1X2XX=X^{1}X^{2}\cdots X^{\ell} and X^=X^1X^\widehat{X}=\widehat{X}^{1}\cdots\widehat{X}^{\ell} be strings on nn bits. If the edit distance between XiX^{i} and X^i\widehat{X}^{i} is at most ε|Xi|\varepsilon|X^{i}| for all i[]i\in[\ell], then d𝖤(X,X^)εnd_{\mathsf{E}}(X,\widehat{X})\leqslant\varepsilon n.

Proof.

We will use the fact that edit distance satisfies the triangle inequality. Consider bit strings X=X1X2X=X^{1}X^{2} and X^=X^1X^2\widehat{X}=\widehat{X}^{1}\widehat{X}^{2}. Then,

d𝖤(X1X2,X^1X^2)d𝖤(X1X2,X^1X2)+d𝖤(X^1X2,X^1X^2)=d𝖤(X1,X^1)+d𝖤(X2,X^2).d_{\mathsf{E}}(X^{1}X^{2},\widehat{X}^{1}\widehat{X}^{2})\leqslant d_{\mathsf{E}}(X^{1}X^{2},\widehat{X}^{1}X^{2})+d_{\mathsf{E}}(\widehat{X}^{1}X^{2},\widehat{X}^{1}\widehat{X}^{2})=d_{\mathsf{E}}(X^{1},\widehat{X}^{1})+d_{\mathsf{E}}(X^{2},\widehat{X}^{2}).

This extends to X=X1XX=X^{1}\cdots X^{\ell} and X^=X^1X^\widehat{X}=\widehat{X}^{1}\cdots\widehat{X}^{\ell} by recursively applying the above inequality. ∎

2 Warm-up: Approximating strings that only have long runs

We begin with two simple cases that demonstrate some of our algorithmic techniques. For this section, we defer proofs to Appendix A. We note that other methods may lead to similar or slightly better results in some regimes, but we follow this presentation as a prelude to Section 3.

The first algorithm εn\varepsilon n-approximately reconstructs a string with long runs using Ω(log(n)/ε2)\Omega(\log(n)/\varepsilon^{2}) traces by scaling an average of the run length across all traces.

Proposition 8.

Let XX be a string on nn bits such that all of its runs have length at least log(n5)\log(n^{5}). Then XX can be εn\varepsilon n-approximately reconstructed with O(log(n)/ε2)O(\log(n)/\varepsilon^{2}) traces.

Algorithm

  1. Set-up:

    String XX on nn bits such that all of its runs have length at least log(n5)\log(n^{5}).

  2. 1.

    Sample T=2pε2log(n)T=\frac{2}{p\varepsilon^{2}}\log(n) traces, X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T}, from the deletion channel with deletion probability qq. Fail if all traces do not have the same number of runs. Otherwise let kk denote the number runs in every trace.

  3. 2.

    Compute μ~i=1Tj=1T|𝐫~ij|\widetilde{\mu}_{i}=\frac{1}{T}\sum_{j=1}^{T}|\widetilde{\mathbf{r}}^{j}_{i}| for all i[k]i\in[k], where 𝐫~1j,𝐫~2j,,𝐫~kj\widetilde{\mathbf{r}}^{j}_{1},\widetilde{\mathbf{r}}^{j}_{2},\ldots,\widetilde{\mathbf{r}}^{j}_{k} are the kk runs of X~j\widetilde{X}_{j}.

  4. 3.

    Output X^=X^1X^k\widehat{X}=\widehat{X}_{1}\cdots\widehat{X}_{k}, where X^i\widehat{X}_{i} has length μ~i/p\widetilde{\mu}_{i}/p and bit value matching run ii of the traces.

The analysis is a basic use of Chernoff bounds; see Appendix A for details.

Ideally we would only require that 11-runs have length Ω(log(n))\Omega(\log(n)), without restricting the length of 0-runs. The following result shows that if we require the 11-runs to be Ω(1ε2log(n))\Omega(\frac{1}{\varepsilon^{2}}\log(n)), which is an order of 1/ε1/\varepsilon larger than in Theorem 1, then approximate reconstruction is possible using one trace.

Proposition 9.

Let XX be a string on nn bits such that all of its 11-runs have length at least Clog(n)/ε2C^{\prime}\log(n)/\varepsilon^{2}. Then there exists a constant CC such that for CCC^{\prime}\geqslant C, XX can be εn\varepsilon n-approximately reconstructed with a single trace.

Algorithm

  1. Set-up:

    String XX on nn bits such that all its 11-runs have length at least 6plog(n)/ε2\frac{6}{p}\log(n)/\varepsilon^{2}.

  2. 1.

    Sample 1 trace X~\widetilde{X} from the deletion channel with deletion probability qq.

  3. 2.

    Let L:=log(n)10εL:=\frac{\log(n)}{10\varepsilon}; 𝐫~1\widetilde{\mathbf{r}}_{1},…,𝐫~k\widetilde{\mathbf{r}}_{k} be 0-runs in X~\widetilde{X} with length at least LL; and 𝐬~i\widetilde{\mathbf{s}}_{i}, for i{0,1,,k+1}i\in\{0,1,\ldots,k+1\}, be the bits in X^\widehat{X} before 𝐫~1\widetilde{\mathbf{r}}_{1}, between 𝐫~i\widetilde{\mathbf{r}}_{i} and 𝐫~i+1\widetilde{\mathbf{r}}_{i+1}, and after 𝐫~k\widetilde{\mathbf{r}}_{k}, respectively.

  4. 3.

    Output X^=1^00^11^11^k0^k1^k+1\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{k}\widehat{0}_{k}\widehat{1}_{k+1}, where 1^i\widehat{1}_{i} is a 11-run, length |𝐬~i|p\frac{|\widetilde{\mathbf{s}}_{i}|}{p}, and 0^i\widehat{0}_{i} is a 0-run, length |𝐫~i|p\frac{|\widetilde{\mathbf{r}}_{i}|}{p}.

The algorithm for Proposition 9 no longer attempts to align multiple traces. Step three is approximate by design because we use 11-runs to fill in the gaps between the long 0-runs. The error is from the variance of how many bits of each run are deleted by the deletion channel. See Appendix A for the proof.

3 Approximating more general classes of strings

Moving on from our warm-ups, we reconstruct larger classes of strings. Our first two algorithms in this section reconstruct strings that still contain some long runs, where these help us align traces. Our third algorithm reconstructs from a single trace by approximately preserving local density.

3.1 Identifying long runs

To weaken the assumptions of Proposition 8, we want to consider strings where 0-runs can be any length but 11-runs must still be long and have length Ω(logn)\Omega(\log n). When relaxing the length restriction on the 0-runs, the alignment step, step 1, of the algorithm for Proposition 8 begins to fail—entire runs of 0s may be deleted, combining consecutive 11-runs and making it difficult to identify which runs align together between traces. To still use an alignment algorithm that averages run lengths, we impose the weaker condition on the 0-runs that they must be divided into short 0-runs and long 0-runs. As long as there is a gap of sufficiently large size such that there are no 0-runs with length in the gap, then in the traces we can identify which 0-runs are long and which are short.

See 1

Algorithm for identifying long runs

  1. Set-up:

    String XX on nn bits such that all of its 11-runs have length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon, where C100/pC^{\prime}\geqslant 100/p, and all of its 0-runs have length either greater than 3Clogn3C^{\prime}\log n or less than ClognC^{\prime}\log n.

  2. 1.

    Sample T=2p2ε2log(n)T=\frac{2}{p^{2}\varepsilon^{2}}\log(n) traces, X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T}, from the deletion channel with probability qq.

  3. 2.

    Define L:=2CplognL:=2C^{\prime}p\log n, and for all j[T]j\in[T], index the 0-runs in X~j\widetilde{X}_{j} with length at least LL as 𝐫~1j,,𝐫~kjj\widetilde{\mathbf{r}}^{j}_{1},\ldots,\widetilde{\mathbf{r}}^{j}_{k_{j}}. For i[kj1]i\in[k_{j}-1], let 𝐬~ij\widetilde{\mathbf{s}}^{j}_{i} be the bits between 𝐫~ij\widetilde{\mathbf{r}}^{j}_{i} and 𝐫~i+1j\widetilde{\mathbf{r}}^{j}_{i+1} in X~j\widetilde{X}_{j} and let 𝐬~0j\widetilde{\mathbf{s}}^{j}_{0} be the bits before 𝐫~1j\widetilde{\mathbf{r}}^{j}_{1} and 𝐬~kj+1j\widetilde{\mathbf{s}}^{j}_{k_{j}+1} the bits after 𝐫~kjj\widetilde{\mathbf{r}}^{j}_{k_{j}} for all j[T]j\in[T].

  4. 3.

    If there exist jj[T]j\neq j^{\prime}\in[T] such that kjkjk_{j}\neq k_{j^{\prime}}, then fail without output. Otherwise, let k:=k1=k2==kTk:=k_{1}=k_{2}=\cdots=k_{T}.

  5. 4.

    Compute μ~i𝐫=1Tj=1T|𝐫~ij|\widetilde{\mu}^{\mathbf{r}}_{i}=\frac{1}{T}\sum_{j=1}^{T}|\widetilde{\mathbf{r}}^{j}_{i}| for all i[k]i\in[k] and μ~i𝐬=1Tj=1T|𝐬~ij|\widetilde{\mu}^{\mathbf{s}}_{i}=\frac{1}{T}\sum_{j=1}^{T}|\widetilde{\mathbf{s}}^{j}_{i}| for all i{0}[k+1]i\in\{0\}\cup[k+1].

  6. 5.

    Output X^=1^00^11^11^k0^k1^k+1\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{k}\widehat{0}_{k}\widehat{1}_{k+1}, where 1^i\widehat{1}_{i} is a 11-run, length μ~i𝐬p\frac{\widetilde{\mu}^{\mathbf{s}}_{i}}{p}, and 0^i\widehat{0}_{i} is a 0-run, length μ~i𝐫p\frac{\widetilde{\mu}^{\mathbf{r}}_{i}}{p}.

Observe that the algorithm is inherently approximate, as we fill in the gaps between the long 0-runs with 11-runs, omitting any short 0-runs.

Analysis

Proof of Theorem 1.

Let XX be a string on nn bits such that all of its 11-runs have length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon, where C100/pC^{\prime}\geqslant 100/p, and all of its 0-runs have length either greater than 3Clogn3C^{\prime}\log n or less than ClognC^{\prime}\log n. Take T=2p2ε2log(n)T=\frac{2}{p^{2}\varepsilon^{2}}\log(n) traces of XX. By a Chernoff bound, with probability at least 11n21-\frac{1}{n^{2}}, no 11-run is fully deleted in any trace; in the following we assume that we are on this event.

We will justify that in the traces we can identify all 0-runs that had length at least 3Clog(n)3C^{\prime}\log(n) in XX. Let 𝐫\mathbf{r} be a 0-run from XX with length |𝐫|3Clog(n)|\mathbf{r}|\geqslant 3C^{\prime}\log(n). Using a Chernoff bound, the probability that in a single trace 𝐫\mathbf{r} is transformed into a run 𝐫~\widetilde{\mathbf{r}} with |𝐫~|2Cplog(n)|\widetilde{\mathbf{r}}|\leqslant 2C^{\prime}p\log(n) is bounded by

𝐏(|𝐫~|2Cplog(n))\displaystyle\mathbf{P}\big{(}|\widetilde{\mathbf{r}}|\leqslant 2C^{\prime}p\log(n)\big{)} 𝐏(𝐫~|p|𝐫Cplog(n))2n3\displaystyle\ \leqslant\ \mathbf{P}\big{(}||\widetilde{\mathbf{r}}|-p|\mathbf{r}||\geqslant C^{\prime}p\log(n)\big{)}\leqslant 2n^{-3}

Similarly, for any 0-run 𝐫\mathbf{r} in XX such that |𝐫|Clog(n)|\mathbf{r}|\leqslant C^{\prime}\log(n), the probability that 𝐫\mathbf{r} is reduced to a run 𝐫~\widetilde{\mathbf{r}} with |𝐫~|2Cplog(n)|\widetilde{\mathbf{r}}|\geqslant 2C^{\prime}p\log(n) is bounded by

𝐏(|𝐫~|2Cplog(n))𝐏(𝐫~|p|𝐫Cplogn) 2n3\mathbf{P}\big{(}|\widetilde{\mathbf{r}}|\geqslant 2C^{\prime}p\log(n)\big{)}\ \leqslant\ \mathbf{P}\big{(}||\widetilde{\mathbf{r}}|-p|\mathbf{r}||\ \geqslant\ C^{\prime}p\log n\big{)}\ \leqslant\ 2n^{-3}

It follows that, with probability at least 14Tn21-\frac{4T}{n^{2}}, there does not exist any 0-run and any trace such that either of the “unlikely” inequalities above holds. On this event, we have that for any 0-run 𝐫\mathbf{r} of length at least 3Clogn3C^{\prime}\log n, and any trace X~j\widetilde{X}_{j}, we can identify the image 𝐫~j\widetilde{\mathbf{r}}^{j} of 𝐫\mathbf{r} in trace X~j\widetilde{X}_{j}. In particular, on this event, the number of 0-runs in each trace that has length at least 2Cplog(n)2C^{\prime}p\log(n) is equal to the number of 0-runs in XX of length at least 3Clog(n)3C^{\prime}\log(n); thus k1=k2=kT=:kk_{1}=k_{2}=\cdots k_{T}=:k. The algorithm and proof now proceed very similarly to those of Proposition 9, except since we have more than a single trace, we estimate lengths of subsequences by scaling an average of the corresponding subsequences from the traces.

Let L:=2CplognL:=2C^{\prime}p\log n and find every 0-run in X~j\widetilde{X}_{j} with length at least LL, indexing them as 𝐫~1j,,𝐫~kj\widetilde{\mathbf{r}}^{j}_{1},\ldots,\widetilde{\mathbf{r}}^{j}_{k}. For i[k1]i\in[k-1], let 𝐬~ij\widetilde{\mathbf{s}}^{j}_{i} be the bits between the last bit of 𝐫~ij\widetilde{\mathbf{r}}^{j}_{i} and the first bit of 𝐫~i+1j\widetilde{\mathbf{r}}^{j}_{i+1} in X~j\widetilde{X}_{j} and let 𝐬~0j\widetilde{\mathbf{s}}^{j}_{0} be the bits before 𝐫~1j\widetilde{\mathbf{r}}^{j}_{1} and 𝐬~k+1j\widetilde{\mathbf{s}}^{j}_{k+1} the bits after 𝐫~kj\widetilde{\mathbf{r}}^{j}_{k}. Let 𝐬i{\mathbf{s}}_{i} be the contiguous substring of XX from which 𝐬~i1,,𝐬~iT\widetilde{\mathbf{s}}^{1}_{i},\ldots,\widetilde{\mathbf{s}}^{T}_{i} came and 𝐫i{\mathbf{r}}_{i} the contiguous substring of XX from which 𝐫~i1,,𝐫~iT\widetilde{\mathbf{r}}^{1}_{i},\ldots,\widetilde{\mathbf{r}}^{T}_{i} came.

For all ii, we will approximate 𝐫i{\mathbf{r}}_{i} with 0^i\widehat{0}_{i} a 0-run of length μ~i𝐫/p\widetilde{\mu}^{\mathbf{r}}_{i}/p , for μ~i𝐫=1Tj=1T|𝐫~ij|\widetilde{\mu}^{\mathbf{r}}_{i}=\frac{1}{T}\sum_{j=1}^{T}|\widetilde{\mathbf{r}}^{j}_{i}|, and we will approximate 𝐬i{\mathbf{s}}_{i} with 1^i\widehat{1}_{i}, a 11-run of length μ~i𝐬/p\widetilde{\mu}^{\mathbf{s}}_{i}/p, for μ~i𝐬=1Tj=1T|𝐬~ij|\widetilde{\mu}^{\mathbf{s}}_{i}=\frac{1}{T}\sum_{j=1}^{T}|\widetilde{\mathbf{s}}^{j}_{i}|. Applying a Chernoff bound and then a union bound, 𝐏(i:|μ~i𝐫/p|𝐫i||ε|𝐫i|)2n3\mathbf{P}(\exists i:|\widetilde{\mu}^{\mathbf{r}}_{i}/p-|{\mathbf{r}}_{i}||\geqslant\varepsilon|{\mathbf{r}}_{i}|)\leqslant 2n^{-3} and 𝐏(i:|μ~i𝐬/p|𝐬i||ε|𝐬i|)2n3\mathbf{P}(\exists i:|\widetilde{\mu}^{\mathbf{s}}_{i}/p-|{\mathbf{s}}_{i}||\geqslant\varepsilon|{\mathbf{s}}_{i}|)\leqslant 2n^{-3}.

Since 𝐬i{\mathbf{s}}_{i} contains alternating 11-runs with length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon and 0-runs with length at most Clog(n)C^{\prime}\log(n), 𝐬i{\mathbf{s}}_{i} has at least a 1ε1-\varepsilon density of 11s. Therefore d𝖤(𝐬i,1^i)2ε|𝐬i|d_{\mathsf{E}}({\mathbf{s}}_{i},\widehat{1}_{i})\leqslant 2\varepsilon|{\mathbf{s}}_{i}| and d𝖤(𝐫i,0^i)ε|𝐫i|d_{\mathsf{E}}({\mathbf{r}}_{i},\widehat{0}_{i})\leqslant\varepsilon|{\mathbf{r}}_{i}|. Let X^=1^00^11^11^k0^k1^k+1\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{k}\widehat{0}_{k}\widehat{1}_{k+1} and we see that from Lemma 7

d𝖤(X,X^)\displaystyle d_{\mathsf{E}}(X,\widehat{X}) =i=1k(d𝖤(0^i,𝐫i)+d𝖤(1^i,𝐬i))+d𝖤(0^0,𝐬0)+d𝖤(0^k+1,𝐬k+1)\displaystyle=\sum_{i=1}^{k}(d_{\mathsf{E}}(\widehat{0}_{i},{\mathbf{r}}_{i})+d_{\mathsf{E}}(\widehat{1}_{i},{\mathbf{s}}_{i}))+d_{\mathsf{E}}(\widehat{0}_{0},{\mathbf{s}}_{0})+d_{\mathsf{E}}(\widehat{0}_{k+1},{\mathbf{s}}_{k+1})
i=1k(ε|𝐫i|+2ε|𝐬i|)+2ε|𝐬0|+2ε|𝐬k+1|2εn.\displaystyle\leqslant\sum_{i=1}^{k}(\varepsilon|{\mathbf{r}}_{i}|+2\varepsilon|{\mathbf{s}}_{i}|)+2\varepsilon|{\mathbf{s}}_{0}|+2\varepsilon|{\mathbf{s}}_{k+1}|\leqslant 2\varepsilon n.

If we apply this algorithm and analysis with ε/2\varepsilon/2 instead of ε\varepsilon, the result follows. Constants were taken large enough to account for this factor of 2. ∎

Note that the above theorem holds when the constant CC^{\prime} is unknown. Given T=O(logn/ε2)T=O(\log n/\varepsilon^{2}) traces of XX, we can determine whether or not XX had such a gap, and the corresponding CC^{\prime} value, with high probability. We can then execute the algorithm as stated.

3.2 Identifying dense substrings

Here we extend the class of strings we can approximately reconstruct, proving a robust version of Theorem 1. Specifically, we consider strings with similar properties to those in Theorem 1, allowing for additional bit flips.

See 2

The general goal of the algorithm is similar to that of Theorem 1, which is to identify long 0-runs from YY in each trace of XX and to align by these 0-runs; then, we approximate the rest of XX with 11-runs. Because XX and YY have small edit distance, a good approximation for YY is also good for XX. Unfortunately the long 0-runs from YY are no longer necessarily 0-runs in XX, and therefore they are more difficult to find in the traces. Instead we find long 0-dense substrings in XX.

Let XX and YY be as in the theorem statement. We also fix m:=Cεlog(n)m:=C^{\prime}\varepsilon\log(n) throughout this subsection. Fix a trace X~\widetilde{X} of XX, as well as an index \ell. Let n~\widetilde{n} denote the length of the trace. Define the indices ii_{\ell} and jj_{\ell} to be those that are (m+1)(m+1) 11s to the left and right of \ell in X~\widetilde{X}, respectively, if such indices exist. We count the 0s in X~\widetilde{X} between indices ii_{\ell} and jj_{\ell} with the quantity

Sint(X~,):=k=ij𝟙X~[k]=0.S_{\text{int}}(\widetilde{X},\ell):=\sum_{k=i_{\ell}}^{j_{\ell}}\mathds{1}_{\widetilde{X}[k]=0}.

Note that Sint(X~,)S_{\text{int}}(\widetilde{X},\ell) is not defined if ii_{\ell} or jj_{\ell} are not defined. We use a slightly different quantity on the boundary of the trace to handle this. Letting the definition of ii_{\ell} and jj_{\ell} remain the same, if ii_{\ell} or jj_{\ell} is not defined, then we consider SL-bound(X~,):=k=0j𝟙X~[k]=0S_{\text{L-bound}}(\widetilde{X},\ell):=\sum_{k=0}^{j_{\ell}}\mathds{1}_{\widetilde{X}[k]=0} or SR-bound(X~,):=k=in~𝟙X~[k]=0S_{\text{R-bound}}(\widetilde{X},\ell):=\sum_{k=i_{\ell}}^{\widetilde{n}}\mathds{1}_{\widetilde{X}[k]=0}, respectively. Combining the interior and boundary quantities, let S(X~j,)=Sint(X~j,)S(\widetilde{X}_{j},\ell)=S_{\text{int}}(\widetilde{X}_{j},\ell) if there are (m+1)(m+1) 1s to the left and right of \ell, let S(X~j,)=SLbound(X~j,)S(\widetilde{X}_{j},\ell)=S_{L-\text{bound}}(\widetilde{X}_{j},\ell) if there are (m+1)(m+1) 1s to the right of \ell but not the left, and let S(X~j,)=SRbound(X~j,)S(\widetilde{X}_{j},\ell)=S_{R-\text{bound}}(\widetilde{X}_{j},\ell) if there are (m+1)(m+1) 1s to the left of \ell but not the right.

In each trace we identify a set of substrings of XX that are 0-dense, and then decide whether each such substring is long or short using S(X~j,)S(\widetilde{X}_{j},\ell); that is, whether the corresponding unknown 0-runs in YY are long (length at least the upper bound of the gap) or short (length at most the lower bound of the gap). If the traces all agree on the number of long 0-dense substrings, we align the traces by these substrings and reconstruct in a manner similar to that of Theorem 1.

Algorithm for identifying dense substrings

  1. Set-up:

    String XX on nn bits formed by flipping at most εClog(n)\varepsilon C^{\prime}\log(n) bits in each run of YY, where YY is a string on nn bits such that all of its 11-runs have length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon, for C1000/pC^{\prime}\geqslant 1000/p, and all of its 0-runs have length either greater than 3Clogn3C^{\prime}\log n or less than ClognC^{\prime}\log n.

  2. 1.

    Sample T=2p2ε2lognT=\frac{2}{p^{2}\varepsilon^{2}}\log n traces, X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T}, from the deletion channel with deletion probability qq.

  3. 2.

    Set m:=εClognm:=\varepsilon C^{\prime}\log n and a:=pClogna:=pC^{\prime}\log n. For each trace X~j\widetilde{X}_{j}, let ii be the smallest index of X~j\widetilde{X}_{j} such that X~j[i]=0\widetilde{X}_{j}[i]=0 and |{k:X~j[k]=0,|ik|a+m}|a.|\{k:\widetilde{X}_{j}[k]=0,|i-k|\leqslant a+m\}|\geqslant a. Let 1j\ell^{j}_{1} be the smallest index such that X~j[1j]=0\widetilde{X}_{j}[\ell^{j}_{1}]=0 and |{k:X~j[k]=0,i(a+m)k<1j}|=m.|\{k:\widetilde{X}_{j}[k]=0,i-(a+m)\leqslant k<\ell^{j}_{1}\}|=m. Compute S(X~j,1j)S(\widetilde{X}_{j},\ell^{j}_{1}). Starting m+1m+1 bits to the right of the last bit counted in S(X~j,1j)S(\widetilde{X}_{j},\ell^{j}_{1}), continue scanning to the right and repeat this process, finding indices tj\ell_{t}^{j} and computing S(X~j,tj)S(\widetilde{X}_{j},\ell^{j}_{t}), for t2t\geqslant 2.

  4. 3.

    Set G¯=2Cplogn\bar{G}=2C^{\prime}p\log n. For every trace X~j\widetilde{X}_{j}, let Ij={t:S(X~j,tj)>G¯}I_{j}=\{t:S(\widetilde{X}_{j},\ell_{t}^{j})>\bar{G}\}. If |Ij||I_{j}| is not the same across all TT traces, the algorithm fails. Otherwise, define I=|Ij|I=|I_{j}| and for all t[I]t\in[I], we let 0^t\widehat{0}_{t} be a 0-run of length μ~t/p\widetilde{\mu}_{t}/p, for μ~t=1Tj=1TS(X~j,tj)\widetilde{\mu}_{t}=\frac{1}{T}\sum_{j=1}^{T}S(\widetilde{X}_{j},\ell^{j}_{t}).

  5. 4.

    Define i^t=1Tj=1Titj\widehat{i}_{t}=\frac{1}{T}\sum_{j^{\prime}=1}^{T}i_{\ell^{j^{\prime}}_{t}} and j^t=1Tj=1Tjtj\widehat{j}_{t}=\frac{1}{T}\sum_{j^{\prime}=1}^{T}j_{\ell^{j^{\prime}}_{t}}, for itji_{\ell^{j^{\prime}}_{t}} and jtjj_{\ell^{j^{\prime}}_{t}} as in the definition of S(X~j,tj)S(\widetilde{X}_{j^{\prime}},\ell^{j^{\prime}}_{t}). Let 1^0,,1^I\widehat{1}_{0},\ldots,\widehat{1}_{I} be 11-runs where 1^t\widehat{1}_{t} has length |i^t+1j^t|/p|\widehat{i}_{t+1}-\widehat{j}_{t}|/p for t[I1]t\in[I-1], 1^0\widehat{1}_{0} has length i^1/p\widehat{i}_{1}/p, and 1^I\widehat{1}_{I} has length |pnj^I|/p|pn-\widehat{j}_{I}|/p.

  6. 5.

    Output X^=1^00^11^11^I10^I11^I\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{I-1}\widehat{0}_{I-1}\widehat{1}_{I}.

Analysis

Let ε,p\varepsilon,p be fixed such that p>3εp>3\varepsilon, and let CC^{\prime} be fixed such that C1000pC^{\prime}\geqslant\frac{1000}{p}. Suppose YY is a string on nn bits such that every 11-run in YY has length at least Clog(n)/εC^{\prime}\log(n)/\varepsilon and all of its 0-runs have length either greater than 3Clogn3C^{\prime}\log n or length less than ClognC^{\prime}\log n. Let XX be a string on nn bits that is formed by flipping at most m=Cεlog(n)m=C^{\prime}\varepsilon\log(n) bits within each run of YY. Let X~\widetilde{X} be a trace of XX. A 0-run 𝐫\mathbf{r} in YY may have some bits flipped from 0 to 1 in XX, becoming the substring 𝐫X\mathbf{r}_{X}, so let |𝐫X0||\mathbf{r}_{X}^{0}| denote the number of 0s in 𝐫X\mathbf{r}_{X}.

Next, we prove several properties of S(X~,)S(\widetilde{X},\ell) when the bit at index \ell in trace X~\widetilde{X} was from a 0-run in YY and XX.

Lemma 10.

Let X~\widetilde{X} be a random trace from XX, and let \ell be an index of X~\widetilde{X} such that X~[]=0\widetilde{X}[\ell]=0. If the bit at X~[]\widetilde{X}[\ell] is from a 0-run 𝐫\mathbf{r} in YY, then the following holds for the quantity S(X~,)S(\widetilde{X},\ell):

  1. 1.

    (Property 1) With probability at least 1n61-n^{-6} the bits at indices ii_{\ell} and jj_{\ell} come from a 11-run adjacent to 𝐫\mathbf{r}.

  2. 2.

    (Property 2) If indices ii_{\ell} and jj_{\ell} come from a 11-run adjacent to 𝐫\mathbf{r}, then S(X~,)S(\widetilde{X},\ell) is upper bounded by a random variable from the distribution Bin(|𝐫X0|,p)+Bin(2m,p)\mathrm{Bin}(|\mathbf{r}_{X}^{0}|,p)+\mathrm{Bin}(2m,p).

  3. 3.

    (Property 3) If |𝐫|Clogn|\mathbf{r}|\geqslant C^{\prime}\log n and the bits at indices ii_{\ell} and jj_{\ell} come from a 11-run adjacent to 𝐫\mathbf{r}, then with probability at least 1n61-n^{-6}, |S(X~,)p|𝐫||p|𝐫|4+3m.|S(\widetilde{X},\ell)-p|\mathbf{r}||\leqslant\frac{p|\mathbf{r}|}{4}+3m.

Proof of Property 1.

It suffices to prove the claim for ii_{\ell}. Index ii_{\ell} is m+1m+1 11s to the left of \ell, and therefore not from 𝐫\mathbf{r}, since at most mm 0s of 𝐫\mathbf{r} were flipped to 11s. Further, by a Chernoff bound, with probability at least 1n61-n^{-6} the 11-run left-adjacent to 𝐫\mathbf{r} in YY has at least 2m+12m+1 bits surviving in X~\widetilde{X}. At most mm bits of the left-adjacent 11-run to 𝐫\mathbf{r} in YY are flipped to 0, so at least m+1m+1 11s from this 11-run survive in X~\widetilde{X}. It follows that ii_{\ell} came from the left adjacent 11-run to 𝐫\mathbf{r} in YY. ∎

Proof of Property 2.

Recall that |𝐫X0||\mathbf{r}_{X}^{0}| is the number of 0s in 𝐫\mathbf{r} that were not flipped to 11 in XX. This component of S(X~,)S(\widetilde{X},\ell) is from the distribution Bin(|𝐫X0|,p)\mathrm{Bin}(|\mathbf{r}_{X}^{0}|,p). Let the contribution to S(X~,)S(\widetilde{X},\ell) by any 0s not from 𝐫\mathbf{r} be the random variable Z𝐫()Z_{\mathbf{r}}(\ell). Each bit that was flipped to 0 in either 11-run adjacent to 𝐫\mathbf{r} in YY can contribute 11 with probability at most pp to Z𝐫()Z_{\mathbf{r}}(\ell). From the assumption on ii_{\ell} and jj_{\ell}, any other 0 from XX will be outside of the range [i,j][i_{\ell},j_{\ell}]. Therefore we can upper bound the contribution of Z𝐫()Z_{\mathbf{r}}(\ell) by a random variable sampled from Bin(2m,p)\mathrm{Bin}(2m,p). ∎

Proof of Property 3.

By Property 2, S(X~,)S(\widetilde{X},\ell) is upper bounded by a random variable from the distribution Bin(|𝐫X0|,p)+Z𝐫()\mathrm{Bin}(|\mathbf{r}_{X}^{0}|,p)+Z_{\mathbf{r}}(\ell). By a Chernoff bound, with probability 1n61-n^{-6} the first binomial term varies from its mean by at most p|𝐫|/4p|\mathbf{r}|/4. The second binomial term is upper bounded by 2m2m and ||𝐫X0||𝐫||m||\mathbf{r}_{X}^{0}|-|\mathbf{r}||\leqslant m. ∎

Now we are ready to prove Theorem 2.

Proof of Theorem 2.

Define a:=pClog(n)a:=pC^{\prime}\log(n). Take T=2p2ε2lognT=\frac{2}{p^{2}\varepsilon^{2}}\log n traces of XX, X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T} , and fix a trace X~j\widetilde{X}_{j}. Our first goal is to find long 0-dense substrings in XX; we can also think of these long 0-dense substrings as corresponding to long 0-runs in YY. Let ii be the smallest index of X~j\widetilde{X}_{j} such that X~j[i]=0\widetilde{X}_{j}[i]=0 and there are at least aa 0s in X~j\widetilde{X}_{j} within a+ma+m indices of ii, i.e.

|{k:X~j[k]=0,|ik|a+m}|a.|\{k:\widetilde{X}_{j}[k]=0,|i-k|\leqslant a+m\}|\geqslant a.

Next find the index 1j\ell^{j}_{1} such that X~j[1j]=0\widetilde{X}_{j}[\ell^{j}_{1}]=0 and there are exactly mm 0s in X~j\widetilde{X}_{j} within the interval of indices [i(a+m),1j][i-(a+m),\ell^{j}_{1}], i.e. |{k:X~j[k]=0,i(a+m)k<1j}|=m.|\{k:\widetilde{X}_{j}[k]=0,\ i-(a+m)\leqslant k<\ell^{j}_{1}\}|=m. The goal of this procedure is to find an index 1j\ell^{j}_{1} such that the bit at X~j[1j]\widetilde{X}_{j}[\ell^{j}_{1}] is from a 0-run in YY with high probability.

With probability at least 1n61-n^{-6}, every 11-run in YY is reduced to a substring with at least 2(a+m)2(a+m) 11s in X~j\widetilde{X}_{j}. This implies that the length 2(a+m)2(a+m) interval X~j[i(a+m),i+a+m]\widetilde{X}_{j}[i-(a+m),i+a+m] contains bits from at most two 11 runs in YY and at most one 0 run with probability 1n61-n^{-6}. By construction, this interval contains at least a>3ma>3m 0s (the inequality coming from the fact that p>3εp>3\varepsilon). Since each 11-run had at most mm bits flipped to 0, there must be at least a2m>ma-2m>m 0s in the interval X~j[i(a+m),i+a+m]\widetilde{X}_{j}[i-(a+m),i+a+m] that came from some 0-run 𝐫\mathbf{r} in YY. In this construction, the 0s from the 𝐫\mathbf{r} that survived in X~j\widetilde{X}_{j} are nested between at most mm 0s that were flipped from the left-adjacent 11-run to 𝐫\mathbf{r} in YY and at most mm 0s that were flipped from the right-adjacent 11-run to 𝐫\mathbf{r} in YY. This implies that the (m+1)(m+1)th 0 in this interval must be from the 0-run 𝐫\mathbf{r}.

Compute S(X~j,1j)S(\widetilde{X}_{j},\ell^{j}_{1}). Note that with high probability, if a trace does not have (m+1)(m+1) 1s to the right of 1j\ell^{j}_{1}, the original string can be well-approximated by outputting the all 0s string with length 1Tj=1T|X~j|/p\frac{1}{T}\sum_{j=1}^{T}|\widetilde{X}_{j}|/p. Starting m+1m+1 bits to the right of the last bit counted in S(X~j,1j)S(\widetilde{X}_{j},\ell^{j}_{1}), continue scanning to the right and repeat this process, finding indices tj\ell_{t}^{j} and computing S(X~j,tj)S(\widetilde{X}_{j},\ell^{j}_{t}), for t2t\geqslant 2. We jump ahead m+1m+1 bits to the right between iterations because this forces the next bit ii that satisfies the condition |{k:X~j[k]=0,|ik|a+m}|a|\{k:\widetilde{X}_{j}[k]=0,|i-k|\leqslant a+m\}|\geqslant a to not overlap with the previous 0-run with high probability by Property 1.

We justify that this process succeeds, meaning that it catches all long 0-runs from YY, in all TT traces, with high probability. For 0-run 𝐫\mathbf{r} in YY such that |𝐫|3Clog(n)|\mathbf{r}|\geqslant 3C^{\prime}\log(n), with probability at least 1n61-n^{-6} at least a+ma+m bits from all such 0-runs survive in all TT traces. Further there are at most mm 11s among these bits. Therefore, with probability at least 1n61-n^{-6}, we have at least aa 0s that have at most mm 11s inserted among them, and this triggers the calculation of tj\ell^{j}_{t} for some tt.

By the theorem assumptions, there exists an interval [Clogn,3Clogn][C^{\prime}\log n,3C^{\prime}\log n] such that no 0-run 𝐫\mathbf{r} in YY has |𝐫||\mathbf{r}| in the gap [Clogn,3Clogn][C^{\prime}\log n,3C^{\prime}\log n]. Let G¯\bar{G} be the middle of the gap scaled by pp, so G¯=2Cplogn\bar{G}=2C^{\prime}p\log n. By Property 3 and a union bound, with probability at least 1n41-n^{-4}, all 0-runs 𝐫\mathbf{r} in YY with |𝐫|3Clogn|\mathbf{r}|\geqslant 3C^{\prime}\log n will trigger the calculation of an tj\ell^{j}_{t} with S(X~j,tj)>G¯S(\widetilde{X}_{j},\ell^{j}_{t})>\bar{G} in all traces, and all 0-runs 𝐫\mathbf{r} in YY with |𝐫|<Clogn|\mathbf{r}|<C^{\prime}\log n will either not trigger an tj\ell^{j}_{t} calculation, or if they do, tj\ell^{j}_{t} will have S(X~j,tj)<G¯S(\widetilde{X}_{j},\ell^{j}_{t})<\bar{G} for all traces.

For every trace X~j\widetilde{X}_{j}, let Ij={t:S(X~j,tj)>G¯}I_{j}=\{t:S(\widetilde{X}_{j},\ell^{j}_{t})>\bar{G}\}. If |Ij||I_{j}| is not the same across all TT traces, the algorithm fails. Otherwise let I=|Ij|I=|I_{j}| for all jj, and for each trace X~j\widetilde{X}_{j} relabel the tj\ell^{j}_{t} with S(X~j,tj)>G¯S(\widetilde{X}_{j},\ell^{j}_{t})>\bar{G} as 1j,,Ij\ell_{1}^{j},\ldots,\ell_{I}^{j}.

The proof now proceeds similarly to that of Theorem 1. We approximate long 0-runs 𝐫t\mathbf{r}_{t} in YY, which are close to some long 0-dense substrings of XX with high probability, with 0-runs, and the rest is approximated with 11-runs. We first estimate the distance between the 0-runs in YY. Consider a 0-run 𝐫t\mathbf{r}_{t} that generates an estimate of μ~t𝐫/p\widetilde{\mu}^{\mathbf{r}}_{t}/p, and take i^t=1Tj=1Titj\widehat{i}_{t}=\frac{1}{T}\sum_{j^{\prime}=1}^{T}i_{\ell_{t}^{j^{\prime}}} and j^t=1Tj=1Tjtj\widehat{j}_{t}=\frac{1}{T}\sum_{j^{\prime}=1}^{T}j_{\ell_{t}^{j^{\prime}}}, for itji_{\ell_{t}^{j^{\prime}}} and jtjj_{\ell_{t}^{j^{\prime}}} as in the definition of S(X~j,tj)S(\widetilde{X}_{j^{\prime}},\ell_{t}^{j^{\prime}}). The average of the indices i^t\widehat{i}_{t} can be at most mm bits to the left of the first 0 from 𝐫t\mathbf{r}_{t}, and therefore is at most off by mm bits. The same is true for j^t\widehat{j}_{t}. By a Chernoff bound, |i^t+1j^t|/p|\widehat{i}_{t+1}-\widehat{j}_{t}|/p is an estimate of the distance between 0-runs with accuracy 2ε|𝐫t|2\varepsilon|\mathbf{r}_{t}| with probability at least 1n61-n^{-6}. The substring between these 0-runs also has at least a 1ε1-\varepsilon density of 11s, so we can fill with 11-runs for a good approximation. Let 1^0,,1^I\widehat{1}_{0},\ldots,\widehat{1}_{I} be 11-runs where 1^t\widehat{1}_{t} has length |i^t+1j^t|/p|\widehat{i}_{t+1}-\widehat{j}_{t}|/p for t[I1]t\in[I-1], 1^0\widehat{1}_{0} has length i^1/p\widehat{i}_{1}/p, and 1^I\widehat{1}_{I} has length |pnj^I|/p|pn-\widehat{j}_{I}|/p. Hence by Lemma 7 the 11-runs contribute at most 3εn3\varepsilon n to the edit distance error.

It remains to estimate the lengths of the long 0-runs in YY 𝐫1,,𝐫I\mathbf{r}_{1},\ldots,\mathbf{r}_{I}. Fix t[I]t\in[I], let 0^t\widehat{0}_{t} be a 0-run of length μ~t𝐫/p\widetilde{\mu}^{\mathbf{r}}_{t}/p, for μ~t𝐫=1Tj=1TS(X~j,tj)\widetilde{\mu}^{\mathbf{r}}_{t}=\frac{1}{T}\sum_{j=1}^{T}S(\widetilde{X}_{j},\ell^{j}_{t}). For every 𝐫t{𝐫1,,𝐫I}\mathbf{r}_{t}\in\{\mathbf{r}_{1},\ldots,\mathbf{r}_{I}\}, define 𝐫tX0{\mathbf{r}_{t}}_{X}^{0} as above (the number of 0s from 𝐫t\mathbf{r}_{t} in XX). With probability at least 1n61-n^{-6} the average of Bin(|𝐫tX0|,p)\mathrm{Bin}(|{\mathbf{r}_{t}}_{X}^{0}|,p) over T=O(log(n)/ε2)T=O(\log(n)/\varepsilon^{2}) traces is within εp|𝐫tX0|\varepsilon p|{\mathbf{r}_{t}}_{X}^{0}| of the mean p|𝐫tX0|p|{\mathbf{r}_{t}}_{X}^{0}|. Combining this with Property 2, with probability at least 1n31-n^{-3},

|μ~t𝐫p|𝐫tX0||εp|𝐫tX0|+2m.|\widetilde{\mu}^{\mathbf{r}}_{t}-p|{\mathbf{r}_{t}}_{X}^{0}||\leqslant\varepsilon p|{\mathbf{r}_{t}}_{X}^{0}|+2m.

Since ||𝐫tX0||𝐫𝐭||m|{|\mathbf{r}_{t}}_{X}^{0}|-|\mathbf{\mathbf{r}_{t}}||\leqslant m, we have that

|p|𝐫𝐭|μ~t𝐫|εp|𝐫t|+2m+pm=εp|𝐫t|+3m.|p|\mathbf{\mathbf{r}_{t}}|-\widetilde{\mu}^{\mathbf{r}}_{t}|\leqslant\varepsilon p|\mathbf{\mathbf{r}}_{t}|+2m+pm=\varepsilon p|\mathbf{\mathbf{r}}_{t}|+3m.

This is at worst an approximation of p|𝐫t|p|\mathbf{\mathbf{r}}_{t}| with edit distance error at most

ε+3mp|𝐫t|ε+3mp(a2m)ε+9εp2C′′ε,\varepsilon+\frac{3m}{p|\mathbf{\mathbf{r}}_{t}|}\leqslant\varepsilon+\frac{3m}{p(a-2m)}\leqslant\varepsilon+\frac{9\varepsilon}{p^{2}}\leqslant C^{\prime\prime}\varepsilon,

where we use a>3ma>3m and C′′=1+9p2C^{\prime\prime}=1+\frac{9}{p^{2}}. Taking a union bound over all 𝐫t{𝐫1,,𝐫I}{\mathbf{r}}_{t}\in\{\mathbf{r}_{1},\ldots,\mathbf{r}_{I}\}, and applying Lemma 7, with probability at least 1n21-n^{-2} the long 0-run estimates contribute at most error C′′εnC^{\prime\prime}\varepsilon n. Putting this all together, we output the string X^=1^00^11^11^I10^I11^I\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{I-1}\widehat{0}_{I-1}\widehat{1}_{I}. One more application of Lemma 7 implies that d𝖤(Y,X^)(C′′+3)εnd_{\mathsf{E}}(Y,\widehat{X})\leqslant(C^{\prime\prime}+3)\varepsilon n. Since YY is within εn\varepsilon n edit distance from XX, the triangle inequality lets us conclude that d𝖤(X,X^)(C′′+4)εnd_{\mathsf{E}}(X,\widehat{X})\leqslant(C^{\prime\prime}+4)\varepsilon n.

If we apply this algorithm and analysis with εC′′+4\frac{\varepsilon}{C^{\prime\prime}+4} instead of ε\varepsilon, the result follows. Constants were taken large enough to account for this factor of C′′+4C^{\prime\prime}+4. ∎

As before, the theorem holds when the constant CC^{\prime} is unknown. Given T=O(logn/ε2)T=O(\log n/\varepsilon^{2}) traces of XX, we can find whether XX has a gap, and the corresponding CC^{\prime} value, with high probability.

3.3 Majority voting in substrings

A natural follow-up question to the previous theorems is what happens when the string no longer has long runs, but instead has long dense regions.

See 3

Algorithm for majority voting in substrings

  1. Set-up:

    String XX on nn bits such that XX can be divided into contiguous intervals all of length at least L=50logn/(p2ε2)L=50\log n/(p^{2}\varepsilon^{2}) and density at least 1ε121-\frac{\varepsilon}{12} of 0s or 11s.

  2. 1.

    Sample a single trace X~\widetilde{X} from the deletion channel with probability qq.

  3. 2.

    Uniformly partition X~\widetilde{X} into contiguous substrings of length w=εpLw=\varepsilon pL, so X~=X~1X~n/w\widetilde{X}=\widetilde{X}_{1}\cdots\widetilde{X}_{\lceil n/w\rceil}, with a shorter last interval if needed.

  4. 3.

    Output X^=X^1X^n/w\widehat{X}=\widehat{X}_{1}\cdots\widehat{X}_{\lceil n/w\rceil}, where X^i\widehat{X}_{i} is a run of length w/pw/p with value the majority bit of X~i\widetilde{X}_{i} for i[n/w]i\in[\lceil n/w\rceil].

Analysis

We first present three properties of the traces generated by high density strings with large length.

Lemma 11.

Fix ε\varepsilon and pp. Let XX be a string on at least LL bits, where L=50p2ε2log(n)L=\frac{50}{p^{2}\varepsilon^{2}}\log(n) with density of at least 1ε1-\varepsilon of either 0 or 11. For a trace X~\widetilde{X} of XX, the following properties hold with probability at least 1n41-n^{-4}.

  1. 1.

    (Property 1) |X~|Lp2\frac{|\widetilde{X}|}{L}\geqslant\frac{p}{2}

  2. 2.

    (Property 2) ||X||X~|p|ε|X|\Big{|}|X|-\frac{|\widetilde{X}|}{p}\Big{|}\leqslant\varepsilon|X|.

  3. 3.

    (Property 3) X~\widetilde{X} has density at least 12ε1-2\varepsilon of 0s.

Proof.

Assume w.l.o.g. that XX has density at least 1ε1-\varepsilon of 0. Applying a Chernoff bound gives that with probability at least 1n61-n^{-6}, the length of X~\widetilde{X} is in the range p|X|±3|X|log(n)p|X|\pm\sqrt{3|X|\log(n)}. Taking this lower bound gives |X~||X|p3|X|log(n)|X|\frac{|\widetilde{X}|}{|X|}\geqslant p-\frac{\sqrt{3|X|\log(n)}}{|X|}. Since |X|L|X|\geqslant L, we see that 3|X|log(n)|X|p/2\frac{\sqrt{3|X|\log(n)}}{|X|}\leqslant p/2, completing the proof of Property 1. Another way of writing the same Chernoff bound result is that ||X||X~|p|3|X|log(n)ε|X|\left||X|-\frac{|\widetilde{X}|}{p}\right|\leqslant\sqrt{3|X|\log(n)}\leqslant\varepsilon|X|, proving Property 2.

Applying a Chernoff bound to the number of 0s in XX, with probability at least 1n61-n^{-6}, the number of non-deleted 0s is at least p|X|(1ε)3|X|(1ε)log(n)p|X|(1ε)3|X|log(n)p|X|(1-\varepsilon)-\sqrt{3|X|(1-\varepsilon)\log(n)}\geqslant p|X|(1-\varepsilon)-\sqrt{3|X|\log(n)}. Combining this with the first application of a Chernoff bound, a union bound gives that with probability at least 1n51-n^{-5}, the density of 0s in the trace (denoted ρ\rho) satisfies the following inequalities:

ρp|X|(1ε)3|X|log(n)p|X|+3|X|log(n)50(1ε)150ε50+150ε12ε.\displaystyle\rho\geqslant\frac{p|X|(1-\varepsilon)-\sqrt{3|X|\log(n)}}{p|X|+\sqrt{3|X|\log(n)}}\geqslant\frac{50(1-\varepsilon)-\sqrt{150}\varepsilon}{50+\sqrt{150}\varepsilon}\geqslant 1-2\varepsilon.

Note that the second inequality comes from the fact that the expression to the left is increasing in |X||X|, and therefore is minimized at |X|=L|X|=L. ∎

Using these results, we can now proceed to the main proof of this section.

Proof of Theorem 3.

Suppose XX is a binary string on nn bits that can be divided into intervals I1,,ImI_{1},\ldots,I_{m} such that all intervals IiI_{i} have length at least L:=50p2ε2log(n)L:=\frac{50}{p^{2}\varepsilon^{2}}\log(n) and density at least 1ε1-\varepsilon of either 0 or 11. Take a trace X~\widetilde{X}. Define w=εpLw=\varepsilon pL. Divide the trace X~\widetilde{X} into consecutive intervals of width ww denoted as X~1,,X~k\widetilde{X}_{1},\ldots,\widetilde{X}_{k}, where X~i=X~[(i1)w,iw]\widetilde{X}_{i}=\widetilde{X}[(i-1)w,iw] (with X~k\widetilde{X}_{k} shorter if necessary).

Our approximate string is X^=X^1X^k\widehat{X}=\widehat{X}_{1}\cdots\widehat{X}_{k}, where X^k\widehat{X}_{k} is a run of length w/pw/p with value the majority bit of X~i\widetilde{X}_{i} for i[k]i\in[k], and define XiX_{i} to be the range of bits in XX that correspond to the bits in X~i\widetilde{X}_{i}. Define I~i\widetilde{I}_{i} as the bits present in X~\widetilde{X} from the interval IiI_{i} in XX.

Consider IiI_{i} for some ii that w.l.o.g. has majority bit 0. By Property 3 of Lemma 11 , at most 2ε|I~i|2\varepsilon|\widetilde{I}_{i}| bits in I~i\widetilde{I}_{i} are 11. Consider all intervals X~j\widetilde{X}_{j} such that X~jI~i\widetilde{X}_{j}\subset\widetilde{I}_{i}. There are at least |I~i|2ww\frac{|\widetilde{I}_{i}|-2w}{w} such intervals X~j\widetilde{X}_{j}. At most 2ε|I~i|w2=4ε|I~i|w\frac{2\varepsilon|\widetilde{I}_{i}|}{\frac{w}{2}}=\frac{4\varepsilon|\widetilde{I}_{i}|}{w} of these intervals X~j\widetilde{X}_{j} can have majority bit 11. Therefore, the fraction of these intervals that have majority bit 11 is upper bounded by the following for ε18\varepsilon\leqslant\frac{1}{8}, where we use Property 1 of Lemma 11 to say that |Ii~|pL2|\widetilde{I_{i}}|\geqslant\frac{pL}{2}:

4ε12w|Ii~|4ε14ε8ε\frac{4\varepsilon}{1-2\frac{w}{|\widetilde{I_{i}}|}}\leqslant\frac{4\varepsilon}{1-4\varepsilon}\leqslant 8\varepsilon

Thus, in the concatenation of wp\frac{w}{p} of the majority bits of all XjX_{j} such that XjIiX_{j}\subset I_{i}, the fraction of 11s is at most 8ε8\varepsilon. Furthermore, the length of this concatenation is within ε|Ii|+2|w|p\varepsilon|I_{i}|+\frac{2|w|}{p} of |Ii||I_{i}|, where the first term comes from Property 2 in Lemma 11 and the second term comes from the two intervals XjX_{j} that could cross both IiI_{i} and either Ii1I_{i-1} or Ii+1I_{i+1}. This approximation of IiI_{i} therefore has density at least 18ε1-8\varepsilon of 0 and length differing by a fraction of ε+2|w|p|Ii|3ε\varepsilon+\frac{2|w|}{p|I_{i}|}\leqslant 3\varepsilon from IiI_{i}. Therefore, this is a total of a 11ε11\varepsilon approximation of IiI_{i}. This is true for all ii.

The last error that needs to be considered in our algorithm is the bits from X~j\widetilde{X}_{j} for all jj such that XjIiX_{j}\not\subset I_{i} for all ii (in other words XjX_{j} is on a boundary). We can assume that the bits in the output string from these X~j\widetilde{X}_{j} are all errors, and there are at most nL\frac{n}{L} such boundaries. Therefore, this contributes a total error of wpnLεn\frac{w}{p}\frac{n}{L}\leqslant\varepsilon n bits. Putting it all together with Lemma 7, d𝖤(X,X^)12εnd_{\mathsf{E}}(X,\widehat{X})\leqslant 12\varepsilon n. If we apply this algorithm and analysis with ε/12\varepsilon/12 instead of ε\varepsilon, the result follows. ∎

4 Lower bounds for approximate reconstruction

We turn our attention to proving limitations of approximate reconstruction. We provide two results, one for edit distance approximation and another for Hamming distance. Throughout this section we fix the deletion probability to be a constant q=Θ(1)q=\Theta(1).

4.1 Lower bound for edit distance approximation

Let α(0,1/2)\alpha\in(0,1/2) denote a fixed constant. Let f(n)f(n^{\prime}) be a lower bound on the number of traces required to distinguish between two length nn^{\prime} strings XX^{\prime} and YY^{\prime} with probability at least 1/2+α1/2+\alpha. We can take α\alpha to be as small as we like by slightly decreasing the lower bound, and therefore, we assume that α=1/8\alpha=1/8. Previous work identifies two strings such that f(n)=Ω~(n1.5)f(n)=\widetilde{\Omega}(n^{1.5}), where the Ω~\widetilde{\Omega} hides the 1/polylog(n)1/\mathrm{polylog}(n) factor [Cha20a, HL20]. They use X=(01)k1(01)k+1X^{\prime}=(01)^{k}1(01)^{k+1} and Y=(01)k+11(01)kY^{\prime}=(01)^{k+1}1(01)^{k} for n=4k+3n^{\prime}=4k+3. Our strategy holds for any family of pairs X,YX^{\prime},Y^{\prime} that witness the lower bound. However, we note that this specific pair is already close in edit distance, and hence, outputting either of them would always be an approximation within edit distance two.

We instead form a string VV of length nn by concatenating a sequence of blocks, where each block is a uniformly random choice between XX^{\prime} and YY^{\prime}. Setting the block length to be C/εC/\varepsilon, we show that any algorithm that approximates VV within edit distance εn\varepsilon n must require at least f(C/ε)f(C/\varepsilon) traces for a constant C(0,1)C\in(0,1). Our strategy follows previous results on exact reconstruction lower bounds that argue based on traces being independent of the choice of string in each block [Cha20a, HL20, MPV14]. However, the proof is not a straightforward extension because we must account for the algorithm being approximate. In essence, we argue that if the algorithm outputs a good enough approximation, then it must be able to distinguish between X,YX^{\prime},Y^{\prime} in many blocks.

Input Distribution and Indistinguishable Blocks

We define the hard distribution as follows. Let XX^{\prime} and YY^{\prime} be strings of length 1/128ε1/\lceil 128\varepsilon\rceil. We construct a random string VV of length nn by concatenating b=128εnb=\lceil 128\varepsilon\rceil n blocks V=V1V2Vb.V=V_{1}V_{2}\cdots V_{b}. Each of the substrings ViV_{i} is set to be XX^{\prime} or YY^{\prime} uniformly and independently. The approximate reconstruction algorithm receives T<f(C/ε)T<f(C/\varepsilon) traces for C=1/128C=1/128. By assumption, with TT traces from XX^{\prime} or YY^{\prime}, the algorithm must fail to distinguish between them with probability at least 1/2α1/2-\alpha. As this is an information-theoretical statement, we next argue that the TT traces are independent of the choice between XX^{\prime} and YY^{\prime} with probability at least 12α1-2\alpha.

To formalize this claim, we introduce some notation. Let 𝒜\mathcal{A} denote a set of T<f(C/ε)T<f(C/\varepsilon) traces generated from the random string VV described above by passing VV through the deletion channel TT times. Since the channel deletes bits independently, we can equivalently determine the set 𝒜\mathcal{A} of traces by passing each block ViV_{i} for i[b]i\in[b] through the channel one at a time and then concatenating the subsequences to form a trace from VV. We let 𝒟i\mathcal{D}_{i} denote the distribution over sets of TT traces where ViV_{i} generates these traces. By our assumption, any algorithm that receives T<f(C/ε)T<f(C/\varepsilon) traces must fail to distinguish between Vi=XV_{i}=X^{\prime} and Vi=YV_{i}=Y^{\prime} with probability at least 1/2α1/2-\alpha.

Next, we decompose the trace distribution 𝒟i\mathcal{D}_{i} in a way that relates the failure probability to the event that the TT traces are independent of ViV_{i}. We express the distribution 𝒟i\mathcal{D}_{i} over TT traces of ViV_{i} as a convex combination of two distributions \mathcal{F} and 𝒢Vi\mathcal{G}_{V_{i}}, where intuitively sampling from \mathcal{F} corresponds to being unable to determine ViV_{i} with any advantage (see Definition 1 below). Formally, we take \mathcal{F} and 𝒢Vi\mathcal{G}_{V_{i}} to be any distributions over TT traces of ViV_{i} such that for some γ[0,1]\gamma\in[0,1] we have

𝒟i=(1γ)+γ𝒢Vi,\mathcal{D}_{i}=(1-\gamma)\cdot\mathcal{F}+\gamma\cdot\mathcal{G}_{V_{i}}, (1)

where 𝒢Vi=12(𝒢X+𝒢Y)\mathcal{G}_{V_{i}}=\frac{1}{2}(\mathcal{G}_{X^{\prime}}+\mathcal{G}_{Y^{\prime}}), and moreover, the following three properties hold: (i) \mathcal{F} is independent of ViV_{i}, (ii) 𝒢Vi\mathcal{G}_{V_{i}} is not independent of whether Vi=XV_{i}=X^{\prime} or Vi=YV_{i}=Y^{\prime}, and (iii) the distributions 𝒢X\mathcal{G}_{X^{\prime}} and 𝒢Y\mathcal{G}_{Y^{\prime}} have disjoint supports. We sketch how to construct distributions as in Eq. (1). The distribution 𝒟i\mathcal{D}_{i} from ViV_{i} is discrete over the TT-wise product of distributions over {0,1}n\{0,1\}^{\leqslant n}. Depending on ViV_{i}, the distribution gives different weights to each subsequence based on its length and the number of times it is a subsequence of ViV_{i}. Assume that some TT traces have higher probability of occurring under XX^{\prime} than YY^{\prime}. Assign the mass in 𝒟i\mathcal{D}_{i} that comes from YY^{\prime} to \mathcal{F} and the remainder to 𝒢X\mathcal{G}_{X^{\prime}} (if the probability is higher for YY^{\prime}, swap XX^{\prime} and YY^{\prime}). Doing this for all multisets of TT subsequences leads to 𝒢X\mathcal{G}_{X^{\prime}} and 𝒢Y\mathcal{G}_{Y^{\prime}} having disjoint support. The parameter γ\gamma normalizes the distributions.

We now argue that γ2α\gamma\leqslant 2\alpha by claiming that there is an algorithm using TT traces with failure probability at most (1γ)/2(1-\gamma)/2. By our hypothesis, with T<f(C/ε)T<f(C/\varepsilon) traces, any algorithm has failure probability at least 1/2α1/2-\alpha. This implies that (1γ)/21/2α(1-\gamma)/2\geqslant 1/2-\alpha, which leads to 2αγ2\alpha\geqslant\gamma. Since 𝒢X\mathcal{G}_{X^{\prime}} and 𝒢Y\mathcal{G}_{Y^{\prime}} have disjoint supports, the traces from these distributions identify ViV_{i}, and the algorithm correctly determines ViV_{i}. From Eq. (1), with probability γ\gamma, the traces are sampled from 𝒢X\mathcal{G}_{X^{\prime}} or 𝒢Y\mathcal{G}_{Y^{\prime}}. Otherwise, with probability (1γ)(1-\gamma), traces are sampled from \mathcal{F}. When traces come from \mathcal{F}, an algorithm that outputs either XX^{\prime} or YY^{\prime} has probability 1/21/2 of being correct.

Now, define a binary latent variable i\mathcal{E}_{i} such that i=1\mathcal{E}_{i}=1 with probability 1γ1-\gamma and i=0\mathcal{E}_{i}=0 with probability γ\gamma. If i=1\mathcal{E}_{i}=1, then 𝒟i\mathcal{D}_{i} samples TT traces from \mathcal{F}, and if i=0\mathcal{E}_{i}=0, it samples from 𝒢Vi\mathcal{G}_{V_{i}}. Using this notation, we can define the event that the traces are independent of a block in VV. Recall that we sample TT traces 𝒜\mathcal{A} from VV by sampling TT traces from 𝒟i\mathcal{D}_{i} for each i[b]i\in[b] and then concatenating the traces of the blocks (using an arbitrary but fixed ordering of the traces).

Definition 1.

For i[b]i\in[b], we say that the ithi^{\mathrm{th}} block is indistinguishable from the TT traces 𝒜\mathcal{A} of VV if the distribution 𝒟i\mathcal{D}_{i} samples the traces of the ithi^{\mathrm{th}} block ViV_{i} from \mathcal{F}, or in other words, if i=1\mathcal{E}_{i}=1.

Lemma 12.

If α=1/8\alpha=1/8 and the number of blocks bb satisfies b128lognb\geqslant 128\log n, then at least (14α)b(1-4\alpha)b blocks are indistinguishable with probability at least 12/n21-2/n^{2}.

Proof.

Using the notation and arguments by Eq. (1), we have that γ2α\gamma\leqslant 2\alpha, which implies that i=1\mathcal{E}_{i}=1 with probability at least (12α)(1-2\alpha). Hence, the expected number of indistinguishable blocks is at least (12α)b(1-2\alpha)b. Since traces are generated for each block independently, the binary random variables {i}i=1b\{\mathcal{E}_{i}\}_{i=1}^{b} are independent. By a Chernoff bound, the probability that the number of indistinguishable blocks deviates from its mean by 2αb2\alpha b is at most 2e4α2(12α)b/32e2logn=2n22e^{-4\alpha^{2}(1-2\alpha)b/3}\leqslant 2e^{-2\log n}=2n^{-2}, where we have used that (12α)=3/4(1-2\alpha)=3/4 and α2b(128/64)logn=2logn\alpha^{2}b\geqslant(128/64)\log n=2\log n. ∎

From Indistinguishable Blocks to Edit Distance Error

We move on to a technical lemma that allows us to lower bound the edit distance by looking at the indicator vectors for the agreement of substrings in an optimal alignment. In what follows, we consider partitions into substrings, which are collections of non-overlapping, contiguous sequences of characters (a substring may be empty; substrings in a partition may have varying lengths).

Lemma 13.

Let UU and VV be strings. For an integer b1b\geqslant 1, assume that VV is partitioned into bb substrings V=V1V2VbV=V_{1}V_{2}\cdots V_{b}. Then, there exists a partition of UU into bb substrings U=U1U2UbU=U_{1}U_{2}\cdots U_{b} such that111 It is tempting to conjecture that equality can be achieved in Lemma 13 if we instead take the minimum over all partitions of UU. However, an example shows that this does not always hold. Over the alphabet {𝗑,𝗒,𝗓}\{\mathsf{x},\mathsf{y},\mathsf{z}\}, consider the pair U=𝗒𝗓𝗓𝗓𝗑U=\mathsf{y}\mathsf{z}\mathsf{z}\mathsf{z}\mathsf{x} and V=𝗑𝗒𝗒𝗑V=\mathsf{x}\mathsf{y}\mathsf{y}\mathsf{x}. Their edit distance is d𝖤(U,V)=4d_{\mathsf{E}}(U,V)=4. Using four blocks, partition V=[𝗑][𝗒][𝗒][𝗑]V=[\mathsf{x}][\mathsf{y}][\mathsf{y}][\mathsf{x}]. Decompose U=[][𝗒][𝗓𝗓𝗓][𝗑]U=[\varnothing][\mathsf{y}][\mathsf{z}\mathsf{z}\mathsf{z}][\mathsf{x}]. Summing the indicator vectors only equals two, and not four.

d𝖤(U,V)i=1b𝟙{UiVi}.d_{\mathsf{E}}(U,V)\geqslant\sum_{i=1}^{b}\mathds{1}_{\{U_{i}\neq V_{i}\}}.
Proof.

Let d=d𝖤(U,V)d=d_{\mathsf{E}}(U,V). We proceed by induction on the number of substrings bb. For the base case, b=1b=1, we have that the edit distance between UU and VV is zero if and only if U=VU=V. For the inductive step, assume the lemma holds up to b1b-1 substrings with b2b\geqslant 2. We consider two cases, where in both we will split UU into two substrings U=U1UU=U_{1}U^{\prime}.

For the first case, assume that V1V_{1} matches the prefix of UU, so that U=U1U=V1UU=U_{1}U^{\prime}=V_{1}U^{\prime}. Then, we have that d𝖤(U,V)=d𝖤(U,V2Vb)d_{\mathsf{E}}(U,V)=d_{\mathsf{E}}(U^{\prime},V_{2}\cdots V_{b}). Applying the inductive hypothesis with b1b-1 substrings for the pair UU^{\prime} and V2VbV_{2}\cdots V_{b} finishes this case.

For the second case, V1V_{1} does not match the prefix of UU, and hence, any minimum edit distance alignment between UU and VV uses at least one edit in the V1V_{1} portion. Consider any alignment between UU and VV with d=d𝖤(U,V)d=d_{\mathsf{E}}(U,V) edits. Let U=U1UU=U_{1}U^{\prime} denote the partition where U1U_{1} is aligned to V1V_{1} and UU^{\prime} is aligned to V2VbV_{2}\cdots V_{b}. Since the prefixes differ, we have d𝖤(U1,V1)1d_{\mathsf{E}}(U_{1},V_{1})\geqslant 1, which implies that d𝖤(U,V2Vb)d1d_{\mathsf{E}}(U^{\prime},V_{2}\cdots V_{b})\leqslant d-1. Applying the inductive hypothesis with b1b-1 substrings to the pair UU^{\prime} and V2VbV_{2}\cdots V_{b} leads to a partition U=U2UbU^{\prime}=U_{2}\cdots U_{b} such that i=2b𝟙{UiVi}d1\sum_{i=2}^{b}\mathds{1}_{\{U_{i}\neq V_{i}\}}\leqslant d-1. We conclude that d𝖤(U,V)=d=1+(d1)𝟙{U1V1}+i=2b𝟙{UiVi}d_{\mathsf{E}}(U,V)=d=1+(d-1)\geqslant\mathds{1}_{\{U_{1}\neq V_{1}\}}+\sum_{i=2}^{b}\mathds{1}_{\{U_{i}\neq V_{i}\}} for this partition of UU. ∎

Using the above lemmas, we can now prove the edit distance lower bound theorem.

See 4

Proof.

Let ε\varepsilon^{\star} be a small constant such that εε<C\varepsilon\leqslant\varepsilon^{\star}<C and f(C/ε)>1f(C/\varepsilon^{\star})>1, where we set C=1/128C=1/128. Assume that the approximate reconstruction algorithm receives T<f(C/ε)T<f(C/\varepsilon) traces.

Let X^\widehat{X} denote the output of the reconstruction algorithm on input V=V1V2Vb,V=V_{1}V_{2}\cdots V_{b}, where Vi{X,Y}V_{i}\in\{X^{\prime},Y^{\prime}\} and b=128εnb=\lceil 128\varepsilon\rceil n. Assume for contradiction that d𝖤(X^,V)εnd_{\mathsf{E}}(\widehat{X},V)\leqslant\varepsilon n with high probability. Using Lemma 13, we can partition X^\widehat{X} into bb blocks X^=X^1X^2X^b\widehat{X}=\widehat{X}_{1}\widehat{X}_{2}\cdots\widehat{X}_{b} such that

d𝖤(X^,V)i=1b𝟙{X^iVi}.d_{\mathsf{E}}(\widehat{X},V)\geqslant\sum_{i=1}^{b}\mathds{1}_{\{\widehat{X}_{i}\neq V_{i}\}}. (2)

Since b128lognb\geqslant 128\log n, Lemma 12 establishes that there are at least (14α)b(1-4\alpha)b blocks in VV that are indistinguishable with high probability using the TT traces. For each of these blocks, the algorithm cannot guess between Vi=XV_{i}=X^{\prime} or Vi=YV_{i}=Y^{\prime} with any advantage. While we do not know how the alignment corresponds to the indistinguishable blocks, we know that for at least (14α)b(1-4\alpha)b values j[b]j\in[b], we have that {X^jVj}\{\widehat{X}_{j}\neq V_{j}\} with probability at least 1/2. Thus, the sum in Eq. (2) is at least 12(14α)b=b/4\frac{1}{2}(1-4\alpha)b=b/4 in expectation, and by a Chernoff bound, it is at least b/8b/8 with high probability. This implies that d𝖤(X^,V)16εnd_{\mathsf{E}}(\widehat{X},V)\geqslant 16\varepsilon n, contradicting the edit distance being at most εn\varepsilon n. ∎

Corollary 5 now follows immediately from this theorem and the previous trace reconstruction lower bounds [Cha20a], showing that for δ(0,1/3)\delta\in(0,1/3), we have that n1+3δ/2/polylog(n)n^{1+3\delta/2}/\mathrm{polylog}(n) traces are necessary to n1/3δn^{1/3-\delta}-approximately reconstruct an arbitrary nn-bit string with probability 11/n1-1/n.

4.2 Lower Bound for Hamming Distance Approximation

See 6

Proof.

Let n=4k+1n=4k+1. Define X=0k(01)k0k+1X=0^{k}(01)^{k}0^{k+1} to be the string of kk zeros followed by kk pairs of 0101 and ending with k+1k+1 zeros. Define Y=0k+1(01)k0kY=0^{k+1}(01)^{k}0^{k} to be k+1k+1 zeros followed by kk pairs of 0101 and ending with kk zeros. These two strings have Hamming distance 2k=(n1)/22k=(n-1)/2.

Differentiating between XX and YY is equivalent to determining the number of 0s at the beginning or end of them (as this is a promise problem). It is known that it requires Ω(k)=Ω(n)\Omega(k)=\Omega(n) traces to determine if the length of the 0-run at the beginning is even or odd with probability at least 2/32/3 because the problem reduces to differentiating between two binomial distributions [BKKM04]. Therefore, with probability at least 1/31/3, a reconstruction algorithm using fewer traces must output a string that is at least Hamming distance k=(n1)/4k=(n-1)/4 away from the actual string. ∎

5 Conclusion

We studied the challenge of determining the relative trace complexity of approximate versus exact string reconstruction. Outputting a string close to the original in edit distance with few traces is a central problem in DNA data storage that has gone largely unnoticed in lieu of exact reconstruction. We present algorithms for classes of strings, where these classes lend themselves to techniques in every theoretician’s toolbox (e.g., concentration bounds, estimates from averages), while introducing new alignment techniques that may be useful for other algorithms. Additionally, these classes of strings are hard to reconstruct exactly (they contain the set of nn-bit strings with Hamming weight n1n-1, which suffices to derive an Ω(n)\Omega(n) lower bound on the trace complexity).

We left open the intriguing question of whether εn\varepsilon n-approximate reconstruction is actually easier than exact reconstruction for all strings. On the other hand, we showed that it is easier for at least some strings. Our algorithms output a string within edit distance εn\varepsilon n from the original string using O(logn/ε2)O(\log n/\varepsilon^{2}) traces for large classes of strings. In some cases, we showed how to approximately reconstruct with a single trace. We also presented lower bounds that interpolate between the hardness of approximate and exact trace reconstruction.

Algorithms with small sample complexity for the approximate trace reconstruction problem could also provide insight into exact solutions. If we know that the unknown string belongs to a specified Hamming ball of radius kk, then one can recover the string exactly with nO(k)n^{O(k)} traces by estimating the histogram of length kk subsequences [KR97, KMMP19]. It is an open question whether an analogous claim can be proven for edit distance [GSZ20]. Do nO(k)n^{O(k)} traces suffice if we know an edit ball of radius kk that contains the string? If this is true, then an algorithm satisfying our notion of edit distance approximation would imply an exact reconstruction result.

Approximate trace reconstruction is also a specialization of list decoding for the deletion channel, where the goal is to output a small set of strings that contains the correct one with high probability. We are not aware of any work on list decoding in the context of trace reconstruction, even though it seems like a natural problem to study. Using an approximate reconstruction algorithm, we could output the whole edit ball around the approximate string. For more on list decoding with insertions and deletions, see the work by Guruswami, Haeupler, and Shahrasbi and references therein [GHS20].

6 Acknowledgments

We thank João Ribeiro and Josh Brakensiek for discussions on coded trace reconstruction, as well as the anonymous reviewers for helpful feedback on an earlier version of the paper.

References

  • [ADHR12] Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sebastien Roch. Global alignment of molecular sequences via ancestral state reconstruction. Stochastic Processes and their Applications, 122(12):3852–3874, 2012.
  • [AVDGiF19] Mahed Abroshan, Ramji Venkataramanan, Lara Dolecek, and Albert Guillén i Fàbregas. Coding for deletion channels with multiple traces. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1372–1376. IEEE, 2019.
  • [BCF+19] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 745–768. IEEE Computer Society, 2019.
  • [BCSS19] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), volume 145 of LIPIcs, pages 44:1–44:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [BKKM04] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 910–918, 2004.
  • [BLS20] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In IEEE Annual Symposium on Foundations of Computer Science, FOCS, 2020.
  • [BPRS20] V. Bhardwaj, P. A. Pevzner, C. Rashtchian, and Y. Safonova. Trace Reconstruction Problems in Computational Biology. IEEE Transactions on Information Theory, pages 1–1, 2020.
  • [CDL+21] Xi Chen, Anindya De, Chin Ho Lee, Rocco A Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.
  • [CGK12] George M. Church, Yuan Gao, and Sriram Kosuri. Next-Generation Digital Information Storage in DNA. Science, 337(6102):1628, 2012.
  • [CGMR20] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020.
  • [Cha20a] Zachary Chase. New lower bounds for trace reconstruction. Annales de l’Institut Henri Poincaré (to appear), 2020. Preprint at https://arxiv.org/abs/1905.03031.
  • [Cha20b] Zachary Chase. New upper bounds for trace reconstruction. Preprint available at https://arxiv.org/abs/2009.03296, 2020.
  • [CKY20] Johan Chrisnata, Han Mao Kiah, and Eitan Yaakobi. Optimal Reconstruction Codes for Deletion Channels. Preprint available at https://arxiv.org/abs/2004.06032, 2020.
  • [DM07] Eleni Drinea and Michael Mitzenmacher. Improved lower bounds for the capacity of iid deletion and duplication channels. IEEE Transactions on Information Theory, 53(8):2693–2714, 2007.
  • [DOS19] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
  • [DRR19] Sami Davies, Miklós Z. Rácz, and Cyrus Rashtchian. Reconstructing trees from traces. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory (COLT), volume 99 of Proceedings of Machine Learning Research, pages 961–978. PMLR, 2019.
  • [GBC+13] Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77–80, 2013.
  • [GHS20] Venkatesan Guruswami, Bernhard Haeupler, and Amirbehshad Shahrasbi. Optimally resilient codes for list-decoding from insertions and deletions. In Proc. 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 524–537, 2020.
  • [GSZ20] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Distance. Preprint available at https://arxiv.org/abs/2011.13737, 2020.
  • [HHP18] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Proceedings of the Fifteenth Workshop on Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018.
  • [HL20] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020.
  • [HM14] Bernhard Haeupler and Michael Mitzenmacher. Repeated deletion channels. In 2014 IEEE Information Theory Workshop (ITW 2014), pages 152–156. IEEE, 2014.
  • [HMPW08] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proc. 19th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 389–398, 2008.
  • [HPP18] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Proceedings of the 31st Conference On Learning Theory (COLT), pages 1799–1840, 2018.
  • [KMMP19] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. Preprint at https://arxiv.org/abs/1904.09618, 2019.
  • [KNY20] Han Mao Kiah, Tuan Thanh Nguyen, and Eitan Yaakobi. Coding for Sequence Reconstruction for Single Edits. In IEEE International Symposium on Information Theory (ISIT), 2020.
  • [KR97] Ilia Krasikov and Yehuda Roditty. On a Reconstruction Problem for Sequences. Journal of Combinatorial Theory, Series A, 77(2):344–348, 1997.
  • [LCA+19] Randolph Lopez, Yuan-Jyue Chen, Siena Dumas Ang, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Georg Seelig, Karin Strauss, and Luis Ceze. DNA assembly for nanopore data storage readout. Nature Communications, 10(1):1–9, 2019.
  • [Lev01] Vladimir I. Levenshtein. Efficient Reconstruction of Sequences from Their Subsequences or Supersequences. Journal of Combinatorial Theory, Series A, 93(2):310–332, 2001.
  • [Mit09] Michael Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probability Surveys, 6:1–33, 2009.
  • [MPV14] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace Reconstruction Revisited. In European Symposium on Algorithms (ESA), pages 689–700. Springer, 2014.
  • [Nar21] Shyam Narayanan. Population Recovery from the Deletion Channel: Nearly Matching Trace Reconstruction Bounds. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021. Preprint at https://arxiv.org/abs/2004.06828.
  • [NP17] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(O(n1/3))\exp(O(n^{1/3})) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1042–1046, 2017.
  • [NR21] Shyam Narayanan and Michael Ren. Circular Trace Reconstruction. In Proceedings of Innovations in Theoretical Computer Science (ITCS), 2021. Preprint at https://arxiv.org/abs/2009.01346.
  • [OAC+18] Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher N Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, and Karin Strauss. Random access in large-scale DNA data storage. Nature Biotechnology, 36:242–248, 2018.
  • [PZ17] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 228–239. IEEE Computer Society, 2017.
  • [SDDF18] Sundara Rajan Srinivasavaradhan, Michelle Du, Suhas Diggavi, and Christina Fragouli. On maximum likelihood reconstruction over multiple deletion channels. In IEEE International Symp. on Information Theory (ISIT), pages 436–440, 2018.
  • [SDDF20] Sundara Rajan Srinivasavaradhan, Michelle Du, Suhas Diggavi, and Christina Fragouli. Algorithms for reconstruction over single and multiple deletion channels. Preprint available at https://arxiv.org/abs/2005.14388, 2020.
  • [SYY20] Omer Sabary, Eitan Yaakobi, and Alexander Yucovich. The error probability of maximum-likelihood decoding over two deletion channels. Preprint available at https://arxiv.org/abs/2001.05582, 2020.
  • [VS08] Krishnamurthy Viswanathan and Ram Swaminathan. Improved String Reconstruction Over Insertion-Deletion Channels. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 399–408, 2008.
  • [YGM17] SM Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free DNA-based data storage. Scientific reports, 7(1):1–6, 2017.

Appendix A Appendix

The following are omitted proofs from our warm-up approximate reconstruction algorithms.

A.1 Analysis of first warm-up algorithm

Proof of Proposition 8.

It is straight-forward to check that if XX contains kk runs, then with probability at least 11n21-\frac{1}{n^{2}} all T=2pε2log(n)T=\frac{2}{p\varepsilon^{2}}\log(n) traces contain kk runs. Next, we estimate the lengths of runs in XX. For traces X~1,,X~T\widetilde{X}_{1},\ldots,\widetilde{X}_{T}, label the runs in X~j\widetilde{X}_{j} as 𝐫~1j,𝐫~2j,,𝐫~kj\widetilde{\mathbf{r}}^{j}_{1},\widetilde{\mathbf{r}}^{j}_{2},\ldots,\widetilde{\mathbf{r}}^{j}_{k}, and recall that |𝐫i||{\mathbf{r}}_{i}| denotes the length of the iith run, 𝐫i{\mathbf{r}}_{i}, in XX. For μ~i=j=1T𝐫~ij/T\widetilde{\mu}_{i}=\sum_{j=1}^{T}\widetilde{\mathbf{r}}^{j}_{i}/T, the scaled average μ~ip\frac{\widetilde{\mu}_{i}}{p} estimates |𝐫i||{\mathbf{r}}_{i}| for i[k]i\in[k]. Applying a Chernoff bound and then a union bound, 𝐏(i:|μ~i/p|𝐫i||ε|𝐫i|)2n3\mathbf{P}(\exists i:|\widetilde{\mu}_{i}/p-|{\mathbf{r}}_{i}||\geqslant\varepsilon|{\mathbf{r}}_{i}|)\leqslant 2n^{-3}. Let X^=X^1X^k\widehat{X}=\widehat{X}_{1}\cdots\widehat{X}_{k}, where substring X^i\widehat{X}_{i} is a run with length μ~ip\frac{\widetilde{\mu}_{i}}{p} and bit value matching run ii of the traces. We have seen that with probability at least 11n1-\frac{1}{n}, for every i[k]i\in[k] the edit distance between X^i\widehat{X}_{i} and 𝐫i{\mathbf{r}}_{i} is at most ε|𝐫i|\varepsilon|{\mathbf{r}}_{i}|. On this event, X^\widehat{X} has edit distance at most εn\varepsilon n from XX, by Lemma 7. ∎

We can also achieve slightly stronger guarantees. If the number of traces in Proposition 8 is linear, then the algorithm actually reconstructs exactly with high probability. Also, the output X^\widehat{X} from the algorithm for Proposition 8 will approximately reconstruct strings that do not quite satisfy the current assumptions, as described in the premises of the following corollary.

Corollary 14 (Robustness).

Let XX be an nn-bit string such that all runs have length at least log(n5)\log(n^{5}) except for at most ss runs. We can εn\varepsilon n-approximately reconstruct XX with O(log(n)/ε2(1p)s)O(\log(n)/\varepsilon^{2}\cdot(\frac{1}{p})^{s}) traces.

Proof.

Taking C=8/pC=8/p, with probability 11n31-\frac{1}{n^{3}} every long run (those with length at least log(n4)\log(n^{4})) will not be entirely deleted, and with probability at least psp^{s} none of the ss short runs are entirely deleted. By a Chernoff bound, with probability at least 1n31-n^{-3} the number of traces where no short run is entirely deleted is at least 3ε2plog(n)\frac{3}{\varepsilon^{2}p}\log(n). We identify the traces with the maximum number of runs and then use the algorithm for Proposition 8 using these traces. ∎

A.2 Analysis of second warm-up algorithm

Proof of Proposition 9.

Suppose that all of the 11-runs of XX have length at least 6pε2log(n)\frac{6}{p\varepsilon^{2}}\log(n). Take a single trace X~\widetilde{X}. By a Chernoff bound, with probability at least 1n21-n^{-2}, every 0-run from XX with length at least 6pεlog(n)\frac{6}{p\varepsilon}\log(n) will have length at least L:=log(n)10εL:=\frac{\log(n)}{10\varepsilon} in X~\widetilde{X}. Find every 0-run in X~\widetilde{X} with length at least LL and index them as 𝐫~1\widetilde{\mathbf{r}}_{1},…,𝐫~k\widetilde{\mathbf{r}}_{k}. For i[k1]i\in[k-1], let s~i\widetilde{s}_{i} be the bits between the last bit of 𝐫~i\widetilde{\mathbf{r}}_{i} and the first bit of 𝐫~i+1\widetilde{\mathbf{r}}_{i+1} and let s~0\widetilde{s}_{0} be the bits before 𝐫~1\widetilde{\mathbf{r}}_{1} and 𝐬~k+1\widetilde{\mathbf{s}}_{k+1} the bits after 𝐫~k\widetilde{\mathbf{r}}_{k}. Let 𝐬i{\mathbf{s}}_{i} be the contiguous substring of XX from which 𝐬~i\widetilde{\mathbf{s}}_{i} came and 𝐫i{\mathbf{r}}_{i} the contiguous substring of XX from which 𝐫~i\widetilde{\mathbf{r}}_{i} came. For all ii, we will approximate 𝐬i{\mathbf{s}}_{i} with 1^i,\widehat{1}_{i}, a 11-run of length |s~i|/p|\widetilde{s}_{i}|/p, and 𝐫i{\mathbf{r}}_{i} with 0^i\widehat{0}_{i}, a 0-run of length |𝐫~i|/p|\widetilde{\mathbf{r}}_{i}|/p.

Since 𝐬i{\mathbf{s}}_{i} contains alternating 11-runs with length at least 6pε2log(n)\frac{6}{p\varepsilon^{2}}\log(n) and 0-runs with length at most 6pεlog(n)\frac{6}{p\varepsilon}\log(n), 𝐬i{\mathbf{s}}_{i} has at least a 1ε1-\varepsilon density of 11s. By a Chernoff bound, 𝐏(||𝐬~i|p|𝐬i||ε|𝐬i|)n2\mathbf{P}\left(\left|\frac{|\widetilde{\mathbf{s}}_{i}|}{p}-|{\mathbf{s}}_{i}|\right|\geqslant\varepsilon|{\mathbf{s}}_{i}|\right)\leqslant n^{-2}. Therefore 1^i\widehat{1}_{i} and 𝐬i{\mathbf{s}}_{i} have edit distance at most 2ε|𝐬i|2\varepsilon|{\mathbf{s}}_{i}|. If |𝐫i|6pε2log(n)|{\mathbf{r}}_{i}|\geqslant\frac{6}{p\varepsilon^{2}}\log(n), then, as before, by a Chernoff bound 𝐏(||𝐫i||𝐫~i|p|ε|𝐫i|)n2\mathbf{P}\left(\left||{\mathbf{r}}_{i}|-\frac{|\widetilde{\mathbf{r}}_{i}|}{p}\right|\geqslant\varepsilon|{\mathbf{r}}_{i}|\right)\leqslant n^{-2}, and so 0^i\widehat{0}_{i} has edit distance at most 2ε|𝐫i|2\varepsilon|{\mathbf{r}}_{i}| from 𝐫i{\mathbf{r}}_{i}. If |𝐫i|6pε2log(n)|{\mathbf{r}}_{i}|\leqslant\frac{6}{p\varepsilon^{2}}\log(n) then the approximation of |𝐫~i|/p|\widetilde{\mathbf{r}}_{i}|/p 0s has edit distance at most 6pεlog(n)\frac{6}{p\varepsilon}\log(n) from 𝐫i{\mathbf{r}}_{i} with probability at least 1n21-n^{-2}.

Let X^=1^00^11^11^k0^k1^k+1\widehat{X}=\widehat{1}_{0}\widehat{0}_{1}\widehat{1}_{1}\cdots\widehat{1}_{k}\widehat{0}_{k}\widehat{1}_{k+1} and observe that the number of 0-runs is at most pε2n6log(n)\frac{p\varepsilon^{2}n}{6\log(n)}, since there at most this many 11-runs which separate 0-runs. Then applying Lemma 7, we have with probability at least 11/n1-1/n that

d𝖤(X,X^)\displaystyle d_{\mathsf{E}}(X,\widehat{X}) i=1k(d𝖤(0^i,𝐫i)+d𝖤(1^i,𝐬i))+d𝖤(0^0,𝐬0)+d𝖤(1^k+1,𝐬k+1)\displaystyle\leqslant\sum_{i=1}^{k}(d_{\mathsf{E}}(\widehat{0}_{i},{\mathbf{r}}_{i})+d_{\mathsf{E}}(\widehat{1}_{i},{\mathbf{s}}_{i}))+d_{\mathsf{E}}(\widehat{0}_{0},{\mathbf{s}}_{0})+d_{\mathsf{E}}(\widehat{1}_{k+1},{\mathbf{s}}_{k+1})
i=1k(2ε|𝐫i|+6pεlog(n)+2ε|𝐬i|)+2ε|𝐬0|+2ε|𝐬k+1|\displaystyle\leqslant\sum_{i=1}^{k}\left(2\varepsilon|{\mathbf{r}}_{i}|+\frac{6}{p\varepsilon}\log(n)+2\varepsilon|{\mathbf{s}}_{i}|\right)+2\varepsilon|{\mathbf{s}}_{0}|+2\varepsilon|{\mathbf{s}}_{k+1}|
2εn+6pεlog(n)n6log(n)/(pε2)3εn.\displaystyle\leqslant 2\varepsilon n+\frac{6}{p\varepsilon}\log(n)\cdot\frac{n}{6\log(n)/(p\varepsilon^{2})}\leqslant 3\varepsilon n.

The theorem follows by taking ε=ε3\varepsilon=\frac{\varepsilon^{*}}{3}. ∎

Appendix B Chernoff-Hoeffding Bound

In many proofs, we use the following concentration bound:

Lemma 15 (Chernoff-Hoeffding bound).

Let X1,,Xn{0,1}X_{1},\ldots,X_{n}\in\{0,1\} be independent. Let b1,,bn0b_{1},\ldots,b_{n}\geqslant 0 with b=max{bi}b=\max\{b_{i}\}. Then for 0<δ<10<\delta<1, X=i=1nbiXiX=\sum_{i=1}^{n}b_{i}X_{i}, and μ=𝔼[X]\mu=\mathbb{E}[X] the following holds:

𝐏(|Xμ|δμ)2exp(μδ2/(3b)).\mathbf{P}\left(\left|X-\mu\right|\geqslant\delta\mu\right)\leqslant 2\exp(-\mu\delta^{2}/(3b)).