This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reconstruction from Noisy Substrings

Hengjia Wei, Moshe Schwartz H. Wei is with the Peng Cheng Laboratory, Shenzhen 518055, China. He is also with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China, and the Pazhou Laboratory (Huangpu), Guangzhou 510555, China (e-mail: hjwei05@gmail.com).M. Schwartz is on a leave of absence from the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel. He is now with the Department of Electrical and Computer Engineering at McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: schwartz.moshe@mcmaster.ca).G. Ge is with the School of Mathematical Sciences, Capital Normal University, Beijing 100048, China (e-mail: gnge@zju.edu.cn).This work was supported in part by the National Key Research and Development Program of China under Grant 2020YFA0712100, the National Natural Science Foundation of China under Grant 11971325, Grant 12231014 and Grant 12371523, Beijing Scholars Program, the major key project of Peng Cheng Laboratory under grant PCL2023AS1-2, and the Zhejiang Lab BioBit Program under Grant 2022YFB507.
Abstract

This paper studies the problem of encoding messages into sequences which can be uniquely recovered from some noisy observations about their substrings. The observed reads comprise consecutive substrings with some given minimum overlap. This coded reconstruction problem has applications to DNA storage. We consider both single-strand reconstruction codes and multi-strand reconstruction codes, where the message is encoded into a single strand or a set of multiple strands, respectively. Various parameter regimes are studied. New codes are constructed, some of whose rates asymptotically attain the upper bounds.

Index Terms:
DNA storage, sequence (string) reconstruction, substitution, substring-distant sequences, robust positoining sequences.

I Introduction

Sequence (string) reconstruction refers to a large class of problems of reconstructing a sequence from partial (perhaps noisy) observations of it. Instances of this problem include reconstruction from multiple erroneous copies of the sequence [13, 12, 3], some substrings of the sequence [11, 10], all the length-kk subsequences [15, 20, 8], and compositions of the sequence’s substrings or prefixes/suffixes [1, 18].

In this paper, we shall consider the problem of encoding messages into sequences which can be uniquely recovered from observations about their substrings. This coding problem is motivated by applications to DNA-based data storage systems, where data are encoded to long DNA sequences. In some DNA sequencing technologies (e.g., shotgun sequencing), a long DNA strand is first replicated multiple times, and these replicas are then fragmented into some short substrings so that they could be read. In order to retrieve the data, the original long sequence should be reconstructed based on the observations about these short substrings.

This coded reconstruction problem has been studied in different models with different assumptions on the substrings. Gabrys and Milenkovic [10] considered the problem of reconstructing a sequence of length nn from its LL-multispectrum, i.e., the multiset of all of its length-LL substrings. They constructed two classes of reconstruction codes with redundancies 22 and O(loglogn)O(\log\log n) for L>2lognL>2\log n and logn<L2logn\log n<L\leqslant 2\log n, respectively. They also studied the noisy settings in which some substrings/observations may be lost or be corrupted by errors, and constructed codes to combat these effects. Subsequently, Marcovich and Yaakobi [16] followed this noisy setup and provided more code constructions. The constructions in [10, 16] are based on the so-called (L,d)(L,d)-substring distant (SD) sequence, a sequence in which every two length-LL substrings are of Hamming distance at least dd apart. When d=1d=1, such sequences are also known as LL-substring unique sequences or LL-repeat free sequences. Efficient encoding algorithms can be found in [9] for L>lognL>\log n. For general dd, Marcovich and Yaakobi [16] proposed an encoding algorithm of (L,d)(L,d)-SD sequences for L>2lognL>2\log n.

Another model is the torn-paper channel, which randomly tears the input sequence into small pieces of different sizes. The output of this channel is a set of substrings of the input sequence with no overlap, and the message which is carried by the input sequence should be recovered from these substrings. This problem has been researched in the probabilistic setting in [21, 19, 17]. Recently, Bar-Lev et al. [2] considered this problem in the worst-case. They studied both the noiseless setup and the noisy setup, and proposed a couple of index-based constructions to encode messages into sequences each of which can be uniquely recovered from its non-overlapping substrings. Furthermore, motivated by DNA sequencing technologies where multiple strings are sequenced simultaneously, they extended the single-strand reconstruction problem to a multi-strand reconstruction problem. They constructed multi-strand reconstruction codes whose rates asymptotically behave like those of single-strand reconstruction codes. Another related paper is by Wang et al. [23], which, unlike [2], does not restrict the length of the torn substrings, but rather their number. For this setting they construct codes that attain the upper bound on the rate up to asymptotically small factors.

In a recent paper, Yehezkeally et al. [25] proposed a general model, which includes the two models above as extreme cases. In this model, the reconstruction is based on the sequence’s (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace, which is a multiset of subsrings where every substring has length at least LminL_{\rm min} and the overlap of every two consecutive substrings has length at least LoverL_{\rm over}. They focused on the noiseless setup, and constructed a class of trace reconstruction codes whose rate can asymptotically achieve the upper bound. They also studied the multi-strand reconstruction problem in the LL-multispectrum model, and proposed reconstruction codes whose rates are asymptotically 11.

In this paper, we shall follow the model in [25] and study the coding problem for both single-strand reconstruction and multi-strand reconstruction in the noisy setup. We aim to encode a message into a sequence which can be uniquely recovered from its (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace, where each substring may suffer from at most ee substitution errors, or to encode a message into a set of kk sequences which can be recovered from the union of their (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous traces. Our contributions are listed as follows.

  1. 1.

    We first give an algorithm which can encode messages into (L,d)(L,d)-SD sequences for L=alognL=\lceil a\log n\rceil where a>1a>1 is an arbitrary real constant. The rates of the encoded sequences asymptotically approach 11. In contrast, the encoding algorithm in [16] requires a single redundancy bit but works only when L>2lognL>2\log n.

  2. 2.

    For single-strand reconstruction, by using the proposed encoding algorithm for SD sequences, we construct two classes of (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes whose rates asymptotically achieve the upper bound.

  3. 3.

    For multi-strand reconstruction, we present some upper bounds on the rates of multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes, as well as some code constructions. In some parameter regimes, our constructions yield codes whose rates asymptotically attain the upper bounds. Interestingly, when logk=κn\log k=\kappa n, Lmin=alognL_{\rm min}=a\log n and Lover=γLminL_{\rm over}=\gamma L_{\rm min}, the maximal rates of multi-strand reconstruction codes not only depend on κ,a,γ\kappa,a,\gamma, but also depend on the congruence class of nn modulo LminLoverL_{\rm min}-L_{\rm over}.

II Preliminaries

For a positive integer nn\in\mathbb{N}, let [n][n] denote the set {0,1,2,,n1}\{0,1,2,\ldots,n-1\}. Let Σ\Sigma denote a finite alphabet. Throughout this paper, we always consider the binary case, i.e., Σ={0,1}\Sigma=\{0,1\}, however, our results can be easily generalized to non-binary cases. We use logx\log x to denote the logarithm of xx to base 22. When generalizing our results to the qq-ary alphabet case, it suffices to replace the log\log with logq\log_{q}.

Assume 𝐱=(x0,x1,,xn1)Σn\mathbf{x}=(x_{0},x_{1},\ldots,x_{n-1})\in\Sigma^{n} is a sequence over Σ\Sigma. We denote its length |𝐱|=n\lvert\mathbf{x}\rvert=n, and its Hamming weight by wtH(𝐱)\operatorname{wt}_{H}(\mathbf{x}). Given two sequence 𝐱\mathbf{x} and 𝐲\mathbf{y} over Σ\Sigma, we denote their concatenation by 𝐱𝐲\mathbf{x}\circ\mathbf{y}. If 𝐱\mathbf{x} and 𝐲\mathbf{y} have the same length, we use dH(𝐱,𝐲)d_{H}(\mathbf{x},\mathbf{y}) to denote their Hamming distance.

A substring of 𝐱\mathbf{x} is a sequence of the form (xa,xa+1,,xb)(x_{a},x_{a+1},\dots,x_{b}), where 0ab<|𝐱|0\leqslant a\leqslant b<\lvert\mathbf{x}\rvert, and we use 𝐱[a,b]\mathbf{x}[a,b] to denote it. We also use 𝐱i+[L]\mathbf{x}_{i+[L]}, where i[nL+1]i\in[n-L+1], to denote the substring of 𝐱\mathbf{x} which starts at the position ii and has length LL, i.e., 𝐱i+[L]=(xi,xi+1,,xi+L1)=𝐱[i,i+L1]\mathbf{x}_{i+[L]}=(x_{i},x_{i+1},\ldots,x_{i+L-1})=\mathbf{x}[i,i+L-1].

A code is simply a set 𝒞Σn\mathcal{C}\subseteq\Sigma^{n}, whose elements are referred to as codewords. We say nn is the length of the code. The rate of the code is defined as R(𝒞)=1nlog|𝒞|R(\mathcal{C})=\frac{1}{n}\log\lvert\mathcal{C}\rvert, and the redundancy of the code is nR(𝒞)n-R(\mathcal{C}).

II-A Reconstruction from the LL-Multispectrum

For a sequence 𝐱Σn\mathbf{x}\in\Sigma^{n} and a positive integer LnL\leqslant n, the LL-multispectrum of 𝐱\mathbf{x}, denoted by 𝒮L(𝐱)\mathcal{S}_{L}(\mathbf{x}), is the multiset of all its length-LL substrings, namely,

𝒮L(𝐱)={𝐱0+[L],𝐱1+[L],,𝐱nL+[L]}.\mathcal{S}_{L}(\mathbf{x})=\left\{\mathbf{x}_{0+[L]},\mathbf{x}_{1+[L]},\ldots,\mathbf{x}_{n-L+[L]}\right\}.

If 𝐱\mathbf{x} can be uniquely reconstructed from its LL-multispectrum, then we say it is LL-reconstructible. It was proved in [22] that if all the length-(L1)(L-1) substrings of 𝐱\mathbf{x} are distinct, then 𝐱\mathbf{x} is LL-reconstructible. Such a sequence is referred to as an LL-substring unique sequence. In the works [10, 9], algorithms were proposed to construct a set of LL-substring unique sequences of rate approaching 11, where L=alognL=\lceil a\log n\rceil for any constant real number a>1a>1.

In [10], Gabrys and Milenkovic further studied the problem of reconstructing sequences from their noisy multispectra. They first considered the scenario where some substrings are not included in the readout spectrum. For a subset 𝒮^𝒮L(𝐱)\hat{\mathcal{S}}\subset\mathcal{S}_{L}(\mathbf{x}), if the maximum number of consecutive substrings which are not included in 𝒮^\hat{\mathcal{S}} is GG, we say 𝒮^\hat{\mathcal{S}} has maximal coverage gap GG. A code is called an (L,G)(L,G)-reconstruction code if every codeword 𝐱\mathbf{x} can be uniquely reconstructed from any subset 𝒮^𝒮L(𝐱)\hat{\mathcal{S}}\subset\mathcal{S}_{L}(\mathbf{x}) with maximal coverage gap GG. Gabrys and Milenkovic proposed a construction for such codes [10] by restricting each codeword 𝐱\mathbf{x} to be L^\hat{L}-substring unique with L^<LG\hat{L}<L-G and imposing some constraints on their prefixes.

Gabrys and Milenkovic also researched the scenario where the observations about the substrings suffer from substitution errors. Let 𝒴={𝐲0,𝐲1,,𝐲m1}\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\} be a multiset consisting of mm strings of length LL. If there is a subset 𝒮^={𝐱i0,𝐱i1,,𝐱im1}𝒮L(𝐱)\hat{\mathcal{S}}=\{\mathbf{x}_{i_{0}},\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{m-1}}\}\subset\mathcal{S}_{L}(\mathbf{x}) with maximal coverage gap GG such that dH(𝐲j,𝐱ij)ed_{H}(\mathbf{y}_{j},\mathbf{x}_{i_{j}})\leqslant e for all j[m]j\in[m], then we say 𝒴\mathcal{Y} is an (L,G,e)(L,G,e)-constrained erroneous multispectrum of 𝐱\mathbf{x}. Moreover, 𝒴\mathcal{Y} is said to be reliable if for any symbol in 𝐱\mathbf{x}, there are more copies of the correct value rather than an incorrect value of the symbol. A code is called an (L,G,e)(L,G,e)-reconstruction code if every codeword can be uniquely reconstructed from its any reliable (L,G,e)(L,G,e)-constrained erroneous multispectrum111We emphasize that the multispectrum 𝒴={𝐲0,𝐲1,,𝐲m1}\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\} is just a multiset, and the order/index ii of each 𝐲i\mathbf{y}_{i} cannot be directly read when reconstructing.. Gabrys and Milenkovic constructed an (L,G,e)(L,G,e)-reconstruction code of redundancy O(loglogn)O(\log\log n) for L=6logn+O(loglogn)L=6\log n+O(\log\log n). Their construction is based on (L,d)(L,d)-substring distant sequences, whose definition is presented as follows.

Definition 1.

A sequence 𝐰Σn\mathbf{w}\in\Sigma^{n} is called (L,d)(L,d)-substring distant (SD) if the minimum Hamming distance of its LL-multispectrum is at least dd, that is, dH(𝐰i+[L],𝐰j+[L])dd_{H}(\mathbf{w}_{i+[L]},\mathbf{w}_{j+[L]})\geqslant d for any 0i<jnL0\leqslant i<j\leqslant n-L.

Remark.

We observe that an (L,d)(L,d)-substring distant sequence is also (L,d)(L^{\prime},d)-substring distant, for any LLL^{\prime}\geqslant L. Thus, we may equivalently say that 𝐰Σn\mathbf{w}\in\Sigma^{n} is (L,d)(L,d)-substring distant (SD) if dH(𝐰i+[L],𝐰j+[L])dd_{H}(\mathbf{w}_{i+[L^{\prime}]},\mathbf{w}_{j+[L^{\prime}]})\geqslant d for any integer LLL^{\prime}\geqslant L and 0i<jnL0\leqslant i<j\leqslant n-L^{\prime}. This equivalent definition allows LL to be a real number, which we shall conveniently use in the future.

In [16], Marcovich and Yaakobi followed the noisy setup of Gabrys and Milenkovic. They studied the case of G=0G=0, i.e., no substring losses. Instead of reconstructing 𝐱\mathbf{x} from a reliable erroneous multispectrum, they aimed to reconstruct from an (L,0,e)(L,0,e)-erroneous multispectrum 𝒴\mathcal{Y}, the so-called maximum reconstructible-string, i.e., a string of length nn that takes at every position ii the majority value of the occurrences of xix_{i} in 𝒴\mathcal{Y}. Obviously, if 𝒴\mathcal{Y} is reliable, then the maximum reconstructible-string is equal to 𝐱\mathbf{x}. A sequence 𝐱\mathbf{x} is called (L,0,e)(L,0,e)-reconstructible222The notion here is a bit different from that in [16], where Marcovich and Yaakobi further assumed that there are at most tt substrings in 𝒴\mathcal{Y} each of which is affected by at most ee errors and referred to it as a (t,e)(t,e)-erroneous multispectrum. They proposed two constructions for reconstructible codes: one is independent of tt and thus can combat any number of erroneous substrings, while the other one depends on tt. In this paper, we focus on reconstructible codes which are independent of tt. if one can always reconstruct the maximum reconstructible-string from its any (L,0,e)(L,0,e)-erroneous multispectrum.

Proposition 2 ([16, Theorem 16]).

If 𝐱\mathbf{x} is (L1,4e+1)(L-1,4e+1)-SD, then it is (L,0,e)(L,0,e)-reconstructible.

For positive integers n,d,Ln,d,L with dL<nd\leqslant L<n, we use 𝒵n(L,d)\mathcal{Z}_{n}(L,d) to denote the set of (L,d)(L,d)-SD sequences of Σn\Sigma^{n}. For fixed dd and a>1a>1, Marcovich and Yaakobi showed that the asymptotic rate of the set 𝒵(alogn,d)\mathcal{Z}(a\log n,d) is 11, by using the Lovász Local Lemma. Note that when a<1a<1, even a single (alogn)(a\log n)-substring unique sequence of length nn does not exist.

Theorem 3 ([16, Theorem 19]).

For fixed dd and a>1a>1,

limnlog|𝒵n(alogn,d)|n=1.\lim_{n\to\infty}\frac{\log\lvert\mathcal{Z}_{n}(a\log n,d)\rvert}{n}=1.

Marcovich and Yaakobi also presented a deterministic algorithm which uses a single redundancy bit to encode (alogn,d)(a\log n,d)-SD sequences for a>2a>2.

Theorem 4 ([16, Algorithm 4 and Theorem 25]).

Let d>0d>0 be a fixed integer. There is an encoding algorithm which uses a single redundancy bit to encode (L,d)(L,d)-SD sequences of length nn, for

L=2logn+2(d1+ϵ)loglogn,L=2\log n+2(d-1+\epsilon)\log\log n,

where ϵ>0\epsilon>0 is a small constant number and nn is sufficiently large.

In Section III, we shall present an algorithm which can encode (alogn,d)(a\log n,d)-SD sequences of length nn for any a>1a>1, while its redundancy is o(n)o(n). According to Proposition 2, this implies an (L,0,e)(L,0,e)-reconstructible code whose rate approaches 11, for L=alogn+1L=\lceil a\log n\rceil+1 and e=d14e=\lfloor\frac{d-1}{4}\rfloor.

II-B Reconstruction from an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace

In [25], Yehezkeally et al. studied an extension of the problem of reconstructing from substrings. Let 𝐱Σn\mathbf{x}\in\Sigma^{n} be a sequence. A substring trace of 𝐱\mathbf{x} is a multiset of substrings {𝐱i0+[L0],𝐱i1+[L1],,𝐱im1+[Lm1]}\{\mathbf{x}_{i_{0}+[L_{0}]},\mathbf{x}_{i_{1}+[L_{1}]},\ldots,\mathbf{x}_{i_{m-1}+[L_{m-1}]}\} for some positive integer mm, where i0<i1<<im1i_{0}<i_{1}<\cdots<i_{m-1}. If i0=0i_{0}=0, ij+1<ij+Lji_{j+1}<i_{j}+L_{j} for all j<m1j<m-1, and im1+Lm1=ni_{m-1}+L_{m-1}=n, then the substring trace is called complete. Let LminL_{\rm min} and LoverL_{\rm over} be two positive integers such that Lover<Lmin<nL_{\rm over}<L_{\rm min}<n. An (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace is a complete trace such that:

  1. 1.

    every substring has length at least LminL_{\rm min}, i.e., LiLminL_{i}\geqslant L_{\rm min} for all i[m]i\in[m];

  2. 2.

    the overlap of every two consecutive substrings has length at least LoverL_{\rm over}, i.e., ij+Ljij+1Loveri_{j}+L_{j}-i_{j+1}\geqslant L_{\rm over} for all j[m1]j\in[m-1].

For a sequence 𝐱\mathbf{x}, let 𝒯LminLover(𝐱)\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x}) denote the set of all (Lmin,Lover)(L_{\rm min},L_{\rm over})-traces of 𝐱\mathbf{x}. A code 𝒞\mathcal{C} is referred to as an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code if 𝒯LminLover(𝐱)𝒯LminLover(𝐱)=\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x})\cap\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x}^{\prime})=\emptyset for all 𝐱𝐱𝒞\mathbf{x}\neq\mathbf{x}^{\prime}\in\mathcal{C}, or equivalently, every codeword can be uniquely reconstructed from any of its (Lmin,Lover)(L_{\rm min},L_{\rm over})-traces.

Proposition 5 ([25, Lemma 1]).

Let 𝐱\mathbf{x} be an LoverL_{\rm over}-substring unique sequence. Then 𝐱\mathbf{x} can be uniquely reconstructed from any of its (Lmin,Lover)(L_{\rm min},L_{\rm over})-traces.

By refining the constructions of substring unique sequences, Yehezkeally et al. obtained the following result.

Theorem 6 ([25, Corollary 6]).

There is an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code of Σn\Sigma^{n} whose rate approaches 11, for Loverlogn+3loglogn+12L_{\rm over}\geqslant\lceil\log n\rceil+3\lceil\log\log n\rceil+12 and sufficiently large nn.

They also studied the other parameter regimes.

Lemma 7 ([25, Lemma 8]).

If Lmin=alogn+O(1)L_{\rm min}=a\log n+O(1) and Lover=γLmin+O(1)L_{\rm over}=\gamma L_{\rm min}+O(1) for some a>1a>1 and 0γ1a0\leqslant\gamma\leqslant\frac{1}{a}, then for any (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code 𝒞Σn\mathcal{C}\subseteq\Sigma^{n}, its rate R(𝒞)R(\mathcal{C}) must satisfy

R(𝒞)11/a1γ+O(loglognlogn).R(\mathcal{C})\leqslant\frac{1-1/a}{1-\gamma}+O\left\lparen\frac{\log\log n}{\log n}\right\rparen.
Theorem 8 ([25, Theorem 15]).

Let Lmin=alognL_{\rm min}=a\log n and Lover=γLminL_{\rm over}=\gamma L_{\rm min} for some a>1a>1 and 0γ1a0\leqslant\gamma\leqslant\frac{1}{a}. If nn is sufficiently large, then there is an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code 𝒞Σn\mathcal{C}\subseteq\Sigma^{n} with rate

R(𝒞)11/a1γ(logn)ϵalognO(1logn),R(\mathcal{C})\geqslant\frac{1-1/a}{1-\gamma}-\frac{(\log n)^{\epsilon}}{a\sqrt{\log n}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,

where ϵ>0\epsilon>0 is a small number which is independent of nn.

In this paper, we shall study the problem of reconstructing sequences from their noisy substring traces. Let 𝒴={𝐲0,𝐲1,,𝐲m1}\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\} be a multiset of sequences over Σ\Sigma, and let Lj=|𝐲j|L_{j}=\lvert\mathbf{y}_{j}\rvert for j[m]j\in[m]. We say 𝒴\mathcal{Y} is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐱\mathbf{x} if there exists an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace {𝐱i0+[L0],𝐱i1+[L1],,𝐱im1+[Lm1]}\{\mathbf{x}_{i_{0}+[L_{0}]},\mathbf{x}_{i_{1}+[L_{1}]},\ldots,\mathbf{x}_{i_{m-1}+[L_{m-1}]}\} such that dH(𝐲j,𝐱ij+[Lj])ed_{H}(\mathbf{y}_{j},\mathbf{x}_{{i_{j}}+[L_{j}]})\leqslant e for all j[m]j\in[m]. Namely, each string 𝐲j\mathbf{y}_{j} in 𝒴\mathcal{Y} is an erroneous copy of the substring 𝐱ij+[Lj]\mathbf{x}_{{i_{j}}+[L_{j}]} in 𝐱\mathbf{x} with at most ee errors. The index iji_{j} is referred to as the location 𝐲j\mathbf{y}_{j} in 𝐱\mathbf{x}. For a sequence 𝐱\mathbf{x} and its any (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace 𝒴\mathcal{Y}, if one can always determine the location of every 𝐲i𝒴\mathbf{y}_{i}\in\mathcal{Y} in 𝐱\mathbf{x}, then we say 𝐱\mathbf{x} is (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible. We note that once all the locations of 𝐲j\mathbf{y}_{j}’s are identified, the maximum reconstructible-string of 𝒴\mathcal{Y} can be determined by taking at every position ii the majority value of the occurrences of xix_{i} in 𝒴\mathcal{Y}. Hence, the (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible sequence 𝐱\mathbf{x} can be uniquely reconstructed as long as 𝒴\mathcal{Y} is reliable.

A code is called an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code if every codeword 𝐱\mathbf{x} is (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible333Unlike the noiseless case, in an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code it might be possible that two codewords share a common (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace. Nevertheless, they cannot have a common reliable trace. . In Section IV, we will give two constructions for (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes where the number of errors ee is fixed. Our results are akin to Theorem 6 and Theorem 8. In particular, when Lover=alognL_{\rm over}=a\log n for some a>1a>1, we construct a class of (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes whose rates approach 11. When Lmin=alognL_{\rm min}=a\log n and Lover=γLminL_{\rm over}=\gamma L_{\rm min} for some a>1a>1 and 0γ1a0\leqslant\gamma\leqslant\frac{1}{a}, the proposed (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes have rates close to 11/a1γ\frac{1-1/a}{1-\gamma}. These results are summarized in Table I. Our constructions are based on robust positioning sequences and window-weight limited sequences, which are reviewed in Section II-D.

We note that when Lover=0L_{\rm over}=0, (Lmin,0)(L_{\rm min},0)-reconstruction codes were researched by Bar-Lev et al. in [2] by the name of adversarial torn-paper codes. In the same paper, they also consider the scenario where the DNA strand may suffer from substitution errors before sequencing. Such kind of errors cannot be corrected by majority decoding. Yehezkeally and Polyanskii studied a similar problem for the (L+1,L)(L+1,L)-trace reconstruction [26]. They introduced the notion of (t,L)(t,L)-resilient repeat free sequence, which satisfies the property that the result of any tt substitution errors to it is LL-repeat free, and proposed an algorithm to directly encode such sequences. Interestingly, [26, Lemma 6] shows that an (L,2t+1)(L,2t+1)-SD sequence is (t,L)(t,L)-resilient repeat free. In Section IV, we will also study errors before sequencing and modify our code construction for (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction to combat such errors.

TABLE I: Lower and upper bounds on the code rate of single-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes of Σn\Sigma^{n}.
Parameter regimes Lower bound Ref. Upper bound Ref.
Lover=logn+(6d+7)loglogn+dlogd+5dL_{\rm over}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d 1o(1)1-o(1) Corollary 19 11
where d=4e+1d=4e+1
Lmin=alog(n)L_{\rm min}=\lceil a\log(n)\rceil, Lover=γLminL_{\rm over}=\lceil\gamma L_{\rm min}\rceil 11/a1γo(1)\frac{1-1/a}{1-\gamma}-o(1) Theorem 25 11/a1γ+o(1)\frac{1-1/a}{1-\gamma}+o(1) Lemma 7
where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1 & Theorem 28

II-C Multi-strand reconstruction

Motivated by DNA sequencing technologies where multiple DNA strands are sequenced simultaneously, the reconstruction problem has been extended to the multi-strand case in [25, 2], i.e., reconstructing a multiset of kk sequences of length nn from the union of their traces.

Define

𝒳n,k{{𝐱0,𝐱1,,𝐱k1}:𝐱iΣn for all i[k]}.\mathcal{X}_{n,k}\triangleq\left\{\left\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\right\}~{}:~{}\mathbf{x}_{i}\in\Sigma^{n}\textup{ for all }i\in[k]\right\}.

Then |𝒳n,k|=(k+2n1k)\lvert\mathcal{X}_{n,k}\rvert=\binom{k+2^{n}-1}{k}. The rate of a multi-strand code 𝒞𝒳n,k\mathcal{C}\subseteq\mathcal{X}_{n,k} is defined as

R(𝒞)log|𝒞|log|𝒳n,k|.R(\mathcal{C})\triangleq\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}.

For a multiset 𝒮={𝐱0,𝐱1,,𝐱k1}𝒳n,k\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{X}_{n,k}, its (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace is a (multiset) union 𝒴=i=0k1𝒴i\mathcal{Y}=\bigcup_{i=0}^{k-1}\mathcal{Y}_{i}, where each 𝒴i\mathcal{Y}_{i} is an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace of 𝐱i\mathbf{x}_{i}. A code 𝒞𝒳n,k\mathcal{C}\subseteq\mathcal{X}_{n,k} is referred to as a multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code if every codeword can be reconstructed from its (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace. Two classes of multi-strand trace reconstruction codes whose rates asymptotically attain the upper bound have been constructed in [25, 2], for Lover=0L_{\rm over}=0 or Lover=Lmin1L_{\rm over}=L_{\rm min}-1, respectively.

Theorem 9 ([2, Theorem 12]).

Suppose that logk=o(n)\log k=o(n) and Lmin=alog(nk)L_{\rm min}=a\log(nk) with a>1a>1. Then there is a class of multi-strand (Lmin,0)(L_{\rm min},0)-trace reconstruction codes of rate 11/ao(1)1-1/a-o(1).

Theorem 10 ([25, Corollary 23]).

Suppose that lim supnlogk/n<1\limsup_{n\to\infty}\log k/n<1 and Lminlog(nk)+3loglog(nk)+12L_{\rm min}\geqslant\log(nk)+3\log\log(nk)+12. Then there is a class of multi-strand (Lmin,Lmin1)(L_{\rm min},L_{\rm min}-1)-trace reconstruction codes of rate 1o(1)1-o(1).

In this paper, we will also study the problem of reconstructing multiple strands from their noisy traces. For a multiset 𝒮={𝐱0,𝐱1,,𝐱k1}𝒳n,k\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{X}_{n,k}, its (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace is a (multiset) union 𝒴=i=0k1𝒴i\mathcal{Y}=\bigcup_{i=0}^{k-1}\mathcal{Y}_{i}, where each 𝒴i\mathcal{Y}_{i} is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐱i\mathbf{x}_{i}. We aim to reconstruct 𝒮\mathcal{S} from its (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace. If for any (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace 𝒴\mathcal{Y} of 𝒮\mathcal{S} and any 𝐲𝒴\mathbf{y}\in\mathcal{Y}, it is possible to determine the index ii such that 𝐲𝒴i\mathbf{y}\in\mathcal{Y}_{i} as well as the location of 𝐲\mathbf{y} in 𝐱i\mathbf{x}_{i}, then we say 𝒮\mathcal{S} is (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible. A code 𝒞𝒳n,k\mathcal{C}\subseteq\mathcal{X}_{n,k} is called an multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code if each of its codewords is (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible.

Following the research in [25], we assume that lim supnlogk/n<1\limsup_{n\to\infty}\log k/n<1, which is of great interest in applications. In Section V, we shall present some upper bounds on the multi-strand trace code rate and propose some codes whose rates asymptotically attain these bounds. Our results are summarized in Table II and Table III. Among others, when logk=κn\log k=\kappa n with 0<κ<10<\kappa<1, we obtain a class of multi-strand (Lmin,0,e)(L_{\rm min},0,e)-trace reconstruction codes of rate 11/a1κ+La(1κ)no(1)\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}-o(1), where Ln(modLmin)L^{*}\equiv n\pmod{L_{\rm min}}. Note that L[Lmin]L^{*}\in[L_{\rm min}] and Lmin=alog(nk)=Θ(n)L_{\rm min}=a\log(nk)=\Theta(n). The term Ln\frac{L^{*}}{n} could be a non-vanishing number, depending on the congruence class of nn modulo LminL_{\rm min}. In contrast, when logk=o(n)\log k=o(n), the rate of the multi-strand (Lmin,0)(L_{\rm min},0)-trace reconstruction codes in [2, Theorem 12] is 11/ao(1)1-1/a-o(1), which is the same as that of single-strand reconstruction codes.

TABLE II: Lower and upper bounds on the code rate of multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes of 𝒳n,k\mathcal{X}_{n,k}, where logk=o(n)\log k=o(n).
Parameter regimes Lower bound Ref. Upper bound Ref.
Lover=log(nk)+(24e+13)loglog(nk)+O(1)L_{\rm over}=\log(nk)+(24e+13)\log\log(nk)+O(1) 1o(1)1-o(1) Theorem 31 11
Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil, Lover=γlog(nk)L_{\rm over}=\lceil\gamma\log(nk)\rceil 11/a1γo(1)\frac{1-1/a}{1-\gamma}-o(1) Theorem 37 11/a1γ+o(1)\frac{1-1/a}{1-\gamma}+o(1) Lemma 33
where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1
Lminlog(nk)+o(log(nk))L_{\rm min}\leqslant\log(nk)+o(\log(nk)) o(1)o(1) Corollary 34
TABLE III: Lower and upper bounds on the code rate of multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes of 𝒳n,k\mathcal{X}_{n,k}, where logk=κn\log k=\kappa n and L=(nLover)mod(LminLover)L^{*}=(n-L_{\rm over})\bmod(L_{\rm min}-L_{\rm over})
Parameter regimes Lower bound Ref. Upper bound Ref.
Lover=log(nk)+(24e+13)loglog(nk)+O(1)L_{\rm over}=\log(nk)+(24e+13)\log\log(nk)+O(1) 1o(1)1-o(1) Theorem 31 11
Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil, Lover=γLminL_{\rm over}=\lceil\gamma L_{\rm min}\rceil 1aγκ1κ(11/a1γ)o(1)\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1) Theorem 37 1aγκ1κ(11/a1γ)\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen Lemma 33
where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1 +1/aγ(1γ)(1κ)Ln+o(1)+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\frac{L^{*}}{n}+o(1)
Lover=0L_{\rm over}=0, Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil, a>1a>1, 11/a1κ+La(1κ)no(1)\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}-o(1) Theorem 39 11/a1κ+La(1κ)n+o(1)\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}+o(1) Lemma 33
and LLmin(1+ϵ)log(nk)L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)
Lmin=log(nk)+o(log(nk))L_{\rm min}=\log(nk)+o(\log(nk)) o(1)o(1) Lemma 36
and LminLover=Θ(log(nk))L_{\rm min}-L_{\rm over}=\Theta(\log(nk))
Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil with a<1a<1 o(1)o(1) Lemma 35

II-D Robust positioning sequences

An (L,d)(L,d)-substring distant sequence 𝐱\mathbf{x} is also known as an (L,d)(L,d)-robust positioning sequence, since the contents of any length-LL substring can locate the substring’s position in 𝐱\mathbf{x}, even if they are corrupted by at most (d1)/2\lfloor(d-1)/2\rfloor errors. In the context of robust positioning sequences, given LL and dd, it is of interest to construct a (single) long (L,d)(L,d)-robust positioning sequence with efficient locating algorithm. This problem, as well as its 2-dimensional extension, has been discussed in [5, 4, 6, 7, 24]. Among others, Chee et al. [6] constructed a class of (L,d)(L,d)-robust positioning sequences of length 2L/(cL3d+6.5){2^{L}}/(cL^{3d+6.5}) for some constant number c>0c>0. Their construction was refined in [24] to obtain sequences of length 2L/(cL(d1)/2+8){2^{L}}/(cL^{\lceil(d-1)/2\rceil+8}), which is nearly optimal. The constructions in [6, 24] require the following notions.

Theorem 11 (dd-Auto-Cyclic Sequences [14]).

Let =dlogd+2d\ell=d\lceil\log d\rceil+2d. Set 𝐮\mathbf{u} to be the sequence

𝐮=1d𝐮0𝐮1𝐮logd, where 𝐮i=((12i02i)d)[0,d1].\mathbf{u}=1^{d}\circ\mathbf{u}_{0}\circ\mathbf{u}_{1}\circ\cdots\circ\mathbf{u}_{\lceil\log d\rceil},\mbox{ where }\mathbf{u}_{i}=((1^{2^{i}}\circ 0^{2^{i}})^{d})[0,d-1].

Then for all 1id1\leqslant i\leqslant d, we have that

dH(𝐮,0i𝐮[0,i1])d,d_{H}(\mathbf{u},0^{i}\circ\mathbf{u}[0,{\ell-i-1}])\geqslant d,

and 𝐮\mathbf{u} is called a dd-auto-cyclic sequence.

Definition 12.

Let n,L,dn,L,d be positive integers such that d<L<nd<L<n. We say a sequence 𝐱Σn\mathbf{x}\in\Sigma^{n} satisfies the (L,d)(L,d)-window weight limited (WWL) constraint, and is called an (L,d)(L,d)-WWL sequence, if wtH(𝐱i+[L])d\operatorname{wt}_{H}(\mathbf{x}_{i+[L]})\geqslant d for any i[nL+1]i\in[n-L+1].

Proposition 13 ([6, Construction 1 and Theorem 3.7]).

Given LL and dd, choose KK such that <K\ell<K and K+<LK+\ell<L, where =dlogd+2d\ell=d\lceil\log d\rceil+2d. Let 𝐮\mathbf{u} be a dd-auto-cyclic vector of length \ell from Theorem 11 and set Lp=K+L_{p}=K+\ell. Let 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} be a collection of length-(LLp)(L-L_{p}) binary vectors satisfying the following conditions:

  1. (P1)

    𝐬i\mathbf{s}_{i} is a (K,d)(K,d)-WWL vector for i[M]i\in[M];

  2. (P2)

    𝐬i+1[0,j1]𝐬i[j,LLp1]\mathbf{s}_{i+1}[0,j-1]\circ\mathbf{s}_{i}[j,L-L_{p}-1] is a (K,d)(K,d)-WWL vector for i[M1]i\in[M-1] and j[LLp1]j\in[L-L_{p}-1]; and

  3. (P3)

    the concatenation 𝐬0𝐬1𝐬2𝐬M1\mathbf{s}_{0}\circ\mathbf{s}_{1}\circ\mathbf{s}_{2}\circ\cdots\circ\mathbf{s}_{M-1} is an (LLp,d)(L-L_{p},d)-modular robust positioning sequence444A sequence 𝐰\mathbf{w} is an (LLp,d)(L-L_{p},d)-modular robust positioning sequence if dH(𝐰i+[LLp],𝐰j+[LLp])dd_{H}(\mathbf{w}_{i+[L-L_{p}]},\mathbf{w}_{j+[L-L_{p}]})\geqslant d for any ij(modLLp)i\equiv j\pmod{L-L_{p}} and iji\neq j..

Then the sequence

𝐬0K𝐮𝐬00K𝐮𝐬10K𝐮𝐬M1\mathbf{s}\triangleq 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{M-1}

is an (L,d)(L,d)-robust positioning (substring distant) sequence.

Theorem 14 ([6, Construction 1A and Corollary 3.12]).

Given dd and LL, set K=3(3logL)/2=92logL+O(1)K=3\lceil(3\log L)/2\rceil=\frac{9}{2}\log L+O(1). There is an explicit construction of sequences 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} of length LKL-K-\ell, where logM=L3dlogL7.5logLO(1)\log M=L-3d\log L-7.5\log L-O(1), such that the conditions (P1)–(P3) in Proposition 13 are satisfied.

Remark.

We note that for each i[M]i\in[M], the concatenation 0K𝐮𝐬i0^{K}\circ\mathbf{u}\circ\mathbf{s}_{i} is an (Lp,d)(L_{p},d)-WWL sequence, since the length-dd prefix of 𝐮\mathbf{u} is 1d1^{d} and 𝐬i\mathbf{s}_{i} is (K,d)(K,d)-WWL.

III Encoding of (alogn,d)(a\log n,d)-Substring Distant Sequences for a>1a>1

In this section we shall present an encoding method which can generate a set of (alogn,d)(a\log n,d)-SD sequences of length nn (with a>1a>1, a real number) whose rate asymptotically approaches 11. We shall, in fact, construct (L,d)(L,d)-SD sequences with L=logn+(6d+7)loglogn+O(1)L=\log n+(6d+7)\log\log n+O(1), but using the remark following Definition 1, we shall find it more convenient to denote these sequences as (alogn,d)(a\log n,d)-SD.

We first require some notations. For a sequence 𝐰Σn\mathbf{w}\in\Sigma^{n}, we say that (i,j)(i,j) (where 0i<jnL0\leqslant i<j\leqslant n-L) is an (L,ρ)(L,\rho)-close window pair in 𝐰\mathbf{w} if dH(𝐰i+[L],𝐰j+[L])ρd_{H}(\mathbf{w}_{i+[L]},\mathbf{w}_{j+[L]})\leqslant\rho. Moreover, (i,j)(i,j) is called primal, if for any other (L,ρ)(L,\rho)-close window pair (i,j)(i^{\prime},j^{\prime}) in 𝐰\mathbf{w} we have jjj\leqslant j^{\prime}. Let 𝐱,𝐱ΣL\mathbf{x},\mathbf{x}^{\prime}\in\Sigma^{L} be two sequences with dH(𝐱,𝐱)ρd_{H}(\mathbf{x},\mathbf{x}^{\prime})\leqslant\rho for some integer ρL\rho\leqslant L. Let p1,p2,,pdH(𝐱,𝐱)p_{1},p_{2},\ldots,p_{d_{H}(\mathbf{x},\mathbf{x}^{\prime})} denote the indices of the entries where 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime} do not agree. For every 1iρ{1\leqslant i\leqslant\rho} let

𝐛i={b(pi)if idH(𝐱,𝐱),0log(L+1)otherwise,\mathbf{b}_{i}=\begin{cases}b(p_{i})&\text{if $i\leqslant d_{H}(\mathbf{x},\mathbf{x}^{\prime})$},\\ 0^{{\lceil\log(L+1)\rceil}}&\text{otherwise},\end{cases} (1)

where b(i)b(i) is the binary representation of ii with log(L+1){\lceil\log(L+1)\rceil} symbols. Let

EncDistL,ρ(𝐱,𝐱)𝐛1𝐛2𝐛ρ.\operatorname{EncDist}_{L,\rho}(\mathbf{x},\mathbf{x}^{\prime})\triangleq\mathbf{b}_{1}\circ\mathbf{b}_{2}\circ\cdots\circ\mathbf{b}_{\rho}.

Then EncDistn,ρ(𝐱,𝐱)\operatorname{EncDist}_{n,\rho}(\mathbf{x},\mathbf{x}^{\prime}) encodes the difference between 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime}, and its length is ρlog(L+1)\rho{\lceil\log(L+1)\rceil}.

Given a fixed dd and a sufficiently large nn, we are going to present an encoding algorithm which can encode (L,d)(L,d)-SD sequences of length nn. Set

L1\displaystyle L_{1} logn+(2d1)loglogn+6d+log(d+1),\displaystyle\triangleq\lceil\log n\rceil+(2d-1)\lceil\log\lceil\log n\rceil\rceil+6d+\lceil\log(d+1)\rceil,
K1\displaystyle K_{1} dloglogn+d,\displaystyle\triangleq d\lceil\log\lceil\log n\rceil\rceil+d,
L2\displaystyle L_{2} logn+(3d+7)loglogn,\displaystyle\triangleq\lceil\log n\rceil+(3d+7)\lceil\log\lceil\log n\rceil\rceil,
K2\displaystyle K_{2} 332logL2,\displaystyle\triangleq 3\left\lceil\frac{3}{2}\log L_{2}\right\rceil,
Kmax\displaystyle K_{\max} max{K1,K2}.\displaystyle\triangleq\max\{K_{1},K_{2}\}.

Additionally, set

\displaystyle\ell dlogd+2d,\displaystyle\triangleq d\lceil\log d\rceil+2d,
L\displaystyle L max{L1+K2+Kmax+,L2+2K1+Kmax+}.\displaystyle\triangleq\max\{L_{1}+K_{2}+K_{\max}+\ell,L_{2}+2K_{1}+K_{\max}+\ell\}.

Assume that dd is fixed and nn is sufficiently large. Then L=L2+2K1+Kmax+L=L_{2}+2K_{1}+K_{\max}+\ell, and K1>K2K_{1}>K_{2} if and only if d5d\geqslant 5. Note that

K2=31.5logL231.5loglogn+1.54.5loglogn+7.5.K_{2}=3\left\lceil 1.5\log L_{2}\right\rceil\leqslant 3\left\lceil 1.5\log\left\lceil\log n\right\rceil+1.5\right\rceil\leqslant 4.5\left\lceil\log\left\lceil\log n\right\rceil\right\rceil+7.5.

Thus, we have that

L{=logn+(6d+7)loglogn+dlogd+5dif d5,logn+(5d+11.5)loglogn+dlogd+4d+7.5otherwise.L\begin{cases}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d&\text{if $d\geqslant 5$},\\ \leqslant\lceil\log n\rceil+(5d+11.5)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+4d+7.5&\text{otherwise.}\end{cases}

Our encoder resembles the encoding algorithms in [10, 9] and consists of the following three parts:

  1. 1.

    We first use the encoder presented in [14] to encode a message sequence 𝐦Σn\mathbf{m}\in\Sigma^{n^{\prime}} into a (dloglog(n),d)(d\lceil\log\lceil\log(n)\rceil\rceil,d)-WWL sequence 𝐰\mathbf{w} of length nK1K2n-K_{1}-K_{2}. According to [14, Corollary 20], this encoder, denoted by 1\mathcal{E}_{1}, requires approximately 2d2(nK1K2,d)dloglogn2d\cdot 2^{\mathcal{F}(n-K_{1}-K_{2},d)-d\lceil\log\lceil\log n\rceil\rceil} redundancy symbols, where

    (n,d)=logn+(d1)(loglogn+C)+2\mathcal{F}(n,d)=\lceil\log n\rceil+(d-1)(\lceil\log\lceil\log n\rceil\rceil+C)+2

    for some constant CC. Hence,

    n=nK1K22d2(nK1K2,d)dloglogn=nK1K2Θ(n/logn).n^{\prime}=n-K_{1}-K_{2}-2d\cdot 2^{\mathcal{F}(n-K_{1}-K_{2},d)-d\lceil\log\lceil\log n\rceil\rceil}=n-K_{1}-K_{2}-\Theta(n/\log n). (2)
  2. 2.

    Then we encode the (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL sequence 𝐰\mathbf{w} into an (L1,d)(L_{1},d)-SD sequence 𝐰¯\bar{\mathbf{w}} by eliminating the pairs of substrings of small distance and attaching some information about their positions and difference. This encoder, denoted by 2\mathcal{E}_{2}, is presented in Algorithm 1, and it can additionally guarantee the output sequence is (K1,d)(K_{1},d)-WWL.

  3. 3.

    As an output of Algorithm 1, the sequence 𝐰¯\bar{\mathbf{w}} is usually shorter than the sequence 𝐰\mathbf{w}. Thus, we need an expansion step to increase the sequence length while keeping the substring-distant property. Let 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} be a collection of (K2,d)(K_{2},d)-WWL sequences of length L2LpL_{2}-L_{p} as in Theorem 14. Set

    𝐬¯0Kmax𝐮𝐬00Kmax𝐮𝐬10Kmax𝐮𝐬M1,\bar{\mathbf{s}}\triangleq 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{M-1},

    where 𝐮\mathbf{u} is the dd-auto-cyclic vector of length \ell from Theorem 11. Finally, let

    𝐰^3(𝐰¯)(𝐰¯0K2𝐬¯)[0,n1].\hat{\mathbf{w}}\triangleq\mathcal{E}_{3}(\bar{\mathbf{w}})\triangleq(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1].

We shall show 𝐰^\hat{\mathbf{w}} is the required (L,d)(L,d)-SD sequence of length nn.

We first describe the encoding presented in Algorithm 1. This procedure encodes a (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL sequence 𝐰\mathbf{w} into a sequence 𝐰¯\bar{\mathbf{w}} that is simultaneously (L1,d)(L_{1},d)-SD and (K1,d)(K_{1},d)-WWL. Initiate 𝐰¯=𝐰\bar{\mathbf{w}}=\mathbf{w}. If there are no (L1,d1)(L_{1},d-1)-close window pairs in 𝐰¯\bar{\mathbf{w}}, then the algorithm returns 𝐰¯\bar{\mathbf{w}} as the output. We observe that since 𝐰\mathbf{w} is (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL and K1dloglognK_{1}\geqslant d\lceil\log\lceil\log n\rceil\rceil, then 𝐰\mathbf{w} is also (K1,d)(K_{1},d)-WWL.

Otherwise, we choose a primal (L1,d1)(L_{1},d-1)-close window pair, say (i,j)(i,j). We replace the substring 𝐰¯j+[L1]\bar{\mathbf{w}}_{j+[L_{1}]} with the sequence

1d0dloglogn1dB(i)1dEncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])0log(d+1)1d,1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\circ 1^{d}, (3)

where B(i):[n]Σlogn+dB(i):[n]\longrightarrow\Sigma^{\lceil\log n\rceil+d} is the encoding function in [14, Algorithm 2], which can encode integers in [n][n] into (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL sequences in O(n)O(n) time. We note that this sequence is (K1,d)(K_{1},d)-WWL and contains the information about the position ii and the difference between 𝐰¯i+[L1]\bar{\mathbf{w}}_{i+[L_{1}]} and 𝐰¯j+[L1]\bar{\mathbf{w}}_{j+[L_{1}]}. Moreover, the substring 0dloglogn0^{d\lceil\log\lceil\log n\rceil\rceil} serves as a marker which indicates the position jj of the removed substring 𝐰¯j+[L1]\bar{\mathbf{w}}_{j+[L_{1}]}.

We shall repeat this procedure until there are no (L1,d1)(L_{1},d-1)-close window pairs in 𝐰¯\bar{\mathbf{w}}. But in order to ensure that 𝐰\mathbf{w} can be recovered from the output of the algorithm, we need more tricks. We note that in [10] the inserted sequences always start with a marker 02loglogn0^{2\log\log n} and end with a symbol ‘11’. This pattern together with the rule that only the primal pairs can be chosen and replaced guarantees that after each replacement the latest inserted substring always starts with the rightmost 02loglogn0^{2\log\log n} in 𝐰¯\bar{\mathbf{w}}. Due to this property, we have a decoding algorithm which can recover 𝐰\mathbf{w} from 𝐰¯\bar{\mathbf{w}}: Let 𝐰¯(k)\bar{\mathbf{w}}^{(k)} denote the sequence 𝐰¯\bar{\mathbf{w}} after the kk-th replacement. One can search for the rightmost 02loglogn0^{2\log\log n} in 𝐰¯(k)\bar{\mathbf{w}}^{(k)} to find the position jj of the inserted substring in the kk-th replacement. By replacing the inserted substring with the removed substring, one can recover 𝐰¯(k1)\bar{\mathbf{w}}^{(k-1)} from 𝐰¯(k)\bar{\mathbf{w}}^{(k)}. Doing this iteratively, one can eventually recover 𝐰\mathbf{w} from 𝐰¯\bar{\mathbf{w}}.

In our encoding, the inserted substring should always contain 1d1^{d} as both prefix and suffix to maintain the property of being (K1,d)(K_{1},d)-WWL. We have to modify the substring 0log(d+1)0^{\lceil\log(d+1)\rceil} in (3) to ensure the latest inserted substring always starts with the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} in 𝐰¯\bar{\mathbf{w}}. Let jpj_{p} and jj be the positions of the removed substrings in the previous replacement and in the current replacement, respectively. Since we only choose the primal pairs, necessarily, j>jpL1j>j_{p}-L_{1}. If j>jpL1+dj>j_{p}-L_{1}+d, then we still replace the substring 𝐰¯j+[L1]\bar{\mathbf{w}}_{j+[L_{1}]} with the sequence in (3), since the marker 0dloglogn0^{d\lceil\log\lceil\log n\rceil\rceil} which is inserted in the previous replacement will be destroyed by the suffix 1d1^{d} of this inserted sequence. If jpL1<jjpL1+dj_{p}-L_{1}<j\leqslant j_{p}-L_{1}+d, we first set 𝐰¯[jp+d]\bar{\mathbf{w}}[j_{p}+d] to be ‘1’ to destroy the previous marker 0dloglogn0^{d\lceil\log\lceil\log n\rceil\rceil}. Then we replace 𝐰¯j+[L1]\bar{\mathbf{w}}_{j+[L_{1}]} with the sequence

1d0dloglogn1dB(i)1dEncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])b(jjp+L1)1d,1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ b(j-j_{p}+L_{1})\circ 1^{d}, (4)

where b(jjp+L1)b(j-j_{p}+L_{1}) is the binary encoding of jjp+L1j-j_{p}+L_{1} with log(d+1)\lceil\log(d+1)\rceil symbols, since 1jjp+L1d1\leqslant j-j_{p}+L_{1}\leqslant d.

Note that the substring B(i)B(i) and the substring EncDistL1,d1(𝐰¯i,L1,𝐰¯j,L1)\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i,L_{1}},\bar{\mathbf{w}}_{j,L_{1}}) have length logn+d\lceil\log n\rceil+d and length at most (d1)(loglogn+1)(d-1)(\lceil\log\lceil\log n\rceil\rceil+1), respectively. It follows that in the loop we replace substrings of length L1L_{1} with substrings of length at most

4d+dloglogn+(logn+d)+(d1)log(L1+1)+log(d+1)\displaystyle 4d+d\lceil\log\lceil\log n\rceil\rceil+(\lceil\log n\rceil+d)+(d-1)\lceil\log(L_{1}+1)\rceil+\lceil\log(d+1)\rceil
4d+dloglogn+(logn+d)+(d1)(loglogn+1)+log(d+1)\displaystyle\leqslant 4d+d\lceil\log\lceil\log n\rceil\rceil+(\lceil\log n\rceil+d)+(d-1)(\lceil\log\lceil\log n\rceil\rceil+1)+\lceil\log(d+1)\rceil
=L11,\displaystyle=L_{1}-1,

where the first inequality is obtained by noting that for all sufficiently large nn we have L1+12lognL_{1}+1\leqslant 2\lceil\log n\rceil. Hence, the loop will execute at most |𝐰|L1+1\lvert\mathbf{w}\rvert-L_{1}+1 times and the algorithm will terminate eventually.

Input: a (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL sequence 𝐰ΣnK1K2\mathbf{w}\in\Sigma^{n-K_{1}-K_{2}}
Output: a sequence 𝐰¯ΣnK1K2\bar{\mathbf{w}}\in\Sigma^{\leqslant n-K_{1}-K_{2}}
Set 𝐰¯=𝐰\bar{\mathbf{w}}=\mathbf{w} and jp=0j_{p}=0
while  there are two length-L1L_{1} substrings in 𝐰¯\bar{\mathbf{w}} whose Hamming distance is at most d1d-1 do
     Suppose (i,j)(i,j) is a primal (L1,d1)(L_{1},d-1)-close window pair in 𝐰¯\bar{\mathbf{w}} (then necessarily j>jpL1j>j_{p}-L_{1})
     if j>jpL1+dj>j_{p}-L_{1}+d  then
         Remove the substring of length L1L_{1} starting at position jj and replace it with the sequence
1d0dloglogn1dB(i)1dEncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])0log(d+1)1d1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\circ 1^{d}
     else
         Set 𝐰¯[jp+d]\bar{\mathbf{w}}[j_{p}+d] to be ‘1’
         Remove the substring of length L1L_{1} starting at position jj and replace it with the sequence
1d0dloglogn1dB(i)1dEncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])b(jjp+L1)1d1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ b(j-j_{p}+L_{1})\circ 1^{d}
     end if
     jpjj_{p}\leftarrow j
end while
return 𝐰¯\bar{\mathbf{w}}
Algorithm 1 Primal Pair Elimination Encoder 2\mathcal{E}_{2} for Generating (L1,d)(L_{1},d)-SD Sequences
Lemma 15.

The output sequence 𝐰¯\bar{\mathbf{w}} is (K1,d)(K_{1},d)-WWL and (L1,d)(L_{1},d)-SD, and the input sequence 𝐰\mathbf{w} can be recovered from 𝐰¯\bar{\mathbf{w}}, for all sufficiently large nn.

Proof:

The while loop ensures that the output 𝐰¯\bar{\mathbf{w}} of Algorithm 1 is an (L1,d)(L_{1},d)-SD sequence. Moreover, since 𝐰\mathbf{w} is (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL and K1=dloglogn+dK_{1}=d\lceil\log\lceil\log n\rceil\rceil+d, one can tediously verify that for all large enough nn, 𝐰¯\bar{\mathbf{w}} is (K1,d)(K_{1},d)-WWL. In particular, even if EncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]}) is all zeros, for all large enough nn

K1|EncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])0log(d+1)|d,K_{1}-\left\lvert\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\right\rvert\geqslant d,

and a substring of length K1K_{1} containing EncDistL1,d1(𝐰¯i+[L1],𝐰¯j+[L1])0log(d+1)\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil} must also contain at least dd of the surrounding 11’s.

Next, we show after each replacement the latest inserted substring always starts with the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}. Let 𝐰¯(k)\bar{\mathbf{w}}^{(k)} be the sequence 𝐰¯\bar{\mathbf{w}} after the kk-th replacement. We prove this by induction. When k=1k=1, since 𝐰=𝐰¯(0)\mathbf{w}=\bar{\mathbf{w}}^{(0)} is (dloglogn,d)(d\lceil\log\lceil\log n\rceil\rceil,d)-WWL, the marker 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} appears exactly once in 𝐰¯(1)\bar{\mathbf{w}}^{(1)}, and so the claim holds. Now, in the kk-th replacement, jj denotes the position of the substring removed in this replacement, while jpj_{p} denotes the position of the substring removed in the (k1)(k-1)-th replacement. According to the inductive assumption, the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} in 𝐰¯(k1)\bar{\mathbf{w}}^{(k-1)} starts at the position jpj_{p}. If jjpj\geqslant j_{p}, then the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} in 𝐰¯(k)\bar{\mathbf{w}}^{(k)} is 𝐰¯j+[dloglogn+d](k)\bar{\mathbf{w}}^{(k)}_{j+[d\lceil\log\lceil\log n\rceil\rceil+d]}. If jpL1+d<j<jpj_{p}-L_{1}+d<j<j_{p}, the overlap of 𝐰¯j+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]} and 𝐰¯jp+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]} has length greater than dd. Since the sequence which is inserted in the kk-th replacement ends with a symbol ‘1’, it can destroy the marker in 𝐰¯jp+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]}. If jpL1<jjpL1+dj_{p}-L_{1}<j\leqslant j_{p}-L_{1}+d, we set 𝐰¯(k)[jp+d]\bar{\mathbf{w}}^{(k)}[j_{p}+d] to be ‘1’ to destroy the marker in 𝐰¯jp+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]}. In all cases, the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} in 𝐰¯(k)\bar{\mathbf{w}}^{(k)} is always 𝐰¯j+[dloglogn+d](k)\bar{\mathbf{w}}^{(k)}_{j+[d\lceil\log\lceil\log n\rceil\rceil+d]}.

Now, given the sequence 𝐰¯(k)\bar{\mathbf{w}}^{(k)}, we first search for the rightmost 1d0dloglogn1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil} in 𝐰¯(k)\bar{\mathbf{w}}^{(k)} to determine the position jj. Then from the substring 𝐰¯j+[L11](k)\bar{\mathbf{w}}^{(k)}_{j+[L_{1}-1]} we can decode ii, the difference between 𝐰¯i+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{i+[L_{1}]} and 𝐰¯j+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}, and b(jjp+L1)b(j-j_{p}+L_{1}). Note that 𝐰¯i+[min{L1,ji}](k1)=𝐰¯i+[min{L1,ji}](k)\bar{\mathbf{w}}^{(k-1)}_{i+[\min\{L_{1},j-i\}]}=\bar{\mathbf{w}}^{(k)}_{i+[\min\{L_{1},j-i\}]}. So we can recover 𝐰¯j+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}. We remove 𝐰¯j+[L11](k)\bar{\mathbf{w}}^{(k)}_{j+[L_{1}-1]} from 𝐰¯(k)\bar{\mathbf{w}}^{(k)} and replace it with 𝐰¯j+[L1](k1)\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}. If b(jjp+L1)0log(d+1)b(j-j_{p}+L_{1})\neq 0^{\lceil\log(d+1)\rceil}, we further set the symbol in the position jp+dj_{p}+d to be ‘0’. In this way, we recover the sequence 𝐰¯(k1)\bar{\mathbf{w}}^{(k-1)}. We repeat this procedure until there is no substring 0dloglogn0^{d\log\log n}. Then the resulting sequence is the required 𝐰\mathbf{w}. ∎

Now, we need to extend the sequence 𝐰¯\bar{\mathbf{w}} to a long sequence of length nn while keeping the property of being (L,d)(L,d)-SD.

Lemma 16.

Assume nn is sufficiently large. Let 𝐰¯\bar{\mathbf{w}} be an output of Algorithm 1. Recall that K2=332logL2K_{2}=3\lceil\frac{3}{2}\log L_{2}\rceil. By invoking Theorem 14 with parameters “K=K2K=K_{2}” and “L=L2L=L_{2}”, we get a collection of (K2,d)(K_{2},d)-WWL sequences 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} of length L2LpL_{2}-L_{p}, where Lp=K2+dlogd+2dL_{p}=K_{2}+d\lceil\log d\rceil+2d. Let

𝐬¯0Kmax𝐮𝐬00Kmax𝐮𝐬10Kmax𝐮𝐬M1,\bar{\mathbf{s}}\triangleq 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{M-1},

where Kmax=max{K1,K2}K_{\max}=\max\{K_{1},K_{2}\}. Set

𝐰^=3(𝐰¯)(𝐰¯0K2𝐬¯)[0,n1].\hat{\mathbf{w}}=\mathcal{E}_{3}(\bar{\mathbf{w}})\triangleq(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1].

Then 𝐰^\hat{\mathbf{w}} is a (K,d)(K,d)-WWL and (L,d)(L,d)-SD sequence where K=2(K1+K2)K=2(K_{1}+K_{2}) and L=max{L1+K2+Kmax+,L2+2K1+Kmax+}L=\max\{L_{1}+K_{2}+K_{\max}+\ell,L_{2}+2K_{1}+K_{\max}+\ell\}. Moreover, 𝐰¯\bar{\mathbf{w}} can be recovered from 𝐰^\hat{\mathbf{w}}.

Proof:

We first prove that 𝐬¯\bar{\mathbf{s}} is a (Kmax+K2,d)(K_{\max}+K_{2},d)-WWL and (L2+KmaxK2,d)(L_{2}+K_{\max}-K_{2},d)-SD sequence of length at least nn. According to the construction, the length of 𝐬¯\bar{\mathbf{s}} is M(L2+KmaxK2)ML2M(L_{2}+K_{\max}-K_{2})\geqslant ML_{2}. Recall that logM=L23dlogL27.5logL2O(1)\log M=L_{2}-3d\log L_{2}-7.5\log L_{2}-O(1) and L2=logn+(3d+7)loglognL_{2}=\lceil\log n\rceil+(3d+7)\lceil\log\lceil\log n\rceil\rceil. Then

ML2=2L23dlogL26.5logL2O(1)=2L22O(1)L23d+6.5n(logn)3d+72O(1)(logn+(3d+6.5)loglogn)3d+6.5>n.\displaystyle ML_{2}=2^{L_{2}-3d\log L_{2}-6.5\log L_{2}-O(1)}=\frac{2^{L_{2}}}{2^{O(1)}L_{2}^{3d+6.5}}\geqslant\frac{n(\log n)^{3d+7}}{2^{O(1)}(\log n+(3d+6.5)\log\log n)^{3d+6.5}}>n. (5)

Hence, 𝐬¯\bar{\mathbf{s}} has length at least nn. Note that each 𝐬i\mathbf{s}_{i} is a (K2,d)(K_{2},d)-WWL sequence and the length-dd prefix of 𝐮\mathbf{u} is 1d1^{d}. It follows that 𝐬¯\bar{\mathbf{s}} is a (Kmax+K2,d)(K_{\max}+K_{2},d)-WWL sequence. Moreover, note that the sequences 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} satisfy the conditions (P1)-(P3) with “K=K2K=K_{2}”. If K2K1K_{2}\geqslant K_{1} (namely, Kmax=K2K_{\max}=K_{2}), then by Proposition 13, the sequence 𝐬¯\bar{\mathbf{s}} is an (L2,d)(L_{2},d)-SD sequence, hence also an (L2+KmaxK2,d)(L_{2}+K_{\max}-K_{2},d)-SD sequence. If K2<K1K_{2}<K_{1}, since the property of being (K2,d)(K_{2},d)-WWL implies the property of being (Kmax,d)(K_{\max},d)-WWL, the sequences 𝐬0,𝐬1,,𝐬M1\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1} also satisfy the conditions (P1)-(P3) with “K=KmaxK=K_{\max}555In this case, we take “L=L2+KmaxK2L=L_{2}+K_{\max}-K_{2}”, “K=KmaxK=K_{\max}”, “Lp=K+L_{p}=K+\ell”, and so, “LLp=L2K2L-L_{p}=L_{2}-K_{2}-\ell”, which is equal to the length of the 𝐬i\mathbf{s}_{i}’s.. Again, by Proposition 13, the sequence 𝐬¯\bar{\mathbf{s}} is an (L2+KmaxK2,d)(L_{2}+K_{\max}-K_{2},d)-SD sequence.

We have shown that 𝐬¯\bar{\mathbf{s}} is a (Kmax+K2,d)({K_{\max}}+K_{2},d)-WWL sequence in the above paragraph and 𝐰¯\bar{\mathbf{w}} is a (K1,d)(K_{1},d)-WWL sequence in Lemma 15. By using the fact that K1>dK_{1}>d and that the 𝐮\mathbf{u} substring of 𝐬¯\bar{\mathbf{s}} starts with 1d1^{d}, it follows that the sequence 𝐰^=(𝐰¯0K2𝐬¯)[0,n1]\hat{\mathbf{w}}=(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1] is (2(K1+K2),d)(2(K_{1}+K_{2}),d)-WWL. Now, we shall show that it is also (L,d)(L,d)-SD. For any two substrings 𝐰^i+[L]\hat{\mathbf{w}}_{i+[L]} and 𝐰^j+[L]\hat{\mathbf{w}}_{j+[L]} with i,j[nL+1]i,j\in[n-L+1] and i<ji<j, we consider the following cases:

Case 1: i<j|𝐰¯|L1i<j\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1}. Then

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰¯i+[L1],𝐰¯j+[L1])d,\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\geqslant d,

where the first inequality holds since LL1L\geqslant L_{1} and the second inequality holds since 𝐰¯\bar{\mathbf{w}} is an (L1,d)(L_{1},d)-SD sequence.

Case 2: i|𝐰¯|L1i\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1} and |𝐰¯|L1+1j|𝐰¯|\lvert\bar{\mathbf{w}}\rvert-L_{1}+1\leqslant j\leqslant\lvert\bar{\mathbf{w}}\rvert. Since LL1K2+Kmax+L-L_{1}\geqslant K_{2}+{K_{\max}}+\ell, where \ell is the length of 𝐮\mathbf{u}, then 𝐰^j+[L]\hat{\mathbf{w}}_{j+[L]} must contain 0K2+Kmax𝐮0^{K_{2}+{K_{\max}}}\circ\mathbf{u} as a substring. Assume that 𝐰^j+δ+[K2+Kmax+]=0K2+Kmax𝐮\hat{\mathbf{w}}_{j+\delta+[K_{2}+K_{\max}+\ell]}=0^{K_{2}+{K_{\max}}}\circ\mathbf{u} for some δ[L1]\delta\in[L_{1}]. If jidj-i\leqslant d, then

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+δ+K2+Kmax+[],𝐰^j+δ+K2+Kmax+[])=dH(0ji𝐮[0,(ji)1],𝐮)d,\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+K_{2}+{K_{\max}}+[\ell]},\hat{\mathbf{w}}_{j+\delta+K_{2}+{K_{\max}}+[\ell]})=d_{H}(0^{j-i}\circ\mathbf{u}[0,\ell-(j-i)-1],\mathbf{u})\geqslant d,

where the last inequality follows from the definition of a dd-auto-cyclic sequence. If d<jiK2+Kmaxd<j-i\leqslant K_{2}+{K_{\max}}, since the prefix of 𝐮\mathbf{u} is 1d1^{d}, then

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+δ+K2+Kmax+[d],𝐰^j+δ+K2+Kmax+[d])=dH(0d,1d)=d.\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+K_{2}+{K_{\max}}+[d]},\hat{\mathbf{w}}_{j+\delta+K_{2}+{K_{\max}}+[d]})=d_{H}(0^{d},1^{d})=d.

If ji>K2+Kmaxj-i>K_{2}+{K_{\max}}, then i+δ+K2+Kmax<j+δi+\delta+K_{2}+{K_{\max}}<j+\delta, and so, 𝐰^i+δ+[K2+Kmax]\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]} is a substring of 𝐰¯\bar{\mathbf{w}}. Hence,

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+δ+[K2+Kmax],𝐰^j+δ+[K2+Kmax])=dH(𝐰^i+δ+[K2+Kmax],0K2+Kmax)d,\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]},\hat{\mathbf{w}}_{j+\delta+[K_{2}+{K_{\max}}]})=d_{H}(\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]},0^{K_{2}+{K_{\max}}})\geqslant d,

where the last inequality holds since 𝐰¯\bar{\mathbf{w}} is a (K1,d)(K_{1},d)-WWL sequence.

Case 3 and Case 4, which now follow, together cover the case of i|𝐰¯|L1i\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1} and j>|𝐰¯|j>\lvert\bar{\mathbf{w}}\rvert and the case of |𝐰¯|L1<i<|𝐰¯|\lvert\bar{\mathbf{w}}\rvert-L_{1}<i<\lvert\bar{\mathbf{w}}\rvert and i<ji<j,

Case 3: i|𝐰¯|(L2+2K1K2)i\leqslant\lvert\bar{\mathbf{w}}\rvert-(L_{2}+2K_{1}-K_{2}) (|𝐰¯|L1)(\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1}) and j>|𝐰¯|j>\lvert\bar{\mathbf{w}}\rvert. Denote L(L2K2)+2K1L^{\prime}\triangleq(L_{2}-K_{2})+2K_{1}. Then LLL\geqslant L^{\prime}. Note that 𝐰^j+[L]\hat{\mathbf{w}}_{j+[L^{\prime}]} always contains 0K10^{K_{1}} as a substring, and 𝐰^i+[L]\hat{\mathbf{w}}_{i+[L^{\prime}]} is a substring of 𝐰¯\bar{\mathbf{w}}, which is (K1,d)(K_{1},d)-WWL. Hence,

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+[L],𝐰^j+[L])d.\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+[L^{\prime}]},\hat{\mathbf{w}}_{j+[L^{\prime}]})\geqslant d.

Case 4: |𝐰¯|(L2+2K1K2)+1i<|𝐰¯|\lvert\bar{\mathbf{w}}\rvert-(L_{2}+2K_{1}-K_{2})+1\leqslant i<\lvert\bar{\mathbf{w}}\rvert and i<ji<j. Since L(L2+2K1K2)+K2+Kmax+L\geqslant(L_{2}+2K_{1}-K_{2})+K_{2}+{K_{\max}}+\ell, 𝐰^i+[L]\hat{\mathbf{w}}_{i+[L]} must contain 0K2+Kmax𝐮0^{K_{2}+{K_{\max}}}\circ\mathbf{u} as a substring. If jiK2+Kmaxj-i\leqslant K_{2}+{K_{\max}}, then 𝐰^j+[L]\hat{\mathbf{w}}_{j+[L]} must contain 𝐮\mathbf{u} as a substring, and so, with the same argument as that in Case 2, one can show that dH(𝐰^i+[L],𝐰^j+[L])dd_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d. If ji>K2+Kmaxj-i>K_{2}+{K_{\max}}, assume that 𝐰^i+δ+[K2+Kmax]\hat{\mathbf{w}}_{i+\delta^{\prime}+[K_{2}+{K_{\max}}]} is the all-zero substring of length K2+KmaxK_{2}+{K_{\max}}. Then j+δ>i+δ+K2+Kmaxj+\delta^{\prime}>i+\delta^{\prime}+K_{2}+{K_{\max}}. It follows that 𝐰^j+δ+[K2+Kmax]\hat{\mathbf{w}}_{j+\delta^{\prime}+[K_{2}+{K_{\max}}]} is a substring of 𝐬¯\bar{\mathbf{s}}, which is (K2+Kmax,d)(K_{2}+{K_{\max}},d)-WWL. Hence,

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+δ+[K2+Kmax],𝐰^j+δ+[K2+Kmax])d.\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta^{\prime}+[K_{2}+{K_{\max}}]},\hat{\mathbf{w}}_{j+\delta^{\prime}+[K_{2}+{K_{\max}}]})\geqslant d.

Case 5: |𝐰¯|i<j\lvert\bar{\mathbf{w}}\rvert\leqslant i<j. Then

dH(𝐰^i+[L],𝐰^j+[L])dH(𝐰^i+K2+[LK2],𝐰^j+K2+[LK2])=dH(𝐬¯i|𝐰¯|+[LK2],𝐬¯j|𝐰¯|+[LK2])d,\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+K_{2}+[L-K_{2}]},\hat{\mathbf{w}}_{j+K_{2}+[L-K_{2}]})=d_{H}(\bar{\mathbf{s}}_{i-\lvert\bar{\mathbf{w}}\rvert+[L-K_{2}]},\bar{\mathbf{s}}_{j-\lvert\bar{\mathbf{w}}\rvert+[L-K_{2}]})\geqslant d,

where the second inequality holds since LK2L2+KmaxK2L-K_{2}\geqslant L_{2}+{K_{\max}}-K_{2} and 𝐬¯\bar{\mathbf{s}} is (L2+KmaxK2,d)(L_{2}+{K_{\max}}-K_{2},d)-SD.

Finally, note that in the sequence 𝐰^\hat{\mathbf{w}} there is exactly one run of ‘0’ which has length at least K2+KmaxK_{2}+K_{\max}. So we can search for the rightmost 0K2+Kmax0^{K_{2}+K_{\max}} in 𝐰^\hat{\mathbf{w}} and remove this substring as well as the suffix after it to recover the sequence 𝐰¯\bar{\mathbf{w}}. ∎

Theorem 17.

Let 𝚂𝙳()3(2(1()))\mathcal{E}_{\mathtt{SD}}(\cdot)\triangleq\mathcal{E}_{3}(\mathcal{E}_{2}(\mathcal{E}_{1}(\cdot))). Then, for nn large enough, 𝚂𝙳:ΣnΣn\mathcal{E}_{\mathtt{SD}}:\Sigma^{n^{\prime}}\rightarrow\Sigma^{n} is invertible and can encode sequences of Σn\Sigma^{n^{\prime}} into (K,d)(K,d)-WWL and (L,d)(L,d)-SD sequences where K=(2d+9)loglogn+O(1)K=(2d+9)\log\log n+O(1) and

L{=logn+(6d+7)loglogn+dlogd+5dif d5,logn+(5d+11.5)loglogn+dlogd+4d+7.5otherwise.L\begin{cases}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d&\text{if $d\geqslant 5$},\\ \leqslant\lceil\log n\rceil+(5d+11.5)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+4d+7.5&\text{otherwise.}\end{cases}

Moreover, nn=Θ(n/logn)n-n^{\prime}=\Theta(n/\log n), and so, we have that

limnnn=1.\lim_{n\to\infty}\frac{n^{\prime}}{n}=1.
Proof:

The statement about 𝚂𝙳\mathcal{E}_{\mathtt{SD}} follows from Lemma 15 and Lemma 16. Recall that the encoder 1\mathcal{E}_{1} requires Θ(n/logn)\Theta(n/\log n) redundancies (see (2)) and K1+K2=Θ(loglogn)K_{1}+K_{2}=\Theta(\log\log n). Hence,

nn=K1+K2+Θ(n/logn)=Θ(n/logn).n-n^{\prime}=K_{1}+K_{2}+\Theta(n/\log n)=\Theta(n/\log n).

IV Generalized Reconstruction from Noisy Substring Trace

In this section, we are going to give constructions of (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes. Our first result generalizes Proposition 2 and Proposition 5, which shows that the property of being (Lover,d)(L_{\rm over},d)-substring distant implies the property of being (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible.

Proposition 18.

Suppose that Lmin>LoverL_{\rm min}>L_{\rm over}. If a sequence 𝐱Σn\mathbf{x}\in\Sigma^{n} is (Lover,4e+1)(L_{\rm over},4e+1)-substring distant, then 𝐱\mathbf{x} is (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstructible.

Proof:

Let 𝒴={𝐲(0),𝐲(1),,𝐲(m1)}\mathcal{Y}=\{\mathbf{y}^{(0)},\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m-1)}\} be an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐱\mathbf{x} where the location of each 𝐲(j)\mathbf{y}^{(j)} in 𝐱\mathbf{x} is iji_{j}. Since 𝐱\mathbf{x} is (Lover,4e+1)(L_{\rm over},4e+1)-substring distant, for any two substrings 𝐲(j)\mathbf{y}^{(j)} and 𝐲(j)\mathbf{y}^{(j^{\prime})} and their any two subsubstrings 𝐲k+[Lover](j)\mathbf{y}^{(j)}_{k+[L_{\rm over}]} and 𝐲k+[Lover](j)\mathbf{y}^{(j^{\prime})}_{k^{\prime}+[L_{\rm over}]}, we have that

dH(𝐲k+[Lover](j),𝐲k+[Lover](j)){2e+1 if ij+kij+k,2e if ij+k=ij+k.d_{H}\left\lparen\mathbf{y}^{(j)}_{k+[L_{\rm over}]},\mathbf{y}^{(j^{\prime})}_{k^{\prime}+[L_{\rm over}]}\right\rparen\begin{cases}\geqslant 2e+1&\text{ if $i_{j}+k\neq i_{j^{\prime}}+k^{\prime}$},\\ \leqslant 2e&\text{ if $i_{j}+k=i_{j^{\prime}}+k^{\prime}$}.\\ \end{cases}

Therefore, 𝐲(0)\mathbf{y}^{(0)} can be identified as the unique substring 𝐲𝒴\mathbf{y}\in\mathcal{Y} whose length-LoverL_{\rm over} prefix is of Hamming distance at least 2e+12e+1 from every length-LoverL_{\rm over} subsubstring of any other 𝐲𝒴\{𝐲}\mathbf{y}^{\prime}\in\mathcal{Y}\backslash\{\mathbf{y}\}. Denote the length-LoverL_{\rm over} suffix of 𝐲(0)\mathbf{y}^{(0)} as 𝐬0\mathbf{s}_{0}. Then we can identify the substrings 𝐲\mathbf{y}’s in 𝒴\mathcal{Y} which overlap 𝐲(0)\mathbf{y}^{(0)} at at least LoverL_{\rm over} positions, since each of them contains a unique length-LoverL_{\rm over} subsubstring 𝐰\mathbf{w} whose distance from 𝐬0\mathbf{s}_{0} is at most 2e2e. Furthermore, the locations of these substrings in 𝐱\mathbf{x} can be determined by aligning the subsubstring 𝐰\mathbf{w} and the suffix 𝐬0\mathbf{s}_{0}. Assume that there are mm^{\prime} such substrings. Then we have identified the substrings 𝐲(1),,𝐲(m)𝒴\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m^{\prime})}\in\mathcal{Y}. Next, we consider the length-LoverL_{\rm over} suffix of 𝐲(m)\mathbf{y}^{(m^{\prime})} and we can identify all the subsrings in 𝒴\mathcal{Y} which overlap 𝐲(m)\mathbf{y}^{(m^{\prime})} at at least LoverL_{\rm over} positions. We repeat the procedure above. Finally, we can determine the location of every substring 𝐲𝒴\mathbf{y}\in\mathcal{Y} in 𝐱\mathbf{x}. ∎

Combining Theorem 17 and Proposition 18, we have the following result.

Corollary 19.

Suppose that Lover=logn+(24e+13)loglogn+(4e+1)log(4e+1)+20e+5L_{\rm over}=\lceil\log n\rceil+(24e+13)\lceil\log\lceil\log n\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5 and Lmin>LoverL_{\rm min}>L_{\rm over}. If nn is sufficiently large, then there is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of Σn\Sigma^{n} whose rate is 1o(1)1-o(1).

Now, we consider another parameter regime. Suppose that

Lmin\displaystyle L_{\rm min} =alogn,\displaystyle=\lceil a\log n\rceil,
Lover\displaystyle L_{\rm over} =γLmin,\displaystyle=\lceil\gamma L_{\rm min}\rceil,

where a>1a>1 and 0<aγ10<a\gamma\leqslant 1 are real constants. We are going to construct an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code whose rate approaches 11/a1γ\frac{1-1/a}{1-\gamma}. The basic idea of our code construction is similar to the one in [16] for the noiseless scenario: A message 𝐦\mathbf{m} is encoded into a codeword 𝐰=𝐰0𝐰1𝐰2I1\mathbf{w}=\mathbf{w}_{0}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{w}_{2^{I}-1} such that

  1. (i)

    the index ii can be decoded from any length-LminL_{\rm min} substring of 𝐰i\mathbf{w}_{i} even if the substring is corrupted by at most ee errors;

  2. (ii)

    𝐰i\mathbf{w}_{i} can be reconstructed from its any (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace.

To this end, our construction leverages the map 𝚂𝙳\mathcal{E}_{\mathtt{SD}} in Section III which can encode WWL and SD sequences, as well as the following coded indices 𝐜i\mathbf{c}_{i}’s which are generated from a robust positioning sequence.

Construction A (Index Construction).

Given ee, let

d1\displaystyle d_{1} 2e+1,\displaystyle\triangleq 2e+1,
d2\displaystyle d_{2} 4e+1.\displaystyle\triangleq 4e+1.

Additionally, set

I\displaystyle I 1γa1γlogn+(logn)0.5+ϵ,\displaystyle\triangleq\left\lceil\frac{1-\gamma a}{1-\gamma}\log n+(\log n)^{0.5+\epsilon}\right\rceil,
rI\displaystyle r_{I} (3d1+8)logI,\displaystyle\triangleq\left\lceil(3d_{1}+8)\log I\right\rceil,

where 0<ϵ<0.50<\epsilon<0.5 is an arbitrary fixed number which is independent of nn. Then

(I+rI)(3d1+7.5)log(I+rI)O(1)=I+0.5logIO(1)>I,(I+r_{I})-(3d_{1}+7.5)\log(I+r_{I})-O(1)=I+0.5\log I-O(1)>I,

where we assume e,a,γ,ϵe,a,\gamma,\epsilon are constants, and nn\to\infty. Applying Theorem 14 with L=I+rIL=I+r_{I}, there is an explicit construction of sequences 𝐜0,𝐜1,,𝐜2I1ΣI+rI\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}} such that the concatenation

𝐜𝐜0𝐜1𝐜2I1\mathbf{c}\triangleq\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}

is an (I+rI,d1)(I+r_{I},d_{1})-SD sequence. Moreover, according to the remark following Theorem 14, each 𝐜i\mathbf{c}_{i} is (332log(I+rI)+d1,d1)(3\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}},d_{1})-WWL where

d1d1logd1+2d1\ell_{d_{1}}\triangleq d_{1}\lceil\log d_{1}\rceil+2d_{1}

is the length of the d1d_{1}-auto-cyclic sequence 𝐮\mathbf{u}. Denote

K\displaystyle K logn,\displaystyle\triangleq\left\lceil\sqrt{\log n}\right\rceil,
F\displaystyle F I+rIK.\displaystyle\triangleq\left\lceil\frac{I+r_{I}}{K}\right\rceil.

For each i[2I]i\in[2^{I}], we partition the sequence 𝐜i\mathbf{c}_{i} into segments 𝐜i(0),𝐜i(1),,𝐜i(F1)\mathbf{c}_{i}^{(0)},\mathbf{c}_{i}^{(1)},\ldots,\mathbf{c}_{i}^{(F-1)}, each of length I+rIF\lceil\frac{I+r_{I}}{F}\rceil or I+rIF\lfloor\frac{I+r_{I}}{F}\rfloor.

In the following, we first consider the case of LminnL_{\rm min}\mid n and give the code construction. Then we will show how to modify this construction to settle the other cases.

IV-A The case of LminnL_{\rm min}\mid n

Let us define

r\displaystyle r I+rI+K+d1+d1,\displaystyle\triangleq I+r_{I}+K+\ell_{d_{1}}+d_{1},
L\displaystyle L (LoverKd1d12I+rIF)LminrLminr+I+rI.\displaystyle\triangleq\left\lceil\left\lparen L_{\rm over}-K-\ell_{d_{1}}-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}\right\rceil.

We note that by our choice of parameters, Lmin>rL_{\rm min}>r for all sufficiently large nn. Assume that LminnL_{\rm min}\mid n and denote nLnLminn_{L}\triangleq\frac{n}{L_{\rm min}}. For each i[2I]i\in[2^{I}], let

Ni{nL/2I(Lminr)if i<nLmod2I,nL/2I(Lminr)otherwise.N_{i}\triangleq\begin{cases}\lceil n_{L}/2^{I}\rceil(L_{\rm min}-r)&\text{if $i<n_{L}\bmod 2^{I}$,}\\ \lfloor n_{L}/2^{I}\rfloor(L_{\rm min}-r)&\text{otherwise.}\\ \end{cases} (6)

Then i[2I]Ni=nL(Lminr).\sum_{i\in[2^{I}]}N_{i}=n_{L}(L_{\rm min}-r).

Lemma 20.

Let K,L,NiK,L,N_{i} be defined as above, and assume nn is large enough. Then for each i[2I]i\in[2^{I}] there is an integer m(Ni)m(N_{i}) with Nim(Ni)=Θ(Ni/logNi)N_{i}-m(N_{i})=\Theta(N_{i}/\log N_{i}) and an invertible map 𝚂𝙳(i):Σm(Ni)ΣNi\mathcal{E}_{\mathtt{SD}}^{(i)}:\Sigma^{m(N_{i})}\rightarrow\Sigma^{N_{i}} which can encode sequences of Σm(Ni)\Sigma^{m(N_{i})} into (K/4,d2)(\lfloor K/4\rfloor,d_{2})-WWL and (L,d2)(L,d_{2})-SD sequences.

Proof:

We shall apply Theorem 17 to prove this lemma. To this end, we first need to verify that NiN_{i} can be arbitrarily large. As noted before, Lminr>0L_{\rm min}-r>0. Additionally, nL=Θ(n/logn)n_{L}=\Theta(n/\log n), and 2I=n1γa1γ(1+o(1))2^{I}=n^{\frac{1-\gamma a}{1-\gamma}(1+o(1))} and by our choice of parameters, 1γa1γ<1\frac{1-\gamma a}{1-\gamma}<1 is a constant. Hence, NiN_{i}\to\infty as nn\to\infty.

Next, we need to verify that K/4\lfloor K/4\rfloor and LL satisfy the two conditions in Theorem 17. Regarding the value of KK, we need to show that K/4(2d2+9)loglogNi+O(1)\lfloor K/4\rfloor\geqslant(2d_{2}+9)\log\log N_{i}+O(1). Noting that rI=(3d1+8)logI=O(loglogn)r_{I}=\lceil(3d_{1}+8)\log I\rceil=O(\log\log n) and K=O(logn)K=O(\sqrt{\log n}), we have that

1rLmin\displaystyle 1-\frac{r}{L_{\rm min}} =1I+rI+K+d1+d1Lmin=1(1γa1γ)logn+(logn)0.5+ϵ+O(logn)alogn+O(1)\displaystyle=1-\frac{I+r_{I}+K+\ell_{d_{1}}+d_{1}}{L_{\rm min}}=1-\frac{(\frac{1-\gamma a}{1-\gamma})\log n+(\log n)^{0.5+\epsilon}+O(\sqrt{\log n})}{{a\log n}+O(1)}
=1(1/aγ1γ+1a(logn)0.5ϵ+O(1logn))alognalogn+O(1)\displaystyle=1-\left\lparen\frac{1/a-\gamma}{1-\gamma}+\frac{1}{a(\log n)^{0.5-\epsilon}}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen\frac{a\log n}{a\log n+O(1)}
=1(1/aγ1γ+1a(logn)0.5ϵ+O(1logn))(1O(1logn))\displaystyle=1-\left\lparen\frac{1/a-\gamma}{1-\gamma}+\frac{1}{a(\log n)^{0.5-\epsilon}}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen\left\lparen 1-O\left\lparen\frac{1}{\log n}\right\rparen\right\rparen
=11/a1γ1a(logn)0.5ϵO(1logn).\displaystyle=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.

It follows that

logNi\displaystyle\log N_{i} =log(nL2I(Lminr))±O(1)=log(n2I(1rLmin))±O(1)\displaystyle=\log\left\lparen\frac{n_{L}}{2^{I}}(L_{\rm min}-r)\right\rparen\pm O(1)=\log\left\lparen\frac{n}{2^{I}}\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\right\rparen\pm O(1)
=lognI±O(1)=γaγ1γlogn(logn)0.5+ϵ±O(1).\displaystyle=\log n-I\pm O(1)=\frac{\gamma a-\gamma}{1-\gamma}\log n-(\log n)^{0.5+\epsilon}\pm O(1).

Since K=lognK=\left\lceil\sqrt{\log n}\right\rceil, we have that K/4\lfloor K/4\rfloor is substantially larger than (2d2+9)loglogNi+O(1).(2d_{2}+9)\log\log N_{i}+O(1).

Now, we verify the condition on LL, namely that LlogNi+(6d2+7)loglogNi+O(1)L\geqslant\log N_{i}+(6d_{2}+7)\log\log N_{i}+O(1). Note that

I+rILminr\displaystyle\frac{I+r_{I}}{L_{\rm min}-r} =I+O(loglogn)LminIO(logn)=ILminI1+O(loglogn/logn)1O(1/logn)\displaystyle=\frac{I+O(\log{\log n})}{L_{\rm min}-I-O(\sqrt{\log n})}=\frac{I}{L_{\rm min}-I}\cdot\frac{1+O(\log\log n/{\log n})}{1-O(1/\sqrt{\log n})}
=ILminI(1+O(1logn))=ILminI+O(1logn),\displaystyle=\frac{I}{L_{\rm min}-I}\left\lparen 1+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen=\frac{I}{L_{\rm min}-I}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,

and

I+rIF=I+rI(I+rI)/KI+rI(I+rI)/K+1=K+1.\displaystyle\left\lceil\frac{I+r_{I}}{F}\right\rceil=\left\lceil\frac{I+r_{I}}{\lceil(I+r_{I})/K\rceil}\right\rceil\leqslant\frac{I+r_{I}}{(I+r_{I})/K}+1=K+1.

Hence, we have that

L\displaystyle L =(LoverKd1d12I+rIF)LminrLminr+I+rI\displaystyle=\left\lceil\left\lparen L_{\rm over}-K-\ell_{d_{1}}-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}\right\rceil
Lover3Kd1d121+(I+rI)/(Lminr)=LoverO(logn)LminLminI+O(1/logn)\displaystyle\geqslant\frac{L_{\rm over}-3K-\ell_{d_{1}}-d_{1}-2}{1+(I+r_{I})/(L_{\rm min}-r)}=\frac{L_{\rm over}-O(\sqrt{\log n})}{\frac{L_{\rm min}}{L_{\rm min}-I}+O\left\lparen{1}/{\sqrt{\log n}}\right\rparen}
=Lover(LminI)Lmin1O(1/logn)1+O(1/logn)\displaystyle=\frac{L_{\rm over}(L_{\rm min}-I)}{L_{\rm min}}\cdot\frac{1-O(1/\sqrt{\log n})}{1+O(1/\sqrt{\log n})}
γ(alogn1γa1γlogn(logn)0.5+ϵ1)(1O(1logn))\displaystyle\geqslant\gamma\left\lparen a\log n-\frac{1-\gamma a}{1-\gamma}\log n-(\log n)^{0.5+\epsilon}-1\right\rparen\left\lparen 1-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen
=γaγ1γlognγ(logn)0.5+ϵO(logn).\displaystyle=\frac{\gamma a-\gamma}{1-\gamma}\log n-\gamma(\log n)^{0.5+\epsilon}-O(\sqrt{\log n}).

It follows that

LlogNi=(1γ)(logn)0.5+ϵO(logn)=ω(loglogNi).\displaystyle L-\log N_{i}=(1-\gamma)(\log n)^{0.5+\epsilon}-O(\sqrt{\log n})=\omega(\log\log N_{i}).

We can conclude that LL is substantially larger than logNi+(6d2+7)loglogNi+O(1)\log N_{i}+(6d_{2}+7)\log\log N_{i}+O(1). ∎

Now, we present our code construction.

Construction B.

Let m(Ni)m(N_{i})’s be defined as in Lemma 20. We now describe a mapping from Σi[2I]m(Ni)\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})} to Σn\Sigma^{n}. For any message 𝐦Σi[2I]m(Ni)\mathbf{m}\in\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})}, partition 𝐦\mathbf{m} into 2I2^{I} substrings:

𝐦=𝐦0𝐦1𝐦2I1,\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{2^{I}-1},

where each 𝐦i\mathbf{m}_{i} has length m(Ni)m(N_{i}). For each i[2I]i\in[2^{I}], let

𝐯i=𝚂𝙳(i)(𝐦i)ΣNi,\mathbf{v}_{i}=\mathcal{E}_{\mathtt{SD}}^{(i)}(\mathbf{m}_{i})\in\Sigma^{N_{i}},

where 𝚂𝙳(i)\mathcal{E}_{\mathtt{SD}}^{(i)} is the map mentioned in Lemma 20. We partition each 𝐯i\mathbf{v}_{i} into substrings of length LminrL_{\rm min}-r:

𝐯i={𝐯i,0𝐯i,1𝐯i,nL/2I1if i<nLmod2I,𝐯i,0𝐯i,1𝐯i,nL/2I1otherwise.\mathbf{v}_{i}=\begin{cases}\mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\circ\mathbf{v}_{i,\lceil n_{L}/2^{I}\rceil-1}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\circ\mathbf{v}_{i,\lfloor n_{L}/2^{I}\rfloor-1}&\text{otherwise}.\\ \end{cases}

Then the total number of 𝐯i,j\mathbf{v}_{i,j}’s is nLn_{L}. We further partition each 𝐯i,j\mathbf{v}_{i,j} into FF segments of lengths (Lminr)/F\lceil(L_{\rm min}-r)/F\rceil or (Lminr)/F\lfloor(L_{\rm min}-r)/F\rfloor:

𝐯i,j=𝐯i,j(0)𝐯i,j(1)𝐯i,j(F1).\mathbf{v}_{i,j}=\mathbf{v}_{i,j}^{(0)}\circ\mathbf{v}_{i,j}^{(1)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}.

Recall 𝐜i(m)\mathbf{c}_{i}^{(m)} from the index construction, Construction A. Let

𝐰i,j{0d1𝐯i,j(0)𝐜i(0)𝐯i,j(F1)𝐜i(F1)if j=0,1d1𝐯i,j(0)𝐜i(0)𝐯i,j(F1)𝐜i(F1)otherwise.\mathbf{w}_{i,j}\triangleq\begin{cases}0^{d_{1}}\circ\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}&\text{if $j=0$},\\ 1^{d_{1}}\circ\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}&\text{otherwise}.\\ \end{cases}

Finally, let

𝐰i={𝐩𝐰i,0𝐩𝐰i,1𝐩𝐰i,nL/2I1if i<nLmod2I,𝐩𝐰i,0𝐩𝐰i,1𝐩𝐰i,nL/2I1otherwise,\mathbf{w}_{i}=\begin{cases}\mathbf{p}\circ\mathbf{w}_{i,0}\circ\mathbf{p}\circ\mathbf{w}_{i,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{i,\lceil n_{L}/2^{I}\rceil-1}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathbf{p}\circ\mathbf{w}_{i,0}\circ\mathbf{p}\circ\mathbf{w}_{i,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{i,\lfloor n_{L}/2^{I}\rfloor-1}&\text{otherwise},\\ \end{cases}

where 𝐩0K𝐮\mathbf{p}\triangleq 0^{K}\circ\mathbf{u} and 𝐮\mathbf{u} is the d1d_{1}-auto-cyclic sequence in Theorem 11. Denote

𝐰𝐰0𝐰1𝐰2I1.\mathbf{w}\triangleq\mathbf{w}_{0}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{w}_{2^{I}-1}.

The constructed code, 𝒞Trace\mathcal{C}_{\rm Trace}, is the image of the mapping described above.

Lemma 21.

Let 𝒞Trace\mathcal{C}_{\rm Trace} be the code obtained by Construction B. Then 𝒞TraceΣn\mathcal{C}_{\rm Trace}\subseteq\Sigma^{n} and its rate is

R(𝒞Trace)=11/a1γ1a(logn)0.5ϵO(1logn).R(\mathcal{C}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.
Proof:

In our construction, every sequence 𝐰i,j\mathbf{w}_{i,j} has length Lminr+d1+|𝐜i|=LminKd1L_{\rm min}-r+d_{1}+\lvert\mathbf{c}_{i}\rvert=L_{\rm min}-K-\ell_{d_{1}}, and so, the concatenation 𝐩𝐰i,j\mathbf{p}\circ\mathbf{w}_{i,j} has length LminL_{\rm min}. It follows that the codeword 𝐰\mathbf{w} has length nLLmin=nn_{L}L_{\rm min}=n. Noting that the map 𝚂𝙳\mathcal{E}_{\mathtt{SD}} is invertible, we can uniquely recover 𝐦\mathbf{m} from 𝐰\mathbf{w}. Therefore, the code 𝒞Trace\mathcal{C}_{\rm Trace} has rate i[2I]m(Ni)/n\sum_{i\in[2^{I}]}m(N_{i})/{n}.

We have shown in the proof of Lemma 20 that

1rLmin=11/a1γ1a(logn)0.5ϵO(1logn),1-\frac{r}{L_{\rm min}}=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,

and for each i[2I]i\in[2^{I}],

logNi=Θ(logn).\log N_{i}=\Theta(\log n).

Hence,

R(𝒞Trace)\displaystyle R(\mathcal{C}_{\rm Trace}) =i[2I]m(Ni)n=i[2I]NiΘ(Ni/logNi)n\displaystyle=\frac{\sum_{i\in[2^{I}]}m(N_{i})}{n}=\frac{\sum_{i\in[2^{I}]}N_{i}-\Theta\left\lparen N_{i}/\log N_{i}\right\rparen}{n}
=i[2I]Nin(1Θ(1logn))=nL(Lminr)n(1Θ(1logn))\displaystyle=\frac{\sum_{i\in[2^{I}]}N_{i}}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen=\frac{n_{L}(L_{\rm min}-r)}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen
=(1rLmin)(1Θ(1logn))=11/a1γ1a(logn)0.5ϵO(1logn).\displaystyle=\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.

In the following, we shall show that the code 𝒞Trace\mathcal{C}_{\rm Trace} is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code.

Lemma 22 (Construction 1 and Lemma 3.6 in [6]).

Let 𝐰=𝐩𝐰0,0𝐩𝐰0,1𝐩𝐰2I1,nL/2I1\mathbf{w}=\mathbf{p}\circ\mathbf{w}_{0,0}\circ\mathbf{p}\circ\mathbf{w}_{0,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{2^{I}-1,\lfloor n_{L}/2^{I}\rfloor-1} be a codeword of 𝒞Trace\mathcal{C}_{\rm Trace}. Assume that the substrings 𝐰i,j\mathbf{w}_{i,j}’s satisfy the following conditions:

  1. (P1)

    𝐰i,j\mathbf{w}_{i,j} is a (K,d1)(K,d_{1})-WWL sequence for each (i,j)(i,j); and

  2. (P2)

    𝐰i,j[0,μ1]𝐰i,j[μ,LminKd11]\mathbf{w}_{i,j}[0,\mu-1]\circ\mathbf{w}_{i^{\prime},j^{\prime}}[\mu,L_{\rm min}-K-\ell_{d_{1}}-1] is a (K,d1)(K,d_{1})-WWL sequence for (i,j),(i,j)(i,j),(i^{\prime},j^{\prime}) such that (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}) and μ[LminKd1]\mu\in[L_{\rm min}-K-\ell_{d_{1}}].

Then for every substring 𝐲=𝐰i0+[Lmin]\mathbf{y}=\mathbf{w}_{i_{0}+[L_{\rm min}]} in 𝐰\mathbf{w} and each666If i[LminK+d1,Lmin1]i\in[L_{\rm min}-K+\ell_{d_{1}},L_{\rm min}-1], we let 𝐲i+[K+d1]\mathbf{y}_{i+[K+\ell_{d_{1}}]} denote the concatenation 𝐲[i,Lmin1]𝐲[0,K+d1(Lmini)1]\mathbf{y}[i,L_{\rm min}-1]\circ\mathbf{y}[0,K+\ell_{d_{1}}-(L_{\rm min}-i)-1]. i[Lmin]i\in[L_{\rm min}], the following hold:

  1. (i)

    If i+i00(modLmin)i+i_{0}\equiv 0~{}({\rm mod~{}}L_{\rm min}), then 𝐲i+[K+d1]=𝐩\mathbf{y}_{i+[K+\ell_{d_{1}}]}=\mathbf{p}.

  2. (ii)

    If i+i00(modLmin)i+i_{0}\not\equiv 0~{}({\rm mod~{}}L_{\rm min}), then dH(𝐲i+[K+d1],𝐩)d1d_{H}(\mathbf{y}_{i+[K+\ell_{d_{1}}]},\mathbf{p})\geqslant d_{1}.

Lemma 23.

Assume nn is sufficiently large. Let 𝐲\mathbf{y} be an arbitrary length-LminL_{\rm min} substring of 𝐰𝒞Trace\mathbf{w}\in\mathcal{C}_{\rm Trace}. Then 𝐲\mathbf{y} contains a length-(I+rIμ)(I+r_{I}-\mu) suffix of a coded index 𝐜i\mathbf{c}_{i} and a length-μ\mu prefix of either 𝐜i\mathbf{c}_{i} or 𝐜i+1\mathbf{c}_{i+1} for some i[2I]i\in[2^{I}] and μ[I+rI]\mu\in[I+r_{I}]. Furthermore, even if 𝐲\mathbf{y} is corrupted by at most ee errors, we can still identify the positions where the said suffix and prefix appear, and so reconstruct them with at most ee errors.

Proof:

We note that the length of 𝐩𝐰i,j\mathbf{p}\circ\mathbf{w}_{i,j} is LminL_{\rm min}, and that 𝐰\mathbf{w} is a concatenation of such strings. Hence, the first statement follows directly from the code construction. Now, assume that 𝐲\mathbf{y} is corrupted by at most ee errors. We shall use Lemma 22 to identify the location of the marker 𝐩\mathbf{p} in 𝐲\mathbf{y}. Recall that every 𝐜i\mathbf{c}_{i} is (332log(I+rI)+d1,d1)(3\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}},d_{1})-WWL (see the index construction, Construction A) and every 𝐯i\mathbf{v}_{i} is (K/4,d2)(\lfloor K/4\rfloor,d_{2})-WWL (see Lemma 20). Since 332log(I+rI)+d1<K/43\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}}<\lfloor K/4\rfloor and d1<d2d_{1}<d_{2}, all the segments 𝐜i(h)\mathbf{c}_{i}^{(h)}’s and 𝐯i,j(h)\mathbf{v}_{i,j}^{(h)}’s are (K/4,d1)(\lfloor K/4\rfloor,d_{1})-WWL. Hence, 𝐰i,j\mathbf{w}_{i,j}’s satisfy the conditions in Lemma 22. This follows since any substring of length KK contains a substring of length K/4\lfloor K/4\rfloor that is fully contained within a segment of the form 𝐜i(h)\mathbf{c}_{i}^{(h)} or 𝐯i,j(h)\mathbf{v}_{i,j}^{(h)}, thus providing the minimum weight of d1d_{1} as claimed.

Since 𝐲\mathbf{y} suffers from at most ee errors and d1=2e+1d_{1}=2e+1, by Lemma 22 there is a unique index i[Lmin]i\in[L_{\rm min}] such that

dH(𝐲i+[K+d1],𝐩)e.d_{H}(\mathbf{y}_{i+[K+\ell_{d_{1}}]},\mathbf{p})\leqslant e.

Hence, by comparing the distance between the marker 𝐩\mathbf{p} and each length-p\ell_{p} substring of 𝐲\mathbf{y}, we can identify the location of the marker in 𝐲\mathbf{y}. Once the marker 𝐩\mathbf{p} is located, the positions in which the symbols of the coded indices 𝐜i(h)\mathbf{c}_{i}^{(h)}’s appear can also be determined. Then we can reconstruct a prefix 𝐜i[μ,I+rI1]\mathbf{c}_{i}[\mu,I+r_{I}-1] and a suffix 𝐜i[0,μ1]\mathbf{c}_{i}[0,\mu-1] or 𝐜i+1[μ1]\mathbf{c}_{i+1}[\mu-1] for some μ[I+rI]\mu\in[I+r_{I}] with at most ee errors. ∎

The following lemma ensures that every length-LoverL_{\rm over} substring of 𝐰\mathbf{w} contains a long-enough substring of the (L,d2)(L,d_{2})-SD sequence 𝐯i\mathbf{v}_{i}.

Lemma 24.

Assume nn is sufficiently large. Let 𝐰\mathbf{w} be a codeword of 𝒞Trace\mathcal{C}_{\rm Trace}. Then every length-LoverL_{\rm over} substring of 𝐰\mathbf{w} contains at least LL consecutive symbols of 𝐯=𝐯0𝐯1𝐯2I1\mathbf{v}=\mathbf{v}_{0}\circ\mathbf{v}_{1}\circ\cdots\circ\mathbf{v}_{2^{I}-1}.

Proof:

Note that the concatenation

𝐯i,j(0)𝐜i(0)𝐯i,j(F1)𝐜i(F1)\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}

consists of |𝐯i,j|+|𝐜i|=Lminr+I+rI\lvert\mathbf{v}_{i,j}\rvert+\lvert\mathbf{c}_{i}\rvert=L_{\rm min}-r+I+r_{I} symbols, out of which |𝐯i,j|=Lminr\lvert\mathbf{v}_{i,j}\rvert=L_{\rm min}-r symbols are from 𝐯\mathbf{v}. Then according to the construction, every length-LoverL_{\rm over} substring of 𝐰\mathbf{w} contains at least

(Lover(K+d1)d12I+rIF)LminrLminr+I+rI\left\lparen L_{\rm over}-(K+\ell_{d_{1}})-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}

consecutive symbols of 𝐯\mathbf{v}, where Lover(K+d1)d12I+rIFL_{\rm over}-(K+\ell_{d_{1}})-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil accounts for the worst case where the substring both begins and ends with some segments of the coded indices (of length I+rIF\left\lceil\frac{I+r_{I}}{F}\right\rceil or I+rIF\left\lfloor\frac{I+r_{I}}{F}\right\rfloor ) and contains a copy of 𝐩0d1\mathbf{p}\circ 0^{d_{1}} or 𝐩1d1\mathbf{p}\circ 1^{d_{1}}. ∎

Theorem 25.

The code 𝒞Trace\mathcal{C}_{\rm Trace} obtained in Construction B is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of Σn\Sigma^{n} with rate

R(𝒞Trace)=11/a1γ1a(logn)0.5ϵO(1logn).R(\mathcal{C}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.
Proof:

The code rate has been calculated in Lemma 21. Let 𝐰\mathbf{w} be a codeword of 𝒞Trace\mathcal{C}_{\rm Trace} and 𝒴\mathcal{Y} be an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐰\mathbf{w}. For each 𝐲\mathbf{y} in 𝒴\mathcal{Y}, since the length of 𝐲\mathbf{y} is at least LminL_{\rm min}, according to Lemma 23, we can extract a corrupted copy 𝐜suf\mathbf{c}_{\rm suf} of the length-(I+rIμ)(I+r_{I}-\mu) suffix of 𝐜i\mathbf{c}_{i}, and a corrupted copy 𝐜pre\mathbf{c}_{\rm pre} of a length-μ\mu prefix of either 𝐜i\mathbf{c}_{i} or 𝐜i+1\mathbf{c}_{i+1}, with the total number of errors being no more than ee. Consider the following cases.

  1. 1.

    If μ=0\mu=0, then 𝐜suf\mathbf{c}_{\rm suf} is a corrupted copy of 𝐜i\mathbf{c}_{i}, and so, we can run the locating algorithm of the robust positioning sequence 𝐜=𝐜0𝐜1𝐜2I1\mathbf{c}=\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1} on the corrupted 𝐜suf\mathbf{c}_{\rm suf} to determine the index ii.

  2. 2.

    If μ>0\mu>0 then 𝐲\mathbf{y} contains a copy of either 𝐩0d1\mathbf{p}\circ 0^{d_{1}} or 𝐩1d1\mathbf{p}\circ 1^{d_{1}} with at most ee errors. Since d1=2e+1d_{1}=2e+1, we can distinguish these two cases.

    1. (a)

      If 𝐲\mathbf{y} contains a copy of 𝐩0d1\mathbf{p}\circ 0^{d_{1}}, then 𝐜pre\mathbf{c}_{\rm pre} is a prefix of 𝐜i+1\mathbf{c}_{i+1}, and so, we run the locating algorithm of 𝐜\mathbf{c} on 𝐜suf𝐜pre\mathbf{c}_{\rm suf}\circ\mathbf{c}_{\rm pre} to decode the index ii.

    2. (b)

      If 𝐲\mathbf{y} contains a copy of 𝐩1d1\mathbf{p}\circ 1^{d_{1}}, then 𝐜pre\mathbf{c}_{\rm pre} is a prefix of 𝐜i\mathbf{c}_{i}, and so, we run the locating algorithm of 𝐜\mathbf{c} on 𝐜pre𝐜suf\mathbf{c}_{\rm pre}\circ\mathbf{c}_{\rm suf} to decode the index ii.

The discussion above shows that for every string 𝐲𝒴\mathbf{y}\in\mathcal{Y}, we can decode the index ii. If 𝐲\mathbf{y} intersects both 𝐯i\mathbf{v}_{i} and 𝐯i+1\mathbf{v}_{i+1}, then we can determine its location in 𝐰\mathbf{w} by identifying the location of the marker 𝐩\mathbf{p} in 𝐲\mathbf{y}. For the other strings with index ii, since 𝐯i\mathbf{v}_{i} is an (L,4e+1)(L,4e+1)-SD sequence, according to Lemma 24 and Proposition 18, there is a unique way to determine the correct order of these strings and match correctly the suffix and the prefix of consecutive strings. By taking the majority value at every position, we can reconstruct a sequence 𝐰i\mathbf{w}_{i}^{\prime}, which is a long substring of 𝐰i\mathbf{w}_{i} possibly with some errors. It remains to determine the location of 𝐰i\mathbf{w}_{i}^{\prime} in 𝐰i\mathbf{w}_{i}, which can be done as follows.

  1. 1.

    If 𝐰i\mathbf{w}_{i}^{\prime} contains a corrupted copy of 𝐩0d1\mathbf{p}\circ 0^{d_{1}} with at most ee errors, then the location this marker in 𝐰i\mathbf{w}_{i}^{\prime} determines the location of 𝐰i\mathbf{w}_{i}^{\prime} in 𝐰i\mathbf{w}_{i}, since 𝐰i\mathbf{w}_{i} only contains one copy of 𝐩0d1\mathbf{p}\circ 0^{d_{1}}.

  2. 2.

    If 𝐰i\mathbf{w}_{i}^{\prime} does not contain any corrupted copy of 𝐩0d1\mathbf{p}\circ 0^{d_{1}} up to ee errors, then there is a string 𝐲^𝒴\hat{\mathbf{y}}\in\mathcal{Y} which intersects both 𝐰i1\mathbf{w}_{i-1} and 𝐰i\mathbf{w}_{i} and contains 𝐩0d1\mathbf{p}\circ 0^{d_{1}} as a substring with at most ee errors, since the length of 𝐩0d1\mathbf{p}\circ 0^{d_{1}} is less that LoverL_{\rm over}.

    1. (a)

      If 𝐲^\hat{\mathbf{y}} overlaps 𝐰i\mathbf{w}_{i} in at most LoverL_{\rm over} positions, since Lover<LminL_{\rm over}<L_{\rm min}, 𝐰i\mathbf{w}_{i}^{\prime} must contain a copy of the first 𝐩1d1\mathbf{p}\circ 1^{d_{1}} of 𝐰i\mathbf{w}_{i}, and so, the location of 𝐰i\mathbf{w}_{i}^{\prime} in 𝐰i\mathbf{w}_{i} can be determined by identifying the first occurrence of the marker 𝐩\mathbf{p} in 𝐰i\mathbf{w}_{i}^{\prime}.

    2. (b)

      If 𝐲^\hat{\mathbf{y}} overlaps 𝐰i\mathbf{w}_{i} in at least LoverL_{\rm over} positions, then 𝐲^\hat{\mathbf{y}} and the length-LoverL_{\rm over} prefix of 𝐰i\mathbf{w}_{i}^{\prime} share a length-LL substring of 𝐯i\mathbf{v}_{i}. Since 𝐯i\mathbf{v}_{i} is (L,4e+1)(L,4e+1)-SD, we can match the suffix of 𝐲^\hat{\mathbf{y}} and the prefix of 𝐰i\mathbf{w}_{i}^{\prime} correctly. Then the location of 𝐰i\mathbf{w}_{i}^{\prime} in 𝐰\mathbf{w} can be deduced from the location of 𝐲^\hat{\mathbf{y}} in 𝐰\mathbf{w}.

IV-B The case of LminnL_{\rm min}\nmid n

Now, we consider the case that LminL_{\rm min} does not divide nn. Take nL=n/Lminn_{L}=\lfloor n/L_{\rm min}\rfloor. Construction B can yield a trace reconstruction code of block length nLLminn_{L}L_{\rm min}. Our approach is to extend this code to have length nn. Let NiN_{i} be defined as in (6) and m(Ni)m(N_{i}) be defined as in Lemma 20. For any message 𝐦Σi[2I]m(Ni)\mathbf{m}\in\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})}, partition 𝐦\mathbf{m} into 2I2^{I} substrings, each of length m(Ni)m(N_{i}):

𝐦=𝐦0𝐦1𝐦2I1.\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{2^{I}-1}.

For each i[2I1]i\in[2^{I}-1], let

𝐯i=𝚂𝙳(i)(𝐦i)ΣNi.\mathbf{v}_{i}=\mathcal{E}_{\mathtt{SD}}^{(i)}(\mathbf{m}_{i})\in\Sigma^{N_{i}}.

The main difference from the previous case is the encoding of 𝐦2I1\mathbf{m}_{2^{I}-1}. We recall that the encoder 𝚂𝙳(i)\mathcal{E}_{\mathtt{SD}}^{(i)} first encodes the message 𝐦i\mathbf{m}_{i} to an SD and WWL sequence of length probably less than NiN_{i}. Then it extends the sequence by appending a sequence 𝐬¯\bar{\mathbf{s}} and taking the first NiN_{i} bits of the concatenation. For i=2I1i=2^{I}-1, we modify the encoder 𝚂𝙳(2I1)\mathcal{E}_{\mathtt{SD}}^{(2^{I}-1)} by taking the first N2I1+LminrN_{2^{I}-1}+{L_{\rm min}-r} bits of the concatenation. This is possible since asymptotically the length of 𝐬¯\bar{\mathbf{s}} is larger than N2I1+LminrN_{2^{I}-1}+{L_{\rm min}-r}, see (5). We denote this modified encoder as 𝚂𝙳𝙴(2I1)\mathcal{E}_{\mathtt{SDE}}^{(2^{I}-1)} and let

𝐯2I1=𝚂𝙳𝙴(2I1)(𝐦2I1).\mathbf{v}_{2^{I}-1}=\mathcal{E}_{\mathtt{SDE}}^{(2^{I}-1)}(\mathbf{m}_{2^{I}-1}).

Then 𝐯2I1\mathbf{v}_{2^{I}-1} is (K/4,d2)(\lfloor K/4\rfloor,d_{2})-WWL and (L,d2)(L,d_{2})-SD and has length N2I1+Lminr=nL/2I(Lminr)N_{2^{I}-1}+L_{\rm min}-r=\lceil n_{L}/2^{I}\rceil(L_{\rm min}-r). Moreover, the message 𝐦2I1\mathbf{m}_{2^{I}-1} can be decoded from the first N2I1N_{2^{I}-1} bits of 𝐯2I1\mathbf{v}_{2^{I}-1}. In other words, the last LminrL_{\rm min}-r bits are redundant.

Then, we proceed similarly as in Construction B and obtain an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of block length (nL+1)Lmin{(n_{L}+1)L_{\rm min}}. Note that the last LminL_{\rm min} bits are redundant, and so, we delete (nL+1)Lminn{(n_{L}+1)L_{\rm min}}-n of them to form an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of length nn, with code rate

i[2I]m(Ni)n=(11/a1γo(1))nLLminn=11/a1γo(1).\frac{\sum_{i\in[2^{I}]}m(N_{i})}{n}=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen\frac{n_{L}L_{\rm min}}{n}=\frac{1-1/a}{1-\gamma}-o(1).

IV-C Handling noise which occurs before sequencing

Up to now, we have studied (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes, which allow reconstructing the maximum reconstructible-string from an erroneous trace 𝒴\mathcal{Y} of a codeword 𝐰\mathbf{w}. We use M(𝒴)M(\mathcal{Y}) to denote the maximum reconstructible-string of 𝒴\mathcal{Y}. If 𝒴\mathcal{Y} is reliable, then M(𝒴)=𝐰M(\mathcal{Y})=\mathbf{w}. However, if 𝒴\mathcal{Y} is not reliable, then M(𝒴)M(\mathcal{Y}) might be different from 𝐰\mathbf{w}. This may happen especially when the sequence 𝐰\mathbf{w} is subject to errors before its substrings are sampled. In the remainder of this section, we shall modify Construction B to combat such errors.

Let 𝒴\mathcal{Y} be an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐰\mathbf{w} such that dH(M(𝒴),𝐰)τd_{H}(M(\mathcal{Y}),\mathbf{w})\leqslant\tau, which is referred to as an (Lmin,Lover,e,τ)(L_{\rm min},L_{\rm over},e,\tau)-erroneous trace. We aim to reconstruct 𝐰\mathbf{w} from 𝒴\mathcal{Y}, and so retrieve the message which is stored in 𝐰\mathbf{w}. Our construction, which is presented below, borrows the idea from [2, Construction B].

Construction C.

Assume that LminnL_{\rm min}\mid n and take nL=n/Lminn_{L}=n/L_{\rm min}. Let NnL/2I(Lminr)N\triangleq\lfloor n_{L}/2^{I}\rfloor(L_{\rm min}-r). According to Lemma 20, there is an integer m(N)m(N) with Nm(N)=Θ(N/logN)N-m(N)=\Theta(N/\log N) and an invertible map 𝚂𝙳:Σm(N)ΣN\mathcal{E}_{\mathtt{SD}}:\Sigma^{m(N)}\rightarrow\Sigma^{N} which can encode sequences of Σm(N)\Sigma^{m(N)} into (K/4,d2)(\lfloor K/4\rfloor,d_{2})-WWL and (L,d2)(L,d_{2})-SD sequences. Let 𝚂𝙳𝙴:Σm(N)ΣN+Lminr\mathcal{E}_{\mathtt{SDE}}:\Sigma^{m(N)}\rightarrow\Sigma^{N+L_{\rm min}-r} be an encoder which modifies 𝚂𝙳\mathcal{E}_{\mathtt{SD}} by taking the first N+LminrN+L_{\rm min}-r bits of the concatenation.

For any message 𝐦Σ(2I2τ)m(N)\mathbf{m}\in\Sigma^{(2^{I}-2\tau)m(N)}, we first use a [2I,2I2τ,2τ+1]2m(N)[2^{I},2^{I}-2\tau,2\tau+1]_{2^{m(N)}} Reed-Solomon code777The Reed-Solomon code is over the finite field of size 2m(N)2^{m(N)}. The message is partitioned into groups of m(N)m(N) bits, and each group is translated to a single symbol from the finite field. After encoding the reverse translation to bits is performed. Note that m(N)=NΘ(N/logN)m(N)=N-\Theta(N/\log N), log(N)=Θ(logn)\log(N)=\Theta(\log n) and I=O(logn)I=O(\log n). Hence, m(N)>Im(N)>I and so, the Reed-Solomon code exists. to encode 𝐦\mathbf{m} into a codeword 𝐦¯Σ2Im(N)\bar{\mathbf{m}}\in\Sigma^{{2^{I}}m(N)}. We partition 𝐦¯\bar{\mathbf{m}} into sequences of length LminrL_{\rm min}-r:

𝐦¯=𝐦¯0𝐦¯1𝐦¯2I1.\bar{\mathbf{m}}=\bar{\mathbf{m}}_{0}\circ\bar{\mathbf{m}}_{1}\circ\cdots\circ\bar{\mathbf{m}}_{2^{I}-1}.

For each i[2I]i\in[2^{I}], let

𝐯i{𝚂𝙳𝙴(𝐦¯i)ΣN+Lminrif i<nLmod2I,𝚂𝙳(𝐦¯i)ΣNotherwise.\mathbf{v}_{i}\triangleq\begin{cases}\mathcal{E}_{\mathtt{SDE}}(\bar{\mathbf{m}}_{i})\in\Sigma^{N+L_{\rm min}-r}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathcal{E}_{\mathtt{SD}}(\bar{\mathbf{m}}_{i})\in\Sigma^{N}&\text{otherwise}.\\ \end{cases}

Then we proceed similarly as in Construction B to obtain a sequence 𝐰\mathbf{w} of length nn. We use 𝒞^Trace\hat{\mathcal{C}}_{\rm Trace} to denote the code produced by this construction.

Lemma 26.

Let 𝐰\mathbf{w} be a codeword of 𝒞^Trace\hat{\mathcal{C}}_{\rm Trace} and 𝒴\mathcal{Y} be an (Lmin,Lover,e,τ)(L_{\rm min},L_{\rm over},e,\tau)-erroneous trace of 𝐰\mathbf{w}. Then we can recover 𝐦\mathbf{m} from 𝒴\mathcal{Y}.

Proof:

With the same argument as the proof of Theorem 25, we can show that 𝒞^Trace\hat{\mathcal{C}}_{\rm Trace} is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of Σn\Sigma^{n}. Since 𝒴\mathcal{Y} is also an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐰\mathbf{w}, the maximum reconstructible-substring M(𝒴)M(\mathcal{Y}) can be decoded from 𝒴\mathcal{Y}. By reversing the operations in Construction C, we obtain a sequence 𝐦¯Σ2Im(N)\bar{\mathbf{m}}^{\prime}\in\Sigma^{{2^{I}}m(N)} from M(𝒴)M(\mathcal{Y}). We partition 𝐦¯\bar{\mathbf{m}}^{\prime} into 2I2^{I} segments of the same length, i.e., 𝐦¯=𝐦¯0𝐦¯1𝐦¯2I1\bar{\mathbf{m}}^{\prime}=\bar{\mathbf{m}}_{0}^{\prime}\circ\bar{\mathbf{m}}_{1}^{\prime}\circ\cdots\circ\bar{\mathbf{m}}_{2^{I}-1}^{\prime}. Since dH(M(𝒴),𝐰)τd_{H}(M(\mathcal{Y}),\mathbf{w})\leqslant\tau, then there are at most τ\tau indices i[2I]i\in[2^{I}] such that 𝐦¯i𝐦¯i\bar{\mathbf{m}}_{i}\neq\bar{\mathbf{m}}_{i}^{\prime}. Hence, we can run the decoder of the Reed-Solomon code on 𝐦¯\bar{\mathbf{m}}^{\prime} to recover 𝐦¯\bar{\mathbf{m}}. ∎

Theorem 27.

Suppose that τ=O(n1γa1γ)\tau=O\left\lparen n^{\frac{1-\gamma a}{1-\gamma}}\right\rparen. Then the code 𝒞^Trace\hat{\mathcal{C}}_{\rm Trace} obtained in Construction C is an (Lmin,Lover,e,τ)(L_{\rm min},L_{\rm over},e,\tau)-trace reconstruction code of Σn\Sigma^{n} with rate

R(𝒞^Trace)=11/a1γo(1).R(\hat{\mathcal{C}}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-o(1).
Proof:

Since τ=O(n1γa1γ)\tau=O\left\lparen n^{\frac{1-\gamma a}{1-\gamma}}\right\rparen, we have that 2τ/2I=o(1)2\tau/2^{I}=o(1). Hence, the code rate

R(𝒞^Trace)\displaystyle R(\hat{\mathcal{C}}_{\rm Trace}) =(2I2τ)m(N)n=2Im(N)n2τNn(1Θ(1logN))\displaystyle=\frac{(2^{I}-2\tau)m(N)}{n}=\frac{2^{I}m(N)}{n}-\frac{2\tau N}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log N}\right\rparen\right\rparen
2Im(N)n2τ2I(1rLmin)(1Θ(1logN))\displaystyle\geqslant\frac{2^{I}m(N)}{n}-\frac{2\tau}{2^{I}}\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\left\lparen 1-\Theta\left\lparen\frac{1}{\log N}\right\rparen\right\rparen
=2Im(N)no(1).\displaystyle=\frac{2^{I}m(N)}{n}-o(1).

Consider the NiN_{i}’s which are defined in (6). We have that

Ni{N+Lminrif i<nLmod2I,Notherwise.N_{i}\triangleq\begin{cases}N+L_{\rm min}-r&\text{if $i<n_{L}\bmod 2^{I}$},\\ N&\text{otherwise}.\\ \end{cases}

Hence,

R(𝒞^Trace)\displaystyle R(\hat{\mathcal{C}}_{\rm Trace}) =2Im(N)no(1)\displaystyle=\frac{2^{I}m(N)}{n}-o(1)
i[2I]m(Ni)2I(Lminr)no(1)\displaystyle\geqslant\frac{\sum_{i\in[2^{I}]}m(N_{i})-2^{I}(L_{\rm min}-r)}{n}-o(1)
=R(𝒞Trace)o(1)=11/a1γo(1).\displaystyle=R(\mathcal{C}_{\rm Trace})-o(1)=\frac{1-1/a}{1-\gamma}-o(1).

IV-D (Lmin,0,e)(L_{\rm min},0,e)-Reconstruction Codes

In this subsection, we consider the case of Lover=0L_{\rm over}=0.

Construction D.

Suppose that Lmin=alognL_{\rm min}=\lceil a\log n\rceil, Lover=0L_{\rm over}=0 and LminnL_{\rm min}\mid n. As before, we denote nLnLminn_{L}\triangleq\frac{n}{L_{\rm min}} and KlognK\triangleq\left\lceil\sqrt{\log n}\right\rceil. However, this time, we let IlognLI\triangleq\lceil\log n_{L}\rceil and rI(3d+8)logIr_{I}\triangleq\lceil(3d+8)\log I\rceil where d=2e+1d=2e+1 and =dlogd+2d\ell=d\lceil\log d\rceil+2d. Then according to Theorem 14, there is a collection of (332log(I+rI)+,d)(3\lceil\frac{3}{2}\log(I+r_{I})\rceil+\ell,d)-WWL sequences 𝐜0,𝐜1,,𝐜2I1ΣI+rI\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}} such that the concatenation 𝐜0𝐜1𝐜2I1\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1} is an (I+rI,d)(I+r_{I},d)-SD sequence.

Denote mLmin(I+rI+K+)m^{\prime}\triangleq L_{\rm min}-(I+r_{I}+K+\ell). Let 𝚆𝚆𝙻\mathcal{E}_{\mathtt{WWL}} be the encoder in [14, Algorithm 2] which can encode sequences of Σmd\Sigma^{m^{\prime}-d} into (K/4,d)(\lceil K/4\rceil,d)-WWL sequences888Note that m=Θ(logn)m^{\prime}=\Theta(\log n) and K=lognK=\left\lceil\sqrt{\log n}\right\rceil. Hence, K/4(m,d)=logm+(d1)loglogm+O(1)K/4\gg\mathcal{F}(m^{\prime},d)=\log m^{\prime}+(d-1)\log\log m^{\prime}+O(1). Then according to Lemma 19 in [14], the encoder 𝚆𝚆𝙻\mathcal{E}_{\mathtt{WWL}} does work. of Σm\Sigma^{m^{\prime}}. For a message 𝐦=𝐦0𝐦1𝐦nL1\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{n_{L}-1} where 𝐦iΣmd\mathbf{m}_{i}\in\Sigma^{m^{\prime}-d} for i[nL]i\in[n_{L}], let 𝐰i𝚆𝚆𝙻(𝐦i)\mathbf{w}_{i}\triangleq\mathcal{E}_{\mathtt{WWL}}(\mathbf{m}_{i}) for all i[nL]i\in[n_{L}].

Denote 𝐩0K𝐮\mathbf{p}\triangleq 0^{K}\circ\mathbf{u} where 𝐮\mathbf{u} is a dd-auto-cyclic sequence of length \ell. Let

𝐰=𝐩𝐜0𝐰0𝐩𝐜1𝐰1𝐩𝐜nL1𝐰nL1.\mathbf{w}=\mathbf{p}\circ\mathbf{c}_{0}\circ\mathbf{w}_{0}\circ\mathbf{p}\circ\mathbf{c}_{1}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{p}\circ\mathbf{c}_{n_{L}-1}\circ\mathbf{w}_{n_{L}-1}.

Output 𝐰\mathbf{w} as the codeword which encodes the message 𝐦\mathbf{m}. The image under this mapping is the code that we construct.

Theorem 28.

The code obtained in Construction D is an (Lmin,0,e)(L_{\rm min},0,e)-trace reconstruction code of Σn\Sigma^{n} with rate

11aO(1logn).1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.
Proof:

The code has rate

nL(md)n=mdLmin=Lmin(I+rI+K++d)Lmin=11aO(1logn).\frac{n_{L}(m^{\prime}-d)}{n}=\frac{m^{\prime}-d}{L_{\rm min}}=\frac{L_{\rm min}-(I+r_{I}+K+\ell+d)}{L_{\rm min}}=1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.

Now, let 𝐲\mathbf{y} be a length-LminL_{\rm min} substring of some codeword 𝐰\mathbf{w}. Then 𝐲\mathbf{y} must contain either a copy of 𝐩𝐜i\mathbf{p}\circ\mathbf{c}_{i} or a suffix of 𝐩𝐜i\mathbf{p}\circ\mathbf{c}_{i} together with a prefix of 𝐩𝐜i+1\mathbf{p}\circ\mathbf{c}_{i+1}. Since 𝐰i\mathbf{w}_{i}’s and 𝐜j\mathbf{c}_{j}’s are WWL sequences, even if 𝐲\mathbf{y} suffers from ee errors, we can still locate the marker 𝐩\mathbf{p} in 𝐲\mathbf{y}. Then we can run the locating algorithm of the robust positioning sequence 𝐜0𝐜1𝐜2I1\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1} to determine the index ii or i+1i+1, and hence the location of 𝐲\mathbf{y}. ∎

For the case of LminnL_{\rm min}\nmid n, let nL=n/Lminn_{L}=\lceil n/L_{\rm min}\rceil. We first construct an (Lmin,0)(L_{\rm min},0)-trace reconstruction code of ΣnLLmin\Sigma^{n_{L}L_{\rm min}}, where the length-LminL_{\rm min} suffix of every codeword is fixed. Then we truncate it to be of length nn. In this way, we get a code of rate

n/Lmin(Lmin(I+rI+K++d))n(1Lmin1n)(1I+rI+K++dLmin)=11aO(1logn).\frac{\lfloor n/L_{\rm min}\rfloor(L_{\rm min}-(I+r_{I}+K+\ell+d))}{n}\geqslant\left\lparen 1-\frac{L_{\rm min}-1}{n}\right\rparen\left\lparen 1-\frac{I+r_{I}+K+\ell+d}{L_{\rm min}}\right\rparen=1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.

For (Lmin,0,e,τ)(L_{\rm min},0,e,\tau)-erroneous trace reconstruction, we proceed similarly as in [2, Construction B]. We first use an (nL,2(md)(nLr),2τ+1)2md(n_{L},2^{(m^{\prime}-d)(n_{L}-r)},2\tau+1)_{2^{m^{\prime}-d}} code to encode a message 𝐦=𝐦0𝐦1𝐦nLr1Σ(md)(nLr)\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{n_{L}-r-1}\in\Sigma^{(m^{\prime}-d)(n_{L}-r)} to a sequence 𝐦¯=𝐦¯0𝐦¯1𝐦¯nL1Σ(md)nL\bar{\mathbf{m}}=\bar{\mathbf{m}}_{0}\circ\bar{\mathbf{m}}_{1}\circ\cdots\circ\bar{\mathbf{m}}_{n_{L}-1}\in\Sigma^{(m^{\prime}-d)n_{L}}. Then we use the encoder outlined in Construction D to get a codeword 𝐰\mathbf{w}. We note that Construction B in [2] only concerns errors before sequencing, while our construction incorporates errors both before and after sequencing.

V Multi-Strand Reconstruction

In this section, instead of reconstructing a single sequence, we consider the problem of reconstructing a multiset of kk sequences of length nn from the union of their traces. The following construction of multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes is adapted from [25, Construction C].

Construction E.

Let Nk(nLover)+LoverN\triangleq k(n-L_{\rm over})+L_{\rm over}. We take an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒞\mathcal{C} of ΣN\Sigma^{N}. For each codeword 𝐱𝒞\mathbf{x}\in\mathcal{C}, let

𝒮(𝐱){𝐱0+[n],𝐱nLover+[n],𝐱2(nLover)+[n],,𝐱(k1)(nLover)+[n]}𝒳n,k.\mathcal{S}(\mathbf{x})\triangleq\left\{\mathbf{x}_{0+[n]},\mathbf{x}_{n-L_{\rm over}+[n]},\mathbf{x}_{2(n-L_{\rm over})+[n]},\ldots,\mathbf{x}_{(k-1)(n-L_{\rm over})+[n]}\right\}\in\mathcal{X}_{n,k}.

The code we construct is 𝒟\mathcal{D}, defined as,

𝒟{𝒮(𝐱):𝐱𝒞}𝒳n,k.\mathcal{D}\triangleq\left\{\mathcal{S}(\mathbf{x})~{}:~{}\mathbf{x}\in\mathcal{C}\right\}\subseteq\mathcal{X}_{n,k}.

Lemma 29.

Let Lmin>LoverL_{\rm min}>L_{\rm over}. Then the code 𝒟\mathcal{D} from Construction E is a multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}.

Proof:

It is easy to see that an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace 𝒴\mathcal{Y} of 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is also an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-erroneous trace of 𝐱\mathbf{x}. Since 𝒞\mathcal{C} is a trace reconstruction code, then for each 𝐲𝒴\mathbf{y}\in\mathcal{Y}, we can determine its location in 𝐱\mathbf{x}. Hence, we can determine the index ii such that 𝐲𝒴i\mathbf{y}\in\mathcal{Y}_{i} and determine the location of 𝐲\mathbf{y} in 𝐱i\mathbf{x}_{i}. ∎

Lemma 30 ([25, Lemma 16]).

log|𝒳n,k|=k(nlog(k/𝖾))+o(k)\log\lvert\mathcal{X}_{n,k}\rvert=k(n-\log(k/\mathsf{e}))+o(k) 999We use 𝖾\mathsf{e} to denote exp(1)\exp(1) in order to avoid confusion with ee which denotes the number of errors..

Theorem 31.

Suppose that lim supnlogk/n<1\limsup_{n\to\infty}\log k/n<1, Lover=log(nk)+(24e+13)loglog(nk)+(4e+1)log(4e+1)+20e+5L_{\rm over}=\lceil\log(nk)\rceil+(24e+13)\lceil\log\lceil\log(nk)\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5 and Lmin>LoverL_{\rm min}>L_{\rm over}. For sufficiently large nn, there is a multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k} whose rate is 1o(1)1-o(1).

Proof:

Let N=k(nLover)+LoverN=k(n-L_{\rm over})+L_{\rm over}. Then LoverlogN+(24e+13)loglogN+(4e+1)log(4e+1)+20e+5L_{\rm over}\geqslant\lceil\log N\rceil+(24e+13)\lceil\log\lceil\log N\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5. According to Corollary 19, there is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒞\mathcal{C} of Σn\Sigma^{n} whose rate is 1o(1)1-o(1). Applying Construction E with this code, we obtain a multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒟\mathcal{D} of 𝒳n,k\mathcal{X}_{n,k} with |𝒟|=|𝒞|\lvert\mathcal{D}\rvert=\lvert\mathcal{C}\rvert. Note that

Nlog|𝒳n,k|\displaystyle\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert} =k(nLover)+Loverk(nlog(k/𝖾))+o(k)=nLover+Lover/knlogk+O(1)\displaystyle=\frac{k(n-L_{\rm over})+L_{\rm over}}{k(n-\log(k/\mathsf{e}))+o(k)}=\frac{n-L_{\rm over}+L_{\rm over}/k}{n-\log k+O(1)}
=1LoverlogkLover/k+O(1)nlogk+O(1)=1O(lognn).\displaystyle=1-\frac{L_{\rm over}-\log k-L_{\rm over}/k+O(1)}{n-\log k+O(1)}=1-O\left\lparen\frac{\log n}{n}\right\rparen.

Hence, the code rate is

R(𝒟)=log|𝒟|log|𝒳n,k|=log|𝒞|NNlog|𝒳n,k|=(1o(1))(1O(lognn))=1o(1).\displaystyle R(\mathcal{D})=\frac{\log\lvert\mathcal{D}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{N}\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}=(1-o(1))\left\lparen 1-O\left\lparen\frac{\log n}{n}\right\rparen\right\rparen=1-o(1).

Now, we consider the case of Loverlog(nk)L_{\rm over}\leqslant\log(nk). Assume that Lmin=alog(nk)L_{\rm min}=a\log(nk) and Lover=γLminL_{\rm over}=\gamma L_{\rm min} where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1. Let

L(nLover)mod(LminLover).L^{*}\triangleq(n-L_{\rm over})\bmod(L_{\rm min}-L_{\rm over}).

We first present some upper bounds on the rate of multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction codes.

Lemma 32 ([25, In the proof of Lemma 8]).

For all vu0v\geqslant u\geqslant 0, log(u+vu)<u(2log𝖾+logvlogu).\log\binom{u+v}{u}<u(2\log\mathsf{e}+\log v-\log u).

Lemma 33.

Suppose that Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil and Lover=γLminL_{\rm over}=\lceil\gamma L_{\rm min}\rceil where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1. Let 𝒞\mathcal{C} be a multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}. Then it holds that

log|𝒞|nk(11/a1γ)(1γLminn)+1/aγ1γLn+O(lognn).\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-\gamma\frac{L_{\rm min}}{n}\right\rparen+\frac{1/a-\gamma}{1-\gamma}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

In particular, if logk=o(n)\log k=o(n), then the code rate satisfies

R(𝒞)11/a1γ+o(1),R(\mathcal{C})\leqslant\frac{1-1/a}{1-\gamma}+o(1),

and if logk=κn\log k=\kappa n where 0<κ<10<\kappa<1 is a real constant, then the code rate satisfies

R(𝒞)1aγκ1κ(11/a1γ)+1/aγ(1γ)(1κ)Ln+O(lognn).R(\mathcal{C})\leqslant\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.
Proof:

For a sequence 𝐱Σn\mathbf{x}\in\Sigma^{n}, let

𝒴^(𝐱){𝐱i(LminLover)+[Lmin]:i[nLoverLLminLover1]}{𝐱[nLminL,n1]}.\hat{\mathcal{Y}}(\mathbf{x})\triangleq\left\{\mathbf{x}_{i(L_{\rm min}-L_{\rm over})+[L_{\rm min}]}~{}:~{}i\in\left[\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}-1\right]\right\}\cup\{\mathbf{x}[n-L_{\rm min}-L^{*},n-1]\}.

For a codeword 𝒮={𝐱0,𝐱1,,𝐱k1}𝒞\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{C}, let 𝒴^(𝒮)i=0k1𝒴^(𝐱i).\hat{\mathcal{Y}}(\mathcal{S})\triangleq\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i}). Then 𝒴^(𝒮)\hat{\mathcal{Y}}(\mathcal{S}) is an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace of 𝒮\mathcal{S}.

Since 𝒞\mathcal{C} is an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code, necessarily 𝒴^(𝒮)𝒴^(𝒮)\hat{\mathcal{Y}}(\mathcal{S})\neq\hat{\mathcal{Y}}(\mathcal{S}^{\prime}) for any two different codewords 𝒮\mathcal{S} and 𝒮\mathcal{S}^{\prime}. It follows that

|𝒞||{𝒴^(𝒮):𝒮𝒞}|.\lvert\mathcal{C}\rvert\leqslant\left\lvert\left\{\hat{\mathcal{Y}}(\mathcal{S})~{}:~{}\mathcal{S}\in\mathcal{C}\right\}\right\rvert.

Note that 𝒴^(𝒮)\hat{\mathcal{Y}}(\mathcal{S}) is a multiset consisting of knLminLLminLoverk\frac{n-L_{\rm min}-L^{*}}{L_{\rm min}-L_{\rm over}} sequences of ΣLmin\Sigma^{L_{\rm min}} and kk sequences of ΣLmin+L\Sigma^{L_{\rm min}+L^{*}}. Hence,

|𝒞|(k(nLminLLminLover)+2Lmin12Lmin1)(k+2Lmin+L12Lmin+L1).\lvert\mathcal{C}\rvert\leqslant\binom{k\left\lparen\frac{n-L_{\rm min}-L^{*}}{L_{\rm min}-L_{\rm over}}\right\rparen+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}\cdot\binom{k+2^{L_{\rm min}+L^{*}}-1}{2^{L_{\rm min}+L^{*}}-1}. (7)

We denote the first binomial coefficient in (7) as AA and the second one as BB. Since 2Lmin(nk)a>k(nLmin)LminLover2^{L_{\rm min}}\geqslant(nk)^{a}>\frac{k(n-L_{\rm min})}{L_{\rm min}-L_{\rm over}} and 2Lmin+L>k2^{L_{\rm min}+L^{*}}>k, according to Lemma 32, we have that

logAnk\displaystyle\frac{\log A}{nk} <knk(nLminLLminLover)(2log𝖾+Lminlog(k(nLminL))+log(LminLover))\displaystyle<\frac{k}{nk}\left\lparen\frac{n-L_{\rm min}-L^{*}}{L_{\rm min}-L_{\rm over}}\right\rparen\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm min}-L^{*}))+\log(L_{\rm min}-L_{\rm over})\right\rparen
=1(Lmin+L)/nLminLover(Lminlog(nk)+O(loglog(nk)))\displaystyle=\frac{1-(L_{\rm min}+L^{*})/n}{L_{\rm min}-L_{\rm over}}(L_{\rm min}-\log(nk)+O(\log\log(nk)))
=(1Lmin+Ln)Lminlog(nk)LminLover+O(loglog(nk)log(nk))\displaystyle=\left\lparen 1-\frac{L_{\rm min}+L^{*}}{n}\right\rparen\frac{L_{\rm min}-\log(nk)}{L_{\rm min}-L_{\rm over}}+O\left\lparen\frac{\log\log(nk)}{\log(nk)}\right\rparen
=11/a1γ(1Lmin+Ln)+O(loglog(nk)log(nk)),\displaystyle=\frac{1-1/a}{1-\gamma}\left\lparen 1-\frac{L_{\rm min}+L^{*}}{n}\right\rparen+O\left\lparen\frac{\log\log(nk)}{\log(nk)}\right\rparen, (8)

and

logBnk\displaystyle\frac{\log B}{nk} <1n(2log𝖾+Lmin+Llogk)=(11/a)Lminn+Ln+O(lognn).\displaystyle<\frac{1}{n}\left\lparen 2\log\mathsf{e}+L_{\rm min}+L^{*}-\log k\right\rparen=\frac{(1-1/a)L_{\rm min}}{n}+\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen. (9)

Combining (7), (8) and (9), we have that

log|𝒞|nk(11/a1γ)(1γLminn)+1/aγ1γLn+O(lognn).\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-\gamma\frac{L_{\rm min}}{n}\right\rparen+\frac{1/a-\gamma}{1-\gamma}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen. (10)

If logk=o(n)\log k=o(n), then Lmin/n=alog(nk)/n=o(1)L_{\rm min}/n=a\log(nk)/n=o(1) and L/n<Lmin/n=o(1)L^{*}/n<L_{\rm min}/n=o(1). It follows that

log|𝒞|nk(11/a1γ)(1o(1))+o(1)=11/a1γ+o(1).\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-o(1)\right\rparen+o(1)=\frac{1-1/a}{1-\gamma}+o(1).

Recall that log|𝒳n,k|=k(nlog(k/𝖾))+o(k)\log\lvert\mathcal{X}_{n,k}\rvert=k(n-\log(k/\mathsf{e}))+o(k). Hence, the code rate

R(𝒞)=log|𝒞|log|𝒳n,k|=log|𝒞|nknkk(nlog(k/𝖾))+o(k)(11/a1γ+o(1))11o(1)=11/a1γ+o(1).\displaystyle R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{k(n-\log(k/\mathsf{e}))+o(k)}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}+o(1)\right\rparen\frac{1}{1-o(1)}=\frac{1-1/a}{1-\gamma}+o(1).

If logk=κn\log k=\kappa n where 0<κ<10<\kappa<1 is a real constant, then

nklog|𝒳n,k|=nkk(nlog(k/𝖾))+o(k)=11κ+O(1/n)=11κO(1n).\displaystyle\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{nk}{k(n-\log(k/\mathsf{e}))+o(k)}=\frac{1}{1-\kappa+O(1/n)}=\frac{1}{1-\kappa}-O\left\lparen\frac{1}{n}\right\rparen.

Therefore, it follows from (10) that the code rate satisfies

R(𝒞)\displaystyle R(\mathcal{C}) =log|𝒞|log|𝒳n,k|=log|𝒞|nknklog|𝒳n,k|\displaystyle=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}
((11/a1γ)(1aγκ)+1/aγ1γLn+O(lognn))(11κO(1n))\displaystyle\leqslant\left\lparen\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-a\gamma\kappa\right\rparen+\frac{1/a-\gamma}{1-\gamma}\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen\right\rparen\left\lparen\frac{1}{1-\kappa}-O\left\lparen\frac{1}{n}\right\rparen\right\rparen
=1aγκ1κ(11/a1γ)+1/aγ(1γ)(1κ)Ln+O(lognn).\displaystyle=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

Corollary 34.

Suppose that logk=o(n)\log k=o(n). Let 𝒞\mathcal{C} be a multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}. If Lminlog(nk)+o(log(nk))L_{\rm min}\leqslant\log(nk)+o(\log(nk)), then R(𝒞)=o(1)R(\mathcal{C})=o(1).

Proof:

Since 𝒞\mathcal{C} is also a multi-strand (alog(nk),0)(\lceil a\log(nk)\rceil,0)-trace reconstruction code for any a>1a>1, it follows from Lemma 33 that R(𝒞)11/a+o(1)R(\mathcal{C})\leqslant 1-1/a+o(1) for all a>1a>1. Hence, R(𝒞)=o(1)R(\mathcal{C})=o(1). ∎

Lemma 35.

Suppose that logkκn\log k\leqslant\kappa n where κ<1\kappa<1 is a constant. Let 𝒞\mathcal{C} be a multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}. If Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil for some a<1a<1, then R(𝒞)=o(1)R(\mathcal{C})=o(1).

Proof:

The proof is similar to that of Lemma 33. In this case, we denote

𝒴^(𝐱){𝐱0+[Lmin],𝐱1+[Lmin],,𝐱nLmin+[Lmin]}.\hat{\mathcal{Y}}(\mathbf{x})\triangleq\{\mathbf{x}_{0+[L_{\rm min}]},\mathbf{x}_{1+[L_{\rm min}]},\ldots,\mathbf{x}_{n-L_{\rm min}+[L_{\rm min}]}\}.

Then each 𝒴^(𝒮)=i=0k1𝒴^(𝐱i)\hat{\mathcal{Y}}(\mathcal{S})=\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i}) is still an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace, and it consists of k(nLmin+1))k(n-L_{\rm min}+1)) sequences of ΣLmin\Sigma^{L_{\rm min}}, and so,

|𝒞|(k(nLmin+1)+2Lmin12Lmin1).\lvert\mathcal{C}\rvert\leqslant\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}.

We observe that k(nLmin+1)k(nalognalogk)k((1aκ)nalogn)cnkk(n-L_{\rm min}+1)\geqslant k(n-a\log n-a\log k)\geqslant k\left\lparen(1-a\kappa)n-a\log n\right\rparen\geqslant cnk for some constant cc and 2Lmin2(nk)a2^{L_{\rm min}}\leqslant 2(nk)^{a}. Since a<1a<1, when nn is sufficiently large, we have that k(nLmin+1)2Lmink(n-L_{\rm min}+1)\geqslant 2^{L_{\rm min}}. Using the inequality in Lemma 32, we get that

1nklog(k(nLmin+1)+2Lmin12Lmin1)2Lminnk(2log𝖾+log(k(nLmin+1))Lmin).\frac{1}{nk}\log\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen 2\log\mathsf{e}+\log(k(n-L_{\rm min}+1))-L_{\rm min}\right\rparen. (11)

Noting that kn>k(nLmin+1)cnkkn>k(n-L_{\rm min}+1)\geqslant cnk, we have that log(k(nLmin+1))=log(nk)O(1)\log(k(n-L_{\rm min}+1))=\log(nk)-O(1). Continuing (11),

1nklog(k(nLmin+1)+2Lmin12Lmin1)\displaystyle\frac{1}{nk}\log\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1} 2Lminnk(2log𝖾+log(k(nLmin+1))Lmin)\displaystyle\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen 2\log\mathsf{e}+\log(k(n-L_{\rm min}+1))-L_{\rm min}\right\rparen
2Lminnk((1a)log(nk)+O(1))\displaystyle\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen(1-a)\log(nk)+O(1)\right\rparen
=(1a)log(nk)+O(1)(nk)1a=o(1).\displaystyle=\frac{(1-a)\log(nk)+O(1)}{(nk)^{1-a}}=o(1).

Hence,

R(𝒞)=log|𝒞|log|𝒳n,k|=log|𝒞|nknklog|𝒳n,k|=o(1).R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=o(1).

Lemma 36.

Suppose that k2nk\leqslant 2^{n}. Let 𝒞\mathcal{C} be a multi-strand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}. If Lminlog(nk)+o(log(nk))L_{\rm min}\leqslant\log(nk)+o(\log(nk)) and LminLover=Θ(log(nk))L_{\rm min}-L_{\rm over}=\Theta(\log(nk)), then R(𝒞)=o(1)R(\mathcal{C})=o(1).

Proof:

It suffices to consider the case of Lmin=log(nk)+o(log(nk))L_{\rm min}=\log(nk)+o(\log(nk)). In this case, we denote

𝒴^(𝐱){𝐱i(LminLover)+[Lmin]:i[nLoverLLminLover]}{𝐱[nLmin,n1]}.\hat{\mathcal{Y}}(\mathbf{x})\triangleq\left\{\mathbf{x}_{i(L_{\rm min}-L_{\rm over})+[L_{\rm min}]}~{}:~{}i\in\left[\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}\right]\right\}\cup\{\mathbf{x}[n-L_{\rm min},n-1]\}.

Since LminLLoverL_{\rm min}-L^{*}\geqslant L_{\rm over}, each 𝒴^(𝒮)=i=0k1𝒴^(𝐱i)\hat{\mathcal{Y}}(\mathcal{S})=\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i}) is still an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace, and it consists of k(nLoverLLminLover+1)k\left\lparen\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}+1\right\rparen sequences of ΣLmin\Sigma^{L_{\rm min}}. Hence, we have that

|𝒞|(k(n+Lmin2LoverL)LminLover+2Lmin12Lmin1).\lvert\mathcal{C}\rvert\leqslant\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}.

Since Lmin=log(nk)+o(log(nk))L_{\rm min}=\log(nk)+o(\log(nk)), we have k(n+Lmin2LoverL)LminLover<2Lmin\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}<2^{L_{\rm min}} . Using the inequality in Lemma 32, we get that

1nklog(k(n+Lmin2LoverL)LminLover+2Lmin12Lmin1)\displaystyle\frac{1}{nk}\log\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}
\displaystyle\leqslant k(n+Lmin2LoverL)(LminLover)nk(2log𝖾+Lminlog(k(n+Lmin2LoverL))+log(LminLover))\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n+L_{\rm min}-2L_{\rm over}-L^{*}))+\log(L_{\rm min}-L_{\rm over})\right\rparen
\displaystyle\leqslant k(n+Lmin2LoverL)(LminLover)nk(2log𝖾+Lminlog(k(nLover))+log(LminLover)).\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm over}))+\log(L_{\rm min}-L_{\rm over})\right\rparen. (12)

Since Lminlog(nk)+o(log(nk))L_{\rm min}\leqslant\log(nk)+o(\log(nk)) and LminLover=Θ(log(nk))L_{\rm min}-L_{\rm over}=\Theta(\log(nk)), we have that Loverc1log(nk)c2nL_{\rm over}\leqslant c_{1}\log(nk)\leqslant c_{2}n for some constants c1,c2<1c_{1},c_{2}<1. It follows that log(k(nLover))=log(nk)O(1)\log(k(n-L_{\rm over}))=\log(nk)-O(1). Continuing (12), we have that

1nklog(k(n+Lmin2LoverL)LminLover+2Lmin12Lmin1)\displaystyle\frac{1}{nk}\log\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}
\displaystyle\leqslant k(n+Lmin2LoverL)(LminLover)nk(2log𝖾+Lminlog(k(nLover))+log(LminLover))\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm over}))+\log(L_{\rm min}-L_{\rm over})\right\rparen
\displaystyle\leqslant k(n+Lmin2LoverL)(LminLover)nk(2log𝖾+Lminlog(nk)+O(loglog(nk)))\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(nk)+O(\log\log(nk))\right\rparen
=\displaystyle= (1+Lmin2LoverLn)o(log(nk))LminLover=o(1),\displaystyle\left\lparen 1+\frac{L_{\rm min}-2L_{\rm over}-L^{*}}{n}\right\rparen\frac{o(\log(nk))}{L_{\rm min}-L_{\rm over}}=o(1),

where the last equality holds since LminLover=Θ(log(nk))L_{\rm min}-L_{\rm over}=\Theta(\log(nk)). Hence,

R(𝒞)=log|𝒞|log|𝒳n,k|=log|𝒞|nknklog|𝒳n,k|=o(1).R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=o(1).

Remark.

We note that the condition LminLover=Θ(log(nk))L_{\rm min}-L_{\rm over}=\Theta(\log(nk)) in Lemma 36 cannot be removed. A counterexample is the (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes of rate 1o(1)1-o(1) in Theorem 31, where Lover=lognk+(24e+13)loglognk+(4e+1)log(4e+1)+20e+5L_{\rm over}=\lceil\log nk\rceil+(24e+13)\lceil\log\lceil\log nk\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5 and LminLover+1L_{\rm min}\geqslant L_{\rm over}+1.

Note that a multistrand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code is also a multistrand (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace reconstruction code. Hence, the upper bounds in Lemmas 3336 also work for multistrand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction codes.

In the following, we study the lower bounds.

Theorem 37.

Let Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil and Lover=γLminL_{\rm over}=\lceil\gamma L_{\rm min}\rceil, where a>1a>1 and 0aγ10\leqslant a\gamma\leqslant 1. For all sufficiently large nn,

  1. 1.

    if logk=o(n)\log k=o(n), then there is a multi-stand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒟\mathcal{D} of 𝒳n,k\mathcal{X}_{n,k} of rate

    R(𝒟)=11/a1γo(1);R(\mathcal{D})=\frac{1-1/a}{1-\gamma}-o(1);
  2. 2.

    if logk=κn\log k=\kappa n where 0<κ<10<\kappa<1 is a real constant, then there is a multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒟\mathcal{D} of 𝒳n,k\mathcal{X}_{n,k} of rate

    R(𝒟)=1aγκ1κ(11/a1γ)o(1).R(\mathcal{D})=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1).
Proof:

Let N=k(nLover)+LoverN=k(n-L_{\rm over})+L_{\rm over}. Then LminalogNL_{\rm min}\geqslant\lceil a\log N\rceil. According to Theorem 25, there is an (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒞\mathcal{C} of ΣN\Sigma^{N} whose rate is 11/a1γo(1)\frac{1-1/a}{1-\gamma}-o(1). Applying Construction E with this code, we obtain a multi-strand (Lmin,Lover,e)(L_{\rm min},L_{\rm over},e)-trace reconstruction code 𝒟\mathcal{D} of 𝒳n,k\mathcal{X}_{n,k} with |𝒟|=|𝒞|\lvert\mathcal{D}\rvert=\lvert\mathcal{C}\rvert. Note that

Nlog|𝒳n,k|\displaystyle\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert} =k(nLover)+Loverk(nlog(k/e))+o(k)=nLover+Lover/knlogk+O(1)\displaystyle=\frac{k(n-L_{\rm over})+L_{\rm over}}{k(n-\log(k/e))+o(k)}=\frac{n-L_{\rm over}+L_{\rm over}/k}{n-\log k+O(1)}
=1LoverlogkLover/k+O(1)nlogk+O(1)=1(aγ1)logk+O(logn)nlogk+o(1).\displaystyle=1-\frac{L_{\rm over}-\log k-L_{\rm over}/k+O(1)}{n-\log k+O(1)}=1-\frac{(a\gamma-1)\log k+O(\log n)}{n-\log k+o(1)}.

If logk=o(n)\log k=o(n), then N/log|𝒳n,k|=1o(1){N}/{\log\lvert\mathcal{X}_{n,k}\rvert}=1-o(1), and so, we have that

R(𝒟)=(11/a1γo(1))(1o(1))=11/a1γo(1).\displaystyle R(\mathcal{D})=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen(1-o(1))=\frac{1-1/a}{1-\gamma}-o(1).

If logk=κn\log k=\kappa n, then

Nlog|𝒳n,k|=1(aγ1)κ1κo(1)=1aγκ1κo(1),\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}=1-\frac{(a\gamma-1)\kappa}{1-\kappa}-o(1)=\frac{1-a\gamma\kappa}{1-\kappa}-o(1),

and so, we have that

R(𝒟)=(11/a1γo(1))(1aγκ1κo(1))=1aγκ1κ(11/a1γ)o(1).\displaystyle R(\mathcal{D})=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen\left\lparen\frac{1-a\gamma\kappa}{1-\kappa}-o(1)\right\rparen=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1).

When logk=o(n)\log k=o(n) or when logk=κn\log k=\kappa n and L=o(n)L^{*}=o(n), the lower bounds in Theorem 37 asymptotically achieve the upper bound in Lemma 33.

Next, we show that when logk=κn\log k=\kappa n and Lover=0L_{\rm over}=0, if LLmin(1+ϵ)log(nk)=(a1ϵ)log(nk)L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)=(a-1-\epsilon)\log(nk) for a positive ϵ\epsilon which is independent of nn, then the upper bound in Lemma 33 still can be achieved.

Construction F.

Suppose that Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil and Lover=0L_{\rm over}=0. Denote n¯nLLmin\bar{n}\triangleq\frac{n-L^{*}}{L_{\rm min}} and Klog(nk)K\triangleq\lceil\sqrt{\log(nk)}\rceil. Let Ilog(n¯k)I\triangleq\lceil\log(\bar{n}k)\rceil and rI(3d+8)logIr_{I}\triangleq\lceil(3d+8)\log I\rceil where d=2e+1d=2e+1. Then according to Theorem 14, there is a collection of (332log(I+rI)+,d)(3\lceil\frac{3}{2}\log(I+r_{I})\rceil+\ell,d)-WWL sequences 𝐜0,𝐜1,,𝐜2I1ΣI+rI\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}} such that the concatenation 𝐜0𝐜1𝐜2I1\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1} is an (I+rI,d)(I+r_{I},d)-SD sequence.

Denote nn¯(Lmin(I+rI+K+))+Ln^{\prime}\triangleq\bar{n}(L_{\rm min}-(I+r_{I}+K+\ell))+L^{*}. Let 𝚆𝚆𝙻\mathcal{E}_{\mathtt{WWL}} be the encoder in [14, Algorithm 2] which can encode sequences of Σnd\Sigma^{n^{\prime}-d} into (K/4,d)(\lceil K/4\rceil,d)-WWL sequences101010Note that n=Θ(n)n^{\prime}=\Theta(n) and K=log(nk)=Θ(n)K=\sqrt{\log(nk)}=\Theta(\sqrt{n}). Hence, K/4(n,d)=logn+(d1)loglogn+O(1)K/4\gg\mathcal{F}(n^{\prime},d)=\log n^{\prime}+(d-1)\log\log n^{\prime}+O(1). Then according to Lemma 19 in [14], the encoder 𝚆𝚆𝙻\mathcal{E}_{\mathtt{WWL}} does work. of Σn\Sigma^{n^{\prime}}. For a message 𝐦=𝐦0𝐦1𝐦k1\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{k-1} where 𝐦iΣnd\mathbf{m}_{i}\in\Sigma^{n^{\prime}-d} for i[k]i\in[k], let 𝐯i𝚆𝚆𝙻(𝐦i)\mathbf{v}_{i}\triangleq\mathcal{E}_{\mathtt{WWL}}(\mathbf{m}_{i}) for all i[k]i\in[k]. We partition each 𝐯i\mathbf{v}_{i} into n¯+1\bar{n}+1 substrings as follows:

𝐯i=𝐯i,0𝐯i,1𝐯i,n¯1𝐯i,n¯\mathbf{v}_{i}=\mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\mathbf{v}_{i,\bar{n}-1}\circ\mathbf{v}_{i,\bar{n}}

where |𝐯i,j|=Lmin(I+rI+K+)\lvert\mathbf{v}_{i,j}\rvert=L_{\rm min}-(I+r_{I}+K+\ell) for j[n¯]j\in[\bar{n}] and |𝐯i,n¯|=L.\lvert\mathbf{v}_{i,\bar{n}}\rvert=L^{*}.

Denote 𝐩0K𝐮\mathbf{p}\triangleq 0^{K}\circ\mathbf{u} where 𝐮\mathbf{u} is a dd-auto-cyclic sequence of length \ell. For each i[k]i\in[k], let

𝐰i=𝐯i,0𝐩𝐜in¯𝐯i,1𝐩𝐜in¯+1𝐯i,n¯1𝐩𝐜(i+1)n¯1𝐯i,n¯.\mathbf{w}_{i}=\mathbf{v}_{i,0}\circ\mathbf{p}\circ\mathbf{c}_{i\bar{n}}\circ\mathbf{v}_{i,1}\circ\mathbf{p}\circ\mathbf{c}_{i\bar{n}+1}\circ\cdots\circ\mathbf{v}_{i,\bar{n}-1}\circ\mathbf{p}\circ\mathbf{c}_{(i+1)\bar{n}-1}\circ\mathbf{v}_{i,\bar{n}}.

Output {𝐰0,𝐰1,,𝐰k1}\{\mathbf{w}_{0},\mathbf{w}_{1},\ldots,\mathbf{w}_{k-1}\} as the codeword which encodes the message {𝐦0,𝐦1,,𝐦k1}\{\mathbf{m}_{0},\mathbf{m}_{1},\ldots,\mathbf{m}_{k-1}\}. The image of the mapping described here is the constructed code.

Lemma 38.

Suppose that LLmin(1+ϵ)log(nk)L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk) for a positive ϵ\epsilon which is independent of nn. Then the code obtained in Construction F is a multi-strand (Lmin,0,e)(L_{\rm min},0,e)-trace reconstruction code of 𝒳n,k\mathcal{X}_{n,k}.

Proof:

Let 𝐲\mathbf{y} be a length-LminL_{\rm min} substring of 𝐰i\mathbf{w}_{i} for some 𝐰i{𝐰0,𝐰1,,𝐰k1}\mathbf{w}_{i}\in\{\mathbf{w}_{0},\mathbf{w}_{1},\ldots,\mathbf{w}_{k-1}\}. Note that LLmin(1+ϵ)log(nk)L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk) and |𝐩𝐜j|=K++I+rI<(1+ϵ)log(nk)\lvert\mathbf{p}\circ\mathbf{c}_{j}\rvert=K+\ell+I+r_{I}<(1+\epsilon)\log(nk). Then 𝐲\mathbf{y} must contain either a copy of 𝐩𝐜in¯+j\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j} or a suffix of 𝐩𝐜in¯+j\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j} together with a prefix of 𝐩𝐜in¯+j+1\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j+1}. Since 𝐯i\mathbf{v}_{i}’s and 𝐜j\mathbf{c}_{j}’s are WWL sequence, even if 𝐲\mathbf{y} suffers from ee errors, we can still locate their position in 𝐲\mathbf{y} by searching for the marker 𝐩\mathbf{p}. Then we can run the locating algorithm of the robust positioning sequence 𝐜0𝐜1𝐜2I1\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1} to determine the index in¯+ji\bar{n}+j or in¯+j+1i\bar{n}+j+1, and hence the location of 𝐲\mathbf{y}. ∎

Theorem 39.

Suppose that logk=κn\log k=\kappa n, Lmin=alog(nk)L_{\rm min}=\lceil a\log(nk)\rceil and Lover=0L_{\rm over}=0, where 0<κ<10<\kappa<1 and a>1a>1. If LLmin(1+ϵ)log(nk)L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk) for a fixed positive ϵ\epsilon which is independent of nn, then there is a multi-strand (Lmin,0,e)(L_{\rm min},0,e)-trace reconstruction code which has code rate

11/a1κ+1a(1κ)Lno(1)\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\cdot\frac{L^{*}}{n}-o(1)
Proof:

Note that

ndn\displaystyle\frac{n^{\prime}-d}{n} =n¯(Lmin(I+rI+K+))+Ldn=nn¯(I+rI+K+)dn\displaystyle=\frac{\bar{n}(L_{\rm min}-(I+r_{I}+K+\ell))+L^{*}-d}{n}=\frac{n-\bar{n}(I+r_{I}+K+\ell)-d}{n}
=11L/nLmin(I+rI+K+)O(1n)\displaystyle=1-\frac{1-L^{*}/n}{L_{\rm min}}(I+r_{I}+K+\ell)-O\left\lparen\frac{1}{n}\right\rparen
=1(1Ln)log(nk)+O(log(nk))alog(nk)O(1n)\displaystyle=1-\left\lparen 1-\frac{L^{*}}{n}\right\rparen\frac{\log(nk)+O(\sqrt{\log(nk)})}{a\log(nk)}-O\left\lparen\frac{1}{n}\right\rparen
=11a+LanO(1log(nk)).\displaystyle=1-\frac{1}{a}+\frac{L^{*}}{an}-O\left\lparen\frac{1}{\sqrt{\log(nk)}}\right\rparen.

Hence, the code rate is

(nd)klog|𝒳n,k|=(nd)knknklog|𝒳n,k|=(11a+Lano(1))(11κo(1))=11/a1κ+1a(1κ)Lno(1).\frac{(n^{\prime}-d)k}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{(n^{\prime}-d)k}{nk}\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=\left\lparen 1-\frac{1}{a}+\frac{L^{*}}{an}-o(1)\right\rparen\left\lparen\frac{1}{1-\kappa}-o(1)\right\rparen=\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\frac{L^{*}}{n}-o(1).

Finally, we note that the multi-strand (Lmin,0,e)(L_{\rm min},0,e)-trace reconstruction code in Construction F only guarantees recovering message from reliable (Lmin,0,e)(L_{\rm min},0,e)-erroneous traces, the occurrence of which might be rare since Lover=0L_{\rm over}=0 and each symbol is usually included in a small number of substrings in 𝒴\mathcal{Y}. Nevertheless, we can use a (k,2(nd)(kro),2τ+1)2nd(k,2^{(n^{\prime}-d)(k-r_{o})},2\tau+1)_{2^{n^{\prime}-d}} code to encode the message, like what we have done in Construction C, so that even if there are in total τ\tau errors in 𝒴\mathcal{Y}, we still can decode the message. The rate of this trace reconstruction code is

(1rok)(11/a1κ+1a(1κ)Ln)o(1).\left\lparen 1-\frac{r_{o}}{k}\right\rparen\left\lparen\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\cdot\frac{L^{*}}{n}\right\rparen-o(1).

References

  • [1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String reconstruction from substring compositions,” SIAM J. Discrete Math., vol. 29, no. 3, pp. 1340–1371, 2015.
  • [2] D. Bar-Lev, S. Marcovich, E. Yaakobi, and Y. Yehezkeally, “Adversarial torn-paper codes,” in Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, Finland, Jun. 2022, pp. 2934–2939.
  • [3] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing strings from random traces,” in Proc. the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 2004, pp. 910–918.
  • [4] R. Berkowitz and S. Kopparty, “Robust positioning patterns,” in Proc. of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, USA, 2016, pp. 1937–1951.
  • [5] A. M. Bruckstein, T. Etzion, R. Giryes, N. Gordon, R. J. Holt, and D. Shuldiner, “Simple and robust binary self-location patterns,” IEEE Trans. Inform. Theory, vol. 58, no. 7, pp. 4884–4889, 2012.
  • [6] Y. M. Chee, D. T. Dao, H. M. Kiah, S. Ling, and H. Wei, “Robust positioning patterns with low redundancy,” SIAM J. Comput., vol. 49, no. 2, pp. 284–317, 2020.
  • [7] D. T. Dao, H. M. Kiah, and H. Wei, “Maximum length of robust positioning sequences,” in Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT2020), Los Angeles, CA, USA, 2020, pp. 108–113.
  • [8] M. Dudik and L. J. Schulman, “Reconstruction from subsequences,” J. Combin. Theory Ser. A, vol. 103, no. 2, pp. 337–348, 2003.
  • [9] O. Elishco, R. Gabrys, E. Yaakobi, and M. Médard, “Repeat-free codes,” IEEE Trans. Inform. Theory, vol. 67, no. 9, pp. 5749–5764, 2021.
  • [10] R. Gabrys and O. Milenkovic, “Unique reconstruction of coded strings from multiset substring spectra,” IEEE Trans. Inform. Theory, vol. 65, no. 12, pp. 7682–7696, 2019.
  • [11] H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence profiles,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3125–3146, Jun. 2016.
  • [12] V. I. Levenshtein, “Efficient reconstruction of sequences from their subsequences or supersequences,” J. Combin. Theory Ser. A, vol. 93, no. 2, pp. 310–332, 2001.
  • [13] V. I. Leveshtein, “Efficient reconstruction of sequences,” IEEE Trans. Inform. Theory, vol. 47, no. 1, pp. 2–22, 2001.
  • [14] M. Levy and E. Yaakobi, “Mutually uncorrelated codes for DNA storage,” IEEE Trans. Inform. Theory, vol. 65, no. 6, pp. 3671–3691, 2019.
  • [15] B. Manvel, A. Meyerowitz, A. Schwenk, K. Smith, and P. Stockmeyer, “Reconstruction of sequences,” Discrete Math., vol. 94, no. 3, pp. 209–219, 1991.
  • [16] S. Marcovich and E. Yaakobi, “Reconstruction of strings from their substrings spectrum,” IEEE Trans. Inform. Theory, vol. 67, no. 7, pp. 4369–4384, 2021.
  • [17] S. Nassirpour, I. Shomorony, and A. Vahid, “Reassembly codes for the chop-and-shuffle channel,” Jan. 2022. [Online]. Available: http://arxiv.org/abs/2201.03590
  • [18] S. Pattabiraman, R. Gabrys, and O. Milenkovic, “Coding for polymer-based data storage,” IEEE Trans. on Inform. Theory (Early Access), 2023.
  • [19] A. N. Ravi, A. Vahid, and I. Shomorony, “Capacity of the torn paper channel with lost pieces,” in Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT2021), Melbourne, Victoria, Australia, Jul. 2021, pp. 1937–1942.
  • [20] A. D. Scott, “Reconstructing sequences,” Discrete Math., vol. 175, pp. 231–238, 1997.
  • [21] I. Shomorony and A. Vahid, “Torn-paper coding,” IEEE Trans. Inform. Theory, vol. 67, no. 12, pp. 7904–7913, 2021.
  • [22] E. Ukkonen, “Approximate string-matching with qq-grams and maximal matches,” Theoret. Comp. Sci., vol. 92, no. 1, pp. 191–211, 1992.
  • [23] C. Wang, J. Sima, and N. Raviv, “Break-resilient codes for forensic 3D fingerprinting,” arXiv preprint arXiv:2310.03897, 2023.
  • [24] H. Wei, “Nearly optimal robust positioning patterns,” IEEE Trans. Inform. Theory, vol. 68, no. 1, pp. 193–203, 2022.
  • [25] Y. Yehezkeally, D. Bar-Lev, S. Marcovich, and E. Yaakobi, “Generalized unique reconstruction from substrings,” IEEE Trans. Inform. Theory, vol. 69, no. 9, pp. 5648–5659, Sep. 2023.
  • [26] Y. Yehezkeally and N. Polyanskii, “On codes for the noisy substring channel,” Sep. 2023. [Online]. Available: http://arxiv.org/abs/2102.01412v3