Reconstruction from Noisy Substrings

Hengjia Wei, Moshe Schwartz H. Wei is with the Peng Cheng Laboratory, Shenzhen 518055, China. He is also with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China, and the Pazhou Laboratory (Huangpu), Guangzhou 510555, China (e-mail: hjwei05@gmail.com).M. Schwartz is on a leave of absence from the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel. He is now with the Department of Electrical and Computer Engineering at McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: schwartz.moshe@mcmaster.ca).G. Ge is with the School of Mathematical Sciences, Capital Normal University, Beijing 100048, China (e-mail: gnge@zju.edu.cn).This work was supported in part by the National Key Research and Development Program of China under Grant 2020YFA0712100, the National Natural Science Foundation of China under Grant 11971325, Grant 12231014 and Grant 12371523, Beijing Scholars Program, the major key project of Peng Cheng Laboratory under grant PCL2023AS1-2, and the Zhejiang Lab BioBit Program under Grant 2022YFB507.

Abstract

This paper studies the problem of encoding messages into sequences which can be uniquely recovered from some noisy observations about their substrings. The observed reads comprise consecutive substrings with some given minimum overlap. This coded reconstruction problem has applications to DNA storage. We consider both single-strand reconstruction codes and multi-strand reconstruction codes, where the message is encoded into a single strand or a set of multiple strands, respectively. Various parameter regimes are studied. New codes are constructed, some of whose rates asymptotically attain the upper bounds.

Index Terms:

DNA storage, sequence (string) reconstruction, substitution, substring-distant sequences, robust positoining sequences.

I Introduction

Sequence (string) reconstruction refers to a large class of problems of reconstructing a sequence from partial (perhaps noisy) observations of it. Instances of this problem include reconstruction from multiple erroneous copies of the sequence [13, 12, 3], some substrings of the sequence [11, 10], all the length- $k$ subsequences [15, 20, 8], and compositions of the sequence’s substrings or prefixes/suffixes [1, 18].

In this paper, we shall consider the problem of encoding messages into sequences which can be uniquely recovered from observations about their substrings. This coding problem is motivated by applications to DNA-based data storage systems, where data are encoded to long DNA sequences. In some DNA sequencing technologies (e.g., shotgun sequencing), a long DNA strand is first replicated multiple times, and these replicas are then fragmented into some short substrings so that they could be read. In order to retrieve the data, the original long sequence should be reconstructed based on the observations about these short substrings.

This coded reconstruction problem has been studied in different models with different assumptions on the substrings. Gabrys and Milenkovic [10] considered the problem of reconstructing a sequence of length $n$ from its $L$ -multispectrum, i.e., the multiset of all of its length- $L$ substrings. They constructed two classes of reconstruction codes with redundancies $2$ and $O(\log\log n)$ for $L>2\log n$ and $\log n<L\leqslant 2\log n$ , respectively. They also studied the noisy settings in which some substrings/observations may be lost or be corrupted by errors, and constructed codes to combat these effects. Subsequently, Marcovich and Yaakobi [16] followed this noisy setup and provided more code constructions. The constructions in [10, 16] are based on the so-called $(L,d)$ -substring distant (SD) sequence, a sequence in which every two length- $L$ substrings are of Hamming distance at least $d$ apart. When $d=1$ , such sequences are also known as $L$ -substring unique sequences or $L$ -repeat free sequences. Efficient encoding algorithms can be found in [9] for $L>\log n$ . For general $d$ , Marcovich and Yaakobi [16] proposed an encoding algorithm of $(L,d)$ -SD sequences for $L>2\log n$ .

Another model is the torn-paper channel, which randomly tears the input sequence into small pieces of different sizes. The output of this channel is a set of substrings of the input sequence with no overlap, and the message which is carried by the input sequence should be recovered from these substrings. This problem has been researched in the probabilistic setting in [21, 19, 17]. Recently, Bar-Lev et al. [2] considered this problem in the worst-case. They studied both the noiseless setup and the noisy setup, and proposed a couple of index-based constructions to encode messages into sequences each of which can be uniquely recovered from its non-overlapping substrings. Furthermore, motivated by DNA sequencing technologies where multiple strings are sequenced simultaneously, they extended the single-strand reconstruction problem to a multi-strand reconstruction problem. They constructed multi-strand reconstruction codes whose rates asymptotically behave like those of single-strand reconstruction codes. Another related paper is by Wang et al. [23], which, unlike [2], does not restrict the length of the torn substrings, but rather their number. For this setting they construct codes that attain the upper bound on the rate up to asymptotically small factors.

In a recent paper, Yehezkeally et al. [25] proposed a general model, which includes the two models above as extreme cases. In this model, the reconstruction is based on the sequence’s $(L_{\rm min},L_{\rm over})$ -trace, which is a multiset of subsrings where every substring has length at least $L_{\rm min}$ and the overlap of every two consecutive substrings has length at least $L_{\rm over}$ . They focused on the noiseless setup, and constructed a class of trace reconstruction codes whose rate can asymptotically achieve the upper bound. They also studied the multi-strand reconstruction problem in the $L$ -multispectrum model, and proposed reconstruction codes whose rates are asymptotically $1$ .

In this paper, we shall follow the model in [25] and study the coding problem for both single-strand reconstruction and multi-strand reconstruction in the noisy setup. We aim to encode a message into a sequence which can be uniquely recovered from its $(L_{\rm min},L_{\rm over},e)$ -erroneous trace, where each substring may suffer from at most $e$ substitution errors, or to encode a message into a set of $k$ sequences which can be recovered from the union of their $(L_{\rm min},L_{\rm over},e)$ -erroneous traces. Our contributions are listed as follows.

1.

We first give an algorithm which can encode messages into $(L,d)$ -SD sequences for $L=\lceil a\log n\rceil$ where $a>1$ is an arbitrary real constant. The rates of the encoded sequences asymptotically approach $1$ . In contrast, the encoding algorithm in [16] requires a single redundancy bit but works only when $L>2\log n$ .
2.

For single-strand reconstruction, by using the proposed encoding algorithm for SD sequences, we construct two classes of $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes whose rates asymptotically achieve the upper bound.
3.

For multi-strand reconstruction, we present some upper bounds on the rates of multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes, as well as some code constructions. In some parameter regimes, our constructions yield codes whose rates asymptotically attain the upper bounds. Interestingly, when $\log k=\kappa n$ , $L_{\rm min}=a\log n$ and $L_{\rm over}=\gamma L_{\rm min}$ , the maximal rates of multi-strand reconstruction codes not only depend on $\kappa,a,\gamma$ , but also depend on the congruence class of $n$ modulo $L_{\rm min}-L_{\rm over}$ .

II Preliminaries

For a positive integer $n\in\mathbb{N}$ , let $[n]$ denote the set $\{0,1,2,\ldots,n-1\}$ . Let $\Sigma$ denote a finite alphabet. Throughout this paper, we always consider the binary case, i.e., $\Sigma=\{0,1\}$ , however, our results can be easily generalized to non-binary cases. We use $\log x$ to denote the logarithm of $x$ to base $2$ . When generalizing our results to the $q$ -ary alphabet case, it suffices to replace the $\log$ with $\log_{q}$ .

Assume $\mathbf{x}=(x_{0},x_{1},\ldots,x_{n-1})\in\Sigma^{n}$ is a sequence over $\Sigma$ . We denote its length $\lvert\mathbf{x}\rvert=n$ , and its Hamming weight by $\operatorname{wt}_{H}(\mathbf{x})$ . Given two sequence $\mathbf{x}$ and $\mathbf{y}$ over $\Sigma$ , we denote their concatenation by $\mathbf{x}\circ\mathbf{y}$ . If $\mathbf{x}$ and $\mathbf{y}$ have the same length, we use $d_{H}(\mathbf{x},\mathbf{y})$ to denote their Hamming distance.

A substring of $\mathbf{x}$ is a sequence of the form $(x_{a},x_{a+1},\dots,x_{b})$ , where $0\leqslant a\leqslant b<\lvert\mathbf{x}\rvert$ , and we use $\mathbf{x}[a,b]$ to denote it. We also use $\mathbf{x}_{i+[L]}$ , where $i\in[n-L+1]$ , to denote the substring of $\mathbf{x}$ which starts at the position $i$ and has length $L$ , i.e., $\mathbf{x}_{i+[L]}=(x_{i},x_{i+1},\ldots,x_{i+L-1})=\mathbf{x}[i,i+L-1]$ .

A code is simply a set $\mathcal{C}\subseteq\Sigma^{n}$ , whose elements are referred to as codewords. We say $n$ is the length of the code. The rate of the code is defined as $R(\mathcal{C})=\frac{1}{n}\log\lvert\mathcal{C}\rvert$ , and the redundancy of the code is $n-R(\mathcal{C})$ .

II-A Reconstruction from the $L$ -Multispectrum

For a sequence $\mathbf{x}\in\Sigma^{n}$ and a positive integer $L\leqslant n$ , the $L$ -multispectrum of $\mathbf{x}$ , denoted by $\mathcal{S}_{L}(\mathbf{x})$ , is the multiset of all its length- $L$ substrings, namely,

\mathcal{S}_{L}(\mathbf{x})=\left\{\mathbf{x}_{0+[L]},\mathbf{x}_{1+[L]},\ldots,\mathbf{x}_{n-L+[L]}\right\}.

If $\mathbf{x}$ can be uniquely reconstructed from its $L$ -multispectrum, then we say it is $L$ -reconstructible. It was proved in [22] that if all the length- $(L-1)$ substrings of $\mathbf{x}$ are distinct, then $\mathbf{x}$ is $L$ -reconstructible. Such a sequence is referred to as an $L$ -substring unique sequence. In the works [10, 9], algorithms were proposed to construct a set of $L$ -substring unique sequences of rate approaching $1$ , where $L=\lceil a\log n\rceil$ for any constant real number $a>1$ .

In [10], Gabrys and Milenkovic further studied the problem of reconstructing sequences from their noisy multispectra. They first considered the scenario where some substrings are not included in the readout spectrum. For a subset $\hat{\mathcal{S}}\subset\mathcal{S}_{L}(\mathbf{x})$ , if the maximum number of consecutive substrings which are not included in $\hat{\mathcal{S}}$ is $G$ , we say $\hat{\mathcal{S}}$ has maximal coverage gap $G$ . A code is called an $(L,G)$ -reconstruction code if every codeword $\mathbf{x}$ can be uniquely reconstructed from any subset $\hat{\mathcal{S}}\subset\mathcal{S}_{L}(\mathbf{x})$ with maximal coverage gap $G$ . Gabrys and Milenkovic proposed a construction for such codes [10] by restricting each codeword $\mathbf{x}$ to be $\hat{L}$ -substring unique with $\hat{L}<L-G$ and imposing some constraints on their prefixes.

Gabrys and Milenkovic also researched the scenario where the observations about the substrings suffer from substitution errors. Let $\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\}$ be a multiset consisting of $m$ strings of length $L$ . If there is a subset $\hat{\mathcal{S}}=\{\mathbf{x}_{i_{0}},\mathbf{x}_{i_{1}},\ldots,\mathbf{x}_{i_{m-1}}\}\subset\mathcal{S}_{L}(\mathbf{x})$ with maximal coverage gap $G$ such that $d_{H}(\mathbf{y}_{j},\mathbf{x}_{i_{j}})\leqslant e$ for all $j\in[m]$ , then we say $\mathcal{Y}$ is an $(L,G,e)$ -constrained erroneous multispectrum of $\mathbf{x}$ . Moreover, $\mathcal{Y}$ is said to be reliable if for any symbol in $\mathbf{x}$ , there are more copies of the correct value rather than an incorrect value of the symbol. A code is called an $(L,G,e)$ -reconstruction code if every codeword can be uniquely reconstructed from its any reliable $(L,G,e)$ -constrained erroneous multispectrum¹¹1We emphasize that the multispectrum $\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\}$ is just a multiset, and the order/index $i$ of each $\mathbf{y}_{i}$ cannot be directly read when reconstructing.. Gabrys and Milenkovic constructed an $(L,G,e)$ -reconstruction code of redundancy $O(\log\log n)$ for $L=6\log n+O(\log\log n)$ . Their construction is based on $(L,d)$ -substring distant sequences, whose definition is presented as follows.

Definition 1.

A sequence $\mathbf{w}\in\Sigma^{n}$ is called $(L,d)$ -substring distant (SD) if the minimum Hamming distance of its $L$ -multispectrum is at least $d$ , that is, $d_{H}(\mathbf{w}_{i+[L]},\mathbf{w}_{j+[L]})\geqslant d$ for any $0\leqslant i<j\leqslant n-L$ .

Remark.

We observe that an $(L,d)$ -substring distant sequence is also $(L^{\prime},d)$ -substring distant, for any $L^{\prime}\geqslant L$ . Thus, we may equivalently say that $\mathbf{w}\in\Sigma^{n}$ is $(L,d)$ -substring distant (SD) if $d_{H}(\mathbf{w}_{i+[L^{\prime}]},\mathbf{w}_{j+[L^{\prime}]})\geqslant d$ for any integer $L^{\prime}\geqslant L$ and $0\leqslant i<j\leqslant n-L^{\prime}$ . This equivalent definition allows $L$ to be a real number, which we shall conveniently use in the future.

In [16], Marcovich and Yaakobi followed the noisy setup of Gabrys and Milenkovic. They studied the case of $G=0$ , i.e., no substring losses. Instead of reconstructing $\mathbf{x}$ from a reliable erroneous multispectrum, they aimed to reconstruct from an $(L,0,e)$ -erroneous multispectrum $\mathcal{Y}$ , the so-called maximum reconstructible-string, i.e., a string of length $n$ that takes at every position $i$ the majority value of the occurrences of $x_{i}$ in $\mathcal{Y}$ . Obviously, if $\mathcal{Y}$ is reliable, then the maximum reconstructible-string is equal to $\mathbf{x}$ . A sequence $\mathbf{x}$ is called $(L,0,e)$ -reconstructible²²2The notion here is a bit different from that in [16], where Marcovich and Yaakobi further assumed that there are at most $t$ substrings in $\mathcal{Y}$ each of which is affected by at most $e$ errors and referred to it as a $(t,e)$ -erroneous multispectrum. They proposed two constructions for reconstructible codes: one is independent of $t$ and thus can combat any number of erroneous substrings, while the other one depends on $t$ . In this paper, we focus on reconstructible codes which are independent of $t$ . if one can always reconstruct the maximum reconstructible-string from its any $(L,0,e)$ -erroneous multispectrum.

Proposition 2 ([16, Theorem 16]).

If $\mathbf{x}$ is $(L-1,4e+1)$ -SD, then it is $(L,0,e)$ -reconstructible.

For positive integers $n,d,L$ with $d\leqslant L<n$ , we use $\mathcal{Z}_{n}(L,d)$ to denote the set of $(L,d)$ -SD sequences of $\Sigma^{n}$ . For fixed $d$ and $a>1$ , Marcovich and Yaakobi showed that the asymptotic rate of the set $\mathcal{Z}(a\log n,d)$ is $1$ , by using the Lovász Local Lemma. Note that when $a<1$ , even a single $(a\log n)$ -substring unique sequence of length $n$ does not exist.

Theorem 3 ([16, Theorem 19]).

For fixed $d$ and $a>1$ ,

\lim_{n\to\infty}\frac{\log\lvert\mathcal{Z}_{n}(a\log n,d)\rvert}{n}=1.

Marcovich and Yaakobi also presented a deterministic algorithm which uses a single redundancy bit to encode $(a\log n,d)$ -SD sequences for $a>2$ .

Theorem 4 ([16, Algorithm 4 and Theorem 25]).

Let $d>0$ be a fixed integer. There is an encoding algorithm which uses a single redundancy bit to encode $(L,d)$ -SD sequences of length $n$ , for

L=2\log n+2(d-1+\epsilon)\log\log n,

where $\epsilon>0$ is a small constant number and $n$ is sufficiently large.

In Section III, we shall present an algorithm which can encode $(a\log n,d)$ -SD sequences of length $n$ for any $a>1$ , while its redundancy is $o(n)$ . According to Proposition 2, this implies an $(L,0,e)$ -reconstructible code whose rate approaches $1$ , for $L=\lceil a\log n\rceil+1$ and $e=\lfloor\frac{d-1}{4}\rfloor$ .

II-B Reconstruction from an $(L_{\rm min},L_{\rm over})$ -trace

In [25], Yehezkeally et al. studied an extension of the problem of reconstructing from substrings. Let $\mathbf{x}\in\Sigma^{n}$ be a sequence. A substring trace of $\mathbf{x}$ is a multiset of substrings $\{\mathbf{x}_{i_{0}+[L_{0}]},\mathbf{x}_{i_{1}+[L_{1}]},\ldots,\mathbf{x}_{i_{m-1}+[L_{m-1}]}\}$ for some positive integer $m$ , where $i_{0}<i_{1}<\cdots<i_{m-1}$ . If $i_{0}=0$ , $i_{j+1}<i_{j}+L_{j}$ for all $j<m-1$ , and $i_{m-1}+L_{m-1}=n$ , then the substring trace is called complete. Let $L_{\rm min}$ and $L_{\rm over}$ be two positive integers such that $L_{\rm over}<L_{\rm min}<n$ . An $(L_{\rm min},L_{\rm over})$ -trace is a complete trace such that:

1.

every substring has length at least $L_{\rm min}$ , i.e., $L_{i}\geqslant L_{\rm min}$ for all $i\in[m]$ ;
2.

the overlap of every two consecutive substrings has length at least $L_{\rm over}$ , i.e., $i_{j}+L_{j}-i_{j+1}\geqslant L_{\rm over}$ for all $j\in[m-1]$ .

For a sequence $\mathbf{x}$ , let $\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x})$ denote the set of all $(L_{\rm min},L_{\rm over})$ -traces of $\mathbf{x}$ . A code $\mathcal{C}$ is referred to as an $(L_{\rm min},L_{\rm over})$ -trace reconstruction code if $\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x})\cap\mathcal{T}_{L_{\rm min}}^{L_{\rm over}}(\mathbf{x}^{\prime})=\emptyset$ for all $\mathbf{x}\neq\mathbf{x}^{\prime}\in\mathcal{C}$ , or equivalently, every codeword can be uniquely reconstructed from any of its $(L_{\rm min},L_{\rm over})$ -traces.

Proposition 5 ([25, Lemma 1]).

Let $\mathbf{x}$ be an $L_{\rm over}$ -substring unique sequence. Then $\mathbf{x}$ can be uniquely reconstructed from any of its $(L_{\rm min},L_{\rm over})$ -traces.

By refining the constructions of substring unique sequences, Yehezkeally et al. obtained the following result.

Theorem 6 ([25, Corollary 6]).

There is an $(L_{\rm min},L_{\rm over})$ -trace reconstruction code of $\Sigma^{n}$ whose rate approaches $1$ , for $L_{\rm over}\geqslant\lceil\log n\rceil+3\lceil\log\log n\rceil+12$ and sufficiently large $n$ .

They also studied the other parameter regimes.

Lemma 7 ([25, Lemma 8]).

If $L_{\rm min}=a\log n+O(1)$ and $L_{\rm over}=\gamma L_{\rm min}+O(1)$ for some $a>1$ and $0\leqslant\gamma\leqslant\frac{1}{a}$ , then for any $(L_{\rm min},L_{\rm over})$ -trace reconstruction code $\mathcal{C}\subseteq\Sigma^{n}$ , its rate $R(\mathcal{C})$ must satisfy

R(\mathcal{C})\leqslant\frac{1-1/a}{1-\gamma}+O\left\lparen\frac{\log\log n}{\log n}\right\rparen.

Theorem 8 ([25, Theorem 15]).

Let $L_{\rm min}=a\log n$ and $L_{\rm over}=\gamma L_{\rm min}$ for some $a>1$ and $0\leqslant\gamma\leqslant\frac{1}{a}$ . If $n$ is sufficiently large, then there is an $(L_{\rm min},L_{\rm over})$ -trace reconstruction code $\mathcal{C}\subseteq\Sigma^{n}$ with rate

R(\mathcal{C})\geqslant\frac{1-1/a}{1-\gamma}-\frac{(\log n)^{\epsilon}}{a\sqrt{\log n}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,

where $\epsilon>0$ is a small number which is independent of $n$ .

In this paper, we shall study the problem of reconstructing sequences from their noisy substring traces. Let $\mathcal{Y}=\{\mathbf{y}_{0},\mathbf{y}_{1},\ldots,\mathbf{y}_{m-1}\}$ be a multiset of sequences over $\Sigma$ , and let $L_{j}=\lvert\mathbf{y}_{j}\rvert$ for $j\in[m]$ . We say $\mathcal{Y}$ is an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{x}$ if there exists an $(L_{\rm min},L_{\rm over})$ -trace $\{\mathbf{x}_{i_{0}+[L_{0}]},\mathbf{x}_{i_{1}+[L_{1}]},\ldots,\mathbf{x}_{i_{m-1}+[L_{m-1}]}\}$ such that $d_{H}(\mathbf{y}_{j},\mathbf{x}_{{i_{j}}+[L_{j}]})\leqslant e$ for all $j\in[m]$ . Namely, each string $\mathbf{y}_{j}$ in $\mathcal{Y}$ is an erroneous copy of the substring $\mathbf{x}_{{i_{j}}+[L_{j}]}$ in $\mathbf{x}$ with at most $e$ errors. The index $i_{j}$ is referred to as the location $\mathbf{y}_{j}$ in $\mathbf{x}$ . For a sequence $\mathbf{x}$ and its any $(L_{\rm min},L_{\rm over},e)$ -erroneous trace $\mathcal{Y}$ , if one can always determine the location of every $\mathbf{y}_{i}\in\mathcal{Y}$ in $\mathbf{x}$ , then we say $\mathbf{x}$ is $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible. We note that once all the locations of $\mathbf{y}_{j}$ ’s are identified, the maximum reconstructible-string of $\mathcal{Y}$ can be determined by taking at every position $i$ the majority value of the occurrences of $x_{i}$ in $\mathcal{Y}$ . Hence, the $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible sequence $\mathbf{x}$ can be uniquely reconstructed as long as $\mathcal{Y}$ is reliable.

A code is called an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code if every codeword $\mathbf{x}$ is $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible³³3Unlike the noiseless case, in an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code it might be possible that two codewords share a common $(L_{\rm min},L_{\rm over},e)$ -erroneous trace. Nevertheless, they cannot have a common reliable trace. . In Section IV, we will give two constructions for $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes where the number of errors $e$ is fixed. Our results are akin to Theorem 6 and Theorem 8. In particular, when $L_{\rm over}=a\log n$ for some $a>1$ , we construct a class of $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes whose rates approach $1$ . When $L_{\rm min}=a\log n$ and $L_{\rm over}=\gamma L_{\rm min}$ for some $a>1$ and $0\leqslant\gamma\leqslant\frac{1}{a}$ , the proposed $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes have rates close to $\frac{1-1/a}{1-\gamma}$ . These results are summarized in Table I. Our constructions are based on robust positioning sequences and window-weight limited sequences, which are reviewed in Section II-D.

We note that when $L_{\rm over}=0$ , $(L_{\rm min},0)$ -reconstruction codes were researched by Bar-Lev et al. in [2] by the name of adversarial torn-paper codes. In the same paper, they also consider the scenario where the DNA strand may suffer from substitution errors before sequencing. Such kind of errors cannot be corrected by majority decoding. Yehezkeally and Polyanskii studied a similar problem for the $(L+1,L)$ -trace reconstruction [26]. They introduced the notion of $(t,L)$ -resilient repeat free sequence, which satisfies the property that the result of any $t$ substitution errors to it is $L$ -repeat free, and proposed an algorithm to directly encode such sequences. Interestingly, [26, Lemma 6] shows that an $(L,2t+1)$ -SD sequence is $(t,L)$ -resilient repeat free. In Section IV, we will also study errors before sequencing and modify our code construction for $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction to combat such errors.

TABLE I: Lower and upper bounds on the code rate of single-strand

(L_{\rm min},L_{\rm over},e)

-trace reconstruction codes of

\Sigma^{n}

Parameter regimes	Lower bound	Ref.	Upper bound	Ref.
$L_{\rm over}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d$	$1-o(1)$	Corollary 19	$1$
where $d=4e+1$	$1-o(1)$	Corollary 19	$1$
$L_{\rm min}=\lceil a\log(n)\rceil$ , $L_{\rm over}=\lceil\gamma L_{\rm min}\rceil$	$\frac{1-1/a}{1-\gamma}-o(1)$	Theorem 25	$\frac{1-1/a}{1-\gamma}+o(1)$	Lemma 7
where $a>1$ and $0\leqslant a\gamma\leqslant 1$	$\frac{1-1/a}{1-\gamma}-o(1)$	& Theorem 28	$\frac{1-1/a}{1-\gamma}+o(1)$	Lemma 7

II-C Multi-strand reconstruction

Motivated by DNA sequencing technologies where multiple DNA strands are sequenced simultaneously, the reconstruction problem has been extended to the multi-strand case in [25, 2], i.e., reconstructing a multiset of $k$ sequences of length $n$ from the union of their traces.

Define

\mathcal{X}_{n,k}\triangleq\left\{\left\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\right\}~{}:~{}\mathbf{x}_{i}\in\Sigma^{n}\textup{ for all }i\in[k]\right\}.

Then $\lvert\mathcal{X}_{n,k}\rvert=\binom{k+2^{n}-1}{k}$ . The rate of a multi-strand code $\mathcal{C}\subseteq\mathcal{X}_{n,k}$ is defined as

R(\mathcal{C})\triangleq\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}.

For a multiset $\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{X}_{n,k}$ , its $(L_{\rm min},L_{\rm over})$ -trace is a (multiset) union $\mathcal{Y}=\bigcup_{i=0}^{k-1}\mathcal{Y}_{i}$ , where each $\mathcal{Y}_{i}$ is an $(L_{\rm min},L_{\rm over})$ -trace of $\mathbf{x}_{i}$ . A code $\mathcal{C}\subseteq\mathcal{X}_{n,k}$ is referred to as a multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code if every codeword can be reconstructed from its $(L_{\rm min},L_{\rm over})$ -trace. Two classes of multi-strand trace reconstruction codes whose rates asymptotically attain the upper bound have been constructed in [25, 2], for $L_{\rm over}=0$ or $L_{\rm over}=L_{\rm min}-1$ , respectively.

Theorem 9 ([2, Theorem 12]).

Suppose that $\log k=o(n)$ and $L_{\rm min}=a\log(nk)$ with $a>1$ . Then there is a class of multi-strand $(L_{\rm min},0)$ -trace reconstruction codes of rate $1-1/a-o(1)$ .

Theorem 10 ([25, Corollary 23]).

Suppose that $\limsup_{n\to\infty}\log k/n<1$ and $L_{\rm min}\geqslant\log(nk)+3\log\log(nk)+12$ . Then there is a class of multi-strand $(L_{\rm min},L_{\rm min}-1)$ -trace reconstruction codes of rate $1-o(1)$ .

In this paper, we will also study the problem of reconstructing multiple strands from their noisy traces. For a multiset $\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{X}_{n,k}$ , its $(L_{\rm min},L_{\rm over},e)$ -erroneous trace is a (multiset) union $\mathcal{Y}=\bigcup_{i=0}^{k-1}\mathcal{Y}_{i}$ , where each $\mathcal{Y}_{i}$ is an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{x}_{i}$ . We aim to reconstruct $\mathcal{S}$ from its $(L_{\rm min},L_{\rm over},e)$ -erroneous trace. If for any $(L_{\rm min},L_{\rm over},e)$ -erroneous trace $\mathcal{Y}$ of $\mathcal{S}$ and any $\mathbf{y}\in\mathcal{Y}$ , it is possible to determine the index $i$ such that $\mathbf{y}\in\mathcal{Y}_{i}$ as well as the location of $\mathbf{y}$ in $\mathbf{x}_{i}$ , then we say $\mathcal{S}$ is $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible. A code $\mathcal{C}\subseteq\mathcal{X}_{n,k}$ is called an multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code if each of its codewords is $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible.

Following the research in [25], we assume that $\limsup_{n\to\infty}\log k/n<1$ , which is of great interest in applications. In Section V, we shall present some upper bounds on the multi-strand trace code rate and propose some codes whose rates asymptotically attain these bounds. Our results are summarized in Table II and Table III. Among others, when $\log k=\kappa n$ with $0<\kappa<1$ , we obtain a class of multi-strand $(L_{\rm min},0,e)$ -trace reconstruction codes of rate $\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}-o(1)$ , where $L^{*}\equiv n\pmod{L_{\rm min}}$ . Note that $L^{*}\in[L_{\rm min}]$ and $L_{\rm min}=a\log(nk)=\Theta(n)$ . The term $\frac{L^{*}}{n}$ could be a non-vanishing number, depending on the congruence class of $n$ modulo $L_{\rm min}$ . In contrast, when $\log k=o(n)$ , the rate of the multi-strand $(L_{\rm min},0)$ -trace reconstruction codes in [2, Theorem 12] is $1-1/a-o(1)$ , which is the same as that of single-strand reconstruction codes.

TABLE II: Lower and upper bounds on the code rate of multi-strand

(L_{\rm min},L_{\rm over},e)

-trace reconstruction codes of

\mathcal{X}_{n,k}

, where

\log k=o(n)

Parameter regimes	Lower bound	Ref.	Upper bound	Ref.
$L_{\rm over}=\log(nk)+(24e+13)\log\log(nk)+O(1)$	$1-o(1)$	Theorem 31	$1$
$L_{\rm min}=\lceil a\log(nk)\rceil$ , $L_{\rm over}=\lceil\gamma\log(nk)\rceil$	$\frac{1-1/a}{1-\gamma}-o(1)$	Theorem 37	$\frac{1-1/a}{1-\gamma}+o(1)$	Lemma 33
where $a>1$ and $0\leqslant a\gamma\leqslant 1$	$\frac{1-1/a}{1-\gamma}-o(1)$	Theorem 37	$\frac{1-1/a}{1-\gamma}+o(1)$	Lemma 33
$L_{\rm min}\leqslant\log(nk)+o(\log(nk))$			$o(1)$	Corollary 34

TABLE III: Lower and upper bounds on the code rate of multi-strand

(L_{\rm min},L_{\rm over},e)

-trace reconstruction codes of

\mathcal{X}_{n,k}

, where

\log k=\kappa n

and

L^{*}=(n-L_{\rm over})\bmod(L_{\rm min}-L_{\rm over})

Parameter regimes	Lower bound	Ref.	Upper bound	Ref.
$L_{\rm over}=\log(nk)+(24e+13)\log\log(nk)+O(1)$	$1-o(1)$	Theorem 31	$1$
$L_{\rm min}=\lceil a\log(nk)\rceil$ , $L_{\rm over}=\lceil\gamma L_{\rm min}\rceil$	$\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1)$	Theorem 37	$\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen$	Lemma 33
where $a>1$ and $0\leqslant a\gamma\leqslant 1$		Theorem 37	$+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\frac{L^{*}}{n}+o(1)$	Lemma 33
$L_{\rm over}=0$ , $L_{\rm min}=\lceil a\log(nk)\rceil$ , $a>1$ ,	$\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}-o(1)$	Theorem 39	$\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}+o(1)$	Lemma 33
and $L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)$	$\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}-o(1)$	Theorem 39	$\frac{1-1/a}{1-\kappa}+\frac{L^{*}}{a(1-\kappa)n}+o(1)$	Lemma 33
$L_{\rm min}=\log(nk)+o(\log(nk))$			$o(1)$	Lemma 36
and $L_{\rm min}-L_{\rm over}=\Theta(\log(nk))$			$o(1)$	Lemma 36
$L_{\rm min}=\lceil a\log(nk)\rceil$ with $a<1$			$o(1)$	Lemma 35

II-D Robust positioning sequences

An $(L,d)$ -substring distant sequence $\mathbf{x}$ is also known as an $(L,d)$ -robust positioning sequence, since the contents of any length- $L$ substring can locate the substring’s position in $\mathbf{x}$ , even if they are corrupted by at most $\lfloor(d-1)/2\rfloor$ errors. In the context of robust positioning sequences, given $L$ and $d$ , it is of interest to construct a (single) long $(L,d)$ -robust positioning sequence with efficient locating algorithm. This problem, as well as its 2-dimensional extension, has been discussed in [5, 4, 6, 7, 24]. Among others, Chee et al. [6] constructed a class of $(L,d)$ -robust positioning sequences of length ${2^{L}}/(cL^{3d+6.5})$ for some constant number $c>0$ . Their construction was refined in [24] to obtain sequences of length ${2^{L}}/(cL^{\lceil(d-1)/2\rceil+8})$ , which is nearly optimal. The constructions in [6, 24] require the following notions.

Theorem 11 ( $d$ -Auto-Cyclic Sequences [14]).

Let $\ell=d\lceil\log d\rceil+2d$ . Set $\mathbf{u}$ to be the sequence

\mathbf{u}=1^{d}\circ\mathbf{u}_{0}\circ\mathbf{u}_{1}\circ\cdots\circ\mathbf{u}_{\lceil\log d\rceil},\mbox{ where }\mathbf{u}_{i}=((1^{2^{i}}\circ 0^{2^{i}})^{d})[0,d-1].

Then for all $1\leqslant i\leqslant d$ , we have that

d_{H}(\mathbf{u},0^{i}\circ\mathbf{u}[0,{\ell-i-1}])\geqslant d,

and $\mathbf{u}$ is called a $d$ -auto-cyclic sequence.

Definition 12.

Let $n,L,d$ be positive integers such that $d<L<n$ . We say a sequence $\mathbf{x}\in\Sigma^{n}$ satisfies the $(L,d)$ -window weight limited (WWL) constraint, and is called an $(L,d)$ -WWL sequence, if $\operatorname{wt}_{H}(\mathbf{x}_{i+[L]})\geqslant d$ for any $i\in[n-L+1]$ .

Proposition 13 ([6, Construction 1 and Theorem 3.7]).

Given $L$ and $d$ , choose $K$ such that $\ell<K$ and $K+\ell<L$ , where $\ell=d\lceil\log d\rceil+2d$ . Let $\mathbf{u}$ be a $d$ -auto-cyclic vector of length $\ell$ from Theorem 11 and set $L_{p}=K+\ell$ . Let $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ be a collection of length- $(L-L_{p})$ binary vectors satisfying the following conditions:

(P1)

$\mathbf{s}_{i}$ is a $(K,d)$ -WWL vector for $i\in[M]$ ;
(P2)

$\mathbf{s}_{i+1}[0,j-1]\circ\mathbf{s}_{i}[j,L-L_{p}-1]$ is a $(K,d)$ -WWL vector for $i\in[M-1]$ and $j\in[L-L_{p}-1]$ ; and
(P3)

the concatenation $\mathbf{s}_{0}\circ\mathbf{s}_{1}\circ\mathbf{s}_{2}\circ\cdots\circ\mathbf{s}_{M-1}$ is an $(L-L_{p},d)$ -modular robust positioning sequence⁴⁴4A sequence $\mathbf{w}$ is an $(L-L_{p},d)$ -modular robust positioning sequence if $d_{H}(\mathbf{w}_{i+[L-L_{p}]},\mathbf{w}_{j+[L-L_{p}]})\geqslant d$ for any $i\equiv j\pmod{L-L_{p}}$ and $i\neq j$ ..

Then the sequence

\mathbf{s}\triangleq 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K}\circ\mathbf{u}\circ\mathbf{s}_{M-1}

is an $(L,d)$ -robust positioning (substring distant) sequence.

Theorem 14 ([6, Construction 1A and Corollary 3.12]).

Given $d$ and $L$ , set $K=3\lceil(3\log L)/2\rceil=\frac{9}{2}\log L+O(1)$ . There is an explicit construction of sequences $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ of length $L-K-\ell$ , where $\log M=L-3d\log L-7.5\log L-O(1)$ , such that the conditions (P1)–(P3) in Proposition 13 are satisfied.

Remark.

We note that for each $i\in[M]$ , the concatenation $0^{K}\circ\mathbf{u}\circ\mathbf{s}_{i}$ is an $(L_{p},d)$ -WWL sequence, since the length- $d$ prefix of $\mathbf{u}$ is $1^{d}$ and $\mathbf{s}_{i}$ is $(K,d)$ -WWL.

III Encoding of $(a\log n,d)$ -Substring Distant Sequences for $a>1$

In this section we shall present an encoding method which can generate a set of $(a\log n,d)$ -SD sequences of length $n$ (with $a>1$ , a real number) whose rate asymptotically approaches $1$ . We shall, in fact, construct $(L,d)$ -SD sequences with $L=\log n+(6d+7)\log\log n+O(1)$ , but using the remark following Definition 1, we shall find it more convenient to denote these sequences as $(a\log n,d)$ -SD.

We first require some notations. For a sequence $\mathbf{w}\in\Sigma^{n}$ , we say that $(i,j)$ (where $0\leqslant i<j\leqslant n-L$ ) is an $(L,\rho)$ -close window pair in $\mathbf{w}$ if $d_{H}(\mathbf{w}_{i+[L]},\mathbf{w}_{j+[L]})\leqslant\rho$ . Moreover, $(i,j)$ is called primal, if for any other $(L,\rho)$ -close window pair $(i^{\prime},j^{\prime})$ in $\mathbf{w}$ we have $j\leqslant j^{\prime}$ . Let $\mathbf{x},\mathbf{x}^{\prime}\in\Sigma^{L}$ be two sequences with $d_{H}(\mathbf{x},\mathbf{x}^{\prime})\leqslant\rho$ for some integer $\rho\leqslant L$ . Let $p_{1},p_{2},\ldots,p_{d_{H}(\mathbf{x},\mathbf{x}^{\prime})}$ denote the indices of the entries where $\mathbf{x}$ and $\mathbf{x}^{\prime}$ do not agree. For every ${1\leqslant i\leqslant\rho}$ let

\mathbf{b}_{i}=\begin{cases}b(p_{i})&\text{if $i\leqslant d_{H}(\mathbf{x},\mathbf{x}^{\prime})$},\\ 0^{{\lceil\log(L+1)\rceil}}&\text{otherwise},\end{cases}

(1)

where $b(i)$ is the binary representation of $i$ with ${\lceil\log(L+1)\rceil}$ symbols. Let

\operatorname{EncDist}_{L,\rho}(\mathbf{x},\mathbf{x}^{\prime})\triangleq\mathbf{b}_{1}\circ\mathbf{b}_{2}\circ\cdots\circ\mathbf{b}_{\rho}.

Then $\operatorname{EncDist}_{n,\rho}(\mathbf{x},\mathbf{x}^{\prime})$ encodes the difference between $\mathbf{x}$ and $\mathbf{x}^{\prime}$ , and its length is $\rho{\lceil\log(L+1)\rceil}$ .

Given a fixed $d$ and a sufficiently large $n$ , we are going to present an encoding algorithm which can encode $(L,d)$ -SD sequences of length $n$ . Set

	$\displaystyle L_{1}$	$\displaystyle\triangleq\lceil\log n\rceil+(2d-1)\lceil\log\lceil\log n\rceil\rceil+6d+\lceil\log(d+1)\rceil,$
	$\displaystyle K_{1}$	$\displaystyle\triangleq d\lceil\log\lceil\log n\rceil\rceil+d,$
	$\displaystyle L_{2}$	$\displaystyle\triangleq\lceil\log n\rceil+(3d+7)\lceil\log\lceil\log n\rceil\rceil,$
	$\displaystyle K_{2}$	$\displaystyle\triangleq 3\left\lceil\frac{3}{2}\log L_{2}\right\rceil,$
	$\displaystyle K_{\max}$	$\displaystyle\triangleq\max\{K_{1},K_{2}\}.$

Additionally, set

	$\displaystyle\ell$	$\displaystyle\triangleq d\lceil\log d\rceil+2d,$
	$\displaystyle L$	$\displaystyle\triangleq\max\{L_{1}+K_{2}+K_{\max}+\ell,L_{2}+2K_{1}+K_{\max}+\ell\}.$

Assume that $d$ is fixed and $n$ is sufficiently large. Then $L=L_{2}+2K_{1}+K_{\max}+\ell$ , and $K_{1}>K_{2}$ if and only if $d\geqslant 5$ . Note that

K_{2}=3\left\lceil 1.5\log L_{2}\right\rceil\leqslant 3\left\lceil 1.5\log\left\lceil\log n\right\rceil+1.5\right\rceil\leqslant 4.5\left\lceil\log\left\lceil\log n\right\rceil\right\rceil+7.5.

Thus, we have that

L\begin{cases}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d&\text{if $d\geqslant 5$},\\ \leqslant\lceil\log n\rceil+(5d+11.5)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+4d+7.5&\text{otherwise.}\end{cases}

Our encoder resembles the encoding algorithms in [10, 9] and consists of the following three parts:

We first use the encoder presented in [14] to encode a message sequence $\mathbf{m}\in\Sigma^{n^{\prime}}$ into a $(d\lceil\log\lceil\log(n)\rceil\rceil,d)$ -WWL sequence $\mathbf{w}$ of length $n-K_{1}-K_{2}$ . According to [14, Corollary 20], this encoder, denoted by $\mathcal{E}_{1}$ , requires approximately $2d\cdot 2^{\mathcal{F}(n-K_{1}-K_{2},d)-d\lceil\log\lceil\log n\rceil\rceil}$ redundancy symbols, where

\mathcal{F}(n,d)=\lceil\log n\rceil+(d-1)(\lceil\log\lceil\log n\rceil\rceil+C)+2

for some constant $C$ . Hence,

n^{\prime}=n-K_{1}-K_{2}-2d\cdot 2^{\mathcal{F}(n-K_{1}-K_{2},d)-d\lceil\log\lceil\log n\rceil\rceil}=n-K_{1}-K_{2}-\Theta(n/\log n).

(2)

2.

Then we encode the $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL sequence $\mathbf{w}$ into an $(L_{1},d)$ -SD sequence $\bar{\mathbf{w}}$ by eliminating the pairs of substrings of small distance and attaching some information about their positions and difference. This encoder, denoted by $\mathcal{E}_{2}$ , is presented in Algorithm 1, and it can additionally guarantee the output sequence is $(K_{1},d)$ -WWL.

As an output of Algorithm 1, the sequence $\bar{\mathbf{w}}$ is usually shorter than the sequence $\mathbf{w}$ . Thus, we need an expansion step to increase the sequence length while keeping the substring-distant property. Let $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ be a collection of $(K_{2},d)$ -WWL sequences of length $L_{2}-L_{p}$ as in Theorem 14. Set

\bar{\mathbf{s}}\triangleq 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{M-1},

where $\mathbf{u}$ is the $d$ -auto-cyclic vector of length $\ell$ from Theorem 11. Finally, let

\hat{\mathbf{w}}\triangleq\mathcal{E}_{3}(\bar{\mathbf{w}})\triangleq(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1].

We shall show $\hat{\mathbf{w}}$ is the required $(L,d)$ -SD sequence of length $n$ .

We first describe the encoding presented in Algorithm 1. This procedure encodes a $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL sequence $\mathbf{w}$ into a sequence $\bar{\mathbf{w}}$ that is simultaneously $(L_{1},d)$ -SD and $(K_{1},d)$ -WWL. Initiate $\bar{\mathbf{w}}=\mathbf{w}$ . If there are no $(L_{1},d-1)$ -close window pairs in $\bar{\mathbf{w}}$ , then the algorithm returns $\bar{\mathbf{w}}$ as the output. We observe that since $\mathbf{w}$ is $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL and $K_{1}\geqslant d\lceil\log\lceil\log n\rceil\rceil$ , then $\mathbf{w}$ is also $(K_{1},d)$ -WWL.

Otherwise, we choose a primal $(L_{1},d-1)$ -close window pair, say $(i,j)$ . We replace the substring $\bar{\mathbf{w}}_{j+[L_{1}]}$ with the sequence

1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\circ 1^{d},

(3)

where $B(i):[n]\longrightarrow\Sigma^{\lceil\log n\rceil+d}$ is the encoding function in [14, Algorithm 2], which can encode integers in $[n]$ into $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL sequences in $O(n)$ time. We note that this sequence is $(K_{1},d)$ -WWL and contains the information about the position $i$ and the difference between $\bar{\mathbf{w}}_{i+[L_{1}]}$ and $\bar{\mathbf{w}}_{j+[L_{1}]}$ . Moreover, the substring $0^{d\lceil\log\lceil\log n\rceil\rceil}$ serves as a marker which indicates the position $j$ of the removed substring $\bar{\mathbf{w}}_{j+[L_{1}]}$ .

We shall repeat this procedure until there are no $(L_{1},d-1)$ -close window pairs in $\bar{\mathbf{w}}$ . But in order to ensure that $\mathbf{w}$ can be recovered from the output of the algorithm, we need more tricks. We note that in [10] the inserted sequences always start with a marker $0^{2\log\log n}$ and end with a symbol ‘ $1$ ’. This pattern together with the rule that only the primal pairs can be chosen and replaced guarantees that after each replacement the latest inserted substring always starts with the rightmost $0^{2\log\log n}$ in $\bar{\mathbf{w}}$ . Due to this property, we have a decoding algorithm which can recover $\mathbf{w}$ from $\bar{\mathbf{w}}$ : Let $\bar{\mathbf{w}}^{(k)}$ denote the sequence $\bar{\mathbf{w}}$ after the $k$ -th replacement. One can search for the rightmost $0^{2\log\log n}$ in $\bar{\mathbf{w}}^{(k)}$ to find the position $j$ of the inserted substring in the $k$ -th replacement. By replacing the inserted substring with the removed substring, one can recover $\bar{\mathbf{w}}^{(k-1)}$ from $\bar{\mathbf{w}}^{(k)}$ . Doing this iteratively, one can eventually recover $\mathbf{w}$ from $\bar{\mathbf{w}}$ .

In our encoding, the inserted substring should always contain $1^{d}$ as both prefix and suffix to maintain the property of being $(K_{1},d)$ -WWL. We have to modify the substring $0^{\lceil\log(d+1)\rceil}$ in (3) to ensure the latest inserted substring always starts with the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ in $\bar{\mathbf{w}}$ . Let $j_{p}$ and $j$ be the positions of the removed substrings in the previous replacement and in the current replacement, respectively. Since we only choose the primal pairs, necessarily, $j>j_{p}-L_{1}$ . If $j>j_{p}-L_{1}+d$ , then we still replace the substring $\bar{\mathbf{w}}_{j+[L_{1}]}$ with the sequence in (3), since the marker $0^{d\lceil\log\lceil\log n\rceil\rceil}$ which is inserted in the previous replacement will be destroyed by the suffix $1^{d}$ of this inserted sequence. If $j_{p}-L_{1}<j\leqslant j_{p}-L_{1}+d$ , we first set $\bar{\mathbf{w}}[j_{p}+d]$ to be ‘1’ to destroy the previous marker $0^{d\lceil\log\lceil\log n\rceil\rceil}$ . Then we replace $\bar{\mathbf{w}}_{j+[L_{1}]}$ with the sequence

1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ b(j-j_{p}+L_{1})\circ 1^{d},

(4)

where $b(j-j_{p}+L_{1})$ is the binary encoding of $j-j_{p}+L_{1}$ with $\lceil\log(d+1)\rceil$ symbols, since $1\leqslant j-j_{p}+L_{1}\leqslant d$ .

Note that the substring $B(i)$ and the substring $\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i,L_{1}},\bar{\mathbf{w}}_{j,L_{1}})$ have length $\lceil\log n\rceil+d$ and length at most $(d-1)(\lceil\log\lceil\log n\rceil\rceil+1)$ , respectively. It follows that in the loop we replace substrings of length $L_{1}$ with substrings of length at most

	$\displaystyle 4d+d\lceil\log\lceil\log n\rceil\rceil+(\lceil\log n\rceil+d)+(d-1)\lceil\log(L_{1}+1)\rceil+\lceil\log(d+1)\rceil$
	$\displaystyle\leqslant 4d+d\lceil\log\lceil\log n\rceil\rceil+(\lceil\log n\rceil+d)+(d-1)(\lceil\log\lceil\log n\rceil\rceil+1)+\lceil\log(d+1)\rceil$
	$\displaystyle=L_{1}-1,$

where the first inequality is obtained by noting that for all sufficiently large $n$ we have $L_{1}+1\leqslant 2\lceil\log n\rceil$ . Hence, the loop will execute at most $\lvert\mathbf{w}\rvert-L_{1}+1$ times and the algorithm will terminate eventually.

Input: a

(d\lceil\log\lceil\log n\rceil\rceil,d)

-WWL sequence

\mathbf{w}\in\Sigma^{n-K_{1}-K_{2}}

Output: a sequence

\bar{\mathbf{w}}\in\Sigma^{\leqslant n-K_{1}-K_{2}}

Set

\bar{\mathbf{w}}=\mathbf{w}

and

j_{p}=0

while there are two length-

L_{1}

substrings in

\bar{\mathbf{w}}

whose Hamming distance is at most

d-1

Suppose

(i,j)

is a primal

(L_{1},d-1)

-close window pair in

\bar{\mathbf{w}}

(then necessarily

j>j_{p}-L_{1}

)

j>j_{p}-L_{1}+d

then

Remove the substring of length

L_{1}

starting at position

j

and replace it with the sequence

1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\circ 1^{d}

else

Set

\bar{\mathbf{w}}[j_{p}+d]

to be ‘1’

Remove the substring of length

L_{1}

starting at position

j

and replace it with the sequence

1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}\circ 1^{d}\circ B(i)\circ 1^{d}\circ\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ b(j-j_{p}+L_{1})\circ 1^{d}

end if

j_{p}\leftarrow j

end while

return

\bar{\mathbf{w}}

Algorithm 1 Primal Pair Elimination Encoder

\mathcal{E}_{2}

for Generating

(L_{1},d)

-SD Sequences

Lemma 15.

The output sequence $\bar{\mathbf{w}}$ is $(K_{1},d)$ -WWL and $(L_{1},d)$ -SD, and the input sequence $\mathbf{w}$ can be recovered from $\bar{\mathbf{w}}$ , for all sufficiently large $n$ .

Proof:

The while loop ensures that the output $\bar{\mathbf{w}}$ of Algorithm 1 is an $(L_{1},d)$ -SD sequence. Moreover, since $\mathbf{w}$ is $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL and $K_{1}=d\lceil\log\lceil\log n\rceil\rceil+d$ , one can tediously verify that for all large enough $n$ , $\bar{\mathbf{w}}$ is $(K_{1},d)$ -WWL. In particular, even if $\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})$ is all zeros, for all large enough $n$

K_{1}-\left\lvert\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}\right\rvert\geqslant d,

and a substring of length $K_{1}$ containing $\operatorname{EncDist}_{L_{1},d-1}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\circ 0^{\lceil\log(d+1)\rceil}$ must also contain at least $d$ of the surrounding $1$ ’s.

Next, we show after each replacement the latest inserted substring always starts with the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ . Let $\bar{\mathbf{w}}^{(k)}$ be the sequence $\bar{\mathbf{w}}$ after the $k$ -th replacement. We prove this by induction. When $k=1$ , since $\mathbf{w}=\bar{\mathbf{w}}^{(0)}$ is $(d\lceil\log\lceil\log n\rceil\rceil,d)$ -WWL, the marker $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ appears exactly once in $\bar{\mathbf{w}}^{(1)}$ , and so the claim holds. Now, in the $k$ -th replacement, $j$ denotes the position of the substring removed in this replacement, while $j_{p}$ denotes the position of the substring removed in the $(k-1)$ -th replacement. According to the inductive assumption, the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ in $\bar{\mathbf{w}}^{(k-1)}$ starts at the position $j_{p}$ . If $j\geqslant j_{p}$ , then the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ in $\bar{\mathbf{w}}^{(k)}$ is $\bar{\mathbf{w}}^{(k)}_{j+[d\lceil\log\lceil\log n\rceil\rceil+d]}$ . If $j_{p}-L_{1}+d<j<j_{p}$ , the overlap of $\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}$ and $\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]}$ has length greater than $d$ . Since the sequence which is inserted in the $k$ -th replacement ends with a symbol ‘1’, it can destroy the marker in $\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]}$ . If $j_{p}-L_{1}<j\leqslant j_{p}-L_{1}+d$ , we set $\bar{\mathbf{w}}^{(k)}[j_{p}+d]$ to be ‘1’ to destroy the marker in $\bar{\mathbf{w}}^{(k-1)}_{j_{p}+[L_{1}]}$ . In all cases, the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ in $\bar{\mathbf{w}}^{(k)}$ is always $\bar{\mathbf{w}}^{(k)}_{j+[d\lceil\log\lceil\log n\rceil\rceil+d]}$ .

Now, given the sequence $\bar{\mathbf{w}}^{(k)}$ , we first search for the rightmost $1^{d}\circ 0^{d\lceil\log\lceil\log n\rceil\rceil}$ in $\bar{\mathbf{w}}^{(k)}$ to determine the position $j$ . Then from the substring $\bar{\mathbf{w}}^{(k)}_{j+[L_{1}-1]}$ we can decode $i$ , the difference between $\bar{\mathbf{w}}^{(k-1)}_{i+[L_{1}]}$ and $\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}$ , and $b(j-j_{p}+L_{1})$ . Note that $\bar{\mathbf{w}}^{(k-1)}_{i+[\min\{L_{1},j-i\}]}=\bar{\mathbf{w}}^{(k)}_{i+[\min\{L_{1},j-i\}]}$ . So we can recover $\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}$ . We remove $\bar{\mathbf{w}}^{(k)}_{j+[L_{1}-1]}$ from $\bar{\mathbf{w}}^{(k)}$ and replace it with $\bar{\mathbf{w}}^{(k-1)}_{j+[L_{1}]}$ . If $b(j-j_{p}+L_{1})\neq 0^{\lceil\log(d+1)\rceil}$ , we further set the symbol in the position $j_{p}+d$ to be ‘0’. In this way, we recover the sequence $\bar{\mathbf{w}}^{(k-1)}$ . We repeat this procedure until there is no substring $0^{d\log\log n}$ . Then the resulting sequence is the required $\mathbf{w}$ . ∎

Now, we need to extend the sequence $\bar{\mathbf{w}}$ to a long sequence of length $n$ while keeping the property of being $(L,d)$ -SD.

Lemma 16.

Assume $n$ is sufficiently large. Let $\bar{\mathbf{w}}$ be an output of Algorithm 1. Recall that $K_{2}=3\lceil\frac{3}{2}\log L_{2}\rceil$ . By invoking Theorem 14 with parameters “ $K=K_{2}$ ” and “ $L=L_{2}$ ”, we get a collection of $(K_{2},d)$ -WWL sequences $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ of length $L_{2}-L_{p}$ , where $L_{p}=K_{2}+d\lceil\log d\rceil+2d$ . Let

\bar{\mathbf{s}}\triangleq 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{0}\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{1}\circ\cdots\circ 0^{K_{\max}}\circ\mathbf{u}\circ\mathbf{s}_{M-1},

where $K_{\max}=\max\{K_{1},K_{2}\}$ . Set

\hat{\mathbf{w}}=\mathcal{E}_{3}(\bar{\mathbf{w}})\triangleq(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1].

Then $\hat{\mathbf{w}}$ is a $(K,d)$ -WWL and $(L,d)$ -SD sequence where $K=2(K_{1}+K_{2})$ and $L=\max\{L_{1}+K_{2}+K_{\max}+\ell,L_{2}+2K_{1}+K_{\max}+\ell\}$ . Moreover, $\bar{\mathbf{w}}$ can be recovered from $\hat{\mathbf{w}}$ .

Proof:

We first prove that $\bar{\mathbf{s}}$ is a $(K_{\max}+K_{2},d)$ -WWL and $(L_{2}+K_{\max}-K_{2},d)$ -SD sequence of length at least $n$ . According to the construction, the length of $\bar{\mathbf{s}}$ is $M(L_{2}+K_{\max}-K_{2})\geqslant ML_{2}$ . Recall that $\log M=L_{2}-3d\log L_{2}-7.5\log L_{2}-O(1)$ and $L_{2}=\lceil\log n\rceil+(3d+7)\lceil\log\lceil\log n\rceil\rceil$ . Then

\displaystyle ML_{2}=2^{L_{2}-3d\log L_{2}-6.5\log L_{2}-O(1)}=\frac{2^{L_{2}}}{2^{O(1)}L_{2}^{3d+6.5}}\geqslant\frac{n(\log n)^{3d+7}}{2^{O(1)}(\log n+(3d+6.5)\log\log n)^{3d+6.5}}>n.

(5)

Hence, $\bar{\mathbf{s}}$ has length at least $n$ . Note that each $\mathbf{s}_{i}$ is a $(K_{2},d)$ -WWL sequence and the length- $d$ prefix of $\mathbf{u}$ is $1^{d}$ . It follows that $\bar{\mathbf{s}}$ is a $(K_{\max}+K_{2},d)$ -WWL sequence. Moreover, note that the sequences $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ satisfy the conditions (P1)-(P3) with “ $K=K_{2}$ ”. If $K_{2}\geqslant K_{1}$ (namely, $K_{\max}=K_{2}$ ), then by Proposition 13, the sequence $\bar{\mathbf{s}}$ is an $(L_{2},d)$ -SD sequence, hence also an $(L_{2}+K_{\max}-K_{2},d)$ -SD sequence. If $K_{2}<K_{1}$ , since the property of being $(K_{2},d)$ -WWL implies the property of being $(K_{\max},d)$ -WWL, the sequences $\mathbf{s}_{0},\mathbf{s}_{1},\ldots,\mathbf{s}_{M-1}$ also satisfy the conditions (P1)-(P3) with “ $K=K_{\max}$ ” ⁵⁵5In this case, we take “ $L=L_{2}+K_{\max}-K_{2}$ ”, “ $K=K_{\max}$ ”, “ $L_{p}=K+\ell$ ”, and so, “ $L-L_{p}=L_{2}-K_{2}-\ell$ ”, which is equal to the length of the $\mathbf{s}_{i}$ ’s.. Again, by Proposition 13, the sequence $\bar{\mathbf{s}}$ is an $(L_{2}+K_{\max}-K_{2},d)$ -SD sequence.

We have shown that $\bar{\mathbf{s}}$ is a $({K_{\max}}+K_{2},d)$ -WWL sequence in the above paragraph and $\bar{\mathbf{w}}$ is a $(K_{1},d)$ -WWL sequence in Lemma 15. By using the fact that $K_{1}>d$ and that the $\mathbf{u}$ substring of $\bar{\mathbf{s}}$ starts with $1^{d}$ , it follows that the sequence $\hat{\mathbf{w}}=(\bar{\mathbf{w}}\circ 0^{K_{2}}\circ\bar{\mathbf{s}})[0,n-1]$ is $(2(K_{1}+K_{2}),d)$ -WWL. Now, we shall show that it is also $(L,d)$ -SD. For any two substrings $\hat{\mathbf{w}}_{i+[L]}$ and $\hat{\mathbf{w}}_{j+[L]}$ with $i,j\in[n-L+1]$ and $i<j$ , we consider the following cases:

Case 1: $i<j\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1}$ . Then

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\bar{\mathbf{w}}_{i+[L_{1}]},\bar{\mathbf{w}}_{j+[L_{1}]})\geqslant d,

where the first inequality holds since $L\geqslant L_{1}$ and the second inequality holds since $\bar{\mathbf{w}}$ is an $(L_{1},d)$ -SD sequence.

Case 2: $i\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1}$ and $\lvert\bar{\mathbf{w}}\rvert-L_{1}+1\leqslant j\leqslant\lvert\bar{\mathbf{w}}\rvert$ . Since $L-L_{1}\geqslant K_{2}+{K_{\max}}+\ell$ , where $\ell$ is the length of $\mathbf{u}$ , then $\hat{\mathbf{w}}_{j+[L]}$ must contain $0^{K_{2}+{K_{\max}}}\circ\mathbf{u}$ as a substring. Assume that $\hat{\mathbf{w}}_{j+\delta+[K_{2}+K_{\max}+\ell]}=0^{K_{2}+{K_{\max}}}\circ\mathbf{u}$ for some $\delta\in[L_{1}]$ . If $j-i\leqslant d$ , then

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+K_{2}+{K_{\max}}+[\ell]},\hat{\mathbf{w}}_{j+\delta+K_{2}+{K_{\max}}+[\ell]})=d_{H}(0^{j-i}\circ\mathbf{u}[0,\ell-(j-i)-1],\mathbf{u})\geqslant d,

where the last inequality follows from the definition of a $d$ -auto-cyclic sequence. If $d<j-i\leqslant K_{2}+{K_{\max}}$ , since the prefix of $\mathbf{u}$ is $1^{d}$ , then

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+K_{2}+{K_{\max}}+[d]},\hat{\mathbf{w}}_{j+\delta+K_{2}+{K_{\max}}+[d]})=d_{H}(0^{d},1^{d})=d.

If $j-i>K_{2}+{K_{\max}}$ , then $i+\delta+K_{2}+{K_{\max}}<j+\delta$ , and so, $\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]}$ is a substring of $\bar{\mathbf{w}}$ . Hence,

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]},\hat{\mathbf{w}}_{j+\delta+[K_{2}+{K_{\max}}]})=d_{H}(\hat{\mathbf{w}}_{i+\delta+[K_{2}+{K_{\max}}]},0^{K_{2}+{K_{\max}}})\geqslant d,

where the last inequality holds since $\bar{\mathbf{w}}$ is a $(K_{1},d)$ -WWL sequence.

Case 3 and Case 4, which now follow, together cover the case of $i\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1}$ and $j>\lvert\bar{\mathbf{w}}\rvert$ and the case of $\lvert\bar{\mathbf{w}}\rvert-L_{1}<i<\lvert\bar{\mathbf{w}}\rvert$ and $i<j$ ,

Case 3: $i\leqslant\lvert\bar{\mathbf{w}}\rvert-(L_{2}+2K_{1}-K_{2})$ $(\leqslant\lvert\bar{\mathbf{w}}\rvert-L_{1})$ and $j>\lvert\bar{\mathbf{w}}\rvert$ . Denote $L^{\prime}\triangleq(L_{2}-K_{2})+2K_{1}$ . Then $L\geqslant L^{\prime}$ . Note that $\hat{\mathbf{w}}_{j+[L^{\prime}]}$ always contains $0^{K_{1}}$ as a substring, and $\hat{\mathbf{w}}_{i+[L^{\prime}]}$ is a substring of $\bar{\mathbf{w}}$ , which is $(K_{1},d)$ -WWL. Hence,

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+[L^{\prime}]},\hat{\mathbf{w}}_{j+[L^{\prime}]})\geqslant d.

Case 4: $\lvert\bar{\mathbf{w}}\rvert-(L_{2}+2K_{1}-K_{2})+1\leqslant i<\lvert\bar{\mathbf{w}}\rvert$ and $i<j$ . Since $L\geqslant(L_{2}+2K_{1}-K_{2})+K_{2}+{K_{\max}}+\ell$ , $\hat{\mathbf{w}}_{i+[L]}$ must contain $0^{K_{2}+{K_{\max}}}\circ\mathbf{u}$ as a substring. If $j-i\leqslant K_{2}+{K_{\max}}$ , then $\hat{\mathbf{w}}_{j+[L]}$ must contain $\mathbf{u}$ as a substring, and so, with the same argument as that in Case 2, one can show that $d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d$ . If $j-i>K_{2}+{K_{\max}}$ , assume that $\hat{\mathbf{w}}_{i+\delta^{\prime}+[K_{2}+{K_{\max}}]}$ is the all-zero substring of length $K_{2}+{K_{\max}}$ . Then $j+\delta^{\prime}>i+\delta^{\prime}+K_{2}+{K_{\max}}$ . It follows that $\hat{\mathbf{w}}_{j+\delta^{\prime}+[K_{2}+{K_{\max}}]}$ is a substring of $\bar{\mathbf{s}}$ , which is $(K_{2}+{K_{\max}},d)$ -WWL. Hence,

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+\delta^{\prime}+[K_{2}+{K_{\max}}]},\hat{\mathbf{w}}_{j+\delta^{\prime}+[K_{2}+{K_{\max}}]})\geqslant d.

Case 5: $\lvert\bar{\mathbf{w}}\rvert\leqslant i<j$ . Then

\displaystyle d_{H}(\hat{\mathbf{w}}_{i+[L]},\hat{\mathbf{w}}_{j+[L]})\geqslant d_{H}(\hat{\mathbf{w}}_{i+K_{2}+[L-K_{2}]},\hat{\mathbf{w}}_{j+K_{2}+[L-K_{2}]})=d_{H}(\bar{\mathbf{s}}_{i-\lvert\bar{\mathbf{w}}\rvert+[L-K_{2}]},\bar{\mathbf{s}}_{j-\lvert\bar{\mathbf{w}}\rvert+[L-K_{2}]})\geqslant d,

where the second inequality holds since $L-K_{2}\geqslant L_{2}+{K_{\max}}-K_{2}$ and $\bar{\mathbf{s}}$ is $(L_{2}+{K_{\max}}-K_{2},d)$ -SD.

Finally, note that in the sequence $\hat{\mathbf{w}}$ there is exactly one run of ‘0’ which has length at least $K_{2}+K_{\max}$ . So we can search for the rightmost $0^{K_{2}+K_{\max}}$ in $\hat{\mathbf{w}}$ and remove this substring as well as the suffix after it to recover the sequence $\bar{\mathbf{w}}$ . ∎

Theorem 17.

Let $\mathcal{E}_{\mathtt{SD}}(\cdot)\triangleq\mathcal{E}_{3}(\mathcal{E}_{2}(\mathcal{E}_{1}(\cdot)))$ . Then, for $n$ large enough, $\mathcal{E}_{\mathtt{SD}}:\Sigma^{n^{\prime}}\rightarrow\Sigma^{n}$ is invertible and can encode sequences of $\Sigma^{n^{\prime}}$ into $(K,d)$ -WWL and $(L,d)$ -SD sequences where $K=(2d+9)\log\log n+O(1)$ and

L\begin{cases}=\lceil\log n\rceil+(6d+7)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+5d&\text{if $d\geqslant 5$},\\ \leqslant\lceil\log n\rceil+(5d+11.5)\lceil\log\lceil\log n\rceil\rceil+d\lceil\log d\rceil+4d+7.5&\text{otherwise.}\end{cases}

Moreover, $n-n^{\prime}=\Theta(n/\log n)$ , and so, we have that

\lim_{n\to\infty}\frac{n^{\prime}}{n}=1.

Proof:

The statement about $\mathcal{E}_{\mathtt{SD}}$ follows from Lemma 15 and Lemma 16. Recall that the encoder $\mathcal{E}_{1}$ requires $\Theta(n/\log n)$ redundancies (see (2)) and $K_{1}+K_{2}=\Theta(\log\log n)$ . Hence,

n-n^{\prime}=K_{1}+K_{2}+\Theta(n/\log n)=\Theta(n/\log n).

∎

IV Generalized Reconstruction from Noisy Substring Trace

In this section, we are going to give constructions of $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes. Our first result generalizes Proposition 2 and Proposition 5, which shows that the property of being $(L_{\rm over},d)$ -substring distant implies the property of being $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible.

Proposition 18.

Suppose that $L_{\rm min}>L_{\rm over}$ . If a sequence $\mathbf{x}\in\Sigma^{n}$ is $(L_{\rm over},4e+1)$ -substring distant, then $\mathbf{x}$ is $(L_{\rm min},L_{\rm over},e)$ -trace reconstructible.

Proof:

Let $\mathcal{Y}=\{\mathbf{y}^{(0)},\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m-1)}\}$ be an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{x}$ where the location of each $\mathbf{y}^{(j)}$ in $\mathbf{x}$ is $i_{j}$ . Since $\mathbf{x}$ is $(L_{\rm over},4e+1)$ -substring distant, for any two substrings $\mathbf{y}^{(j)}$ and $\mathbf{y}^{(j^{\prime})}$ and their any two subsubstrings $\mathbf{y}^{(j)}_{k+[L_{\rm over}]}$ and $\mathbf{y}^{(j^{\prime})}_{k^{\prime}+[L_{\rm over}]}$ , we have that

d_{H}\left\lparen\mathbf{y}^{(j)}_{k+[L_{\rm over}]},\mathbf{y}^{(j^{\prime})}_{k^{\prime}+[L_{\rm over}]}\right\rparen\begin{cases}\geqslant 2e+1&\text{ if $i_{j}+k\neq i_{j^{\prime}}+k^{\prime}$},\\ \leqslant 2e&\text{ if $i_{j}+k=i_{j^{\prime}}+k^{\prime}$}.\\ \end{cases}

Therefore, $\mathbf{y}^{(0)}$ can be identified as the unique substring $\mathbf{y}\in\mathcal{Y}$ whose length- $L_{\rm over}$ prefix is of Hamming distance at least $2e+1$ from every length- $L_{\rm over}$ subsubstring of any other $\mathbf{y}^{\prime}\in\mathcal{Y}\backslash\{\mathbf{y}\}$ . Denote the length- $L_{\rm over}$ suffix of $\mathbf{y}^{(0)}$ as $\mathbf{s}_{0}$ . Then we can identify the substrings $\mathbf{y}$ ’s in $\mathcal{Y}$ which overlap $\mathbf{y}^{(0)}$ at at least $L_{\rm over}$ positions, since each of them contains a unique length- $L_{\rm over}$ subsubstring $\mathbf{w}$ whose distance from $\mathbf{s}_{0}$ is at most $2e$ . Furthermore, the locations of these substrings in $\mathbf{x}$ can be determined by aligning the subsubstring $\mathbf{w}$ and the suffix $\mathbf{s}_{0}$ . Assume that there are $m^{\prime}$ such substrings. Then we have identified the substrings $\mathbf{y}^{(1)},\ldots,\mathbf{y}^{(m^{\prime})}\in\mathcal{Y}$ . Next, we consider the length- $L_{\rm over}$ suffix of $\mathbf{y}^{(m^{\prime})}$ and we can identify all the subsrings in $\mathcal{Y}$ which overlap $\mathbf{y}^{(m^{\prime})}$ at at least $L_{\rm over}$ positions. We repeat the procedure above. Finally, we can determine the location of every substring $\mathbf{y}\in\mathcal{Y}$ in $\mathbf{x}$ . ∎

Combining Theorem 17 and Proposition 18, we have the following result.

Corollary 19.

Suppose that $L_{\rm over}=\lceil\log n\rceil+(24e+13)\lceil\log\lceil\log n\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5$ and $L_{\rm min}>L_{\rm over}$ . If $n$ is sufficiently large, then there is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of $\Sigma^{n}$ whose rate is $1-o(1)$ .

Now, we consider another parameter regime. Suppose that

	$\displaystyle L_{\rm min}$	$\displaystyle=\lceil a\log n\rceil,$
	$\displaystyle L_{\rm over}$	$\displaystyle=\lceil\gamma L_{\rm min}\rceil,$

where $a>1$ and $0<a\gamma\leqslant 1$ are real constants. We are going to construct an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code whose rate approaches $\frac{1-1/a}{1-\gamma}$ . The basic idea of our code construction is similar to the one in [16] for the noiseless scenario: A message $\mathbf{m}$ is encoded into a codeword $\mathbf{w}=\mathbf{w}_{0}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{w}_{2^{I}-1}$ such that

(i)

the index $i$ can be decoded from any length- $L_{\rm min}$ substring of $\mathbf{w}_{i}$ even if the substring is corrupted by at most $e$ errors;
(ii)

$\mathbf{w}_{i}$ can be reconstructed from its any $(L_{\rm min},L_{\rm over},e)$ -erroneous trace.

To this end, our construction leverages the map $\mathcal{E}_{\mathtt{SD}}$ in Section III which can encode WWL and SD sequences, as well as the following coded indices $\mathbf{c}_{i}$ ’s which are generated from a robust positioning sequence.

Construction A (Index Construction).

Given $e$ , let

	$\displaystyle d_{1}$	$\displaystyle\triangleq 2e+1,$
	$\displaystyle d_{2}$	$\displaystyle\triangleq 4e+1.$

Additionally, set

	$\displaystyle I$	$\displaystyle\triangleq\left\lceil\frac{1-\gamma a}{1-\gamma}\log n+(\log n)^{0.5+\epsilon}\right\rceil,$
	$\displaystyle r_{I}$	$\displaystyle\triangleq\left\lceil(3d_{1}+8)\log I\right\rceil,$

where $0<\epsilon<0.5$ is an arbitrary fixed number which is independent of $n$ . Then

(I+r_{I})-(3d_{1}+7.5)\log(I+r_{I})-O(1)=I+0.5\log I-O(1)>I,

where we assume $e,a,\gamma,\epsilon$ are constants, and $n\to\infty$ . Applying Theorem 14 with $L=I+r_{I}$ , there is an explicit construction of sequences $\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}}$ such that the concatenation

\mathbf{c}\triangleq\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}

is an $(I+r_{I},d_{1})$ -SD sequence. Moreover, according to the remark following Theorem 14, each $\mathbf{c}_{i}$ is $(3\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}},d_{1})$ -WWL where

\ell_{d_{1}}\triangleq d_{1}\lceil\log d_{1}\rceil+2d_{1}

is the length of the $d_{1}$ -auto-cyclic sequence $\mathbf{u}$ . Denote

	$\displaystyle K$	$\displaystyle\triangleq\left\lceil\sqrt{\log n}\right\rceil,$
	$\displaystyle F$	$\displaystyle\triangleq\left\lceil\frac{I+r_{I}}{K}\right\rceil.$

For each $i\in[2^{I}]$ , we partition the sequence $\mathbf{c}_{i}$ into segments $\mathbf{c}_{i}^{(0)},\mathbf{c}_{i}^{(1)},\ldots,\mathbf{c}_{i}^{(F-1)}$ , each of length $\lceil\frac{I+r_{I}}{F}\rceil$ or $\lfloor\frac{I+r_{I}}{F}\rfloor$ . ∎

In the following, we first consider the case of $L_{\rm min}\mid n$ and give the code construction. Then we will show how to modify this construction to settle the other cases.

IV-A The case of $L_{\rm min}\mid n$

Let us define

	$\displaystyle r$	$\displaystyle\triangleq I+r_{I}+K+\ell_{d_{1}}+d_{1},$
	$\displaystyle L$	$\displaystyle\triangleq\left\lceil\left\lparen L_{\rm over}-K-\ell_{d_{1}}-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}\right\rceil.$

We note that by our choice of parameters, $L_{\rm min}>r$ for all sufficiently large $n$ . Assume that $L_{\rm min}\mid n$ and denote $n_{L}\triangleq\frac{n}{L_{\rm min}}$ . For each $i\in[2^{I}]$ , let

N_{i}\triangleq\begin{cases}\lceil n_{L}/2^{I}\rceil(L_{\rm min}-r)&\text{if $i<n_{L}\bmod 2^{I}$,}\\ \lfloor n_{L}/2^{I}\rfloor(L_{\rm min}-r)&\text{otherwise.}\\ \end{cases}

(6)

Then $\sum_{i\in[2^{I}]}N_{i}=n_{L}(L_{\rm min}-r).$

Lemma 20.

Let $K,L,N_{i}$ be defined as above, and assume $n$ is large enough. Then for each $i\in[2^{I}]$ there is an integer $m(N_{i})$ with $N_{i}-m(N_{i})=\Theta(N_{i}/\log N_{i})$ and an invertible map $\mathcal{E}_{\mathtt{SD}}^{(i)}:\Sigma^{m(N_{i})}\rightarrow\Sigma^{N_{i}}$ which can encode sequences of $\Sigma^{m(N_{i})}$ into $(\lfloor K/4\rfloor,d_{2})$ -WWL and $(L,d_{2})$ -SD sequences.

Proof:

We shall apply Theorem 17 to prove this lemma. To this end, we first need to verify that $N_{i}$ can be arbitrarily large. As noted before, $L_{\rm min}-r>0$ . Additionally, $n_{L}=\Theta(n/\log n)$ , and $2^{I}=n^{\frac{1-\gamma a}{1-\gamma}(1+o(1))}$ and by our choice of parameters, $\frac{1-\gamma a}{1-\gamma}<1$ is a constant. Hence, $N_{i}\to\infty$ as $n\to\infty$ .

Next, we need to verify that $\lfloor K/4\rfloor$ and $L$ satisfy the two conditions in Theorem 17. Regarding the value of $K$ , we need to show that $\lfloor K/4\rfloor\geqslant(2d_{2}+9)\log\log N_{i}+O(1)$ . Noting that $r_{I}=\lceil(3d_{1}+8)\log I\rceil=O(\log\log n)$ and $K=O(\sqrt{\log n})$ , we have that

	$\displaystyle 1-\frac{r}{L_{\rm min}}$	$\displaystyle=1-\frac{I+r_{I}+K+\ell_{d_{1}}+d_{1}}{L_{\rm min}}=1-\frac{(\frac{1-\gamma a}{1-\gamma})\log n+(\log n)^{0.5+\epsilon}+O(\sqrt{\log n})}{{a\log n}+O(1)}$
		$\displaystyle=1-\left\lparen\frac{1/a-\gamma}{1-\gamma}+\frac{1}{a(\log n)^{0.5-\epsilon}}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen\frac{a\log n}{a\log n+O(1)}$
		$\displaystyle=1-\left\lparen\frac{1/a-\gamma}{1-\gamma}+\frac{1}{a(\log n)^{0.5-\epsilon}}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen\left\lparen 1-O\left\lparen\frac{1}{\log n}\right\rparen\right\rparen$
		$\displaystyle=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.$

It follows that

	$\displaystyle\log N_{i}$	$\displaystyle=\log\left\lparen\frac{n_{L}}{2^{I}}(L_{\rm min}-r)\right\rparen\pm O(1)=\log\left\lparen\frac{n}{2^{I}}\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\right\rparen\pm O(1)$
		$\displaystyle=\log n-I\pm O(1)=\frac{\gamma a-\gamma}{1-\gamma}\log n-(\log n)^{0.5+\epsilon}\pm O(1).$

Since $K=\left\lceil\sqrt{\log n}\right\rceil$ , we have that $\lfloor K/4\rfloor$ is substantially larger than $(2d_{2}+9)\log\log N_{i}+O(1).$

Now, we verify the condition on $L$ , namely that $L\geqslant\log N_{i}+(6d_{2}+7)\log\log N_{i}+O(1)$ . Note that

	$\displaystyle\frac{I+r_{I}}{L_{\rm min}-r}$	$\displaystyle=\frac{I+O(\log{\log n})}{L_{\rm min}-I-O(\sqrt{\log n})}=\frac{I}{L_{\rm min}-I}\cdot\frac{1+O(\log\log n/{\log n})}{1-O(1/\sqrt{\log n})}$
		$\displaystyle=\frac{I}{L_{\rm min}-I}\left\lparen 1+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen=\frac{I}{L_{\rm min}-I}+O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,$

and

\displaystyle\left\lceil\frac{I+r_{I}}{F}\right\rceil=\left\lceil\frac{I+r_{I}}{\lceil(I+r_{I})/K\rceil}\right\rceil\leqslant\frac{I+r_{I}}{(I+r_{I})/K}+1=K+1.

Hence, we have that

	$\displaystyle L$	$\displaystyle=\left\lceil\left\lparen L_{\rm over}-K-\ell_{d_{1}}-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}\right\rceil$
		$\displaystyle\geqslant\frac{L_{\rm over}-3K-\ell_{d_{1}}-d_{1}-2}{1+(I+r_{I})/(L_{\rm min}-r)}=\frac{L_{\rm over}-O(\sqrt{\log n})}{\frac{L_{\rm min}}{L_{\rm min}-I}+O\left\lparen{1}/{\sqrt{\log n}}\right\rparen}$
		$\displaystyle=\frac{L_{\rm over}(L_{\rm min}-I)}{L_{\rm min}}\cdot\frac{1-O(1/\sqrt{\log n})}{1+O(1/\sqrt{\log n})}$
		$\displaystyle\geqslant\gamma\left\lparen a\log n-\frac{1-\gamma a}{1-\gamma}\log n-(\log n)^{0.5+\epsilon}-1\right\rparen\left\lparen 1-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen\right\rparen$
		$\displaystyle=\frac{\gamma a-\gamma}{1-\gamma}\log n-\gamma(\log n)^{0.5+\epsilon}-O(\sqrt{\log n}).$

It follows that

\displaystyle L-\log N_{i}=(1-\gamma)(\log n)^{0.5+\epsilon}-O(\sqrt{\log n})=\omega(\log\log N_{i}).

We can conclude that $L$ is substantially larger than $\log N_{i}+(6d_{2}+7)\log\log N_{i}+O(1)$ . ∎

Now, we present our code construction.

Construction B.

Let $m(N_{i})$ ’s be defined as in Lemma 20. We now describe a mapping from $\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})}$ to $\Sigma^{n}$ . For any message $\mathbf{m}\in\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})}$ , partition $\mathbf{m}$ into $2^{I}$ substrings:

\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{2^{I}-1},

where each $\mathbf{m}_{i}$ has length $m(N_{i})$ . For each $i\in[2^{I}]$ , let

\mathbf{v}_{i}=\mathcal{E}_{\mathtt{SD}}^{(i)}(\mathbf{m}_{i})\in\Sigma^{N_{i}},

where $\mathcal{E}_{\mathtt{SD}}^{(i)}$ is the map mentioned in Lemma 20. We partition each $\mathbf{v}_{i}$ into substrings of length $L_{\rm min}-r$ :

\mathbf{v}_{i}=\begin{cases}\mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\circ\mathbf{v}_{i,\lceil n_{L}/2^{I}\rceil-1}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\circ\mathbf{v}_{i,\lfloor n_{L}/2^{I}\rfloor-1}&\text{otherwise}.\\ \end{cases}

Then the total number of $\mathbf{v}_{i,j}$ ’s is $n_{L}$ . We further partition each $\mathbf{v}_{i,j}$ into $F$ segments of lengths $\lceil(L_{\rm min}-r)/F\rceil$ or $\lfloor(L_{\rm min}-r)/F\rfloor$ :

\mathbf{v}_{i,j}=\mathbf{v}_{i,j}^{(0)}\circ\mathbf{v}_{i,j}^{(1)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}.

Recall $\mathbf{c}_{i}^{(m)}$ from the index construction, Construction A. Let

\mathbf{w}_{i,j}\triangleq\begin{cases}0^{d_{1}}\circ\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}&\text{if $j=0$},\\ 1^{d_{1}}\circ\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}&\text{otherwise}.\\ \end{cases}

Finally, let

\mathbf{w}_{i}=\begin{cases}\mathbf{p}\circ\mathbf{w}_{i,0}\circ\mathbf{p}\circ\mathbf{w}_{i,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{i,\lceil n_{L}/2^{I}\rceil-1}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathbf{p}\circ\mathbf{w}_{i,0}\circ\mathbf{p}\circ\mathbf{w}_{i,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{i,\lfloor n_{L}/2^{I}\rfloor-1}&\text{otherwise},\\ \end{cases}

where $\mathbf{p}\triangleq 0^{K}\circ\mathbf{u}$ and $\mathbf{u}$ is the $d_{1}$ -auto-cyclic sequence in Theorem 11. Denote

\mathbf{w}\triangleq\mathbf{w}_{0}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{w}_{2^{I}-1}.

The constructed code, $\mathcal{C}_{\rm Trace}$ , is the image of the mapping described above. ∎

Lemma 21.

Let $\mathcal{C}_{\rm Trace}$ be the code obtained by Construction B. Then $\mathcal{C}_{\rm Trace}\subseteq\Sigma^{n}$ and its rate is

R(\mathcal{C}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.

Proof:

In our construction, every sequence $\mathbf{w}_{i,j}$ has length $L_{\rm min}-r+d_{1}+\lvert\mathbf{c}_{i}\rvert=L_{\rm min}-K-\ell_{d_{1}}$ , and so, the concatenation $\mathbf{p}\circ\mathbf{w}_{i,j}$ has length $L_{\rm min}$ . It follows that the codeword $\mathbf{w}$ has length $n_{L}L_{\rm min}=n$ . Noting that the map $\mathcal{E}_{\mathtt{SD}}$ is invertible, we can uniquely recover $\mathbf{m}$ from $\mathbf{w}$ . Therefore, the code $\mathcal{C}_{\rm Trace}$ has rate $\sum_{i\in[2^{I}]}m(N_{i})/{n}$ .

We have shown in the proof of Lemma 20 that

1-\frac{r}{L_{\rm min}}=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen,

and for each $i\in[2^{I}]$ ,

\log N_{i}=\Theta(\log n).

Hence,

	$\displaystyle R(\mathcal{C}_{\rm Trace})$	$\displaystyle=\frac{\sum_{i\in[2^{I}]}m(N_{i})}{n}=\frac{\sum_{i\in[2^{I}]}N_{i}-\Theta\left\lparen N_{i}/\log N_{i}\right\rparen}{n}$
		$\displaystyle=\frac{\sum_{i\in[2^{I}]}N_{i}}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen=\frac{n_{L}(L_{\rm min}-r)}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen$
		$\displaystyle=\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\left\lparen 1-\Theta\left\lparen\frac{1}{\log n}\right\rparen\right\rparen=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.$

∎

In the following, we shall show that the code $\mathcal{C}_{\rm Trace}$ is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code.

Lemma 22 (Construction 1 and Lemma 3.6 in [6]).

Let $\mathbf{w}=\mathbf{p}\circ\mathbf{w}_{0,0}\circ\mathbf{p}\circ\mathbf{w}_{0,1}\circ\cdots\circ\mathbf{p}\circ\mathbf{w}_{2^{I}-1,\lfloor n_{L}/2^{I}\rfloor-1}$ be a codeword of $\mathcal{C}_{\rm Trace}$ . Assume that the substrings $\mathbf{w}_{i,j}$ ’s satisfy the following conditions:

(P1)

$\mathbf{w}_{i,j}$ is a $(K,d_{1})$ -WWL sequence for each $(i,j)$ ; and
(P2)

$\mathbf{w}_{i,j}[0,\mu-1]\circ\mathbf{w}_{i^{\prime},j^{\prime}}[\mu,L_{\rm min}-K-\ell_{d_{1}}-1]$ is a $(K,d_{1})$ -WWL sequence for $(i,j),(i^{\prime},j^{\prime})$ such that $(i,j)\neq(i^{\prime},j^{\prime})$ and $\mu\in[L_{\rm min}-K-\ell_{d_{1}}]$ .

Then for every substring $\mathbf{y}=\mathbf{w}_{i_{0}+[L_{\rm min}]}$ in $\mathbf{w}$ and each⁶⁶6If $i\in[L_{\rm min}-K+\ell_{d_{1}},L_{\rm min}-1]$ , we let $\mathbf{y}_{i+[K+\ell_{d_{1}}]}$ denote the concatenation $\mathbf{y}[i,L_{\rm min}-1]\circ\mathbf{y}[0,K+\ell_{d_{1}}-(L_{\rm min}-i)-1]$ . $i\in[L_{\rm min}]$ , the following hold:

(i)

If $i+i_{0}\equiv 0~{}({\rm mod~{}}L_{\rm min})$ , then $\mathbf{y}_{i+[K+\ell_{d_{1}}]}=\mathbf{p}$ .
(ii)

If $i+i_{0}\not\equiv 0~{}({\rm mod~{}}L_{\rm min})$ , then $d_{H}(\mathbf{y}_{i+[K+\ell_{d_{1}}]},\mathbf{p})\geqslant d_{1}$ .

Lemma 23.

Assume $n$ is sufficiently large. Let $\mathbf{y}$ be an arbitrary length- $L_{\rm min}$ substring of $\mathbf{w}\in\mathcal{C}_{\rm Trace}$ . Then $\mathbf{y}$ contains a length- $(I+r_{I}-\mu)$ suffix of a coded index $\mathbf{c}_{i}$ and a length- $\mu$ prefix of either $\mathbf{c}_{i}$ or $\mathbf{c}_{i+1}$ for some $i\in[2^{I}]$ and $\mu\in[I+r_{I}]$ . Furthermore, even if $\mathbf{y}$ is corrupted by at most $e$ errors, we can still identify the positions where the said suffix and prefix appear, and so reconstruct them with at most $e$ errors.

Proof:

We note that the length of $\mathbf{p}\circ\mathbf{w}_{i,j}$ is $L_{\rm min}$ , and that $\mathbf{w}$ is a concatenation of such strings. Hence, the first statement follows directly from the code construction. Now, assume that $\mathbf{y}$ is corrupted by at most $e$ errors. We shall use Lemma 22 to identify the location of the marker $\mathbf{p}$ in $\mathbf{y}$ . Recall that every $\mathbf{c}_{i}$ is $(3\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}},d_{1})$ -WWL (see the index construction, Construction A) and every $\mathbf{v}_{i}$ is $(\lfloor K/4\rfloor,d_{2})$ -WWL (see Lemma 20). Since $3\left\lceil\frac{3}{2}\log(I+r_{I})\right\rceil+\ell_{d_{1}}<\lfloor K/4\rfloor$ and $d_{1}<d_{2}$ , all the segments $\mathbf{c}_{i}^{(h)}$ ’s and $\mathbf{v}_{i,j}^{(h)}$ ’s are $(\lfloor K/4\rfloor,d_{1})$ -WWL. Hence, $\mathbf{w}_{i,j}$ ’s satisfy the conditions in Lemma 22. This follows since any substring of length $K$ contains a substring of length $\lfloor K/4\rfloor$ that is fully contained within a segment of the form $\mathbf{c}_{i}^{(h)}$ or $\mathbf{v}_{i,j}^{(h)}$ , thus providing the minimum weight of $d_{1}$ as claimed.

Since $\mathbf{y}$ suffers from at most $e$ errors and $d_{1}=2e+1$ , by Lemma 22 there is a unique index $i\in[L_{\rm min}]$ such that

d_{H}(\mathbf{y}_{i+[K+\ell_{d_{1}}]},\mathbf{p})\leqslant e.

Hence, by comparing the distance between the marker $\mathbf{p}$ and each length- $\ell_{p}$ substring of $\mathbf{y}$ , we can identify the location of the marker in $\mathbf{y}$ . Once the marker $\mathbf{p}$ is located, the positions in which the symbols of the coded indices $\mathbf{c}_{i}^{(h)}$ ’s appear can also be determined. Then we can reconstruct a prefix $\mathbf{c}_{i}[\mu,I+r_{I}-1]$ and a suffix $\mathbf{c}_{i}[0,\mu-1]$ or $\mathbf{c}_{i+1}[\mu-1]$ for some $\mu\in[I+r_{I}]$ with at most $e$ errors. ∎

The following lemma ensures that every length- $L_{\rm over}$ substring of $\mathbf{w}$ contains a long-enough substring of the $(L,d_{2})$ -SD sequence $\mathbf{v}_{i}$ .

Lemma 24.

Assume $n$ is sufficiently large. Let $\mathbf{w}$ be a codeword of $\mathcal{C}_{\rm Trace}$ . Then every length- $L_{\rm over}$ substring of $\mathbf{w}$ contains at least $L$ consecutive symbols of $\mathbf{v}=\mathbf{v}_{0}\circ\mathbf{v}_{1}\circ\cdots\circ\mathbf{v}_{2^{I}-1}$ .

Proof:

Note that the concatenation

\mathbf{v}_{i,j}^{(0)}\circ\mathbf{c}_{i}^{(0)}\circ\cdots\circ\mathbf{v}_{i,j}^{(F-1)}\circ\mathbf{c}_{i}^{(F-1)}

consists of $\lvert\mathbf{v}_{i,j}\rvert+\lvert\mathbf{c}_{i}\rvert=L_{\rm min}-r+I+r_{I}$ symbols, out of which $\lvert\mathbf{v}_{i,j}\rvert=L_{\rm min}-r$ symbols are from $\mathbf{v}$ . Then according to the construction, every length- $L_{\rm over}$ substring of $\mathbf{w}$ contains at least

\left\lparen L_{\rm over}-(K+\ell_{d_{1}})-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil\right\rparen\frac{L_{\rm min}-r}{L_{\rm min}-r+I+r_{I}}

consecutive symbols of $\mathbf{v}$ , where $L_{\rm over}-(K+\ell_{d_{1}})-d_{1}-2\left\lceil\frac{I+r_{I}}{F}\right\rceil$ accounts for the worst case where the substring both begins and ends with some segments of the coded indices (of length $\left\lceil\frac{I+r_{I}}{F}\right\rceil$ or $\left\lfloor\frac{I+r_{I}}{F}\right\rfloor$ ) and contains a copy of $\mathbf{p}\circ 0^{d_{1}}$ or $\mathbf{p}\circ 1^{d_{1}}$ . ∎

Theorem 25.

The code $\mathcal{C}_{\rm Trace}$ obtained in Construction B is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of $\Sigma^{n}$ with rate

R(\mathcal{C}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-\frac{1}{a(\log n)^{0.5-\epsilon}}-O\left\lparen\frac{1}{\sqrt{\log n}}\right\rparen.

Proof:

The code rate has been calculated in Lemma 21. Let $\mathbf{w}$ be a codeword of $\mathcal{C}_{\rm Trace}$ and $\mathcal{Y}$ be an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{w}$ . For each $\mathbf{y}$ in $\mathcal{Y}$ , since the length of $\mathbf{y}$ is at least $L_{\rm min}$ , according to Lemma 23, we can extract a corrupted copy $\mathbf{c}_{\rm suf}$ of the length- $(I+r_{I}-\mu)$ suffix of $\mathbf{c}_{i}$ , and a corrupted copy $\mathbf{c}_{\rm pre}$ of a length- $\mu$ prefix of either $\mathbf{c}_{i}$ or $\mathbf{c}_{i+1}$ , with the total number of errors being no more than $e$ . Consider the following cases.

1.

If $\mu=0$ , then $\mathbf{c}_{\rm suf}$ is a corrupted copy of $\mathbf{c}_{i}$ , and so, we can run the locating algorithm of the robust positioning sequence $\mathbf{c}=\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}$ on the corrupted $\mathbf{c}_{\rm suf}$ to determine the index $i$ .
2.
If $\mu>0$ then $\mathbf{y}$ contains a copy of either $\mathbf{p}\circ 0^{d_{1}}$ or $\mathbf{p}\circ 1^{d_{1}}$ with at most $e$ errors. Since $d_{1}=2e+1$ , we can distinguish these two cases.
1. (a)
  
  If $\mathbf{y}$ contains a copy of $\mathbf{p}\circ 0^{d_{1}}$ , then $\mathbf{c}_{\rm pre}$ is a prefix of $\mathbf{c}_{i+1}$ , and so, we run the locating algorithm of $\mathbf{c}$ on $\mathbf{c}_{\rm suf}\circ\mathbf{c}_{\rm pre}$ to decode the index $i$ .
2. (b)
  
  If $\mathbf{y}$ contains a copy of $\mathbf{p}\circ 1^{d_{1}}$ , then $\mathbf{c}_{\rm pre}$ is a prefix of $\mathbf{c}_{i}$ , and so, we run the locating algorithm of $\mathbf{c}$ on $\mathbf{c}_{\rm pre}\circ\mathbf{c}_{\rm suf}$ to decode the index $i$ .

The discussion above shows that for every string $\mathbf{y}\in\mathcal{Y}$ , we can decode the index $i$ . If $\mathbf{y}$ intersects both $\mathbf{v}_{i}$ and $\mathbf{v}_{i+1}$ , then we can determine its location in $\mathbf{w}$ by identifying the location of the marker $\mathbf{p}$ in $\mathbf{y}$ . For the other strings with index $i$ , since $\mathbf{v}_{i}$ is an $(L,4e+1)$ -SD sequence, according to Lemma 24 and Proposition 18, there is a unique way to determine the correct order of these strings and match correctly the suffix and the prefix of consecutive strings. By taking the majority value at every position, we can reconstruct a sequence $\mathbf{w}_{i}^{\prime}$ , which is a long substring of $\mathbf{w}_{i}$ possibly with some errors. It remains to determine the location of $\mathbf{w}_{i}^{\prime}$ in $\mathbf{w}_{i}$ , which can be done as follows.

1.

If $\mathbf{w}_{i}^{\prime}$ contains a corrupted copy of $\mathbf{p}\circ 0^{d_{1}}$ with at most $e$ errors, then the location this marker in $\mathbf{w}_{i}^{\prime}$ determines the location of $\mathbf{w}_{i}^{\prime}$ in $\mathbf{w}_{i}$ , since $\mathbf{w}_{i}$ only contains one copy of $\mathbf{p}\circ 0^{d_{1}}$ .
2.
If $\mathbf{w}_{i}^{\prime}$ does not contain any corrupted copy of $\mathbf{p}\circ 0^{d_{1}}$ up to $e$ errors, then there is a string $\hat{\mathbf{y}}\in\mathcal{Y}$ which intersects both $\mathbf{w}_{i-1}$ and $\mathbf{w}_{i}$ and contains $\mathbf{p}\circ 0^{d_{1}}$ as a substring with at most $e$ errors, since the length of $\mathbf{p}\circ 0^{d_{1}}$ is less that $L_{\rm over}$ .
1. (a)
  
  If $\hat{\mathbf{y}}$ overlaps $\mathbf{w}_{i}$ in at most $L_{\rm over}$ positions, since $L_{\rm over}<L_{\rm min}$ , $\mathbf{w}_{i}^{\prime}$ must contain a copy of the first $\mathbf{p}\circ 1^{d_{1}}$ of $\mathbf{w}_{i}$ , and so, the location of $\mathbf{w}_{i}^{\prime}$ in $\mathbf{w}_{i}$ can be determined by identifying the first occurrence of the marker $\mathbf{p}$ in $\mathbf{w}_{i}^{\prime}$ .
2. (b)
  
  If $\hat{\mathbf{y}}$ overlaps $\mathbf{w}_{i}$ in at least $L_{\rm over}$ positions, then $\hat{\mathbf{y}}$ and the length- $L_{\rm over}$ prefix of $\mathbf{w}_{i}^{\prime}$ share a length- $L$ substring of $\mathbf{v}_{i}$ . Since $\mathbf{v}_{i}$ is $(L,4e+1)$ -SD, we can match the suffix of $\hat{\mathbf{y}}$ and the prefix of $\mathbf{w}_{i}^{\prime}$ correctly. Then the location of $\mathbf{w}_{i}^{\prime}$ in $\mathbf{w}$ can be deduced from the location of $\hat{\mathbf{y}}$ in $\mathbf{w}$ .

∎

IV-B The case of $L_{\rm min}\nmid n$

Now, we consider the case that $L_{\rm min}$ does not divide $n$ . Take $n_{L}=\lfloor n/L_{\rm min}\rfloor$ . Construction B can yield a trace reconstruction code of block length $n_{L}L_{\rm min}$ . Our approach is to extend this code to have length $n$ . Let $N_{i}$ be defined as in (6) and $m(N_{i})$ be defined as in Lemma 20. For any message $\mathbf{m}\in\Sigma^{\sum_{i\in[2^{I}]}m(N_{i})}$ , partition $\mathbf{m}$ into $2^{I}$ substrings, each of length $m(N_{i})$ :

\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{2^{I}-1}.

For each $i\in[2^{I}-1]$ , let

\mathbf{v}_{i}=\mathcal{E}_{\mathtt{SD}}^{(i)}(\mathbf{m}_{i})\in\Sigma^{N_{i}}.

The main difference from the previous case is the encoding of $\mathbf{m}_{2^{I}-1}$ . We recall that the encoder $\mathcal{E}_{\mathtt{SD}}^{(i)}$ first encodes the message $\mathbf{m}_{i}$ to an SD and WWL sequence of length probably less than $N_{i}$ . Then it extends the sequence by appending a sequence $\bar{\mathbf{s}}$ and taking the first $N_{i}$ bits of the concatenation. For $i=2^{I}-1$ , we modify the encoder $\mathcal{E}_{\mathtt{SD}}^{(2^{I}-1)}$ by taking the first $N_{2^{I}-1}+{L_{\rm min}-r}$ bits of the concatenation. This is possible since asymptotically the length of $\bar{\mathbf{s}}$ is larger than $N_{2^{I}-1}+{L_{\rm min}-r}$ , see (5). We denote this modified encoder as $\mathcal{E}_{\mathtt{SDE}}^{(2^{I}-1)}$ and let

\mathbf{v}_{2^{I}-1}=\mathcal{E}_{\mathtt{SDE}}^{(2^{I}-1)}(\mathbf{m}_{2^{I}-1}).

Then $\mathbf{v}_{2^{I}-1}$ is $(\lfloor K/4\rfloor,d_{2})$ -WWL and $(L,d_{2})$ -SD and has length $N_{2^{I}-1}+L_{\rm min}-r=\lceil n_{L}/2^{I}\rceil(L_{\rm min}-r)$ . Moreover, the message $\mathbf{m}_{2^{I}-1}$ can be decoded from the first $N_{2^{I}-1}$ bits of $\mathbf{v}_{2^{I}-1}$ . In other words, the last $L_{\rm min}-r$ bits are redundant.

Then, we proceed similarly as in Construction B and obtain an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of block length ${(n_{L}+1)L_{\rm min}}$ . Note that the last $L_{\rm min}$ bits are redundant, and so, we delete ${(n_{L}+1)L_{\rm min}}-n$ of them to form an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of length $n$ , with code rate

\frac{\sum_{i\in[2^{I}]}m(N_{i})}{n}=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen\frac{n_{L}L_{\rm min}}{n}=\frac{1-1/a}{1-\gamma}-o(1).

IV-C Handling noise which occurs before sequencing

Up to now, we have studied $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes, which allow reconstructing the maximum reconstructible-string from an erroneous trace $\mathcal{Y}$ of a codeword $\mathbf{w}$ . We use $M(\mathcal{Y})$ to denote the maximum reconstructible-string of $\mathcal{Y}$ . If $\mathcal{Y}$ is reliable, then $M(\mathcal{Y})=\mathbf{w}$ . However, if $\mathcal{Y}$ is not reliable, then $M(\mathcal{Y})$ might be different from $\mathbf{w}$ . This may happen especially when the sequence $\mathbf{w}$ is subject to errors before its substrings are sampled. In the remainder of this section, we shall modify Construction B to combat such errors.

Let $\mathcal{Y}$ be an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{w}$ such that $d_{H}(M(\mathcal{Y}),\mathbf{w})\leqslant\tau$ , which is referred to as an $(L_{\rm min},L_{\rm over},e,\tau)$ -erroneous trace. We aim to reconstruct $\mathbf{w}$ from $\mathcal{Y}$ , and so retrieve the message which is stored in $\mathbf{w}$ . Our construction, which is presented below, borrows the idea from [2, Construction B].

Construction C.

Assume that $L_{\rm min}\mid n$ and take $n_{L}=n/L_{\rm min}$ . Let $N\triangleq\lfloor n_{L}/2^{I}\rfloor(L_{\rm min}-r)$ . According to Lemma 20, there is an integer $m(N)$ with $N-m(N)=\Theta(N/\log N)$ and an invertible map $\mathcal{E}_{\mathtt{SD}}:\Sigma^{m(N)}\rightarrow\Sigma^{N}$ which can encode sequences of $\Sigma^{m(N)}$ into $(\lfloor K/4\rfloor,d_{2})$ -WWL and $(L,d_{2})$ -SD sequences. Let $\mathcal{E}_{\mathtt{SDE}}:\Sigma^{m(N)}\rightarrow\Sigma^{N+L_{\rm min}-r}$ be an encoder which modifies $\mathcal{E}_{\mathtt{SD}}$ by taking the first $N+L_{\rm min}-r$ bits of the concatenation.

For any message $\mathbf{m}\in\Sigma^{(2^{I}-2\tau)m(N)}$ , we first use a $[2^{I},2^{I}-2\tau,2\tau+1]_{2^{m(N)}}$ Reed-Solomon code⁷⁷7The Reed-Solomon code is over the finite field of size $2^{m(N)}$ . The message is partitioned into groups of $m(N)$ bits, and each group is translated to a single symbol from the finite field. After encoding the reverse translation to bits is performed. Note that $m(N)=N-\Theta(N/\log N)$ , $\log(N)=\Theta(\log n)$ and $I=O(\log n)$ . Hence, $m(N)>I$ and so, the Reed-Solomon code exists. to encode $\mathbf{m}$ into a codeword $\bar{\mathbf{m}}\in\Sigma^{{2^{I}}m(N)}$ . We partition $\bar{\mathbf{m}}$ into sequences of length $L_{\rm min}-r$ :

\bar{\mathbf{m}}=\bar{\mathbf{m}}_{0}\circ\bar{\mathbf{m}}_{1}\circ\cdots\circ\bar{\mathbf{m}}_{2^{I}-1}.

For each $i\in[2^{I}]$ , let

\mathbf{v}_{i}\triangleq\begin{cases}\mathcal{E}_{\mathtt{SDE}}(\bar{\mathbf{m}}_{i})\in\Sigma^{N+L_{\rm min}-r}&\text{if $i<n_{L}\bmod 2^{I}$},\\ \mathcal{E}_{\mathtt{SD}}(\bar{\mathbf{m}}_{i})\in\Sigma^{N}&\text{otherwise}.\\ \end{cases}

Then we proceed similarly as in Construction B to obtain a sequence $\mathbf{w}$ of length $n$ . We use $\hat{\mathcal{C}}_{\rm Trace}$ to denote the code produced by this construction. ∎

Lemma 26.

Let $\mathbf{w}$ be a codeword of $\hat{\mathcal{C}}_{\rm Trace}$ and $\mathcal{Y}$ be an $(L_{\rm min},L_{\rm over},e,\tau)$ -erroneous trace of $\mathbf{w}$ . Then we can recover $\mathbf{m}$ from $\mathcal{Y}$ .

Proof:

With the same argument as the proof of Theorem 25, we can show that $\hat{\mathcal{C}}_{\rm Trace}$ is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of $\Sigma^{n}$ . Since $\mathcal{Y}$ is also an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{w}$ , the maximum reconstructible-substring $M(\mathcal{Y})$ can be decoded from $\mathcal{Y}$ . By reversing the operations in Construction C, we obtain a sequence $\bar{\mathbf{m}}^{\prime}\in\Sigma^{{2^{I}}m(N)}$ from $M(\mathcal{Y})$ . We partition $\bar{\mathbf{m}}^{\prime}$ into $2^{I}$ segments of the same length, i.e., $\bar{\mathbf{m}}^{\prime}=\bar{\mathbf{m}}_{0}^{\prime}\circ\bar{\mathbf{m}}_{1}^{\prime}\circ\cdots\circ\bar{\mathbf{m}}_{2^{I}-1}^{\prime}$ . Since $d_{H}(M(\mathcal{Y}),\mathbf{w})\leqslant\tau$ , then there are at most $\tau$ indices $i\in[2^{I}]$ such that $\bar{\mathbf{m}}_{i}\neq\bar{\mathbf{m}}_{i}^{\prime}$ . Hence, we can run the decoder of the Reed-Solomon code on $\bar{\mathbf{m}}^{\prime}$ to recover $\bar{\mathbf{m}}$ . ∎

Theorem 27.

Suppose that $\tau=O\left\lparen n^{\frac{1-\gamma a}{1-\gamma}}\right\rparen$ . Then the code $\hat{\mathcal{C}}_{\rm Trace}$ obtained in Construction C is an $(L_{\rm min},L_{\rm over},e,\tau)$ -trace reconstruction code of $\Sigma^{n}$ with rate

R(\hat{\mathcal{C}}_{\rm Trace})=\frac{1-1/a}{1-\gamma}-o(1).

Proof:

Since $\tau=O\left\lparen n^{\frac{1-\gamma a}{1-\gamma}}\right\rparen$ , we have that $2\tau/2^{I}=o(1)$ . Hence, the code rate

	$\displaystyle R(\hat{\mathcal{C}}_{\rm Trace})$	$\displaystyle=\frac{(2^{I}-2\tau)m(N)}{n}=\frac{2^{I}m(N)}{n}-\frac{2\tau N}{n}\left\lparen 1-\Theta\left\lparen\frac{1}{\log N}\right\rparen\right\rparen$
		$\displaystyle\geqslant\frac{2^{I}m(N)}{n}-\frac{2\tau}{2^{I}}\left\lparen 1-\frac{r}{L_{\rm min}}\right\rparen\left\lparen 1-\Theta\left\lparen\frac{1}{\log N}\right\rparen\right\rparen$
		$\displaystyle=\frac{2^{I}m(N)}{n}-o(1).$

Consider the $N_{i}$ ’s which are defined in (6). We have that

N_{i}\triangleq\begin{cases}N+L_{\rm min}-r&\text{if $i<n_{L}\bmod 2^{I}$},\\ N&\text{otherwise}.\\ \end{cases}

Hence,

	$\displaystyle R(\hat{\mathcal{C}}_{\rm Trace})$	$\displaystyle=\frac{2^{I}m(N)}{n}-o(1)$
		$\displaystyle\geqslant\frac{\sum_{i\in[2^{I}]}m(N_{i})-2^{I}(L_{\rm min}-r)}{n}-o(1)$
		$\displaystyle=R(\mathcal{C}_{\rm Trace})-o(1)=\frac{1-1/a}{1-\gamma}-o(1).$

∎

IV-D $(L_{\rm min},0,e)$ -Reconstruction Codes

In this subsection, we consider the case of $L_{\rm over}=0$ .

Construction D.

Suppose that $L_{\rm min}=\lceil a\log n\rceil$ , $L_{\rm over}=0$ and $L_{\rm min}\mid n$ . As before, we denote $n_{L}\triangleq\frac{n}{L_{\rm min}}$ and $K\triangleq\left\lceil\sqrt{\log n}\right\rceil$ . However, this time, we let $I\triangleq\lceil\log n_{L}\rceil$ and $r_{I}\triangleq\lceil(3d+8)\log I\rceil$ where $d=2e+1$ and $\ell=d\lceil\log d\rceil+2d$ . Then according to Theorem 14, there is a collection of $(3\lceil\frac{3}{2}\log(I+r_{I})\rceil+\ell,d)$ -WWL sequences $\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}}$ such that the concatenation $\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}$ is an $(I+r_{I},d)$ -SD sequence.

Denote $m^{\prime}\triangleq L_{\rm min}-(I+r_{I}+K+\ell)$ . Let $\mathcal{E}_{\mathtt{WWL}}$ be the encoder in [14, Algorithm 2] which can encode sequences of $\Sigma^{m^{\prime}-d}$ into $(\lceil K/4\rceil,d)$ -WWL sequences⁸⁸8Note that $m^{\prime}=\Theta(\log n)$ and $K=\left\lceil\sqrt{\log n}\right\rceil$ . Hence, $K/4\gg\mathcal{F}(m^{\prime},d)=\log m^{\prime}+(d-1)\log\log m^{\prime}+O(1)$ . Then according to Lemma 19 in [14], the encoder $\mathcal{E}_{\mathtt{WWL}}$ does work. of $\Sigma^{m^{\prime}}$ . For a message $\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{n_{L}-1}$ where $\mathbf{m}_{i}\in\Sigma^{m^{\prime}-d}$ for $i\in[n_{L}]$ , let $\mathbf{w}_{i}\triangleq\mathcal{E}_{\mathtt{WWL}}(\mathbf{m}_{i})$ for all $i\in[n_{L}]$ .

Denote $\mathbf{p}\triangleq 0^{K}\circ\mathbf{u}$ where $\mathbf{u}$ is a $d$ -auto-cyclic sequence of length $\ell$ . Let

\mathbf{w}=\mathbf{p}\circ\mathbf{c}_{0}\circ\mathbf{w}_{0}\circ\mathbf{p}\circ\mathbf{c}_{1}\circ\mathbf{w}_{1}\circ\cdots\circ\mathbf{p}\circ\mathbf{c}_{n_{L}-1}\circ\mathbf{w}_{n_{L}-1}.

Output $\mathbf{w}$ as the codeword which encodes the message $\mathbf{m}$ . The image under this mapping is the code that we construct. ∎

Theorem 28.

The code obtained in Construction D is an $(L_{\rm min},0,e)$ -trace reconstruction code of $\Sigma^{n}$ with rate

1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.

Proof:

The code has rate

\frac{n_{L}(m^{\prime}-d)}{n}=\frac{m^{\prime}-d}{L_{\rm min}}=\frac{L_{\rm min}-(I+r_{I}+K+\ell+d)}{L_{\rm min}}=1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.

Now, let $\mathbf{y}$ be a length- $L_{\rm min}$ substring of some codeword $\mathbf{w}$ . Then $\mathbf{y}$ must contain either a copy of $\mathbf{p}\circ\mathbf{c}_{i}$ or a suffix of $\mathbf{p}\circ\mathbf{c}_{i}$ together with a prefix of $\mathbf{p}\circ\mathbf{c}_{i+1}$ . Since $\mathbf{w}_{i}$ ’s and $\mathbf{c}_{j}$ ’s are WWL sequences, even if $\mathbf{y}$ suffers from $e$ errors, we can still locate the marker $\mathbf{p}$ in $\mathbf{y}$ . Then we can run the locating algorithm of the robust positioning sequence $\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}$ to determine the index $i$ or $i+1$ , and hence the location of $\mathbf{y}$ . ∎

For the case of $L_{\rm min}\nmid n$ , let $n_{L}=\lceil n/L_{\rm min}\rceil$ . We first construct an $(L_{\rm min},0)$ -trace reconstruction code of $\Sigma^{n_{L}L_{\rm min}}$ , where the length- $L_{\rm min}$ suffix of every codeword is fixed. Then we truncate it to be of length $n$ . In this way, we get a code of rate

\frac{\lfloor n/L_{\rm min}\rfloor(L_{\rm min}-(I+r_{I}+K+\ell+d))}{n}\geqslant\left\lparen 1-\frac{L_{\rm min}-1}{n}\right\rparen\left\lparen 1-\frac{I+r_{I}+K+\ell+d}{L_{\rm min}}\right\rparen=1-\frac{1}{a}-O\left\lparen\frac{1}{\log n}\right\rparen.

For $(L_{\rm min},0,e,\tau)$ -erroneous trace reconstruction, we proceed similarly as in [2, Construction B]. We first use an $(n_{L},2^{(m^{\prime}-d)(n_{L}-r)},2\tau+1)_{2^{m^{\prime}-d}}$ code to encode a message $\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{n_{L}-r-1}\in\Sigma^{(m^{\prime}-d)(n_{L}-r)}$ to a sequence $\bar{\mathbf{m}}=\bar{\mathbf{m}}_{0}\circ\bar{\mathbf{m}}_{1}\circ\cdots\circ\bar{\mathbf{m}}_{n_{L}-1}\in\Sigma^{(m^{\prime}-d)n_{L}}$ . Then we use the encoder outlined in Construction D to get a codeword $\mathbf{w}$ . We note that Construction B in [2] only concerns errors before sequencing, while our construction incorporates errors both before and after sequencing.

V Multi-Strand Reconstruction

In this section, instead of reconstructing a single sequence, we consider the problem of reconstructing a multiset of $k$ sequences of length $n$ from the union of their traces. The following construction of multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes is adapted from [25, Construction C].

Construction E.

Let $N\triangleq k(n-L_{\rm over})+L_{\rm over}$ . We take an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{C}$ of $\Sigma^{N}$ . For each codeword $\mathbf{x}\in\mathcal{C}$ , let

\mathcal{S}(\mathbf{x})\triangleq\left\{\mathbf{x}_{0+[n]},\mathbf{x}_{n-L_{\rm over}+[n]},\mathbf{x}_{2(n-L_{\rm over})+[n]},\ldots,\mathbf{x}_{(k-1)(n-L_{\rm over})+[n]}\right\}\in\mathcal{X}_{n,k}.

The code we construct is $\mathcal{D}$ , defined as,

\mathcal{D}\triangleq\left\{\mathcal{S}(\mathbf{x})~{}:~{}\mathbf{x}\in\mathcal{C}\right\}\subseteq\mathcal{X}_{n,k}.

∎

Lemma 29.

Let $L_{\rm min}>L_{\rm over}$ . Then the code $\mathcal{D}$ from Construction E is a multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of $\mathcal{X}_{n,k}$ .

Proof:

It is easy to see that an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace $\mathcal{Y}$ of $\mathcal{S}(\mathbf{x})$ is also an $(L_{\rm min},L_{\rm over},e)$ -erroneous trace of $\mathbf{x}$ . Since $\mathcal{C}$ is a trace reconstruction code, then for each $\mathbf{y}\in\mathcal{Y}$ , we can determine its location in $\mathbf{x}$ . Hence, we can determine the index $i$ such that $\mathbf{y}\in\mathcal{Y}_{i}$ and determine the location of $\mathbf{y}$ in $\mathbf{x}_{i}$ . ∎

Lemma 30 ([25, Lemma 16]).

$\log\lvert\mathcal{X}_{n,k}\rvert=k(n-\log(k/\mathsf{e}))+o(k)$ ⁹⁹9We use $\mathsf{e}$ to denote $\exp(1)$ in order to avoid confusion with $e$ which denotes the number of errors..

Theorem 31.

Suppose that $\limsup_{n\to\infty}\log k/n<1$ , $L_{\rm over}=\lceil\log(nk)\rceil+(24e+13)\lceil\log\lceil\log(nk)\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5$ and $L_{\rm min}>L_{\rm over}$ . For sufficiently large $n$ , there is a multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code of $\mathcal{X}_{n,k}$ whose rate is $1-o(1)$ .

Proof:

Let $N=k(n-L_{\rm over})+L_{\rm over}$ . Then $L_{\rm over}\geqslant\lceil\log N\rceil+(24e+13)\lceil\log\lceil\log N\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5$ . According to Corollary 19, there is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{C}$ of $\Sigma^{n}$ whose rate is $1-o(1)$ . Applying Construction E with this code, we obtain a multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{D}$ of $\mathcal{X}_{n,k}$ with $\lvert\mathcal{D}\rvert=\lvert\mathcal{C}\rvert$ . Note that

	$\displaystyle\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}$	$\displaystyle=\frac{k(n-L_{\rm over})+L_{\rm over}}{k(n-\log(k/\mathsf{e}))+o(k)}=\frac{n-L_{\rm over}+L_{\rm over}/k}{n-\log k+O(1)}$
		$\displaystyle=1-\frac{L_{\rm over}-\log k-L_{\rm over}/k+O(1)}{n-\log k+O(1)}=1-O\left\lparen\frac{\log n}{n}\right\rparen.$

Hence, the code rate is

\displaystyle R(\mathcal{D})=\frac{\log\lvert\mathcal{D}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{N}\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}=(1-o(1))\left\lparen 1-O\left\lparen\frac{\log n}{n}\right\rparen\right\rparen=1-o(1).

∎

Now, we consider the case of $L_{\rm over}\leqslant\log(nk)$ . Assume that $L_{\rm min}=a\log(nk)$ and $L_{\rm over}=\gamma L_{\rm min}$ where $a>1$ and $0\leqslant a\gamma\leqslant 1$ . Let

L^{*}\triangleq(n-L_{\rm over})\bmod(L_{\rm min}-L_{\rm over}).

We first present some upper bounds on the rate of multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction codes.

Lemma 32 ([25, In the proof of Lemma 8]).

For all $v\geqslant u\geqslant 0$ , $\log\binom{u+v}{u}<u(2\log\mathsf{e}+\log v-\log u).$

Lemma 33.

Suppose that $L_{\rm min}=\lceil a\log(nk)\rceil$ and $L_{\rm over}=\lceil\gamma L_{\rm min}\rceil$ where $a>1$ and $0\leqslant a\gamma\leqslant 1$ . Let $\mathcal{C}$ be a multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code of $\mathcal{X}_{n,k}$ . Then it holds that

\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-\gamma\frac{L_{\rm min}}{n}\right\rparen+\frac{1/a-\gamma}{1-\gamma}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

In particular, if $\log k=o(n)$ , then the code rate satisfies

R(\mathcal{C})\leqslant\frac{1-1/a}{1-\gamma}+o(1),

and if $\log k=\kappa n$ where $0<\kappa<1$ is a real constant, then the code rate satisfies

R(\mathcal{C})\leqslant\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

Proof:

For a sequence $\mathbf{x}\in\Sigma^{n}$ , let

\hat{\mathcal{Y}}(\mathbf{x})\triangleq\left\{\mathbf{x}_{i(L_{\rm min}-L_{\rm over})+[L_{\rm min}]}~{}:~{}i\in\left[\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}-1\right]\right\}\cup\{\mathbf{x}[n-L_{\rm min}-L^{*},n-1]\}.

For a codeword $\mathcal{S}=\{\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{k-1}\}\in\mathcal{C}$ , let $\hat{\mathcal{Y}}(\mathcal{S})\triangleq\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i}).$ Then $\hat{\mathcal{Y}}(\mathcal{S})$ is an $(L_{\rm min},L_{\rm over})$ -trace of $\mathcal{S}$ .

Since $\mathcal{C}$ is an $(L_{\rm min},L_{\rm over})$ -trace reconstruction code, necessarily $\hat{\mathcal{Y}}(\mathcal{S})\neq\hat{\mathcal{Y}}(\mathcal{S}^{\prime})$ for any two different codewords $\mathcal{S}$ and $\mathcal{S}^{\prime}$ . It follows that

\lvert\mathcal{C}\rvert\leqslant\left\lvert\left\{\hat{\mathcal{Y}}(\mathcal{S})~{}:~{}\mathcal{S}\in\mathcal{C}\right\}\right\rvert.

Note that $\hat{\mathcal{Y}}(\mathcal{S})$ is a multiset consisting of $k\frac{n-L_{\rm min}-L^{*}}{L_{\rm min}-L_{\rm over}}$ sequences of $\Sigma^{L_{\rm min}}$ and $k$ sequences of $\Sigma^{L_{\rm min}+L^{*}}$ . Hence,

\lvert\mathcal{C}\rvert\leqslant\binom{k\left\lparen\frac{n-L_{\rm min}-L^{*}}{L_{\rm min}-L_{\rm over}}\right\rparen+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}\cdot\binom{k+2^{L_{\rm min}+L^{*}}-1}{2^{L_{\rm min}+L^{*}}-1}.

(7)

We denote the first binomial coefficient in (7) as $A$ and the second one as $B$ . Since $2^{L_{\rm min}}\geqslant(nk)^{a}>\frac{k(n-L_{\rm min})}{L_{\rm min}-L_{\rm over}}$ and $2^{L_{\rm min}+L^{*}}>k$ , according to Lemma 32, we have that

$\displaystyle\frac{\log A}{nk}$	$\displaystyle<\frac{k}{nk}\left\lparen\frac{n-L_{\rm min}-L^{}}{L_{\rm min}-L_{\rm over}}\right\rparen\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm min}-L^{}))+\log(L_{\rm min}-L_{\rm over})\right\rparen$
	$\displaystyle=\frac{1-(L_{\rm min}+L^{*})/n}{L_{\rm min}-L_{\rm over}}(L_{\rm min}-\log(nk)+O(\log\log(nk)))$
	$\displaystyle=\left\lparen 1-\frac{L_{\rm min}+L^{*}}{n}\right\rparen\frac{L_{\rm min}-\log(nk)}{L_{\rm min}-L_{\rm over}}+O\left\lparen\frac{\log\log(nk)}{\log(nk)}\right\rparen$
	$\displaystyle=\frac{1-1/a}{1-\gamma}\left\lparen 1-\frac{L_{\rm min}+L^{*}}{n}\right\rparen+O\left\lparen\frac{\log\log(nk)}{\log(nk)}\right\rparen,$	(8)

and

\displaystyle\frac{\log B}{nk}

\displaystyle<\frac{1}{n}\left\lparen 2\log\mathsf{e}+L_{\rm min}+L^{*}-\log k\right\rparen=\frac{(1-1/a)L_{\rm min}}{n}+\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

(9)

Combining (7), (8) and (9), we have that

\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-\gamma\frac{L_{\rm min}}{n}\right\rparen+\frac{1/a-\gamma}{1-\gamma}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.

(10)

If $\log k=o(n)$ , then $L_{\rm min}/n=a\log(nk)/n=o(1)$ and $L^{*}/n<L_{\rm min}/n=o(1)$ . It follows that

\frac{\log\lvert\mathcal{C}\rvert}{nk}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-o(1)\right\rparen+o(1)=\frac{1-1/a}{1-\gamma}+o(1).

Recall that $\log\lvert\mathcal{X}_{n,k}\rvert=k(n-\log(k/\mathsf{e}))+o(k)$ . Hence, the code rate

\displaystyle R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{k(n-\log(k/\mathsf{e}))+o(k)}\leqslant\left\lparen\frac{1-1/a}{1-\gamma}+o(1)\right\rparen\frac{1}{1-o(1)}=\frac{1-1/a}{1-\gamma}+o(1).

If $\log k=\kappa n$ where $0<\kappa<1$ is a real constant, then

\displaystyle\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{nk}{k(n-\log(k/\mathsf{e}))+o(k)}=\frac{1}{1-\kappa+O(1/n)}=\frac{1}{1-\kappa}-O\left\lparen\frac{1}{n}\right\rparen.

Therefore, it follows from (10) that the code rate satisfies

	$\displaystyle R(\mathcal{C})$	$\displaystyle=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}$
		$\displaystyle\leqslant\left\lparen\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen\left\lparen 1-a\gamma\kappa\right\rparen+\frac{1/a-\gamma}{1-\gamma}\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen\right\rparen\left\lparen\frac{1}{1-\kappa}-O\left\lparen\frac{1}{n}\right\rparen\right\rparen$
		$\displaystyle=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen+\frac{1/a-\gamma}{(1-\gamma)(1-\kappa)}\cdot\frac{L^{*}}{n}+O\left\lparen\frac{\log n}{n}\right\rparen.$

∎

Corollary 34.

Suppose that $\log k=o(n)$ . Let $\mathcal{C}$ be a multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code of $\mathcal{X}_{n,k}$ . If $L_{\rm min}\leqslant\log(nk)+o(\log(nk))$ , then $R(\mathcal{C})=o(1)$ .

Proof:

Since $\mathcal{C}$ is also a multi-strand $(\lceil a\log(nk)\rceil,0)$ -trace reconstruction code for any $a>1$ , it follows from Lemma 33 that $R(\mathcal{C})\leqslant 1-1/a+o(1)$ for all $a>1$ . Hence, $R(\mathcal{C})=o(1)$ . ∎

Lemma 35.

Suppose that $\log k\leqslant\kappa n$ where $\kappa<1$ is a constant. Let $\mathcal{C}$ be a multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code of $\mathcal{X}_{n,k}$ . If $L_{\rm min}=\lceil a\log(nk)\rceil$ for some $a<1$ , then $R(\mathcal{C})=o(1)$ .

Proof:

The proof is similar to that of Lemma 33. In this case, we denote

\hat{\mathcal{Y}}(\mathbf{x})\triangleq\{\mathbf{x}_{0+[L_{\rm min}]},\mathbf{x}_{1+[L_{\rm min}]},\ldots,\mathbf{x}_{n-L_{\rm min}+[L_{\rm min}]}\}.

Then each $\hat{\mathcal{Y}}(\mathcal{S})=\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i})$ is still an $(L_{\rm min},L_{\rm over})$ -trace, and it consists of $k(n-L_{\rm min}+1))$ sequences of $\Sigma^{L_{\rm min}}$ , and so,

\lvert\mathcal{C}\rvert\leqslant\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}.

We observe that $k(n-L_{\rm min}+1)\geqslant k(n-a\log n-a\log k)\geqslant k\left\lparen(1-a\kappa)n-a\log n\right\rparen\geqslant cnk$ for some constant $c$ and $2^{L_{\rm min}}\leqslant 2(nk)^{a}$ . Since $a<1$ , when $n$ is sufficiently large, we have that $k(n-L_{\rm min}+1)\geqslant 2^{L_{\rm min}}$ . Using the inequality in Lemma 32, we get that

\frac{1}{nk}\log\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen 2\log\mathsf{e}+\log(k(n-L_{\rm min}+1))-L_{\rm min}\right\rparen.

(11)

Noting that $kn>k(n-L_{\rm min}+1)\geqslant cnk$ , we have that $\log(k(n-L_{\rm min}+1))=\log(nk)-O(1)$ . Continuing (11),

	$\displaystyle\frac{1}{nk}\log\binom{k(n-L_{\rm min}+1)+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}$	$\displaystyle\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen 2\log\mathsf{e}+\log(k(n-L_{\rm min}+1))-L_{\rm min}\right\rparen$
		$\displaystyle\leqslant\frac{2^{L_{\rm min}}}{nk}\left\lparen(1-a)\log(nk)+O(1)\right\rparen$
		$\displaystyle=\frac{(1-a)\log(nk)+O(1)}{(nk)^{1-a}}=o(1).$

Hence,

R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=o(1).

∎

Lemma 36.

Suppose that $k\leqslant 2^{n}$ . Let $\mathcal{C}$ be a multi-strand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code of $\mathcal{X}_{n,k}$ . If $L_{\rm min}\leqslant\log(nk)+o(\log(nk))$ and $L_{\rm min}-L_{\rm over}=\Theta(\log(nk))$ , then $R(\mathcal{C})=o(1)$ .

Proof:

It suffices to consider the case of $L_{\rm min}=\log(nk)+o(\log(nk))$ . In this case, we denote

\hat{\mathcal{Y}}(\mathbf{x})\triangleq\left\{\mathbf{x}_{i(L_{\rm min}-L_{\rm over})+[L_{\rm min}]}~{}:~{}i\in\left[\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}\right]\right\}\cup\{\mathbf{x}[n-L_{\rm min},n-1]\}.

Since $L_{\rm min}-L^{*}\geqslant L_{\rm over}$ , each $\hat{\mathcal{Y}}(\mathcal{S})=\bigcup_{i=0}^{k-1}\hat{\mathcal{Y}}(\mathbf{x}_{i})$ is still an $(L_{\rm min},L_{\rm over})$ -trace, and it consists of $k\left\lparen\frac{n-L_{\rm over}-L^{*}}{L_{\rm min}-L_{\rm over}}+1\right\rparen$ sequences of $\Sigma^{L_{\rm min}}$ . Hence, we have that

\lvert\mathcal{C}\rvert\leqslant\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}.

Since $L_{\rm min}=\log(nk)+o(\log(nk))$ , we have $\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}<2^{L_{\rm min}}$ . Using the inequality in Lemma 32, we get that

	$\displaystyle\frac{1}{nk}\log\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}$
$\displaystyle\leqslant$	$\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n+L_{\rm min}-2L_{\rm over}-L^{}))+\log(L_{\rm min}-L_{\rm over})\right\rparen$
$\displaystyle\leqslant$	$\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm over}))+\log(L_{\rm min}-L_{\rm over})\right\rparen.$	(12)

Since $L_{\rm min}\leqslant\log(nk)+o(\log(nk))$ and $L_{\rm min}-L_{\rm over}=\Theta(\log(nk))$ , we have that $L_{\rm over}\leqslant c_{1}\log(nk)\leqslant c_{2}n$ for some constants $c_{1},c_{2}<1$ . It follows that $\log(k(n-L_{\rm over}))=\log(nk)-O(1)$ . Continuing (12), we have that

		$\displaystyle\frac{1}{nk}\log\binom{\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{L_{\rm min}-L_{\rm over}}+2^{L_{\rm min}}-1}{2^{L_{\rm min}}-1}$
	$\displaystyle\leqslant$	$\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(k(n-L_{\rm over}))+\log(L_{\rm min}-L_{\rm over})\right\rparen$
	$\displaystyle\leqslant$	$\displaystyle\frac{k(n+L_{\rm min}-2L_{\rm over}-L^{*})}{(L_{\rm min}-L_{\rm over})nk}\left\lparen 2\log\mathsf{e}+L_{\rm min}-\log(nk)+O(\log\log(nk))\right\rparen$
	$\displaystyle=$	$\displaystyle\left\lparen 1+\frac{L_{\rm min}-2L_{\rm over}-L^{*}}{n}\right\rparen\frac{o(\log(nk))}{L_{\rm min}-L_{\rm over}}=o(1),$

where the last equality holds since $L_{\rm min}-L_{\rm over}=\Theta(\log(nk))$ . Hence,

R(\mathcal{C})=\frac{\log\lvert\mathcal{C}\rvert}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{\log\lvert\mathcal{C}\rvert}{nk}\cdot\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=o(1).

∎

Remark.

We note that the condition $L_{\rm min}-L_{\rm over}=\Theta(\log(nk))$ in Lemma 36 cannot be removed. A counterexample is the $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes of rate $1-o(1)$ in Theorem 31, where $L_{\rm over}=\lceil\log nk\rceil+(24e+13)\lceil\log\lceil\log nk\rceil\rceil+(4e+1)\lceil\log(4e+1)\rceil+20e+5$ and $L_{\rm min}\geqslant L_{\rm over}+1$ .

Note that a multistrand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code is also a multistrand $(L_{\rm min},L_{\rm over})$ -trace reconstruction code. Hence, the upper bounds in Lemmas 33–36 also work for multistrand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction codes.

In the following, we study the lower bounds.

Theorem 37.

Let $L_{\rm min}=\lceil a\log(nk)\rceil$ and $L_{\rm over}=\lceil\gamma L_{\rm min}\rceil$ , where $a>1$ and $0\leqslant a\gamma\leqslant 1$ . For all sufficiently large $n$ ,

1.

if $\log k=o(n)$ , then there is a multi-stand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{D}$ of $\mathcal{X}_{n,k}$ of rate

$R(\mathcal{D})=\frac{1-1/a}{1-\gamma}-o(1);$
2.

if $\log k=\kappa n$ where $0<\kappa<1$ is a real constant, then there is a multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{D}$ of $\mathcal{X}_{n,k}$ of rate

$R(\mathcal{D})=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1).$

Proof:

Let $N=k(n-L_{\rm over})+L_{\rm over}$ . Then $L_{\rm min}\geqslant\lceil a\log N\rceil$ . According to Theorem 25, there is an $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{C}$ of $\Sigma^{N}$ whose rate is $\frac{1-1/a}{1-\gamma}-o(1)$ . Applying Construction E with this code, we obtain a multi-strand $(L_{\rm min},L_{\rm over},e)$ -trace reconstruction code $\mathcal{D}$ of $\mathcal{X}_{n,k}$ with $\lvert\mathcal{D}\rvert=\lvert\mathcal{C}\rvert$ . Note that

	$\displaystyle\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}$	$\displaystyle=\frac{k(n-L_{\rm over})+L_{\rm over}}{k(n-\log(k/e))+o(k)}=\frac{n-L_{\rm over}+L_{\rm over}/k}{n-\log k+O(1)}$
		$\displaystyle=1-\frac{L_{\rm over}-\log k-L_{\rm over}/k+O(1)}{n-\log k+O(1)}=1-\frac{(a\gamma-1)\log k+O(\log n)}{n-\log k+o(1)}.$

If $\log k=o(n)$ , then ${N}/{\log\lvert\mathcal{X}_{n,k}\rvert}=1-o(1)$ , and so, we have that

\displaystyle R(\mathcal{D})=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen(1-o(1))=\frac{1-1/a}{1-\gamma}-o(1).

If $\log k=\kappa n$ , then

\frac{N}{\log\lvert\mathcal{X}_{n,k}\rvert}=1-\frac{(a\gamma-1)\kappa}{1-\kappa}-o(1)=\frac{1-a\gamma\kappa}{1-\kappa}-o(1),

and so, we have that

\displaystyle R(\mathcal{D})=\left\lparen\frac{1-1/a}{1-\gamma}-o(1)\right\rparen\left\lparen\frac{1-a\gamma\kappa}{1-\kappa}-o(1)\right\rparen=\frac{1-a\gamma\kappa}{1-\kappa}\left\lparen\frac{1-1/a}{1-\gamma}\right\rparen-o(1).

∎

When $\log k=o(n)$ or when $\log k=\kappa n$ and $L^{*}=o(n)$ , the lower bounds in Theorem 37 asymptotically achieve the upper bound in Lemma 33.

Next, we show that when $\log k=\kappa n$ and $L_{\rm over}=0$ , if $L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)=(a-1-\epsilon)\log(nk)$ for a positive $\epsilon$ which is independent of $n$ , then the upper bound in Lemma 33 still can be achieved.

Construction F.

Suppose that $L_{\rm min}=\lceil a\log(nk)\rceil$ and $L_{\rm over}=0$ . Denote $\bar{n}\triangleq\frac{n-L^{*}}{L_{\rm min}}$ and $K\triangleq\lceil\sqrt{\log(nk)}\rceil$ . Let $I\triangleq\lceil\log(\bar{n}k)\rceil$ and $r_{I}\triangleq\lceil(3d+8)\log I\rceil$ where $d=2e+1$ . Then according to Theorem 14, there is a collection of $(3\lceil\frac{3}{2}\log(I+r_{I})\rceil+\ell,d)$ -WWL sequences $\mathbf{c}_{0},\mathbf{c}_{1},\ldots,\mathbf{c}_{2^{I}-1}\in\Sigma^{I+r_{I}}$ such that the concatenation $\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}$ is an $(I+r_{I},d)$ -SD sequence.

Denote $n^{\prime}\triangleq\bar{n}(L_{\rm min}-(I+r_{I}+K+\ell))+L^{*}$ . Let $\mathcal{E}_{\mathtt{WWL}}$ be the encoder in [14, Algorithm 2] which can encode sequences of $\Sigma^{n^{\prime}-d}$ into $(\lceil K/4\rceil,d)$ -WWL sequences¹⁰¹⁰10Note that $n^{\prime}=\Theta(n)$ and $K=\sqrt{\log(nk)}=\Theta(\sqrt{n})$ . Hence, $K/4\gg\mathcal{F}(n^{\prime},d)=\log n^{\prime}+(d-1)\log\log n^{\prime}+O(1)$ . Then according to Lemma 19 in [14], the encoder $\mathcal{E}_{\mathtt{WWL}}$ does work. of $\Sigma^{n^{\prime}}$ . For a message $\mathbf{m}=\mathbf{m}_{0}\circ\mathbf{m}_{1}\circ\cdots\circ\mathbf{m}_{k-1}$ where $\mathbf{m}_{i}\in\Sigma^{n^{\prime}-d}$ for $i\in[k]$ , let $\mathbf{v}_{i}\triangleq\mathcal{E}_{\mathtt{WWL}}(\mathbf{m}_{i})$ for all $i\in[k]$ . We partition each $\mathbf{v}_{i}$ into $\bar{n}+1$ substrings as follows:

\mathbf{v}_{i}=\mathbf{v}_{i,0}\circ\mathbf{v}_{i,1}\circ\cdots\mathbf{v}_{i,\bar{n}-1}\circ\mathbf{v}_{i,\bar{n}}

where $\lvert\mathbf{v}_{i,j}\rvert=L_{\rm min}-(I+r_{I}+K+\ell)$ for $j\in[\bar{n}]$ and $\lvert\mathbf{v}_{i,\bar{n}}\rvert=L^{*}.$

Denote $\mathbf{p}\triangleq 0^{K}\circ\mathbf{u}$ where $\mathbf{u}$ is a $d$ -auto-cyclic sequence of length $\ell$ . For each $i\in[k]$ , let

\mathbf{w}_{i}=\mathbf{v}_{i,0}\circ\mathbf{p}\circ\mathbf{c}_{i\bar{n}}\circ\mathbf{v}_{i,1}\circ\mathbf{p}\circ\mathbf{c}_{i\bar{n}+1}\circ\cdots\circ\mathbf{v}_{i,\bar{n}-1}\circ\mathbf{p}\circ\mathbf{c}_{(i+1)\bar{n}-1}\circ\mathbf{v}_{i,\bar{n}}.

Output $\{\mathbf{w}_{0},\mathbf{w}_{1},\ldots,\mathbf{w}_{k-1}\}$ as the codeword which encodes the message $\{\mathbf{m}_{0},\mathbf{m}_{1},\ldots,\mathbf{m}_{k-1}\}$ . The image of the mapping described here is the constructed code. ∎

Lemma 38.

Suppose that $L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)$ for a positive $\epsilon$ which is independent of $n$ . Then the code obtained in Construction F is a multi-strand $(L_{\rm min},0,e)$ -trace reconstruction code of $\mathcal{X}_{n,k}$ .

Proof:

Let $\mathbf{y}$ be a length- $L_{\rm min}$ substring of $\mathbf{w}_{i}$ for some $\mathbf{w}_{i}\in\{\mathbf{w}_{0},\mathbf{w}_{1},\ldots,\mathbf{w}_{k-1}\}$ . Note that $L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)$ and $\lvert\mathbf{p}\circ\mathbf{c}_{j}\rvert=K+\ell+I+r_{I}<(1+\epsilon)\log(nk)$ . Then $\mathbf{y}$ must contain either a copy of $\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j}$ or a suffix of $\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j}$ together with a prefix of $\mathbf{p}\circ\mathbf{c}_{i\bar{n}+j+1}$ . Since $\mathbf{v}_{i}$ ’s and $\mathbf{c}_{j}$ ’s are WWL sequence, even if $\mathbf{y}$ suffers from $e$ errors, we can still locate their position in $\mathbf{y}$ by searching for the marker $\mathbf{p}$ . Then we can run the locating algorithm of the robust positioning sequence $\mathbf{c}_{0}\circ\mathbf{c}_{1}\circ\cdots\circ\mathbf{c}_{2^{I}-1}$ to determine the index $i\bar{n}+j$ or $i\bar{n}+j+1$ , and hence the location of $\mathbf{y}$ . ∎

Theorem 39.

Suppose that $\log k=\kappa n$ , $L_{\rm min}=\lceil a\log(nk)\rceil$ and $L_{\rm over}=0$ , where $0<\kappa<1$ and $a>1$ . If $L^{*}\leqslant L_{\rm min}-(1+\epsilon)\log(nk)$ for a fixed positive $\epsilon$ which is independent of $n$ , then there is a multi-strand $(L_{\rm min},0,e)$ -trace reconstruction code which has code rate

\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\cdot\frac{L^{*}}{n}-o(1)

Proof:

Note that

	$\displaystyle\frac{n^{\prime}-d}{n}$	$\displaystyle=\frac{\bar{n}(L_{\rm min}-(I+r_{I}+K+\ell))+L^{*}-d}{n}=\frac{n-\bar{n}(I+r_{I}+K+\ell)-d}{n}$
		$\displaystyle=1-\frac{1-L^{*}/n}{L_{\rm min}}(I+r_{I}+K+\ell)-O\left\lparen\frac{1}{n}\right\rparen$
		$\displaystyle=1-\left\lparen 1-\frac{L^{*}}{n}\right\rparen\frac{\log(nk)+O(\sqrt{\log(nk)})}{a\log(nk)}-O\left\lparen\frac{1}{n}\right\rparen$
		$\displaystyle=1-\frac{1}{a}+\frac{L^{*}}{an}-O\left\lparen\frac{1}{\sqrt{\log(nk)}}\right\rparen.$

Hence, the code rate is

\frac{(n^{\prime}-d)k}{\log\lvert\mathcal{X}_{n,k}\rvert}=\frac{(n^{\prime}-d)k}{nk}\frac{nk}{\log\lvert\mathcal{X}_{n,k}\rvert}=\left\lparen 1-\frac{1}{a}+\frac{L^{*}}{an}-o(1)\right\rparen\left\lparen\frac{1}{1-\kappa}-o(1)\right\rparen=\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\frac{L^{*}}{n}-o(1).

∎

Finally, we note that the multi-strand $(L_{\rm min},0,e)$ -trace reconstruction code in Construction F only guarantees recovering message from reliable $(L_{\rm min},0,e)$ -erroneous traces, the occurrence of which might be rare since $L_{\rm over}=0$ and each symbol is usually included in a small number of substrings in $\mathcal{Y}$ . Nevertheless, we can use a $(k,2^{(n^{\prime}-d)(k-r_{o})},2\tau+1)_{2^{n^{\prime}-d}}$ code to encode the message, like what we have done in Construction C, so that even if there are in total $\tau$ errors in $\mathcal{Y}$ , we still can decode the message. The rate of this trace reconstruction code is

\left\lparen 1-\frac{r_{o}}{k}\right\rparen\left\lparen\frac{1-1/a}{1-\kappa}+\frac{1}{a(1-\kappa)}\cdot\frac{L^{*}}{n}\right\rparen-o(1).

References

[1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String reconstruction from substring compositions,” SIAM J. Discrete Math., vol. 29, no. 3, pp. 1340–1371, 2015.
[2] D. Bar-Lev, S. Marcovich, E. Yaakobi, and Y. Yehezkeally, “Adversarial torn-paper codes,” in Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT2022), Espoo, Finland, Jun. 2022, pp. 2934–2939.
[3] T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing strings from random traces,” in Proc. the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 2004, pp. 910–918.
[4] R. Berkowitz and S. Kopparty, “Robust positioning patterns,” in Proc. of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, USA, 2016, pp. 1937–1951.
[5] A. M. Bruckstein, T. Etzion, R. Giryes, N. Gordon, R. J. Holt, and D. Shuldiner, “Simple and robust binary self-location patterns,” IEEE Trans. Inform. Theory, vol. 58, no. 7, pp. 4884–4889, 2012.
[6] Y. M. Chee, D. T. Dao, H. M. Kiah, S. Ling, and H. Wei, “Robust positioning patterns with low redundancy,” SIAM J. Comput., vol. 49, no. 2, pp. 284–317, 2020.
[7] D. T. Dao, H. M. Kiah, and H. Wei, “Maximum length of robust positioning sequences,” in Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT2020), Los Angeles, CA, USA, 2020, pp. 108–113.
[8] M. Dudik and L. J. Schulman, “Reconstruction from subsequences,” J. Combin. Theory Ser. A, vol. 103, no. 2, pp. 337–348, 2003.
[9] O. Elishco, R. Gabrys, E. Yaakobi, and M. Médard, “Repeat-free codes,” IEEE Trans. Inform. Theory, vol. 67, no. 9, pp. 5749–5764, 2021.
[10] R. Gabrys and O. Milenkovic, “Unique reconstruction of coded strings from multiset substring spectra,” IEEE Trans. Inform. Theory, vol. 65, no. 12, pp. 7682–7696, 2019.
[11] H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence profiles,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3125–3146, Jun. 2016.
[12] V. I. Levenshtein, “Efficient reconstruction of sequences from their subsequences or supersequences,” J. Combin. Theory Ser. A, vol. 93, no. 2, pp. 310–332, 2001.
[13] V. I. Leveshtein, “Efficient reconstruction of sequences,” IEEE Trans. Inform. Theory, vol. 47, no. 1, pp. 2–22, 2001.
[14] M. Levy and E. Yaakobi, “Mutually uncorrelated codes for DNA storage,” IEEE Trans. Inform. Theory, vol. 65, no. 6, pp. 3671–3691, 2019.
[15] B. Manvel, A. Meyerowitz, A. Schwenk, K. Smith, and P. Stockmeyer, “Reconstruction of sequences,” Discrete Math., vol. 94, no. 3, pp. 209–219, 1991.
[16] S. Marcovich and E. Yaakobi, “Reconstruction of strings from their substrings spectrum,” IEEE Trans. Inform. Theory, vol. 67, no. 7, pp. 4369–4384, 2021.
[17] S. Nassirpour, I. Shomorony, and A. Vahid, “Reassembly codes for the chop-and-shuffle channel,” Jan. 2022. [Online]. Available: http://arxiv.org/abs/2201.03590
[18] S. Pattabiraman, R. Gabrys, and O. Milenkovic, “Coding for polymer-based data storage,” IEEE Trans. on Inform. Theory (Early Access), 2023.
[19] A. N. Ravi, A. Vahid, and I. Shomorony, “Capacity of the torn paper channel with lost pieces,” in Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT2021), Melbourne, Victoria, Australia, Jul. 2021, pp. 1937–1942.
[20] A. D. Scott, “Reconstructing sequences,” Discrete Math., vol. 175, pp. 231–238, 1997.
[21] I. Shomorony and A. Vahid, “Torn-paper coding,” IEEE Trans. Inform. Theory, vol. 67, no. 12, pp. 7904–7913, 2021.
[22] E. Ukkonen, “Approximate string-matching with $q$ -grams and maximal matches,” Theoret. Comp. Sci., vol. 92, no. 1, pp. 191–211, 1992.
[23] C. Wang, J. Sima, and N. Raviv, “Break-resilient codes for forensic 3D fingerprinting,” arXiv preprint arXiv:2310.03897, 2023.
[24] H. Wei, “Nearly optimal robust positioning patterns,” IEEE Trans. Inform. Theory, vol. 68, no. 1, pp. 193–203, 2022.
[25] Y. Yehezkeally, D. Bar-Lev, S. Marcovich, and E. Yaakobi, “Generalized unique reconstruction from substrings,” IEEE Trans. Inform. Theory, vol. 69, no. 9, pp. 5648–5659, Sep. 2023.
[26] Y. Yehezkeally and N. Polyanskii, “On codes for the noisy substring channel,” Sep. 2023. [Online]. Available: http://arxiv.org/abs/2102.01412v3

Reconstruction from Noisy Substrings

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Reconstruction from the LL-Multispectrum

Definition 1.

Remark.

Proposition 2 ([16, Theorem 16]).

Theorem 3 ([16, Theorem 19]).

Theorem 4 ([16, Algorithm 4 and Theorem 25]).

II-B Reconstruction from an (Lmin,Lover)(L_{\rm min},L_{\rm over})-trace

Proposition 5 ([25, Lemma 1]).

Theorem 6 ([25, Corollary 6]).

Lemma 7 ([25, Lemma 8]).

Theorem 8 ([25, Theorem 15]).

II-C Multi-strand reconstruction

Theorem 9 ([2, Theorem 12]).

Theorem 10 ([25, Corollary 23]).

II-D Robust positioning sequences

Theorem 11 (dd-Auto-Cyclic Sequences [14]).

Definition 12.

Proposition 13 ([6, Construction 1 and Theorem 3.7]).

Theorem 14 ([6, Construction 1A and Corollary 3.12]).

Remark.

III Encoding of (a​log⁡n,d)(a\log n,d)-Substring Distant Sequences for a>1a>1

Lemma 15.

Proof:

Lemma 16.

Proof:

Theorem 17.

Proof:

IV Generalized Reconstruction from Noisy Substring Trace

Proposition 18.

Proof:

Corollary 19.

Construction A (Index Construction).

IV-A The case of Lmin∣nL_{\rm min}\mid n

Lemma 20.

Proof:

Construction B.

Lemma 21.

Proof:

Lemma 22 (Construction 1 and Lemma 3.6 in [6]).

Lemma 23.

Proof:

Lemma 24.

Proof:

Theorem 25.

Proof:

IV-B The case of Lmin∤nL_{\rm min}\nmid n

IV-C Handling noise which occurs before sequencing

Construction C.

Lemma 26.

Proof:

Theorem 27.

Proof:

IV-D (Lmin,0,e)(L_{\rm min},0,e)-Reconstruction Codes

Construction D.

Theorem 28.

Proof:

V Multi-Strand Reconstruction

Construction E.

Lemma 29.

Proof:

Lemma 30 ([25, Lemma 16]).

Theorem 31.

Proof:

Lemma 32 ([25, In the proof of Lemma 8]).

Lemma 33.

Proof:

Corollary 34.

Proof:

Lemma 35.

Proof:

Lemma 36.

Proof:

Remark.

Theorem 37.

Proof:

II-A Reconstruction from the $L$ -Multispectrum

II-B Reconstruction from an $(L_{\rm min},L_{\rm over})$ -trace

Theorem 11 ( $d$ -Auto-Cyclic Sequences [14]).

III Encoding of $(a\log n,d)$ -Substring Distant Sequences for $a>1$

IV-A The case of $L_{\rm min}\mid n$

IV-B The case of $L_{\rm min}\nmid n$

IV-D $(L_{\rm min},0,e)$ -Reconstruction Codes