Hengjia Wei, Moshe Schwartz
H. Wei is with the Peng Cheng Laboratory, Shenzhen 518055, China. He is also with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China, and the Pazhou Laboratory (Huangpu), Guangzhou 510555, China
(e-mail: hjwei05@gmail.com).M. Schwartz is on a leave of absence from the School
of Electrical and Computer Engineering, Ben-Gurion University of the Negev,
Beer Sheva 8410501, Israel. He is now with the Department of Electrical and Computer Engineering at McMaster University, Hamilton, ON L8S 4K1, Canada
(e-mail: schwartz.moshe@mcmaster.ca).G. Ge is with the School of Mathematical Sciences, Capital Normal University, Beijing 100048, China (e-mail: gnge@zju.edu.cn).This work was supported in part by
the National Key Research and Development Program of China under Grant 2020YFA0712100, the National Natural Science Foundation of China under Grant 11971325, Grant 12231014 and Grant 12371523, Beijing Scholars Program, the major key project of Peng Cheng Laboratory under grant PCL2023AS1-2, and the Zhejiang Lab BioBit Program under Grant 2022YFB507.
Abstract
This paper studies the problem of encoding messages into sequences which can be uniquely recovered from some noisy observations about their substrings. The observed reads comprise consecutive substrings with some given minimum
overlap. This coded reconstruction problem has applications to DNA storage. We consider both single-strand reconstruction codes and multi-strand reconstruction codes, where the message is encoded into a single strand or a set of multiple strands, respectively. Various parameter regimes are studied. New codes are constructed, some of whose rates asymptotically attain the upper bounds.
Sequence (string) reconstruction refers to a large class of problems of reconstructing a sequence from partial (perhaps noisy) observations of it. Instances of this problem include reconstruction from multiple erroneous copies of the sequence [13, 12, 3], some substrings of the sequence [11, 10], all the length- subsequences [15, 20, 8], and compositions of the sequence’s substrings or prefixes/suffixes [1, 18].
In this paper, we shall consider the problem of encoding messages into sequences which can be uniquely recovered from observations about their substrings. This coding problem
is motivated by applications to DNA-based data storage systems, where data are encoded to long DNA sequences. In some DNA sequencing technologies (e.g., shotgun sequencing), a long DNA strand is first replicated multiple times, and these replicas are then fragmented into some short substrings so that they could be read. In order to retrieve the data, the original long sequence should be reconstructed based on the observations about these short substrings.
This coded reconstruction problem has been studied in different models with different assumptions on the substrings. Gabrys and Milenkovic [10] considered the problem of reconstructing a sequence of length from its -multispectrum, i.e., the multiset of all of its length- substrings. They constructed two classes of reconstruction codes with redundancies and for and , respectively. They also studied the noisy settings in which some substrings/observations may be lost or be corrupted by errors, and constructed codes to combat these effects. Subsequently, Marcovich and Yaakobi [16] followed this noisy setup and provided more code constructions. The constructions in [10, 16] are based on the so-called -substring distant (SD) sequence, a sequence in which every two length- substrings are of Hamming distance at least apart.
When , such sequences are also known as -substring unique sequences or -repeat free sequences. Efficient encoding algorithms can be found in [9] for . For general , Marcovich and Yaakobi [16] proposed an encoding algorithm of -SD sequences for .
Another model is the torn-paper channel, which randomly tears the input sequence into small pieces of different sizes.
The output of this channel is a set of substrings of the input sequence with no overlap, and the message which is carried by the input sequence should be recovered from these substrings. This problem has been researched in the
probabilistic setting in [21, 19, 17].
Recently, Bar-Lev et al. [2] considered this problem in the worst-case. They studied both the noiseless setup and the noisy setup, and proposed a couple of index-based constructions to encode messages into sequences each of which can be uniquely recovered from its non-overlapping substrings. Furthermore, motivated by DNA sequencing technologies where multiple strings are sequenced simultaneously, they extended the single-strand reconstruction problem to a multi-strand reconstruction problem. They constructed multi-strand reconstruction codes whose rates asymptotically behave like those of single-strand reconstruction codes. Another related paper is by Wang et al. [23], which, unlike [2], does not restrict the length of the torn substrings, but rather their number. For this setting they construct codes that attain the upper bound on the rate up to asymptotically small factors.
In a recent paper, Yehezkeally et al. [25] proposed a general model, which includes the two models above as extreme cases. In this model, the reconstruction is based on the sequence’s -trace, which is a multiset of subsrings where every substring has length at least and the overlap of every two consecutive substrings has length at least . They focused on the noiseless setup, and constructed a class of trace reconstruction codes whose rate can asymptotically achieve the upper bound. They also studied the multi-strand reconstruction problem in the -multispectrum model, and proposed reconstruction codes whose rates are asymptotically .
In this paper, we shall follow the model in [25] and study the coding problem for both single-strand reconstruction and multi-strand reconstruction in the noisy setup. We aim to encode a message into a sequence which can be uniquely recovered from its -erroneous trace, where each substring may suffer from at most substitution errors, or to encode a message into a set of sequences which can be recovered from the union of their -erroneous traces. Our contributions are listed as follows.
1.
We first give an algorithm which can encode messages into -SD sequences for where is an arbitrary real constant. The rates of the encoded sequences asymptotically approach . In contrast, the encoding algorithm in [16] requires a single redundancy bit but works only when .
2.
For single-strand reconstruction, by using the proposed encoding algorithm for SD sequences, we construct two classes of -trace reconstruction codes whose rates asymptotically achieve the upper bound.
3.
For multi-strand reconstruction, we present some upper bounds on the rates of multi-strand -trace reconstruction codes, as well as some code constructions. In some parameter regimes, our constructions yield codes whose rates asymptotically attain the upper bounds.
Interestingly, when , and , the maximal rates of multi-strand reconstruction codes not only depend on , but also depend on the congruence class of modulo .
II Preliminaries
For a positive integer , let denote the set . Let denote a finite alphabet. Throughout this paper, we always consider the binary case, i.e., , however, our results can be easily generalized to non-binary cases. We use to denote the logarithm of to base . When generalizing our results to the -ary alphabet case, it suffices to replace the with .
Assume is a sequence over . We denote its length , and its Hamming weight by . Given two sequence and over , we denote their concatenation by . If and have the same length, we use to denote their Hamming distance.
A substring of is a sequence of the form , where , and we use to denote it. We also use , where , to denote the substring of which starts at the position and has length , i.e., .
A code is simply a set , whose elements are referred to as codewords. We say is the length of the code. The rate of the code is defined as , and the redundancy of the code is .
II-AReconstruction from the -Multispectrum
For a sequence and a positive integer , the -multispectrum of , denoted by , is the multiset of all its length- substrings, namely,
If can be uniquely reconstructed from its -multispectrum, then we say it is -reconstructible. It was proved in [22] that if all the length- substrings of are distinct, then is -reconstructible. Such a sequence is referred to as an -substring unique sequence. In the works [10, 9], algorithms were proposed to construct a set of -substring unique sequences of rate approaching , where for any constant real number .
In [10], Gabrys and Milenkovic further studied the problem of reconstructing sequences from their noisy multispectra.
They first considered the scenario where some substrings are not included in the readout spectrum.
For a subset , if the maximum number of consecutive substrings which are not included in is , we say has maximal coverage gap .
A code is called an -reconstruction code if every codeword can be uniquely reconstructed from any subset with maximal coverage gap . Gabrys and Milenkovic proposed a construction for such codes [10] by restricting each codeword to be -substring unique with and imposing some constraints on their prefixes.
Gabrys and Milenkovic also researched the scenario where the observations about the substrings suffer from substitution errors. Let be a multiset consisting of strings of length . If there is a subset with maximal coverage gap such that for all , then we say is an -constrained erroneous multispectrum of . Moreover, is said to
be reliable if for any symbol in , there are more copies of the correct value rather than an incorrect value of the symbol. A code is called an -reconstruction code if every codeword can be uniquely reconstructed from its any reliable -constrained erroneous multispectrum111We emphasize
that the multispectrum is just a multiset, and the order/index of each cannot be directly read when reconstructing.. Gabrys and Milenkovic constructed an -reconstruction code of redundancy for . Their construction is based on -substring distant sequences, whose definition is presented as follows.
Definition 1.
A sequence is called -substring distant (SD) if the minimum Hamming distance of its -multispectrum is at least , that is, for any .
Remark.
We observe that an -substring distant sequence is also -substring distant, for any . Thus, we may equivalently say that is -substring distant (SD) if for any integer and . This equivalent definition allows to be a real number, which we shall conveniently use in the future.
In [16], Marcovich and Yaakobi followed the noisy setup of Gabrys and Milenkovic. They studied the case of , i.e., no substring losses.
Instead of reconstructing from a reliable erroneous multispectrum, they aimed to reconstruct from an -erroneous multispectrum , the so-called maximum reconstructible-string, i.e., a string of length that takes at every position the majority value of the occurrences of in . Obviously, if is reliable, then the maximum reconstructible-string is equal to . A sequence is called -reconstructible222The notion here is a bit different from that in [16], where Marcovich and Yaakobi further assumed that there are at most substrings in each of which is affected by at most errors and referred to it as a -erroneous multispectrum. They proposed two constructions for reconstructible codes: one is independent of and thus can combat any number of erroneous substrings, while the other one depends on . In this paper, we focus on reconstructible codes which are independent of . if one can always reconstruct the maximum reconstructible-string from its any -erroneous multispectrum.
For positive integers with , we use to denote the set of -SD sequences of . For fixed and , Marcovich and Yaakobi showed that the asymptotic rate of the set is , by using the Lovász Local Lemma. Note that when , even a single -substring unique sequence of length does not exist.
Let be a fixed integer. There is an encoding algorithm which uses a single redundancy bit to encode -SD sequences of length , for
where is a small constant number and is sufficiently large.
In Section III, we shall present an algorithm which can encode -SD sequences of length for any , while its redundancy is . According to Proposition 2, this implies an -reconstructible code whose rate approaches , for and .
II-BReconstruction from an -trace
In [25], Yehezkeally et al. studied an extension of the problem of reconstructing from substrings.
Let be a sequence. A substring trace of is a multiset of substrings for some positive integer , where . If , for all , and , then the substring trace is called complete. Let and be two positive integers such that . An -trace is a complete trace such that:
1.
every substring has length at least , i.e., for all ;
2.
the overlap of every two consecutive substrings has length at least , i.e., for all .
For a sequence , let denote the set of all -traces of . A code is referred to as an -trace reconstruction code if for all , or equivalently, every codeword can be uniquely reconstructed from any of its -traces.
Let and for some and . If is sufficiently large, then there is an -trace reconstruction code with rate
where is a small number which is independent of .
In this paper, we shall study the problem of reconstructing sequences from their noisy substring traces.
Let be a multiset of sequences over , and let for . We say is an -erroneous trace of if there exists an -trace such that for all . Namely, each string in is an erroneous copy of the substring in with at most errors. The index is referred to as the location in . For a sequence and its any -erroneous trace , if one can always determine the location of every in , then we say is -trace reconstructible. We note that once all the locations of ’s are identified, the maximum reconstructible-string of can be determined by taking at every position the majority value of the occurrences of in . Hence, the -trace reconstructible sequence can be uniquely reconstructed as long as is reliable.
A code is called an -trace reconstruction code if every codeword is -trace reconstructible333Unlike the noiseless case, in an -trace reconstruction code it might be possible that two codewords share a common -erroneous trace. Nevertheless, they cannot have a common reliable trace. .
In Section IV, we will give two constructions for -trace reconstruction codes where the number of errors is fixed. Our results are akin to Theorem 6 and Theorem 8. In particular, when for some , we construct a class of -trace reconstruction codes whose rates approach . When and for some and , the proposed -trace reconstruction codes have rates close to . These results are summarized in Table I.
Our constructions are based on robust positioning sequences and window-weight limited sequences, which are reviewed in Section II-D.
We note that when , -reconstruction codes were researched by Bar-Lev et al. in [2] by the name of adversarial torn-paper codes. In the same paper, they also consider the scenario where the DNA strand may suffer from substitution errors before sequencing. Such kind of errors cannot be corrected by majority decoding. Yehezkeally and Polyanskii studied a similar problem for the -trace reconstruction [26]. They introduced the notion of -resilient repeat free sequence, which satisfies the property that the result of any substitution errors to it is -repeat free, and proposed an algorithm to directly encode such sequences. Interestingly, [26, Lemma 6] shows that an -SD sequence is -resilient repeat free. In Section IV, we will also study errors before sequencing and modify our code construction for -trace reconstruction to combat such errors.
TABLE I: Lower and upper bounds on the code rate of single-strand -trace reconstruction codes of .
Motivated by DNA sequencing technologies where multiple DNA strands are sequenced simultaneously,
the reconstruction problem has been extended to the multi-strand case in [25, 2], i.e., reconstructing a multiset of sequences of length from the union of their traces.
Define
Then . The rate of a multi-strand code is defined as
For a multiset , its -trace is a (multiset) union , where each is an -trace of . A code is referred to as a multi-strand -trace reconstruction code if every codeword can be reconstructed from its -trace.
Two classes of multi-strand trace reconstruction codes whose rates asymptotically attain the upper bound have been constructed in [25, 2], for or , respectively.
Suppose that and . Then there is a class of multi-strand -trace reconstruction codes of rate .
In this paper, we will also study the problem of reconstructing multiple strands from their noisy traces.
For a multiset , its -erroneous trace is a (multiset) union , where each is an -erroneous trace of . We aim to reconstruct from its -erroneous trace. If for any -erroneous trace of and any , it is possible to determine the index such that as well as the location of in , then we say is -trace reconstructible.
A code is called an multi-strand -trace reconstruction code if each of its codewords is -trace reconstructible.
Following the research in [25], we assume that , which is of great interest in applications. In Section V, we shall present some upper bounds on the multi-strand trace code rate and propose some codes whose rates asymptotically attain these bounds. Our results are summarized in Table II and Table III. Among others, when with , we obtain a class of multi-strand -trace reconstruction codes of rate , where . Note that and . The term could be a non-vanishing number, depending on the congruence class of modulo .
In contrast, when , the rate of the multi-strand -trace reconstruction codes in [2, Theorem 12] is , which is the same as that of single-strand reconstruction codes.
TABLE II: Lower and upper bounds on the code rate of multi-strand -trace reconstruction codes of , where .
An -substring distant sequence is also known as an -robust positioning sequence, since the contents of any length- substring can locate the substring’s position in , even if they are corrupted by at most errors. In the context of robust positioning sequences, given and , it is of interest to construct a (single) long -robust positioning sequence with efficient locating algorithm.
This problem, as well as its 2-dimensional extension, has been discussed in [5, 4, 6, 7, 24]. Among others, Chee et al. [6] constructed a class of -robust positioning sequences of length for some constant number . Their construction was refined in [24] to obtain sequences of length , which is nearly optimal. The constructions in [6, 24] require the following notions.
Let be positive integers such that . We say a sequence satisfies the -window weight limited (WWL) constraint, and is called an -WWL sequence,
if for any .
Proposition 13([6, Construction 1 and Theorem 3.7]).
Given and , choose such that and , where .
Let be a -auto-cyclic vector of length from Theorem 11 and set
.
Let be a collection of length- binary vectors satisfying the following conditions:
(P1)
is a -WWL vector for ;
(P2)
is a -WWL vector for and ; and
(P3)
the concatenation is an -modular robust positioning sequence444A sequence is an -modular robust positioning sequence if for any and ..
Then the sequence
is an -robust positioning (substring distant) sequence.
Theorem 14([6, Construction 1A and Corollary 3.12]).
Given and , set . There is an explicit construction of sequences of length , where , such that the conditions (P1)–(P3) in Proposition 13 are satisfied.
Remark.
We note that for each , the concatenation is an -WWL sequence, since the length- prefix of is and is -WWL.
III Encoding of -Substring Distant Sequences for
In this section we shall present an encoding method which can generate a set of -SD sequences of length (with , a real number) whose rate asymptotically approaches . We shall, in fact, construct -SD sequences with , but using the remark following Definition 1, we shall find it more convenient to denote these sequences as -SD.
We first require some notations. For a sequence , we say that (where ) is an -close window pair in if . Moreover, is called primal, if for any other -close window pair in we have .
Let be two sequences with for some integer . Let denote the indices of the entries where and do not agree.
For every let
(1)
where is the binary representation of with symbols.
Let
Then encodes the difference between and , and its length is .
Given a fixed and a sufficiently large , we are going to present an encoding algorithm which can encode -SD sequences of length . Set
Additionally, set
Assume that is fixed and is sufficiently large. Then , and if and only if . Note that
Thus, we have that
Our encoder resembles the encoding algorithms in [10, 9] and consists of the following three parts:
1.
We first use the encoder presented in [14] to encode a message sequence into a -WWL sequence of length . According to [14, Corollary 20], this encoder, denoted by , requires approximately redundancy symbols, where
for some constant .
Hence,
(2)
2.
Then we encode the -WWL sequence into an -SD sequence by eliminating the pairs of substrings of small distance and attaching some information about their positions and difference. This encoder, denoted by , is presented in Algorithm 1, and it can additionally guarantee the output sequence is -WWL.
3.
As an output of Algorithm 1, the sequence is usually shorter than the sequence . Thus, we need an expansion step to increase the sequence length while keeping the substring-distant property. Let be a collection of -WWL sequences of length as in Theorem 14. Set
where is the -auto-cyclic vector of length from Theorem 11.
Finally, let
We shall show is the required -SD sequence of length .
We first describe the encoding presented in Algorithm 1. This procedure encodes a -WWL sequence into a sequence that is simultaneously -SD and -WWL. Initiate . If there are no -close window pairs in , then the algorithm returns as the output. We observe that since is -WWL and , then is also -WWL.
Otherwise, we choose a primal -close window pair, say . We replace the substring with the sequence
(3)
where is the encoding function in [14, Algorithm 2], which can encode integers in into -WWL sequences in time. We note that this sequence is -WWL and contains the information about the position and the difference between and . Moreover, the substring serves as a marker which indicates the position of the removed substring .
We shall repeat this procedure until there are no -close window pairs in . But in order to ensure that can be recovered from the output of the algorithm, we need more tricks. We note that in [10] the inserted sequences always start with a marker and end with a symbol ‘’. This pattern together with the rule that only the primal pairs can be chosen and replaced guarantees that after each replacement the latest inserted substring always starts with the rightmost in . Due to this property, we have a decoding algorithm which can recover from : Let denote the sequence after the -th replacement. One can search for the rightmost in to find the position of the inserted substring in the -th replacement. By replacing the inserted substring with the removed substring, one can recover from . Doing this iteratively, one can eventually recover from .
In our encoding, the inserted substring should always contain as both prefix and suffix to maintain the property of being -WWL. We have to modify the substring in (3) to ensure the latest inserted substring always starts with the rightmost in .
Let and be the positions of the removed substrings in the previous replacement and in the current replacement, respectively. Since we only choose the primal pairs, necessarily, . If , then we still replace the substring with the sequence in (3), since the marker which is inserted in the previous replacement will be destroyed by the suffix of this inserted sequence. If , we first set to be ‘1’ to destroy the previous marker . Then we replace with the sequence
(4)
where is the binary encoding of with symbols, since .
Note that the substring and the substring have length and length at most , respectively. It follows that in the loop we replace substrings of length with substrings of length at most
where the first inequality is obtained by noting that for all sufficiently large we have . Hence, the loop will execute at most times and the algorithm will terminate eventually.
Input: a -WWL sequence
Output: a sequence
Set and
while there are two length- substrings in whose Hamming distance is at most do
Suppose is a primal -close window pair in (then necessarily )
ifthen
Remove the substring of length starting at position and replace it with the sequence
else
Set to be ‘1’
Remove the substring of length starting at position and replace it with the sequence
endif
endwhile
return
Algorithm 1 Primal Pair Elimination Encoder for Generating -SD Sequences
Lemma 15.
The output sequence is -WWL and -SD, and the input sequence can be recovered from , for all sufficiently large .
Proof:
The while loop ensures that the output of Algorithm 1 is an -SD sequence. Moreover, since is -WWL and , one can tediously verify that for all large enough , is -WWL. In particular, even if is all zeros, for all large enough
and a substring of length containing must also contain at least of the surrounding ’s.
Next, we show after each replacement the latest inserted substring always starts with the rightmost . Let be the sequence after the -th replacement. We prove this by induction. When , since is -WWL, the marker appears exactly once in , and so the claim holds. Now, in the -th replacement, denotes the position of the substring removed in this replacement, while denotes the position of the substring removed in the -th replacement. According to the inductive assumption, the rightmost in starts at the position . If , then the rightmost in is . If , the overlap of and has length greater than . Since the sequence which is inserted in the -th replacement ends with a symbol ‘1’, it can destroy the marker in . If , we set to be ‘1’ to destroy the marker in . In all cases, the rightmost in is always .
Now, given the sequence , we first search for the rightmost in to determine the position . Then from the substring we can decode , the difference between and , and . Note that . So we can recover . We remove from and replace it with . If , we further set the symbol in the position to be ‘0’. In this way, we recover the sequence . We repeat this procedure until there is no substring . Then the resulting sequence is the required .
∎
Now, we need to extend the sequence to a long sequence of length while keeping the property of being -SD.
Lemma 16.
Assume is sufficiently large. Let be an output of Algorithm 1. Recall that .
By invoking Theorem 14 with parameters “” and “”, we get
a collection of -WWL sequences of length , where . Let
where . Set
Then is a -WWL and -SD sequence where and . Moreover, can be recovered from .
Proof:
We first prove that is a -WWL and -SD sequence of length at least . According to the construction, the length of is . Recall that and . Then
(5)
Hence, has length at least . Note that each is a -WWL sequence and the length- prefix of is . It follows that is a -WWL sequence. Moreover,
note that the sequences satisfy the conditions (P1)-(P3) with “”.
If (namely, ), then by Proposition 13, the sequence
is an -SD sequence, hence also
an -SD sequence.
If , since the property of being -WWL implies the property of being -WWL, the sequences also satisfy the conditions (P1)-(P3) with “” 555In this case, we take “”, “”, “”, and so, “”, which is equal to the length of the ’s..
Again, by Proposition 13, the sequence is an -SD sequence.
We have shown that is a -WWL sequence in the above paragraph and is a -WWL sequence in Lemma 15. By using the fact that and that the substring of starts with , it follows that the sequence is -WWL. Now, we shall show that it is also -SD. For any two substrings and with and , we consider the following cases:
Case 1: . Then
where the first inequality holds since and the second inequality holds since is an -SD sequence.
Case 2: and . Since , where is the length of , then must contain as a substring. Assume that for some .
If , then
where the last inequality follows from the definition of a -auto-cyclic sequence.
If , since the prefix of is , then
If , then , and so, is a substring of . Hence,
where the last inequality holds since is a -WWL sequence.
Case 3 and Case 4, which now follow, together cover the case of and and the case of and ,
Case 3: and . Denote . Then . Note that always contains as a substring, and is a substring of , which is -WWL. Hence,
Case 4: and . Since , must contain as a substring. If , then must contain as a substring, and so, with the same argument as that in Case 2, one can show that . If , assume that is the all-zero substring of length . Then . It follows that is a substring of , which is -WWL. Hence,
Case 5: . Then
where the second inequality holds since and is -SD.
Finally, note that in the sequence there is exactly one run of ‘0’ which has length at least . So we can search for the rightmost in and remove this substring as well as the suffix after it to recover the sequence .
∎
Theorem 17.
Let . Then, for large enough, is invertible and can encode sequences of into -WWL and -SD sequences where and
Moreover, , and so, we have that
Proof:
The statement about follows from Lemma 15 and Lemma 16. Recall that the encoder requires redundancies (see (2)) and . Hence,
∎
IV Generalized Reconstruction from Noisy Substring Trace
In this section, we are going to give constructions of -trace reconstruction codes. Our first result generalizes Proposition 2 and Proposition 5, which shows that the property of being -substring distant implies the property of being -trace reconstructible.
Proposition 18.
Suppose that . If a sequence is -substring distant, then is -trace reconstructible.
Proof:
Let be an -erroneous trace of where the location of each in is .
Since is -substring distant, for any two substrings and and their any two subsubstrings and , we have that
Therefore, can be identified as the unique substring whose length- prefix is of Hamming distance at least from every length- subsubstring of any other . Denote the length- suffix of as . Then we can identify the substrings ’s in which overlap at at least positions, since each of them contains a unique length- subsubstring whose distance from is at most . Furthermore, the locations of these substrings in can be determined by aligning the subsubstring and the suffix . Assume that there are such substrings. Then we have identified the substrings . Next, we consider the length- suffix of and we can identify all the subsrings in which overlap at at least positions. We repeat the procedure above. Finally, we can determine the location of every substring in .
∎
Combining Theorem 17 and Proposition 18, we have the following result.
Corollary 19.
Suppose that and . If is sufficiently large, then there is an -trace reconstruction code of whose rate is .
Now, we consider another parameter regime. Suppose that
where and are real constants. We are going to construct an -trace reconstruction code whose rate approaches . The basic idea of our code construction is similar to the one in [16] for the noiseless scenario: A message is encoded into a codeword such that
(i)
the index can be decoded from any length- substring of even if the substring is corrupted by at most errors;
(ii)
can be reconstructed from its any -erroneous trace.
To this end, our construction leverages the map in Section III which can encode WWL and SD sequences, as well as the following coded indices ’s which are generated from a robust positioning sequence.
Construction A(Index Construction).
Given , let
Additionally, set
where is an arbitrary fixed number which is independent of . Then
where we assume are constants, and . Applying Theorem 14 with , there is an explicit construction of sequences such that the concatenation
is an -SD sequence. Moreover, according to the remark following Theorem 14, each is -WWL where
is the length of the -auto-cyclic sequence . Denote
For each , we partition the sequence into segments , each of length or .
∎
In the following, we first consider the case of and give the code construction. Then we will show how to modify this construction to settle the other cases.
IV-AThe case of
Let us define
We note that by our choice of parameters, for all sufficiently large .
Assume that and denote .
For each , let
(6)
Then
Lemma 20.
Let be defined as above, and assume is large enough.
Then for each there is an integer with and an invertible map which can encode sequences of into -WWL and -SD sequences.
Proof:
We shall apply Theorem 17 to prove this lemma. To this end, we first need to verify that can be arbitrarily large. As noted before, . Additionally, , and and by our choice of parameters, is a constant. Hence, as .
Next, we need to verify that and satisfy the two conditions in Theorem 17. Regarding the value of , we need to show that . Noting that and , we have that
It follows that
Since
, we have that is substantially larger than
Now, we verify the condition on , namely that . Note that
and
Hence, we have that
It follows that
We can conclude that is substantially larger than .
∎
Now, we present our code construction.
Construction B.
Let ’s be defined as in Lemma 20. We now describe a mapping from to . For any message , partition into substrings:
where each has length .
For each , let
where is the map mentioned in Lemma 20.
We partition each into substrings of length :
Then the total number of ’s is .
We further partition each into segments of lengths or :
Recall from the index construction, Construction A. Let
Finally, let
where and is the -auto-cyclic sequence in Theorem 11.
Denote
The constructed code, , is the image of the mapping described above.
∎
Lemma 21.
Let be the code obtained by Construction B. Then and its rate is
Proof:
In our construction, every sequence has length , and so, the concatenation has length . It follows that the codeword has length .
Noting that the map is invertible, we can uniquely recover from . Therefore, the code has rate .
Let be a codeword of . Assume that
the substrings ’s satisfy the following conditions:
(P1)
is a -WWL sequence for each ; and
(P2)
is a -WWL sequence for such that and .
Then for every substring in and each666If , we let denote the concatenation . , the following hold:
(i)
If , then .
(ii)
If , then .
Lemma 23.
Assume is sufficiently large. Let be an arbitrary length- substring of . Then contains a length- suffix of a coded index and a length- prefix of either or for some and . Furthermore, even if is corrupted by at most errors, we can still identify the positions where the said suffix and prefix appear, and so reconstruct them with at most errors.
Proof:
We note that the length of is , and that is a concatenation of such strings. Hence, the first statement follows directly from the code construction. Now, assume that is corrupted by at most errors. We shall use Lemma 22 to identify the location of the marker in .
Recall that every is -WWL (see the index construction, Construction A) and every is -WWL (see Lemma 20). Since and , all the segments ’s and ’s are -WWL. Hence, ’s satisfy the conditions in Lemma 22. This follows since any substring of length contains a substring of length that is fully contained within a segment of the form or , thus providing the minimum weight of as claimed.
Since suffers from at most errors and , by Lemma 22 there is a unique index such that
Hence, by comparing the distance between the marker and each length- substring of , we can identify the location of the marker in . Once the marker is located, the positions in which the symbols of the coded indices ’s appear can also be determined. Then we can reconstruct a prefix and a suffix or for some with at most errors.
∎
The following lemma ensures that every length- substring of contains a long-enough substring of the -SD sequence .
Lemma 24.
Assume is sufficiently large. Let be a codeword of . Then every length- substring of contains at least consecutive symbols of .
Proof:
Note that the concatenation
consists of symbols, out of which symbols are from . Then according to the construction, every length- substring of contains at least
consecutive symbols of , where accounts for the worst case where the substring both begins and ends with some segments of the coded indices (of length or ) and contains a copy of or .
∎
Theorem 25.
The code obtained in Construction B is an -trace reconstruction code of with rate
Proof:
The code rate has been calculated in Lemma 21.
Let be a codeword of and be an -erroneous trace of . For each in , since the length of is at least , according to Lemma 23, we can extract a corrupted copy of the length- suffix of , and a corrupted copy of a length- prefix of either or , with the total number of errors being no more than . Consider the following cases.
1.
If , then is a corrupted copy of , and so, we can run the locating algorithm of the robust positioning sequence on the corrupted to determine the index .
2.
If then contains a copy of either or with at most errors. Since , we can distinguish these two cases.
(a)
If contains a copy of , then is a prefix of , and so, we run the locating algorithm of on to decode the index .
(b)
If contains a copy of , then is a prefix of , and so, we run the locating algorithm of on to decode the index .
The discussion above shows that for every string , we can decode the index . If intersects both and , then we can determine its location in by identifying the location of the marker in . For the other strings with index , since is an -SD sequence, according to Lemma 24 and Proposition 18, there is a unique way to determine the correct order of these strings and match correctly the suffix and the prefix of consecutive strings. By taking the majority value at every position, we can reconstruct a sequence , which is a long substring of possibly with some errors. It remains to determine the location of in , which can be done as follows.
1.
If contains a corrupted copy of with at most errors, then the location this marker in determines the location of in , since only contains one copy of .
2.
If does not contain any corrupted copy of up to errors, then there is a string which intersects both and and contains as a substring with at most errors, since the length of is less that .
(a)
If overlaps in at most positions, since , must contain a copy of the first of , and so, the location of in can be determined by identifying the first occurrence of the marker in .
(b)
If overlaps in at least positions, then and the length- prefix of share a length- substring of . Since is -SD, we can match the suffix of and the prefix of correctly. Then the location of in can be deduced from the location of in .
∎
IV-BThe case of
Now, we consider the case that does not divide . Take . Construction B can yield a trace reconstruction code of block length . Our approach is to extend this code to have length .
Let be defined as in (6) and be defined as in Lemma 20. For any message , partition into substrings, each of length :
For each , let
The main difference from the previous case is the encoding of .
We recall that the encoder first encodes the message to an SD and WWL sequence of length probably less than . Then it extends the sequence by appending a sequence and taking the first bits of the concatenation. For , we modify the encoder by taking the first bits of the concatenation. This is possible since asymptotically the length of is larger than , see (5). We denote this modified encoder as and let
Then is -WWL and -SD and has length . Moreover, the message can be decoded from the first bits of . In other words, the last bits are redundant.
Then, we proceed similarly as in Construction B and obtain an -trace reconstruction code of block length . Note that the last bits are redundant, and so, we delete of them to form an -trace reconstruction code of length , with code rate
IV-CHandling noise which occurs before sequencing
Up to now, we have studied -trace reconstruction codes, which allow reconstructing the maximum reconstructible-string from an erroneous trace of a codeword . We use to denote the maximum reconstructible-string of . If is reliable, then . However, if is not reliable, then might be different from . This may happen especially when the sequence is subject to errors before its substrings are sampled. In the remainder of this section, we shall modify Construction B to combat such errors.
Let be an -erroneous trace of such that , which is referred to as an -erroneous trace. We aim to reconstruct from , and so retrieve the message which is stored in . Our construction, which is presented below, borrows the idea from [2, Construction B].
Construction C.
Assume that and take . Let . According to Lemma 20, there is an integer with and an invertible map which can encode sequences of into -WWL and -SD sequences. Let be an encoder which modifies by taking the first bits of the concatenation.
For any message , we first use a Reed-Solomon code777The Reed-Solomon code is over the finite field of size . The message is partitioned into groups of bits, and each group is translated to a single symbol from the finite field. After encoding the reverse translation to bits is performed. Note that , and . Hence, and so, the Reed-Solomon code exists. to encode into a codeword .
We partition into sequences of length :
For each , let
Then we proceed similarly as in Construction B to obtain a sequence of length . We use to denote the code produced by this construction.
∎
Lemma 26.
Let be a codeword of and be an -erroneous trace of . Then we can recover from .
Proof:
With the same argument as the proof of Theorem 25, we can show that is an -trace reconstruction code of . Since is also an -erroneous trace of , the maximum reconstructible-substring can be decoded from . By reversing the operations in Construction C, we obtain a sequence from . We partition into segments of the same length, i.e., . Since , then there are at most indices such that . Hence, we can run the decoder of the Reed-Solomon code on to recover .
∎
Theorem 27.
Suppose that .
Then the code obtained in Construction C is an -trace reconstruction code of with rate
Proof:
Since , we have that . Hence, the code rate
Consider the ’s which are defined in (6).
We have that
Hence,
∎
IV-D-Reconstruction Codes
In this subsection, we consider the case of .
Construction D.
Suppose that , and . As before, we denote and . However, this time, we let and where and . Then according to Theorem 14, there is a collection of -WWL sequences such that the concatenation is an -SD sequence.
Denote . Let be the encoder in [14, Algorithm 2] which can encode sequences of into -WWL sequences888Note that and . Hence, . Then according to Lemma 19 in [14], the encoder does work. of . For a message where for , let for all .
Denote where is a -auto-cyclic sequence of length . Let
Output as the codeword which encodes the message . The image under this mapping is the code that we construct.
∎
Theorem 28.
The code obtained in Construction D is an -trace reconstruction code of with rate
Proof:
The code has rate
Now, let be a length- substring of some codeword . Then must contain either a copy of or a suffix of together with a prefix of . Since ’s and ’s are WWL sequences, even if suffers from errors, we can still locate the marker in . Then we can run the locating algorithm of the robust positioning sequence to determine the index or , and hence the location of .
∎
For the case of , let . We first construct an -trace reconstruction code of , where the length- suffix of every codeword is fixed. Then we truncate it to be of length . In this way, we get a code of rate
For -erroneous trace reconstruction, we proceed similarly as in [2, Construction B]. We first use an code to encode a message to a sequence . Then we use the encoder outlined in Construction D to get a codeword . We note that Construction B in [2] only concerns errors before sequencing, while our construction incorporates errors both before and after sequencing.
V Multi-Strand Reconstruction
In this section, instead of reconstructing a single sequence, we consider the problem of reconstructing a multiset of sequences of length from the union of their traces.
The following construction of multi-strand -trace reconstruction codes is adapted from [25, Construction C].
Construction E.
Let . We take an -trace reconstruction code of . For each codeword , let
The code we construct is , defined as,
∎
Lemma 29.
Let . Then the code from Construction E is a multi-strand -trace reconstruction code of .
Proof:
It is easy to see that an -erroneous trace of is also an -erroneous trace of . Since is a trace reconstruction code, then for each , we can determine its location in . Hence, we can determine the index such that and determine the location of in .
∎
999We use to denote in order to avoid confusion with which denotes the number of errors..
Theorem 31.
Suppose that , and . For sufficiently large , there is a multi-strand -trace reconstruction code of whose rate is .
Proof:
Let . Then . According to Corollary 19, there is an -trace reconstruction code of whose rate is . Applying Construction E with this code, we obtain a multi-strand -trace reconstruction code of with . Note that
Hence, the code rate is
∎
Now, we consider the case of .
Assume that and where and . Let
We first present some upper bounds on the rate of multi-strand -trace reconstruction codes.
Suppose that .
Let be a multi-strand -trace reconstruction code of . If and , then .
Proof:
It suffices to consider the case of . In this case, we denote
Since , each is still an -trace, and it consists of sequences of . Hence, we have that
Since , we have . Using the inequality in Lemma 32, we get that
(12)
Since and , we have that for some constants . It follows that .
Continuing (12), we have that
where the last equality holds since .
Hence,
∎
Remark.
We note that the condition in Lemma 36 cannot be removed. A counterexample is the -trace reconstruction codes of rate in Theorem 31, where and .
Note that a multistrand -trace reconstruction code is also a multistrand -trace reconstruction code. Hence, the upper bounds in Lemmas 33–36 also work for multistrand -trace reconstruction codes.
In the following, we study the lower bounds.
Theorem 37.
Let and , where and . For all sufficiently large ,
1.
if , then there is a multi-stand -trace reconstruction code of of rate
2.
if where is a real constant, then there is a multi-strand -trace reconstruction code of of rate
Proof:
Let . Then . According to Theorem 25, there is an -trace reconstruction code of whose rate is . Applying Construction E with this code, we obtain a multi-strand -trace reconstruction code of with . Note that
If , then , and so, we have that
If , then
and so, we have that
∎
When or when and , the lower bounds in Theorem 37 asymptotically achieve the upper bound in Lemma 33.
Next, we show that when and , if for a positive which is independent of , then the upper bound in Lemma 33 still can be achieved.
Construction F.
Suppose that and . Denote and . Let and where . Then according to Theorem 14, there is a collection of -WWL sequences such that the concatenation is an -SD sequence.
Denote . Let be the encoder in [14, Algorithm 2] which can encode sequences of into -WWL sequences101010Note that and . Hence, . Then according to Lemma 19 in [14], the encoder does work. of . For a message where for , let for all . We partition each into substrings as follows:
where for and
Denote where is a -auto-cyclic sequence of length . For each , let
Output as the codeword which encodes the message . The image of the mapping described here is the constructed code.
∎
Lemma 38.
Suppose that for a positive which is independent of . Then the code obtained in Construction F is a multi-strand -trace reconstruction code of .
Proof:
Let be a length- substring of for some . Note that and . Then must contain either a copy of or a suffix of together with a prefix of . Since ’s and ’s are WWL sequence, even if suffers from errors, we can still locate their position in by searching for the marker . Then we can run the locating algorithm of the robust positioning sequence to determine the index or , and hence the location of .
∎
Theorem 39.
Suppose that , and , where and . If for a fixed positive which is independent of , then there is a multi-strand -trace reconstruction code which has code rate
Proof:
Note that
Hence, the code rate is
∎
Finally, we note that the
multi-strand -trace reconstruction code in Construction F only guarantees recovering message from reliable -erroneous traces, the occurrence of which might be rare since and each symbol is usually included in a small number of substrings in . Nevertheless,
we can use a code to encode the message, like what we have done in Construction C, so that even if there are in total errors in , we still can decode the message. The rate of this trace reconstruction code is
References
[1]
J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String
reconstruction from substring compositions,” SIAM J. Discrete Math.,
vol. 29, no. 3, pp. 1340–1371, 2015.
[2]
D. Bar-Lev, S. Marcovich, E. Yaakobi, and Y. Yehezkeally, “Adversarial
torn-paper codes,” in Proceedings of the 2022 IEEE International
Symposium on Information Theory (ISIT2022), Espoo, Finland, Jun. 2022, pp.
2934–2939.
[3]
T. Batu, S. Kannan, S. Khanna, and A. McGregor, “Reconstructing strings from
random traces,” in Proc. the 15th Annual ACM-SIAM Symposium on
Discrete Algorithms, New Orleans, LA, USA, 2004, pp. 910–918.
[4]
R. Berkowitz and S. Kopparty, “Robust positioning patterns,” in Proc.
of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA,
USA, 2016, pp. 1937–1951.
[5]
A. M. Bruckstein, T. Etzion, R. Giryes, N. Gordon, R. J. Holt, and
D. Shuldiner, “Simple and robust binary self-location patterns,” IEEE
Trans. Inform. Theory, vol. 58, no. 7, pp. 4884–4889, 2012.
[6]
Y. M. Chee, D. T. Dao, H. M. Kiah, S. Ling, and H. Wei, “Robust positioning
patterns with low redundancy,” SIAM J. Comput., vol. 49, no. 2, pp.
284–317, 2020.
[7]
D. T. Dao, H. M. Kiah, and H. Wei, “Maximum length of robust positioning
sequences,” in Proceedings of the 2020 IEEE International Symposium on
Information Theory (ISIT2020), Los Angeles, CA, USA, 2020, pp. 108–113.
[8]
M. Dudik and L. J. Schulman, “Reconstruction from subsequences,”
J. Combin. Theory Ser. A, vol. 103, no. 2, pp. 337–348, 2003.
[9]
O. Elishco, R. Gabrys, E. Yaakobi, and M. Médard, “Repeat-free codes,”
IEEE Trans. Inform. Theory, vol. 67, no. 9, pp. 5749–5764, 2021.
[10]
R. Gabrys and O. Milenkovic, “Unique reconstruction of coded strings from
multiset substring spectra,” IEEE Trans. Inform. Theory, vol. 65,
no. 12, pp. 7682–7696, 2019.
[11]
H. M. Kiah, G. J. Puleo, and O. Milenkovic, “Codes for DNA sequence
profiles,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp.
3125–3146, Jun. 2016.
[12]
V. I. Levenshtein, “Efficient reconstruction of sequences from their
subsequences or supersequences,” J. Combin. Theory Ser. A, vol. 93,
no. 2, pp. 310–332, 2001.
[13]
V. I. Leveshtein, “Efficient reconstruction of sequences,” IEEE
Trans. Inform. Theory, vol. 47, no. 1, pp. 2–22, 2001.
[14]
M. Levy and E. Yaakobi, “Mutually uncorrelated codes for DNA storage,”
IEEE Trans. Inform. Theory, vol. 65, no. 6, pp. 3671–3691, 2019.
[15]
B. Manvel, A. Meyerowitz, A. Schwenk, K. Smith, and P. Stockmeyer,
“Reconstruction of sequences,” Discrete Math., vol. 94, no. 3, pp.
209–219, 1991.
[16]
S. Marcovich and E. Yaakobi, “Reconstruction of strings from their substrings
spectrum,” IEEE Trans. Inform. Theory, vol. 67, no. 7, pp.
4369–4384, 2021.
[17]
S. Nassirpour, I. Shomorony, and A. Vahid, “Reassembly codes for the
chop-and-shuffle channel,” Jan. 2022. [Online]. Available:
http://arxiv.org/abs/2201.03590
[18]
S. Pattabiraman, R. Gabrys, and O. Milenkovic, “Coding for polymer-based data
storage,” IEEE Trans. on Inform. Theory (Early Access), 2023.
[19]
A. N. Ravi, A. Vahid, and I. Shomorony, “Capacity of the torn paper channel
with lost pieces,” in Proceedings of the 2021 IEEE International
Symposium on Information Theory (ISIT2021), Melbourne, Victoria, Australia,
Jul. 2021, pp. 1937–1942.
[20]
A. D. Scott, “Reconstructing sequences,” Discrete Math., vol. 175, pp.
231–238, 1997.
[21]
I. Shomorony and A. Vahid, “Torn-paper coding,” IEEE
Trans. Inform. Theory, vol. 67, no. 12, pp. 7904–7913, 2021.
[22]
E. Ukkonen, “Approximate string-matching with -grams and maximal matches,”
Theoret. Comp. Sci., vol. 92, no. 1, pp. 191–211, 1992.
[23]
C. Wang, J. Sima, and N. Raviv, “Break-resilient codes for forensic 3D
fingerprinting,” arXiv preprint arXiv:2310.03897, 2023.
[24]
H. Wei, “Nearly optimal robust positioning patterns,” IEEE
Trans. Inform. Theory, vol. 68, no. 1, pp. 193–203, 2022.
[25]
Y. Yehezkeally, D. Bar-Lev, S. Marcovich, and E. Yaakobi, “Generalized unique
reconstruction from substrings,” IEEE Trans. Inform. Theory, vol. 69,
no. 9, pp. 5648–5659, Sep. 2023.
[26]
Y. Yehezkeally and N. Polyanskii, “On codes for the noisy substring channel,”
Sep. 2023. [Online]. Available: http://arxiv.org/abs/2102.01412v3