This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved bounds for noisy group testing with constant tests per item

Oliver Gebhard, Oliver Johnson, Philipp Loick, Maurice Rolvien {Gebhard, Loick, Rolvien}@math.uni-frankfurt.de, Goethe University, Mathematics Institute,
10 Robert Mayer St, Frankfurt 60325, Germany.
O.Johnson@bristol.ac.uk, University of Bristol, School of Mathematics,
Woodland Road, Bristol, BS8 1UG, United Kingdom
Abstract.

The group testing problem is concerned with identifying a small set of infected individuals in a large population. At our disposal is a testing procedure that allows us to test several individuals together. In an idealized setting, a test is positive if and only if at least one infected individual is included and negative otherwise. Significant progress was made in recent years towards understanding the information-theoretic and algorithmic properties in this noiseless setting. In this paper, we consider a noisy variant of group testing where test results are flipped with certain probability, including the realistic scenario where sensitivity and specificity can take arbitrary values. Using a test design where each individual is assigned to a fixed number of tests, we derive explicit algorithmic bounds for two commonly considered inference algorithms and thereby naturally extend the results of Scarlett & Cevher (2016) and Scarlett & Johnson (2020). We provide improved performance guarantees for the efficient algorithms in these noisy group testing models – indeed, for a large set of parameter choices the bounds provided in the paper are the strongest currently proved.

1. Introduction

1.1. Motivation and background

Suppose we have a large collection of nn people, a small number kk of whom are infected by some disease, and where only mnm\ll n tests are available. In a landmark paper [16] from 1943, Dorfman introduced the idea of group testing. The basic idea is as follows: rather than screen one person using one test, we could mix samples from individuals in one pool, and use a single test for this whole pool. The task is to recover the infection status of all individuals using the pooled test results. Dorfman’s original work was motivated by a biological application, namely identifying individuals with syphilis. Subsequently, group testing has found a number of related applications, including detection of HIV [51], DNA sequencing [29, 37] and protein interaction experiments [35, 49]. More recently, it has been recognised as an essential tool to moderate pandemic spread [12], where identifiying infected individuals fast and at a low cost is indispensable [32]. In particular, group testing has been identified as a testing scheme for the detection of COVID-19 [2, 17, 21]. From a mathematical perspective, group testing is a prime example of an inference problem where one wants to learn a ground truth from (possibly noisy) measurements [1, 8, 15]. Over the last decade, it has regained popularity and a significant body of research was dedicated to understand its information-theoretic and algorithmic properties [9, 13, 14, 44, 45, 46]. In this paper, we provide improved upper bounds on the number of tests that guarantee successful inference for the noisy variant of group testing.

1.2. Related Work

1.2.1. Noiseless Group Testing

In the simplest version of group testing, we suppose that a test is positive if and only if the pool contains at least one infected individual. We refer to this as the noiseless case. In this setting, each negative test guarantees that every member of the corresponding pool is not infected, so they can be removed from further consideration. However, a positive test only tells us that at least one item in the test is defective (but not which one), and so requires further investigation. Dorfman’s original work [16] proposed a simple adaptive strategy where a small pool of individuals is tested, and where each positive test is followed up by testing every individual in the corresponding pool individually. Since then it has been an important problem to find the optimal way to recover the whole population’s infection status in the noiseless case (see [7] for a detailed survey). A simple counting argument (see for example [7, Section 1.4]) shows that to ensure recovery with zero error probability, since every possible defective set must give different test outcomes, the following must hold in the noiseless setting:

(1.1) 2m(nk)mminf0:=1log2klog(n/k)\displaystyle 2^{m}\geq\binom{n}{k}\qquad\Rightarrow\qquad m\geq m^{0}_{\inf}:=\frac{1}{\log 2}k\log(n/k)

This can be extended to the case of recovery with small error probability, for example with the bound (see [7, Eq. (1.7)]) that the success probability

(1.2) (suc)2m(nk),\displaystyle\mathbb{P}(\rm{suc})\leq\frac{2^{m}}{\binom{n}{k}},

meaning that the success probability must decay exponentially with the number of tests below minf0m^{0}_{\inf}. Hwang [24] provided an algorithm based on repeated binary search, which is essentially optimal in terms of the number of tests required in that it requires minf0+O(k)m^{0}_{\inf}+O(k) tests, but may require many stages of testing. The question of whether non-adaptive algorithms (or even adaptive algorithms with a limited number of stages) can attain the bound (1.1) remained open until recently. [4, 14] showed that the answer depends on the prevalence of the disease, for example on the value of θ(0,1)\theta\in(0,1) in a parameterisation111The result of [14] is two-fold. On the one hand, it provides a method to recover infected individuals w.h.p.as well as attaining (1.1) for a certain range of θ<θ\theta<\theta^{*}. On the other hand they show that (1.1) cannot be attained by any testing procedure for larger θ>θ\theta>\theta^{*}. One finds θ=log(2)(1+log(2))1\theta^{*}=\log(2)\cdot(1+\log(2))^{-1}. where the number of infected individuals knθk\sim n^{\theta}. Non-adaptive testing schemes can be represented through a binary (m×n)(m\times n)-matrix that indicates which individual participates in which test. Significant research was dedicated to see which design attains the optimal performance, although much of the recent research analysed the performance of randomized designs. Initial research focused on the case where the matrix entries are i.i.d. [3, 5, 46], which we will refer to as Bernoulli pooling. Later work considered a constant column design where each individual is assigned to a (near-)constant number of tests [6, 13, 14, 26]. Indeed [14] showed that such a design is information-theoretically optimal in the noiseless setting and it is to be expected that this remains true for the noisy case. To recover the ground truth from the test results and the pooling scheme, this paper focuses on two non-adaptive algorithms, COMP and DD, which are relatively simple to perform and interpret in the noiseless case. We describe them in more detail below, but in brief COMP [10] simply builds a list of all the individuals who ever appear in a negative test and are hence certainly healthy, and assumes that the other individuals are infected. DD [5] uses COMP as a first stage and builds on it by looking for individuals who appear in a positive test that only otherwise contains individuals known to be healthy. While the noiseless case provides an interesting mathematical abstraction, it is clear that it may not be realistic in practice [40].

1.2.2. Noisy Group Testing

In medical applications [42] the two occurring types of noise in a testing procedure are related to sensitivity (the probability that a test containing an infected individual is indeed positive) and specificity (the probability that a test with only healthy individuals is indeed negative), and in that language we cannot assume the gold standard of tests with unit specificity and sensitivity. Thus, research attention in recent years has shifted towards the noisy version of group testing [10, 43, 44, 46, 47, 48]. On the one hand, the adaptive noisy case was considered in [43, 44]. On the other hand [10, 27, 28, 33, 46, 47, 48] looked at the non-adaptive noise case from different angles (for instance linear programming, belief propagation, and Markov Chain Monte Carlo). In [46, 47, 48] the algorithmic performance guarantees within noisy group testing under Bernoulli pooling are discussed. First of all [46] obtained a converse as well as a theoretical achievability bound, but stated the practical recovery as an direction for further research. In the following [47, 48] shed light on this question by using Bernoulli pooling.222[47] introduced an approach based on separate decoding of items for symmetric noise models. While this approach works well for small θ\theta (in particular θ0\theta\rightarrow 0), the performance drops dramatically for larger θ\theta. For most θ\theta this approach is worse off than the noisy DD discussed in [48]. Note there exist some noise levels with the very strong restriction assuming p=qp=q where [47] improve over our results in the θ\theta very close to 0 regime. Due to the generality of our model we will from now on focus on [48] as benchmark for our results. In this paper we focus on the COMP and DD algorithms, since it is possible to deduce explicit performance guarantees for them. The original COMP and DD were designed for the noiseless case and do not automatically carry over to general noisy models. However, recent work of Scarlett and Johnson [48] showed that noisy versions of these algorithms can perform well under certain noise models using i.i.d. (Bernoulli pooling) test designs, particularly focusing on ZZ channel and reverse ZZ channel noise. As common medical tests have different values for sensitivity and specificity [31] the analysis of a generalized noise model beyond the ZZ and reverse ZZ channel is warranted.

1.2.3. Model Justification

As described for example in pandemic plans developed by the EU, US and WHO [19, 38, 39], and in COVID-specific work [36], adaptive strategies may not be suitable for pandemic prevention. For example, if a test takes one day to prepare and for the results to be known, then each stage will require an extra day to perform, meaning that adaptive group testing information can be received too late to be useful. Hence the need to perform large-scale testing to identify infected individuals fast relative to the doubling time [12, 32, 36] can make adaptive group testing unsuitable to prevent an infectious disease from spreading. Furthermore it may be difficult to preserve virus samples in a usable state for long enough to perform multi-round testing [22]. Due to its automation potential and the fact that tests can be completed in parallel (for example by the use of 96-well PCR plates [18]), the main applications of group testing such as DNA screening [11, 29, 37], HIV testing [51] and protein interaction analysis [35, 49] are non-adaptive, where all tests are specified upfront and performed in parallel. For example, while group testing strategies appear to be useful to identify individuals infected with COVID-19 (see for example [17, 21]), testing for the presence of the SARS-CoV-19 virus is not perfect [52], and so we need to understand the effect of both false positive and false negative errors in this context, with non-identical error probabilities. For this reason, we consider a general pqp-q noise model in this paper. Under this model, a truly negative test is flipped with probability pp to display a positive test result, while a truly positive test is flipped to negative with probability qq (Figure 1). Its formulation is sufficiently general to accommodate the recovery of the noiseless results (p=q=0p=q=0), Z channel (p=0p=0), reverse Z channel (q=0q=0) and the Binary Symmetric Channel (p=qp=q). However, our results include the case of non-zero pp and qq without having to make the somewhat artificial assumption that false negative and false positive errors are equally likely. We note that it may be unrealistic to assume that the noise parameters are known exactly, and more sophisticated models may be needed to understand the real world. Nevertheless our analysis of a generalised noise model serves as a starting point towards a full understanding of the difficulties occurring while implementing group testing algorithms in laboratories.

001111pp1p1-pqq1q1-q
Figure 1. The pqp-q-noise model: the result of each standard noiseless group test is transmitted independently through the given noisy communication channel.

1.3. Contribution

This paper provides a simultaneous extension of [13] and [26, 48], by analysing noisy versions of COMP and DD under more general noise models for constant-column weight designs. In contrast to prior work [5, 26] assuming sampling with replacement, in this paper we use sampling without replacement, meaning that our designs have exactly the same number of tests for each item, rather than approximately the same as in those previous works. This makes little difference in practice, but may be closer to the spirit of LDPC codes for example.

We provide explicit bounds on the performance of these algorithms in a generalized noise model. We will prove that (noisy versions of) COMP as well as DD succeed with Θ(klog(n/k))\Theta(k\log(n/k)) tests. Our analysis reveals the exact constants to ensure the recovery with these two inference algorithms. The main results will be stated formally in Theorems 2.1 and 2.2, but we would like to give the reader a first insight of what will follow. We analyze Algorithms 1 and 2 for the constant degree model, where there are m=cklog(n/k)m=ck\log(n/k) tests performed and each individual chooses Δ=cdlog(n/k)\Delta=cd\log(n/k) tests uniformly at random. Let p,q0,p+q<1p,q\geq 0,p+q<1 and ϵ>0\epsilon>0.

We start with the performance of COMP (Algorithm 1), as stated in Theorem 2.1:

For any Δ:=Δ(c,d)\Delta:=\Delta(c,d) we find a threshold α:=α(d,p,q)\alpha:=\alpha(d,p,q) such that COMP succeeds in inferring the infected individuals if the number of tests

m(1+ε)mCOMP=minα,dmax{b1(α,d),b2(α,d)}klog(n/k)m\geq(1+\varepsilon)m_{COMP}=\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}k\log(n/k)

The next step on our agenda is the performance of DD (Algorithm 2), as stated in Theorem 2.1:

For any Δ:=Δ(c,d)\Delta:=\Delta(c,d) we find thresholds α:=α(d,p,q)\alpha:=\alpha(d,p,q) and β:=β(d,q)\beta:=\beta(d,q) such that DD succeeds in inferring the infected individuals if the number of tests

m(1+ε)mDD(n,θ,p,q)=minα,β,dmax{c1(α,d),c2(α,d),c3(β,d),c4(α,β,d)}klog(n/k)m\geq(1+\varepsilon)m_{\text{DD}}(n,\theta,p,q)=\min_{\alpha,\beta,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d),c_{4}(\alpha,\beta,d)}\right\}k\log(n/k)

For all typical noise channels (Z, reverse Z and BSC) we compare the constant-column and Bernoulli design and find for all such instances that the required number of tests in the former is lower than the number needed in the latter thereby improving on results from [48], and providing the strongest performance guarantees currently proved for efficient algorithms in noisy group testing.

As group testing offers an essential tool for pandemic prevention [32] and as the the accuracy of medical testing is limited [31, 40] this paper provides the natural next step in the group testing literature.

1.4. Test design and notation

To formalize our notation, we write nn for the number of individuals in the population, 𝝈\bm{\sigma} for a binary vector representing the infection status of each individual, kk (the Hamming weight of 𝝈\bm{\sigma}) for the number of infected individuals and mm for the number of tests performed. We assume that kk is known for the purposes of matrix design, though in practice (see [7, Remark 2.3]) it is generally enough to know kk up to a constant factor to design a matrix with good properties. In this paper, in line with other work such as [5], we consider a scaling knθk\sim n^{\theta} for some fixed θ(0,1)\theta\in(0,1), referred to in [7, Remark 1.1] as the sparse regime333Note that the analysis directly extends to k=Θ(nθ)k=\Theta(n^{\theta}) as a constant factor in front does not influence the analysis.. In addition to the interesting phase transitions observed using this scaling, this sparse regime is particularly relevant as it was found suitable to model the early state of a pandemic [50].

Let us next introduce the test design. With V=(xi)i[n]V=(x_{i})_{i\in[n]} denoting the set of nn individuals444[n][n] will be used as an abbreviated notation for the set {1,,n}\left\{{1,\dots,n}\right\}. and F=(ai)i[m]F=(a_{i})_{i\in[m]} the set of mm tests, the test design can be envisioned as a bipartite factor graph with nn variable nodes "on the left" and mm factor nodes "on the right". We draw a configuration 𝝈{0,1}V\bm{\sigma}\in\left\{{0,1}\right\}^{V}, encoding the infection status of each individual, uniformly at random from vectors of Hamming weight kk. The set of healthy individuals will be denoted by V0V_{0} and the set of infected individuals by V1V_{1}. In symbols,

V0={xV:𝝈(x)=0}andV1=VV0={xV:𝝈(x)=1}\displaystyle V_{0}=\left\{{x\in V:\bm{\sigma}(x)=0}\right\}\qquad\text{and}\qquad V_{1}=V\setminus V_{0}=\left\{{x\in V:\bm{\sigma}(x)=1}\right\}

The lower bound from (1.1) suggests that in the noisy group testing setting it is natural to compare the performance of algorithms and matrix designs in terms of the prefactor of klog(n/k)k\log(n/k) in the number of tests required. To be precise, we carry out mm tests, and each item is assigned to exactly Δ\Delta tests chosen uniformly at random without replacement. We parameterize mm and Δ\Delta as

(1.3) m=cklog(n/k)andΔ=cdlog(n/k)\displaystyle m=ck\log(n/k)\qquad\text{and}\qquad\Delta=cd\log(n/k)

for some suitably chosen constants c,d0c,d\geq 0.

Let x\partial x denote the set of tests that individual xx appears in and a\partial a the set of individuals assigned to test aa. The resulting (non-constant) collection of test degrees will be denoted by the vector 𝚪=(𝚪a)a[m]\bm{\Gamma}=(\bm{\Gamma}_{a})_{a\in[m]}. Further, let

(1.4) Γmin=mina[m]ΓaandΓmax=maxa[m]Γa.\displaystyle\Gamma_{\min}=\min_{a\in[m]}\Gamma_{a}\qquad\text{and}\qquad\Gamma_{\max}=\max_{a\in[m]}\Gamma_{a}.

Throughout, 𝑮=𝑮(n,m,Δ)\bm{G}=\bm{G}(n,m,\Delta) describes the random bipartite factor graph from this construction.

Now consider the outcome of the tests. Recall from above that a standard noiseless group test aa gives a positive result if and only if there is at least one defective item contained in the pool, or equivalently if xa𝝈(x)1\sum_{x\in\partial a}\bm{\sigma}(x)\geq 1. Even in the noisy case, this sum is a useful object to consider. Writing 𝟏\bm{1} for the indicator function, we define

(1.5) 𝝈(a)=𝟏{xa𝝈(x)1}\bm{\sigma}^{*}(a)=\bm{1}\left\{{\sum_{x\in\partial a}\bm{\sigma}(x)\geq 1}\right\}

to be the outcome we would observe in the noiseless case using the test matrix corresponding to 𝑮\bm{G}. We will say that test aa is truly positive if 𝝈(a)=1\bm{\sigma}^{*}(a)=1 and truly negative otherwise.

However, we do not observe the values of 𝝈(a)\bm{\sigma}^{*}(a) directly, but rather see what we will refer to as the displayed test outcomes 𝝈^(a)\hat{\bm{\sigma}}(a) – the outcomes of sending the true outcomes 𝝈(a)\bm{\sigma}^{*}(a) independently through the pqp-q channel of Figure 1. Since in this model a truly positive test remains positive with probability 1q1-q and a truly negative test is displayed as positive with probability pp we can write

(1.6) 𝝈^(a)\displaystyle\hat{\bm{\sigma}}(a) =𝟏{Be(p)=1}(1𝝈(a))+𝟏{Be(1q)=1}𝝈(a)\displaystyle=\bm{1}\left\{{{\rm Be}(p)=1}\right\}\left(1-\bm{\sigma}^{*}(a)\right)+\bm{1}\left\{{{\rm Be}(1-q)=1}\right\}\bm{\sigma}^{*}(a)

where Be(r){\rm Be}(r) denotes a Bernoulli random variable with parameter rr independent of all other randomness in the model. For models with binary outputs, this is the most general channel satisfying the noisy defective channel property of [7, Definition 3.3], though more general models are possible under the only defects matter property [7, Definition 3.2], where the probability of a test being positive depends on the number of infected individuals it contains.

Note that if p+q>1p+q>1, we can preprocess the outputs from (1.6) by flipping them, i.e. setting p~=1p{\widetilde{p}}=1-p and q~=1q{\widetilde{q}}=1-q, where p~+q~<1{\widetilde{p}}+{\widetilde{q}}<1. Hence without loss of generality we will assume throughout that p+q<1p+q<1. In the case p+q=1p+q=1, the test outcomes are independent of the inputs, and we cannot hope to find the infected individuals – see Corollary 2.3.

With 𝒎0\bm{m}_{0} being the number of truly negative tests, let 𝒎0f\bm{m}_{0}^{f} be the number of truly negative tests that are flipped to display a positive test result and 𝒎0u\bm{m}_{0}^{u} be the number of truly negative tests that are unflipped. Similarly, define 𝒎1\bm{m}_{1} as the number of truly positive tests, of which 𝒎1f\bm{m}_{1}^{f} are flipped to a negative test result and of which 𝒎1u\bm{m}_{1}^{u} are unflipped. For reference, for t{0,1}t\in\left\{{0,1}\right\} we write

𝒎t=\displaystyle\bm{m}_{t}= |{a:𝝈(a)=t}|\displaystyle\left|{\left\{{a:\bm{\sigma}^{*}(a)=t}\right\}}\right|
𝒎tf=|{a:𝝈(a)=t,𝝈^(a)t}|\displaystyle\bm{m}_{t}^{f}=\left|{\left\{{a:\bm{\sigma}^{*}(a)=t,\hat{\bm{\sigma}}(a)\neq t}\right\}}\right| and𝒎tu=|{a:𝝈(a)=t,𝝈^(a)=t}|\displaystyle\quad\text{and}\quad\bm{m}_{t}^{u}=\left|{\left\{{a:\bm{\sigma}^{*}(a)=t,\hat{\bm{\sigma}}(a)=t}\right\}}\right|

Here we use bold letters to indicate random variables. Throughout the paper, we use the standard Landau notation o(),O(),Θ(),Ω(),ω()o(\cdot),O(\cdot),\Theta(\cdot),\Omega(\cdot),\omega(\cdot) and define 0log0=00\log 0=0. Furthermore we say that a property 𝒫\mathcal{P} holds with high probability ( w.h.p.), if (𝒫)=1\mathbb{P}\left({\mathcal{P}}\right)=1 as nn\to\infty. In order to quantify the performance of our algorithms, for any 0<rs<10<r\neq s<1, we write

(1.7) DKL(rs):=rlog(rs)+(1r)log(1r1s),\displaystyle D_{\mathrm{KL}}\left({{{r}\|{s}}}\right):=r\log\left(\frac{r}{s}\right)+(1-r)\log\left(\frac{1-r}{1-s}\right),

for the relative entropy of a Bernoulli random variable with parameter rr to a Bernoulli random variable with parameter ss, commonly referred to as the Kullback–Leibler divergence. Here and throughout the paper we use log\log to denote the natural logarithm. For rr or ss equal to 0 or 11 we define the value of DKL()D_{\mathrm{KL}}\left({{{\cdot}\|{\cdot}}}\right) (possibly infinite) on grounds of continuity, so for example DKL(0s)=log(1s)D_{\mathrm{KL}}\left({{{0}\|{s}}}\right)=-\log(1-s).

2. Main results

With the test design and notation in place, we are now in a position to state our main results. Theorems 2.1, 2.2 are the centerpiece of this paper, featuring improved bounds for the noisy group testing problem for the general pqp-q model. We follow up in Section 2.2 with a discussion of the combinatorics underlying both algorithms, and provide a converse bound in Section 2.3. Subsequently, in Section 2.4 we show how the bounds simplify when we consider the special cases of the Z, the reverse Z and Binary Symmetric Channel. Finally, in Section 2.5 we derive sufficient conditions under which DD requires fewer tests than the COMP algorithm and compare the bounds of our constant-column design against the Bernoulli design employed in prior literature.

2.1. Bounds for Noisy Group Testing

We will consider two well-known algorithms from the noiseless setting to identify infected individuals in this paper. First, we study a noisy variant of the COMP algorithm, originally introduced in [10].

1 Declare every individual that appears in αΔ\alpha\Delta or more displayed negative tests as healthy.
Declare all remaining individuals as infected.
Algorithm 1 The noisy COMP algorithm

Note that for αΔ=1\alpha\Delta=1 the formulation of Algorithm 1 coincides with the standard 𝙲𝙾𝙼𝙿{\tt COMP} algorithm where an individual is classified as healthy if it appears in at least one displayed negative test which constitutes a sufficient condition in the noiseless case. We now state the first main result of this paper.

Theorem 2.1 (Noisy COMP).

Let p,q0p,q\geq 0, p+q<1,d(0,),α(q,ed(1p)+(1ed)q)p+q<1,d\in(0,\infty),\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q). Suppose that 0<θ<10<\theta<1 and let

mCOMP\displaystyle m_{\text{COMP}} =mCOMP(n,θ,p,q)=minα,dmax{b1(α,d),b2(α,d)}klog(n/k)\displaystyle=m_{\text{COMP}}(n,\theta,p,q)=\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}k\log(n/k)
whereb1(α,d)=θ1θ1dDKL(αq)\displaystyle\text{where}\qquad b_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)}
andb2(α,d)=11θ1dDKL(αed(1p)+(1ed)q)\displaystyle\text{and}\qquad b_{2}(\alpha,d)=\frac{1}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)}

If m(1+ε)mCOMPm\geq(1+\varepsilon)m_{\text{COMP}} for some ε>0\varepsilon>0, noisy COMP will recover 𝛔\bm{\sigma} w.h.p. given test design 𝐆\bm{G} and test results 𝛔^\hat{\bm{\sigma}}.

The noisy variant of the DD algorithm of [5] was introduced in [48] and reads as follows:

1 Declare every individual that appears in αΔ\alpha\Delta or more displayed negative tests as healthy and remove such individual from every assigned test.
2 Declare every yet unclassified individual who is now the only unclassified individual in βΔ\beta\Delta or more displayed positive tests as infected.
Declare all remaining individuals as healthy.
Algorithm 2 The noisy DD algorithm [48]

Note that the formulation of Algorithm 2 reduces to the noiseless version of DD introduced in [5] by taking αΔ=βΔ=1\alpha\Delta=\beta\Delta=1. This is because in the noiseless setting a single negative test or a single positive test with just individuals already classified as uninfected is sufficient in the noiseless case. Furthermore note that for β=0\beta=0 noisy DD and noisy COMP are the same. From now on we assume β>0\beta>0. The proof of Theorem 2.1 can be found in Appendix B. We now state the second main result of the paper.

Theorem 2.2 (Noisy DD).

Let p,q0p,q\geq 0, p+q<1,d(0,),α(q,ed(1p)+(1ed)q)p+q<1,d\in(0,\infty),\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q) and β(0,ed(1q))\beta\in(0,e^{-d}(1-q)) and define w=edp+(1ed)(1q)w=e^{-d}p+(1-e^{-d})(1-q). Suppose that 0<θ<10<\theta<1 and let

mDD\displaystyle m_{\text{DD}} =mDD(n,θ,p,q)=minα,β,dmax{c1(α,d),c2(α,d),c3(β,d),c4(α,β,d)}klog(n/k)\displaystyle=m_{\text{DD}}(n,\theta,p,q)=\min_{\alpha,\beta,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d),c_{4}(\alpha,\beta,d)}\right\}k\log(n/k)
wherec1(α,d)=θ1θ1dDKL(αq)\displaystyle\text{where}\qquad c_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)}
andc2(α,d)=1dDKL(α1w)\displaystyle\text{and}\qquad c_{2}(\alpha,d)=\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{1-w}}}\right)}
andc3(β,d)=θ1θ1dDKL(β(1q)ed)\displaystyle\text{and}\qquad c_{3}(\beta,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\beta}\|{(1-q)e^{-d}}}}\right)}
andc4(α,β,d)=max1αz1{11θ1d(DKL(zw)+𝟏{β>zedpw}zDKL(βzedpw))}\displaystyle\text{and}\qquad c_{4}(\alpha,\beta,d)=\max_{1-\alpha\leq z\leq 1}\left\{{\frac{1}{1-\theta}\frac{1}{d\left({D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)+\bm{1}\left\{{\beta>\frac{ze^{-d}p}{w}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{w}}}}\right)}\right)}}\right\}

If m(1+ε)mDDm\geq(1+\varepsilon)m_{\text{DD}} for some ε>0\varepsilon>0, then noisy DD will recover 𝛔\bm{\sigma} w.h.p. given test design 𝐆\bm{G} and test results 𝛔^\hat{\bm{\sigma}}.

The proof of Theorem 2.2 can be found in Appendix C. While the bounds appear cumbersome at first glance, the optimization is of finite dimension and for every specific value of pp and qq can be efficiently solved to arbitrary precision yielding explicit values for mCOMPm_{\text{COMP}} and mDDm_{\text{DD}}. For illustration purposes, we will calculate those bounds for several values of p,qp,q and θ\theta.

2.2. The combinatorics of the noisy group testing algorithms

In the following, we outline the combinatorial structures that Algorithm 1 and 2 take advantage of.
We start with defining the three types of tests that are relevant for the classification of an individual xix_{i} while using COMP and DD. In the first stage we find

  • Type DN: Displayed negative tests

  • Type DP: Displayed positive tests

Note that the only available information during the first stage of the algorithms is the test result and the pooling structure – no information about the individuals’ infection status is available. We give an illustration on the left hand side of Figure 2. After this step COMP terminates by declaring all remaining individuals as infected.

The DD algorithm continues with a second step which considers just the displayed positive tests. From the first step of the algorithm one receives the estimate of the set of non-infected individuals obtained in the first round. Now distinguish the following two types, illustrated on the right hand side in Figure 2:

  • Type Displayed-Positive-Single (DP-S): Displayed positive tests in which all other individuals are already declared as uninfected.

  • Type Displayed-Positive-Multiple (DP-M): Displayed positive tests with at least one other individual that is not contained in the estimated set of uninfected individuals.

2.2.1. The noisy COMP algorithm

To get started, let us shed light on the combinatorics of noisy COMP (Algorithm 1). For the noiseless case, the COMP algorithm classifies each individual that appears in at least one negative test as healthy and all other individuals as infected, since the participation in a negative test is a sufficient condition for the individual to be healthy.

For the noisy case, the situation is not as straightforward, since an infected individual might appear in displayed negative tests that were flipped when sent through the noisy channel. Thus, a single negative test is not definitive evidence that an individual is healthy. Yet, we can use the number of negative tests to tell the infected individuals apart from the healthy individuals.

Clearly, noisy COMP (Algorithm 1) using a threshold αΔ\alpha\Delta succeeds if no healthy individual appears in fewer than αΔ\alpha\Delta displayed negative tests and no infected individual appears in more than αΔ\alpha\Delta displayed negative tests. To this end, we define

(2.1) 𝑵x=|{ax:𝝈^(a)=0}|\displaystyle\bm{N}_{x}=\left|{\left\{{a\in\partial x:\hat{\bm{\sigma}}(a)=0}\right\}}\right|

for the number of displayed negative tests that item xx appears in. In terms of Figure 2, the algorithm determines the infection status by counting the number of tests of Type DN.

xix_{i}DNDPxix_{i}DP-SDP-M
Figure 2. The relevant neighborhood structures for the analysis of the algorithms, on the left for the first stage and on the right for the second step. Rectangles represent tests (displayed positive in red, displayed negative in blue). Blue circles represent individuals that have been classified as healthy in the first step of DD (or by COMP). White circles represent individuals that are unclassified in the current stage. We refer to displayed negative tests as Type DN, displayed positive tests as Type DP, displayed positive with a single unclassified individual as Type DP-S and displayed positive with a multiple unclassified individual as Type DP-M

2.2.2. The noisy DD algorithm

As in the prior section, let us first consider the noiseless DD algorithm. The first step is identical to COMP classifying all individuals that are contained in at least one negative test as healthy. In a second step, the algorithm checks each individual to see if it is contained in a positive test as the only remaining unclassified individual after the first step of the algorithm and thus must be infected.

Again, the situation is more intricate when we add noise, since neither a single negative test gives us confidence that an individual is healthy nor does a positive test where the individual is the single remaining unclassified individual after the first step of the algorithm inform us that this individual must be infected. Instead we count and compare the number of such tests. The first step of the noisy DD algorithm is identical to noisy COMP, but we are not required to identify all healthy individuals in the first step (we are able to keep some unclassified for the second round). Thus, after the first step, we are left with all infected individuals V1V_{1} (as the algorithm did not try to classify any individual as infected in the first step) and a set of yet unclassified healthy individuals (as some of them might exhibit a first neighbourhood that is not sufficient for a clear first round classification) which we will denote by V0,PDV_{0,\text{PD}}.These are healthy individuals who did not appear in sufficiently many displayed negative tests to be declared healthy with confidence in the first step555Note that the bounds are taken in a way such that no infected individual is classified as uninfected in the first round.. In symbols, for some α(0,1)\alpha\in(0,1)

V0,PD={xV0:𝑵x<αΔ}\displaystyle V_{0,\text{PD}}=\left\{{x\in V_{0}:\bm{N}_{x}<\alpha\Delta}\right\}

To tell V1V_{1} and V0,PDV_{0,\text{PD}} apart, we consider the number of displayed positive tests 𝑷x\bm{P}_{x} where the individual xx appears on its own after removing the individuals , which were declared healthy already, V0V0,PDV_{0}\setminus V_{0,\text{PD}} from the first step, i.e.

(2.2) 𝑷x=|{ax:𝝈^(a)=1 and a{x}V0V0,PD}|\displaystyle\bm{P}_{x}=\left|{\left\{{a\in\partial x:\hat{\bm{\sigma}}(a)=1\text{ and }\partial a\setminus\left\{{x}\right\}\subset V_{0}\setminus V_{0,\text{PD}}}\right\}}\right|

Referring to Figure 2, the second step of the algorithm is based on counting tests of Type DP-S. Tests of Type DP-M contain another remaining unclassified individual after the first step of the algorithm from V0,PDV1V_{0,PD}\cup V_{1}. The noisy DD algorithm takes advantage of the fact that it is less likely for an individual xV0,PDx\in V_{0,\text{PD}} to appear as the only yet unclassified individual in a displayed positive test than it is for an individual in xV1x\in V_{1}. For xV0,PDx\in V_{0,\text{PD}} such a test would be truly negative and would have been flipped (which occurs with probability pp) to display a positive test result. Conversely, an individual xV1x\in V_{1} renders any of its tests truly positive and thus the only requirement is that the test otherwise contains only individuals which were declared healthy already, and is not flipped (which occurs with probability 1q1-q). For this reason, we will see that the distribution of 𝑷x\bm{P}_{x} differs between xV1x\in V_{1} and xV0,PDx\in V_{0,\text{PD}}, and the difference (1q)p>0(1-q)-p>0 helps determine the size of this difference. The second step of DD exploits this observation by counting tests of Type DP-S.

2.3. The Channel Perspective of noisy group testing

Motivated by (1.1), we can describe the bounds in terms of rate, in a Shannon-theoretic sense. That is, we follow the common notion to define the rate (bits learned per test) of an algorithm in this setting (for instance as in [9]) to be

R:=log(nk)mlog2klog(n/k)mlog2.\displaystyle R:=\frac{\log\binom{n}{k}}{m\log 2}\sim\frac{k\log(n/k)}{m\log 2}.

(Recall that we take logarithms to base ee throughout this paper). For example the fact that Theorems 2.1 and 2.2 show that noisy COMP and DD respectively can succeed w.h.p. ; with m(1+ϵ)cklog(n/k)m\geq(1+\epsilon)ck\log(n/k) tests for some cc is equivalent to the fact that R=1/(clog2)R=1/(c\log 2) is an achievable rate in a Shannon-theoretic sense.

We now give a counterpart to these two theorems by stating a universal converse for the pqp-q channel below, improving on the universal counting bound from (1.1). The starting observation (see [7, Theorem 3.1]) is that no group testing algorithm can succeed w.h.p. with rate greater than CChanC_{\text{Chan}}, the Shannon capacity of the corresponding noisy communication channel. Thus, we cannot hope to succeed w.h.p. with m<(1ϵ)cklog(n/k)m<(1-\epsilon)ck\log(n/k) tests where c=1/(CChanlog2)c=1/(C_{\text{Chan}}\log 2). Hence as a direct consequence of the value of the channel capacity of the pqp-q channel, we deduce the following statement.

Corollary 2.3.

Let p,q0p,q\geq 0, p+q<1p+q<1 and ϵ>0\epsilon>0, write h()h(\cdot) for the binary entropy in nats (logarithms taken to base ee) and ϕ=ϕ(p,q)=(h(p)h(q))/(1pq)\phi=\phi(p,q)=(h(p)-h(q))/(1-p-q). If we define

mCOUNT=(1DKL(q1/(1+eϕ)))klog(n/k),m_{\text{COUNT}}=\left(\frac{1}{D_{\mathrm{KL}}\left({{{q}\|{1/(1+e^{\phi})}}}\right)}\right)k\log(n/k),

then for m(1ϵ)mCOUNTm\leq(1-\epsilon)m_{\text{COUNT}} no algorithm can recover 𝛔\bm{\sigma} w.h.p. for any matrix design.

Remark 2.4.

This result follows from Lemma F.1 derived in Appendix F below. As discussed there, this derivation (combined with the fact that each test is negative with probability ede^{-d}) suggests a choice of density for the matrix:

d=dch=log(1pq)log(11+eϕq).d=d^{*}_{{\rm ch}}=\log(1-p-q)-\log\left(\frac{1}{1+e^{\phi}}-q\right).

While a choice of Δ=cdchlog(n/k)\Delta=c\cdot d^{*}_{{\rm ch}}\cdot\log(n/k) is not necessarily optimal, it may be regarded as a sensible heuristic that provides good rates for a range of pp and qq values.

2.4. Applying the results to standard channels

With Theorem 2.1 and Theorem 2.2 we derived achievable rates for the generalized p-q-model (see Figure 1). Prior research considered the Z channel where p=0p=0 and q>0q>0, the Reverse Z channel where p>0p>0 and q=0q=0 and the Binary Symmetric Channel with p=q>0p=q>0. These channels are common models in coding theory [41], but are also often considered in medical applications [30, 31] concerned with taking imperfect sensitivity (q>0q>0), specificity (p>0p>0) or both (p>0p>0 and q>0q>0) into account. As a consequence we also compare our results with the most recent results of Johnson and Scarlett [48]. In the following section we will demonstrate how performance guarantees on these channels can directly be obtained from our main theorems.

2.4.1. Recovery of the noiseless model

Note that the bounds Corollary 2.5 and Corollary 2.6 are already known [10, 26]. We would like to give the reader an idea of how one can see that our cumbersome looking bounds relate to the more accessible bounds given for the noiseless case. First, we show the noiseless bounds can be simply recovered by letting p,q0p,q\rightarrow 0. In the noiseless setting, it is sufficient, by definition of the algorithm, to set both αΔ=1\alpha\Delta=1 and βΔ=1\beta\Delta=1. To see why, observe that in the absence of noise a single negative test is sufficient evidence that an individual is healthy. Conversely, a single positive test where the individual only appears with individuals , which were declared healthy already, implies that particular individual must surely be infected. As shown in [13] the optimal parameter choice for the density parameter dd in the constant-column design in the noiseless setting is log(2)\log(2). Applying these values to Theorem 2.1 we recover the noiseless bound for COMP.These bounds were first stated in [10].

Corollary 2.5 (COMP in the noiseless setting).

Let p,q0p,q\rightarrow 0, 0<θ<10<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙲𝙾𝙼𝙿,noiseless=1(1θ)log22klog(n/k).\displaystyle m_{{\tt COMP},\text{noiseless}}=\frac{1}{(1-\theta)\log^{2}2}k\log(n/k).

Furthermore let m𝙲𝙾𝙼𝙿(n,θ,p,q)m_{\tt COMP}(n,\theta,p,q) be defined as in Theorem 2.1 Then we find

m𝙲𝙾𝙼𝙿(n,θ,p,q)p,q0m𝙲𝙾𝙼𝙿,noiselessm_{\tt COMP}(n,\theta,p,q)\underset{p,q\rightarrow 0}{\longrightarrow}m_{{\tt COMP},\text{noiseless}}
Proof.

We start by taking the bounds b1(α,d)b_{1}(\alpha,d) and b2(α,d)b_{2}(\alpha,d). To see how this boils down to m𝙲𝙾𝙼𝙿,noiselessm_{{\tt COMP},\text{noiseless}}, we start with using the well-known fact that within the near constant column design d=log(2)d=\log(2) is the optimal choice [13]. Now by taking both p,q0p,q\rightarrow 0 one realizes that b1(α,log(2))b_{1}(\alpha,\log(2)) vanishes as log(p)\log(p)\rightarrow-\infty as p0p\rightarrow 0. Turning our focus to the second bound we see that it boils down to

b2(α,log(2)))=1(1θ)log(2)1log(2)+αlog(α)+(1α)log(1α)b_{2}(\alpha,\log(2)))=\frac{1}{(1-\theta)\log(2)}\frac{1}{\log(2)+\alpha\log(\alpha)+(1-\alpha)\log(1-\alpha)}

On the one hand we realize that αlog(α)+(1α)log(1α)\alpha\log(\alpha)+(1-\alpha)\log(1-\alpha) is negative for all α(0,1)\alpha\in(0,1). This leads to

b2(α,log(2))>b2(0,log(2))b_{2}(\alpha,\log(2))>b_{2}(0,\log(2))

On the other hand we realize that in the noiseless case a single negative test is sufficient for a classification as uninfected. Therefore we may choose α>0\alpha>0 sufficiently small. One indeed realizes that for each α\alpha we can choose ε:=ε(α)>0\varepsilon:=\varepsilon(\alpha)>0 appropriately, such that the bounds given in Theorem 2.1 recover the noiseless case. ∎

We also recover the noiseless bounds for the DD algorithm as stated in [26].

Corollary 2.6 (DD in the noiseless setting).

Let p,q0,0<θ<1p,q\rightarrow 0,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙳𝙳,noiseless=max{1,θ1θ}1log22klog(n/k).\displaystyle m_{{\tt DD},\text{noiseless}}=\max\left\{{1,\frac{\theta}{1-\theta}}\right\}\frac{1}{\log^{2}2}k\log(n/k).

Furthermore let m𝙳𝙳(n,θ,p,q)m_{\tt DD}(n,\theta,p,q) be defined as in Theorem 2.2 Then we find

m𝙳𝙳(n,θ,p,q)p,q0m𝙳𝙳,noiselessm_{\tt DD}(n,\theta,p,q)\underset{p,q\rightarrow 0}{\longrightarrow}m_{{\tt DD},\text{noiseless}}
Proof.

We start with taking c1(α,d),c2(α,d),c3(β,d)c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d) and c4(α,β,d)c_{4}(\alpha,\beta,d) as defined in Theorem 2.2. First of all we take c4(α,β,d)c_{4}(\alpha,\beta,d). By assumption we find β>0\beta>0 and therefore the indicator is 1 as soon as we let p0p\rightarrow 0. Furthermore for p0p\rightarrow 0 we get log(p)-\log(p)\rightarrow\infty and find c40c_{4}\rightarrow 0. Second of all we take c1(α,d)c_{1}(\alpha,d). With a similar argument as before we see that c1(α,d)0c_{1}(\alpha,d)\rightarrow 0 for q0q\rightarrow 0 as in this case we find log(q)-\log(q)\rightarrow\infty. Therefore we are left with c2(β,d)c_{2}(\beta,d) and c3(α,β,d)c_{3}(\alpha,\beta,d). Again, we use the well known fact that in the noiseless case d=log(2)d=\log(2) is the optimal choice. Therefore with p,q0p,q\rightarrow 0 the two remaining bounds read as follows:

c2(α,log(2))=1log(2)(log(2)+αlog(α)+(1α)log(1α))\displaystyle c_{2}(\alpha,\log(2))=\frac{1}{\log(2)\left({\log(2)+\alpha\log(\alpha)+(1-\alpha)\log(1-\alpha)}\right)}
c3(α,β,log(2))=θ(1θ)1log(2)(log(2)+βlog(β)+(1β)log(1β))\displaystyle c_{3}(\alpha,\beta,\log(2))=\frac{\theta}{(1-\theta)}\frac{1}{\log(2)\left({\log(2)+\beta\log(\beta)+(1-\beta)\log(1-\beta)}\right)}

Again we see that xlog(x)+(1x)log(1x)x\log(x)+(1-x)\log(1-x) is negative for x(0,1)x\in(0,1). Therefore we find

c2(α,log(2))>c2(0,log(2))\displaystyle c_{2}(\alpha,\log(2))>c_{2}(0,\log(2))
c3(α,log(2))>c3(0,log(2))\displaystyle c_{3}(\alpha,\log(2))>c_{3}(0,\log(2))

Now as as before in this case again a single negative test as well as a single test with only already classified uninfected individuals is sufficient. Therefore we can choose α,β>0\alpha,\beta>0 sufficiently small. One indeed realizes that for each α,β>0\alpha,\beta>0 one can choose ε:=ε(α,β)\varepsilon:=\varepsilon(\alpha,\beta) appropriately such that the bounds of Theorem 2.2 recover the noiseless case. ∎

2.4.2. The Z channel

In the Z channel, we have p=0p=0 and q>0q>0, i.e. no truly negative test displays a positive test result. Thus, in this case finding one positive test with only one unclassified individual is a clear indication, therefore we again can choose β>0\beta>0 sufficiently small and remain agnostic about α\alpha and dd. The bounds for COMP and DD thus read as follows.

Corollary 2.7 (Noisy COMP for the Z channel).

Let p0,0<q<1,0<θ<1p{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\rightarrow}0,0<q<1,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙲𝙾𝙼𝙿,Z\displaystyle m_{{\tt COMP},Z} =minα,dmax{b1(α,d),b2(α,d)}klog(n/k)\displaystyle=\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}k\log(n/k)
with b1(α,d)\displaystyle\text{ with }\quad b_{1}(\alpha,d) =θ1θ1dDKL(αq) and b2(α,d)=11θ1dDKL(αed+(1ed)q).\displaystyle=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)}\quad\text{ and }\quad b_{2}(\alpha,d)=\frac{1}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}+\left({1-e^{-d}}\right)q}}}\right)}.

If m>(1+ε)m𝙲𝙾𝙼𝙿,Zm>(1+\varepsilon)m_{{\tt COMP},Z}, noisy COMP will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Corollary 2.8 (Noisy DD for the Z channel).

Let p0,0<q<1,0<θ<1p{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\rightarrow}0,0<q<1,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙳𝙳,Z\displaystyle m_{{\tt DD},Z} =minα,dmax{c1(α,d),c2(α,d),c3(d)}klog(n/k)\displaystyle=\min_{\alpha,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(d)}\right\}k\log(n/k)
with c1(α,d)=θ1θ1dDKL(αq) and c2(α,d)=1dDKL(αed+(1ed)q)\displaystyle c_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)}\quad\text{ and }\quad c_{2}(\alpha,d)=\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}+\left({1-e^{-d}}\right)q}}}\right)}
and c3(d)=θ1θ1dlog(1ed(1q)).\displaystyle c_{3}(d)=\frac{\theta}{1-\theta}\frac{1}{-d\log\left({1-e^{-d}(1-q)}\right)}.

If m>(1+ε)m𝙳𝙳,Zm>(1+\varepsilon)m_{{\tt DD},Z}, noisy DD will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Proof.

The bounds c1c_{1} and c2c_{2} follow directly from Theorem 2.2 by letting p0p\rightarrow 0. An immediate consequence of p0p\rightarrow 0 is that due to the fact that log(p)-\log(p)\rightarrow\infty and one finds that c40c_{4}\rightarrow 0, thus being trivial in this case. For c3c_{3} we use the fact that we can choose β>0\beta>0 sufficiently small we find DKL(αed(1q))=log(1ed(1q))δ(β)D_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-q)}}}\right)=-\log\left({1-e^{-d}(1-q)}\right)-\delta(\beta) for δ(β)>0\delta(\beta)>0. Note that by definition of the noise model, we may choose an arbitrary βmin\beta_{\min} very close to zero and as a consequence β=βmin\beta=\beta_{\min} leading to δ(β)δmin\delta(\beta)\rightarrow\delta_{\min}. The assertion follows as for each β\beta we may choose ε:=ε(β)>0\varepsilon:=\varepsilon(\beta)>0 such that (1+ε)>(1+ε(βmin))(1+\varepsilon)>\left({1+\varepsilon\left({\beta_{\min}}\right)}\right).

An illustration of the bounds from Corollary 2.7 and 2.8 for sample values of qq is shown in Figure 5.

2.4.3. Reverse Z channel

In the reverse Z channel, we have q=0q=0 and p>0p>0, i.e. no truly positive test displays a negative test result. Thus, we may choose α>0\alpha>0 sufficiently small and remain agnostic about β\beta and dd. The bounds for the noisy COMP and DD thus read as follows.

Corollary 2.9 (Noisy COMP for the Reverse Z channel).

Let 0<p<1,q0,0<θ<10<p<1,q\rightarrow 0,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙲𝙾𝙼𝙿,rev Z\displaystyle m_{{\tt COMP},\text{rev Z}} =11θmind{1dlog(1ed(1p))}klog(n/k).\displaystyle=\frac{1}{1-\theta}\min_{d}\left\{{\frac{1}{-d\log\left({1-e^{-d}(1-p)}\right)}}\right\}k\log(n/k).

If m>(1+ε)m𝙲𝙾𝙼𝙿,rev Zm>(1+\varepsilon)m_{{\tt COMP},\text{rev Z}}, noisy COMP will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Proof.

The corollary follows from Theorem 2.1 and the fact that for q0q\rightarrow 0 one finds that DKL(α0)D_{\mathrm{KL}}\left({{{\alpha}\|{0}}}\right) diverges, Thereby b10b_{1}\rightarrow 0 just gives a trivial bound in this case. Furthermore for sufficiently small α>0\alpha>0 we get DKL(αed(1p))log(1ed(1p))δ(α)D_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)}}}\right)\rightarrow-\log\left({1-e^{-d}(1-p)}\right)-\delta(\alpha). Due to the noise assumption, we may choose an arbitrary αmin\alpha_{\min} very close to zero and α=αmin\alpha=\alpha_{\min} which leads to δ(α)δ(αmin)\delta(\alpha)\rightarrow\delta\left({\alpha_{\min}}\right). The assertion follows by choosing ε:=ε(α)>0\varepsilon:=\varepsilon(\alpha)>0 such that (1+ε)>(1+ε(αmin))(1+\varepsilon)>\left({1+\varepsilon\left({\alpha_{\min}}\right)}\right).

Note that Corollary 2.9 does not yield an immediate closed form expression for the optimal value of dd.

Corollary 2.10 (Noisy DD in the Reverse Z channel).

Let   0<p<1,q0,0<θ<10<p<1,q\rightarrow 0,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙳𝙳,rev Z\displaystyle m_{{\tt DD},\text{rev Z}} =minβ,dmax{c2(d),c3(β,d),c4(β,d)}klog(n/k)\displaystyle=\min_{\beta,d}\max\left\{{c_{2}(d),c_{3}(\beta,d),c_{4}(\beta,d)}\right\}k\log(n/k)
with c2(d)=1dlog(1ed(1p)) and c3(β,d)=θ1θ1dDKL(βed)\displaystyle c_{2}(d)=\frac{1}{-d\log\left({1-e^{-d}(1-p)}\right)}\quad\text{ and }\quad c_{3}(\beta,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\beta}\|{e^{-d}}}}\right)}
and c4(β,d)=11θ1d(log(1ed(1p))+DKL(βedpedp+(1ed)))\displaystyle c_{4}(\beta,d)=\frac{1}{1-\theta}\frac{1}{d\left({-\log\left({1-e^{-d}(1-p)}\right)+D_{\mathrm{KL}}\left({{{\beta}\|{\frac{e^{-d}p}{e^{-d}p+\left({1-e^{-d}}\right)}}}}\right)}\right)}

If m>(1+ε)m𝙳𝙳,rev Zm>(1+\varepsilon)m_{{\tt DD},\text{rev Z}}, noisy DD will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Proof.

First of all we assume q0q\rightarrow 0. Therefore we find c10c_{1}\rightarrow 0 as log(q)-\log(q)\rightarrow\infty. The bounds c2,c3c_{2},c_{3} follow from Theorem 2.2 and the same manipulations as above. For c4c_{4}, we again see that by definition of the noise model we may choose α>0\alpha>0 as close to zero as we like. Therefore we get (1α)(1-\alpha) close to 1, which leads to z1z\rightarrow 1. The assertion follows as for each α\alpha we can choose ε:=ε(α)>0\varepsilon:=\varepsilon(\alpha)>0 such that (1+ε)>(1+ε(αmin))(1+\varepsilon)>\left({1+\varepsilon\left({\alpha_{\min}}\right)}\right).

An illustration of the bounds of Corollary 2.9 and 2.10 for sample values of pp is shown in Figure 6.

2.4.4. Binary Symmetric Channel

In the Binary Symmetric Channel (BSC), we set p=q>0p=q>0. Even though information-theoretic arguments would suggest setting d=log2d=\log 2, we formulate the expression below with general dd. We also keep the threshold parameters α\alpha and β\beta. The bounds for the noisy DD and COMP only simplify slightly.

Corollary 2.11 (Noisy COMP in the Binary Symmetric Channel).

Let 0<p=q<1/2,0<θ<10<p=q<1/2,0<\theta<1 and ε>0\varepsilon>0. Further, let

m𝙲𝙾𝙼𝙿,BSC=minα,dmax{b1(α,d),b2(α,d)}klog(n/k)\displaystyle m_{{\tt COMP},\text{BSC}}=\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}k\log(n/k)
with b1(α,d)=θ1θ1dDKL(αp) and b2(α,d)=11θ1dDKL(αed+p2edp).\displaystyle b_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{p}}}\right)}\quad\text{ and }\quad b_{2}(\alpha,d)=\frac{1}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}+p-2e^{-d}p}}}\right)}.

If m>(1+ε)m𝙲𝙾𝙼𝙿,BSCm>(1+\varepsilon)m_{{\tt COMP},\text{BSC}}, noisy COMP will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Corollary 2.12 (Noisy DD in the Binary Symmetric Channel).

Let 0<p=q<1/2,0<θ<10<p=q<1/2,0<\theta<1 and ε>0\varepsilon>0 and define v=1edp+2edpv=1-e^{-d}-p+2e^{-d}p. Further, let

m𝙳𝙳,BSC=minα,β,dmax{c1(α,d),c2(α,d),c3(β,d),c4(α,β,d)}klog(n/k)\displaystyle m_{{\tt DD},\text{BSC}}=\min_{\alpha,\beta,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d),c_{4}(\alpha,\beta,d)}\right\}k\log(n/k)
with c1(α,d)=θ1θ1dDKL(αp) and c2(α,d)=1dDKL(αed+p2edp)\displaystyle c_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{p}}}\right)}\quad\text{ and }\quad c_{2}(\alpha,d)=\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}+p-2e^{-d}p}}}\right)}
and c3(β,d)=θ1θ1dDKL(β(1p)ed)\displaystyle c_{3}(\beta,d)=\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\beta}\|{(1-p)e^{-d}}}}\right)}
and c4(α,β,d)=max1αz1{11θ1d(DKL(zv)+𝟏{β>zedpv}zDKL(βzedpv))}.\displaystyle c_{4}(\alpha,\beta,d)=\max_{1-\alpha\leq z\leq 1}\left\{{\frac{1}{1-\theta}\frac{1}{d\left({D_{\mathrm{KL}}\left({{{z}\|{v}}}\right)+\bm{1}\left\{{\beta>\frac{ze^{-d}p}{v}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{v}}}}\right)}\right)}}\right\}.

If m>(1+ε)m𝙳𝙳,BSCm>(1+\varepsilon)m_{{\tt DD},\text{BSC}}, noisy DD will recover 𝛔\bm{\sigma} w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

An illustration of the bounds of Corollary 2.11 and 2.12 is shown in Figure 7.

2.5. Comparison of noisy COMP and DD

An obvious next question is to find conditions under which the noisy DD algorithm requires fewer tests than the noisy COMP. For the noiseless setting, it can be easily shown that DD provably outperforms COMP for all θ(0,1)\theta\in(0,1). For the noisy case, matters are slightly more complicated.

Recall that noisy COMP classifies all individuals appearing in less than αΔ\alpha\Delta displayed negative tests as infected while noisy DD additionally requires such individuals to appear in more than βΔ\beta\Delta displayed positive tests as the only yet unclassified individual. Thus, it might well be that an infected individual is classified correctly by noisy COMP, while it is missed by the noisy DD algorithm.

That being said, our simulations indicate that noisy DD generally requires fewer tests than noisy COMP, but for the reason mentioned above we can only prove that for the reverse Z channel while remaining agnostic about the Z channel and the Binary Symmetric Channel, as the next proposition evinces.

Proposition 2.13.

For all p,q0p,q\geq 0 with p+q<1p+q<1 there exists a d(0,)d^{*}\in(0,\infty) such that mCOMPmDDm_{\text{COMP}}\geq m_{\text{DD}} as long as edpqe^{-d^{*}}p\geq q.

In terms of the common noise channels Proposition 2.13 gives the following corollary.

Corollary 2.14.

In the reverse Z channel, mCOMPmDDm_{\text{COMP}}\geq m_{\text{DD}}.

The proof can be found in Appendix D. Our simulations suggest that this superior performance of noisy DD holds as well for the Z channel and Binary Symmetric Channel. Please refer to Figure 3 for an illustration.

Refer to caption
Refer to caption
Figure 3. Comparison of the bound for noisy DD and noisy COMP in the Z-channel and the Binary Symmetric Channel for different noise level. (Note for black and white prints: The lines in the diagram are in the same order as given in the legend from top to bottom)

2.6. Relation to Bernoulli testing

In [48] sufficient bounds for noisy group testing and a Bernoulli test design where each individual joins every test independently with some fixed probability were derived. Thus, the variable degrees fluctuate and we end up with some individuals assigned only to few tests. In contrast, we work under a model in this paper where each individual joins an equal number of tests Δ\Delta chosen uniformly at random without replacement. For the noiseless case, it is by now clear that the near-constant-column design better facilitates inference than the Bernoulli test design [13, 26]. We find that the same holds true for the noisy variant of the COMP algorithm. Let us denote by mCOMPBerm_{\text{COMP}}^{\text{Ber}} the number of tests required for the noisy COMP to succeed under a Bernoulli test design.

Proposition 2.15.

For all p+q<1p+q<1, we have

mCOMPBermCOMP\displaystyle m_{\text{COMP}}^{\text{Ber}}\geq m_{\text{COMP}}

We see the same effect for the noisy variant of the DD algorithm for all simulations, but for technical reasons only prove it for the Z channel.

Proposition 2.16.

For the Z channel where p=0p=0 and 0<q<10<q<1, we have

mDDBer>mDD\displaystyle m_{\text{DD}}^{\text{Ber}}>m_{\text{DD}}

For an illustration on the magnitude of the difference, we refer to Figure 4 and Figure 8.

Refer to caption
Refer to caption
Figure 4. Comparison of DD bounds under a Bernoulli test design ([48]) and constant column test design (present paper) for the reverse Z and Binary Symmetric Channel. (Note for black and white prints: The solid lines as well as the dashed lines in the diagram are in the same order as given in the legend from top to bottom)

Appendix

The core of the technical sections is the proof of Theorems 2.1 and Theorem 2.2. Some groundwork with standard concentration bounds and group testing properties can be found in Section A. We continue with the proof of Theorems 2.1 and 2.2 in Sections B and C, respectively. The structure of the proofs follows a similar logic. First, we derive the distributions for the number of displayed positive and negative tests for infected and healthy individuals. Second, we threshold these distributions using sharp Chernoff concentration bounds to deduce the bounds stated in Theorem 2.1 and Theorem 2.2. Thereafter, we proceed to the proof of Proposition 2.13 in Section D, while the proofs of Propositions 2.15 and 2.16 follow in Section E. The proof of Corollary 2.3 can be found in Section F. Additional illustrations of our results for the different channels can be found in Section G.

Appendix A Groundwork

For starters, let us recall the Chernoff bound for binomial and hypergeometric distributions.

Lemma A.1 (Chernoff bound for the binomial distribution [25]).

Let p<q<r(0,1)p<q<r\in(0,1) and 𝐗Bin(n,q)\bm{X}\sim{\rm Bin}(n,q) be a binomially distributed random variable. Then

(𝑿pn)\displaystyle\mathbb{P}\left({\bm{X}\leq\lceil pn\rceil}\right) =exp((1+nΩ(1))nDKL(pq))\displaystyle=\exp\left({-\left({1+n^{-\Omega(1)}}\right)nD_{\mathrm{KL}}\left({{{p}\|{q}}}\right)}\right)
(𝑿rn)\displaystyle\mathbb{P}\left({\bm{X}\geq\lceil rn\rceil}\right) =exp((1+nΩ(1))nDKL(rq))\displaystyle=\exp\left({-\left({1+n^{-\Omega(1)}}\right)nD_{\mathrm{KL}}\left({{{r}\|{q}}}\right)}\right)
Lemma A.2 (Chernoff bound for the hypergeometric distribution [23]).

Let p<q<r(0,1)p<q<r\in(0,1) and 𝐘H(N,Q,n)\bm{Y}\sim H(N,Q,n) be a hypergeometrically distributed random variable. Further, let q=Q/Nq=Q/N. Then

(𝒀pn)\displaystyle\mathbb{P}\left({\bm{Y}\leq\lceil pn\rceil}\right) =exp((1+nΩ(1))nDKL(pq))\displaystyle=\exp\left({-\left({1+n^{-\Omega(1)}}\right)nD_{\mathrm{KL}}\left({{{p}\|{q}}}\right)}\right)
(𝒀rn)\displaystyle\mathbb{P}\left({\bm{Y}\geq\lceil rn\rceil}\right) =exp((1+nΩ(1))nDKL(rq))\displaystyle=\exp\left({-\left({1+n^{-\Omega(1)}}\right)nD_{\mathrm{KL}}\left({{{r}\|{q}}}\right)}\right)

The next lemma provides that the test degrees, as defined in (1.4) above, are tightly concentrated. Recall from (1.3) that the number of tests m=cklog(n/k)m=ck\log(n/k) and each item appears in Δ=cdlog(n/k)\Delta=cd\log(n/k) tests.

Lemma A.3.

With probability 1o(n2)1-o(n^{-2}) we have

dn/kdn/klogn𝚪min𝚪maxdn/k+dn/klogn\displaystyle dn/k-\sqrt{dn/k}\log n\leq\bm{\Gamma}_{\min}\leq\bm{\Gamma}_{\max}\leq dn/k{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}+}\sqrt{dn/k}\log n
Proof.

The probability that an individual xx is assigned to test aa is given by

(A.1) (xa)=1(xa)=1(m1Δ)(mΔ)1=Δ/m=d/k\displaystyle\mathbb{P}\left({x\in\partial a}\right)=1-\mathbb{P}\left({x\notin\partial a}\right)=1-\binom{m-1}{\Delta}\binom{m}{\Delta}^{-1}=\Delta/m=d/k

Since each individual is assigned to tests independently, the total number of individuals in a given test follows the binomial distribution Bin(n,d/k){\rm Bin}\left({n,d/k}\right). The assertion now follows from applying the Chernoff bound for this binomial distribution at the expectation (Lemma A.1). ∎

Next, we show that the number of truly negative tests 𝒎0\bm{m}_{0} (and thus the number of truly positive tests 𝒎1\bm{m}_{1}) is tightly concentrated.

Lemma A.4.

With probability 1o(n2)1-o(n^{-2}) we have 𝐦0=edm+O(mlog3n)\bm{m}_{0}=e^{-d}m+O(\sqrt{m}\log^{3}n).

Proof.

Recall from (A.1) that

(xa)=d/k\displaystyle\mathbb{P}\left({x\in\partial a}\right)=d/k

Since infected individuals are assigned to tests mutually independently, we find for a test aa that

(V1a=)=(Bin(k,d/k)=0)=(1d/k)k=(1+nΩ(1))ed.\displaystyle\mathbb{P}\left({V_{1}\cap\partial a=\emptyset}\right)=\mathbb{P}\left({{\rm Bin}\left({k,d/k}\right)=0}\right)=\left({1-d/k}\right)^{k}=\left({1+n^{-\Omega(1)}}\right)e^{-d}.

Consequently, 𝔼[𝒎0]=(1+nΩ(1))edm\mathbb{E}\left[{\bm{m}_{0}}\right]=\left({1+n^{-\Omega(1)}}\right)e^{-d}m. Finally, changing the set of tests for a specific infected individual shifts the total number of negative tests by at most Δ\Delta. Therefore, the McDiarmid inequality (Lemma 1.2 in [34]) yields

(|𝒎0𝔼[𝒎0]|t)2exp(t24kΔ2).\displaystyle\mathbb{P}\left({\left|{\bm{m}_{0}-\mathbb{E}\left[{\bm{m}_{0}}\right]}\right|\geq t}\right)\leq 2\exp\left({-\frac{t^{2}}{4k\Delta^{2}}}\right).

The lemma follows from setting t=O(mlog3n)t=O\left({\sqrt{m}\log^{3}n}\right). ∎

With the concentration of 𝒎0\bm{m}_{0} and 𝒎1\bm{m}_{1} at hand, we readily obtain estimates for 𝒎0f,𝒎0u,𝒎1f\bm{m}_{0}^{f},\bm{m}_{0}^{u},\bm{m}_{1}^{f} and 𝒎1u\bm{m}_{1}^{u}. We remind ourselves that these are the number of flipped, unflipped negative tests and the number of flipped, unflipped positive tests as defined in Sec. 1.4.

Corollary A.5.

With probability 1o(n2)1-o(n^{-2}) we have

  • (i)

    𝒎0f=edpm+O(mlog3n)\bm{m}_{0}^{f}=e^{-d}pm+O\left({\sqrt{m}\log^{3}n}\right)

  • (ii)

    𝒎0u=ed(1p)m+O(mlog3n)\bm{m}_{0}^{u}=e^{-d}(1-p)m+O\left({\sqrt{m}\log^{3}n}\right)

  • (iii)

    𝒎1f=(1ed)qm+O(mlog3n)\bm{m}_{1}^{f}=(1-e^{-d})qm+O\left({\sqrt{m}\log^{3}n}\right)

  • (iv)

    𝒎1u=(1ed)(1q)m+O(mlog3n)\bm{m}_{1}^{u}=(1-e^{-d})(1-q)m+O\left({\sqrt{m}\log^{3}n}\right)

Proof.

Since each test is flipped with probability pp and qq independently, the claims follow from Lemma A.4 and the Chernoff bound for the binomial distribution (Lemma A.1). ∎

In the following, let \mathcal{E} be the event that the bounds from Lemma A.4 and A.5 hold. Note that \mathcal{E} holds with high probability.

Appendix B Proof of COMP bound, Theorem 2.1

Recall from (2.1) that we write 𝑵x\bm{N}_{x} for the number of displayed negative tests that item xx appears in (as illustrated by the right branch of Fig. 2). The proof of Theorem 2.1 is based on two pillars. First, Lemmas B.1 and B.2 provide the distribution of 𝑵x\bm{N}_{x} for healthy and infected individuals, respectively. We will see that these distributions differ according to the infection status of the individual. Second, we will derive a suitable threshold αΔ\alpha\Delta via Lemma B.3 and B.4 to tell healthy and infected individuals apart w.h.p. We start by analysing individuals in the infected set V1V_{1}. Throughout the section, we assume α(q,ed(1p)+(1ed)q)\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q).

Lemma B.1.

Given xV1x\in V_{1}, its number of displayed negative tests 𝐍x\bm{N}_{x} is distributed as Bin(Δ,q){\rm Bin}(\Delta,q).

Proof.

Any test containing an infected individual is truly positive because of the presence of the infected individual. Since an infected individual is assigned to Δ\Delta different tests and each such test is flipped with probability qq independently, the lemma follows immediately. ∎

Next, we consider the distribution for healthy individuals. Recall that \mathcal{E} denotes the event that the bounds from Lemma A.4 and Corollary A.5 hold.

Lemma B.2.

Given xV0x\in V_{0} and conditioned on \mathcal{E}, the total variation distance of the distribution of 𝐍x\bm{N}_{x} and 𝐓h\bm{T}_{h} that is distributed as H(m,m(ed(1p)+(1ed)q),Δ)H\left({m,m\left({e^{-d}(1-p)+\left({1-e^{-d}}\right)q}\right),\Delta}\right) tends to zero with nn, that is

dTV(𝑵x,𝑻h)=nΩ(1)d_{TV}(\bm{N}_{x},\bm{T}_{h})=n^{-\Omega(1)}
Proof.

Since xx is healthy, the outcome of all the tests remains the same if it is removed from consideration (if we perform group testing with n1n-1 items and the corresponding reduced matrix).

Thus, given \mathcal{E}, we find that with xx removed the 𝒎0f,𝒎0u,𝒎1f,𝒎1u\bm{m}_{0}^{f},\bm{m}_{0}^{u},\bm{m}_{1}^{f},\bm{m}_{1}^{u} still satisfy the bounds from Corollary A.5. As a result the number of displayed negative tests (which consist of unflipped truly negative tests and flipped truly positive tests) is given by

(B.1) 𝒎0u+𝒎1f=(ed(1p)+(1ed)q)m+O(mlog3n)\bm{m}_{0}^{u}+\bm{m}_{1}^{f}=\left(e^{-d}(1-p)+(1-e^{-d})q\right)m+O\left({\sqrt{m}\log^{3}n}\right)

Now, adding xx back into consideration: xV0x\in V_{0} chooses Δ\Delta tests without replacement independently of this. Hence, given that the random quantity 𝒎0u+𝒎1f=\bm{m}_{0}^{u}+\bm{m}_{1}^{f}=\ell, the 𝑵x\bm{N}_{x} (the number of displayed negative tests that item xx appears in) is distributed as H(m,,Δ)H(m,\ell,\Delta). Hence, a conditioning argument shows that the linear combination of distribution functions

(𝒎0u+𝒎1f=)(H(m,,Δ)x)\sum_{\ell}\mathbb{P}\left({\bm{m}_{0}^{u}+\bm{m}_{1}^{f}=\ell}\right)\mathbb{P}(H\left({m,\ell,\Delta}\right)\leq x)

tends to the distribution function of H(m,m(ed(1p)+(1ed)q),Δ)H\left({m,m\left({e^{-d}(1-p)+\left({1-e^{-d}}\right)q}\right),\Delta}\right) in total variation distance, due to the concentration of 𝒎0u+𝒎1f\bm{m}_{0}^{u}+\bm{m}_{1}^{f} as obtained in Corollary A.5.

Moving to the second pillar of the proof, we need to demonstrate that no infected individual is assigned to more than αΔ\alpha\Delta displayed negative tests as shown by the following lemma.

Lemma B.3.

If c>(1+η)θ1θ1dDKL(αq)c>(1+\eta)\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)} for some small η>0\eta>0, 𝐍x<αΔ\bm{N}_{x}<\alpha\Delta for all xV1x\in V_{1} w.h.p.

Proof.

We have to ensure that (xV1:𝑵xαΔ)=o(1)\mathbb{P}(\exists x\in V_{1}:\bm{N}_{x}\geq\alpha\Delta)=o(1). By Lemma B.1 and the union bound, we thus need to have

o(1)=k(𝑵xαΔ:xV1)=k(Bin(Δ,q)αΔ)=kexp((1+ΔΩ(1))ΔDKL(αq)),\displaystyle o(1)=k\cdot\mathbb{P}\left({\bm{N}_{x}\geq\alpha\Delta:x\in V_{1}}\right)=k\cdot\mathbb{P}\left({{\rm Bin}(\Delta,q)\geq\alpha\Delta}\right)=k\cdot\exp\left(-\left(1+\Delta^{-\Omega(1)}\right)\Delta D_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)\right),

by the Chernoff bound for the binomial distribution (Lemma A.1). Since knθk\sim n^{\theta} and Δ=cd(1θ)logn\Delta=cd(1-\theta)\log n the following must hold

θcd(1θ)DKL(αq)<0\displaystyle\theta-cd(1-\theta)D_{\mathrm{KL}}\left({{{\alpha}\|{q}}}\right)<0

The lemma follows from rearranging terms and the fact that if we choose the number of tests slightly above the required number of tests (larger by a factor of 1+η1+\eta for η>0\eta>0), the assertion holds w.h.p. as nn\rightarrow\infty. ∎

We proceed to show that no healthy individual is assigned to less than αΔ\alpha\Delta displayed negative tests.

Lemma B.4.

If c>(1+η)11θ1dDKL(αed(1p)+(1ed)q)c>(1+\eta)\frac{1}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)} for some small η>0\eta>0, 𝐍x>αΔ\bm{N}_{x}>\alpha\Delta for all xV0x\in V_{0} w.h.p.

Proof.

We need to ensure that (xV0:𝑵x<αΔ)=o(1)\mathbb{P}(\exists x\in V_{0}:\bm{N}_{x}<\alpha\Delta)=o(1). Since \mathcal{E} occurs w.h.p.  by Lemma A.4 and Corollary A.5, we need to have by Lemma B.2 and the union bound that

(B.2) (nk)(𝑵xαΔ|xV0,)n(𝑻hαΔ)=o(1).\displaystyle(n-k)\cdot\mathbb{P}\left({\bm{N}_{x}\leq\alpha\Delta|x\in V_{0},\mathcal{E}}\right)\leq n\cdot\mathbb{P}\left({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{T}_{h}}\leq\alpha\Delta}\right)=o(1).

We remind ourselves that 𝑻hH(m,m(ed(1p)+(1ed)q),Δ)\bm{T}_{h}\sim H\left({m,m\left({e^{-d}(1-p)+\left({1-e^{-d}}\right)q}\right),\Delta}\right) and together with the Chernoff bound for the hypergeometric distribution (Lemma A.2) this leads to the following condition666Note that the additive rule of the logarithm allows us to move the error term from inside the KL-divergence to outside

1cd(1θ)DKL(α(1p)ed+(1ed)q)<0\displaystyle 1-cd(1-\theta)D_{\mathrm{KL}}\left({{{\alpha}\|{(1-p)e^{-d}+(1-e^{-d})q}}}\right)<0

in a similar way to the proof of Lemma B.3. The lemma follows from rearranging terms and the fact that if we choose the number of tests slightly above the required number of tests (larger by a factor of 1+η1+\eta for η>0\eta>0), the assertion holds w.h.p. as nn\rightarrow\infty. ∎

Proof of Theorem 2.1.

The theorem is now an immediate consequence of Lemma B.3 and B.4 which guarantee that w.h.p. classifying individuals according to the threshold αΔ\alpha\Delta for negative displayed tests recovers 𝝈\bm{\sigma}, and the fact that the choice of α\alpha and dd is at our disposal. ∎

Appendix C Proof of DD bound, Theorem 2.2

The proof of Theorem 2.2 follows a similar two-step approach as the proof of Theorem 2.1 by first finding the distribution of 𝑷x\bm{P}_{x} (the number of displayed positive tests where individual xx appears on its own after removing the individuals, which were declared healthy already, V0V0,PDV_{0}\setminus V_{0,\text{PD}}, illustrated by DP-S in Fig. 2). We then threshold the distributions for healthy and infected individuals. To get started, we revise the second bound from Theorem 2.1 to allow knΩ(1)kn^{-\Omega(1)} healthy individuals to not be classified yet after the first step of DD. Recall that, we assume α(q,ed(1p)+(1ed)q)\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q) and β(0,ed(1q))\beta\in(0,e^{-d}(1-q)).

Lemma C.1.

If

c>(1+η)1dDKL(αed(1p)+(1ed)q)\displaystyle c>(1+\eta)\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)}

for some small η>0\eta>0, we have |𝐕𝟎,PD|=knΩ(1){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left|{\bm{V_{0,\textbf{PD}}}}\right|}=kn^{-\Omega(1)} w.h.p.

Proof.

The lemma follows immediately by replacing the r.h.s. of (B.2) with knδkn^{-\delta} for some small δ=δ(η)\delta=\delta(\eta), rearranging terms and applying Markov’s inequality. ∎

For the next lemmas, we need an auxiliary notation denoting the number of tests 𝒎0,nd\bm{m}_{0,\text{nd}} that only contain individuals from V0V0,PDV_{0}\setminus V_{0,\text{PD}}. In symbols,

𝒎0,nd=|{aF:aV0V0,PD}|.\displaystyle\bm{m}_{0,\text{nd}}=\left|{\left\{{a\in F:\partial a\subset V_{0}\setminus V_{0,\text{PD}}}\right\}}\right|.
Lemma C.2.

If

c>(1+η)1dDKL(αed(1p)+(1ed)q)\displaystyle c>(1+\eta)\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)}

for some small η>0\eta>0, we have 𝐦0,nd=(1nΩ(1))edm\bm{m}_{0,\text{nd}}=\left({1-n^{-\Omega(1)}}\right)e^{-d}m with probability 1o(n2)1-o(n^{-2}).

Proof.

As in the proof of Lemma B.2 above, we consider the graph in two rounds: in the first round we consider the tests containing infected individuals. Since each healthy individual xV0x\in V_{0} does not impact the number of positive and negative tests, we know by Lemma A.4 that with probability 1o(n2)1-o(n^{-2}) we find that the number of truly negative tests 𝒎0=edm+O(mlog4n)\bm{m}_{0}=e^{-d}m+O\left({\sqrt{m}\log^{4}n}\right) after the first round. Furthermore the presence of a healthy individual has no impact on the number of displayed negative tests, as unflipped negative tests remain unflipped and flipped positive tests remain flipped. In the second round, we consider the effect of adding healthy individuals into the tests. Knowing the number of negative tests w.h.p.  we can think of the participation of individuals xV0,PDx\in V_{0,\text{PD}} in these tests as a balls into bins experiment. Starting with the number of truly negative tests 𝒎0\bm{m}_{0} (given by the first round) we conduct a worst case analysis to see how many of those tests may include one of the xV0,PDx\in V_{0,\text{PD}}. Consider some particular truly negative test aa. We are interested in the probability that none of the elements of V0,PDV_{0,\text{PD}} is contained. The probability that a given individual xV0,PDx\in V_{0,\text{PD}} (knowing that it participates in NxαΔN_{x}\leq\alpha\Delta displayed negative tests, which is of lower order than mm) is assigned to this test is given by777We refer the reader to [20] for two results we use while obtaining (C.3) (apply Claim 7.3 to the binomial coefficients) as well as (C.4)(apply Claim 7.4 as error corrected version of Bernoulli’s inequality).Please note that these bounds in particular hold for Δ=Θ(log(n))\Delta=\Theta(\log(n)) and knθk\sim n^{\theta}.

(C.1) (xa|xV0,PD)\displaystyle\mathbb{P}\left({x\in\partial a{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}|x\in V_{0,\text{PD}}}}\right) =1(xa|xV0,PD)\displaystyle=1-\mathbb{P}\left({x\notin\partial a{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}|x\in V_{0,\text{PD}}}}\right)
(C.2) =1i=0αΔ(𝑵x=i|xV0,PD)(m1Δi)(mΔi)1\displaystyle=1-\sum_{i=0}^{\alpha\Delta}\mathbb{P}\left({\bm{N}_{x}=i|x\in V_{0,PD}}\right)\binom{m-1}{\Delta-i}\binom{m}{\Delta-i}^{-1}
(C.3) 1(1+nΩ(1))i=0αΔ(𝑵x=i|xV0,PD)(11m)Δi\displaystyle\leq 1-\big{(}1+n^{-\Omega(1)}\big{)}\sum_{i=0}^{\alpha\Delta}\mathbb{P}\left({\bm{N}_{x}=i|x\in V_{0,PD}}\right)\left({1-\frac{1}{m}}\right)^{\Delta-i}
(C.4) 1(1+nΩ(1))i=0αΔ(𝑵x=i|xV0,PD)(11m)Δ=(1+nΩ(1))(Δm+O(k2))=dk+O(k2)\displaystyle\leq 1-\big{(}1+n^{-\Omega(1)}\big{)}\sum_{i=0}^{\alpha\Delta}\mathbb{P}\left({\bm{N}_{x}=i|x\in V_{0,PD}}\right)\left({1-\frac{1}{m}}\right)^{\Delta}=\big{(}1+n^{-\Omega(1)}\big{)}\left({\frac{\Delta}{m}+O(k^{-2})}\right)=\frac{d}{k}+O(k^{-2})

We can now calculate the probability that no individual xV0,PDx\in V_{0,\text{PD}} is assigned to aa, bearing in mind that the size of V0,PDV_{0,\text{PD}} is random, and that each such individual is assigned to tests mutually independently. Using (C.4), and decomposing the sum into two parts, this is given by (for a given VV)

({V0,PDa}=)\displaystyle\mathbb{P}\left({\big{\{}V_{0,\text{PD}}\cap\partial a\big{\}}=\emptyset}\right) =j=0n(|𝑽𝟎,PD|=j)({V0,PDa}=||𝑽𝟎,PD|=j)\displaystyle=\sum_{j=0}^{n}\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|=j}\right)\mathbb{P}\left({\big{\{}V_{0,\text{PD}}\cap\partial a\big{\}}=\emptyset\Big{|}\left|{\bm{V_{0,\textbf{PD}}}}\right|=j}\right)
=j=0V(|𝑽𝟎,PD|=j)(1dk+O(k2))j+j=V+1n(|𝑽𝟎,PD|=j)(1dk+O(k2))j\displaystyle=\sum_{j=0}^{V}\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|=j}\right)\left({1-\frac{d}{k}+O\left({k^{-2}}\right)}\right)^{j}+\sum_{j=V+1}^{n}\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|=j}\right)\left({1-\frac{d}{k}+O\left({k^{-2}}\right)}\right)^{j}
j=0V(|𝑽𝟎,PD|=j)(1dk+O(k2))V=(|𝑽𝟎,PD|V)(1dk+O(k2))V\displaystyle\geq\sum_{j=0}^{V}\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|=j}\right)\left({1-\frac{d}{k}+O\left({k^{-2}}\right)}\right)^{V}=\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|\leq V}\right)\left({1-\frac{d}{k}+O\left({k^{-2}}\right)}\right)^{V}

By Lemma C.1, we can choose V=knΩ(1)V=kn^{-\Omega(1)} such that (|𝑽𝟎,PD|V)\mathbb{P}\left({\left|{\bm{V_{0,\textbf{PD}}}}\right|\leq V}\right) is arbitrarily close to 1, and knowing that (1dk+O(k2))Vexp(dV/k)=exp(dnΩ(1))\left({1-\frac{d}{k}+O\left({k^{-2}}\right)}\right)^{V}\simeq\exp(-dV/k)=\exp(-dn^{-\Omega(1)}) we find

({V0,PDa}=)=1nΩ(1).\mathbb{P}\left({\big{\{}V_{0,\text{PD}}\cap\partial a\big{\}}=\emptyset}\right)=1-n^{-\Omega(1)}.

By combining this with the findings of Lemma A.4 we find 𝔼[𝒎0,nd]=(1nΩ(1))edm\mathbb{E}\left[{\bm{m}_{0,\text{nd}}}\right]=\left({1-n^{-\Omega(1)}}\right)e^{-d}m. The lemma follows by a similar application of the McDiarmid inequality as used in the proof of Lemma A.4.

Note that, changing the set of tests for a specific individual xV1V0,PDx\in V_{1}\cup V_{0,\text{PD}} shifts 𝒎0,nd\bm{m}_{0,\text{nd}} by at most Δ\Delta. Thus, such an individual choosing from this set is not affecting the order of 𝒎0,nd\bm{m}_{0,\text{nd}}.
Let \mathcal{F} be the event that 𝒎0,nd=(1nΩ(1))edm\bm{m}_{0,\text{nd}}=\left({1-n^{-\Omega(1)}}\right)e^{-d}m. By Lemma C.2, ()=1o(n2)\mathbb{P}\left({\mathcal{F}}\right)=1-o(n^{-2}) if

c>(1+η)1dDKL(αed(1p)+(1ed)q)\displaystyle c>(1+\eta)\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)}

for some small η>0\eta>0. With Lemma C.2 at hand, we are in a position to describe the distribution of 𝑷x\bm{P}_{x} for healthy and infected individuals (recall the definition of 𝑷x\bm{P}_{x} in (2.2)). Let us start with infected individuals.

Lemma C.3.

Given xV1x\in V_{1} and conditioned on \mathcal{F}, the total variation distance between 𝐏x\bm{P}_{x} and 𝐐H\bm{Q}_{H}, a random variable with hypergeometric distribution H(m,med(1q),Δ)H\left({m,me^{-d}(1-q),\Delta}\right), tends to zero with nn, that is

dTV(𝑷x,𝑸H)=nΩ(1).d_{TV}\left(\bm{P}_{x},\bm{Q}_{H}\right)=n^{-\Omega{(1)}}.
Proof.

We are interested in the neighborhood structure of one given infected individual xV1x\in V_{1}, and we check how the remaining individuals influence the test types. In particular we are interested in the number of tests aFa\in F such that aV0V0,PD\partial a\subset V_{0}\setminus V_{0,\text{PD}} are contained in the neighborhood of an infected individual xx. Knowing the total number of tests mm and fixed degree Δ\Delta, for a given value of the random quantity 𝒎0,nd=\bm{m}_{0,\text{nd}}=\ell, we find that this quantity of interest follows a H(m,,Δ)H\left({m,\ell,\Delta}\right)-distribution. Given \mathcal{F}, Lemma C.2 gives that 𝒎0,nd\bm{m}_{0,\text{nd}} is highly concentrated,

𝒎0,nd=(1nΩ(1))edm\bm{m}_{0,\text{nd}}=\left({1-n^{-\Omega(1)}}\right)e^{-d}m

with high probability. Hence a conditioning argument, similar to Lemma B.2, shows that the linear combination of distribution functions

(𝒎0,nd=)(H(m,,Δ)x)\sum_{\ell}\mathbb{P}(\bm{m}_{0,\text{nd}}=\ell)\mathbb{P}(H\left({m,\ell,\Delta}\right)\leq x)

tends to the distribution function of H(m,med,Δ)H\left({m,me^{-d},\Delta}\right) in total variation distance, due to the concentration result obtained in Lemma C.2. Since each test featuring xx will truly be positive (as we assume xx to be infected) and will be displayed positive with probability 1q1-q independently, the lemma follows immediately.

To describe the distribution of 𝑷x\bm{P}_{x} for healthy individuals, let us introduce the random variable 𝑷x(P)\bm{P}_{x}(P), which is 𝑷x\bm{P}_{x} conditioned on the individual appearing in PP displayed positive tests, as follows:

(𝑷x(P)=t)=(𝑷x=t|𝑵x=ΔP)\displaystyle\mathbb{P}\left({\bm{P}_{x}(P)=t}\right)=\mathbb{P}\left({\bm{P}_{x}=t|\bm{N}_{x}=\Delta-P}\right)

Then, we find for healthy individuals the following conditional distribution.

Lemma C.4.

Given xV0x\in V_{0} ,conditioned on \mathcal{E} and \mathcal{F}, the total variation distance between 𝐏x(P)\bm{P}_{x}(P) and
𝐁hH(m(edp+(1ed)(1q)),m(edp),P)\bm{B}_{h}\sim H\left({m\left({e^{-d}p+(1-e^{-d})(1-q)}\right),m\left({e^{-d}p}\right),P}\right) tends to zero with nn. That is

dTV(𝑷x(P),𝑩h)=nΩ(1).d_{TV}(\bm{P}_{x}(P),\bm{B}_{h})=n^{-\Omega(1)}.
Proof.

We proceed with the same exposition and reasoning as in the proof of Lemma C.3. Due to the fact that xx is healthy we can remove it without affecting the test result. Therefore we can analyse its neighborhood structure induced by the pooling graph while excluding it. Since by assumption individual xV0x\in V_{0} is assigned to exactly PP displayed positive and the total number of displayed positive test is given by 𝒎0f+𝒎1u\bm{m}_{0}^{f}+\bm{m}_{1}^{u}, we see that 𝑷x(P)\bm{P}_{x}(P) is H(𝒎0f+𝒎1u,𝒎0,nd,P)H\left({\bm{m}_{0}^{f}+\bm{m}_{1}^{u},\bm{m}_{0,\text{nd}},P}\right)-distributed. Due to the fact that the event \mathcal{E} pinpoints the amount of displayed positive and negative tests we can derive the distribution of neighbors the individual may choose from. Recalling the results of Corollary A.5, we see that w.h.p.

𝒎0f\displaystyle\bm{m}_{0}^{f} =edpm+O(mlog3n),\displaystyle=e^{-d}pm+O\left({\sqrt{m}\log^{3}n}\right),
and 𝒎1u\displaystyle\text{and }\bm{m}_{1}^{u} =(1ed)(1q)m+O(mlog3n).\displaystyle=(1-e^{-d})(1-q)m+O\left({\sqrt{m}\log^{3}n}\right).

Furthermore we get from Lemma C.2 that w.h.p.

𝒎0,nd=(1nΩ(1))edm.\bm{m}_{0,\text{nd}}=\left({1-n^{-\Omega(1)}}\right)e^{-d}m.

Now we apply the concentration results obtained in Corollary A.5 and Lemma C.2 to obtain a linear combination of distribution functions

,v(𝒎0,nd=,𝒎0f+𝒎1u=v)(H(v,,Δ)x)\sum_{\ell,v}\mathbb{P}(\bm{m}_{0,\text{nd}}=\ell,\bm{m}_{0}^{f}+\bm{m}_{1}^{u}=v)\cdot\mathbb{P}(H\left({v,\ell,\Delta}\right)\leq x)

that tends to H(m(edp+(1ed)(1q)),med,P)H\left({m\left({e^{-d}p+(1-e^{-d})(1-q)}\right),me^{-d},P}\right). The lemma follows since truly negative tests get flipped independently with probability pp.

Having derived the distributions for 𝑷x\bm{P}_{x} for xV1x\in V_{1} and 𝑷x(P)\bm{P}_{x}(P) for xV0x\in V_{0} we can now determine a threshold βΔ\beta\Delta of displayed positive tests where the individual appears only with individuals from the set V0V0,PDV_{0}\setminus V_{0,\text{PD}} such that we can tell V1V_{1} and V0,PDV_{0,\text{PD}} apart and thus recover 𝝈\bm{\sigma}. Let us start with infected individuals.

Lemma C.5.

As long as

c>(1+η)max{1dDKL(αed(1p)+(1ed)q),θ1θ1dDKL(β(1q)ed)}\displaystyle c>(1+\eta)\max\left\{{\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)},\frac{\theta}{1-\theta}\frac{1}{dD_{\mathrm{KL}}\left({{{\beta}\|{(1-q)e^{-d}}}}\right)}}\right\}

for some small η>0\eta>0, we have 𝐏x>βΔ\bm{P}_{x}>\beta\Delta for all xV1x\in V_{1} w.h.p.

Proof.

We need to ensure that (xV1:𝑷x<βΔ)=o(1)\mathbb{P}(\exists x\in V_{1}:\bm{P}_{x}<\beta\Delta)=o(1). For the bound on cc from the lemma, we know that \mathcal{F} occurs w.h.p. by Lemma C.2. In combination with Lemma C.3 and the union bound we need to ensure that

(C.5) k(𝑷xβΔ|xV1,)=k(𝑸HβΔ)+knΩ(1)=o(1),\displaystyle k\cdot\mathbb{P}\left({\bm{P}_{x}\leq\beta\Delta|x\in V_{1},\mathcal{F}}\right)=k\cdot\mathbb{P}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{Q}_{H}}\leq\beta\Delta)+kn^{-\Omega(1)}=o(1),

where as before 𝑸H\bm{Q}_{H} is a random variable with hypergeometric distribution H(m,med(1q),Δ)H\left({m,me^{-d}(1-q),\Delta}\right). Using the Chernoff bound for the hypergeometric distribution (Lemma A.2), the following condition for (C.5) to hold arises

(C.6) θcd(1θ)DKL(β(1q)ed)<0\displaystyle\theta-cd(1-\theta)D_{\mathrm{KL}}\left({{{\beta}\|{(1-q)e^{-d}}}}\right)<0

The lemma follows from rearranging terms in (C.6) and the fact that if we choose the number of tests slightly above the required number of tests (larger by a factor of 1+η1+\eta for η>0\eta>0), the assertion holds w.h.p. as nn\rightarrow\infty. ∎

We proceed with the set of individuals V0,PDV_{0,\text{PD}}.

Lemma C.6.

As long as

c\displaystyle c >(1+η)max{1dDKL(αed(1p)+(1ed)q),\displaystyle>(1+\eta)\max\Bigg{\{}\frac{1}{dD_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right)},
max1αz1{11θ1d(DKL(zedp+(1ed)(1q))+zDKL(βzedpedp+(1ed)(1q)))}}\displaystyle\qquad\max_{1-\alpha\leq z\leq 1}\left\{{\frac{1}{1-\theta}\frac{1}{d\left({D_{\mathrm{KL}}\left({{{z}\|{e^{-d}p+(1-e^{-d})(1-q)}}}\right)+zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{e^{-d}p+(1-e^{-d})(1-q)}}}}\right)}\right)}}\right\}\Bigg{\}}

for some small η>0\eta>0, we have 𝐏x<βΔ\bm{P}_{x}<\beta\Delta for all xV0,PDx\in V_{0,\text{PD}} w.h.p.

Proof.

We need to ensure that (xV0,PD:𝑷x>βΔ)=o(1)\mathbb{P}(\exists x\in V_{0,PD}:\bm{P}_{x}>\beta\Delta)=o(1). For the bound on cc from the lemma, we know that \mathcal{F} occurs w.h.p. by Lemma C.2. Moreover, \mathcal{E} occurs w.h.p. by Lemma A.4 and Corollary A.5. We write w=edp+(1ed(1q))w=e^{-d}p+\left({1-e^{-d}(1-q)}\right) for brevity. Combining this fact with Lemma B.2 and C.4 we need to ensure

(C.7) (nk)P=(1α)ΔΔ(𝑵x=ΔP|xV0,)(𝑷x(P)βΔ|xV0,)\displaystyle(n-k)\sum_{P=(1-\alpha)\Delta}^{\Delta}\mathbb{P}\left({\bm{N}_{x}=\Delta-P|x\in V_{0},\mathcal{E}}\right)\mathbb{P}\left({\bm{P}_{x}(P)\geq\beta\Delta|x\in V_{0},\mathcal{F}}\right)
(C.8) =(1nΩ(1))nP=(1α)ΔΔ(𝑻𝒉=P)(𝑩𝒉βΔ)=o(1)\displaystyle=\left({1-n^{-\Omega(1)}}\right)n\sum_{P=(1-\alpha)\Delta}^{\Delta}\mathbb{P}\left({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{T_{h}}}=P}\right)\cdot\mathbb{P}\left({{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bm{B_{h}}}\geq\beta\Delta}\right)=o(1)

We remind ourselves that

𝑻𝒉\displaystyle\bm{T_{h}} H(m,m(ed(1p)+(1ed)q),Δ)\displaystyle\sim H\left({m,m\left({e^{-d}(1-p)+\left({1-e^{-d}}\right)q}\right),\Delta}\right)
and 𝑩𝒉\displaystyle\text{and }\quad\bm{B_{h}} H(m(edp+(1ed)(1q)),m(edp),P).\displaystyle\sim H\left({m\left({e^{-d}p+(1-e^{-d})(1-q)}\right),m\left({e^{-d}p}\right),P}\right).

Now by the Chernoff bound for the hypergeometric distribution (Lemma A.2) and setting z=P/Δz=P/\Delta, we establish the following two bounds for the probability terms:

(C.9) (H(m,m(w+nΩ(1)),Δ)=P)=exp((1+nΩ(1))Δ(DKL(zw)))\displaystyle\mathbb{P}\left({H\left({m,m\left({w+n^{-\Omega(1)}}\right),\Delta}\right)=P}\right)=\exp\left({-(1+n^{-\Omega(1)})\Delta\left({D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)}\right)}\right)
(H(m(w+nΩ(1)),m(edp+nΩ(1)),P)βΔ)\displaystyle\mathbb{P}\left({H\left({m\left({w+n^{-\Omega(1)}}\right),m\left({e^{-d}p+n^{-\Omega(1)}}\right),P}\right)\geq\beta\Delta}\right)
(C.10) =exp((1+nΩ)zΔ𝟏{β>zedpw}zDKL(βzedpw))\displaystyle=\exp\left({-\left({1+n^{-\Omega}}\right)z\Delta\bm{1}\left\{{\beta>\frac{ze^{-d}p}{w}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{w}}}}\right)}\right)

(Note that the indicator in (C.10) appears due to the condition given by Lemma A.2) We reformulate the left-hand-side of (C.8) to

nP=(1α)ΔΔexp((1+o(1))Δ(DKL(zw)+𝟏{β>zedpw}zDKL(βzedpw)))\displaystyle n\sum_{P=(1-\alpha)\Delta}^{\Delta}\exp\left({-(1+o(1))\Delta\left({D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)+\bm{1}\left\{{\beta>\frac{ze^{-d}p}{w}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{w}}}}\right)}\right)}\right)
=(1+nΩ(1))nmax1αz1{exp((1+o(1))Δ(DKL(zw)+𝟏{β>zedpw}zDKL(βzedpw)))}\displaystyle\qquad\qquad=\left({1+n^{-\Omega(1)}}\right)n\max_{1-\alpha\leq z\leq 1}\Bigg{\{}\exp\Bigg{(}-(1+o(1))\Delta\Bigg{(}D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)+\bm{1}\left\{{\beta>\frac{ze^{-d}p}{w}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{w}}}}\right)\Bigg{)}\Bigg{)}\Bigg{\}}

where the second equality follows since the sum consists of Θ(Δ)=Θ(logn)\Theta(\Delta)=\Theta(\log n) many summands. Since ()=1nΩ(1)\mathbb{P}\left({\mathcal{F}}\right)=1-n^{-\Omega(1)} for our choice of cc by Lemma C.2 rearranging terms readily yields that the expression in (C.7) is indeed of order o(1)o(1).

To see this, we remind ourselves that by definition Δ=cdlog(nk)=(1θ)cdlog(n)\Delta=cd\log\left({\frac{n}{k}}\right)=(1-\theta)cd\log(n). Furthermore we plug in the definition for w=edp+(1ed(1q))w=e^{-d}p+\left({1-e^{-d}(1-q)}\right). In the end we have to ensure that

1<(1θ)cd(DKL(zw)+𝟏{β>zedpedp+(1ed(1q))}zDKL(βzedpedp+(1ed(1q))))1<(1-\theta)cd\Bigg{(}D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)+\bm{1}\left\{{\beta>\frac{ze^{-d}p}{e^{-d}p+\left({1-e^{-d}(1-q)}\right)}}\right\}zD_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{e^{-d}p+\left({1-e^{-d}(1-q)}\right)}}}}\right)\Bigg{)}

We solve this inequality for cc. As we are only interested in a worst case bound, the assertion follows from the non-negativity of DKL()D_{\mathrm{KL}}\left({{{*}\|{*}}}\right).

Proof of Theorem 2.2.

The theorem is now immediate from Lemma B.3, C.1, C.5 and C.6 and the fact that the choice of α,β\alpha,\beta and dd is at our disposal. ∎

Appendix D Comparison of the noisy DD and COMP bounds

The following section is intented to provide sufficient conditions under which the DD algorithm attains reliable performance requiring fewer tests than the COMP. However, these conditions are not necessary and DD might (and for all performed simulations does) require fewer tests than COMP for even wider settings.

Proof of Proposition 2.13.

In order to prove the proposition, we need to find conditions under which

minα,dmax{b1(α,d),b2(α,d)}minα,β,dmax{c1(α,d),c2(α,d),c3(β,d),c4(α,β,d)}\displaystyle\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}\geq\min_{\alpha,\beta,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d),c_{4}(\alpha,\beta,d)}\right\}

We write α\alpha^{*} and dd^{*} for the values that minimise the maximum of the two terms at the LHS, at which point we know that b1(α,d)=b2(α,d)b_{1}(\alpha^{*},d^{*})=b_{2}(\alpha^{*},d^{*}). Then it is sufficient to show that there exists β\beta^{*} such that

b1(α,d)=b2(α,d)max{c1(α,d),c2(α,d),c3(β,d),c4(α,β,d)}b_{1}(\alpha^{*},d^{*})=b_{2}(\alpha^{*},d^{*})\geq\max\left\{{c_{1}(\alpha^{*},d^{*}),c_{2}(\alpha^{*},d^{*}),c_{3}(\beta^{*},d^{*}),c_{4}(\alpha^{*},\beta^{*},d^{*})}\right\}

By inspection for any α\alpha and dd b1(α,d)=c1(α,d)b_{1}(\alpha,d)=c_{1}(\alpha,d) and b2(α,d)c2(α,d)b_{2}(\alpha,d)\geq c_{2}(\alpha,d) since θ(0,1)\theta\in(0,1).

Next, we will show that b2(α,d)c4(α,β,d)b_{2}(\alpha,d)\geq c_{4}(\alpha,\beta,d) for any α,β\alpha,\beta in the respective bounds and d(0,)d\in(0,\infty). Writing w=edp+(1ed)(1q)w=e^{-d}p+(1-e^{-d})(1-q), and recalling that by assumption that α1w\alpha\leq 1-w (or w1αw\leq 1-\alpha) we readily find that

(D.1) DKL(α1w)=min1αz1(DKL(zw))min1αz1(DKL(zw)+z𝟏{β>zedpw}DKL(βzedpw))D_{\mathrm{KL}}\left({{{\alpha}\|{1-w}}}\right)=\min_{1-\alpha\leq z\leq 1}\left({D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)}\right)\leq\min_{1-\alpha\leq z\leq 1}\left({D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)+z\bm{1}\left\{{\beta>\frac{ze^{-d}p}{w}}\right\}D_{\mathrm{KL}}\left({{{\frac{\beta}{z}}\|{\frac{e^{-d}p}{w}}}}\right)}\right)

where the first equality follows since DKL(α1w)=DKL(1αw)D_{\mathrm{KL}}\left({{{\alpha}\|{1-w}}}\right)=D_{\mathrm{KL}}\left({{{1-\alpha}\|{w}}}\right) and DKL(zw)>DKL(1aw)D_{\mathrm{KL}}\left({{{z}\|{w}}}\right)>D_{\mathrm{KL}}\left({{{1-a}\|{w}}}\right) for any z>1αz>1-\alpha. The bound follows. Note that (D.1) indeed holds for any choice of α,β\alpha,\beta and dd in the respective bounds stated in the theorem.

Finally, we need to demonstrate that c3(β,d)b2(α,d)c_{3}(\beta^{*},d^{*})\leq b_{2}(\alpha^{*},d^{*}). Since β\beta is not an optimisation parameter in b2(α,d)b_{2}(\alpha^{*},d^{*}) and the bound in (D.1) holds for any value of β\beta, we can simply set it to the value that minimizes c3(β,d)c_{3}(\beta^{*},d^{*}) which is β=1/Δ\beta=1/\Delta and for which we find

c3(β,d)=θ1θ1dlog(1ed(1q).\displaystyle c_{3}(\beta^{*},d^{*})=\frac{\theta}{1-\theta}\frac{1}{d^{*}\log\left({1-e^{-d^{*}}(1-q}\right)}.

Thus, to obtain the desired inequality we need to ensure that for the optimal choice α\alpha^{*} from COMP

θDKL(αed(1p)+(1ed)q)\displaystyle\theta D_{\mathrm{KL}}\left({{{\alpha^{*}}\|{e^{-d^{*}}(1-p)+\left({1-e^{-d^{*}}}\right)q}}}\right) log(1ed(1q))\displaystyle\leq-\log\left({1-e^{-d^{*}}(1-q)}\right)

Using the bound

θDKL(αed(1p)+(1ed)q)\displaystyle\theta D_{\mathrm{KL}}\left({{{\alpha}\|{e^{-d}(1-p)+\left({1-e^{-d}}\right)q}}}\right) θlog(1(ed(1p)+(1ed)q))\displaystyle\leq-\theta\log\left({1-\left(e^{-d}(1-p)+\left({1-e^{-d}}\right)q\right)}\right)
log(1(ed(1p)+(1ed)q))\displaystyle\leq-\log\left({1-\left(e^{-d}(1-p)+\left({1-e^{-d}}\right)q\right)}\right)

which is obtained by setting α=1/Δ\alpha=1/\Delta, we find that c3(β,d)b2(α,d)c_{3}(\beta^{*},d^{*})\leq b_{2}(\alpha^{*},d^{*}) if

log(1ed(1q))log(1ed(1p)+(1ed)q)edpq\displaystyle-\log\left({1-e^{-d^{*}}(1-q)}\right)\geq-\log\left({1-e^{-d^{*}}(1-p)+\left({1-e^{-d^{*}}}\right)q}\right)\Leftrightarrow e^{-d^{*}}p\geq q

As mentioned before, due to bounding b2(α,d)b_{2}(\alpha^{*},d^{*}) the result is not sharp. However, one immediate consequence of Proposition 2.13 is that DD is guaranteed to require fewer tests than COMP for the reverse Z channel.

Appendix E Relation to Bernoulli testing

In the noiseless case [26] shows that the constant column weight design (where each individual joins exactly Δ\Delta different tests) requires fewer tests to recover 𝝈\bm{\sigma} than the i.i.d. (Bernoulli pooling) design (where each individual is included in each test with a certain probability independently). In this section we show that in the noisy case, the COMP algorithm requires fewer tests for the constant column weight design than for the i.i.d. design, and derive sufficient conditions under which the same is true for the noisy DD algorithm.

To get started, let us state the relevant bounds for the Bernoulli design, taken from [48, Theorem 5] and rephrased in our notation.

Proposition E.1 (Noisy COMP under Bernoulli).

Let p,q0p,q\geq 0, p+q<1p+q<1, d(0,)d\in(0,\infty), α(q,ed(1p)+(1ed)q)\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q). Suppose that 0<θ<10<\theta<1 and ε>0\varepsilon>0 and let

mCOMPBer\displaystyle m_{\text{COMP}}^{\text{Ber}} =mCOMPBer(n,θ,p,q)=minα,dmax{b1(α,d),b2(α,d)}klog(n/k)\displaystyle=m_{\text{COMP}}^{\text{Ber}}(n,\theta,p,q)=\min_{\alpha,d}\max\left\{{b_{1}(\alpha,d),b_{2}(\alpha,d)}\right\}k\log(n/k)
whereb1(α,d)=θ1θ1kDKL(αd/kqd/k)\displaystyle\text{where}\qquad b_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{kD_{\mathrm{KL}}\left({{{\alpha d/k}\|{qd/k}}}\right)}
andb2(α,d)=11θ1kDKL(αd/k(ed(1p)+(1ed)q)d/k)\displaystyle\text{and}\qquad b_{2}(\alpha,d)=\frac{1}{1-\theta}\frac{1}{kD_{\mathrm{KL}}\left({{{\alpha d/k}\|{(e^{-d}(1-p)+(1-e^{-d})q)d/k}}}\right)}

If m>(1+ε)mCOMPBerm>(1+\varepsilon)m_{\text{COMP}}^{\text{Ber}}, COMP will recover 𝛔\bm{\sigma} under the Bernoulli test design w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

Proposition E.2 (Noisy DD under Bernoulli).

Let p,q0p,q\geq 0, p+q<1p+q<1, d(0,)d\in(0,\infty), α(q,ed(1p)+(1ed)q)\alpha\in(q,e^{-d}(1-p)+\left({1-e^{-d}}\right)q) and β(edp,ed(1q))\beta\in(e^{-d}p,e^{-d}(1-q)). Suppose that 0<θ<1,ζ(0,θ)0<\theta<1,\zeta\in(0,\theta) and ε>0\varepsilon>0 and let

mDDBer\displaystyle m_{\text{DD}}^{\text{Ber}} =mDDBer(n,θ,p,q)=minα,β,dmax{c1(α,d),c2(α,d),c3(β,d),c4(β,d)}klog(n/k)\displaystyle=m_{\text{DD}}^{\text{Ber}}(n,\theta,p,q)=\min_{\alpha,\beta,d}\max\left\{{c_{1}(\alpha,d),c_{2}(\alpha,d),c_{3}(\beta,d),c_{4}(\beta,d)}\right\}k\log(n/k)
wherec1(α,d)=θ1θ1kDKL(αd/kqd/k)\displaystyle\text{where}\qquad c_{1}(\alpha,d)=\frac{\theta}{1-\theta}\frac{1}{kD_{\mathrm{KL}}\left({{{\alpha d/k}\|{qd/k}}}\right)}
andc2(α,d)=1ζ1θ1kDKL(αd/k(ed(1p)+(1ed)q)d/k)\displaystyle\text{and}\qquad c_{2}(\alpha,d)=\frac{1-\zeta}{1-\theta}\frac{1}{kD_{\mathrm{KL}}\left({{{\alpha d/k}\|{(e^{-d}(1-p)+(1-e^{-d})q)d/k}}}\right)}
andc3(β,d)=θ1θ1kDKL(βd/ked(1q)d/k)\displaystyle\text{and}\qquad c_{3}(\beta,d)=\frac{\theta}{1-\theta}\frac{1}{k\cdot D_{\mathrm{KL}}\left({{{\beta d/k}\|{e^{-d}(1-q)d/k}}}\right)}
andc4(β,d)=ζ1θ1kDKL(βd/kedpd/k)\displaystyle\text{and}\qquad c_{4}(\beta,d)=\frac{\zeta}{1-\theta}\frac{1}{k\cdot D_{\mathrm{KL}}\left({{{\beta d/k}\|{e^{-d}pd/k}}}\right)}\frac{}{}

If m>(1+ε)mDDBerm>(1+\varepsilon)m_{\text{DD}}^{\text{Ber}}, DD will recover 𝛔\bm{\sigma} under the Bernoulli test design w.h.p. given 𝐆,𝛔^\bm{G},\hat{\bm{\sigma}}.

To compare the bounds of the Bernoulli and constant-column test design we employ the following handy observation.

Lemma E.3.

Let 0<x,y<10<x,y<1 and d>0d>0 be constants independent of kk. As kk\to\infty

kDKL(xdkydk)=d(DKL(xy)+v(x,y))+o(1/k)\displaystyle kD_{\mathrm{KL}}\left({{{\frac{xd}{k}}\|{\frac{yd}{k}}}}\right)=d\left(D_{\mathrm{KL}}\left({{{x}\|{y}}}\right)+v(x,y)\right)+o(1/k)

with

(E.1) v(x,y)=yx+(1x)log(1y1x)0\displaystyle v(x,y)=y-x+(1-x)\log\left(\frac{1-y}{1-x}\right)\leq 0
Proof.

Applying the definition of the Kullback-Leibler divergence and Taylor expanding the logarithm we obtain

kDKL(xdkydk)=\displaystyle k\cdot D_{\mathrm{KL}}\left({{{\frac{xd}{k}}\|{\frac{yd}{k}}}}\right)= xdlog(xy)+(kxd)(log(1xdk)log(1ydk))\displaystyle xd\cdot\log\left(\frac{x}{y}\right)+(k-xd)\left(\log\left(1-\frac{xd}{k}\right)-\log\left(1-\frac{yd}{k}\right)\right)
=xdlog(xy)+(kxd)(xdk+ydk+o(1k2))\displaystyle=xd\cdot\log\left(\frac{x}{y}\right)+(k-xd)\left(-\frac{xd}{k}+\frac{yd}{k}+o\left(\frac{1}{k^{2}}\right)\right)
=d(xlog(xy)x+y)+o(1/k)\displaystyle=d\left(x\cdot\log\left(\frac{x}{y}\right)-x+y\right)+o(1/k)
=d(DKL(xy)+yx(1x)log(1x1y))+o(1/k).\displaystyle=d\left(D_{\mathrm{KL}}\left({{{x}\|{y}}}\right)+y-x-(1-x)\log\left(\frac{1-x}{1-y}\right)\right)+o(1/k).

We can bound v(x,y)v(x,y) from above by writing the final term as (1x)log(1+xy1x)(1x)xy1x=xy(1-x)\log\left(1+\frac{x-y}{1-x}\right)\leq(1-x)\frac{x-y}{1-x}=x-y, using the standard linearisation of the logarithm. ∎

We are now in a position to prove Proposition 2.15 and 2.16.

Proof of Proposition 2.15.

The lemma follows by comparing the bounds from Theorem 2.1 and Proposition E.1 and applying Lemma E.3. ∎

Proof of Proposition 2.16.

As evident from Corollary 2.8, the fourth bound c4(α,β,d)c_{4}(\alpha,\beta,d) vanishes under the Z channel. Now comparing the bounds from Theorem 2.2 and Proposition E.2, observing that (1ζ)/(1θ)>1(1-\zeta)/(1-\theta)>1 for ζ<θ\zeta<\theta and applying Lemma E.3 immediately implies the lemma. ∎

Appendix F Notes on Corollary 2.3

Lemma F.1.

If p+q<1p+q<1 the Shannon capacity of the pqp-q channel of Figure 1 measured in nats is

(F.1) CChan=DKL(q11+eϕ)=DKL(p11+eϕ),C_{\text{Chan}}=D_{\mathrm{KL}}\left({{{q}\|{\frac{1}{1+e^{\phi}}}}}\right)=D_{\mathrm{KL}}\left({{{p}\|{\frac{1}{1+e^{-\phi}}}}}\right),

where ϕ=(h(p)h(q))/(1pq)\phi=(h(p)-h(q))/(1-p-q). This is achieved by taking

(F.2) (X=0)=11pq(11+eϕq).\mathbb{P}(X=0)=\frac{1}{1-p-q}\left(\frac{1}{1+e^{\phi}}-q\right).

Please note that the proof might be a standard result for readers from some research communities, but for others it might be less standard. Therefore we state it here to prevent the interested (but unfamiliar) reader from a long textbook search.

Proof.

Write (X=0)=γ\mathbb{P}(X=0)=\gamma and (Y=0)=T(γ):=(1p)γ+q(1γ)\mathbb{P}(Y=0)=T(\gamma):=(1-p)\gamma+q(1-\gamma). Then since the mutual information

(F.3) I(X;Y)=h(Y)h(Y|X)=h(T(γ))(γh(p)+(1γ)h(q)),I(X;Y)=h(Y)-h(Y|X)=h\left(T(\gamma)\right)-\left(\gamma h(p)+(1-\gamma)h(q)\right),

we can find the optimal TT by solving

0=γI(X;Y)=(1pq)log(1T(γ)T(γ))(h(p)h(q)),0=\frac{\partial}{\partial\gamma}I(X;Y)=(1-p-q)\log\left(\frac{1-T(\gamma)}{T(\gamma)}\right)-\left(h(p)-h(q)\right),

which implies that the optimal T=1/(1+eϕ)T^{*}=1/(1+e^{\phi}). We can solve for this for γ=(Tq)/(1pq)\gamma^{*}=(T^{*}-q)/(1-p-q) to find the expression above. As 2γI(X;Y)<0\frac{\partial}{\partial^{2}\gamma}I(X;Y)<0 it is indeed a maximum. Substituting this in (F.3) we obtain that the capacity is given by

h(T)(γh(p)+(1γ)h(q))\displaystyle h(T^{*})-\left(\gamma^{*}h(p)+(1-\gamma^{*})h(q)\right) =\displaystyle= h(11+eϕ)((Tq)ϕ+h(q))\displaystyle h\left(\frac{1}{1+e^{\phi}}\right)-\left((T^{*}-q)\phi+h(q)\right)
=\displaystyle= log(1+eϕ)ϕ(1q)h(q)\displaystyle\log(1+e^{\phi})-\phi(1-q)-h(q)
=\displaystyle= DKL(q1/(1+eϕ))\displaystyle D_{\mathrm{KL}}\left({{{q}\|{1/(1+e^{\phi})}}}\right)

as claimed in the first expression in (F.1) above. We can see that the second expression in (F.1) matches the first by writing the corresponding expression as DKL(1p1/(1+eϕ))=log(1+eϕ)ϕph(p)D_{\mathrm{KL}}\left({{{1-p}\|{1/(1+e^{\phi})}}}\right)=\log(1+e^{\phi})-\phi p-h(p), which is equal to (F) by the definition of ϕ\phi. ∎

Note that this result suggests a choice of density for the matrix: since each test is negative with probability ede^{-d}, equating this with (F.2) suggests that we take

d=dch=log(1pq)log(11+eϕq).d=d^{*}_{{\rm ch}}=\log(1-p-q)-\log\left(\frac{1}{1+e^{\phi}}-q\right).

This is unlikely to be optimal in a group testing sense, since we make different inferences from positive and negative tests, but gives a closed form expression that may perform well in practice. For the noiseless and BSC case observe that ϕ=0\phi=0, and we obtain dch=log2d^{*}_{{\rm ch}}=\log 2.

Appendix G Illustration of bounds for Z, reverse Z channel and the BSC

Refer to caption
Refer to caption
Figure 5. Illustration of achievability bounds for noisy COMP and DD under the Z channel. (Note for black and white prints: The solid lines as well as the dashed lines in the diagram are in the same order as given in the legend from top to bottom)
Refer to caption
Refer to caption
Figure 6. Illustration of achievability bounds for noisy COMP and DD under the reverse Z channel. (Note for black and white prints: The solid lines as well as the dashed lines in the diagram are in the same order as given in the legend from top to bottom)
Refer to caption
Refer to caption
Figure 7. Illustration of achievability bounds for noisy COMP and DD under the Binary Symmetric Channel. (Note for black and white prints: The solid lines as well as the dashed lines in the diagram are in the same order as given in the legend from top to bottom)
Refer to caption
Figure 8. Comparison of the noisy DD rates under Bernoulli pooling ([48]) with the DD bounds with constant-column design as provided in the paper at hand within the Z-Channel. (Note for black and white prints: The solid lines as well as the dashed lines in the diagram are in the same order as given in the legend from top to bottom).

Acknowledgment

The authors would like to thank two anonymous referees for their detailed reading of this paper and for the suggestions they made to improve its presentation. Oliver Gebhard and Philipp Loick are supported by DFG CO 646/3.

References

  • [1] E. Abbe, A. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62:471–487, 2016.
  • [2] B. Abdalhamid, C. Bilder, E. McCutchen, S. Hinrichs, S. Koepsell, and P. Iwen. Assessment of specimen pooling to conserve SARS-CoV-2 testing resources. American Journal of Clinical Pathology, 153:715–718, 2020.
  • [3] M. Aldridge. The capacity of Bernoulli nonadaptive group testing. IEEE Transactions on Information Theory, 63:7142–7148, 2017.
  • [4] M. Aldridge. Individual testing is optimal for nonadaptive group testing in the linear regime. IEEE Transactions on Information Theory, 65:2058–2061, 2019.
  • [5] M. Aldridge, L. Baldassini, and O. Johnson. Group testing algorithms: bounds and simulations. IEEE Transactions on Information Theory, 60:3671–3687, 2014.
  • [6] M. Aldridge, O. Johnson, and J. Scarlett. Improved group testing rates with constant column weight designs. Proceedings of 2016 IEEE International Symposium on Information Theory (ISIT’16), pages 1381–1385, 2016.
  • [7] M. Aldridge, O. Johnson, and J. Scarlett. Group testing: an information theory perspective. Foundations and Trends in Communications and Information Theory, 15(3–4):196–392, 2019.
  • [8] E. Arıkan. Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memory-less channels. IEEE Transactions on Information Theory, 55:3051––3073, 2009.
  • [9] L. Baldassini, O. Johnson, and M. Aldridge. The capacity of adaptive group testing. Proceedings of 2013 IEEE International Symposium on Information Theory (ISIT’13), 1:2676–2680, 2013.
  • [10] C. Chan, P. Che, S. Jaggi, and V. Saligrama. Non-adaptive probabilistic group testing with noisy measurements: near-optimal bounds with efficient algorithms. Proceedings of 49th Annual Allerton Conference on Communication, Control, and Computing, 1:1832–1839, 2011.
  • [11] H. Chen and F. Hwang. A survey on nonadaptive group testing algorithms through the angle of decoding. Journal of Combinatorial Optimization, 15:49–59, 2008.
  • [12] I. Cheong. The experience of South Korea with COVID-19. Mitigating the COVID Economic Crisis: Act Fast and Do Whatever It Takes (CEPR Press), pages 113–120, 2020.
  • [13] A. Coja-Oghlan, O. Gebhard, M. Hahn-Klimroth, and P. Loick. Information-theoretic and algorithmic thresholds for group testing. IEEE Transactions on Information Theory, DOI: 10.1109/TIT.2020.3023377, 2020.
  • [14] A. Coja-Oghlan, O. Gebhard, M. Hahn-Klimroth, and P. Loick. Optimal group testing. Proceedings of 33rd Conference on Learning Theory (COLT’20), 2020.
  • [15] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52:1289–1306, 2006.
  • [16] R. Dorfman. The detection of defective members of large populations. Annals of Mathematical Statistics, 14:436–440, 1943.
  • [17] S. Ciesek E. Seifried. Pool testing of SARS-CoV-02 samples increases worldwide test capacities many times over, 2020. https://www.bionity.com/en/news/1165636/pool-testing-of-sars-cov-02-samples-increases-worldwide-test-capacities-many-times-over.html, last accessed on 2020-11-16.
  • [18] Y. Erlich, A. Gilbert, H. Ngo, A. Rudra, N. Thierry-Mieg, M. Wootters, D. Zielinski, and O. Zuk. Biological screens from linear codes: theory and tools. bioRxiv, page 035352, 2015.
  • [19] European Centre for Disease Prevention and Control. Surveillance and studies in a pandemic in Europe, 2009. https://www.ecdc.europa.eu/en/publications-data/surveillance-and-studies-pandemic-europe (last accessed on 2020-11-16).
  • [20] Oliver Gebhard, Max Hahn-Klimroth, Olaf Parczyk, Manuel Penschuck, Maurice Rolvien, Jonathan Scarlett, and Nelvin Tan. Near optimal sparsity-constrained group testing: improved bounds and algorithms. Arxiv-Preprint, 2021.
  • [21] Y. Gefen, M. Szwarcwort-Cohen, and R. Kishony. Pooling method for accelerated testing of COVID-19, 2020. https://www.technion.ac.il/en/2020/03/pooling-method-for-accelerated-testing-of-covid-19/ (last accessed on 2020-11-16).
  • [22] E. Gould. Methods for long-term virus preservation. Mol Biotechnol, 13:57–66, 1999.
  • [23] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:301:13–30, 1963.
  • [24] F. Hwang. A method for detecting all defective members in a population by group testing. Journal of the American Statistical Association, 67:605–608, 1972.
  • [25] S. Janson, T. Luczak, and A. Rucinski. Random Graphs. John Wiley and Sons, 2011.
  • [26] O. Johnson, M. Aldridge, and J. Scarlett. Performance of group testing algorithms with near-constant tests per item. IEEE Transactions on Information Theory, 65:707–723, 2018.
  • [27] O. Johnson and D. Sejdinovic. Note on noisy group testing: Asymptotic bounds and belief propagation reconstruction. Proceedings of 48th Allerton Conference on Communication, Control, and Computing, 2010.
  • [28] E. Knill, A. Schliep, and D. Torney. Interpretation of pooling experiments using the Markov chain Monte Carlo method. Journal of Computational Biology, 3:395–406, 1996.
  • [29] H. Kwang-Ming and D. Ding-Zhu. Pooling designs and nonadaptive group testing: important tools for dna sequencing. World Scientific, 2006.
  • [30] A. Lalkhen. Clinical tests: sensitivity and specificity. Continuing Education in Anaesthesia Critical Care & Pain, 8, 2008.
  • [31] S. Long, C. Prober, and M. Fischer. Principles and practice of pediatric infectious diseases. Elsevier, 2018.
  • [32] N. Madhav, B. Oppenheim, M. Gallivan, P. Mulembakani, E. Rubin, and N. Wolfe. Pandemics: Risks, impacts and mitigation. The World Bank:Disease control priorities, 9:315–345, 2017.
  • [33] D. M. Malioutov and M. Malyutov. Boolean compressed sensing: Lp relaxation for group testing. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
  • [34] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 1989: Invited Papers at the 12th British Combinatorial Conference, page 148–188, 1989.
  • [35] R. Mourad, Z. Dawy, and F. Morcos. Designing pooling systems for noisy high-throughput protein-protein interaction experiments using Boolean compressed sensing. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10:1478–1490, 2013.
  • [36] L. Mutesa, P. Ndishimye, Y. Butera, J. Souopgui, A. Uwineza, R. Rutayisire, E. Musoni, N. Rujeni, T. Nyatanyi, E. Ntagwabira, M. Semakula, C. Musanabaganwa, D. Nyamwasa, M. Ndashimye, E. Ujeneza, I. Mwikarago, C. Muvunyi, J. Mazarati, S. Nsanzimana, N. Turok, and W. Ndifon. A strategy for finding people infected with SARS-CoV-2: optimizing pooled testing at low prevalence. Nature, 589:276–280, 2021. doi:10.1038/s41586-020-2885-5.
  • [37] H. Ngo and D. Du. A survey on combinatorial group testing algorithms with applications to DNA library screening. Discrete Mathematical Problems with Medical Applications, 7:171–182, 2000.
  • [38] U.S. Department of Health and Human Services. Pandemic influenza plan, 2017. https://www.cdc.gov/flu/pandemic-resources/pdf/pandemic-influenza-implementation.pdf (last accessed on 2020-11-16).
  • [39] World Health Origanisation. Global surveillance during an influenza pandemic, 2009. www.who.int/csr/resources/publications/swineflu (last accessed on 2020-11-16).
  • [40] M. Plebani. Diagnostic errors and laboratory medicine – causes and strategies. Electronic Journal of the International Federation of Clinical Chemistry and Laboratory Medicine, 26:7–14, 2015.
  • [41] T. Richardson and R. Urbanke. Modern coding theory. Cambridge University Press, 2007.
  • [42] C. Sammut and G. Webb. Encyclopedia of machine learning. Springer, 2011.
  • [43] J. Scarlett. Noisy adaptive group testing: Bounds and algorithms. IEEE Transactions on Information Theory, 65:3646–3661, 2018.
  • [44] J. Scarlett. An efficient algorithm for capacity-approaching noisy adaptive group testing. Proceedings of 2019 IEEE International Symposium on Information Theory (ISIT’19), pages 2679–2683, 2019.
  • [45] J. Scarlett and V. Cevher. Converse bounds for noisy group testing with arbitrary measurement matrices. Proceedings of 2016 IEEE International Symposium on Information Theory (ISIT’16), pages 2868–2872, 2016.
  • [46] J. Scarlett and V. Cevher. Phase transitions in group testing. Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA’16), 1:40–53, 2016.
  • [47] J. Scarlett and V. Cevher. Near-optimal noisy group testing via separate decoding of items. IEEE Journal of Selected Topics in Signal Processing, 2017.
  • [48] J. Scarlett and O. Johnson. Noisy non-adaptive group testing: A (near-)definite defectives, approach. IEEE Transactions on Information Theory, 66(6):3775–3797, 2020.
  • [49] N. Thierry-Mieg. A new pooling strategy for high-throughput screening: the shifted transversal design. BMC Bioinformatics, 7:28, 2006.
  • [50] L. Wang, X. Li, Y. Zhang, and K. Zhang. Evolution of scaling emergence in large-scale spatial epidemic spreading. Public Library of Science ONE, 6, 2011.
  • [51] L. Wein and S. Zenios. Pooled testing for HIV screening: Capturing the dilution effect. Operations Research, 44:543–569, 1996.
  • [52] S. Woloshin, N. Patel, and A. Kesselheim. False negative tests for SARS-CoV-2 infection — challenges and implications. New England Journal of Medicine, 2020.