This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Source Detection via Contact Tracing in the Presence of Asymptomatic Patients

Gergely Ódor gergely.odor@epfl.ch 0000-0001-9139-249X EPFLLausanneSwitzerland Jana Vuckovic jana.vuckovic@epfl.ch EPFLLausanneSwitzerland Miguel-Angel Sanchez Ndoye miguel-angel.sanchezndoye@epfl.ch EPFLLausanneSwitzerland  and  Patrick Thiran patrick.thiran@epfl.ch EPFLLausanneSwitzerland
Abstract.

Inferring the source of a diffusion in a large network of agents is a difficult but feasible task, if a few agents act as sensors revealing the time at which they got hit by the diffusion. One of the main limitations of current source detection algorithms is that they assume full knowledge of the contact network, which is rarely the case, especially for epidemics, where the source is called patient zero. Inspired by recent implementations of contact tracing algorithms, we propose a new framework, which we call Source Detection via Contact Tracing Framework (SDCTF). In the SDCTF, the source detection task starts at the time of the first hospitalization, and initially we have no knowledge about the contact network other than the identity of the first hospitalized agent. We may then explore the network by contact queries, and obtain symptom onset times by test queries in an adaptive way, i.e., both contact and test queries can depend on the outcome of previous queries. We also assume that some of the agents may be asymptomatic, and therefore cannot reveal their symptom onset time. Our goal is to find patient zero with as few contact and test queries as possible. We propose two local search algorithms for the SDCTF: the LS algorithm is more data-efficient, but can fail to find the true source if many asymptomatic agents are present, whereas the LS+ algorithm is more robust to asymptomatic agents. By simulations we show that both LS and LS+ outperform state of the art adaptive and non-adaptive source detection algorithms adapted to the SDCTF, even though these baseline algorithms have full access to the contact network. Extending the theory of random exponential trees, we analytically approximate the probability of success of the LS/ LS+ algorithms, and we show that our analytic results match the simulations. Finally, we benchmark our algorithms on the Data-driven COVID-19 Simulator (DCS) developed by Lorch et al., which is the first time source detection algorithms are tested on such a complex dataset.

adaptive source detection, contact tracing, sensor selection, epidemics
copyright: none

1. Introduction

During the COVID-19 pandemic, we have seen a revolution of the contact tracing technology, which helped track and contain the epidemic (Braithwaite et al., 2020; Kretzschmar et al., 2020). Some contact tracing programs were conducted by governmental/health agencies (Park et al., 2020), while others relied on decentralized approaches (Troncoso et al., 2020). Most contact tracing approaches work by notifying people who could have received the infection from known infectious patients, i.e., they trace “forward” in time. However, some advocate that a “bidirectional” tracing, where the past history of the infection is also tracked, can be more effective (Bradshaw et al., 2021; Endo et al., 2020; Kojaku et al., 2021). In this paper we focus on the “backward” direction of the problem; the task of identifying the first patient who carried the disease, also called patient zero, or the source of the epidemic. The identification of patient zero can either be limited to a smaller population cluster, in which case it can be a first step towards “bidirectional” tracing, or it can be more ambitious; finding the first patient who developed the mutation of a certain disease can help understanding how the mutation occurred, which can help us prevent, or better prepare for future epidemics.

Surprisingly, given the importance of the problem and the relatively large literature on the topic, we are not aware of any instance where source detection algorithms have been applied in real situations, including during the COVID-19 pandemic. Our goal in this paper is to examine the applicability of the source detection models in the literature (which we call frameworks from now on), and then propose a new framework, which improves them in several aspects. Originally, source detection was introduced in the context of rumor spreading instead of epidemics by Zaman and Shah in their pioneering Sigmetrics paper (Shah and Zaman, 2010, 2011). Translating to the language of epidemics for clarity, in the framework of (Shah and Zaman, 2011), an epidemic spreads over a network of agents that is completely known to us, and we observe a snapshot of the network, which means that every agent reveals if they are infected or not at some given time (not too early, because then the problem is trivial, nor too late, because then the problem is impossible). Shortly after (Shah and Zaman, 2011), Pinto et al. proposed a different framework, in which agents (also called sensors) reveal, in addition to their state, the time when they became infected, but where only a few of them do so and act as sensors (Pinto et al., 2012); indeed, the problem is trivial if all agents are sensors. This framework is better tailored to epidemics, as it is reasonable that obtaining any information from all the agents is much harder than asking one more question about the starting time of the symptoms of the disease to only some of them. Pinto et al. found that in their framework, if the sensors are already selected, the maximum likelihood estimator of the source has a closed form solution when the underlying network is a tree, and the time it takes for an agent to infect one of its susceptible contacts follows a Gaussian distribution. For general graphs, it is difficult to find an algorithm with any theoretical guarantees, although we note that many heuristics have been developed (Hu et al., 2018; Li et al., 2019; Paluch et al., 2018; Paluch et al., 2020b; Shen et al., 2016; Tang et al., 2018; Xu et al., 2019; Zhu et al., 2016). The only exception is on very simple contact networks (Lecomte et al., 2020), or when the epidemic spreads deterministically between the agents (Zejnilovic et al., 2013), which is not a realistic assumption for epidemics, but at least the estimation algorithm is trivial, and more emphasis can be put on the question of how the sensors should be selected for good performance (Spinelli et al., 2017c, 2018), which again is studied by heuristics in the general case (Paluch et al., 2020a). For a recent review of source detection algorithms, see (Shelke and Attar, 2019).

One of the main criticisms of original framework of Pinto et al. is that, even though the contact network is fully known, it is very difficult to find the source exactly unless a large fraction (20-50%) of the population act as sensors, which is unrealistic in the case of an epidemics, when the source is searched in a large population. An alternative recently proposed is to compute confidence sets for the source instead of finding it (Dawkins et al., 2021). But if our goal is to locate the source exactly, a promising approach is to allow the sensors to be selected adaptively to previous observations (Zejnilović et al., 2015, 2017), which we call adaptive sensor placement. When the contact network is known, adaptive strategies have been studied by simulations (Spinelli et al., 2017a, b) and by theoretical analysis (Lecomte et al., 2020), and they show a large reduction in the number of required sensors in real networks. In this paper, we will also allow the sensors to be placed adaptively.

We believe that the most problematic assumption that is still present in source detection papers, is the full knowledge of the contact network of agents, which is unrealistic (let alone because of privacy concerns). Due to this lack of data-availability, algorithms in the source detection literature have not been tested on realistic epidemic data. Moreover, while governmental/health agencies might have access to private datasets, such as cellular location data, from which a contact network may be estimated, these networks may be very noisy, and are potentially unfit for the source detection task. We only know of a few papers that study the effect of imperfections in the network data on the source detection task (Mashkaria et al., 2020; Zejnilović et al., 2016), but these papers study epidemics that spread deterministically between the agents. Inspired by adaptive sensor placement, and by the recent implementations of contact tracing algorithms, we propose a new framework for source detection, which we call Source Detection via Contact Tracing Framework (SDCTF). In SDCTF, algorithms can have two types of queries: contact queries, which can be used to explore the network, and sensor (test) queries, after which agents reveal their symptom onset time as before. The goal of the algorithm is to find the source as accurately as possible, while minimizing the number of contact and sensor queries. The SDCTF is a way to formalize the source detection task; it determines the goal of the algorithm and how information can be gained about the epidemic, but it does not specify the underlying epidemic and mobility data models (simulated or real). In this paper, we analyse different algorithms in the SDCTF with various epidemic and mobility models.

Besides specifying the possible queries that algorithms can make, the SDCTF also determines the way the outbreak is detected, which marks the starting time of the source detection task. In sensor-based source detection, the source detection task often starts long after the outbreak, when essentially all agents in the network are infected (Pinto et al., 2012), which can be seen as a limitation of source detection frameworks. The SDCTF is also closely related to contact tracing frameworks, where it is standard to assign a probability that each node spontaneously self-reports after developing symptoms, which triggers the activation of contact tracing algorithms (Kretzschmar et al., 2020; Bradshaw et al., 2021). In the SDCTF, we adopt the idea of self-reporting with a slight modification. We believe that the most interesting time to perform the source detection task is when a new disease (or a new mutation of the disease) appears, and therefore we tie these self-reporting events to hospitalizations, where infections are properly diagnosed by healthcare professionals. In particular, this means that the SDCTF can only be applied to epidemic data (and models) where hospitalizations are well-defined. In this paper, we use the datasets generated by the Data-driven COVID Simulator (DCS) introduced in (Lorch et al., 2020), which is one of the most realistic toolboxes that generate datasets modelling COVID-19, which we are aware of (notably, hospitalizations are part of the model). We also propose synthetic approximations for the epidemic and mobility models in the DCS; the Deterministically Developing Epidemic model and the Household Network Model, which improve the interpretability of our results since they have fewer parameters.

We propose a simple algorithm called LocalSearch (LS), which adaptively traces back the transmission path from the first hospitalized patient to the source. The LS algorithm is quite efficient at finding the source; the number of contact and sensor queries that it uses does not depend on the size of the network, but only on the local neighborhood of the source. Moreover, the LS algorithm provably finds the source with 100% accuracy, because of our assumption that every contact and sensor query is answered without noise. However, it is well-known that data-availability is a major issue in contact tracing (BeidasRinad et al., 2020), either because the agents do not comply with contact tracing efforts, or possibly (and in particular in the current COVID-19 epidemic) because they do not develop symptoms, and are unaware that they have the disease. In this paper, we model the effect of asymptomatic agents. When queried and tested, these agents do not reveal their time of infection, only whether they have or had the disease at some point. We show that the accuracy of the LS algorithm drops in the presence of asymptomatic agents, because the algorithm can get stuck while tracing back the transmission path from the first hospitalized patient to the source. Therefore, we propose an improved version of LS called LS+, which accounts for the presence of asymptomatic agents by placing more sensors. We are not aware of any previous work in the source detection literature that models the effect of asymptomatic patients, but the resulting model can be seen as a mix between the snapshot and the sensor-based models. We mention that non-complying agents or agents who provide noisy observations have been studied by (Altarelli et al., 2014; Hernando et al., 2008; Louni et al., 2015). Non-complying agents could also be included in our framework by treating them as asymptomatic agents (even though in this case we have no information about whether the agent had the disease or not), without jeopardizing the correctness of our algorithms.

We benchmark the LS and LS+ algorithms in both our data-driven and our synthetic epidemic and mobility models, and we compare them to state-of-the-art adaptive (Spinelli et al., 2017b) and non-adaptive (Jiang et al., 2016; Lokhov et al., 2014) algorithms tailored to the SDCTF, whenever possible. We find that both LS and LS+ outperform these baseline algorithms in accuracy (probability of finding the correct source).

While the LS/LS+ are designed to be simple algorithms, their theoretical analysis is quite challenging. Nevertheless, we are able to provide rigorous results about the success probability of both algorithms after a series of simplifications to the epidemic and mobility models, by extending some recent results on the theory of exponential random trees (Feng and Mahmoud, 2018; Mahmoud, 2021), which have previously not been connected to the source detection literature. We present these theoretical results in Section 4, after formally introducing the SDCTF, our models and the LS/LS+ algorithms in Section 3. By simulations, we show that our analytic results approximate the accuracy of the algorithms well, even in the most realistic setting in Section 5. Our analytic results provide additional insight into how the parameters of the epidemic and mobility models affect the performance of the algorithms. We discuss these insights along with some non-rigorous computations that mirror our main proof ideas in Section 2. Reading Section 2 before Sections 3-5 is useful to build intuition, but is not necessary to understand the paper.

2. Warmup Results

2.1. A Simple Network and Epidemic Model and a Simple Algorithm

Refer to caption
Figure 1. (a)-(c) shows the spread of the infection in the model considered in Section 2.1, which is equivalent to the growth of the RERT, with d=2d=2. Dark blue edges show the contacts on day tt, and light blue edges show contacts present on previous days (and thus subfigures). Orange (resp., red; black) nodes mark symptomatic non-hospitalized (resp, asymptomatic; symptomatic hospitalized) nodes. (d)-(f) shows the LS source detection algorithm introduced in Section 2.2, which succeeds in this example because there are no asymptomatic nodes on the transmission path between the first hospitalized node and the source. Black edges show the queried edges, and black stroke marks nodes already discovered by the algorithm. A node with black X marks a negative test result, and red stroked node marks the node currently maintained as source candidate by the LS algorithm.

Let us consider a time-dependent network model, where each agent meets dd new agents each day in such a way that the contact network is an infinite tree (ignoring the label of the edges giving the propagation time along the edge). This network models homogeneous mixing in a very large population; we consider more realistic network models in Section 3. On this network, we consider an epidemic model that starts at t=0t=0 with one infected agent, and then progresses as infected agents infect their dd susceptible contacts each independently with probability pip_{i} each day. Since our goal is to study the epidemic process, it is sufficient to track only the agents who are already infectious (also called internal nodes), and the agents who are in contact with infectious agents at time tt (also called external nodes), as shown in Figure 1 (a)-(c). For d=1d=1, the spread of the infection is then equivalent to the growth a random tree 𝒯t\mathcal{T}_{t} rooted at the source of the infection, known under the name of Random Exponential Recursive Tree (RERT) and recently introduced in (Mahmoud, 2021). Because of the similarities of the models, we refer to the model with general dd as RERT in the remaining of this section. We point out that the standard literature on elementary branching processes such as Galton-Watson trees or random recursive trees (Drmota, 2009) is not applicable in our scenario, because these branching processes have no notion of global time (i.e., a node in such processes becomes infectious immediately after receiving the infection), whereas nodes in diseases commonly go through an exposed, non-infectious period before becoming infectious, which is well captured by the RERT model. We mention that there is literature on more advanced branching processes that do have a notion of global time, e.g. Crump-Mode-Jagers trees (Jagers and Nerman, 1984), however we opt for the RERT because of its simple definition.

After a node (patient) becomes infected, the disease can take three courses (which for now do not affect 𝒯t\mathcal{T}_{t}): with probability pap_{a} the patient is asymptomatic, with probability (1pa)ph(1-p_{a})p_{h} the patient is hospitalized, and with probability (1pa)(1ph)(1-p_{a})(1-p_{h}) the patient recovers without hospitalization. The governmental/health agency learns about the outbreak when the first hospitalization occurs (see Figure 1 (c)) and starts the source detection process right away. It can inquire about the contacts of each agent and it can test the agents. From patients that were symptomatic (at any point in time in the past), the agency learns about their symptom onset time (which, in this simple model, is always one day after the infection time), but from asymptomatic patients it only learns that they had (or have) the disease at some point when they are tested. The framework introduced in this paragraph (including both the detection of the outbreak through the first hospitalization, and the possible actions the agency can take) is a simplified version of the SDCTF (Source Detection via Contact Tracing Framework), introduced in Section 3.3.

The network and epidemic models introduced in this section have four parameters: d,pi,pa,phd,p_{i},p_{a},p_{h}, and it is important to understand how each of them affects the difficulty of source detection in the SDCTF. We distinguish two important factors. First, if the outbreak is not detected rapidly enough, the length of the transmission path to the first hospitalized agent is long, and source detection becomes then difficult, because a lot of information needs to be recovered. Therefore, a low pip_{i}, a low php_{h} and/or a high pap_{a} parameter can hinder source detection (recall that the probability of hospitalization was ph(1pa)p_{h}(1-p_{a})). The second factor is related to the difficulty of recovering information about the transmission path. If pap_{a} is high, then there are a lot of nodes who are asymptotic and therefore do not reveal their symptom onset time, making source detection very difficult. Since pap_{a} affects both the length of the transmission path and the amount of collected information, it is safe to expect that, of all parameters, pap_{a} has the largest effect on the difficulty of source detection. The parameter dd is interesting, because a large dd can reduce the length of the transmission path, but it also makes the information about the transmission path less accessible as more agents need to be tested. Since in this paper we do not set a hard constraint on the total number of available tests, the advantage of a shorter path takes over the drawback of additional tests and a large dd increases the success probability.

To say anything quantitative about source detection in the SDCTF, we must discuss specific algorithms that solve the source detection task. In this paper we propose a simple algorithm called LocalSearch (LS), shown in Figure 1 (d)-(f). The LS algorithm maintains one candidate node scs_{c} at each iteration (initially, the first hospitalized node), which is always symptomatic, and it updates it in a greedy way: at the time of the infection of scs_{c}, all its dd incident edges are queried, and all its dd neighbors are tested. Then the agent with the lowest reported infection time will be the new candidate scs_{c}. The algorithm stops when scs_{c} does not change anymore between two consecutive iterations. For simplicity, we assume that the infection does not spread any further during these iterations, however, this assumption does not affect the ability of the algorithm to find the source or not. Indeed, it is not difficult to see that on tree networks, LS succeeds if and only if there are no asymptomatic nodes on the transmission path from the source to the first hospitalized agent. This observation leads us to enhance the LS algorithm by also searching within the neighbors of asymptomatic nodes; we explore this idea in the LS+ algorithm introduced in Section 3.5. We are not aware of this simple greedy algorithm being studied in the context of source detection, although similar ideas were implemented for non-adaptive source detection to lower the runtime of the algorithms (Paluch et al., 2018).

2.2. Back of the Envelope Calculation

Now, we have all the tools to estimate the probability of success of the LS algorithm. First we condition on the course of the disease in the source. With probability pap_{a}, the source is asymptomatic and LS can never succeed. With probability (1pa)ph(1-p_{a})p_{h}, the source itself becomes hospitalized, and LS always succeeds. Finally, with probability (1pa)(1ph)(1-p_{a})(1-p_{h}) the source is symptomatic but not hospitalized, which we call event 𝒜\mathcal{A}. If event 𝒜\mathcal{A} happens, then LS may or may not succeed depending on whether there are any asymptomatic nodes on the transmission path. More precisely, conditioned on event 𝒜\mathcal{A} and on the transmission path having length ll, the probability of success is (1pa)l1(1-p_{a})^{l-1} (since there are l1l-1 nodes on the path which can be asymptomatic), which implies

(1) 𝐏(success)=(1pa)ph+(1pa)(1ph)(l=1t𝐏(transmission path has length l𝒜)(1pa)l1).\mathbf{P}(\mathrm{success})=(1-p_{a})p_{h}+(1-p_{a})(1-p_{h})\left(\sum_{l=1}^{t}\mathbf{P}\left(\text{transmission path has length $l$}\mid\mathcal{A}\right)(1-p_{a})^{l-1}\right).

The difficult part is to compute the distribution of the transmission path conditioned on event 𝒜\mathcal{A}; indeed we already saw that all four parameters d,pi,pa,phd,p_{i},p_{a},p_{h} affect this distribution in a non-trivial way. Let us perform a back of the envelope computation to get more insight into the effect of these parameters. The exact structure of the infection tree will not matter for this computation, only its profile does. It is denoted by 𝒯t(l)\mathcal{T}_{t}(l) and defined as the number of (internal) nodes at level ll (i.e., at distance ll from the source of the infection). Remember that by definition the RERT has d𝒯t1(l1)d\cdot\mathcal{T}_{t-1}(l-1) external nodes on level ll, and that at time tt each external node is promoted to be internal with probability pip_{i} to form 𝒯t\mathcal{T}_{t}. Consequently, the level of a node hh added at time t>0t>0 has the same distribution (conditioned on the tree 𝒯t1\mathcal{T}_{t-1} at the previous step) as the size (number of internal nodes) of the profile 𝒯t1(l1)\mathcal{T}_{t-1}(l-1), that is,

(2) 𝐏(level(h)=l𝒯t1=Tt1)=Tt1(l1)|Tt1|.\mathbf{P}(\mathrm{level}(h)=l\mid\mathcal{T}_{t-1}=T_{t-1})=\frac{T_{t-1}(l-1)}{|T_{t-1}|}.

Working on the RERT directly can be a daunting task, therefore we propose to approximate the numerator and the denominator of equation (2) by 𝐄[𝒯t1(l1)]\mathbf{E}[\mathcal{T}_{t-1}(l-1)] and 𝐄|[𝒯t1|]\mathbf{E}|[\mathcal{T}_{t-1}|], respectively. It can be shown by a simple inductive argument, or by generating functions as in (Mahmoud, 2021), that for RERTs we have 𝐄[𝒯t(l)]=(tl)(dpi)l\mathbf{E}[\mathcal{T}_{t}(l)]=\binom{t}{l}(dp_{i})^{l} and 𝐄[|𝒯t|]=(1+dpi)t\mathbf{E}[|\mathcal{T}_{t}|]=(1+dp_{i})^{t}, which suggests a binomial distribution for the level of hh. And indeed, we can approximate the distribution of the level of a node hh added at time tt as

𝐏(level(h)=l)\displaystyle\mathbf{P}(\mathrm{level}(h)=l) 𝐄[𝒯t1(l1)]𝐄[|𝒯t1|]\displaystyle\approx\frac{\mathbf{E}[\mathcal{T}_{t-1}(l-1)]}{\mathbf{E}[|\mathcal{T}_{t-1}|]}
=(t1l1)(dpi)l1(1+dpi)t1\displaystyle=\frac{\binom{t-1}{l-1}(dp_{i})^{l-1}}{(1+dp_{i})^{t-1}}
=(t1l1)(dpi1+dpi)l1(1dpi1+dpi)tl\displaystyle=\binom{t-1}{l-1}\left(\frac{dp_{i}}{1+dp_{i}}\right)^{l-1}\left(1-\frac{dp_{i}}{1+dp_{i}}\right)^{t-l}
=𝐏(Bin(t1,q)=l1),\displaystyle=\mathbf{P}(\mathrm{Bin}(t-1,q)=l-1),

with q=dpi/(1+dpi)q=dp_{i}/(1+dp_{i}).

One of the main challenges of this calculation is that we do not know the day of the first hospitalization tt conditioned on event 𝒜\mathcal{A}, we only know that each node is hospitalized with probability (1pa)ph(1-p_{a})p_{h}, which means that the index of the first hospitalized node follows a geometric distribution with mean 1/((1pa)ph)1/((1-p_{a})p_{h}). We approximate t1t-1 by the first time that the expected size of the infection tree (excluding the source since we condition on event 𝒜\mathcal{A}) exceeds the expected index of the first hospitalized node. Therefore we solve

𝐄[|𝒯t1|1]=(1+dpi)t11=1(1pa)ph=𝐄[index of the first hospitalized node]\mathbf{E}[|\mathcal{T}_{t-1}|-1]=(1+dp_{i})^{t-1}-1=\frac{1}{(1-p_{a})p_{h}}=\mathbf{E}[\text{index of the first hospitalized node}]

for tt (relaxing the constraint that tt is an integer), which gives

t1=log(1+1(1pa)ph)log(1+dpi).t-1=\frac{\log\left(1+\frac{1}{(1-p_{a})p_{h}}\right)}{\log(1+dp_{i})}.

Consequently, we approximate 𝐏(transmission path has length l𝒜)\mathbf{P}\left(\text{transmission path has length $l$}\mid\mathcal{A}\right) by 𝐏(Bin(t1,q)=l1)\mathbf{P}(\mathrm{Bin}(t-1,q)=l-1). Continuing equation (1), and using the well-known expression of the probability generating function of the binomial distribution, we get

𝐏(success)\displaystyle\mathbf{P}(\mathrm{success}) (1pa)ph+(1pa)(1ph)(l=1t𝐏(Bin(t1,q)=l1)(1pa)l1)\displaystyle\approx(1-p_{a})p_{h}+(1-p_{a})(1-p_{h})\left(\sum_{l=1}^{t}\mathbf{P}\left(\mathrm{Bin}\left(t-1,q\right)=l-1\right)(1-p_{a})^{l-1}\right)
(3) =(1pa)(ph+(1ph)((1pa)dpi1+dpi+1dpi1+dpi)log(1+1(1pa)ph)log(1+dpi)).\displaystyle=(1-p_{a})\left(p_{h}+(1-p_{h})\left((1-p_{a})\frac{dp_{i}}{1+dp_{i}}+1-\frac{dp_{i}}{1+dp_{i}}\right)^{\frac{\log\left(1+\frac{1}{(1-p_{a})p_{h}}\right)}{\log(1+dp_{i})}}\right).

One can check that this expression agrees with our qualitative intuition. However, it is not at all clear whether it is valid because of the strong approximations made in some steps of the above computation. In Section 4, we prove a rigorous upper bound on the success probability, and we also provide much more careful approximations by proving exact theorems about the simplified models that we use. Then, in Section 5 we compare our results with simulation results on synthetic data, as well as with data generated by the DCS model.

3. Models, Methods, Algorithms

3.1. Epidemic Models

3.1.1. The DCS Model

We call DCS the model implemented by (Lorch et al., 2020). The DCS model is fairly complex, and we only give a brief overview.

Each agent in the agent set VV can be in one of 8 states: susceptible, exposed, asymptomatic infectious, pre-symptomatic infectious, symptomatic infectious, hospitalized, recovered or dead. Transitions between different states are characterized by counting processes described by stochastic differential equations with jumps. The most important, and also most complicated of these counting processes is the exposure counting process Ni(t)N_{i}(t), which is modeled by a Hawkes process for each agent ii. Hawkes processes are point processes with a time-dependent, self-exciting conditional intensity function λi(t)\lambda^{*}_{i}(t).

(4) λi(t)=βjV\{i}tδtKi,j(τ)γeγ(tτ)𝑑τ\hskip-5.0pt\lambda^{*}_{i}(t)=\beta\sum_{j\in V\backslash\{i\}}\int_{t-\delta}^{t}K_{i,j}(\tau)~\gamma e^{-\gamma(t-\tau)}\,d\tau

where the kernel Ki,j(τ)K_{i,j}(\tau) indicates whether jj has been at time τ\tau at the same site where ii is at time tt, and whether jj is in the infectious state. Parameters γ\gamma and δ\delta are the decay of infectiousness at sites and the non-contact contamination window, respectively, and they account for the fact that jj can infect ii even if they are never at the same site, as jj can leave some pathogens behind (airborne for instance). Parameter β\beta is the transmission rate for symptomatic and asymptomatic individuals, and it comes in two versions: βc\beta_{c} accounts for infections outside the household and βh\beta_{h} accounts for infection in the household. Parameters βc\beta_{c} and βh\beta_{h} are fitted to the COVID-19 infection data of Tubingen from 12/03/2020 to 03/05/2020 using Bayesian Optimization. The model also has a parameter for the relative asymptomatic transmission rate built into the function Ki,j(τ)K_{i,j}(\tau), which scales down the infectiousness of asymptomatic agents (to 55% of the infectiousness of symptomatic agents by default).

Once a susceptible agent becomes infected, the disease can take three possible courses (see Figure 2 (a)). With probability pap_{a}, the agent becomes asymptomatic infectious after time TET_{E}, and then recovers after time TIT_{I}. With probability 1pa1-p_{a}, the agent becomes pre-symptomatic infectious after time TET_{E}, next symptomatic infectious after time TPT_{P}, and then recovers with probability 1ph1-p_{h} after time TITPT_{I}-T_{P}, or becomes hospitalized with probability php_{h} after time THT_{H}. Agents in the DCS are also assigned age values based on demographic data, and the hospitalization probability php_{h} of each agent is determined based on its age (following COVID-19 infection data). The times TE,TP,TIT_{E},T_{P},T_{I} and THT_{H} are drawn from an appropriately parametrized (using values from the COVID-19 literature) lognormal distribution as shown in Table 3.

3.1.2. The DDE Model

We start by taking the DCS model (Lorch et al., 2020), which we simplify to enable its theoretical analysis. In the Deterministically Developing Epidemic (DDE) model, continuous time (used in DCS) is replaced by discrete time-steps: we refer to one time-step in the DDE as one day. Instead of modelling the infection propagation as a Hawkes process, an infectious agent (symptomatic or asymptomatic) can infect its susceptible neighbor with probability pip_{i} each day. Thereafter, the disease progresses the same way as in the DCS, except that in the DDE model the transition times are deterministic (the infection events and the severity of the disease (i.e., the (a)symptomatic and hospitalized states) are still determined randomly), and we have a single parameter php_{h} for the hospitalization probability (agents in this model do not have an age parameter). We discuss how we set the parameters of the DDE model in Section 3.4.

Refer to caption
Figure 2. (a) The flow diagram of the DCS and DDE epidemic models. (b) A possible epidemic outbreak in the Tubingen mobility model, and (c) the Household network model. The large grey circles mark households, and the purple nodes mark places, otherwise we use the same coloring as in (a). In both cases (b) and (c), the transmission paths are (v2,v4,v5,v8)(v_{2},v_{4},v_{5},v_{8}).

3.2. Simulating Mobility

3.2.1. Tubingen Mobility Model

We briefly review the mobility model introduced in (Lorch et al., 2020), and illustrated in Figure 2 (b). The population is partitioned into households of possibly varying size (usually between 1 and 5). The households are assigned a location, and we also place some external sites (shops, offices, schools, transport stations, recreating sites) on the map, which the agents may visit. The location of the households and the number of agents in them is sampled randomly based on demographic datasets. Initially, each agent is assigned a few favorite sites (randomly based on distance), and will only visit these throughout the simulation. Each agent decides to leave home after some exponentially distributed time, visits one of its (randomly chosen) favorite sites, and comes back home after another (usually much shorter) exponentially distributed time. If two agents visit the same site at the same time, or within some time δ\delta, we record them as a contact, which gives an opportunity for the infection to propagate. We denote the Tubingen mobility model as TU, and the DCS epidemic model that runs on the TU mobility model as DCS+TU.

3.2.2. Household Network Model

The Household network model (HNM) was inspired by (Lorch et al., 2020), however we note that similar models have been studied in the theoretical community by (Ball et al., 2009). As in the Tubingen mobility model, in HNM NN nodes are assigned into households, but of constant size dh+1d_{h}+1. Every pair of nodes in the same household are connected by an edge, forming therefore cliques of size dh+1d_{h}+1. Additionally, each node is assigned dcd_{c} half edges, which are paired uniformly at random with other half-edges in the beginning. Some half-edge pairings can result in self-loops or multi-edges, which are discarded. This construction defines a random graph generated by a configuration model, which shares a lot of similarities with Random Regular Graphs (RRG) (Wormald et al., 1999). In fact, if we join nodes in the same household into a single node in the HNM (which we refer to as the network of households of the HNM), then the resulting graph is equivalent to the pairing model of RRGs with degree dc(dh+1)d_{c}(d_{h}+1). It is well-known that in the pairing model of RGGs of degree dd, the local neighborhood (of constant radius, as the number of nodes tends to infinity) of a uniformly randomly chosen vertex is a dd-regular tree (with probability tending to 1), which implies that locally there are asymptotically almost surely no self-loops, multi-edges or any cycles in the graph. This result has various names; in random graph theory the result is usually proved by subgraph counting (Wormald et al., 1999), in probability theory it is the basis of branching process approximations (Ball et al., 2009), and in graph limit theory it is called the local convergence to the infinite dd-regular tree (Benjamini and Schramm, 2011). In our theoretical analysis, this result motivates the approximation of the neighborhood of the source in the network of households of the HNM by an infinite dc(dh+1)d_{c}(d_{h}+1)-regular tree. The HNM itself is then approximated by replacing each (household) node of the infinite dc(dh+1)d_{c}(d_{h}+1)-regular tree of households by a (dh+1)(d_{h}+1)-clique, and by setting the edges so that each (individual) node has degree exactly dc+dhd_{c}+d_{h}, while keeping the connection between cliques unchanged (see Figure 2 (c) for a visualization).

Since the HNM is a time-independent graph, we adopt the standard notations from graph theory. Formally, the HNM is given by the set of nodes and edges G=(V,E)G=(V,E). Let us denote by H(v)H(v) the set of nodes that are in the same household as node vv. The distance between two nodes u,vVu,v\in V (denoted by d(u,v)d(u,v)) is defined as a number of edges of the shortest path between uu and vv. We denote the DDE epidemic model that runs on the HNM network as DDE+HNM.

3.3. The Source Detection via Contact Tracing Framework

We present the Source Detection via Contact Tracing Framework (SDCTF), which can be applied to both epidemic and mobility models presented so far. The framework determines how the government/health agency, which conducts the source detection task, learns about the outbreak, and how it can gather further information to locate the source. In the SDCTF, as in Section 2.1, the agency learns about the outbreak when the first hospitalization occurs, and it also learns the identity of nodes when they become hospitalized (including the identity of the first hospitalized node).

After the outbreak is detected, the agency can make three types of queries. The first type of query, the household query with parameter vv, reveals the agents that live in the same household as vv. The household query works the same way in both the TU and the HNM models, and we do not limit the number of times it can be called (these queries are considered as cheap in the SDCTF). The second type of query, the contact query, works differently in the TU and the HNM models. For the TU model, a contact query has two parameters: an agent vv and a time window [t1,t2][t_{1},t_{2}]. As a result, all agents that have been in contact with vv (and therefore could have infected vv or could have been infected by vv) at an external site between t1t_{1} and t2t_{2} are revealed. In the HNM, no time window is needed for the contact query (which we also call edge query), and all neighbors of vv in graph GG are revealed. Contact (and edge) queries are considered expensive in the SDCTF. While in this paper we do not limit the number of available queries, we track the number of contacts and edges that are revealed as the algorithm runs. Note that in the TU model if two agents v1v_{1} and v2v_{2} have been in contact during the time window [t1,t2][t_{1},t_{2}] and also during a different time window [t3,t4][t_{3},t_{4}], then those are counted as separate contacts, whereas in the HNM an edge between v1v_{1} and v2v_{2} is only counted once. Although contact queries are considered expensive, both household and contact queries are answered instantly in the SDCTF.

The third kind of query is the test query with parameter vv, which reveals information about the course of the disease in the queried agent (see Figure 2 (a)). Symptomatic patients reveal the time of their symptom onset (which exactly determines their time of infection in the DDE due to the deterministic transition times) if they are past the pre-symptomatic state (i.e., if they are either infectious or recovered). Asymptomatic and pre-symptomatic patients do not reveal any information about their infection time; they just reveal that they have the disease or had the disease at some point and have recovered. For all algorithms we assume that asymptomatic patients do not reveal whether they have the infection at the time they are queried. Finally, agents who have not been exposed, or are still in their exposed state, give a negative test result. Test queries are again considered expensive in the SDCTF, we even limit the population that can be tested on any given day to at most 1% of the total population, due to the capacity of testing facilities. However, since in this paper we do not limit the number of days that the algorithm can use to locate the source, the limit on the number of tests does not play an important role. As opposed to household and contact queries (and the model in Section 2.1), tests results are only answered the next day in the SDCTF, which means that the algorithms must operate in “real-time”, while the epidemic keeps propagating.

3.4. Parameters

Refer to caption
Figure 3. Default values for the infection parameters in the DCS+TU and the DDE+HNM models.

The DCS+TU model has many parameters, most of which are fitted to COVID-19 datasets of Tubingen from 12/03/2020 to 03/05/2020 by (Lorch et al., 2020) (we show the most relevant parameters in Table 3). We determined the parameters of the DDE+HNM model so that they fit the parameters of the DCS+TU as closely as possible (see the precise values in Table 3). We determine the values of TE,TP,TIT_{E},T_{P},T_{I} in the DDE+HNM by rounding the expected value of the corresponding distribution in the DCS+TU to the nearest integer. Since pap_{a} is simply a constant in both models, we keep the same numerical value in the DDE+HNM. The parameter php_{h} is more complicated, because in the DCS+TU model there is a different hospitalization probability for each age group. We take the average hospitalization probability across the population to be php_{h}. The most complicated parameter to fit is pip_{i}, because in the DCS+TU model, infections are modelled by a Hawkes process, which depends on many parameters, including whether the infectious agent is symptomatic or asymptomatic, the length of the visit, the site where the infection happens, etc (see equation (4)). We empirically observe the probability of infection in every contact in several simulations, and we find that an agent has on average 15 contacts outside the household each day, and that the average probability of infection during such a contact is around 0.02. However, since we use smaller networks for the DDE+HNM (N=400N=400 or 10001000, because running the baselines on larger networks is not feasible) than the DCS (N=9054N=9054), setting dcd_{c} to be as high as 15 would violate the assumption that the network of households of the HNM can be locally approximated by a tree (see Section 3.2.2). Therefore we chose dc=3d_{c}=3 for the HNM and we scale pip_{i} so that dcpid_{c}p_{i} (the expected number of external infections caused by a single agent each day) is the same in the DCS+TU and the DDE+HNM models. Finally, we choose dhd_{h} in the DDE+HNM by rounding the average household connections in the DCS+TU. Note that the average number of household connections is not the same as the average number of household members, because the number of connections grows quadratically in the size of the households, and thus fitting to the number of connections results in a higher dcd_{c} (due to the Quadratic Mean-Arithmetic Mean inequality).

Finding the default values for the parameters is useful to create a realistic model. However, we also interested in the effect of each of the parameters on the performance of our algorithms. Therefore, in the DDE+HNM, we vary the parameters pa,ph,pi,dhp_{a},p_{h},p_{i},d_{h} and dcd_{c}, while keeping the other ones unchanged. For the DCS+TU model, we also keep the mobility model fixed and we focus on varying the parameters pa,php_{a},p_{h} and pip_{i}. As noted above, there is no single parameter php_{h} or pip_{i} in the DCS+TU model, therefore we change all hospitalization probabilities and all intensities of the Hawkes processes so that the hospitalization probability averaged across the population and the infection probability averaged across contacts equal the desired values.

3.5. The LocalSearch Algorithms LS and LS+

The LS algorithm finds patient zero by local greedy search. It keeps track of a candidate node, which is always the node with the earliest reported symptom onset time. We denote the candidate of the algorithm at iteration i>0i>0 by sc,is_{c,i}. We think of scs_{c} as a list, which is updated in each iteration of the algorithm, and we use the notation sc,1s_{c,-1} for the last element of the list (i.e., the current candidate). In each iteration of the algorithm, we compute a new candidate denoted by scs_{c}^{\prime}, and we append it at the end of the list scs_{c} at the beginning of the next iteration, unless sc=sc,1s_{c}^{\prime}=s_{c,-1}, in which case the algorithm terminates.

Refer to caption
Figure 4. Pseudocode and graphical explanation for the LS and LS+ algorithms. We use the same coloring as in Figure 2 (a). Black edges show the queried edges, a node with black X marks a negative test result, and red stroked node marks the node currently maintained as source candidate by the LS algorithm. We denote by tvt_{v} the symptom onset time of symptomatic node vv and by H(v)H(v) the household of a node vv similarly to the main text.

Since we consider the SDCTF, the outbreak is detected when the first hospitalized case is reported. At that time, scs_{c}^{\prime} is initialized to be the hospitalized patient, the test queue is initialized to be empty, and the algorithm is started. In the beginning of an iteration, if the test queue is empty, the household members and the “backward” contacts of the current candidate sc,1s_{c,-1} are queried and are added to the test queue (see Figure 4 (a)). We define “backward” contacts as the set of nodes that have been in contact with sc,1s_{c,-1} in the interval [tsc,1(TE+TP)(σE+σP),tsc,1(TE+TP)+(σE+σP)][t_{s_{c,-1}}-(T_{E}+T_{P})-(\sigma_{E}+\sigma_{P}),t_{s_{c,-1}}-(T_{E}+T_{P})+(\sigma_{E}+\sigma_{P})], where tsc,1t_{s_{c,-1}} is the symptom onset time of current candidate sc,1s_{c,-1}. The terms σE\sigma_{E} and σP\sigma_{P} model the standard deviation of the transition times, and they are set to zero for the DDE and to σE=2\sigma_{E}=2 and σP=1\sigma_{P}=1 for the DCS based on Table 3. We note that the notion of “backward” contacts is only meaningful in the case of time-dependent network models; for the HNM, all neighbors are counted as backward contacts.

After the test queue is initialized, the agents inside the queue are tested (see Figure 4 (b)). Not all nodes can be tested on the same day because of the limitation on the number of tests available per day in the SDCTF, however, this has little effect because we do not proceed to the next iteration until the test queue becomes empty. Once the test results come back to the agency, if any of the (symptomatic) nodes vv reports an earlier symptom onset time than the current candidate sc,1s_{c,-1}, then we update our next candidate scs_{c}^{\prime} to be vv (see Figure 4 (c)). We note that the iteration does not stop immediately after scs_{c}^{\prime} is first updated; the iteration runs until the test queue becomes empty, and until then, scs_{c}^{\prime} can be updated multiple times. This is important in the theoretical results to prevent the algorithm from getting sidetracked (see Figure 9). We also experimented with a version of the LS and LS+ algorithms where the iteration stops immediately once scs_{c}^{\prime} is updated; we call these algorithms LSv2 and LS+v2.

The main drawback of the LS algorithm is that is gets stuck very easily if there is even one asymptomatic node on the transmission path. For this reason, we introduce the LS+ algorithm, in which we enter the backward contacts of the asymptomatic household members of sc,1s_{c,-1}, and the household members of any asymptomatic node into the testing queue (see Figure 4 (d)-(f)). Since the symptom onset times of asymptomatic nodes vv are not revealed, we define backward contact in this case as any contact in the time window [tsc,1(TP+2TE+TI),tsc,1(TP+2TE)][t_{s_{c,-1}}-(T_{P}+2T_{E}+T_{I}),t_{s_{c,-1}}-(T_{P}+2T_{E})], where tsc,1t_{s_{c,-1}} is still the symptom onset time of the current candidate sc,1s_{c,-1}. Indeed, in the DDE model, since sc,1s_{c,-1} was infected at tsc,1(TP+TE)t_{s_{c,-1}}-(T_{P}+T_{E}), if vv infected sc,1s_{c,-1}, agent vv must have been infectious at that time, which implies that vv could not have been infected later than tsc,1(TP+2TE)t_{s_{c,-1}}-(T_{P}+2T_{E}) or earlier than tsc,1(TP+2TE+TI)t_{s_{c,-1}}-(T_{P}+2T_{E}+T_{I}). In the DCS model, the terms σE\sigma_{E} and σP\sigma_{P} can be subtracted and added to the two ends of the queried time window to account for the randomness in the transition times.

Both algorithms stop if the testing queue becomes empty before a node with an earlier symptom onset time than sc,1s_{c,-1} is discovered, and both algorithm return sc,1s_{c,-1} as their inferred source. The high level pseudocode and an illustration of the LS and LS+ algorithms are given in Figure 4.

4. Theoretical Results

In this section we present theoretical results for the LS and LS+ algorithms described in Section 3.5. We follow a similar approach as in the non-rigorous computation in Section 2.2, which useful but not necessary for understanding this section. All the statements are rigorously established, and whenever we reach a point where the computations would become intractable, we propose a simpler approximate model to study. One of the main contributions of this paper is to identify which computations can be done on more general models, and which computations need more simplified ones (see Figure 5 for an overview of the different models used for the computations in this section).

We compute the success probability of the LS and LS+ algorithms in two steps. We first assume the length of the transmission path known in Section 4.1 . This computation is then made possible by a tree approximation of the HNM, called the Red-Blue (RB) tree (defined in Section 4.1.1), and a slightly modified version of the DDE model called DDENR\mathrm{DDE}_{\mathrm{NR}} (defined in Section 4.1.2). The RB tree preserves some of the household structure in the HNM, and therefore allows us gain insight into the difference between the LS and LS+ algorithms, which would be difficult to obtain if we had worked on trees without taking the household structure in account.

For the second step, we would need to compute the distribution of the transmission path on the RB tree. However, finding a closed form expression is intractable. Instead, we combine the network and epidemic models into a growing random tree model, and we consider a dd-ary Random Exponential Tree (RET). The dd-ary RET model has only been studied for d=2d=2 (Feng and Mahmoud, 2018); we extend the results on their expected profile for general dd in Section 4.2.1. Nevertheless, working on dd-ary RETs still remains difficult, and therefore, in our last modeling step, we introduce a Deterministic Exponential Tree (DET) model, whose profile is close to the expected profile of the RET, and we compute the distribution of the transmission path on this model in Section 4.2.2.

(a)
Refer to caption
(b)
Refer to caption
Figure 5. The different approximation methods (a) and the distribution of the length of transmission path in the different models (b) proposed in Section 4. Panel (b) also shows the length of the transmission path in the DCS model on the TU dynamics, to highlight the fit of our model.

To summarize all models considered in this paper, we have a data-driven and a synthetic model for simulations (DCS+TU and HNM+DDE), an analytically tractable model (RB-tree+DDENR\mathrm{DDE}_{\mathrm{NR}}) where we can compute the success probability if the length of the transmission path is known. In a second stage, we compute the distribution of a transmission path on a deterministic tree (DET), which has a similar profile as a random tree (RET) that approximates our analytically tractable model. We visualize these five different models in Figure 5 (a), and we show by simulations in Figure 5 (b) that the distribution of the transmission path is similar in all of the considered models with appropriately scaled parameters. We compare our analytic results on the success probabilities of the LS and LS+ algorithms with our simulation results in Section 5.3 in Figure 7.

4.1. Success Probability of LS and LS+ Algorithms on the RB Tree

In this section we introduce the Red-Blue (RB) tree model (which is a tree approximation to the HNM), and we calculate the exact probability that the LS and LS+ algorithms succeed, if the length of the transmission path is known.

4.1.1. Red-Blue tree models

In short, a RB tree is a two-type branching process with a deterministic offspring distribution that depends on dhd_{h} and dcd_{c}. The lack of randomness in this distribution makes us adopt the formalism of deterministic rooted trees.

Definition 4.1.

Let a rooted tree, denoted by G(s)G(s), be a tree graph with a distinguished node root node ss. Let uu and vv be two nodes connected by an edge in G(s)G(s). If d(u,s)<d(v,s)d(u,s)<d(v,s), we say that uu is a parent of vv, otherwise uu is a child of vv. Moreover, if d(s,v)=ld(s,v)=l we say that vv is on level ll. An RB tree with parameters (dc,dh)(d_{c},d_{h}) is an infinite rooted tree, such that the nodes also have an additional color property. The root is always colored red and the rest of the nodes are colored red or blue. The root has dcd_{c} red and dhd_{h} blue children. Every other red node has dc1d_{c}-1 red and dhd_{h} blue children, and every blue node has dcd_{c} red children and no blue children. Red nodes and their dhd_{h} blue children partition the nodes of the RB tree G(s)G(s) into subsets of size dh+1d_{h}+1, which we call households.

Remark 4.1.

In the RB tree, each blue node has degree dc+1d_{c}+1, and each red node has degree dc+dhd_{c}+d_{h}, including the root of the tree ss (which is the source of the epidemic, when the RB tree is combined with an epidemic model).

The RB tree can be seen as a local tree approximation of the HNM. Let G=(V,E)G=(V,E) be an HNM with parameters (dc,dh)(d_{c},d_{h}), and let sVs\in V be the distinguished source node. In Section 3.2.2 we noted that the HNM can be approximated locally around the source node by replacing each node of an infinite dc(dh+1)d_{c}(d_{h}+1)-regular tree by a (dh+1)(d_{h}+1)-clique, and setting the edges so that each node has degree exactly dc+dhd_{c}+d_{h}, while keeping the connection between cliques unchanged. Let us call this infinite graph GG^{*}. Although GG^{*} is not a tree, all cycles in GG^{*} must be contained entirely inside the households, which implies that in each household there exists exactly one node that has the minimal distance to the source. We will refer to these nodes with minimal distance to the source as the red nodes, and we color the rest of the nodes blue. In other words, the red nodes will be the first ones in their households to be infected. Let us now delete the edges between the blue nodes in GG^{*} to obtain graph GG^{\prime}. We claim that GG^{\prime} is isomorphic to the RB tree G(s)G(s) rooted at the source ss. Indeed, since the edges between blue nodes have been deleted in GG^{*} to form GG^{\prime}, each blue node has dc+1d_{c}+1 red neighbors and no blue neighbor, and since the edges incident to red nodes have been unchanged, each red node has dcd_{c} red and dhd_{h} blue neighbors, exactly as in the definition of RB tree above.

Note that a household in GG^{*} is completely characterized by only specifying the colors of the nodes: a household always consists of one red node and of its dhd_{h} blue children. We use this characterization as a definition for households in the RB tree GG^{\prime}, because it does not depend on the edges from GG that are deleted in GG^{*}, whereas this deletion makes the original definition of a household as a clique in GG unusable.

Next, we make some important observations the behaviour of the LS and the LS+ algorithms on RB trees, which we prove in Appendix A.1. We start by formalizing the notion of transmission path.

Definition 4.2.

Let hh be the first hospitalized node and ss be the source. We call the path (s=v0,v1,vl=h)(s=v_{0},v_{1},...v_{l}=h), where viv_{i} is the infector of vi+1v_{i+1} for 0i<l0\leq i<l, the transmission path. Also we call the path (vl,vl1,v1)(v_{l},v_{l-1},...v_{1}) the reverse transmission path.

Remark 4.2.

Note that in an RB tree, each household traversed by a transmission path shares one (the red node in the household) or two (the red node of the household and one of its dhd_{h} children in the household) nodes with this path. Moreover, the red node of a household traversed by a transmission path is followed by another red node on the path (in another household) if it is the only node of that household on the transmission path, whereas it is followed by a blue node (in the same household) if two nodes of that household are on the transmission path.

Lemma 4.3.

In the RB tree network, the LS algorithm succeeds if and only if all nodes on the transmission path are symptomatic, and the LS+ algorithm succeeds if among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic.

Remark 4.3.

We note that the statement for LS+ in Lemma 4.3 cannot be reversed, i.e., it is possible that LS+ succeeds even if among the nodes of the transmission path, there is a household with no symptomatic node (see Figure 9 (a)). Also, the proof of Lemma 4.3 does not hold if the LS+ algorithm proceeds to the next iteration at the first time scs_{c}^{\prime} is updated (see Figure 9 (b)). Finally, in the proof of Lemma 4.3, we do not make any assumptions about asymptomatic patients having had the disease previously or not, which implies that we could treat non-complying agents as asymptomatic patients without jeopardizing the correctness of the algorithms.

4.1.2. The DDENR\mathrm{DDE}_{\mathrm{NR}} Model

Focusing on tree networks is an important step towards making our models tractable for theoretical analysis, but it will not be enough; we will make two minor simplifications to the DDE model as well: we eliminate (i) the pre-symptomatic state and (ii) the recovered state, and we call the new model DDENR\mathrm{DDE}_{\mathrm{NR}} (where NR stands for No Recovery). (i) The first assumption can be made without loss of generality, because the pre-symptomatic state does not have any effect on the disease propagation, nor on the success of the source detection algorithm. Indeed, according to Lemma 4.3, the success of the LS and LS+ algorithms depends only on the information gained about the transmission path, and by the time of the first hospitalization, every node on the transmission path must have left the pre-symptomatic state (since we always have TP<TE+THT_{P}<T_{E}+T_{H}), even if we include it in the model. (ii) The second assumption on the absence of recovery states amounts to take TIT_{I}\rightarrow\infty, which does have a small effect on the disease propagation, however, this effect is minimal because TI=14T_{I}=14 is already quite large, and because only the very early phase of the infection is interesting for computing the success probabilities of the algorithms. Finally, this last assumption has no effect on the information gained by the algorithm since we assumed that recovered patients (who were symptomatic) can remember and reveal their symptom onset time in the same way as symptomatic infectious patients.

4.1.3. Success Probability of LS

Assuming that the distribution of length of the transmission path is provided for us (we give an approximation in Section 4.2), the success probability of LS can be computed succinctly. We need a short definition before stating our result.

Definition 4.4.

Let pp be the probability that a node is asymptomatic conditioned on the event that it is not hospitalized.

A simple computation shows that

(5) p=𝐏(v is asyv is not hosp)=papa+(1pa)(1ph).p=\mathbf{P}(v\text{ is asy}\mid v\text{ is not hosp})=\frac{p_{a}}{p_{a}+(1-p_{a})(1-p_{h})}.
Lemma 4.5.

For the DDENR\mathrm{DDE}_{\mathrm{NR}} epidemic model with parameters (pi,pa,ph)(p_{i},p_{a},p_{h}) on the RB tree with parameters (dc,dh)(d_{c},d_{h}), and with pp computed in equation (5), we have

(6) 𝐏(LS succeeds)=n=0(1p)n𝐏(d(s,h)=n).\mathbf{P}(LS\textrm{ succeeds})=\sum_{n=0}^{\infty}\left(1-p\right)^{n}\mathbf{P}(d(s,h)=n).
Proof.

Let us reveal the randomness that generates the epidemic in a slightly modified way than in the definition (Sections 3.1.2 and 4.1.2). As before, at the beginning only the source is infectious, and depending on course of the disease, the source can be symptomatic and hospitalized, symptomatic but not hospitalized, or asymptomatic with probabilities (1pa)ph,(1pa)(1ph),pa(1-p_{a})p_{h},(1-p_{a})(1-p_{h}),p_{a}, respectively. In each moment, each infectious node infects each of its susceptible neighbors with probability pip_{i}. If a node is infected, we reveal the information whether it will become hospitalized (which happens with the probability (1pa)ph(1-p_{a})p_{h}), but if it does not become hospitalized, we do not reveal whether the node is asymptomatic or symptomatic yet. Indeed, this information is not necessary for continuing the simulation of the epidemic since we assumed that there is no difference between the infection probabilities of symptomatic and asymptomatic nodes. Thereafter, when the first hospitalized case occurs, we reveal for each infected node vv on the transmission path (except the last node, which we know is hospitalised; see Definition 4.2) whether it is asymptomatic or not. The only information we have about these nodes is that they are not hospitalized, which implies that the probability that a node is revealed to be asymptomatic on the transmission path is exactly the probability pp from Definition 4.4 computed in (5).

By Lemma 4.3, LS succeeds if and only if each node on the transmission path is symptomatic. Conditioning on the length of the transmission path, we can compute the probability of each node being symptomatic by equation (5) as

(7) 𝐏(LS suceeds|d(s,h)=n)=(1𝐏(v is asy|v is not hosp))n=(1p)n,\mathbf{P}(LS\textrm{ suceeds}|d(s,h)=n)=\left(1-\mathbf{P}(v\textrm{ is asy}|v\textrm{ is not hosp})\right)^{n}=\left(1-p\right)^{n},

from which (6) follows immediately. ∎

4.1.4. Success Probability of LS+

Computing the success probability of the LS+ algorithm is far more challenging compared to the LS algorithm, even if the distribution of the length of the transmission path is provided to us. Indeed, since the LS+ algorithm does further testing on the contacts and household members of asymptomatic nodes, it is essential to have additional information about the number of households on the transmission path. We give our main result on the LS+ in the next theorem, which we prove in Appendix A.2.

Theorem 4.6.

Let pp be as in (5) and let 𝒮(n,α,β)\mathcal{S}(n,\alpha,\beta) be the set of kk integer values such that kk and nn have different parity and n+12(α+β)k2(α+β)n+1-2(\alpha+\beta)\geq k\geq 2-(\alpha+\beta). Then, for the DDENR\mathrm{DDE}_{\mathrm{NR}} epidemic model with parameters (pi,pa,ph)(p_{i},p_{a},p_{h}) on the RB tree with parameters (dc,dh)(d_{c},d_{h}), we have

𝐏(LS+ succeeds)𝐏(d(s,h)=0)+(1p)𝐏(d(s,h)=1)+\displaystyle\mathbf{P}(LS+\textrm{ succeeds})\geq\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+
(8) n=2α,β{0,1}k𝒮(n,α,β)(n+k32k2+α+β)(dh(1p))n+k12(dc(1+p))nk+12αβdc(dc1)k+α+β2λ1(dc1+D2)n+λ2(dc1D2)n𝐏(d(s,h)=n),\displaystyle\sum_{n=2}^{\infty}\sum_{\begin{subarray}{c}\alpha,\beta\in\{0,1\}\\ k\in\mathcal{S}(n,\alpha,\beta)\end{subarray}}\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}\frac{(d_{h}(1-p))^{\frac{n+k-1}{2}}(d_{c}(1+p))^{\frac{n-k+1}{2}-\alpha-\beta}d_{c}(d_{c}-1)^{k+\alpha+\beta-2}}{\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}}\mathbf{P}(d(s,h)=n),

where

(9) D\displaystyle D =(dc1)2+4dcdh\displaystyle=\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}
(10) λ1\displaystyle\lambda_{1} =(dc+1+D)(2dh+dc1+D)2D(dc1+D)\displaystyle=\frac{(d_{c}+1+D)(2d_{h}+d_{c}-1+D)}{2D(d_{c}-1+D)}
(11) λ2\displaystyle\lambda_{2} =(Ddc1)(2dh+dc1D)2D(dc1D).\displaystyle=\frac{(D-d_{c}-1)(2d_{h}+d_{c}-1-D)}{2D(d_{c}-1-D)}.

4.2. Approximating the Depth of the Path to the First Hospitalized Node

Section 4.1 was dedicated to the success probability of the LS and LS+ algorithms, however, in these results, we are still missing the distribution of the transmission path length. In this subsection we address this problem by introducing simpler approximate models.

4.2.1. (dr,d)(d_{r},d)-ary Random Exponential Tree

When we introduced the DDENR\mathrm{DDE}_{\mathrm{NR}} model in Section 4.1.2, we removed both parameters TPT_{P} and TIT_{I} from the DDE model (by removing the presymptomatic and the recovered states, respectively), but we kept the parameter TET_{E}. In this step we will rescale the time parameter to make TE=1T_{E}^{\prime}=1 by changing pip_{i}^{\prime} to be 1(1pi)TE1-(1-p_{i})^{T_{E}}. Since we had TE=3T_{E}=3 by default, using TET_{E}^{\prime} and pip_{i}^{\prime} instead of TET_{E} and pip_{i} means that we choose 3 days to be our time unit, and the probability of infection is scaled to be the probability that the infection is passed in at least one of three days (since the RB tree is time-independent, if two nodes are connected, the infection can spread on it every day). We drop the prime from pip_{i}^{\prime} and TET_{E}^{\prime} for ease of notation. As a second approximation, instead of keeping track of two types of nodes (red and blue) as it is done in the RB tree, we propose to change our network model to an infinite dd-regular tree, where dd is set to be the average degree of an RB tree.

By making these two changes (tracking time at a coarser scale and simplifying the network topology to a dd-regular tree), the growth of the epidemic becomes equivalent to a known model, the dd-ary Random Exponential Tree (dd-RET). Binary RETs have been introduced in (Feng and Mahmoud, 2018) . We give the definition below for completeness.

Definition 4.7.

A dd-ary Random Exponential Tree (dd-RET) with parameters d,pid,p_{i} at time day tt, denoted by Gt(s)G_{t}(s), is a random tree rooted at node ss. At day 0, the tree Gt(s)G_{t}(s) only has its root node ss. Let G¯t(s)\bar{G}_{t}(s) be the closure of Gt(s)G_{t}(s), which is obtained by attaching external nodes to Gt(s)G_{t}(s) until every internal node (a node that was already present in Gt(s)G_{t}(s)) has degree exactly dd in the graph G¯t(s)\bar{G}_{t}(s). Then, Gt+1(s)G_{t+1}(s) is obtained from G¯t(s)\bar{G}_{t}(s) by retaining each external node with probability pip_{i}, and dropping the remaining external nodes.

Indeed, each node of a dd-RET infects a new node with probability pip_{i} each day, and after a sufficiently long time, the dd-RET becomes close to a large dd-ary tree. Of course, we do not want to let the dd-RET grow for a very long time, we only want it to grow until the first hospitalization occurs. So far we have not talked about the course of the disease of the nodes in the dd-RET model because we could define the spread of the infection without it. Since we still need to do one final simplification to compute the distribution of the transmission path, we defer the discussion about hospitalizations, and how the parameters pap_{a} and php_{h} are part of the model, to Section 4.2.2. Note that by considering the dd-RET, we deviate from the idea of separating the epidemic and the network models; we only have a randomly growing tree, which is stopped at some time, when the tree is still almost surely finite.

So far we only did simplifications to the model, which resulted in further and further deviations from the original version. Now we will make a small modification that brings our model back closer to the RB tree, without complicating the computations too much. We still make almost all maximum degrees of the RET uniform dd, but we make an exception with the root, which will have maximum degree dr=dc+dhd_{r}=d_{c}+d_{h}. This makes the maximum degree of the root the same as the degree of the root of the RB tree. We call the resulting model a (dr,d)(d_{r},d)-RET with parameter pip_{i}. Since the close neighborhood of the source has a high impact on the success probability, we found that this solution gives the best results while keeping the computations tractable.

In our computations, only the profile the infection tree will be important, which motivates the next definition.

Definition 4.8.

In the (dr,d)(d_{r},d)-RET model with parameter pip_{i}, let At,lA_{t,l} be the number of nodes during day tt at level ll, and let at,l=𝔼[At,l]a_{t,l}=\mathbb{E}[A_{t,l}]. Moreover, we define the random variable

(12) At=t=0+At,l\displaystyle A_{t}=\sum_{t=0}^{+\infty}A_{t,l}

with A1,l=0A_{-1,l}=0 for all ll, and its expectation at=𝔼[At]a_{t}=\mathbb{E}[A_{t}].

As noted earlier, the dd-RET model has only been analysed for d=2d=2 to this date. We provide the expected number at,la_{t,l} of nodes at level ll in day tt for the general case in the next theorem and corollary, which we prove in Appendices  A.3 and A.4.

Theorem 4.9.

In the (dr,d)(d_{r},d)-RET with parameter pip_{i}, let at,la_{t,l} be as in Definition 4.8. Then

(13) at,0\displaystyle a_{t,0} =1\displaystyle=1
(14) at,l\displaystyle a_{t,l} =drpim=l1t1(ml1)(1pi)ml+1dl1pil1, for tl1\displaystyle=d_{r}p_{i}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}\textrm{, for }t\geq l\geq 1
(15) at,l\displaystyle a_{t,l} =0, for l ¿ t.\displaystyle=0\textrm{, for l > t}.
Corollary 4.10.

In the RET(pi,dr,d)(p_{i},d_{r},d), let ata_{t} be the expectation of (12), as in Definition 4.8. For t0t\geq 0,

(16) at=1+dr(1pi+dpi)t1d1.a_{t}=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.

4.2.2. Deterministic Exponential Tree with Parameters pa,php_{a},p_{h} and (ct,l)t,l(c_{t,l})_{t,l\in\mathbb{N}}

In the (dr,d)(d_{r},d)-RET model it is still complicated to calculate the distribution of the depth of the first hospitalized node. For this reason, we approximate the RET model by a deterministic time-dependent tree with a prescribed profile.

Definition 4.11.

Let (ct,l)t{1},l(c_{t,l})_{t\in\mathbb{N}\bigcup\{-1\},l\in\mathbb{N}} be a two-dimensional array with ct,l=0c_{t,l}=0 for t{1,0}t\in\{-1,0\} and ll\in\mathbb{N}, except for c0,0=1c_{0,0}=1, and with ct,lct,l1c_{t,l}\geq c_{t,l-1} for any tt and any l1l\geq 1. Additionally, if we define ct=lct,lc_{t}=\sum_{l}c_{t,l}, then the array (ct,l)(c_{t,l}) must satisfy ct>ct1c_{t}>c_{t-1} for t0t\geq 0. Then, we define the Deterministic Exponential Tree (DET) with parameter (ct,l)t{1},l(c_{t,l})_{t\in\mathbb{N}\bigcup\{-1\},l\in\mathbb{N}}, as a time-dependent rooted tree, that has exactly ct,lc_{t,l} nodes on level ll at time tt. The edges between the adjacent levels are drawn arbitrarily so that the tree structure is preserved.

The formal assumptions on the array (ct,l)(c_{t,l}) are simply made to ensure that the DET starts with a single node at t=0t=0, that it never shrinks on any level (ct,lct,l1c_{t,l}\geq c_{t,l-1}), and that it grows by at least one node in each time step (ct>ct1c_{t}>c_{t-1}).

We have defined the DET at any given time tt, however, to determine the length of the transmission path, we are not interested in the DET at any given time, but only when the first hospitalization occurs. To compute the distribution of the first hospitalized node, we would like to have an absolute order on the times when the nodes are added, which we do by randomization. We say that on day tt, nodes are added one by one to the DET, their order given by a uniformly random permutation, and each node is hospitalized with probability (1pa)ph(1-p_{a})p_{h} (as in the original DDE model). When the first hospitalization occurs, we stop growing the tree, and we call the resulting (now random) model a stopped DET with parameters (ct,l),pa,ph(c_{t,l}),p_{a},p_{h}. We find the transmission path length distribution on the stopped DET in the next lemma, which we prove in Appendix A.5.

Lemma 4.12.

Let us consider the stopped DET model with parameters (ct,l),pa,ph(c_{t,l}),p_{a},p_{h}, and let hh denote the first hospitalized node. Then

(17) 𝐏(d(s,h)=l)=t=0+ct,lct1,lctct1(1(1pa)ph)ct1(1(1(1pa)ph)ctct1).\displaystyle\mathbf{P}(d(s,h)=l)=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).

We would like to set ct,lc_{t,l} so that the DET is close to the RET described in Section 4.2.1. For equation (17) to make sense, we should substitute integer values for ct,lc_{t,l}, however, for an approximation the equation can also be evaluated for fractional values as well.

Remark 4.4.

If we substitute ct,l=at,lc_{t,l}=a_{t,l} and ct=atc_{t}=a_{t} in equation (17), where at,la_{t,l} is given in Theorem 4.9 and ata_{t} is computed in Corollary 4.10, then we get the expression

(18) drpil1dl1t=l+(t1l1)(1pi)tl(1pi+dpi)t1(1(1pa)ph)1+dr(1pi+dpi)t11d1(1(1(1pa)ph)dr(1pi+dpi)t1)),\displaystyle d_{r}p_{i}^{l-1}d^{l-1}\sum_{t=l}^{+\infty}\frac{\binom{t-1}{l-1}(1-p_{i})^{t-l}}{(1-p_{i}+dp_{i})^{t-1}}(1-(1-p_{a})p_{h})^{1+d_{r}\frac{(1-p_{i}+dp_{i})^{t-1}-1}{d-1}}\left(1-(1-(1-p_{a})p_{h})^{d_{r}(1-p_{i}+dp_{i})^{t-1})}\right),

which approximates the distribution of the transmission path length in the (dr,d)(d_{r},d)-ary RET stopped at the first hospitalization.

5. Simulation Results

5.1. Baseline Algorithms

5.1.1. Non-adaptive Baseline: Dynamic Message Passing

There are few sensor-based source detection algorithms that are compatible with time-varying networks in the literature (Huang, 2017; Jiang et al., 2016; Fan et al., 2020). The most promising one among these algorithms (Jiang et al., 2016) has a close resemblance to the a previous work of (Lokhov et al., 2014) on Dynamic Message Passing (DMP) algorithms. Given the initial conditions on the identity of the source node and its time of infection, the DMP algorithm approximates the marginal distribution of the outcome of an epidemic at some later time tt. The algorithm is exact on tree networks, and it computes a good approximation when there are not too many short cycles in the network. Therefore, the DMP algorithm can be used to approximate the likelihood of the observed symptom onset times for any (source,time) pair. Due to its flexibility, we were able to adapt the DMP algorithm to the SDCTF (see Appendix B for more details).

Originally, the DMP was applied to the source detection problem by computing the likelihood values for all possible (source,time) pairs, and then choosing the source node from the most likely pair as the estimate (Lokhov et al., 2014). However, testing all (source,time) pairs increases the time complexity of the algorithms potentially by a factor of N2N^{2}, which makes the algorithm intractable in many applications. Jiang et. al. (Jiang et al., 2016) proposed a very similar algorithm to the DMP equations (which is unfortunately not exact even on trees), and solved the issue of intractability by a heuristic preprocessing step to the DMP algorithm. This preprocessing step, identifies a few candidate (source,time) pairs, by spreading the disease backward from the observations in a deterministic way (called reverse dissemination). Since we already approximate our data-driven model (DCS) by an epidemic model with deterministic transition times (DDE), it is natural for us to also implement the deterministic preprocessing step proposed by (Jiang et al., 2016). We produce 5 (source,time) pairs which are feasible for the 5 earliest symptom onset time observations (see Appendix B.3 for more details). It would have been ideal to run the algorithms for more than 5 pairs, but this was made impossible by the runtimes becoming very high. We run therefore our implementation of the DMP algorithm with the previously computed feasible (source,time) pairs as initial conditions to find the most likely source candidate.

The source estimation algorithms developed using the DMP algorithm do not specify how the sensors should be selected, and therefore place these non-adaptive sensors randomly. We refer to the resulting algorithm as random+DMP. The number of sensors is set so that it always exceeds the number that LS/LS+ would use. The simulation results are shown in Figure 6 for the DDE+HNM model. Importantly, the deterministic preprocessing step of (Jiang et al., 2016) is compatible with time-varying networks, which allows us to run the algorithm for the DCS+TU model as well (see Figure 8).

5.1.2. Adaptive Baseline: Size-Gain

The Size-Gain (SG) algorithm was developed for epidemics which spread deterministically (Zejnilović et al., 2015), and has been later extended to stochastic epidemics (Spinelli et al., 2017b). It works by narrowing a candidate set based on a deterministic constraint. If v1,v2v_{1},v_{2} are symptomatic observations, then scs_{c} is in the candidate set of SG if and only if

(19) |(tv2tv1)(d(v2,sc)d(v1,sc))|<σ(d(v2,sc)+d(v1,sc)),|(t_{v_{2}}-t_{v_{1}})-(d(v_{2},s_{c})-d(v_{1},s_{c}))|<\sigma(d(v_{2},s_{c})+d(v_{1},s_{c})),

where σ\sigma is the standard deviation of the infection time of a susceptible contact. If one of the observations, say v2v_{2}, is negative, then SG uses a condition almost identical to equation (19), except that the absolute value is dropped, since a negative observation at time tv2t_{v_{2}} is only a lower bound on the true symptom onset time of v2v_{2}. These deterministic conditions are checked for every symptomatic-symptomatic or symptomatic-negative pair (v1,v2)(v_{1},v_{2}) to determine if scs_{c} can be part of the candidate set. Next, SG places the next sensor adaptively at the node which reduces the candidate set by the largest amount in expectation (assuming a uniform prior on the source and its infection time), and it terminates when the candidate set shrinks to a single node. Note that the SG algorithm can fail if at least one of the deterministic conditions in equation (19) is violated for some (v1,v2)(v_{1},v_{2}) because of the randomness of the epidemic.

We use the existing implementation of the SG algorithm by (Spinelli et al., 2017b), and adapt it to the SDCTF. We incorporate asymptomatic-symptomatic and asymptomatic-negative observations (v1,v2)(v_{1},v_{2}) the same way as symptomatic-negative are incorporated; we drop the absolute value sign in equation (19), because an asymptomatic observation at time tv1t_{v_{1}} is only an upper bound on the true symptom onset time of v1v_{1}. We impose the same daily limit to the number of sensors that can be placed by the SG algorithm in a single day as for the LS/LS+ algorithm, and if the candidate set size does not shrink to one on the day when both LS and LS+ have already provided their estimates, then the SG algorithm must make a uniformly random choice from the current candidate set as its source estimate. The simulation results are shown in Figure 6 for the DDE+HNM model. We do not implement the SG algorithm for the DCS+TU model, because its runtime is too high, and because it is not clear how it should be implemented for time-varying networks.

5.2. Comparison with Baselines

Refer to caption
Figure 6. The performance of the algorithms LS, LS+, R and SG if the metric is the probability of finding the source (solid curves) or the first symptomatic patient (dashed curves). The simulations were computed on a population of n=400n=400 individuals in the DDE model on the HNM, and each datapoint is the average of 48004800 independent realizations except for the SG algorithm, which was run with 192192 independent realizations. The confidence intervals for the success probabilities are computed using the Wilson score interval method, and for the tests and the queries using the Student’s t-distribution.

We show our simulation results comparing the random+DMP, SG, LS and LS+ algorithms in Figure 6. In the first row of Figure 6, we show the accuracy of the algorithms with solid curves. Since the LS/LS+ algorithms cannot detect an asymptomatic source, we also show what the accuracy would look like if the goal of the SDCTF was to detect the first symptomatic agent with dashed lines. It is clear that in both metrics and across a wide range of parameters, the LS+ algorithm performs best, followed by LS, next random+DMP, and finally SG. The only exception is for high values of pip_{i}, where SG performs best. The good performance of SG for these parameters is expected, because SG was originally developed for deterministically spreading epidemics (i.e., pi=1p_{i}=1).

In the second row of Figure 6, we show the number of test/sensor queries used by the algorithms. LS uses the fewest tests, followed by LS+. The random+DMP and SG algorithms always use more tests than LS/LS+, except for large values of pip_{i}. Finally, in the last row of Figure 6 we show the number of contact (or in this case edge) queries used by the algorithms. Again, LS uses fewer queries than LS+, while both the random+DMP and SG algorithms query essentially the entire network.

Figure 6 shows that the LS/LS+ algorithms are fairly robust to changes in the parameters of the model, except for the parameter pap_{a}. Indeed, if there are many asymptomatic nodes in the network, then source detection becomes very challenging. It may be surprising that as pap_{a} grows, the number of tests that LS uses decreases, contrary to LS+. This is because as pap_{a} grows, the LS algorithm gets stuck more rapidly, while the LS+ algorithm compensates for the presence of asymptomatic nodes by using more test/sensor queries.

5.3. Comparison of Simulations and Theoretical Results

Refer to caption
Figure 7. The probability of success of the LS and LS+ algorithms (solid curves) and their theoretical estimate (dash-dotted curves) with the success probabilities computed in Lemma 4.5 and Theorem 4.6, while the transmission path distribution computed in equation (18). The simulation results were generated using the DDE model on HNM networks of size n=1000n=1000 with 48004800 independent samples. The 95%95\% confidence intervals are computed using the Wilson score interval method.

The analytic results from Section 4 are in good agreement with the simulation results in Figure 7. We also experiment with changing the parameters dhd_{h}, dcd_{c} while keeping all the parameters fixed, and with changing dcd_{c} while keeping the product dcpid_{c}p_{i} fixed. We observe that LS is not affected by the parameter dhd_{h}, whereas LS+ performs better with a higher dcd_{c}, which is expected because LS+ leverages the household structure of the network to improve over LS. Somewhat surprisingly, we also observe that a higher dcd_{c} also improves the performance of both algorithms. This can be explained by the fact that a larger dcd_{c} implies that there are more nodes in the close neighborhood of the source, which results in shorter transmission paths, making source detection less challenging. Finally, if we increase dcd_{c} but keep dcpid_{c}p_{i} fixed, the performance of the algorithms does not change as much, which confirms the intuition that it is the number of infections caused by an infectious node in a single day that matters the most (as we discussed in Section 2).

5.4. Simulations on the DCS Model

Refer to caption
Figure 8. The performance of the algorithms LS, LS+ and random+DMP on the DCS model with the Tubingen dynamics if the metric is the probability of finding the source (solid curves) or the first symptomatic patient (dashed curves), together with the theoretical results (dash-dotted lines), as shown in Figure 7. The simulations were computed on a population of n=9054n=9054 individuals, and each datapoint is the average of 24002400 independent realizations for the LS/LS+/LSv2/LS+v2 algorithms, and 4848 independent realizations for the random+DMP algorithm. The default population and infection parameters were selected to match the population and COVID-19 infection datasets of Tubingen. The confidence intervals for the success probabilities are computed using the Wilson score interval method, and for the tests and the queries using the Student’s t-distribution.

We show our simulation results on our most realistic DCS+TU model in Figure 8. We make very similar observations on this model as the ones that we have made on the DDE+HNM model in Sections 5.2 and 5.3, which shows that the LS/LS+ algorithms and our analysis of their performance is robust to changes in the epidemic and network models.

In the DCS+TU model, we used a fixed limit on the number of sensors that the random+DMP model selects, instead of setting the limit based on the LS+ algorithm. As a result, for a few parameters the LS+ algorithm used more tests than the random+DMP model. However, we note that by updating the candidate node immediately after an earlier symptom onset time is revealed (see Section 3.5), we can essentially cut the number of required tests for the LS+ algorithm by half (LSv2 and LS+v2), without sacrificing the performance of the algorithms.

6. Discussion

We have introduced the LS and LS+ algorithms in the SDCTF, and we have used a sequence of models on which we can compute their accuracy (probability of finding the correct source) rigorously. We find that both LS and LS+ outperform baseline algorithms, even though the baselines essentially query all contacts on a transmission path between agents, while LS and LS+ query only a small neighborhood of the source. One could argue that LS and LS+ beat the baseline algorithms only because we benchmark them in our own framework, which is different from the framework for which the baseline algorithms were developed. However, we argue that the LS/LS+ algorithms are robust to changes in the framework due to their simplicity, and we support our argument by simulation results. The runtimes of the LS/LS+ algorithms are also much lower than the baselines and do not depend on the network size since they are local algorithms - as opposed to the baselines, which have quadratic or even larger dependence on the network size. The “low-tech” approach in the design of the LS/LS+ algorithms increases their potential to be implemented in real-world scenarios, possibly even in a decentralized way, similarly to contact tracing smart phone applications (Troncoso et al., 2020), which is an interesting direction for future work.

References

  • (1)
  • Altarelli et al. (2014) Fabrizio Altarelli, Alfredo Braunstein, Luca Dall’Asta, Alessandro Ingrosso, and Riccardo Zecchina. 2014. The patient-zero problem with noisy observations. Journal of Statistical Mechanics: Theory and Experiment 2014, 10 (2014), P10016.
  • Ball et al. (2009) Frank Ball, David Sirl, and Pieter Trapman. 2009. Threshold behaviour and final outcome of an epidemic on a random network with household structure. Advances in Applied Probability 41, 3 (2009), 765–796.
  • BeidasRinad et al. (2020) S BeidasRinad, M ButtenheimAlison, S KilaruAustin, A AschDavid, G VolppKevin, G LawmanHannah, C CannuscioCarolyn, et al. 2020. Optimizing and implementing contact tracing through behavioral economics. NEJM Catalyst Innovations in Care Delivery (2020).
  • Benjamini and Schramm (2011) Itai Benjamini and Oded Schramm. 2011. Recurrence of distributional limits of finite planar graphs. In Selected Works of Oded Schramm. Springer, 533–545.
  • Bradshaw et al. (2021) William J Bradshaw, Ethan C Alley, Jonathan H Huggins, Alun L Lloyd, and Kevin M Esvelt. 2021. Bidirectional contact tracing could dramatically improve COVID-19 control. Nature communications 12, 1 (2021), 1–9.
  • Braithwaite et al. (2020) Isobel Braithwaite, Thomas Callender, Miriam Bullock, and Robert W Aldridge. 2020. Automated and partly automated contact tracing: a systematic review to inform the control of COVID-19. The Lancet Digital Health (2020).
  • Dawkins et al. (2021) Quinlan Dawkins, Tianxi Li, and Haifeng Xu. 2021. Diffusion Source Identification on Networks with Statistical Confidence. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 2500–2509.
  • Drmota (2009) Michael Drmota. 2009. Random trees: an interplay between combinatorics and probability. Springer Science & Business Media.
  • Endo et al. (2020) Akira Endo et al. 2020. Implication of backward contact tracing in the presence of overdispersed transmission in COVID-19 outbreaks. Wellcome open research 5 (2020).
  • Fan et al. (2020) Lilin Fan, Bingjie Li, Dong Liu, Huanhuan Dai, and Yan Ru. 2020. Identifying Propagation Source in Temporal Networks Based on Label Propagation. In International Conference of Pioneering Computer Scientists, Engineers and Educators. Springer, 72–88.
  • Feng and Mahmoud (2018) Yarong Feng and Hosam Mahmoud. 2018. Profile of random exponential binary trees. Methodology and Computing in Applied Probability 20, 2 (2018), 575–587.
  • Hernando et al. (2008) Carmen Hernando, Mercé Mora, Peter J Slater, and David R Wood. 2008. Fault-tolerant metric dimension of graphs. Convexity in discrete structures 5 (2008), 81–85.
  • Hu et al. (2018) Zhao-Long Hu, Zhesi Shen, Chang-Bing Tang, Bin-Bin Xie, and Jian-Feng Lu. 2018. Localization of diffusion sources in complex networks with sparse observations. Physics Letters A 382, 14 (2018), 931–937.
  • Huang (2017) Qiangjuan Huang. 2017. Source locating of spreading dynamics in temporal networks. In Proceedings of the 26th International Conference on World Wide Web Companion. 723–727.
  • Jagers and Nerman (1984) Peter Jagers and Olle Nerman. 1984. The growth and composition of branching populations. Advances in applied probability 16, 2 (1984), 221–259.
  • Jiang et al. (2016) Jiaojiao Jiang, Sheng Wen, Shui Yu, Yang Xiang, and Wanlei Zhou. 2016. Rumor source identification in social networks with time-varying topology. IEEE Transactions on Dependable and Secure Computing 15, 1 (2016), 166–179.
  • Karrer and Newman (2010) Brian Karrer and M. E. J. Newman. 2010. Message passing approach for general epidemic models. Phys. Rev. E 82 (Jul 2010), 016101. Issue 1. https://doi.org/10.1103/PhysRevE.82.016101
  • Kojaku et al. (2021) Sadamori Kojaku, Laurent Hébert-Dufresne, Enys Mones, Sune Lehmann, and Yong-Yeol Ahn. 2021. The effectiveness of backward contact tracing in networks. Nature Physics (2021), 1–7.
  • Kretzschmar et al. (2020) Mirjam E Kretzschmar, Ganna Rozhnova, Martin CJ Bootsma, Michiel van Boven, Janneke HHM van de Wijgert, and Marc JM Bonten. 2020. Impact of delays on effectiveness of contact tracing strategies for COVID-19: a modelling study. The Lancet Public Health 5, 8 (2020), e452–e459.
  • Lecomte et al. (2020) Victor Lecomte, Gergely Ódor, and Patrick Thiran. 2020. The power of adaptivity in source location on the path. arXiv preprint arXiv:2002.07336 (2020).
  • Li et al. (2019) Xiang Li, Xiaojie Wang, Chengli Zhao, Xue Zhang, and Dongyun Yi. 2019. Locating the source of diffusion in complex networks via Gaussian-based localization and deduction. Applied Sciences 9, 18 (2019), 3758.
  • Lokhov et al. (2014) Andrey Y Lokhov, Marc Mézard, Hiroki Ohta, and Lenka Zdeborová. 2014. Inferring the origin of an epidemic with a dynamic message-passing algorithm. Physical Review E 90, 1 (2014), 012801.
  • Lorch et al. (2020) Lars Lorch, Heiner Kremer, William Trouleau, Stratis Tsirtsis, Aron Szanto, Bernhard Schölkopf, and Manuel Gomez-Rodriguez. 2020. Quantifying the effects of contact tracing, testing, and containment measures in the presence of infection hotspots. arXiv preprint arXiv:2004.07641 (2020).
  • Louni et al. (2015) Alireza Louni, Anand Santhanakrishnan, and KP Subbalakshmi. 2015. Identification of source of rumors in social networks with incomplete information. arXiv preprint arXiv:1509.00557 (2015).
  • Mahmoud (2021) Hosam Mahmoud. 2021. Profile of Random Exponential Recursive Trees. Methodology and Computing in Applied Probability (2021), 1–17.
  • Mashkaria et al. (2020) Satvik Mashkaria, Gergely Ódor, and Patrick Thiran. 2020. On the robustness of the metric dimension to adding a single edge. arXiv preprint arXiv:2010.11023 (2020).
  • Paluch et al. (2020a) Robert Paluch, Łukasz G Gajewski, Janusz A Hołyst, and Boleslaw K Szymanski. 2020a. Optimizing sensors placement in complex networks for localization of hidden signal source: A review. Future Generation Computer Systems 112 (2020), 1070–1092.
  • Paluch et al. (2018) Robert Paluch, Xiaoyan Lu, Krzysztof Suchecki, Bolesław K Szymański, and Janusz A Hołyst. 2018. Fast and accurate detection of spread source in large complex networks. Scientific reports 8, 1 (2018), 1–10.
  • Paluch et al. (2020b) Robert Paluch, Krzysztof Suchecki, and Janusz A Hołyst. 2020b. Locating the source of interacting signal in complex networks. arXiv preprint arXiv:2012.02039 (2020).
  • Park et al. (2020) Young Joon Park, Young June Choe, Ok Park, Shin Young Park, Young-Man Kim, Jieun Kim, Sanghui Kweon, Yeonhee Woo, Jin Gwack, Seong Sun Kim, et al. 2020. Contact tracing during coronavirus disease outbreak, South Korea, 2020. Emerging infectious diseases 26, 10 (2020), 2465.
  • Pinto et al. (2012) P. Pinto, P. Thiran, and M. Vetterli. 2012. Locating the source of diffusion in large-scale networks. Physical Review Letters 109 (2012).
  • Shah and Zaman (2010) Devavrat Shah and Tauhid Zaman. 2010. Detecting Sources of Computer Viruses in Networks: Theory and Experiment. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, New York, USA) (SIGMETRICS ’10). ACM, New York, NY, USA, 203–214.
  • Shah and Zaman (2011) Devavrat Shah and Tauhid Zaman. 2011. Rumors in a network: Who’s the culprit? IEEE Transactions on information theory 57, 8 (2011), 5163–5181.
  • Shelke and Attar (2019) Sushila Shelke and Vahida Attar. 2019. Source detection of rumor in social network–a review. Online Social Networks and Media 9 (2019), 30–42.
  • Shen et al. (2016) Zhesi Shen, Shinan Cao, Wen-Xu Wang, Zengru Di, and H Eugene Stanley. 2016. Locating the source of diffusion in complex networks by time-reversal backward spreading. Physical Review E 93, 3 (2016), 032301.
  • Spinelli et al. (2017a) Brunella Spinelli, Elisa Celis, and Patrick Thiran. 2017a. A General Framework for Sensor Placement in Source Localization. IEEE Transactions on Network Science and Engineering (2017).
  • Spinelli et al. (2017b) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2017b. Back to the source: An online approach for sensor placement and source localization. In Proceedings of the 26th International Conference on World Wide Web. 1151–1160.
  • Spinelli et al. (2017c) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2017c. The effect of transmission variance on observer placement for source-localization. Applied network science 2, 1 (2017), 20.
  • Spinelli et al. (2018) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2018. How Many Sensors to Localize the Source? The Double Metric Dimension of Random Networks. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 1036–1043.
  • Tang et al. (2018) Wenchang Tang, Feng Ji, and Wee Peng Tay. 2018. Estimating infection sources in networks using partial timestamps. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 3035–3049.
  • Troncoso et al. (2020) Carmela Troncoso, Mathias Payer, Jean-Pierre Hubaux, Marcel Salathé, James Larus, Edouard Bugnion, Wouter Lueks, Theresa Stadler, Apostolos Pyrgelis, Daniele Antonioli, et al. 2020. Decentralized privacy-preserving proximity tracing. arXiv preprint arXiv:2005.12273 (2020).
  • Wormald et al. (1999) Nicholas C Wormald et al. 1999. Models of random regular graphs. London Mathematical Society Lecture Note Series (1999), 239–298.
  • Xu et al. (2019) Shuaishuai Xu, Cong Teng, Yinzuo Zhou, Junhao Peng, Yicheng Zhang, and Zi-Ke Zhang. 2019. Identifying the diffusion source in complex networks with limited observers. Physica A: Statistical Mechanics and its Applications 527 (2019), 121267.
  • Zejnilovic et al. (2013) Sabina Zejnilovic, Joao Gomes, and Bruno Sinopoli. 2013. Network observability and localization of the source of diffusion based on a subset of nodes. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on. IEEE, 847–852.
  • Zejnilović et al. (2015) Sabina Zejnilović, João Gomes, and Bruno Sinopoli. 2015. Sequential observer selection for source localization. In Signal and Information Processing (GlobalSIP), 2015 IEEE Global Conference on. IEEE, 1220–1224.
  • Zejnilović et al. (2017) Sabina Zejnilović, Joño Gomes, and Bruno Sinopoli. 2017. Sequential source localization on graphs: A case study of cholera outbreak. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1010–1014.
  • Zejnilović et al. (2016) Sabina Zejnilović, Dieter Mitsche, João Gomes, and Bruno Sinopoli. 2016. Extending the metric dimension to graphs with missing edges. Theoretical Computer Science 609 (2016), 384–394.
  • Zhu et al. (2016) Kai Zhu, Zhen Chen, and Lei Ying. 2016. Locating the contagion source in networks with partial timestamps. Data Mining and Knowledge Discovery 30, 5 (2016), 1217–1248.

Appendix A Additional Proofs

Refer to caption
Figure 9. Illustration for Lemma 4.3 using the same coloring as Figure 2 (a). (a) An example for an epidemic where among the nodes of the transmission path (v1,v2,v3,v5)(v_{1},v_{2},v_{3},v_{5}), the middle household contains no symptomatic node (only the asymptomatic node v3v_{3}), but the LS+ algorithm still succeeds. Indeed, at iteration 0 we set sc,0=v5s_{c,0}=v_{5}, after which we find that v3v_{3} is asymptomatic, and next that v2v_{2} is asymptomatic and v4v_{4} is symptomatic, with a lower symptom onset time then v5v_{5}. Hence, in iteration 1 we set sc,1=v4s_{c,1}=v_{4}, and we find that v3,v2v_{3},v_{2} are asymptomatic and v1v_{1} is is symptomatic, with a lower symptom onset time then v4v_{4}. Finally, in iteration 2 we set sc,2=v1s_{c,2}=v_{1}, and we find sc=v1=sc,2s_{c}^{\prime}=v_{1}=s_{c,2}, which implies that the algorithm stops, and returns the correct source v1v_{1}. (b) An example for an epidemic where the LS+ algorithm would fail if we would update the candidate before the test queue becomes empty. Similarly to subfigure (a), in iteration 0 of the algorithm first learns about asymptomatic node v3v_{3} and next about asymptomatic node v2v_{2} and symptomatic node v4v_{4}. If the algorithm updates the candidate to v4v_{4} and continues further, instead of scheduling the tests of the household members of v2v_{2}, then it is not hard to check that v4v_{4} will be the final estimate and the algorithm fails. However, if the algorithm waits until the test queue becomes empty and tests the household members of v2v_{2}, then v1v_{1} becomes the next candidate and the algorithm succeeds.

A.1. Proof of Lemma 4.3

We start by restating the lemma here for convenience.

Lemma A.1.

In the RB tree network, the LS algorithm succeeds if and only if all nodes on the transmission path are symptomatic, and the LS+ algorithm succeeds if among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic.

Proof.

Throughout the proof we assume that there is no limitation on the number available tests. We can make this assumption because in the SDCTF there is only a daily limit on the number tests, there is no limitation on the number of days, and neither the LS nor the LS+ algorithms proceed in an iteration until the test queue becomes empty, which implies that all nodes that enter the test queue get eventually tested.

Suppose that the LS algorithm succeeds. Then the list of candidate nodes scs_{c} at different iterations forms a path that consists entirely of symptomatic nodes between the source and the first hospitalized node. In tree networks, the transmission path is the only path between the source and the first hospitalized node, which yields the ”only if” part of the statement on the LS algorithm.

Next, suppose that all nodes on the transmission path are symptomatic. Then, we claim that the candidate node sc,is_{c,i} computed in the ithi^{th} iteration of the LS algorithm is vliv_{l-i}, the ithi^{th} node of the reverse transmission path. Our claim is definitely true for i=0i=0, because sc,0s_{c,0} is initialized to be the first hospitalized node vlv_{l}. Then, the proof proceeds by induction. By the induction hypothesis, in the ithi^{th} step, sc,i=vlis_{c,i}=v_{l-i}, and since we are on a tree, the symptom onset time of vl(i+1)v_{l-(i+1)} (which is revealed because all nodes on the transmission path are symptomatic by assumption) is the only symptom onset time among the neighbors of sc,is_{c,i} that have a lower symptom onset time than sc,is_{c,i} itself. Therefore sc=vl(i+1)s_{c}^{\prime}=v_{l-(i+1)}, and sc,i+1s_{c,i+1} is updated to be vl(i+1)v_{l-(i+1)} in the beginning of the next iteration, which proves that the induction hypothesis holds until the source is reached.

Finally, suppose that among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic. Let us denote by wiw_{i} the ithi^{th} symptomatic node of the reverse transmission path. Then, we claim that the candidate list sc,is_{c,i} computed in the ithi^{th} iteration of the LS+ algorithm equals wiw_{i}. Similarly to the case of the LS algorithm, the i=0i=0 case holds by definition, and we proceed by induction. Suppose that sc,i=wis_{c,i}=w_{i}. It will also be useful to define the index of wiw_{i} on the forward transmission path (without skipping asymptomatic nodes). Let jj be this index, for which therefore wi=vjw_{i}=v_{j}. Now we distinguish 3 cases: (i) vj1=wi+1v_{j-1}=w_{i+1} is symptomatic, (ii) vj1v_{j-1} is asymptomatic and vj2=wi+1v_{j-2}=w_{i+1} is symptomatic, and (iii) vj1v_{j-1} and vj2v_{j-2} are asymptomatic and vj3=wi+1v_{j-3}=w_{i+1} is symptomatic. We claim that there are no more cases, and that in all three cases wi+1w_{i+1} is tested in the ithi^{th} iteration of the LS+ algorithm. Case (i) is immediate because all neighbors of sc,is_{c,i} are tested. Case (ii) is only possible if either vj1H(sc,i)v_{j-1}\in H(s_{c,i}) or vj2H(vj1)v_{j-2}\in H(v_{j-1}), otherwise vj1v_{j-1} would be a lone asymptomatic node in a household, which contradicts the assumption that there is a symptomatic node in each household. Since all the contacts of asymptomatic nodes in H(sc,i)H(s_{c,i}) (see Figure 4 (d)) and all nodes in the household of asymptomatic nodes are tested in the LS+ algorithm (see Figure 4 (e)), vj2v_{j-2} must be tested too. Finally, case (iii) is possible only if vj1H(sc,i)v_{j-1}\in H(s_{c,i}) and vj3H(vj2)v_{j-3}\in H(v_{j-2}) both hold, otherwise vj1v_{j-1} or vj2v_{j-2} would be a lone asymptomatic node in a household. Similarly to the previous case, vj3v_{j-3} must be tested (see Figure 4 (f)). There are no more cases because, by Remark 4.2, on the RB tree a transmission path can only have two nodes in each household, and we assumed that there exists a symptomatic node in each household among the nodes of the transmission path.

After we proved that wi+1w_{i+1} is tested in the ithi^{th} iteration of the LS+ algorithm, we must still show that it will be the next candidate sc,i+1s_{c,i+1} for the induction hypothesis to hold. This is true because once the symptom onset time of wi+1w_{i+1} is revealed, none of its neighbors are scheduled for testing, and therefore all tested nodes have wi+1w_{i+1} on their path to the source, which means that wi+1w_{i+1} must have the lowest revealed symptom onset time, and therefore that it will be the next candidate sc,i+1s_{c,i+1}. ∎

A.2. Proof of Theorem 4.6

We are going to need prove a few intermediate results before proving Theorem 4.6. A first step is to count all the possible paths from the source with a given length.

Definition A.2.

Let G(s)G(s) be the RB tree with parameters (dc,dh)(d_{c},d_{h}), and let ss be the source. A Red-Blue (RB) path of length nn is any path of nodes in (s=v0,v1,vn)(s=v_{0},v_{1},...v_{n}) such that (vi,vi+1)E(v_{i},v_{i+1})\in E^{\prime} for 0i<n0\leq i<n. Let 𝒞n\mathcal{C}_{n} be the set of RB paths of length nn.

Lemma A.3.

In the RB tree with parameters (dc,dh)(d_{c},d_{h}), |C0|=1|C_{0}|=1, while for n1n\geq 1,

(20) |𝒞n|=λ1(dc1+D2)n+λ2(dc1D2)n|\mathcal{C}_{n}|=\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}

where

(21) D\displaystyle D =(dc1)2+4dcdh\displaystyle=\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}
(22) λ1\displaystyle\lambda_{1} =(dc+1+D)(2dh+dc1+D)2D(dc1+D)\displaystyle=\frac{(d_{c}+1+D)(2d_{h}+d_{c}-1+D)}{2D(d_{c}-1+D)}
(23) λ2\displaystyle\lambda_{2} =(Ddc1)(2dh+dc1D)2D(dc1D).\displaystyle=\frac{(D-d_{c}-1)(2d_{h}+d_{c}-1-D)}{2D(d_{c}-1-D)}.
Proof.

Let us keep track of the number of RB paths of length nn depending on the color of the last node in the path. Let rnr_{n} and bnb_{n} be the numbers of RB paths of length nn such that the last node is red and blue, respectively. A RB path of length 0 consists only of the source, which implies that r0=1r_{0}=1 and b0=0b_{0}=0. The source has dcd_{c} red and dhd_{h} blue neighbours, which implies that r1=dcr_{1}=d_{c} and b1=dhb_{1}=d_{h}.

Suppose that PP is an RB path of length n2n\geq 2. If the last node of PP is red, then the node before the last node can be both blue or red. Red nodes other than the source have dc1d_{c}-1 red children, while blue nodes have dcd_{c} red children, yielding

(24) rn=(dc1)rn1+dcbn1, for n2.\displaystyle r_{n}=(d_{c}-1)r_{n-1}+d_{c}b_{n-1},\textrm{ for }n\geq 2.

If the last node of PP is blue, then the node before has to be red. Since every red node, including the source, has dhd_{h} blue children, we have

(25) bn=dhrn1, for n1.\displaystyle b_{n}=d_{h}r_{n-1},\textrm{ for }n\geq 1.

By substituting equation (25) into equation (24), we obtain the recurrence

(26) rn=(dc1)rn1+dcdhrn2, for n2.\displaystyle r_{n}=(d_{c}-1)r_{n-1}+d_{c}d_{h}r_{n-2},\textrm{ for }n\geq 2.

We solve this recurrence equation by calculating the characteristic equation

(27) t2(dc1)tdcdh=0,t^{2}-(d_{c}-1)t-d_{c}d_{h}=0,

whose roots are

(28) t1\displaystyle t_{1} =dc1+(dc1)2+4dcdh2=dc1+D2\displaystyle=\frac{d_{c}-1+\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}{2}=\frac{d_{c}-1+D}{2}
(29) t2\displaystyle t_{2} =dc1(dc1)2+4dcdh2=dc1D2\displaystyle=\frac{d_{c}-1-\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}{2}=\frac{d_{c}-1-D}{2}

yielding the the general solution

(30) rn=c1t1n+c2t2n,r_{n}=c_{1}t_{1}^{n}+c_{2}t_{2}^{n},

where c1,c2c_{1},c_{2} are given by the initial conditions for n=0,1n=0,1

(31) c1+c2=r0=1\displaystyle c_{1}+c_{2}=r_{0}=1
(32) c1t1+c2t2=r1=dc,\displaystyle c_{1}t_{1}+c_{2}t_{2}=r_{1}=d_{c},

which are

(33) c1\displaystyle c_{1} =12+dc+12(dc1)2+4dcdh=12+dc+12D\displaystyle=\frac{1}{2}+\frac{d_{c}+1}{2\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}=\frac{1}{2}+\frac{d_{c}+1}{2D}
(34) c2\displaystyle c_{2} =12dc+12(dc1)2+4dcdh=12dc+12D.\displaystyle=\frac{1}{2}-\frac{d_{c}+1}{2\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}=\frac{1}{2}-\frac{d_{c}+1}{2D}.

From equations (24) and (25) we conclude that for n1n\geq 1,

(35) bn=dh(c1t1n1+c2t2n1)b_{n}=d_{h}(c_{1}t_{1}^{n-1}+c_{2}t_{2}^{n-1})

and therefore

(36) |Cn|=rn+bn=λ1t1n+λ2t2n,\displaystyle|C_{n}|=r_{n}+b_{n}=\lambda_{1}t_{1}^{n}+\lambda_{2}t_{2}^{n},

where

(37) λ1\displaystyle\lambda_{1} =c1(1+dht1)\displaystyle=c_{1}\left(1+\frac{d_{h}}{t_{1}}\right)
(38) λ2\displaystyle\lambda_{2} =c2(1+dht2).\displaystyle=c_{2}\left(1+\frac{d_{h}}{t_{2}}\right).

Inserting the values for t1,t2,c1,c2t_{1},t_{2},c_{1},c_{2} we obtain the desired result. ∎

Since LS+ improves on LS by making use of the household structure of the network, we need further information about the household structure of the transmission paths. Recall that by Remark 4.2, households on transmission paths on an RB tree were characterized either by a single red node (that is followed by a red node), or a pair of consecutive red and blue nodes. The following definition and lemma refine our previous result on counting the number of RB paths by taking the household structure into account.

Definition A.4.

Let P={s=v0,v1,,vn=h}P=\{s=v_{0},v_{1},\ldots,v_{n}=h\} be a RB path of length nn. We say that a node vv on the path PP is in a PP-single-household if no other node from PP is in the same household as vv. Otherwise, we say vv is in a PP-multi-household. Given a path PP, let Ms:𝒞n{0,1}M_{s}:\mathcal{C}_{n}\rightarrow\{0,1\} be the indicator function that the source is in a PP-multi-household. Similarly, let Ml:𝒞n{0,1}M_{l}:\mathcal{C}_{n}\rightarrow\{0,1\} be the indicator function that the last node of path PP is in a PP-multi-household. Finally, for 0kn+10\leq k\leq n+1 and α,β{0,1}\alpha,\beta\in\{0,1\}, let

(39) Cn,k,α,β={P𝒞n:(there are exactly k nodes in Psingle-households)(Ms(P)=α)(Ml(P)=β)}.C_{n,k,\alpha,\beta}=\{P\in\mathcal{C}_{n}:(\textrm{there are exactly }k\textrm{ nodes in }P-\textrm{single-households})\wedge(M_{s}(P)=\alpha)\wedge(M_{l}(P)=\beta)\}.

The set Cn,k,α,βC_{n,k,\alpha,\beta} depends on 4 parameters, but only some combinations of these parameters make it non-empty. The following definition will be useful in this regard.

Condition 1.

Let α,β{0,1}\alpha,\beta\in\{0,1\} and n2n\geq 2. We say kk\in\mathbb{N} satisfies Condition 1 if and only if kk and nn have different parity and n+12(α+β)k2(α+β)n+1-2(\alpha+\beta)\geq k\geq 2-(\alpha+\beta).

Lemma A.5.

It holds that |C0,1,0,0|=1|C_{0,1,0,0}|=1, |C1,0,1,1|=dh|C_{1,0,1,1}|=d_{h} and |C1,2,0,0|=dc|C_{1,2,0,0}|=d_{c}. Let α,β{0,1}\alpha,\beta\in\{0,1\}, let n2n\geq 2 and let kk\in\mathbb{N} satisfy Condition 1. Then

(40) |𝒞n,k,α,β|=(n+k32k2+α+β)dhnk+12dcnk+32βα(dc1)k+α+β2.|\mathcal{C}_{n,k,\alpha,\beta}|=\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}d_{h}^{\frac{n-k+1}{2}}d_{c}^{\frac{n-k+3}{2}-\beta-\alpha}(d_{c}-1)^{k+\alpha+\beta-2}.

In all other cases |Cn,k,α,β|=0|C_{n,k,\alpha,\beta}|=0.

Proof.

Since there are n+1n+1 nodes on path PP, with kk in PP-single households and thus n+1kn+1-k of them in PP-multi-households, we must have

k+n+1k2=n+k+12k+\frac{n+1-k}{2}=\frac{n+k+1}{2}

households along path PP in total. Clearly, the numbers nn and kk cannot be of the same parity for any RB path PP, which is thus assumed for the rest of the proof (this assumption is also part of Condition 1).

If n=0n=0, then the source is also the first hospitalized node, and it is in a PP-single-household, which implies that |C0,1,0,0|=1|C_{0,1,0,0}|=1. If n=1n=1, then there are two cases: either the source is in the same PP-multi-household with the first hospitalized node, or both of them are in PP-single-households. The former case is possible via dhd_{h} edges from the source, which gives |C1,0,1,1|=dh|C_{1,0,1,1}|=d_{h}, while the latter case is possible via dcd_{c} edges, and gives |C1,0,1,1|=dc|C_{1,0,1,1}|=d_{c}. Since these are the only possible RB paths of length n1n\leq 1, we must have |C0,k,α,β|=|C1,k,α,β|=0|C_{0,k,\alpha,\beta}|=|C_{1,k,\alpha,\beta}|=0 for any other choice of parameters k,αk,\alpha and β\beta.

Let us assume that n2n\geq 2. Then, the source and the first hospitalized node are not in the same household. Let us denote the household of the source by HsH_{s} and the household of the first hospitalized node by HhH_{h}. Note that (1α)(1-\alpha) and (1β)(1-\beta) are the indicators of HsH_{s} and HhH_{h} being PP-single-households, and therefore k(1α)+(1β)k\geq(1-\alpha)+(1-\beta). If this inequality (which is also part of Condition 1) does not hold, then clearly |𝒞n,k,α,β|=0|\mathcal{C}_{n,k,\alpha,\beta}|=0. Similarly, the number of PP-multi-households is nk+12\frac{n-k+1}{2} and we must have nk+12α+β\frac{n-k+1}{2}\geq\alpha+\beta for |𝒞n,k,α,β|>0|\mathcal{C}_{n,k,\alpha,\beta}|>0, which implies the inequality n+12α2βkn+1-2\alpha-2\beta\geq k. Therefore 𝒞n,k,α,β\mathcal{C}_{n,k,\alpha,\beta} is empty if Condition 1 does not hold. For the rest of the proof we assume that Condition 1 does hold.

There are n+k32\frac{n+k-3}{2} households along path PP, excluding HsH_{s} and HhH_{h}. Among them, there are k(1α)(1β)k-(1-\alpha)-(1-\beta) PP-single-households, which can be chosen in (n+k32k2+α+β)\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta} ways. Once we know the color of each node along the path, the number of RB paths can be computed by multiplying the numbers of children with the appropriate color of each node. PP-single-households have no blue nodes, and PP-multi-households have exactly one, which implies that there are nk+12\frac{n-k+1}{2} blue nodes. Since blue nodes are preceded by red nodes that have dhd_{h} blue children, they give the multiplicative factor dhnk+12d_{h}^{\frac{n-k+1}{2}}. Blue nodes, except from the first hospitalized node (if it is blue), have dcd_{c} red children. So far we have accounted for all of the nodes in PP-multi-households and none of the nodes in PP-single-households. If the source is in a PP-single-household, then we must count its red children, whose number is dcd_{c}. This implies that there exist nk+12β+(1α)\frac{n-k+1}{2}-\beta+(1-\alpha) nodes with dcd_{c} red children. Finally, each PP-single-household, except HsH_{s} and/or HhH_{h} in case they are PP-single households, has dc1d_{c}-1 red children. There are k(1α)(1β)k-(1-\alpha)-(1-\beta) such PP-single-households, which gives the final term in equation (40). ∎

The sets 𝒞n,k,α,β\mathcal{C}_{n,k,\alpha,\beta} define equivalence classes on the transmission paths based on their household structure. In the next lemma we show that once we know which equivalence class we are in, it is possible compute the success probability of the LS+ algorithm.

Lemma A.6.

Let PP be the transmission path in the DDENR\mathrm{DDE}_{\mathrm{NR}} epidemic model with parameters (pi,pa,ph)(p_{i},p_{a},p_{h}) on the RB tree with parameters (dc,dh)(d_{c},d_{h}), and let pp be as computed in (5). Then, it holds that

𝐏(LS+ succeeds|P𝒞0,1,0,0)=1\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{0,1,0,0})=1

and

𝐏(LS+ succeeds|P𝒞1,0,1,1)=𝐏(LS+ succeeds|𝒞1,2,0,0)=1p.\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{1,0,1,1})=\mathbf{P}(LS+\textrm{ succeeds}|\mathcal{C}_{1,2,0,0})=1-p.

Let α,β{0,1}\alpha,\beta\in\{0,1\}, let n2n\geq 2 and let kk\in\mathbb{N} satisfy Condition 1. Then, it holds that

(41) 𝐏(LS+ succeeds|P𝒞n,k,α,β)(1p)n+k12(1+p)nk+12αβ.\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta})\geq(1-p)^{\frac{n+k-1}{2}}(1+p)^{\frac{n-k+1}{2}-\alpha-\beta}.

In all other cases 𝐏(LS+ succeeds|P𝒞n,k,α,β)\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta}) is not defined.

Proof.

If n=0n=0, then k=1k=1 and α=β=0\alpha=\beta=0. In that case, the source is the first hospitalized node and LS+ always succeeds. If n=1n=1, then the first hospitalized node is in the neighbourhood of the source, and LS+ succeeds if and only if the source is symptomatic, which happens with probability 1p1-p.

Let us assume that n2n\geq 2 and that kk satisfies Condition 1 (otherwise |𝒞n,k,α,β|=0|\mathcal{C}_{n,k,\alpha,\beta}|=0 and 𝐏(LS+ succeeds|P𝒞n,k,α,β)\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta}) is not defined). By Lemma 4.3 the LS+ algorithm succeeds in the DDENR\mathrm{DDE}_{\mathrm{NR}} model on the RB tree if, among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic, which means that we can prove a lower bound on the success probability of LS+. Let us assume that the source is indeed symptomatic. Since the first hospitalized node is symptomatic by definition, the households of the source and of the first hospitalized node cannot make the LS+ algorithm fail. Let us denote these two households by HsH_{s} and HhH_{h}, respectively. Also, let MM and SS be the sets of all PP-multi- and PP-single-households, respectively, excluding HsH_{s} and HhH_{h}. Then, LS+ succeeds if all nodes in the households of SS are symptomatic, and if at least one node in the households of MM is symptomatic, which has probability 1p1-p and 1p21-p^{2} for each type of household, respectively, by equation (5). These observations yield that

𝐏(LS+ succeeds|P𝒞n,k,α,β)\displaystyle\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta}) 𝐏(source is sym)(1p)|S|(1p2)|M|\displaystyle\geq\mathbf{P}(\textrm{source is sym})(1-p)^{|S|}(1-p^{2})^{|M|}
=(1p)(1p)k2+α+β(1p2)nk+12αβ\displaystyle=(1-p)(1-p)^{k-2+\alpha+\beta}(1-p^{2})^{\frac{n-k+1}{2}-\alpha-\beta}
(42) =(1p)k1+α+β(1p2)nk+12αβ.\displaystyle=(1-p)^{k-1+\alpha+\beta}(1-p^{2})^{\frac{n-k+1}{2}-\alpha-\beta}.

Finally, we are ready to state and prove Theorem 4.6 on the success probability of LS+, which we restate here for convenience.

Theorem A.7.

Let pp be as in (5) and let 𝒮(n,α,β)\mathcal{S}(n,\alpha,\beta) be the set of kk values that satisfy Condition 1. Then, for the DDENR\mathrm{DDE}_{\mathrm{NR}} epidemic model with parameters (pi,pa,ph)(p_{i},p_{a},p_{h}) on the RB tree with parameters (dc,dh)(d_{c},d_{h}) we have

𝐏(LS+ succeeds)𝐏(d(s,h)=0)+(1p)𝐏(d(s,h)=1)+\displaystyle\mathbf{P}(LS+\textrm{ succeeds})\geq\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+
(43) n=2α,β{0,1}k𝒮(n,α,β)(n+k32k2+α+β)(dh(1p))n+k12(dc(1+p))nk+12αβdc(dc1)k+α+β2λ1(dc1+D2)n+λ2(dc1D2)n𝐏(d(s,h)=n),\displaystyle\sum_{n=2}^{\infty}\sum_{\begin{subarray}{c}\alpha,\beta\in\{0,1\}\\ k\in\mathcal{S}(n,\alpha,\beta)\end{subarray}}\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}\frac{(d_{h}(1-p))^{\frac{n+k-1}{2}}(d_{c}(1+p))^{\frac{n-k+1}{2}-\alpha-\beta}d_{c}(d_{c}-1)^{k+\alpha+\beta-2}}{\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}}\mathbf{P}(d(s,h)=n),

where D,λ1D,\lambda_{1} and λ2\lambda_{2} are terms depending on parameters dcd_{c} and dhd_{h} and are computed explicitly in Lemma A.3.

Proof.

Let us extend the domain of 𝐏(LS+ succeeds|PCn,k,α,β)\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta}) by function gg defined as g:××{0,1}×{0,1}[0,1]g:\mathbb{N}\times\mathbb{N}\times\{0,1\}\times\{0,1\}\rightarrow[0,1] such that

(44) g(n,k,α,β)={𝐏(LS+ succeeds|PCn,k,α,β) if k𝒮(n,α,β)0 if k𝒮(n,α,β).g(n,k,\alpha,\beta)=\begin{cases}\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})&\text{ if }k\in\mathcal{S}(n,\alpha,\beta)\\ 0&\text{ if }k\not\in\mathcal{S}(n,\alpha,\beta).\end{cases}

Unlike 𝐏(LS+ succeeds|PCn,k,α,β)\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta}), gg is defined for every 4-tuple of parameters (n,k,α,β)××{0,1}×{0,1}(n,k,\alpha,\beta)\in\mathbb{N}\times\mathbb{N}\times\{0,1\}\times\{0,1\}. By the law of total probability we expand the success probability by conditioning on the path PP being of length nn as

𝐏(LS+ succeeds)=\displaystyle\mathbf{P}(LS+\textrm{ succeeds})= n=0k=0α,β{0,1}g(n,k,α,β)𝐏(PCn,k,α,β)\displaystyle\sum_{n=0}^{\infty}\sum_{k=0}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}g(n,k,\alpha,\beta)\mathbf{P}(P\in C_{n,k,\alpha,\beta})
(45) =\displaystyle= n=0k=0α,β{0,1}g(n,k,α,β)𝐏(PCn,k,α,β|PCn)𝐏(d(s,h)=n).\displaystyle\sum_{n=0}^{\infty}\sum_{k=0}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}g(n,k,\alpha,\beta)\mathbf{P}(P\in C_{n,k,\alpha,\beta}|P\in C_{n})\mathbf{P}(d(s,h)=n).

Next, we exchange the sums over α,β\alpha,\beta and kk. This allows us to sum over only those kk values that satisfy Condition 1, which implies that 𝐏(LS+ succeeds|PCn,k,α,β)\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta}) is well-defined. As in Lemma A.5, we need to treat the n=0n=0 and n=1n=1 cases separately. Continuing equation (A.2), we arrive to

𝐏(LS+ succeeds)=\displaystyle\mathbf{P}(LS+\textrm{ succeeds})= 𝐏(d(s,h)=0)+(1p)𝐏(d(s,h)=1)+\displaystyle\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+
(46) n=2α,β{0,1}k𝒮(n,α,β)𝐏(LS+ succeeds|PCn,k,α,β)|Cn,k,α,β||Cn|𝐏(d(s,h)=n)\displaystyle\sum_{n=2}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}\sum_{k\in\mathcal{S}(n,\alpha,\beta)}\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})\frac{|C_{n,k,\alpha,\beta}|}{|C_{n}|}\mathbf{P}(d(s,h)=n)

Substituting in the results from Lemmas A.3, A.5 and A.6 into equation (A.2) gives the desired result. ∎

A.3. Proof of Theorem 4.9

We start by restating Theorem 4.9 for convenience.

Theorem A.8.

In the (dr,d)(d_{r},d)-RET with parameters pi,pa,php_{i},p_{a},p_{h}, let at,la_{t,l} be as in Definition 4.8. Then

(47) at,0\displaystyle a_{t,0} =1\displaystyle=1
(48) at,l\displaystyle a_{t,l} =drpim=l1t1(ml1)(1pi)ml+1dl1pil1, for tl1\displaystyle=d_{r}p_{i}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}\textrm{, for }t\geq l\geq 1
(49) at,l\displaystyle a_{t,l} =0, for l ¿ t.\displaystyle=0\textrm{, for l > t}.
Proof.

Similarly to (Feng and Mahmoud, 2018; Mahmoud, 2021), the proof relies on generating functions. We start by addressing the boundary cases. For all t0t\geq 0, it holds that At,0=1A_{t,0}=1, and therefore at,0=1a_{t,0}=1. Similarly, for all l,tl,t such that l>tl>t, it holds that At,l=0A_{t,l}=0, and therefore at,l=0a_{t,l}=0. Suppose that tl=1t\geq l=1. During day t1t-1, on the first level, there are At1,1A_{t-1,1} infected (internal) nodes and drAt1,1d_{r}-A_{t-1,1} (external) nodes that may be infected with probability pip_{i} during day tt. Thus,

(50) At,1\displaystyle A_{t,1} =At1,1+Bin(drAt1,1;pi).\displaystyle=A_{t-1,1}+\mathrm{Bin}(d_{r}-A_{t-1,1};p_{i}).

Taking the expectation of both sides in equation (50) yields

(51) at,1=at1,1(1pi)+drpi, for t1.\displaystyle a_{t,1}=a_{t-1,1}(1-p_{i})+d_{r}p_{i},\textrm{ for }t\geq 1.

By subtracting the appropriate recurrence equations for at,1a_{t,1} and at1,1a_{t-1,1} for t2t\geq 2 we obtain the homogeneous recurrence equation

(52) at,1at1,1(2pi)+(1pi)at2,1=0, for t2\displaystyle a_{t,1}-a_{t-1,1}(2-p_{i})+(1-p_{i})a_{t-2,1}=0,\textrm{ for }t\geq 2

and boundary conditions a0,1=0a_{0,1}=0 and a1,1=drpia_{1,1}=d_{r}p_{i}. We solve for at,1a_{t,1} using the same methods as in the proof of Lemma A.3 and obtain

(53) at,1=dr(1(1pi)t), for t0.\displaystyle a_{t,1}=d_{r}\left(1-(1-p_{i})^{t}\right),\textrm{ for }t\geq 0.

Next, let us consider the general case tl>1t\geq l>1. On day t1t-1, there are At1,l1A_{t-1,l-1} nodes on level l1l-1. Since, each node on level l1l-1 has dd children, there are dAt1,l1dA_{t-1,l-1} nodes on level ll that have an infectious parent on level l1l-1. However, At1,lA_{t-1,l} of them are already infected. Therefore dAt1,l1At1,ldA_{t-1,l-1}-A_{t-1,l} nodes of level ll may be infected on day tt, each with probability pip_{i}, which implies

(54) At,l\displaystyle A_{t,l} =At1,l+Bin(dAt1,l1At1;l,pi), for tl2.\displaystyle=A_{t-1,l}+\mathrm{Bin}(dA_{t-1,l-1}-A_{t-1;l},p_{i}),\textrm{ for }t\geq l\geq 2.

Taking the expectation of both sides in equation (54) yields

at,l\displaystyle a_{t,l} =at1,l+(dat1,l1at1,l)pi\displaystyle=a_{t-1,l}+(da_{t-1,l-1}-a_{t-1,l})p_{i}
(55) =at1,l(1pi)+dpiat1,l1, for tl2.\displaystyle=a_{t-1,l}(1-p_{i})+dp_{i}a_{t-1,l-1},\textrm{ for }t\geq l\geq 2.

For convenience, let us introduce λ=1pi\lambda=1-p_{i} and μ=dpi\mu=dp_{i}, and also let

(56) f(x,y)=t=1l=1at,lxtyl=t=1l=1tat,lxtyl\displaystyle f(x,y)=\sum_{t=1}^{\infty}\sum_{l=1}^{\infty}a_{t,l}x^{t}y^{l}=\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}

be the generating function for at,la_{t,l} with t,l1t,l\geq 1. By multiplying (A.3) by xtylx^{t}y^{l} and summing it over t,l2t,l\geq 2 we obtain

t=2l=2tat,lxtyl\displaystyle\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l} =λt=2l=2tat1,lxtyl+μt=2l=2tat1,l1xtyl\displaystyle=\lambda\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t-1,l}x^{t}y^{l}+\mu\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t-1,l-1}x^{t}y^{l}
(57) =λxt=1l=2tat,lxtyl+μxyt=1l=1tat,lxtyl.\displaystyle=\lambda x\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}+\mu xy\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}.

Since a1,l=0a_{1,l}=0 for l2l\geq 2,

(58) t=1l=2tat,lxtyl=t=2l=2tat,lxtyl,\displaystyle\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}=\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l},

and by inserting (58) into (A.3), we obtain

(59) (1λx)t=1l=2tat,lxtyl\displaystyle(1-\lambda x)\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l} =μxyt=1l=1tat,lxtyl=(56)μxyf(x,y).\displaystyle=\mu xy\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}\stackrel{{\scriptstyle\eqref{eq:f(x,y)}}}{{=}}\mu xyf(x,y).

Now, we can also decompose the sum (58) using geometric series as

t=1l=2tat,lxtyl\displaystyle\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l} =t=1l=1tat,lxtylt=1at,1xty\displaystyle=\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}-\sum_{t=1}^{\infty}a_{t,1}x^{t}y
=(53)f(x,y)dryt=1(1λt)xt\displaystyle\stackrel{{\scriptstyle\eqref{eq:a_t,1}}}{{=}}f(x,y)-d_{r}y\sum_{t=1}^{\infty}(1-\lambda^{t})x^{t}
(60) =f(x,y)drxy(11xλ1λx).\displaystyle=f(x,y)-d_{r}xy\left(\frac{1}{1-x}-\frac{\lambda}{1-\lambda x}\right).

By plugging (A.3) into (59), we obtain the expression

(61) f(x,y)\displaystyle f(x,y) =dr(1λ)xy11x11λxμxy.\displaystyle=d_{r}(1-\lambda)xy\frac{1}{1-x}\frac{1}{1-\lambda x-\mu xy}.

Then, we expand the fractions in (61) into a power series and we next apply the binomial theorem, we arrive to

f(x,y)\displaystyle f(x,y) =dr(1λ)xyn=0xnm=0xm(λ+μy)m\displaystyle=d_{r}(1-\lambda)xy\sum_{n=0}^{\infty}x^{n}\sum_{m=0}^{\infty}x^{m}(\lambda+\mu y)^{m}
=dr(1λ)xyn=0xnm=0xmj=0m(mj)λmj(μy)j\displaystyle=d_{r}(1-\lambda)xy\sum_{n=0}^{\infty}x^{n}\sum_{m=0}^{\infty}x^{m}\sum_{j=0}^{m}\binom{m}{j}\lambda^{m-j}(\mu y)^{j}
(62) =dr(1λ)n=0m=0j=0m(mj)λmjμjx1+n+myj+1.\displaystyle=d_{r}(1-\lambda)\sum_{n=0}^{\infty}\sum_{m=0}^{\infty}\sum_{j=0}^{m}\binom{m}{j}\lambda^{m-j}\mu^{j}x^{1+n+m}y^{j+1}.

Let t=1+n+mt=1+n+m and l=j+1l=j+1. In order to obtain an expression for at,la_{t,l}, we must change the variables in the sums of equation (A.3) from (n,m,k)(n,m,k) to (t,m,l)(t,m,l). Changing the inner sum from variable jj to ll is simple. Changing the variables in the two outer sums is more challenging because t,nt,n and mm depend on each other in a nontrivial way. More precisely, since m,n0m,n\geq 0 we have t1t\geq 1 and also mt1m\leq t-1, which means that we have to set the lower limit of tt and the upper limit of mm accordingly. As for the remaining limits, variable tt can be arbitrary large, and mm can take any integer value starting from 0 independently of tt, which yields the expression

(63) f(x,y)\displaystyle f(x,y) =dr(1λ)t=1m=0t1l=1m+1(ml1)λml+1μl1xtyl.\displaystyle=d_{r}(1-\lambda)\sum_{t=1}^{\infty}\sum_{m=0}^{t-1}\sum_{l=1}^{m+1}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}.

For the values of ll with lm+1l\geq m+1, the binomial coefficient (ml1)\binom{m}{l-1} is 0, which implies that we can increase the upper limit of the inner sum from m+1m+1 to tt in equation (63). Then,

f(x,y)\displaystyle f(x,y) =dr(1λ)t=1m=0t1l=1t(ml1)λml+1μl1xtyl\displaystyle=d_{r}(1-\lambda)\sum_{t=1}^{\infty}\sum_{m=0}^{t-1}\sum_{l=1}^{t}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}
(64) =t=1l=1tdr(1λ)m=0t1(ml1)λml+1μl1xtyl.\displaystyle=\sum_{t=1}^{\infty}\sum_{l=1}^{t}d_{r}(1-\lambda)\sum_{m=0}^{t-1}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}.

Finally we can read off the value of at,la_{t,l} from equation (A.3) as

(65) at,l=dr(1λ)m=0t1(ml1)μl1λml+1=drpim=0t1(ml1)(dpi)l1(1pi)ml+1.\displaystyle a_{t,l}=d_{r}(1-\lambda)\sum_{m=0}^{t-1}\binom{m}{l-1}\mu^{l-1}\lambda^{m-l+1}=d_{r}p_{i}\sum_{m=0}^{t-1}\binom{m}{l-1}(dp_{i})^{l-1}(1-p_{i})^{m-l+1}.

A.4. Proof of Corollary 4.10

We start by restating Corollary 4.10 for convenience.

Corollary A.9.

In the RET(pi,dr,d)(p_{i},d_{r},d), let ata_{t} be the expectation of (12), as in Definition 4.8. For t0t\geq 0,

(66) at=1+dr(1pi+dpi)t1d1.a_{t}=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.
Proof.

By using linearity of expectation, equation (12) and Theorem 4.9 we obtain:

at\displaystyle a_{t} =l=0+at,l\displaystyle=\sum_{l=0}^{+\infty}a_{t,l}
=1+l=1+at,l\displaystyle=1+\sum_{l=1}^{+\infty}a_{t,l}
(67) =1+drpil=1tm=l1t1(ml1)(1pi)ml+1dl1pil1\displaystyle=1+d_{r}p_{i}\sum_{l=1}^{t}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}

Before we use binomial theorem, we need to swap the sums. Boundaries from (A.4) are equivalent to t1ml10t-1\geq m\geq l-1\geq 0, so we can rewrite this as 2 conditions: m+1l1m+1\geq l\geq 1 and tm0t\geq m\geq 0.

at,l\displaystyle a_{t,l} =1+drpim=0t1l=1m+1(ml1)(1pi)ml+1dl1pil1\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}\sum_{l=1}^{m+1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}
(68) =1+drpim=0t1l=0m(ml)(1pi)mldlpil\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}\sum_{l=0}^{m}\binom{m}{l}(1-p_{i})^{m-l}d^{l}p_{i}^{l}

Finally, by applying the binomial theorem and summing the geometric series, we obtain the desired equation:

at,l\displaystyle a_{t,l} =1+drpim=0t1(1pi+dpi)m\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}(1-p_{i}+dp_{i})^{m}
(69) =1+dr(1pi+dpi)t1d1.\displaystyle=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.

A.5. Proof or Lemma 4.12

We restate Lemma 4.12 here for convenience.

Lemma A.10.

Let us consider the stopped DET model with parameters (ct,l),pa,ph(c_{t,l}),p_{a},p_{h}, and let hh denote the first hospitalized node. Then

(70) 𝐏(d(s,h)=l)=t=0+ct,lct1,lctct1(1(1pa)ph)ct1(1(1(1pa)ph)ctct1).\displaystyle\mathbf{P}(d(s,h)=l)=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).
Proof.

Recall that a node added at day tt is uniformly distributed among the ctct1>0c_{t}-c_{t-1}>0 nodes added that day, and that the number of nodes added to level ll is ct,lct1,lc_{t,l}-c_{t-1,l} on day tt. If we condition on the time of the first hospitalized case, denoted by TIhTI_{h}, then

𝐏(d(s,h)=l)\displaystyle\mathbf{P}(d(s,h)=l) =t=0+𝐏(d(s,h)=l|TIh=t)𝐏(TIh=t)\displaystyle=\sum_{t=0}^{+\infty}\mathbf{P}(d(s,h)=l|TI_{h}=t)\mathbf{P}(TI_{h}=t)
=t=0+ct,lct1,lctct1𝐏(node is not hosp)ct1(1𝐏(node is not hosp )ctct1)\displaystyle=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}\mathbf{P}(\textrm{node is not hosp})^{c_{t-1}}(1-\mathbf{P}(\textrm{node is not hosp })^{c_{t}-c_{t-1}})
(71) =t=0+ct,lct1,lctct1(1(1pa)ph)ct1(1(1(1pa)ph)ctct1).\displaystyle=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).

Appendix B Dynamic Message Passing for the DDE model

In this section, we explain how we derived and implemented the DMP equations for the DDE+HNM model. We start by reviewing the previous work on the DMP equations for the SIR model in Appendix B.1, and then we proceed to our derivations in Appendix B.2. In Appendix B.3, we explain how we find candidate (node,time) pairs for the DMP equations, and in Appendix B.4 we conclude by combining Appendices B.2 and B.3 into a source-detection algorithm.

B.1. DMP Equations for the SIR Model

The DMP equations were first derived by (Lokhov et al., 2014) for the SIR model in the context of source detection. Their goal is to compute the marginal probabilities that node ii is in a given state at time tt (denoted by PSi(t),PIi(t)P_{S}^{i}(t),P_{I}^{i}(t) and PRi(t)P_{R}^{i}(t) for the susceptible, infected and recovered states, respectively), given initial conditions PSi(t0),PIi(t0)P_{S}^{i}(t_{0}),P_{I}^{i}(t_{0}) and PRi(t0)P_{R}^{i}(t_{0}) at some initial time t0t_{0}. To solve this problem in tree networks, we may consider a dynamic programming approach, where we delete a node ii, we compute the marginal probabilities of PSj(t1)P_{S}^{j}(t-1) for all neighbors jj of ii in the remaining subtrees, and use this information to compute PSi(t)P_{S}^{i}(t) (as the marginals are independent in each of the subtrees conditioned on the state of ii). The DMP equations make the dynamic programming intuition explicit. Originally, the DMP equations were developed for static networks, but since the generalization to time-varying networks is straightforward, and has already been foreshadowed in a similar heuristic algorithm (Jiang et al., 2016), we include it in this preliminary section. For time-varying networks, we define Ni(t)N_{i}(t) as the set of neighbors of node ii in the time-window [t,t+1)[t,t+1).

To formalize the dynamic programming approach, (Lokhov et al., 2014) introduces some new notation. Let λ\lambda be the probability that an infectious node infects a susceptible neighbor, and let μ\mu be the probability that an infectious node recovers. Let DiD_{i} be the auxiliary dynamics, where node ii receives infection signals, but ignores them, and thus remains in the SS state at all times. Let PSji(t)P_{S}^{j\rightarrow i}(t) be the probability that node jj is in the state SS at time tt in the dynamics DiD_{i}, and let θki(t)\theta^{k\rightarrow i}(t) be the probability that the infection signal has not been passed from node kk to node ii up to time tt in the dynamics DiD_{i}. Finally, let ϕki(t)\phi^{k\rightarrow i}(t) be the probability that the infection signal has not been passed from node kk to node ii up to time tt, and that node kk is in the state II at time tt, in the dynamics DiD_{i}. With these definitions, the dynamic programming approach is formalized by the following equations for tt0t\geq t_{0}:

(72) PSij(t+1)\displaystyle P_{S}^{i\rightarrow j}(t+1) =PSi(t0)kNi(t)\jθki(t+1),\displaystyle=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)\backslash j}\theta^{k\rightarrow i}(t+1),
(73) θki(t+1)θki(t)\displaystyle\theta^{k\rightarrow i}(t+1)-\theta^{k\rightarrow i}(t) =λϕki(t),\displaystyle=-\lambda\phi^{k\rightarrow i}(t),
(74) ϕki(t)\displaystyle\phi^{k\rightarrow i}(t) =(1λ)(1μ)ϕki(t1)+(PSki(t1)PSki(t)).\displaystyle=(1-\lambda)(1-\mu)\phi^{k\rightarrow i}(t-1)+\left(P_{S}^{k\rightarrow i}(t-1)-P_{S}^{k\rightarrow i}(t)\right).

The marginal probabilities that node ii is in a given state at time tt are then given by

(75) PSi(t+1)=PSi(t0)kNi(t)θki(t+1),\displaystyle P_{S}^{i}(t+1)=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)}\theta^{k\rightarrow i}(t+1)\,,
(76) PRi(t+1)=PRi(t)+μPIi(t),\displaystyle P_{R}^{i}(t+1)=P_{R}^{i}(t)+\mu P_{I}^{i}(t)\,,
(77) PIi(t+1)=1PSi(t+1)PRi(t+1).\displaystyle P_{I}^{i}(t+1)=1-P_{S}^{i}(t+1)-P_{R}^{i}(t+1)\,.

These equations are only exact on trees, but they can also be applied to networks with cycles as a heuristic approach. The heuristic gives good approximations to the true marginals if the network is at least locally tree-like (Karrer and Newman, 2010).

B.2. DMP Equations for the DDE+HNM Model

There are several differences between the SIR model on locally tree-like networks and the DDE+HNM model (see Figure 2 (a)). First, the DDE model has additional compartments (exposed nodes, asymptomatic nodes), which motivates the introduction of several new variables. Let λ(a)\lambda_{(a)} (resp., λ(s)\lambda_{(s)}) be the probability that an asymptomatic (resp., symptomatic) node infects a susceptible node. Let ϕki(t)(a)\phi^{k\rightarrow i}(t)^{(a)} (resp., ϕki(t)(s)\phi^{k\rightarrow i}(t)^{(s)}) be the probability that the infection signal has not been passed from node kk to node ii up to time tt, and that node kk is asymptomatic (resp., symptomatic) infectious at time tt, in the dynamics DiD_{i}.

The second important difference is that in the DDE model, the transition times between different compartments are deterministic instead of following a geometric distribution as in the standard SIR model. While deterministic transition times sound simpler at first, it turns out that they make the DMP equations more complex, because the Markovian property that each marginal probability depends only on the previous timestep is lost if the transition times are larger than 11. Recall that the times for the transitions EIE\rightarrow I and IRI\rightarrow R (with their default values) are TE=3T_{E}=3 and TI=14T_{I}=14.

Let us incorporate these two differences into equations (72)–(74) to derive the DMP equations for the DDE model. Equation (78) is essentially a copy of (72). Equation (79) follows equation (73), but we incorporate the two different variants of infected (asymptomatic and symptomatic) patients with their respective infection probabilities λ(a)\lambda_{(a)} and λ(s)\lambda_{(s)}. Equation (80) is a new equation, which is necessary because recovery times are no longer geometric random variables; instead we need to check the probabilities of infection TE+TIT_{E}+T_{I} timesteps earlier than the current time tt. Finally, equation (81) (resp., (82)) is the asymptomatic (resp., symptomatic) version of equation (74), while also incorporating the deterministic time for the transition EIE\rightarrow I. For tt0t\geq t_{0}, this yields equations

(78) PSij(t+1)\displaystyle P_{S}^{i\rightarrow j}(t+1) =PSi(t0)kNi(t)\jθki(t+1)=PSi(t0)PSi(t+1)θji(t+1),\displaystyle=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)\backslash j}\theta^{k\rightarrow i}(t+1)=P_{S}^{i}(t_{0})\frac{P_{S}^{i}(t+1)}{\theta^{j\rightarrow i}(t+1)},
(79) θki(t+1)θki(t)\displaystyle\theta^{k\rightarrow i}(t+1)-\theta^{k\rightarrow i}(t) =λ(a)ϕ(a)ki(t)λ(s)ϕ(s)ki(t),\displaystyle=-\lambda_{(a)}\phi^{k\rightarrow i}_{(a)}(t)-\lambda_{(s)}\phi^{k\rightarrow i}_{(s)}(t),
(80) PRki(t)\displaystyle P_{R}^{k\rightarrow i}(t) =PSki(tTETI1)PSki(tTETI)\displaystyle=P_{S}^{k\rightarrow i}(t-T_{E}-T_{I}-1)-P_{S}^{k\rightarrow i}(t-T_{E}-T_{I})
ϕ(a)ki(t)\displaystyle\phi^{k\rightarrow i}_{(a)}(t) =(1λ(a))(1PRki(t))ϕ(a)ki(t1)\displaystyle=(1-\lambda_{(a)})(1-P_{R}^{k\rightarrow i}(t))\phi^{k\rightarrow i}_{(a)}(t-1)
(81) +pa[PSki(tTE1)PSki(tTE)].\displaystyle\qquad\qquad+p_{a}[P_{S}^{k\rightarrow i}(t-T_{E}-1)-P_{S}^{k\rightarrow i}(t-T_{E})].
ϕ(s)ki(t)\displaystyle\phi^{k\rightarrow i}_{(s)}(t) =(1λ(s))(1PRki(t))ϕ(s)ki(t1)\displaystyle=(1-\lambda_{(s)})(1-P_{R}^{k\rightarrow i}(t))\phi^{k\rightarrow i}_{(s)}(t-1)
(82) +(1pa)[PSki(tTE1)PSki(tTE)].\displaystyle\qquad\qquad+(1-p_{a})[P_{S}^{k\rightarrow i}(t-T_{E}-1)-P_{S}^{k\rightarrow i}(t-T_{E})].

We note that for early values of tt, equations (80)–(82) depend on PSkiP_{S}^{k\rightarrow i} before t0t_{0}, which we initialize to be 1 (all nodes are susceptible before the first node develops the infection). The marginal probability that node ii is susceptible at time tt is still computed by equation (75) as before. Equations (76)–(77) do not apply anymore; we explain it in Appendix B.4 how to take into account observations for nodes in the infectious compartments.

The third difference between the the SIR model on locally tree-like networks and the DDE+HNM model is that the HNM model contains many short cycles inside the households. Short cycles can cause unwanted feedback loops in the DMP equations where, loosely speaking, nodes are treated as if they could reinfect themselves. We solve this issue by modifying the underlying graph to be locally tree-like (only for the computation of the DMP equations). Specifically, we introduce a new central household-node for each household, and we replace the cliques inside the households by a star graph centered at this new household-node node. Introducing such a central household-node does of course alter epidemic process, in particular it makes household infections less independent and slower (all household infections need to pass through an extra node). To mitigate this issue, we assume that central household-nodes have TE=1T_{E}=1 and that they are infected with probability 1 by any node in the same household. We tested the validity of the resulting DMP equations against simulations of the epidemic progressions and we found the results to be quite accurate, in particular, more accurate than the version without the introduction of these central household-nodes.

Note that we derived the DMP equations for the DDE+HNM model, however, since (i) the compartments are the same, (ii) the equations support temporal networks, and (iii) we have separate infection probabilities λ(a)\lambda_{(a)} and λ(s)\lambda_{(s)} for asymptomatic and symptomatic nodes, our equations can also be applied to the DCS+TU model after a discretizing (rounding) the time observations.

Finally, we touch upon the computational complexity of computing the DMP equations. In principle, we need to update O(dN)O(dN) equations (for each edge) over tmaxt_{\mathrm{max}} timesteps, where tmaxt_{\mathrm{max}} is the maximum time during which the marginals can still change, which can be as large as O(N)O(N). However, since we are only interested in computing the likelihood of the 5 earliest observations, tmaxt_{\mathrm{max}} is typically quite low. Moreover, since we assume to be in an early stage of the epidemic, most of the equations remain unchanged. For better computational scalability, we only compute PSi(t)P_{S}^{i}(t) and θki(t)\theta^{k\rightarrow i}(t) for nodes k,ik,i that have PIki(t)>0.01P_{I}^{k\rightarrow i}(t)>0.01, i.e., we only update nodes that are at least somewhat likely to have received the infection. Otherwise, we set PIki(t)=PIki(t1)P_{I}^{k\rightarrow i}(t)=P_{I}^{k\rightarrow i}(t-1), θki(t)=θki(t1)\theta^{k\rightarrow i}(t)=\theta^{k\rightarrow i}(t-1), and in the implementation we can perform these assignments implicitly using appropriate data structures. With these adjustments, the time-complexity of the algorithm becomes independent of NN, but remains dependent on the network parameters, the epidemic parameters and the number of sensors in a non-trivial way.

B.3. Feasible Source-time Pairs for Source Detection

In this section we explain how we implemented the feasible source identification algorithm, which was suggested as a preprocessing step for a method very similar to the DMP equations by (Jiang et al., 2016). Let us define the directed graph G2G_{2} on (node,infecton_time) pairs (we use “nodes” for the nodes of the original graph GG and “pairs” for the nodes of G2G_{2}), and draw an edge between two pairs (v1,t1)(v2,t2)(v_{1},t_{1})\rightarrow(v_{2},t_{2}) if v1v_{1} and v2v_{2} are in contact at t2t_{2}, and t2t_{2} is in the interval [t1+TE,t1+TE+TI][t_{1}+T_{E},t_{1}+T_{E}+T_{I}]. Observe that in the DDE model there is an edge (v1,t1)(v2,t2)(v_{1},t_{1})\rightarrow(v_{2},t_{2}) if and only if v1v_{1} becoming infected at time t1t_{1} can infect v2v_{2} at time t2t_{2}. The definition of G2G_{2} is applicable to the DCS model as well after discretization (rounding), however, since the infection times are not deterministic anymore, not all possible infections (v1,t1)(v2,t2)(v_{1},t_{1})\rightarrow(v_{2},t_{2}) have a corresponding edge in G2G_{2}.

Then, we perform a breadth-first search backwards on the directed edges of G2G_{2}, starting from each pair (vi,tiTETP)(v_{i},t_{i}-T_{E}-T_{P}), where viv_{i} is a symptomatic sensor node, and tit_{i} is the symptom onset time of viv_{i} (for the DCS model, we start from integer times in the tiTETP±(σE+σP)t_{i}-T_{E}-T_{P}\pm(\sigma_{E}+\sigma_{P}) interval to account for the randomness of the transition times). To limit the time complexity of the algorithm, we only consider the k1k_{1} earliest observations, which means that we start k1k_{1} breadth-first searches. With this construction, each pair (v,t)(v,t) discovered by a breadth-first search started from (vi,tiTETP)(v_{i},t_{i}-T_{E}-T_{P}) could have caused the infection in viv_{i}; we say that (v,t)(v,t) is an explanation for observation ii. We perform the breath-first searches until we find k2k_{2} pairs that explain all of the k1k_{1} earliest observations. See the pseudocode in Algorithm B.1.

Claim 1.

In the DDE model, Algorithm B.1 with σE=σP=0\sigma_{E}=\sigma_{P}=0 finds the k2k_{2} feasible explanations with the latest starting time of the k1k_{1} earliest symptomatic nodes.

Proof.

By construction, a source node vv that becomes infectious at time tt can cause an observation (vi,ti)(v_{i},t_{i}) if and only if there is a directed path from (v,t)(v,t) to (vi,tiTETP)(v_{i},t_{i}-T_{E}-T_{P}). Therefore, the breadth-first search algorithm finds all of the closest feasible sources in time. ∎

Input:
  • The mean exposed time TET_{E} the mean pre-infectious time TPT_{P}, the mean infectious time TIT_{I},

    the std of the exposed time σE\sigma_{E} and the std of the pre-infectious time σP\sigma_{P}

  • F(v)minF(v)_{min} and F(v)maxF(v)_{max} returns the minimum and maximum times when vv could have been exposed based on all of its (possibly asymptomatic or negative) test results

  • S(v)S(v) returns the time of symptom onset for a node vv tested positive symtomatic.

  • N(v,[tmin,tmax])N(v,[t_{min},t_{max}]) returns the set of neighbors of node vv in the interval [tmin,tmax][t_{min},t_{max}]

  • A lower estimate of the time the source became infectious tmint_{min}

  • Integers k1,k2k_{1},k_{2}

Output: A list of at most k2k_{2} tuples of node and time pairs that can explain the first k1k_{1} symptomatic nodes
l{}l\leftarrow\{\}; // if the list l[t]l[t] contains the tuple (v,w)(v,w), then the infection started at ww at time tt can explain vv
D{}D\leftarrow\{\}; // if the list D[w,t]D[w,t] contains the node vv, then the infection started at ww at time tt can explain vv
doneList[]doneList\leftarrow[];
for vSortIncreasingByValues(S)[0:k1]v\in SortIncreasingByValues(S)[0:k_{1}] do
 tminS(v)(TE+TP)(σE+σP)t^{\prime}_{min}\leftarrow S(v)-(T_{E}+T_{P})-(\sigma_{E}+\sigma_{P});
 tmaxS(v)(TE+TP)+(σE+σP)t^{\prime}_{max}\leftarrow S(v)-(T_{E}+T_{P})+(\sigma_{E}+\sigma_{P});
 for ttmint^{\prime}\leftarrow t^{\prime}_{min} to tmaxt^{\prime}_{max} do
    Append((v,v),l[t])Append((v,v),l[t^{\prime}]);
    Append(v,D[v,t])Append(v,D[v,t^{\prime}]);
    if Length(D[v,t])=k1Length(D[v,t^{\prime}])=k_{1} then
       Append((v,t),doneList)Append((v,t^{\prime}),doneList)
    
 
tSortIncreasingByValues(S)[k11]t\leftarrow SortIncreasingByValues(S)[k_{1}-1];
stopConditionFalsestopCondition\leftarrow False ;
while  not stopConditionstopCondition and t>tmint>t_{min} do
 for v,wl[t]v,w\in l[t] do
    for uN(w,[t,t1])u\in N(w,[t,t-1]) do
       tminmax(F(u)min,tTETI)t^{\prime}_{min}\leftarrow\max(F(u)_{min},t-T_{E}-T_{I});
       tmaxmin(F(u)max,tTE)t^{\prime}_{max}\leftarrow\min(F(u)_{max},t-T_{E});
       for ttmaxt^{\prime}\leftarrow t^{\prime}_{max} to tmint^{\prime}_{min} do
          Append((v,u),l[t])Append((v,u),l[t^{\prime}]);
          Append(v,D[(u,t)])Append(v,D[(u,t^{\prime})]);
          if Length(D[u,t])=k1Length(D[u,t^{\prime}])=k_{1} then
             Append((u,t),doneList)Append((u,t^{\prime}),doneList)
          
       
    
 doneListSortBySecondElement(doneList)doneList\leftarrow SortBySecondElement(doneList);
 if Length(doneListk2Length(doneList\geq k2) and tTEdoneList[k2][1]t-T_{E}\leq doneList[k2][1] then
    stopConditionTruestopCondition\leftarrow True;
    
 else
    tt1t\leftarrow t-1;
    
 
return doneListdoneList
Algorithm B.1 Feasible source identification (reverse dissemination (Jiang et al., 2016))

B.4. Source Detection via Feasible Source Identification and DMP

In this section we explain how to combine Algorithm B.1 with the DMP equations derived in Appendix B.2. See the pseudocode in Algorithm B.2.

We start by computing the DMP equations (78)-(82) and (72) for the k2k_{2} tuples of node and time pairs that can explain the first k1k_{1} symptomatic observations returned by Algorithm B.1. Next, our goal is to use these DMP equations to compute the likelihood of each of the k2k_{2} tuples using the k1k_{1} observations. Similarly to (Lokhov et al., 2014), we make the assumption that the first k1k_{1} observations are independent, and we can compute the likelihood by multiplying their respective marginals together. For symptomatic observed nodes vv, we know the time of symptom onset, which we denote by S(v)S(v). Then, the marginal probability of vv developing symptoms exactly at time tt can be computed by taking the difference of PSv(S(v)TPTE1)P_{S}^{v}(S(v)-T_{P}-T_{E}-1) and PSv(S(v)TPTE)P_{S}^{v}(S(v)-T_{P}-T_{E}) and multiplying the difference by (1pa)(1-p_{a}). In Algorithm B.2 we drop the multiplicative factor (1pa)(1-p_{a}) because it is present for all of the tuples, and it does not change the final order of their scores. For asymptomatic (resp., negative) observations, we only know that at the time of testing, denoted by A(v)A(v) (resp., NE(v)NE(v)), at least a time interval of length TET_{E} has passed (resp., TET_{E} has not passed) since the time of infection. Therefore, dropping the pap_{a} factor similarly to the symptomatic case, we compute the marginal of asymptomatic observations as 1PSv(A(v)TE)1-P_{S}^{v}(A(v)-T_{E}), and we compute the marginal of negative observations as PSv(NE(v)TE)P_{S}^{v}(NE(v)-T_{E}). Finally, the contributions of the observations are multiplied together for each of the k2k_{2} tuples returned by Algorithm B.1, and the scores approximating the likelihoods are returned.

Input:
  • The mean exposed time TET_{E}, the mean pre-infectious time TPT_{P}, the mean infectious time TIT_{I}

  • S(v)S(v) returns the time of symptom onset for a node vv tested positive symtomatic.

  • A(v)A(v) and NE(v)NE(v) return the time of asymptomatic and negative test results, respectively

Output: A dictionary LL of k2k_{2} elements, which contains a score for each (v,t)(v,t) pair that explains the first k1k_{1} observations. Higher scores signify higher confidence of being the source.
L{}L\leftarrow\{\};
doneListAlgorihtmB.1(k1,k2)doneList\leftarrow\mathrm{Algorihtm\ \ref{alg:feasible}}(k_{1},k_{2});
for v,t0doneListv,t_{0}\in doneList do
 PSP_{S}\leftarrow eq. (72) based on DMP eq. (78)-(82) with PSv(t0)=0P_{S}^{v}(t_{0})=0, and PSw(t0)=1P_{S}^{w}(t_{0})=1 for all wvw\neq v;
 L[v,t0]1L[v,t_{0}]\leftarrow 1;
 for wSw\in S do
    L[v,t0]L[v,t0](PSv(S(v)TPTE1)PSv(S(v)TPTE))L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot(P_{S}^{v}(S(v)-T_{P}-T_{E}-1)-P_{S}^{v}(S(v)-T_{P}-T_{E}));
    
 if wAw\in A  then
    L[v,t0]L[v,t0](1PSv(A(v)TE))L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot(1-P_{S}^{v}(A(v)-T_{E}));
    
 for wNEw\in NE do
    L[v,t0]L[v,t0]PSv(NE(v)TE)L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot P_{S}^{v}(NE(v)-T_{E});
    
 
return LL
Algorithm B.2 Source detection via DMP