Source Detection via Contact Tracing in the Presence of Asymptomatic Patients

Gergely Ódor gergely.odor@epfl.ch 0000-0001-9139-249X EPFLLausanneSwitzerland , Jana Vuckovic jana.vuckovic@epfl.ch EPFLLausanneSwitzerland , Miguel-Angel Sanchez Ndoye miguel-angel.sanchezndoye@epfl.ch EPFLLausanneSwitzerland and Patrick Thiran patrick.thiran@epfl.ch EPFLLausanneSwitzerland

Abstract.

Inferring the source of a diffusion in a large network of agents is a difficult but feasible task, if a few agents act as sensors revealing the time at which they got hit by the diffusion. One of the main limitations of current source detection algorithms is that they assume full knowledge of the contact network, which is rarely the case, especially for epidemics, where the source is called patient zero. Inspired by recent implementations of contact tracing algorithms, we propose a new framework, which we call Source Detection via Contact Tracing Framework (SDCTF). In the SDCTF, the source detection task starts at the time of the first hospitalization, and initially we have no knowledge about the contact network other than the identity of the first hospitalized agent. We may then explore the network by contact queries, and obtain symptom onset times by test queries in an adaptive way, i.e., both contact and test queries can depend on the outcome of previous queries. We also assume that some of the agents may be asymptomatic, and therefore cannot reveal their symptom onset time. Our goal is to find patient zero with as few contact and test queries as possible. We propose two local search algorithms for the SDCTF: the LS algorithm is more data-efficient, but can fail to find the true source if many asymptomatic agents are present, whereas the LS+ algorithm is more robust to asymptomatic agents. By simulations we show that both LS and LS+ outperform state of the art adaptive and non-adaptive source detection algorithms adapted to the SDCTF, even though these baseline algorithms have full access to the contact network. Extending the theory of random exponential trees, we analytically approximate the probability of success of the LS/ LS+ algorithms, and we show that our analytic results match the simulations. Finally, we benchmark our algorithms on the Data-driven COVID-19 Simulator (DCS) developed by Lorch et al., which is the first time source detection algorithms are tested on such a complex dataset.

adaptive source detection, contact tracing, sensor selection, epidemics

^†^†copyright: none

1. Introduction

During the COVID-19 pandemic, we have seen a revolution of the contact tracing technology, which helped track and contain the epidemic (Braithwaite et al., 2020; Kretzschmar et al., 2020). Some contact tracing programs were conducted by governmental/health agencies (Park et al., 2020), while others relied on decentralized approaches (Troncoso et al., 2020). Most contact tracing approaches work by notifying people who could have received the infection from known infectious patients, i.e., they trace “forward” in time. However, some advocate that a “bidirectional” tracing, where the past history of the infection is also tracked, can be more effective (Bradshaw et al., 2021; Endo et al., 2020; Kojaku et al., 2021). In this paper we focus on the “backward” direction of the problem; the task of identifying the first patient who carried the disease, also called patient zero, or the source of the epidemic. The identification of patient zero can either be limited to a smaller population cluster, in which case it can be a first step towards “bidirectional” tracing, or it can be more ambitious; finding the first patient who developed the mutation of a certain disease can help understanding how the mutation occurred, which can help us prevent, or better prepare for future epidemics.

Surprisingly, given the importance of the problem and the relatively large literature on the topic, we are not aware of any instance where source detection algorithms have been applied in real situations, including during the COVID-19 pandemic. Our goal in this paper is to examine the applicability of the source detection models in the literature (which we call frameworks from now on), and then propose a new framework, which improves them in several aspects. Originally, source detection was introduced in the context of rumor spreading instead of epidemics by Zaman and Shah in their pioneering Sigmetrics paper (Shah and Zaman, 2010, 2011). Translating to the language of epidemics for clarity, in the framework of (Shah and Zaman, 2011), an epidemic spreads over a network of agents that is completely known to us, and we observe a snapshot of the network, which means that every agent reveals if they are infected or not at some given time (not too early, because then the problem is trivial, nor too late, because then the problem is impossible). Shortly after (Shah and Zaman, 2011), Pinto et al. proposed a different framework, in which agents (also called sensors) reveal, in addition to their state, the time when they became infected, but where only a few of them do so and act as sensors (Pinto et al., 2012); indeed, the problem is trivial if all agents are sensors. This framework is better tailored to epidemics, as it is reasonable that obtaining any information from all the agents is much harder than asking one more question about the starting time of the symptoms of the disease to only some of them. Pinto et al. found that in their framework, if the sensors are already selected, the maximum likelihood estimator of the source has a closed form solution when the underlying network is a tree, and the time it takes for an agent to infect one of its susceptible contacts follows a Gaussian distribution. For general graphs, it is difficult to find an algorithm with any theoretical guarantees, although we note that many heuristics have been developed (Hu et al., 2018; Li et al., 2019; Paluch et al., 2018; Paluch et al., 2020b; Shen et al., 2016; Tang et al., 2018; Xu et al., 2019; Zhu et al., 2016). The only exception is on very simple contact networks (Lecomte et al., 2020), or when the epidemic spreads deterministically between the agents (Zejnilovic et al., 2013), which is not a realistic assumption for epidemics, but at least the estimation algorithm is trivial, and more emphasis can be put on the question of how the sensors should be selected for good performance (Spinelli et al., 2017c, 2018), which again is studied by heuristics in the general case (Paluch et al., 2020a). For a recent review of source detection algorithms, see (Shelke and Attar, 2019).

One of the main criticisms of original framework of Pinto et al. is that, even though the contact network is fully known, it is very difficult to find the source exactly unless a large fraction (20-50%) of the population act as sensors, which is unrealistic in the case of an epidemics, when the source is searched in a large population. An alternative recently proposed is to compute confidence sets for the source instead of finding it (Dawkins et al., 2021). But if our goal is to locate the source exactly, a promising approach is to allow the sensors to be selected adaptively to previous observations (Zejnilović et al., 2015, 2017), which we call adaptive sensor placement. When the contact network is known, adaptive strategies have been studied by simulations (Spinelli et al., 2017a, b) and by theoretical analysis (Lecomte et al., 2020), and they show a large reduction in the number of required sensors in real networks. In this paper, we will also allow the sensors to be placed adaptively.

We believe that the most problematic assumption that is still present in source detection papers, is the full knowledge of the contact network of agents, which is unrealistic (let alone because of privacy concerns). Due to this lack of data-availability, algorithms in the source detection literature have not been tested on realistic epidemic data. Moreover, while governmental/health agencies might have access to private datasets, such as cellular location data, from which a contact network may be estimated, these networks may be very noisy, and are potentially unfit for the source detection task. We only know of a few papers that study the effect of imperfections in the network data on the source detection task (Mashkaria et al., 2020; Zejnilović et al., 2016), but these papers study epidemics that spread deterministically between the agents. Inspired by adaptive sensor placement, and by the recent implementations of contact tracing algorithms, we propose a new framework for source detection, which we call Source Detection via Contact Tracing Framework (SDCTF). In SDCTF, algorithms can have two types of queries: contact queries, which can be used to explore the network, and sensor (test) queries, after which agents reveal their symptom onset time as before. The goal of the algorithm is to find the source as accurately as possible, while minimizing the number of contact and sensor queries. The SDCTF is a way to formalize the source detection task; it determines the goal of the algorithm and how information can be gained about the epidemic, but it does not specify the underlying epidemic and mobility data models (simulated or real). In this paper, we analyse different algorithms in the SDCTF with various epidemic and mobility models.

Besides specifying the possible queries that algorithms can make, the SDCTF also determines the way the outbreak is detected, which marks the starting time of the source detection task. In sensor-based source detection, the source detection task often starts long after the outbreak, when essentially all agents in the network are infected (Pinto et al., 2012), which can be seen as a limitation of source detection frameworks. The SDCTF is also closely related to contact tracing frameworks, where it is standard to assign a probability that each node spontaneously self-reports after developing symptoms, which triggers the activation of contact tracing algorithms (Kretzschmar et al., 2020; Bradshaw et al., 2021). In the SDCTF, we adopt the idea of self-reporting with a slight modification. We believe that the most interesting time to perform the source detection task is when a new disease (or a new mutation of the disease) appears, and therefore we tie these self-reporting events to hospitalizations, where infections are properly diagnosed by healthcare professionals. In particular, this means that the SDCTF can only be applied to epidemic data (and models) where hospitalizations are well-defined. In this paper, we use the datasets generated by the Data-driven COVID Simulator (DCS) introduced in (Lorch et al., 2020), which is one of the most realistic toolboxes that generate datasets modelling COVID-19, which we are aware of (notably, hospitalizations are part of the model). We also propose synthetic approximations for the epidemic and mobility models in the DCS; the Deterministically Developing Epidemic model and the Household Network Model, which improve the interpretability of our results since they have fewer parameters.

We propose a simple algorithm called LocalSearch (LS), which adaptively traces back the transmission path from the first hospitalized patient to the source. The LS algorithm is quite efficient at finding the source; the number of contact and sensor queries that it uses does not depend on the size of the network, but only on the local neighborhood of the source. Moreover, the LS algorithm provably finds the source with 100% accuracy, because of our assumption that every contact and sensor query is answered without noise. However, it is well-known that data-availability is a major issue in contact tracing (BeidasRinad et al., 2020), either because the agents do not comply with contact tracing efforts, or possibly (and in particular in the current COVID-19 epidemic) because they do not develop symptoms, and are unaware that they have the disease. In this paper, we model the effect of asymptomatic agents. When queried and tested, these agents do not reveal their time of infection, only whether they have or had the disease at some point. We show that the accuracy of the LS algorithm drops in the presence of asymptomatic agents, because the algorithm can get stuck while tracing back the transmission path from the first hospitalized patient to the source. Therefore, we propose an improved version of LS called LS+, which accounts for the presence of asymptomatic agents by placing more sensors. We are not aware of any previous work in the source detection literature that models the effect of asymptomatic patients, but the resulting model can be seen as a mix between the snapshot and the sensor-based models. We mention that non-complying agents or agents who provide noisy observations have been studied by (Altarelli et al., 2014; Hernando et al., 2008; Louni et al., 2015). Non-complying agents could also be included in our framework by treating them as asymptomatic agents (even though in this case we have no information about whether the agent had the disease or not), without jeopardizing the correctness of our algorithms.

We benchmark the LS and LS+ algorithms in both our data-driven and our synthetic epidemic and mobility models, and we compare them to state-of-the-art adaptive (Spinelli et al., 2017b) and non-adaptive (Jiang et al., 2016; Lokhov et al., 2014) algorithms tailored to the SDCTF, whenever possible. We find that both LS and LS+ outperform these baseline algorithms in accuracy (probability of finding the correct source).

While the LS/LS+ are designed to be simple algorithms, their theoretical analysis is quite challenging. Nevertheless, we are able to provide rigorous results about the success probability of both algorithms after a series of simplifications to the epidemic and mobility models, by extending some recent results on the theory of exponential random trees (Feng and Mahmoud, 2018; Mahmoud, 2021), which have previously not been connected to the source detection literature. We present these theoretical results in Section 4, after formally introducing the SDCTF, our models and the LS/LS+ algorithms in Section 3. By simulations, we show that our analytic results approximate the accuracy of the algorithms well, even in the most realistic setting in Section 5. Our analytic results provide additional insight into how the parameters of the epidemic and mobility models affect the performance of the algorithms. We discuss these insights along with some non-rigorous computations that mirror our main proof ideas in Section 2. Reading Section 2 before Sections 3-5 is useful to build intuition, but is not necessary to understand the paper.

2. Warmup Results

2.1. A Simple Network and Epidemic Model and a Simple Algorithm

Refer to caption — Figure 1. (a)-(c) shows the spread of the infection in the model considered in Section 2.1, which is equivalent to the growth of the RERT, with $d=2$ . Dark blue edges show the contacts on day $t$ , and light blue edges show contacts present on previous days (and thus subfigures). Orange (resp., red; black) nodes mark symptomatic non-hospitalized (resp, asymptomatic; symptomatic hospitalized) nodes. (d)-(f) shows the LS source detection algorithm introduced in Section 2.2, which succeeds in this example because there are no asymptomatic nodes on the transmission path between the first hospitalized node and the source. Black edges show the queried edges, and black stroke marks nodes already discovered by the algorithm. A node with black X marks a negative test result, and red stroked node marks the node currently maintained as source candidate by the LS algorithm.

Let us consider a time-dependent network model, where each agent meets $d$ new agents each day in such a way that the contact network is an infinite tree (ignoring the label of the edges giving the propagation time along the edge). This network models homogeneous mixing in a very large population; we consider more realistic network models in Section 3. On this network, we consider an epidemic model that starts at $t=0$ with one infected agent, and then progresses as infected agents infect their $d$ susceptible contacts each independently with probability $p_{i}$ each day. Since our goal is to study the epidemic process, it is sufficient to track only the agents who are already infectious (also called internal nodes), and the agents who are in contact with infectious agents at time $t$ (also called external nodes), as shown in Figure 1 (a)-(c). For $d=1$ , the spread of the infection is then equivalent to the growth a random tree $\mathcal{T}_{t}$ rooted at the source of the infection, known under the name of Random Exponential Recursive Tree (RERT) and recently introduced in (Mahmoud, 2021). Because of the similarities of the models, we refer to the model with general $d$ as RERT in the remaining of this section. We point out that the standard literature on elementary branching processes such as Galton-Watson trees or random recursive trees (Drmota, 2009) is not applicable in our scenario, because these branching processes have no notion of global time (i.e., a node in such processes becomes infectious immediately after receiving the infection), whereas nodes in diseases commonly go through an exposed, non-infectious period before becoming infectious, which is well captured by the RERT model. We mention that there is literature on more advanced branching processes that do have a notion of global time, e.g. Crump-Mode-Jagers trees (Jagers and Nerman, 1984), however we opt for the RERT because of its simple definition.

After a node (patient) becomes infected, the disease can take three courses (which for now do not affect $\mathcal{T}_{t}$ ): with probability $p_{a}$ the patient is asymptomatic, with probability $(1-p_{a})p_{h}$ the patient is hospitalized, and with probability $(1-p_{a})(1-p_{h})$ the patient recovers without hospitalization. The governmental/health agency learns about the outbreak when the first hospitalization occurs (see Figure 1 (c)) and starts the source detection process right away. It can inquire about the contacts of each agent and it can test the agents. From patients that were symptomatic (at any point in time in the past), the agency learns about their symptom onset time (which, in this simple model, is always one day after the infection time), but from asymptomatic patients it only learns that they had (or have) the disease at some point when they are tested. The framework introduced in this paragraph (including both the detection of the outbreak through the first hospitalization, and the possible actions the agency can take) is a simplified version of the SDCTF (Source Detection via Contact Tracing Framework), introduced in Section 3.3.

The network and epidemic models introduced in this section have four parameters: $d,p_{i},p_{a},p_{h}$ , and it is important to understand how each of them affects the difficulty of source detection in the SDCTF. We distinguish two important factors. First, if the outbreak is not detected rapidly enough, the length of the transmission path to the first hospitalized agent is long, and source detection becomes then difficult, because a lot of information needs to be recovered. Therefore, a low $p_{i}$ , a low $p_{h}$ and/or a high $p_{a}$ parameter can hinder source detection (recall that the probability of hospitalization was $p_{h}(1-p_{a})$ ). The second factor is related to the difficulty of recovering information about the transmission path. If $p_{a}$ is high, then there are a lot of nodes who are asymptotic and therefore do not reveal their symptom onset time, making source detection very difficult. Since $p_{a}$ affects both the length of the transmission path and the amount of collected information, it is safe to expect that, of all parameters, $p_{a}$ has the largest effect on the difficulty of source detection. The parameter $d$ is interesting, because a large $d$ can reduce the length of the transmission path, but it also makes the information about the transmission path less accessible as more agents need to be tested. Since in this paper we do not set a hard constraint on the total number of available tests, the advantage of a shorter path takes over the drawback of additional tests and a large $d$ increases the success probability.

To say anything quantitative about source detection in the SDCTF, we must discuss specific algorithms that solve the source detection task. In this paper we propose a simple algorithm called LocalSearch (LS), shown in Figure 1 (d)-(f). The LS algorithm maintains one candidate node $s_{c}$ at each iteration (initially, the first hospitalized node), which is always symptomatic, and it updates it in a greedy way: at the time of the infection of $s_{c}$ , all its $d$ incident edges are queried, and all its $d$ neighbors are tested. Then the agent with the lowest reported infection time will be the new candidate $s_{c}$ . The algorithm stops when $s_{c}$ does not change anymore between two consecutive iterations. For simplicity, we assume that the infection does not spread any further during these iterations, however, this assumption does not affect the ability of the algorithm to find the source or not. Indeed, it is not difficult to see that on tree networks, LS succeeds if and only if there are no asymptomatic nodes on the transmission path from the source to the first hospitalized agent. This observation leads us to enhance the LS algorithm by also searching within the neighbors of asymptomatic nodes; we explore this idea in the LS+ algorithm introduced in Section 3.5. We are not aware of this simple greedy algorithm being studied in the context of source detection, although similar ideas were implemented for non-adaptive source detection to lower the runtime of the algorithms (Paluch et al., 2018).

2.2. Back of the Envelope Calculation

Now, we have all the tools to estimate the probability of success of the LS algorithm. First we condition on the course of the disease in the source. With probability $p_{a}$ , the source is asymptomatic and LS can never succeed. With probability $(1-p_{a})p_{h}$ , the source itself becomes hospitalized, and LS always succeeds. Finally, with probability $(1-p_{a})(1-p_{h})$ the source is symptomatic but not hospitalized, which we call event $\mathcal{A}$ . If event $\mathcal{A}$ happens, then LS may or may not succeed depending on whether there are any asymptomatic nodes on the transmission path. More precisely, conditioned on event $\mathcal{A}$ and on the transmission path having length $l$ , the probability of success is $(1-p_{a})^{l-1}$ (since there are $l-1$ nodes on the path which can be asymptomatic), which implies

(1)

\mathbf{P}(\mathrm{success})=(1-p_{a})p_{h}+(1-p_{a})(1-p_{h})\left(\sum_{l=1}^{t}\mathbf{P}\left(\text{transmission path has length $l$}\mid\mathcal{A}\right)(1-p_{a})^{l-1}\right).

The difficult part is to compute the distribution of the transmission path conditioned on event $\mathcal{A}$ ; indeed we already saw that all four parameters $d,p_{i},p_{a},p_{h}$ affect this distribution in a non-trivial way. Let us perform a back of the envelope computation to get more insight into the effect of these parameters. The exact structure of the infection tree will not matter for this computation, only its profile does. It is denoted by $\mathcal{T}_{t}(l)$ and defined as the number of (internal) nodes at level $l$ (i.e., at distance $l$ from the source of the infection). Remember that by definition the RERT has $d\cdot\mathcal{T}_{t-1}(l-1)$ external nodes on level $l$ , and that at time $t$ each external node is promoted to be internal with probability $p_{i}$ to form $\mathcal{T}_{t}$ . Consequently, the level of a node $h$ added at time $t>0$ has the same distribution (conditioned on the tree $\mathcal{T}_{t-1}$ at the previous step) as the size (number of internal nodes) of the profile $\mathcal{T}_{t-1}(l-1)$ , that is,

(2)

\mathbf{P}(\mathrm{level}(h)=l\mid\mathcal{T}_{t-1}=T_{t-1})=\frac{T_{t-1}(l-1)}{|T_{t-1}|}.

Working on the RERT directly can be a daunting task, therefore we propose to approximate the numerator and the denominator of equation (2) by $\mathbf{E}[\mathcal{T}_{t-1}(l-1)]$ and $\mathbf{E}|[\mathcal{T}_{t-1}|]$ , respectively. It can be shown by a simple inductive argument, or by generating functions as in (Mahmoud, 2021), that for RERTs we have $\mathbf{E}[\mathcal{T}_{t}(l)]=\binom{t}{l}(dp_{i})^{l}$ and $\mathbf{E}[|\mathcal{T}_{t}|]=(1+dp_{i})^{t}$ , which suggests a binomial distribution for the level of $h$ . And indeed, we can approximate the distribution of the level of a node $h$ added at time $t$ as

	$\displaystyle\mathbf{P}(\mathrm{level}(h)=l)$	$\displaystyle\approx\frac{\mathbf{E}[\mathcal{T}_{t-1}(l-1)]}{\mathbf{E}[\|\mathcal{T}_{t-1}\|]}$
		$\displaystyle=\frac{\binom{t-1}{l-1}(dp_{i})^{l-1}}{(1+dp_{i})^{t-1}}$
		$\displaystyle=\binom{t-1}{l-1}\left(\frac{dp_{i}}{1+dp_{i}}\right)^{l-1}\left(1-\frac{dp_{i}}{1+dp_{i}}\right)^{t-l}$
		$\displaystyle=\mathbf{P}(\mathrm{Bin}(t-1,q)=l-1),$

with $q=dp_{i}/(1+dp_{i})$ .

One of the main challenges of this calculation is that we do not know the day of the first hospitalization $t$ conditioned on event $\mathcal{A}$ , we only know that each node is hospitalized with probability $(1-p_{a})p_{h}$ , which means that the index of the first hospitalized node follows a geometric distribution with mean $1/((1-p_{a})p_{h})$ . We approximate $t-1$ by the first time that the expected size of the infection tree (excluding the source since we condition on event $\mathcal{A}$ ) exceeds the expected index of the first hospitalized node. Therefore we solve

\mathbf{E}[|\mathcal{T}_{t-1}|-1]=(1+dp_{i})^{t-1}-1=\frac{1}{(1-p_{a})p_{h}}=\mathbf{E}[\text{index of the first hospitalized node}]

for $t$ (relaxing the constraint that $t$ is an integer), which gives

t-1=\frac{\log\left(1+\frac{1}{(1-p_{a})p_{h}}\right)}{\log(1+dp_{i})}.

Consequently, we approximate $\mathbf{P}\left(\text{transmission path has length $l$}\mid\mathcal{A}\right)$ by $\mathbf{P}(\mathrm{Bin}(t-1,q)=l-1)$ . Continuing equation (1), and using the well-known expression of the probability generating function of the binomial distribution, we get

	$\displaystyle\mathbf{P}(\mathrm{success})$	$\displaystyle\approx(1-p_{a})p_{h}+(1-p_{a})(1-p_{h})\left(\sum_{l=1}^{t}\mathbf{P}\left(\mathrm{Bin}\left(t-1,q\right)=l-1\right)(1-p_{a})^{l-1}\right)$
(3)			$\displaystyle=(1-p_{a})\left(p_{h}+(1-p_{h})\left((1-p_{a})\frac{dp_{i}}{1+dp_{i}}+1-\frac{dp_{i}}{1+dp_{i}}\right)^{\frac{\log\left(1+\frac{1}{(1-p_{a})p_{h}}\right)}{\log(1+dp_{i})}}\right).$

One can check that this expression agrees with our qualitative intuition. However, it is not at all clear whether it is valid because of the strong approximations made in some steps of the above computation. In Section 4, we prove a rigorous upper bound on the success probability, and we also provide much more careful approximations by proving exact theorems about the simplified models that we use. Then, in Section 5 we compare our results with simulation results on synthetic data, as well as with data generated by the DCS model.

3. Models, Methods, Algorithms

3.1. Epidemic Models

3.1.1. The DCS Model

We call DCS the model implemented by (Lorch et al., 2020). The DCS model is fairly complex, and we only give a brief overview.

Each agent in the agent set $V$ can be in one of 8 states: susceptible, exposed, asymptomatic infectious, pre-symptomatic infectious, symptomatic infectious, hospitalized, recovered or dead. Transitions between different states are characterized by counting processes described by stochastic differential equations with jumps. The most important, and also most complicated of these counting processes is the exposure counting process $N_{i}(t)$ , which is modeled by a Hawkes process for each agent $i$ . Hawkes processes are point processes with a time-dependent, self-exciting conditional intensity function $\lambda^{*}_{i}(t)$ .

(4)

\hskip-5.0pt\lambda^{*}_{i}(t)=\beta\sum_{j\in V\backslash\{i\}}\int_{t-\delta}^{t}K_{i,j}(\tau)~\gamma e^{-\gamma(t-\tau)}\,d\tau

where the kernel $K_{i,j}(\tau)$ indicates whether $j$ has been at time $\tau$ at the same site where $i$ is at time $t$ , and whether $j$ is in the infectious state. Parameters $\gamma$ and $\delta$ are the decay of infectiousness at sites and the non-contact contamination window, respectively, and they account for the fact that $j$ can infect $i$ even if they are never at the same site, as $j$ can leave some pathogens behind (airborne for instance). Parameter $\beta$ is the transmission rate for symptomatic and asymptomatic individuals, and it comes in two versions: $\beta_{c}$ accounts for infections outside the household and $\beta_{h}$ accounts for infection in the household. Parameters $\beta_{c}$ and $\beta_{h}$ are fitted to the COVID-19 infection data of Tubingen from 12/03/2020 to 03/05/2020 using Bayesian Optimization. The model also has a parameter for the relative asymptomatic transmission rate built into the function $K_{i,j}(\tau)$ , which scales down the infectiousness of asymptomatic agents (to 55% of the infectiousness of symptomatic agents by default).

Once a susceptible agent becomes infected, the disease can take three possible courses (see Figure 2 (a)). With probability $p_{a}$ , the agent becomes asymptomatic infectious after time $T_{E}$ , and then recovers after time $T_{I}$ . With probability $1-p_{a}$ , the agent becomes pre-symptomatic infectious after time $T_{E}$ , next symptomatic infectious after time $T_{P}$ , and then recovers with probability $1-p_{h}$ after time $T_{I}-T_{P}$ , or becomes hospitalized with probability $p_{h}$ after time $T_{H}$ . Agents in the DCS are also assigned age values based on demographic data, and the hospitalization probability $p_{h}$ of each agent is determined based on its age (following COVID-19 infection data). The times $T_{E},T_{P},T_{I}$ and $T_{H}$ are drawn from an appropriately parametrized (using values from the COVID-19 literature) lognormal distribution as shown in Table 3.

3.1.2. The DDE Model

We start by taking the DCS model (Lorch et al., 2020), which we simplify to enable its theoretical analysis. In the Deterministically Developing Epidemic (DDE) model, continuous time (used in DCS) is replaced by discrete time-steps: we refer to one time-step in the DDE as one day. Instead of modelling the infection propagation as a Hawkes process, an infectious agent (symptomatic or asymptomatic) can infect its susceptible neighbor with probability $p_{i}$ each day. Thereafter, the disease progresses the same way as in the DCS, except that in the DDE model the transition times are deterministic (the infection events and the severity of the disease (i.e., the (a)symptomatic and hospitalized states) are still determined randomly), and we have a single parameter $p_{h}$ for the hospitalization probability (agents in this model do not have an age parameter). We discuss how we set the parameters of the DDE model in Section 3.4.

3.2. Simulating Mobility

3.2.1. Tubingen Mobility Model

We briefly review the mobility model introduced in (Lorch et al., 2020), and illustrated in Figure 2 (b). The population is partitioned into households of possibly varying size (usually between 1 and 5). The households are assigned a location, and we also place some external sites (shops, offices, schools, transport stations, recreating sites) on the map, which the agents may visit. The location of the households and the number of agents in them is sampled randomly based on demographic datasets. Initially, each agent is assigned a few favorite sites (randomly based on distance), and will only visit these throughout the simulation. Each agent decides to leave home after some exponentially distributed time, visits one of its (randomly chosen) favorite sites, and comes back home after another (usually much shorter) exponentially distributed time. If two agents visit the same site at the same time, or within some time $\delta$ , we record them as a contact, which gives an opportunity for the infection to propagate. We denote the Tubingen mobility model as TU, and the DCS epidemic model that runs on the TU mobility model as DCS+TU.

3.2.2. Household Network Model

The Household network model (HNM) was inspired by (Lorch et al., 2020), however we note that similar models have been studied in the theoretical community by (Ball et al., 2009). As in the Tubingen mobility model, in HNM $N$ nodes are assigned into households, but of constant size $d_{h}+1$ . Every pair of nodes in the same household are connected by an edge, forming therefore cliques of size $d_{h}+1$ . Additionally, each node is assigned $d_{c}$ half edges, which are paired uniformly at random with other half-edges in the beginning. Some half-edge pairings can result in self-loops or multi-edges, which are discarded. This construction defines a random graph generated by a configuration model, which shares a lot of similarities with Random Regular Graphs (RRG) (Wormald et al., 1999). In fact, if we join nodes in the same household into a single node in the HNM (which we refer to as the network of households of the HNM), then the resulting graph is equivalent to the pairing model of RRGs with degree $d_{c}(d_{h}+1)$ . It is well-known that in the pairing model of RGGs of degree $d$ , the local neighborhood (of constant radius, as the number of nodes tends to infinity) of a uniformly randomly chosen vertex is a $d$ -regular tree (with probability tending to 1), which implies that locally there are asymptotically almost surely no self-loops, multi-edges or any cycles in the graph. This result has various names; in random graph theory the result is usually proved by subgraph counting (Wormald et al., 1999), in probability theory it is the basis of branching process approximations (Ball et al., 2009), and in graph limit theory it is called the local convergence to the infinite $d$ -regular tree (Benjamini and Schramm, 2011). In our theoretical analysis, this result motivates the approximation of the neighborhood of the source in the network of households of the HNM by an infinite $d_{c}(d_{h}+1)$ -regular tree. The HNM itself is then approximated by replacing each (household) node of the infinite $d_{c}(d_{h}+1)$ -regular tree of households by a $(d_{h}+1)$ -clique, and by setting the edges so that each (individual) node has degree exactly $d_{c}+d_{h}$ , while keeping the connection between cliques unchanged (see Figure 2 (c) for a visualization).

Since the HNM is a time-independent graph, we adopt the standard notations from graph theory. Formally, the HNM is given by the set of nodes and edges $G=(V,E)$ . Let us denote by $H(v)$ the set of nodes that are in the same household as node $v$ . The distance between two nodes $u,v\in V$ (denoted by $d(u,v)$ ) is defined as a number of edges of the shortest path between $u$ and $v$ . We denote the DDE epidemic model that runs on the HNM network as DDE+HNM.

3.3. The Source Detection via Contact Tracing Framework

We present the Source Detection via Contact Tracing Framework (SDCTF), which can be applied to both epidemic and mobility models presented so far. The framework determines how the government/health agency, which conducts the source detection task, learns about the outbreak, and how it can gather further information to locate the source. In the SDCTF, as in Section 2.1, the agency learns about the outbreak when the first hospitalization occurs, and it also learns the identity of nodes when they become hospitalized (including the identity of the first hospitalized node).

After the outbreak is detected, the agency can make three types of queries. The first type of query, the household query with parameter $v$ , reveals the agents that live in the same household as $v$ . The household query works the same way in both the TU and the HNM models, and we do not limit the number of times it can be called (these queries are considered as cheap in the SDCTF). The second type of query, the contact query, works differently in the TU and the HNM models. For the TU model, a contact query has two parameters: an agent $v$ and a time window $[t_{1},t_{2}]$ . As a result, all agents that have been in contact with $v$ (and therefore could have infected $v$ or could have been infected by $v$ ) at an external site between $t_{1}$ and $t_{2}$ are revealed. In the HNM, no time window is needed for the contact query (which we also call edge query), and all neighbors of $v$ in graph $G$ are revealed. Contact (and edge) queries are considered expensive in the SDCTF. While in this paper we do not limit the number of available queries, we track the number of contacts and edges that are revealed as the algorithm runs. Note that in the TU model if two agents $v_{1}$ and $v_{2}$ have been in contact during the time window $[t_{1},t_{2}]$ and also during a different time window $[t_{3},t_{4}]$ , then those are counted as separate contacts, whereas in the HNM an edge between $v_{1}$ and $v_{2}$ is only counted once. Although contact queries are considered expensive, both household and contact queries are answered instantly in the SDCTF.

The third kind of query is the test query with parameter $v$ , which reveals information about the course of the disease in the queried agent (see Figure 2 (a)). Symptomatic patients reveal the time of their symptom onset (which exactly determines their time of infection in the DDE due to the deterministic transition times) if they are past the pre-symptomatic state (i.e., if they are either infectious or recovered). Asymptomatic and pre-symptomatic patients do not reveal any information about their infection time; they just reveal that they have the disease or had the disease at some point and have recovered. For all algorithms we assume that asymptomatic patients do not reveal whether they have the infection at the time they are queried. Finally, agents who have not been exposed, or are still in their exposed state, give a negative test result. Test queries are again considered expensive in the SDCTF, we even limit the population that can be tested on any given day to at most 1% of the total population, due to the capacity of testing facilities. However, since in this paper we do not limit the number of days that the algorithm can use to locate the source, the limit on the number of tests does not play an important role. As opposed to household and contact queries (and the model in Section 2.1), tests results are only answered the next day in the SDCTF, which means that the algorithms must operate in “real-time”, while the epidemic keeps propagating.

3.4. Parameters

The DCS+TU model has many parameters, most of which are fitted to COVID-19 datasets of Tubingen from 12/03/2020 to 03/05/2020 by (Lorch et al., 2020) (we show the most relevant parameters in Table 3). We determined the parameters of the DDE+HNM model so that they fit the parameters of the DCS+TU as closely as possible (see the precise values in Table 3). We determine the values of $T_{E},T_{P},T_{I}$ in the DDE+HNM by rounding the expected value of the corresponding distribution in the DCS+TU to the nearest integer. Since $p_{a}$ is simply a constant in both models, we keep the same numerical value in the DDE+HNM. The parameter $p_{h}$ is more complicated, because in the DCS+TU model there is a different hospitalization probability for each age group. We take the average hospitalization probability across the population to be $p_{h}$ . The most complicated parameter to fit is $p_{i}$ , because in the DCS+TU model, infections are modelled by a Hawkes process, which depends on many parameters, including whether the infectious agent is symptomatic or asymptomatic, the length of the visit, the site where the infection happens, etc (see equation (4)). We empirically observe the probability of infection in every contact in several simulations, and we find that an agent has on average 15 contacts outside the household each day, and that the average probability of infection during such a contact is around 0.02. However, since we use smaller networks for the DDE+HNM ( $N=400$ or $1000$ , because running the baselines on larger networks is not feasible) than the DCS ( $N=9054$ ), setting $d_{c}$ to be as high as 15 would violate the assumption that the network of households of the HNM can be locally approximated by a tree (see Section 3.2.2). Therefore we chose $d_{c}=3$ for the HNM and we scale $p_{i}$ so that $d_{c}p_{i}$ (the expected number of external infections caused by a single agent each day) is the same in the DCS+TU and the DDE+HNM models. Finally, we choose $d_{h}$ in the DDE+HNM by rounding the average household connections in the DCS+TU. Note that the average number of household connections is not the same as the average number of household members, because the number of connections grows quadratically in the size of the households, and thus fitting to the number of connections results in a higher $d_{c}$ (due to the Quadratic Mean-Arithmetic Mean inequality).

Finding the default values for the parameters is useful to create a realistic model. However, we also interested in the effect of each of the parameters on the performance of our algorithms. Therefore, in the DDE+HNM, we vary the parameters $p_{a},p_{h},p_{i},d_{h}$ and $d_{c}$ , while keeping the other ones unchanged. For the DCS+TU model, we also keep the mobility model fixed and we focus on varying the parameters $p_{a},p_{h}$ and $p_{i}$ . As noted above, there is no single parameter $p_{h}$ or $p_{i}$ in the DCS+TU model, therefore we change all hospitalization probabilities and all intensities of the Hawkes processes so that the hospitalization probability averaged across the population and the infection probability averaged across contacts equal the desired values.

3.5. The LocalSearch Algorithms LS and LS+

The LS algorithm finds patient zero by local greedy search. It keeps track of a candidate node, which is always the node with the earliest reported symptom onset time. We denote the candidate of the algorithm at iteration $i>0$ by $s_{c,i}$ . We think of $s_{c}$ as a list, which is updated in each iteration of the algorithm, and we use the notation $s_{c,-1}$ for the last element of the list (i.e., the current candidate). In each iteration of the algorithm, we compute a new candidate denoted by $s_{c}^{\prime}$ , and we append it at the end of the list $s_{c}$ at the beginning of the next iteration, unless $s_{c}^{\prime}=s_{c,-1}$ , in which case the algorithm terminates.

Since we consider the SDCTF, the outbreak is detected when the first hospitalized case is reported. At that time, $s_{c}^{\prime}$ is initialized to be the hospitalized patient, the test queue is initialized to be empty, and the algorithm is started. In the beginning of an iteration, if the test queue is empty, the household members and the “backward” contacts of the current candidate $s_{c,-1}$ are queried and are added to the test queue (see Figure 4 (a)). We define “backward” contacts as the set of nodes that have been in contact with $s_{c,-1}$ in the interval $[t_{s_{c,-1}}-(T_{E}+T_{P})-(\sigma_{E}+\sigma_{P}),t_{s_{c,-1}}-(T_{E}+T_{P})+(\sigma_{E}+\sigma_{P})]$ , where $t_{s_{c,-1}}$ is the symptom onset time of current candidate $s_{c,-1}$ . The terms $\sigma_{E}$ and $\sigma_{P}$ model the standard deviation of the transition times, and they are set to zero for the DDE and to $\sigma_{E}=2$ and $\sigma_{P}=1$ for the DCS based on Table 3. We note that the notion of “backward” contacts is only meaningful in the case of time-dependent network models; for the HNM, all neighbors are counted as backward contacts.

After the test queue is initialized, the agents inside the queue are tested (see Figure 4 (b)). Not all nodes can be tested on the same day because of the limitation on the number of tests available per day in the SDCTF, however, this has little effect because we do not proceed to the next iteration until the test queue becomes empty. Once the test results come back to the agency, if any of the (symptomatic) nodes $v$ reports an earlier symptom onset time than the current candidate $s_{c,-1}$ , then we update our next candidate $s_{c}^{\prime}$ to be $v$ (see Figure 4 (c)). We note that the iteration does not stop immediately after $s_{c}^{\prime}$ is first updated; the iteration runs until the test queue becomes empty, and until then, $s_{c}^{\prime}$ can be updated multiple times. This is important in the theoretical results to prevent the algorithm from getting sidetracked (see Figure 9). We also experimented with a version of the LS and LS+ algorithms where the iteration stops immediately once $s_{c}^{\prime}$ is updated; we call these algorithms LSv2 and LS+v2.

The main drawback of the LS algorithm is that is gets stuck very easily if there is even one asymptomatic node on the transmission path. For this reason, we introduce the LS+ algorithm, in which we enter the backward contacts of the asymptomatic household members of $s_{c,-1}$ , and the household members of any asymptomatic node into the testing queue (see Figure 4 (d)-(f)). Since the symptom onset times of asymptomatic nodes $v$ are not revealed, we define backward contact in this case as any contact in the time window $[t_{s_{c,-1}}-(T_{P}+2T_{E}+T_{I}),t_{s_{c,-1}}-(T_{P}+2T_{E})]$ , where $t_{s_{c,-1}}$ is still the symptom onset time of the current candidate $s_{c,-1}$ . Indeed, in the DDE model, since $s_{c,-1}$ was infected at $t_{s_{c,-1}}-(T_{P}+T_{E})$ , if $v$ infected $s_{c,-1}$ , agent $v$ must have been infectious at that time, which implies that $v$ could not have been infected later than $t_{s_{c,-1}}-(T_{P}+2T_{E})$ or earlier than $t_{s_{c,-1}}-(T_{P}+2T_{E}+T_{I})$ . In the DCS model, the terms $\sigma_{E}$ and $\sigma_{P}$ can be subtracted and added to the two ends of the queried time window to account for the randomness in the transition times.

Both algorithms stop if the testing queue becomes empty before a node with an earlier symptom onset time than $s_{c,-1}$ is discovered, and both algorithm return $s_{c,-1}$ as their inferred source. The high level pseudocode and an illustration of the LS and LS+ algorithms are given in Figure 4.

4. Theoretical Results

In this section we present theoretical results for the LS and LS+ algorithms described in Section 3.5. We follow a similar approach as in the non-rigorous computation in Section 2.2, which useful but not necessary for understanding this section. All the statements are rigorously established, and whenever we reach a point where the computations would become intractable, we propose a simpler approximate model to study. One of the main contributions of this paper is to identify which computations can be done on more general models, and which computations need more simplified ones (see Figure 5 for an overview of the different models used for the computations in this section).

We compute the success probability of the LS and LS+ algorithms in two steps. We first assume the length of the transmission path known in Section 4.1 . This computation is then made possible by a tree approximation of the HNM, called the Red-Blue (RB) tree (defined in Section 4.1.1), and a slightly modified version of the DDE model called $\mathrm{DDE}_{\mathrm{NR}}$ (defined in Section 4.1.2). The RB tree preserves some of the household structure in the HNM, and therefore allows us gain insight into the difference between the LS and LS+ algorithms, which would be difficult to obtain if we had worked on trees without taking the household structure in account.

For the second step, we would need to compute the distribution of the transmission path on the RB tree. However, finding a closed form expression is intractable. Instead, we combine the network and epidemic models into a growing random tree model, and we consider a $d$ -ary Random Exponential Tree (RET). The $d$ -ary RET model has only been studied for $d=2$ (Feng and Mahmoud, 2018); we extend the results on their expected profile for general $d$ in Section 4.2.1. Nevertheless, working on $d$ -ary RETs still remains difficult, and therefore, in our last modeling step, we introduce a Deterministic Exponential Tree (DET) model, whose profile is close to the expected profile of the RET, and we compute the distribution of the transmission path on this model in Section 4.2.2.

To summarize all models considered in this paper, we have a data-driven and a synthetic model for simulations (DCS+TU and HNM+DDE), an analytically tractable model (RB-tree+ $\mathrm{DDE}_{\mathrm{NR}}$ ) where we can compute the success probability if the length of the transmission path is known. In a second stage, we compute the distribution of a transmission path on a deterministic tree (DET), which has a similar profile as a random tree (RET) that approximates our analytically tractable model. We visualize these five different models in Figure 5 (a), and we show by simulations in Figure 5 (b) that the distribution of the transmission path is similar in all of the considered models with appropriately scaled parameters. We compare our analytic results on the success probabilities of the LS and LS+ algorithms with our simulation results in Section 5.3 in Figure 7.

4.1. Success Probability of LS and LS+ Algorithms on the RB Tree

In this section we introduce the Red-Blue (RB) tree model (which is a tree approximation to the HNM), and we calculate the exact probability that the LS and LS+ algorithms succeed, if the length of the transmission path is known.

4.1.1. Red-Blue tree models

In short, a RB tree is a two-type branching process with a deterministic offspring distribution that depends on $d_{h}$ and $d_{c}$ . The lack of randomness in this distribution makes us adopt the formalism of deterministic rooted trees.

Definition 4.1.

Let a rooted tree, denoted by $G(s)$ , be a tree graph with a distinguished node root node $s$ . Let $u$ and $v$ be two nodes connected by an edge in $G(s)$ . If $d(u,s)<d(v,s)$ , we say that $u$ is a parent of $v$ , otherwise $u$ is a child of $v$ . Moreover, if $d(s,v)=l$ we say that $v$ is on level $l$ . An RB tree with parameters $(d_{c},d_{h})$ is an infinite rooted tree, such that the nodes also have an additional color property. The root is always colored red and the rest of the nodes are colored red or blue. The root has $d_{c}$ red and $d_{h}$ blue children. Every other red node has $d_{c}-1$ red and $d_{h}$ blue children, and every blue node has $d_{c}$ red children and no blue children. Red nodes and their $d_{h}$ blue children partition the nodes of the RB tree $G(s)$ into subsets of size $d_{h}+1$ , which we call households.

Remark 4.1.

In the RB tree, each blue node has degree $d_{c}+1$ , and each red node has degree $d_{c}+d_{h}$ , including the root of the tree $s$ (which is the source of the epidemic, when the RB tree is combined with an epidemic model).

The RB tree can be seen as a local tree approximation of the HNM. Let $G=(V,E)$ be an HNM with parameters $(d_{c},d_{h})$ , and let $s\in V$ be the distinguished source node. In Section 3.2.2 we noted that the HNM can be approximated locally around the source node by replacing each node of an infinite $d_{c}(d_{h}+1)$ -regular tree by a $(d_{h}+1)$ -clique, and setting the edges so that each node has degree exactly $d_{c}+d_{h}$ , while keeping the connection between cliques unchanged. Let us call this infinite graph $G^{*}$ . Although $G^{*}$ is not a tree, all cycles in $G^{*}$ must be contained entirely inside the households, which implies that in each household there exists exactly one node that has the minimal distance to the source. We will refer to these nodes with minimal distance to the source as the red nodes, and we color the rest of the nodes blue. In other words, the red nodes will be the first ones in their households to be infected. Let us now delete the edges between the blue nodes in $G^{*}$ to obtain graph $G^{\prime}$ . We claim that $G^{\prime}$ is isomorphic to the RB tree $G(s)$ rooted at the source $s$ . Indeed, since the edges between blue nodes have been deleted in $G^{*}$ to form $G^{\prime}$ , each blue node has $d_{c}+1$ red neighbors and no blue neighbor, and since the edges incident to red nodes have been unchanged, each red node has $d_{c}$ red and $d_{h}$ blue neighbors, exactly as in the definition of RB tree above.

Note that a household in $G^{*}$ is completely characterized by only specifying the colors of the nodes: a household always consists of one red node and of its $d_{h}$ blue children. We use this characterization as a definition for households in the RB tree $G^{\prime}$ , because it does not depend on the edges from $G$ that are deleted in $G^{*}$ , whereas this deletion makes the original definition of a household as a clique in $G$ unusable.

Next, we make some important observations the behaviour of the LS and the LS+ algorithms on RB trees, which we prove in Appendix A.1. We start by formalizing the notion of transmission path.

Definition 4.2.

Let $h$ be the first hospitalized node and $s$ be the source. We call the path $(s=v_{0},v_{1},...v_{l}=h)$ , where $v_{i}$ is the infector of $v_{i+1}$ for $0\leq i<l$ , the transmission path. Also we call the path $(v_{l},v_{l-1},...v_{1})$ the reverse transmission path.

Remark 4.2.

Note that in an RB tree, each household traversed by a transmission path shares one (the red node in the household) or two (the red node of the household and one of its $d_{h}$ children in the household) nodes with this path. Moreover, the red node of a household traversed by a transmission path is followed by another red node on the path (in another household) if it is the only node of that household on the transmission path, whereas it is followed by a blue node (in the same household) if two nodes of that household are on the transmission path.

Lemma 4.3.

In the RB tree network, the LS algorithm succeeds if and only if all nodes on the transmission path are symptomatic, and the LS+ algorithm succeeds if among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic.

Remark 4.3.

We note that the statement for LS+ in Lemma 4.3 cannot be reversed, i.e., it is possible that LS+ succeeds even if among the nodes of the transmission path, there is a household with no symptomatic node (see Figure 9 (a)). Also, the proof of Lemma 4.3 does not hold if the LS+ algorithm proceeds to the next iteration at the first time $s_{c}^{\prime}$ is updated (see Figure 9 (b)). Finally, in the proof of Lemma 4.3, we do not make any assumptions about asymptomatic patients having had the disease previously or not, which implies that we could treat non-complying agents as asymptomatic patients without jeopardizing the correctness of the algorithms.

4.1.2. The $\mathrm{DDE}_{\mathrm{NR}}$ Model

Focusing on tree networks is an important step towards making our models tractable for theoretical analysis, but it will not be enough; we will make two minor simplifications to the DDE model as well: we eliminate (i) the pre-symptomatic state and (ii) the recovered state, and we call the new model $\mathrm{DDE}_{\mathrm{NR}}$ (where NR stands for No Recovery). (i) The first assumption can be made without loss of generality, because the pre-symptomatic state does not have any effect on the disease propagation, nor on the success of the source detection algorithm. Indeed, according to Lemma 4.3, the success of the LS and LS+ algorithms depends only on the information gained about the transmission path, and by the time of the first hospitalization, every node on the transmission path must have left the pre-symptomatic state (since we always have $T_{P}<T_{E}+T_{H}$ ), even if we include it in the model. (ii) The second assumption on the absence of recovery states amounts to take $T_{I}\rightarrow\infty$ , which does have a small effect on the disease propagation, however, this effect is minimal because $T_{I}=14$ is already quite large, and because only the very early phase of the infection is interesting for computing the success probabilities of the algorithms. Finally, this last assumption has no effect on the information gained by the algorithm since we assumed that recovered patients (who were symptomatic) can remember and reveal their symptom onset time in the same way as symptomatic infectious patients.

4.1.3. Success Probability of LS

Assuming that the distribution of length of the transmission path is provided for us (we give an approximation in Section 4.2), the success probability of LS can be computed succinctly. We need a short definition before stating our result.

Definition 4.4.

Let $p$ be the probability that a node is asymptomatic conditioned on the event that it is not hospitalized.

A simple computation shows that

(5)

p=\mathbf{P}(v\text{ is asy}\mid v\text{ is not hosp})=\frac{p_{a}}{p_{a}+(1-p_{a})(1-p_{h})}.

Lemma 4.5.

For the $\mathrm{DDE}_{\mathrm{NR}}$ epidemic model with parameters $(p_{i},p_{a},p_{h})$ on the RB tree with parameters $(d_{c},d_{h})$ , and with $p$ computed in equation (5), we have

(6)

\mathbf{P}(LS\textrm{ succeeds})=\sum_{n=0}^{\infty}\left(1-p\right)^{n}\mathbf{P}(d(s,h)=n).

Proof.

Let us reveal the randomness that generates the epidemic in a slightly modified way than in the definition (Sections 3.1.2 and 4.1.2). As before, at the beginning only the source is infectious, and depending on course of the disease, the source can be symptomatic and hospitalized, symptomatic but not hospitalized, or asymptomatic with probabilities $(1-p_{a})p_{h},(1-p_{a})(1-p_{h}),p_{a}$ , respectively. In each moment, each infectious node infects each of its susceptible neighbors with probability $p_{i}$ . If a node is infected, we reveal the information whether it will become hospitalized (which happens with the probability $(1-p_{a})p_{h}$ ), but if it does not become hospitalized, we do not reveal whether the node is asymptomatic or symptomatic yet. Indeed, this information is not necessary for continuing the simulation of the epidemic since we assumed that there is no difference between the infection probabilities of symptomatic and asymptomatic nodes. Thereafter, when the first hospitalized case occurs, we reveal for each infected node $v$ on the transmission path (except the last node, which we know is hospitalised; see Definition 4.2) whether it is asymptomatic or not. The only information we have about these nodes is that they are not hospitalized, which implies that the probability that a node is revealed to be asymptomatic on the transmission path is exactly the probability $p$ from Definition 4.4 computed in (5).

By Lemma 4.3, LS succeeds if and only if each node on the transmission path is symptomatic. Conditioning on the length of the transmission path, we can compute the probability of each node being symptomatic by equation (5) as

(7)

\mathbf{P}(LS\textrm{ suceeds}|d(s,h)=n)=\left(1-\mathbf{P}(v\textrm{ is asy}|v\textrm{ is not hosp})\right)^{n}=\left(1-p\right)^{n},

from which (6) follows immediately. ∎

4.1.4. Success Probability of LS+

Computing the success probability of the LS+ algorithm is far more challenging compared to the LS algorithm, even if the distribution of the length of the transmission path is provided to us. Indeed, since the LS+ algorithm does further testing on the contacts and household members of asymptomatic nodes, it is essential to have additional information about the number of households on the transmission path. We give our main result on the LS+ in the next theorem, which we prove in Appendix A.2.

Theorem 4.6.

Let $p$ be as in (5) and let $\mathcal{S}(n,\alpha,\beta)$ be the set of $k$ integer values such that $k$ and $n$ have different parity and $n+1-2(\alpha+\beta)\geq k\geq 2-(\alpha+\beta)$ . Then, for the $\mathrm{DDE}_{\mathrm{NR}}$ epidemic model with parameters $(p_{i},p_{a},p_{h})$ on the RB tree with parameters $(d_{c},d_{h})$ , we have

	$\displaystyle\mathbf{P}(LS+\textrm{ succeeds})\geq\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+$
(8)		$\displaystyle\sum_{n=2}^{\infty}\sum_{\begin{subarray}{c}\alpha,\beta\in\{0,1\}\\ k\in\mathcal{S}(n,\alpha,\beta)\end{subarray}}\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}\frac{(d_{h}(1-p))^{\frac{n+k-1}{2}}(d_{c}(1+p))^{\frac{n-k+1}{2}-\alpha-\beta}d_{c}(d_{c}-1)^{k+\alpha+\beta-2}}{\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}}\mathbf{P}(d(s,h)=n),$

where

(9)	$\displaystyle D$	$\displaystyle=\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}$
(10)	$\displaystyle\lambda_{1}$	$\displaystyle=\frac{(d_{c}+1+D)(2d_{h}+d_{c}-1+D)}{2D(d_{c}-1+D)}$
(11)	$\displaystyle\lambda_{2}$	$\displaystyle=\frac{(D-d_{c}-1)(2d_{h}+d_{c}-1-D)}{2D(d_{c}-1-D)}.$

4.2. Approximating the Depth of the Path to the First Hospitalized Node

Section 4.1 was dedicated to the success probability of the LS and LS+ algorithms, however, in these results, we are still missing the distribution of the transmission path length. In this subsection we address this problem by introducing simpler approximate models.

4.2.1. $(d_{r},d)$ -ary Random Exponential Tree

When we introduced the $\mathrm{DDE}_{\mathrm{NR}}$ model in Section 4.1.2, we removed both parameters $T_{P}$ and $T_{I}$ from the DDE model (by removing the presymptomatic and the recovered states, respectively), but we kept the parameter $T_{E}$ . In this step we will rescale the time parameter to make $T_{E}^{\prime}=1$ by changing $p_{i}^{\prime}$ to be $1-(1-p_{i})^{T_{E}}$ . Since we had $T_{E}=3$ by default, using $T_{E}^{\prime}$ and $p_{i}^{\prime}$ instead of $T_{E}$ and $p_{i}$ means that we choose 3 days to be our time unit, and the probability of infection is scaled to be the probability that the infection is passed in at least one of three days (since the RB tree is time-independent, if two nodes are connected, the infection can spread on it every day). We drop the prime from $p_{i}^{\prime}$ and $T_{E}^{\prime}$ for ease of notation. As a second approximation, instead of keeping track of two types of nodes (red and blue) as it is done in the RB tree, we propose to change our network model to an infinite $d$ -regular tree, where $d$ is set to be the average degree of an RB tree.

By making these two changes (tracking time at a coarser scale and simplifying the network topology to a $d$ -regular tree), the growth of the epidemic becomes equivalent to a known model, the $d$ -ary Random Exponential Tree ( $d$ -RET). Binary RETs have been introduced in (Feng and Mahmoud, 2018) . We give the definition below for completeness.

Definition 4.7.

A $d$ -ary Random Exponential Tree ( $d$ -RET) with parameters $d,p_{i}$ at time day $t$ , denoted by $G_{t}(s)$ , is a random tree rooted at node $s$ . At day $0$ , the tree $G_{t}(s)$ only has its root node $s$ . Let $\bar{G}_{t}(s)$ be the closure of $G_{t}(s)$ , which is obtained by attaching external nodes to $G_{t}(s)$ until every internal node (a node that was already present in $G_{t}(s)$ ) has degree exactly $d$ in the graph $\bar{G}_{t}(s)$ . Then, $G_{t+1}(s)$ is obtained from $\bar{G}_{t}(s)$ by retaining each external node with probability $p_{i}$ , and dropping the remaining external nodes.

Indeed, each node of a $d$ -RET infects a new node with probability $p_{i}$ each day, and after a sufficiently long time, the $d$ -RET becomes close to a large $d$ -ary tree. Of course, we do not want to let the $d$ -RET grow for a very long time, we only want it to grow until the first hospitalization occurs. So far we have not talked about the course of the disease of the nodes in the $d$ -RET model because we could define the spread of the infection without it. Since we still need to do one final simplification to compute the distribution of the transmission path, we defer the discussion about hospitalizations, and how the parameters $p_{a}$ and $p_{h}$ are part of the model, to Section 4.2.2. Note that by considering the $d$ -RET, we deviate from the idea of separating the epidemic and the network models; we only have a randomly growing tree, which is stopped at some time, when the tree is still almost surely finite.

So far we only did simplifications to the model, which resulted in further and further deviations from the original version. Now we will make a small modification that brings our model back closer to the RB tree, without complicating the computations too much. We still make almost all maximum degrees of the RET uniform $d$ , but we make an exception with the root, which will have maximum degree $d_{r}=d_{c}+d_{h}$ . This makes the maximum degree of the root the same as the degree of the root of the RB tree. We call the resulting model a $(d_{r},d)$ -RET with parameter $p_{i}$ . Since the close neighborhood of the source has a high impact on the success probability, we found that this solution gives the best results while keeping the computations tractable.

In our computations, only the profile the infection tree will be important, which motivates the next definition.

Definition 4.8.

In the $(d_{r},d)$ -RET model with parameter $p_{i}$ , let $A_{t,l}$ be the number of nodes during day $t$ at level $l$ , and let $a_{t,l}=\mathbb{E}[A_{t,l}]$ . Moreover, we define the random variable

(12)

\displaystyle A_{t}=\sum_{t=0}^{+\infty}A_{t,l}

with $A_{-1,l}=0$ for all $l$ , and its expectation $a_{t}=\mathbb{E}[A_{t}]$ .

As noted earlier, the $d$ -RET model has only been analysed for $d=2$ to this date. We provide the expected number $a_{t,l}$ of nodes at level $l$ in day $t$ for the general case in the next theorem and corollary, which we prove in Appendices A.3 and A.4.

Theorem 4.9.

In the $(d_{r},d)$ -RET with parameter $p_{i}$ , let $a_{t,l}$ be as in Definition 4.8. Then

(13)	$\displaystyle a_{t,0}$	$\displaystyle=1$
(14)	$\displaystyle a_{t,l}$	$\displaystyle=d_{r}p_{i}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}\textrm{, for }t\geq l\geq 1$
(15)	$\displaystyle a_{t,l}$	$\displaystyle=0\textrm{, for l > t}.$

Corollary 4.10.

In the RET $(p_{i},d_{r},d)$ , let $a_{t}$ be the expectation of (12), as in Definition 4.8. For $t\geq 0$ ,

(16)

a_{t}=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.

4.2.2. Deterministic Exponential Tree with Parameters $p_{a},p_{h}$ and $(c_{t,l})_{t,l\in\mathbb{N}}$

In the $(d_{r},d)$ -RET model it is still complicated to calculate the distribution of the depth of the first hospitalized node. For this reason, we approximate the RET model by a deterministic time-dependent tree with a prescribed profile.

Definition 4.11.

Let $(c_{t,l})_{t\in\mathbb{N}\bigcup\{-1\},l\in\mathbb{N}}$ be a two-dimensional array with $c_{t,l}=0$ for $t\in\{-1,0\}$ and $l\in\mathbb{N}$ , except for $c_{0,0}=1$ , and with $c_{t,l}\geq c_{t,l-1}$ for any $t$ and any $l\geq 1$ . Additionally, if we define $c_{t}=\sum_{l}c_{t,l}$ , then the array $(c_{t,l})$ must satisfy $c_{t}>c_{t-1}$ for $t\geq 0$ . Then, we define the Deterministic Exponential Tree (DET) with parameter $(c_{t,l})_{t\in\mathbb{N}\bigcup\{-1\},l\in\mathbb{N}}$ , as a time-dependent rooted tree, that has exactly $c_{t,l}$ nodes on level $l$ at time $t$ . The edges between the adjacent levels are drawn arbitrarily so that the tree structure is preserved.

The formal assumptions on the array $(c_{t,l})$ are simply made to ensure that the DET starts with a single node at $t=0$ , that it never shrinks on any level ( $c_{t,l}\geq c_{t,l-1}$ ), and that it grows by at least one node in each time step ( $c_{t}>c_{t-1}$ ).

We have defined the DET at any given time $t$ , however, to determine the length of the transmission path, we are not interested in the DET at any given time, but only when the first hospitalization occurs. To compute the distribution of the first hospitalized node, we would like to have an absolute order on the times when the nodes are added, which we do by randomization. We say that on day $t$ , nodes are added one by one to the DET, their order given by a uniformly random permutation, and each node is hospitalized with probability $(1-p_{a})p_{h}$ (as in the original DDE model). When the first hospitalization occurs, we stop growing the tree, and we call the resulting (now random) model a stopped DET with parameters $(c_{t,l}),p_{a},p_{h}$ . We find the transmission path length distribution on the stopped DET in the next lemma, which we prove in Appendix A.5.

Lemma 4.12.

Let us consider the stopped DET model with parameters $(c_{t,l}),p_{a},p_{h}$ , and let $h$ denote the first hospitalized node. Then

(17)

\displaystyle\mathbf{P}(d(s,h)=l)=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).

We would like to set $c_{t,l}$ so that the DET is close to the RET described in Section 4.2.1. For equation (17) to make sense, we should substitute integer values for $c_{t,l}$ , however, for an approximation the equation can also be evaluated for fractional values as well.

Remark 4.4.

If we substitute $c_{t,l}=a_{t,l}$ and $c_{t}=a_{t}$ in equation (17), where $a_{t,l}$ is given in Theorem 4.9 and $a_{t}$ is computed in Corollary 4.10, then we get the expression

(18)

\displaystyle d_{r}p_{i}^{l-1}d^{l-1}\sum_{t=l}^{+\infty}\frac{\binom{t-1}{l-1}(1-p_{i})^{t-l}}{(1-p_{i}+dp_{i})^{t-1}}(1-(1-p_{a})p_{h})^{1+d_{r}\frac{(1-p_{i}+dp_{i})^{t-1}-1}{d-1}}\left(1-(1-(1-p_{a})p_{h})^{d_{r}(1-p_{i}+dp_{i})^{t-1})}\right),

which approximates the distribution of the transmission path length in the $(d_{r},d)$ -ary RET stopped at the first hospitalization.

5. Simulation Results

5.1. Baseline Algorithms

5.1.1. Non-adaptive Baseline: Dynamic Message Passing

There are few sensor-based source detection algorithms that are compatible with time-varying networks in the literature (Huang, 2017; Jiang et al., 2016; Fan et al., 2020). The most promising one among these algorithms (Jiang et al., 2016) has a close resemblance to the a previous work of (Lokhov et al., 2014) on Dynamic Message Passing (DMP) algorithms. Given the initial conditions on the identity of the source node and its time of infection, the DMP algorithm approximates the marginal distribution of the outcome of an epidemic at some later time $t$ . The algorithm is exact on tree networks, and it computes a good approximation when there are not too many short cycles in the network. Therefore, the DMP algorithm can be used to approximate the likelihood of the observed symptom onset times for any (source,time) pair. Due to its flexibility, we were able to adapt the DMP algorithm to the SDCTF (see Appendix B for more details).

Originally, the DMP was applied to the source detection problem by computing the likelihood values for all possible (source,time) pairs, and then choosing the source node from the most likely pair as the estimate (Lokhov et al., 2014). However, testing all (source,time) pairs increases the time complexity of the algorithms potentially by a factor of $N^{2}$ , which makes the algorithm intractable in many applications. Jiang et. al. (Jiang et al., 2016) proposed a very similar algorithm to the DMP equations (which is unfortunately not exact even on trees), and solved the issue of intractability by a heuristic preprocessing step to the DMP algorithm. This preprocessing step, identifies a few candidate (source,time) pairs, by spreading the disease backward from the observations in a deterministic way (called reverse dissemination). Since we already approximate our data-driven model (DCS) by an epidemic model with deterministic transition times (DDE), it is natural for us to also implement the deterministic preprocessing step proposed by (Jiang et al., 2016). We produce 5 (source,time) pairs which are feasible for the 5 earliest symptom onset time observations (see Appendix B.3 for more details). It would have been ideal to run the algorithms for more than 5 pairs, but this was made impossible by the runtimes becoming very high. We run therefore our implementation of the DMP algorithm with the previously computed feasible (source,time) pairs as initial conditions to find the most likely source candidate.

The source estimation algorithms developed using the DMP algorithm do not specify how the sensors should be selected, and therefore place these non-adaptive sensors randomly. We refer to the resulting algorithm as random+DMP. The number of sensors is set so that it always exceeds the number that LS/LS+ would use. The simulation results are shown in Figure 6 for the DDE+HNM model. Importantly, the deterministic preprocessing step of (Jiang et al., 2016) is compatible with time-varying networks, which allows us to run the algorithm for the DCS+TU model as well (see Figure 8).

5.1.2. Adaptive Baseline: Size-Gain

The Size-Gain (SG) algorithm was developed for epidemics which spread deterministically (Zejnilović et al., 2015), and has been later extended to stochastic epidemics (Spinelli et al., 2017b). It works by narrowing a candidate set based on a deterministic constraint. If $v_{1},v_{2}$ are symptomatic observations, then $s_{c}$ is in the candidate set of SG if and only if

(19)

|(t_{v_{2}}-t_{v_{1}})-(d(v_{2},s_{c})-d(v_{1},s_{c}))|<\sigma(d(v_{2},s_{c})+d(v_{1},s_{c})),

where $\sigma$ is the standard deviation of the infection time of a susceptible contact. If one of the observations, say $v_{2}$ , is negative, then SG uses a condition almost identical to equation (19), except that the absolute value is dropped, since a negative observation at time $t_{v_{2}}$ is only a lower bound on the true symptom onset time of $v_{2}$ . These deterministic conditions are checked for every symptomatic-symptomatic or symptomatic-negative pair $(v_{1},v_{2})$ to determine if $s_{c}$ can be part of the candidate set. Next, SG places the next sensor adaptively at the node which reduces the candidate set by the largest amount in expectation (assuming a uniform prior on the source and its infection time), and it terminates when the candidate set shrinks to a single node. Note that the SG algorithm can fail if at least one of the deterministic conditions in equation (19) is violated for some $(v_{1},v_{2})$ because of the randomness of the epidemic.

We use the existing implementation of the SG algorithm by (Spinelli et al., 2017b), and adapt it to the SDCTF. We incorporate asymptomatic-symptomatic and asymptomatic-negative observations $(v_{1},v_{2})$ the same way as symptomatic-negative are incorporated; we drop the absolute value sign in equation (19), because an asymptomatic observation at time $t_{v_{1}}$ is only an upper bound on the true symptom onset time of $v_{1}$ . We impose the same daily limit to the number of sensors that can be placed by the SG algorithm in a single day as for the LS/LS+ algorithm, and if the candidate set size does not shrink to one on the day when both LS and LS+ have already provided their estimates, then the SG algorithm must make a uniformly random choice from the current candidate set as its source estimate. The simulation results are shown in Figure 6 for the DDE+HNM model. We do not implement the SG algorithm for the DCS+TU model, because its runtime is too high, and because it is not clear how it should be implemented for time-varying networks.

5.2. Comparison with Baselines

We show our simulation results comparing the random+DMP, SG, LS and LS+ algorithms in Figure 6. In the first row of Figure 6, we show the accuracy of the algorithms with solid curves. Since the LS/LS+ algorithms cannot detect an asymptomatic source, we also show what the accuracy would look like if the goal of the SDCTF was to detect the first symptomatic agent with dashed lines. It is clear that in both metrics and across a wide range of parameters, the LS+ algorithm performs best, followed by LS, next random+DMP, and finally SG. The only exception is for high values of $p_{i}$ , where SG performs best. The good performance of SG for these parameters is expected, because SG was originally developed for deterministically spreading epidemics (i.e., $p_{i}=1$ ).

In the second row of Figure 6, we show the number of test/sensor queries used by the algorithms. LS uses the fewest tests, followed by LS+. The random+DMP and SG algorithms always use more tests than LS/LS+, except for large values of $p_{i}$ . Finally, in the last row of Figure 6 we show the number of contact (or in this case edge) queries used by the algorithms. Again, LS uses fewer queries than LS+, while both the random+DMP and SG algorithms query essentially the entire network.

Figure 6 shows that the LS/LS+ algorithms are fairly robust to changes in the parameters of the model, except for the parameter $p_{a}$ . Indeed, if there are many asymptomatic nodes in the network, then source detection becomes very challenging. It may be surprising that as $p_{a}$ grows, the number of tests that LS uses decreases, contrary to LS+. This is because as $p_{a}$ grows, the LS algorithm gets stuck more rapidly, while the LS+ algorithm compensates for the presence of asymptomatic nodes by using more test/sensor queries.

5.3. Comparison of Simulations and Theoretical Results

The analytic results from Section 4 are in good agreement with the simulation results in Figure 7. We also experiment with changing the parameters $d_{h}$ , $d_{c}$ while keeping all the parameters fixed, and with changing $d_{c}$ while keeping the product $d_{c}p_{i}$ fixed. We observe that LS is not affected by the parameter $d_{h}$ , whereas LS+ performs better with a higher $d_{c}$ , which is expected because LS+ leverages the household structure of the network to improve over LS. Somewhat surprisingly, we also observe that a higher $d_{c}$ also improves the performance of both algorithms. This can be explained by the fact that a larger $d_{c}$ implies that there are more nodes in the close neighborhood of the source, which results in shorter transmission paths, making source detection less challenging. Finally, if we increase $d_{c}$ but keep $d_{c}p_{i}$ fixed, the performance of the algorithms does not change as much, which confirms the intuition that it is the number of infections caused by an infectious node in a single day that matters the most (as we discussed in Section 2).

5.4. Simulations on the DCS Model

We show our simulation results on our most realistic DCS+TU model in Figure 8. We make very similar observations on this model as the ones that we have made on the DDE+HNM model in Sections 5.2 and 5.3, which shows that the LS/LS+ algorithms and our analysis of their performance is robust to changes in the epidemic and network models.

In the DCS+TU model, we used a fixed limit on the number of sensors that the random+DMP model selects, instead of setting the limit based on the LS+ algorithm. As a result, for a few parameters the LS+ algorithm used more tests than the random+DMP model. However, we note that by updating the candidate node immediately after an earlier symptom onset time is revealed (see Section 3.5), we can essentially cut the number of required tests for the LS+ algorithm by half (LSv2 and LS+v2), without sacrificing the performance of the algorithms.

6. Discussion

We have introduced the LS and LS+ algorithms in the SDCTF, and we have used a sequence of models on which we can compute their accuracy (probability of finding the correct source) rigorously. We find that both LS and LS+ outperform baseline algorithms, even though the baselines essentially query all contacts on a transmission path between agents, while LS and LS+ query only a small neighborhood of the source. One could argue that LS and LS+ beat the baseline algorithms only because we benchmark them in our own framework, which is different from the framework for which the baseline algorithms were developed. However, we argue that the LS/LS+ algorithms are robust to changes in the framework due to their simplicity, and we support our argument by simulation results. The runtimes of the LS/LS+ algorithms are also much lower than the baselines and do not depend on the network size since they are local algorithms - as opposed to the baselines, which have quadratic or even larger dependence on the network size. The “low-tech” approach in the design of the LS/LS+ algorithms increases their potential to be implemented in real-world scenarios, possibly even in a decentralized way, similarly to contact tracing smart phone applications (Troncoso et al., 2020), which is an interesting direction for future work.

References

(1)
Altarelli et al. (2014) Fabrizio Altarelli, Alfredo Braunstein, Luca Dall’Asta, Alessandro Ingrosso, and Riccardo Zecchina. 2014. The patient-zero problem with noisy observations. Journal of Statistical Mechanics: Theory and Experiment 2014, 10 (2014), P10016.
Ball et al. (2009) Frank Ball, David Sirl, and Pieter Trapman. 2009. Threshold behaviour and final outcome of an epidemic on a random network with household structure. Advances in Applied Probability 41, 3 (2009), 765–796.
BeidasRinad et al. (2020) S BeidasRinad, M ButtenheimAlison, S KilaruAustin, A AschDavid, G VolppKevin, G LawmanHannah, C CannuscioCarolyn, et al. 2020. Optimizing and implementing contact tracing through behavioral economics. NEJM Catalyst Innovations in Care Delivery (2020).
Benjamini and Schramm (2011) Itai Benjamini and Oded Schramm. 2011. Recurrence of distributional limits of finite planar graphs. In Selected Works of Oded Schramm. Springer, 533–545.
Bradshaw et al. (2021) William J Bradshaw, Ethan C Alley, Jonathan H Huggins, Alun L Lloyd, and Kevin M Esvelt. 2021. Bidirectional contact tracing could dramatically improve COVID-19 control. Nature communications 12, 1 (2021), 1–9.
Braithwaite et al. (2020) Isobel Braithwaite, Thomas Callender, Miriam Bullock, and Robert W Aldridge. 2020. Automated and partly automated contact tracing: a systematic review to inform the control of COVID-19. The Lancet Digital Health (2020).
Dawkins et al. (2021) Quinlan Dawkins, Tianxi Li, and Haifeng Xu. 2021. Diffusion Source Identification on Networks with Statistical Confidence. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 2500–2509.
Drmota (2009) Michael Drmota. 2009. Random trees: an interplay between combinatorics and probability. Springer Science & Business Media.
Endo et al. (2020) Akira Endo et al. 2020. Implication of backward contact tracing in the presence of overdispersed transmission in COVID-19 outbreaks. Wellcome open research 5 (2020).
Fan et al. (2020) Lilin Fan, Bingjie Li, Dong Liu, Huanhuan Dai, and Yan Ru. 2020. Identifying Propagation Source in Temporal Networks Based on Label Propagation. In International Conference of Pioneering Computer Scientists, Engineers and Educators. Springer, 72–88.
Feng and Mahmoud (2018) Yarong Feng and Hosam Mahmoud. 2018. Profile of random exponential binary trees. Methodology and Computing in Applied Probability 20, 2 (2018), 575–587.
Hernando et al. (2008) Carmen Hernando, Mercé Mora, Peter J Slater, and David R Wood. 2008. Fault-tolerant metric dimension of graphs. Convexity in discrete structures 5 (2008), 81–85.
Hu et al. (2018) Zhao-Long Hu, Zhesi Shen, Chang-Bing Tang, Bin-Bin Xie, and Jian-Feng Lu. 2018. Localization of diffusion sources in complex networks with sparse observations. Physics Letters A 382, 14 (2018), 931–937.
Huang (2017) Qiangjuan Huang. 2017. Source locating of spreading dynamics in temporal networks. In Proceedings of the 26th International Conference on World Wide Web Companion. 723–727.
Jagers and Nerman (1984) Peter Jagers and Olle Nerman. 1984. The growth and composition of branching populations. Advances in applied probability 16, 2 (1984), 221–259.
Jiang et al. (2016) Jiaojiao Jiang, Sheng Wen, Shui Yu, Yang Xiang, and Wanlei Zhou. 2016. Rumor source identification in social networks with time-varying topology. IEEE Transactions on Dependable and Secure Computing 15, 1 (2016), 166–179.
Karrer and Newman (2010) Brian Karrer and M. E. J. Newman. 2010. Message passing approach for general epidemic models. Phys. Rev. E 82 (Jul 2010), 016101. Issue 1. https://doi.org/10.1103/PhysRevE.82.016101
Kojaku et al. (2021) Sadamori Kojaku, Laurent Hébert-Dufresne, Enys Mones, Sune Lehmann, and Yong-Yeol Ahn. 2021. The effectiveness of backward contact tracing in networks. Nature Physics (2021), 1–7.
Kretzschmar et al. (2020) Mirjam E Kretzschmar, Ganna Rozhnova, Martin CJ Bootsma, Michiel van Boven, Janneke HHM van de Wijgert, and Marc JM Bonten. 2020. Impact of delays on effectiveness of contact tracing strategies for COVID-19: a modelling study. The Lancet Public Health 5, 8 (2020), e452–e459.
Lecomte et al. (2020) Victor Lecomte, Gergely Ódor, and Patrick Thiran. 2020. The power of adaptivity in source location on the path. arXiv preprint arXiv:2002.07336 (2020).
Li et al. (2019) Xiang Li, Xiaojie Wang, Chengli Zhao, Xue Zhang, and Dongyun Yi. 2019. Locating the source of diffusion in complex networks via Gaussian-based localization and deduction. Applied Sciences 9, 18 (2019), 3758.
Lokhov et al. (2014) Andrey Y Lokhov, Marc Mézard, Hiroki Ohta, and Lenka Zdeborová. 2014. Inferring the origin of an epidemic with a dynamic message-passing algorithm. Physical Review E 90, 1 (2014), 012801.
Lorch et al. (2020) Lars Lorch, Heiner Kremer, William Trouleau, Stratis Tsirtsis, Aron Szanto, Bernhard Schölkopf, and Manuel Gomez-Rodriguez. 2020. Quantifying the effects of contact tracing, testing, and containment measures in the presence of infection hotspots. arXiv preprint arXiv:2004.07641 (2020).
Louni et al. (2015) Alireza Louni, Anand Santhanakrishnan, and KP Subbalakshmi. 2015. Identification of source of rumors in social networks with incomplete information. arXiv preprint arXiv:1509.00557 (2015).
Mahmoud (2021) Hosam Mahmoud. 2021. Profile of Random Exponential Recursive Trees. Methodology and Computing in Applied Probability (2021), 1–17.
Mashkaria et al. (2020) Satvik Mashkaria, Gergely Ódor, and Patrick Thiran. 2020. On the robustness of the metric dimension to adding a single edge. arXiv preprint arXiv:2010.11023 (2020).
Paluch et al. (2020a) Robert Paluch, Łukasz G Gajewski, Janusz A Hołyst, and Boleslaw K Szymanski. 2020a. Optimizing sensors placement in complex networks for localization of hidden signal source: A review. Future Generation Computer Systems 112 (2020), 1070–1092.
Paluch et al. (2018) Robert Paluch, Xiaoyan Lu, Krzysztof Suchecki, Bolesław K Szymański, and Janusz A Hołyst. 2018. Fast and accurate detection of spread source in large complex networks. Scientific reports 8, 1 (2018), 1–10.
Paluch et al. (2020b) Robert Paluch, Krzysztof Suchecki, and Janusz A Hołyst. 2020b. Locating the source of interacting signal in complex networks. arXiv preprint arXiv:2012.02039 (2020).
Park et al. (2020) Young Joon Park, Young June Choe, Ok Park, Shin Young Park, Young-Man Kim, Jieun Kim, Sanghui Kweon, Yeonhee Woo, Jin Gwack, Seong Sun Kim, et al. 2020. Contact tracing during coronavirus disease outbreak, South Korea, 2020. Emerging infectious diseases 26, 10 (2020), 2465.
Pinto et al. (2012) P. Pinto, P. Thiran, and M. Vetterli. 2012. Locating the source of diffusion in large-scale networks. Physical Review Letters 109 (2012).
Shah and Zaman (2010) Devavrat Shah and Tauhid Zaman. 2010. Detecting Sources of Computer Viruses in Networks: Theory and Experiment. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, New York, USA) (SIGMETRICS ’10). ACM, New York, NY, USA, 203–214.
Shah and Zaman (2011) Devavrat Shah and Tauhid Zaman. 2011. Rumors in a network: Who’s the culprit? IEEE Transactions on information theory 57, 8 (2011), 5163–5181.
Shelke and Attar (2019) Sushila Shelke and Vahida Attar. 2019. Source detection of rumor in social network–a review. Online Social Networks and Media 9 (2019), 30–42.
Shen et al. (2016) Zhesi Shen, Shinan Cao, Wen-Xu Wang, Zengru Di, and H Eugene Stanley. 2016. Locating the source of diffusion in complex networks by time-reversal backward spreading. Physical Review E 93, 3 (2016), 032301.
Spinelli et al. (2017a) Brunella Spinelli, Elisa Celis, and Patrick Thiran. 2017a. A General Framework for Sensor Placement in Source Localization. IEEE Transactions on Network Science and Engineering (2017).
Spinelli et al. (2017b) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2017b. Back to the source: An online approach for sensor placement and source localization. In Proceedings of the 26th International Conference on World Wide Web. 1151–1160.
Spinelli et al. (2017c) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2017c. The effect of transmission variance on observer placement for source-localization. Applied network science 2, 1 (2017), 20.
Spinelli et al. (2018) Brunella Spinelli, L Elisa Celis, and Patrick Thiran. 2018. How Many Sensors to Localize the Source? The Double Metric Dimension of Random Networks. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 1036–1043.
Tang et al. (2018) Wenchang Tang, Feng Ji, and Wee Peng Tay. 2018. Estimating infection sources in networks using partial timestamps. IEEE Transactions on Information Forensics and Security 13, 12 (2018), 3035–3049.
Troncoso et al. (2020) Carmela Troncoso, Mathias Payer, Jean-Pierre Hubaux, Marcel Salathé, James Larus, Edouard Bugnion, Wouter Lueks, Theresa Stadler, Apostolos Pyrgelis, Daniele Antonioli, et al. 2020. Decentralized privacy-preserving proximity tracing. arXiv preprint arXiv:2005.12273 (2020).
Wormald et al. (1999) Nicholas C Wormald et al. 1999. Models of random regular graphs. London Mathematical Society Lecture Note Series (1999), 239–298.
Xu et al. (2019) Shuaishuai Xu, Cong Teng, Yinzuo Zhou, Junhao Peng, Yicheng Zhang, and Zi-Ke Zhang. 2019. Identifying the diffusion source in complex networks with limited observers. Physica A: Statistical Mechanics and its Applications 527 (2019), 121267.
Zejnilovic et al. (2013) Sabina Zejnilovic, Joao Gomes, and Bruno Sinopoli. 2013. Network observability and localization of the source of diffusion based on a subset of nodes. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on. IEEE, 847–852.
Zejnilović et al. (2015) Sabina Zejnilović, João Gomes, and Bruno Sinopoli. 2015. Sequential observer selection for source localization. In Signal and Information Processing (GlobalSIP), 2015 IEEE Global Conference on. IEEE, 1220–1224.
Zejnilović et al. (2017) Sabina Zejnilović, Joño Gomes, and Bruno Sinopoli. 2017. Sequential source localization on graphs: A case study of cholera outbreak. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1010–1014.
Zejnilović et al. (2016) Sabina Zejnilović, Dieter Mitsche, João Gomes, and Bruno Sinopoli. 2016. Extending the metric dimension to graphs with missing edges. Theoretical Computer Science 609 (2016), 384–394.
Zhu et al. (2016) Kai Zhu, Zhen Chen, and Lei Ying. 2016. Locating the contagion source in networks with partial timestamps. Data Mining and Knowledge Discovery 30, 5 (2016), 1217–1248.

Appendix A Additional Proofs

A.1. Proof of Lemma 4.3

We start by restating the lemma here for convenience.

Lemma A.1.

Proof.

Throughout the proof we assume that there is no limitation on the number available tests. We can make this assumption because in the SDCTF there is only a daily limit on the number tests, there is no limitation on the number of days, and neither the LS nor the LS+ algorithms proceed in an iteration until the test queue becomes empty, which implies that all nodes that enter the test queue get eventually tested.

Suppose that the LS algorithm succeeds. Then the list of candidate nodes $s_{c}$ at different iterations forms a path that consists entirely of symptomatic nodes between the source and the first hospitalized node. In tree networks, the transmission path is the only path between the source and the first hospitalized node, which yields the ”only if” part of the statement on the LS algorithm.

Next, suppose that all nodes on the transmission path are symptomatic. Then, we claim that the candidate node $s_{c,i}$ computed in the $i^{th}$ iteration of the LS algorithm is $v_{l-i}$ , the $i^{th}$ node of the reverse transmission path. Our claim is definitely true for $i=0$ , because $s_{c,0}$ is initialized to be the first hospitalized node $v_{l}$ . Then, the proof proceeds by induction. By the induction hypothesis, in the $i^{th}$ step, $s_{c,i}=v_{l-i}$ , and since we are on a tree, the symptom onset time of $v_{l-(i+1)}$ (which is revealed because all nodes on the transmission path are symptomatic by assumption) is the only symptom onset time among the neighbors of $s_{c,i}$ that have a lower symptom onset time than $s_{c,i}$ itself. Therefore $s_{c}^{\prime}=v_{l-(i+1)}$ , and $s_{c,i+1}$ is updated to be $v_{l-(i+1)}$ in the beginning of the next iteration, which proves that the induction hypothesis holds until the source is reached.

Finally, suppose that among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic. Let us denote by $w_{i}$ the $i^{th}$ symptomatic node of the reverse transmission path. Then, we claim that the candidate list $s_{c,i}$ computed in the $i^{th}$ iteration of the LS+ algorithm equals $w_{i}$ . Similarly to the case of the LS algorithm, the $i=0$ case holds by definition, and we proceed by induction. Suppose that $s_{c,i}=w_{i}$ . It will also be useful to define the index of $w_{i}$ on the forward transmission path (without skipping asymptomatic nodes). Let $j$ be this index, for which therefore $w_{i}=v_{j}$ . Now we distinguish 3 cases: (i) $v_{j-1}=w_{i+1}$ is symptomatic, (ii) $v_{j-1}$ is asymptomatic and $v_{j-2}=w_{i+1}$ is symptomatic, and (iii) $v_{j-1}$ and $v_{j-2}$ are asymptomatic and $v_{j-3}=w_{i+1}$ is symptomatic. We claim that there are no more cases, and that in all three cases $w_{i+1}$ is tested in the $i^{th}$ iteration of the LS+ algorithm. Case (i) is immediate because all neighbors of $s_{c,i}$ are tested. Case (ii) is only possible if either $v_{j-1}\in H(s_{c,i})$ or $v_{j-2}\in H(v_{j-1})$ , otherwise $v_{j-1}$ would be a lone asymptomatic node in a household, which contradicts the assumption that there is a symptomatic node in each household. Since all the contacts of asymptomatic nodes in $H(s_{c,i})$ (see Figure 4 (d)) and all nodes in the household of asymptomatic nodes are tested in the LS+ algorithm (see Figure 4 (e)), $v_{j-2}$ must be tested too. Finally, case (iii) is possible only if $v_{j-1}\in H(s_{c,i})$ and $v_{j-3}\in H(v_{j-2})$ both hold, otherwise $v_{j-1}$ or $v_{j-2}$ would be a lone asymptomatic node in a household. Similarly to the previous case, $v_{j-3}$ must be tested (see Figure 4 (f)). There are no more cases because, by Remark 4.2, on the RB tree a transmission path can only have two nodes in each household, and we assumed that there exists a symptomatic node in each household among the nodes of the transmission path.

After we proved that $w_{i+1}$ is tested in the $i^{th}$ iteration of the LS+ algorithm, we must still show that it will be the next candidate $s_{c,i+1}$ for the induction hypothesis to hold. This is true because once the symptom onset time of $w_{i+1}$ is revealed, none of its neighbors are scheduled for testing, and therefore all tested nodes have $w_{i+1}$ on their path to the source, which means that $w_{i+1}$ must have the lowest revealed symptom onset time, and therefore that it will be the next candidate $s_{c,i+1}$ . ∎

A.2. Proof of Theorem 4.6

We are going to need prove a few intermediate results before proving Theorem 4.6. A first step is to count all the possible paths from the source with a given length.

Definition A.2.

Let $G(s)$ be the RB tree with parameters $(d_{c},d_{h})$ , and let $s$ be the source. A Red-Blue (RB) path of length $n$ is any path of nodes in $(s=v_{0},v_{1},...v_{n})$ such that $(v_{i},v_{i+1})\in E^{\prime}$ for $0\leq i<n$ . Let $\mathcal{C}_{n}$ be the set of RB paths of length $n$ .

Lemma A.3.

In the RB tree with parameters $(d_{c},d_{h})$ , $|C_{0}|=1$ , while for $n\geq 1$ ,

(20)

|\mathcal{C}_{n}|=\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}

where

(21)	$\displaystyle D$	$\displaystyle=\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}$
(22)	$\displaystyle\lambda_{1}$	$\displaystyle=\frac{(d_{c}+1+D)(2d_{h}+d_{c}-1+D)}{2D(d_{c}-1+D)}$
(23)	$\displaystyle\lambda_{2}$	$\displaystyle=\frac{(D-d_{c}-1)(2d_{h}+d_{c}-1-D)}{2D(d_{c}-1-D)}.$

Proof.

Let us keep track of the number of RB paths of length $n$ depending on the color of the last node in the path. Let $r_{n}$ and $b_{n}$ be the numbers of RB paths of length $n$ such that the last node is red and blue, respectively. A RB path of length 0 consists only of the source, which implies that $r_{0}=1$ and $b_{0}=0$ . The source has $d_{c}$ red and $d_{h}$ blue neighbours, which implies that $r_{1}=d_{c}$ and $b_{1}=d_{h}$ .

Suppose that $P$ is an RB path of length $n\geq 2$ . If the last node of $P$ is red, then the node before the last node can be both blue or red. Red nodes other than the source have $d_{c}-1$ red children, while blue nodes have $d_{c}$ red children, yielding

(24)

\displaystyle r_{n}=(d_{c}-1)r_{n-1}+d_{c}b_{n-1},\textrm{ for }n\geq 2.

If the last node of $P$ is blue, then the node before has to be red. Since every red node, including the source, has $d_{h}$ blue children, we have

(25)

\displaystyle b_{n}=d_{h}r_{n-1},\textrm{ for }n\geq 1.

By substituting equation (25) into equation (24), we obtain the recurrence

(26)

\displaystyle r_{n}=(d_{c}-1)r_{n-1}+d_{c}d_{h}r_{n-2},\textrm{ for }n\geq 2.

We solve this recurrence equation by calculating the characteristic equation

(27)

t^{2}-(d_{c}-1)t-d_{c}d_{h}=0,

whose roots are

(28)		$\displaystyle t_{1}$	$\displaystyle=\frac{d_{c}-1+\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}{2}=\frac{d_{c}-1+D}{2}$
(29)		$\displaystyle t_{2}$	$\displaystyle=\frac{d_{c}-1-\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}{2}=\frac{d_{c}-1-D}{2}$

yielding the the general solution

(30)

r_{n}=c_{1}t_{1}^{n}+c_{2}t_{2}^{n},

where $c_{1},c_{2}$ are given by the initial conditions for $n=0,1$

(31)		$\displaystyle c_{1}+c_{2}=r_{0}=1$
(32)		$\displaystyle c_{1}t_{1}+c_{2}t_{2}=r_{1}=d_{c},$

which are

(33)		$\displaystyle c_{1}$	$\displaystyle=\frac{1}{2}+\frac{d_{c}+1}{2\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}=\frac{1}{2}+\frac{d_{c}+1}{2D}$
(34)		$\displaystyle c_{2}$	$\displaystyle=\frac{1}{2}-\frac{d_{c}+1}{2\sqrt{(d_{c}-1)^{2}+4d_{c}d_{h}}}=\frac{1}{2}-\frac{d_{c}+1}{2D}.$

From equations (24) and (25) we conclude that for $n\geq 1$ ,

(35)

b_{n}=d_{h}(c_{1}t_{1}^{n-1}+c_{2}t_{2}^{n-1})

and therefore

(36)

\displaystyle|C_{n}|=r_{n}+b_{n}=\lambda_{1}t_{1}^{n}+\lambda_{2}t_{2}^{n},

where

(37)		$\displaystyle\lambda_{1}$	$\displaystyle=c_{1}\left(1+\frac{d_{h}}{t_{1}}\right)$
(38)		$\displaystyle\lambda_{2}$	$\displaystyle=c_{2}\left(1+\frac{d_{h}}{t_{2}}\right).$

Inserting the values for $t_{1},t_{2},c_{1},c_{2}$ we obtain the desired result. ∎

Since LS+ improves on LS by making use of the household structure of the network, we need further information about the household structure of the transmission paths. Recall that by Remark 4.2, households on transmission paths on an RB tree were characterized either by a single red node (that is followed by a red node), or a pair of consecutive red and blue nodes. The following definition and lemma refine our previous result on counting the number of RB paths by taking the household structure into account.

Definition A.4.

Let $P=\{s=v_{0},v_{1},\ldots,v_{n}=h\}$ be a RB path of length $n$ . We say that a node $v$ on the path $P$ is in a $P$ -single-household if no other node from $P$ is in the same household as $v$ . Otherwise, we say $v$ is in a $P$ -multi-household. Given a path $P$ , let $M_{s}:\mathcal{C}_{n}\rightarrow\{0,1\}$ be the indicator function that the source is in a $P$ -multi-household. Similarly, let $M_{l}:\mathcal{C}_{n}\rightarrow\{0,1\}$ be the indicator function that the last node of path $P$ is in a $P$ -multi-household. Finally, for $0\leq k\leq n+1$ and $\alpha,\beta\in\{0,1\}$ , let

(39)

C_{n,k,\alpha,\beta}=\{P\in\mathcal{C}_{n}:(\textrm{there are exactly }k\textrm{ nodes in }P-\textrm{single-households})\wedge(M_{s}(P)=\alpha)\wedge(M_{l}(P)=\beta)\}.

The set $C_{n,k,\alpha,\beta}$ depends on 4 parameters, but only some combinations of these parameters make it non-empty. The following definition will be useful in this regard.

Condition 1.

Let $\alpha,\beta\in\{0,1\}$ and $n\geq 2$ . We say $k\in\mathbb{N}$ satisfies Condition 1 if and only if $k$ and $n$ have different parity and $n+1-2(\alpha+\beta)\geq k\geq 2-(\alpha+\beta)$ .

Lemma A.5.

It holds that $|C_{0,1,0,0}|=1$ , $|C_{1,0,1,1}|=d_{h}$ and $|C_{1,2,0,0}|=d_{c}$ . Let $\alpha,\beta\in\{0,1\}$ , let $n\geq 2$ and let $k\in\mathbb{N}$ satisfy Condition 1. Then

(40)

|\mathcal{C}_{n,k,\alpha,\beta}|=\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}d_{h}^{\frac{n-k+1}{2}}d_{c}^{\frac{n-k+3}{2}-\beta-\alpha}(d_{c}-1)^{k+\alpha+\beta-2}.

In all other cases $|C_{n,k,\alpha,\beta}|=0$ .

Proof.

Since there are $n+1$ nodes on path $P$ , with $k$ in $P$ -single households and thus $n+1-k$ of them in $P$ -multi-households, we must have

k+\frac{n+1-k}{2}=\frac{n+k+1}{2}

households along path $P$ in total. Clearly, the numbers $n$ and $k$ cannot be of the same parity for any RB path $P$ , which is thus assumed for the rest of the proof (this assumption is also part of Condition 1).

If $n=0$ , then the source is also the first hospitalized node, and it is in a $P$ -single-household, which implies that $|C_{0,1,0,0}|=1$ . If $n=1$ , then there are two cases: either the source is in the same $P$ -multi-household with the first hospitalized node, or both of them are in $P$ -single-households. The former case is possible via $d_{h}$ edges from the source, which gives $|C_{1,0,1,1}|=d_{h}$ , while the latter case is possible via $d_{c}$ edges, and gives $|C_{1,0,1,1}|=d_{c}$ . Since these are the only possible RB paths of length $n\leq 1$ , we must have $|C_{0,k,\alpha,\beta}|=|C_{1,k,\alpha,\beta}|=0$ for any other choice of parameters $k,\alpha$ and $\beta$ .

Let us assume that $n\geq 2$ . Then, the source and the first hospitalized node are not in the same household. Let us denote the household of the source by $H_{s}$ and the household of the first hospitalized node by $H_{h}$ . Note that $(1-\alpha)$ and $(1-\beta)$ are the indicators of $H_{s}$ and $H_{h}$ being $P$ -single-households, and therefore $k\geq(1-\alpha)+(1-\beta)$ . If this inequality (which is also part of Condition 1) does not hold, then clearly $|\mathcal{C}_{n,k,\alpha,\beta}|=0$ . Similarly, the number of $P$ -multi-households is $\frac{n-k+1}{2}$ and we must have $\frac{n-k+1}{2}\geq\alpha+\beta$ for $|\mathcal{C}_{n,k,\alpha,\beta}|>0$ , which implies the inequality $n+1-2\alpha-2\beta\geq k$ . Therefore $\mathcal{C}_{n,k,\alpha,\beta}$ is empty if Condition 1 does not hold. For the rest of the proof we assume that Condition 1 does hold.

There are $\frac{n+k-3}{2}$ households along path $P$ , excluding $H_{s}$ and $H_{h}$ . Among them, there are $k-(1-\alpha)-(1-\beta)$ $P$ -single-households, which can be chosen in $\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}$ ways. Once we know the color of each node along the path, the number of RB paths can be computed by multiplying the numbers of children with the appropriate color of each node. $P$ -single-households have no blue nodes, and $P$ -multi-households have exactly one, which implies that there are $\frac{n-k+1}{2}$ blue nodes. Since blue nodes are preceded by red nodes that have $d_{h}$ blue children, they give the multiplicative factor $d_{h}^{\frac{n-k+1}{2}}$ . Blue nodes, except from the first hospitalized node (if it is blue), have $d_{c}$ red children. So far we have accounted for all of the nodes in $P$ -multi-households and none of the nodes in $P$ -single-households. If the source is in a $P$ -single-household, then we must count its red children, whose number is $d_{c}$ . This implies that there exist $\frac{n-k+1}{2}-\beta+(1-\alpha)$ nodes with $d_{c}$ red children. Finally, each $P$ -single-household, except $H_{s}$ and/or $H_{h}$ in case they are $P$ -single households, has $d_{c}-1$ red children. There are $k-(1-\alpha)-(1-\beta)$ such $P$ -single-households, which gives the final term in equation (40). ∎

The sets $\mathcal{C}_{n,k,\alpha,\beta}$ define equivalence classes on the transmission paths based on their household structure. In the next lemma we show that once we know which equivalence class we are in, it is possible compute the success probability of the LS+ algorithm.

Lemma A.6.

Let $P$ be the transmission path in the $\mathrm{DDE}_{\mathrm{NR}}$ epidemic model with parameters $(p_{i},p_{a},p_{h})$ on the RB tree with parameters $(d_{c},d_{h})$ , and let $p$ be as computed in (5). Then, it holds that

\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{0,1,0,0})=1

and

\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{1,0,1,1})=\mathbf{P}(LS+\textrm{ succeeds}|\mathcal{C}_{1,2,0,0})=1-p.

Let $\alpha,\beta\in\{0,1\}$ , let $n\geq 2$ and let $k\in\mathbb{N}$ satisfy Condition 1. Then, it holds that

(41)

\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta})\geq(1-p)^{\frac{n+k-1}{2}}(1+p)^{\frac{n-k+1}{2}-\alpha-\beta}.

In all other cases $\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta})$ is not defined.

Proof.

If $n=0$ , then $k=1$ and $\alpha=\beta=0$ . In that case, the source is the first hospitalized node and LS+ always succeeds. If $n=1$ , then the first hospitalized node is in the neighbourhood of the source, and LS+ succeeds if and only if the source is symptomatic, which happens with probability $1-p$ .

Let us assume that $n\geq 2$ and that $k$ satisfies Condition 1 (otherwise $|\mathcal{C}_{n,k,\alpha,\beta}|=0$ and $\mathbf{P}(LS+\textrm{ succeeds}|P\in\mathcal{C}_{n,k,\alpha,\beta})$ is not defined). By Lemma 4.3 the LS+ algorithm succeeds in the $\mathrm{DDE}_{\mathrm{NR}}$ model on the RB tree if, among the nodes of the transmission path, there exists a symptomatic node in each household, and the source is symptomatic, which means that we can prove a lower bound on the success probability of LS+. Let us assume that the source is indeed symptomatic. Since the first hospitalized node is symptomatic by definition, the households of the source and of the first hospitalized node cannot make the LS+ algorithm fail. Let us denote these two households by $H_{s}$ and $H_{h}$ , respectively. Also, let $M$ and $S$ be the sets of all $P$ -multi- and $P$ -single-households, respectively, excluding $H_{s}$ and $H_{h}$ . Then, LS+ succeeds if all nodes in the households of $S$ are symptomatic, and if at least one node in the households of $M$ is symptomatic, which has probability $1-p$ and $1-p^{2}$ for each type of household, respectively, by equation (5). These observations yield that

	$\displaystyle\mathbf{P}(LS+\textrm{ succeeds}\|P\in\mathcal{C}_{n,k,\alpha,\beta})$	$\displaystyle\geq\mathbf{P}(\textrm{source is sym})(1-p)^{\|S\|}(1-p^{2})^{\|M\|}$
		$\displaystyle=(1-p)(1-p)^{k-2+\alpha+\beta}(1-p^{2})^{\frac{n-k+1}{2}-\alpha-\beta}$
(42)			$\displaystyle=(1-p)^{k-1+\alpha+\beta}(1-p^{2})^{\frac{n-k+1}{2}-\alpha-\beta}.$

∎

Finally, we are ready to state and prove Theorem 4.6 on the success probability of LS+, which we restate here for convenience.

Theorem A.7.

Let $p$ be as in (5) and let $\mathcal{S}(n,\alpha,\beta)$ be the set of $k$ values that satisfy Condition 1. Then, for the $\mathrm{DDE}_{\mathrm{NR}}$ epidemic model with parameters $(p_{i},p_{a},p_{h})$ on the RB tree with parameters $(d_{c},d_{h})$ we have

	$\displaystyle\mathbf{P}(LS+\textrm{ succeeds})\geq\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+$
(43)		$\displaystyle\sum_{n=2}^{\infty}\sum_{\begin{subarray}{c}\alpha,\beta\in\{0,1\}\\ k\in\mathcal{S}(n,\alpha,\beta)\end{subarray}}\binom{\frac{n+k-3}{2}}{k-2+\alpha+\beta}\frac{(d_{h}(1-p))^{\frac{n+k-1}{2}}(d_{c}(1+p))^{\frac{n-k+1}{2}-\alpha-\beta}d_{c}(d_{c}-1)^{k+\alpha+\beta-2}}{\lambda_{1}\left(\frac{d_{c}-1+D}{2}\right)^{n}+\lambda_{2}\left(\frac{d_{c}-1-D}{2}\right)^{n}}\mathbf{P}(d(s,h)=n),$

where $D,\lambda_{1}$ and $\lambda_{2}$ are terms depending on parameters $d_{c}$ and $d_{h}$ and are computed explicitly in Lemma A.3.

Proof.

Let us extend the domain of $\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})$ by function $g$ defined as $g:\mathbb{N}\times\mathbb{N}\times\{0,1\}\times\{0,1\}\rightarrow[0,1]$ such that

(44)

g(n,k,\alpha,\beta)=\begin{cases}\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})&\text{ if }k\in\mathcal{S}(n,\alpha,\beta)\\ 0&\text{ if }k\not\in\mathcal{S}(n,\alpha,\beta).\end{cases}

Unlike $\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})$ , $g$ is defined for every 4-tuple of parameters $(n,k,\alpha,\beta)\in\mathbb{N}\times\mathbb{N}\times\{0,1\}\times\{0,1\}$ . By the law of total probability we expand the success probability by conditioning on the path $P$ being of length $n$ as

	$\displaystyle\mathbf{P}(LS+\textrm{ succeeds})=$	$\displaystyle\sum_{n=0}^{\infty}\sum_{k=0}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}g(n,k,\alpha,\beta)\mathbf{P}(P\in C_{n,k,\alpha,\beta})$
(45)		$\displaystyle=$	$\displaystyle\sum_{n=0}^{\infty}\sum_{k=0}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}g(n,k,\alpha,\beta)\mathbf{P}(P\in C_{n,k,\alpha,\beta}\|P\in C_{n})\mathbf{P}(d(s,h)=n).$

Next, we exchange the sums over $\alpha,\beta$ and $k$ . This allows us to sum over only those $k$ values that satisfy Condition 1, which implies that $\mathbf{P}(LS+\textrm{ succeeds}|P\in C_{n,k,\alpha,\beta})$ is well-defined. As in Lemma A.5, we need to treat the $n=0$ and $n=1$ cases separately. Continuing equation (A.2), we arrive to

	$\displaystyle\mathbf{P}(LS+\textrm{ succeeds})=$	$\displaystyle\mathbf{P}(d(s,h)=0)+(1-p)\mathbf{P}(d(s,h)=1)+$
(46)			$\displaystyle\sum_{n=2}^{\infty}\sum_{\alpha,\beta\in\{0,1\}}\sum_{k\in\mathcal{S}(n,\alpha,\beta)}\mathbf{P}(LS+\textrm{ succeeds}\|P\in C_{n,k,\alpha,\beta})\frac{\|C_{n,k,\alpha,\beta}\|}{\|C_{n}\|}\mathbf{P}(d(s,h)=n)$

Substituting in the results from Lemmas A.3, A.5 and A.6 into equation (A.2) gives the desired result. ∎

A.3. Proof of Theorem 4.9

We start by restating Theorem 4.9 for convenience.

Theorem A.8.

In the $(d_{r},d)$ -RET with parameters $p_{i},p_{a},p_{h}$ , let $a_{t,l}$ be as in Definition 4.8. Then

(47)	$\displaystyle a_{t,0}$	$\displaystyle=1$
(48)	$\displaystyle a_{t,l}$	$\displaystyle=d_{r}p_{i}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}\textrm{, for }t\geq l\geq 1$
(49)	$\displaystyle a_{t,l}$	$\displaystyle=0\textrm{, for l > t}.$

Proof.

Similarly to (Feng and Mahmoud, 2018; Mahmoud, 2021), the proof relies on generating functions. We start by addressing the boundary cases. For all $t\geq 0$ , it holds that $A_{t,0}=1$ , and therefore $a_{t,0}=1$ . Similarly, for all $l,t$ such that $l>t$ , it holds that $A_{t,l}=0$ , and therefore $a_{t,l}=0$ . Suppose that $t\geq l=1$ . During day $t-1$ , on the first level, there are $A_{t-1,1}$ infected (internal) nodes and $d_{r}-A_{t-1,1}$ (external) nodes that may be infected with probability $p_{i}$ during day $t$ . Thus,

(50)

\displaystyle A_{t,1}

\displaystyle=A_{t-1,1}+\mathrm{Bin}(d_{r}-A_{t-1,1};p_{i}).

Taking the expectation of both sides in equation (50) yields

(51)

\displaystyle a_{t,1}=a_{t-1,1}(1-p_{i})+d_{r}p_{i},\textrm{ for }t\geq 1.

By subtracting the appropriate recurrence equations for $a_{t,1}$ and $a_{t-1,1}$ for $t\geq 2$ we obtain the homogeneous recurrence equation

(52)

\displaystyle a_{t,1}-a_{t-1,1}(2-p_{i})+(1-p_{i})a_{t-2,1}=0,\textrm{ for }t\geq 2

and boundary conditions $a_{0,1}=0$ and $a_{1,1}=d_{r}p_{i}$ . We solve for $a_{t,1}$ using the same methods as in the proof of Lemma A.3 and obtain

(53)

\displaystyle a_{t,1}=d_{r}\left(1-(1-p_{i})^{t}\right),\textrm{ for }t\geq 0.

Next, let us consider the general case $t\geq l>1$ . On day $t-1$ , there are $A_{t-1,l-1}$ nodes on level $l-1$ . Since, each node on level $l-1$ has $d$ children, there are $dA_{t-1,l-1}$ nodes on level $l$ that have an infectious parent on level $l-1$ . However, $A_{t-1,l}$ of them are already infected. Therefore $dA_{t-1,l-1}-A_{t-1,l}$ nodes of level $l$ may be infected on day $t$ , each with probability $p_{i}$ , which implies

(54)

\displaystyle A_{t,l}

\displaystyle=A_{t-1,l}+\mathrm{Bin}(dA_{t-1,l-1}-A_{t-1;l},p_{i}),\textrm{ for }t\geq l\geq 2.

Taking the expectation of both sides in equation (54) yields

	$\displaystyle a_{t,l}$	$\displaystyle=a_{t-1,l}+(da_{t-1,l-1}-a_{t-1,l})p_{i}$
(55)			$\displaystyle=a_{t-1,l}(1-p_{i})+dp_{i}a_{t-1,l-1},\textrm{ for }t\geq l\geq 2.$

For convenience, let us introduce $\lambda=1-p_{i}$ and $\mu=dp_{i}$ , and also let

(56)

\displaystyle f(x,y)=\sum_{t=1}^{\infty}\sum_{l=1}^{\infty}a_{t,l}x^{t}y^{l}=\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}

be the generating function for $a_{t,l}$ with $t,l\geq 1$ . By multiplying (A.3) by $x^{t}y^{l}$ and summing it over $t,l\geq 2$ we obtain

	$\displaystyle\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}$	$\displaystyle=\lambda\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t-1,l}x^{t}y^{l}+\mu\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t-1,l-1}x^{t}y^{l}$
(57)			$\displaystyle=\lambda x\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}+\mu xy\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}.$

Since $a_{1,l}=0$ for $l\geq 2$ ,

(58)

\displaystyle\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}=\sum_{t=2}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l},

and by inserting (58) into (A.3), we obtain

(59)

\displaystyle(1-\lambda x)\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}

\displaystyle=\mu xy\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}\stackrel{{\scriptstyle\eqref{eq:f(x,y)}}}{{=}}\mu xyf(x,y).

Now, we can also decompose the sum (58) using geometric series as

	$\displaystyle\sum_{t=1}^{\infty}\sum_{l=2}^{t}a_{t,l}x^{t}y^{l}$	$\displaystyle=\sum_{t=1}^{\infty}\sum_{l=1}^{t}a_{t,l}x^{t}y^{l}-\sum_{t=1}^{\infty}a_{t,1}x^{t}y$
		$\displaystyle\stackrel{{\scriptstyle\eqref{eq:a_t,1}}}{{=}}f(x,y)-d_{r}y\sum_{t=1}^{\infty}(1-\lambda^{t})x^{t}$
(60)			$\displaystyle=f(x,y)-d_{r}xy\left(\frac{1}{1-x}-\frac{\lambda}{1-\lambda x}\right).$

By plugging (A.3) into (59), we obtain the expression

(61)

\displaystyle f(x,y)

\displaystyle=d_{r}(1-\lambda)xy\frac{1}{1-x}\frac{1}{1-\lambda x-\mu xy}.

Then, we expand the fractions in (61) into a power series and we next apply the binomial theorem, we arrive to

	$\displaystyle f(x,y)$	$\displaystyle=d_{r}(1-\lambda)xy\sum_{n=0}^{\infty}x^{n}\sum_{m=0}^{\infty}x^{m}(\lambda+\mu y)^{m}$
		$\displaystyle=d_{r}(1-\lambda)xy\sum_{n=0}^{\infty}x^{n}\sum_{m=0}^{\infty}x^{m}\sum_{j=0}^{m}\binom{m}{j}\lambda^{m-j}(\mu y)^{j}$
(62)			$\displaystyle=d_{r}(1-\lambda)\sum_{n=0}^{\infty}\sum_{m=0}^{\infty}\sum_{j=0}^{m}\binom{m}{j}\lambda^{m-j}\mu^{j}x^{1+n+m}y^{j+1}.$

Let $t=1+n+m$ and $l=j+1$ . In order to obtain an expression for $a_{t,l}$ , we must change the variables in the sums of equation (A.3) from $(n,m,k)$ to $(t,m,l)$ . Changing the inner sum from variable $j$ to $l$ is simple. Changing the variables in the two outer sums is more challenging because $t,n$ and $m$ depend on each other in a nontrivial way. More precisely, since $m,n\geq 0$ we have $t\geq 1$ and also $m\leq t-1$ , which means that we have to set the lower limit of $t$ and the upper limit of $m$ accordingly. As for the remaining limits, variable $t$ can be arbitrary large, and $m$ can take any integer value starting from $0$ independently of $t$ , which yields the expression

(63)

\displaystyle f(x,y)

\displaystyle=d_{r}(1-\lambda)\sum_{t=1}^{\infty}\sum_{m=0}^{t-1}\sum_{l=1}^{m+1}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}.

For the values of $l$ with $l\geq m+1$ , the binomial coefficient $\binom{m}{l-1}$ is $0$ , which implies that we can increase the upper limit of the inner sum from $m+1$ to $t$ in equation (63). Then,

	$\displaystyle f(x,y)$	$\displaystyle=d_{r}(1-\lambda)\sum_{t=1}^{\infty}\sum_{m=0}^{t-1}\sum_{l=1}^{t}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}$
(64)			$\displaystyle=\sum_{t=1}^{\infty}\sum_{l=1}^{t}d_{r}(1-\lambda)\sum_{m=0}^{t-1}\binom{m}{l-1}\lambda^{m-l+1}\mu^{l-1}x^{t}y^{l}.$

Finally we can read off the value of $a_{t,l}$ from equation (A.3) as

(65)

\displaystyle a_{t,l}=d_{r}(1-\lambda)\sum_{m=0}^{t-1}\binom{m}{l-1}\mu^{l-1}\lambda^{m-l+1}=d_{r}p_{i}\sum_{m=0}^{t-1}\binom{m}{l-1}(dp_{i})^{l-1}(1-p_{i})^{m-l+1}.

∎

A.4. Proof of Corollary 4.10

We start by restating Corollary 4.10 for convenience.

Corollary A.9.

In the RET $(p_{i},d_{r},d)$ , let $a_{t}$ be the expectation of (12), as in Definition 4.8. For $t\geq 0$ ,

(66)

a_{t}=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.

Proof.

By using linearity of expectation, equation (12) and Theorem 4.9 we obtain:

	$\displaystyle a_{t}$	$\displaystyle=\sum_{l=0}^{+\infty}a_{t,l}$
		$\displaystyle=1+\sum_{l=1}^{+\infty}a_{t,l}$
(67)			$\displaystyle=1+d_{r}p_{i}\sum_{l=1}^{t}\sum_{m=l-1}^{t-1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}$

Before we use binomial theorem, we need to swap the sums. Boundaries from (A.4) are equivalent to $t-1\geq m\geq l-1\geq 0$ , so we can rewrite this as 2 conditions: $m+1\geq l\geq 1$ and $t\geq m\geq 0$ .

	$\displaystyle a_{t,l}$	$\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}\sum_{l=1}^{m+1}\binom{m}{l-1}(1-p_{i})^{m-l+1}d^{l-1}p_{i}^{l-1}$
(68)			$\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}\sum_{l=0}^{m}\binom{m}{l}(1-p_{i})^{m-l}d^{l}p_{i}^{l}$

Finally, by applying the binomial theorem and summing the geometric series, we obtain the desired equation:

	$\displaystyle a_{t,l}$	$\displaystyle=1+d_{r}p_{i}\sum_{m=0}^{t-1}(1-p_{i}+dp_{i})^{m}$
(69)			$\displaystyle=1+d_{r}\frac{(1-p_{i}+dp_{i})^{t}-1}{d-1}.$

∎

A.5. Proof or Lemma 4.12

We restate Lemma 4.12 here for convenience.

Lemma A.10.

Let us consider the stopped DET model with parameters $(c_{t,l}),p_{a},p_{h}$ , and let $h$ denote the first hospitalized node. Then

(70)

\displaystyle\mathbf{P}(d(s,h)=l)=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).

Proof.

Recall that a node added at day $t$ is uniformly distributed among the $c_{t}-c_{t-1}>0$ nodes added that day, and that the number of nodes added to level $l$ is $c_{t,l}-c_{t-1,l}$ on day $t$ . If we condition on the time of the first hospitalized case, denoted by $TI_{h}$ , then

	$\displaystyle\mathbf{P}(d(s,h)=l)$	$\displaystyle=\sum_{t=0}^{+\infty}\mathbf{P}(d(s,h)=l\|TI_{h}=t)\mathbf{P}(TI_{h}=t)$
		$\displaystyle=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}\mathbf{P}(\textrm{node is not hosp})^{c_{t-1}}(1-\mathbf{P}(\textrm{node is not hosp })^{c_{t}-c_{t-1}})$
(71)			$\displaystyle=\sum_{t=0}^{+\infty}\frac{c_{t,l}-c_{t-1,l}}{c_{t}-c_{t-1}}(1-(1-p_{a})p_{h})^{c_{t-1}}\left(1-(1-(1-p_{a})p_{h})^{c_{t}-c_{t-1}}\right).$

∎

Appendix B Dynamic Message Passing for the DDE model

In this section, we explain how we derived and implemented the DMP equations for the DDE+HNM model. We start by reviewing the previous work on the DMP equations for the SIR model in Appendix B.1, and then we proceed to our derivations in Appendix B.2. In Appendix B.3, we explain how we find candidate (node,time) pairs for the DMP equations, and in Appendix B.4 we conclude by combining Appendices B.2 and B.3 into a source-detection algorithm.

B.1. DMP Equations for the SIR Model

The DMP equations were first derived by (Lokhov et al., 2014) for the SIR model in the context of source detection. Their goal is to compute the marginal probabilities that node $i$ is in a given state at time $t$ (denoted by $P_{S}^{i}(t),P_{I}^{i}(t)$ and $P_{R}^{i}(t)$ for the susceptible, infected and recovered states, respectively), given initial conditions $P_{S}^{i}(t_{0}),P_{I}^{i}(t_{0})$ and $P_{R}^{i}(t_{0})$ at some initial time $t_{0}$ . To solve this problem in tree networks, we may consider a dynamic programming approach, where we delete a node $i$ , we compute the marginal probabilities of $P_{S}^{j}(t-1)$ for all neighbors $j$ of $i$ in the remaining subtrees, and use this information to compute $P_{S}^{i}(t)$ (as the marginals are independent in each of the subtrees conditioned on the state of $i$ ). The DMP equations make the dynamic programming intuition explicit. Originally, the DMP equations were developed for static networks, but since the generalization to time-varying networks is straightforward, and has already been foreshadowed in a similar heuristic algorithm (Jiang et al., 2016), we include it in this preliminary section. For time-varying networks, we define $N_{i}(t)$ as the set of neighbors of node $i$ in the time-window $[t,t+1)$ .

To formalize the dynamic programming approach, (Lokhov et al., 2014) introduces some new notation. Let $\lambda$ be the probability that an infectious node infects a susceptible neighbor, and let $\mu$ be the probability that an infectious node recovers. Let $D_{i}$ be the auxiliary dynamics, where node $i$ receives infection signals, but ignores them, and thus remains in the $S$ state at all times. Let $P_{S}^{j\rightarrow i}(t)$ be the probability that node $j$ is in the state $S$ at time $t$ in the dynamics $D_{i}$ , and let $\theta^{k\rightarrow i}(t)$ be the probability that the infection signal has not been passed from node $k$ to node $i$ up to time $t$ in the dynamics $D_{i}$ . Finally, let $\phi^{k\rightarrow i}(t)$ be the probability that the infection signal has not been passed from node $k$ to node $i$ up to time $t$ , and that node $k$ is in the state $I$ at time $t$ , in the dynamics $D_{i}$ . With these definitions, the dynamic programming approach is formalized by the following equations for $t\geq t_{0}$ :

(72)	$\displaystyle P_{S}^{i\rightarrow j}(t+1)$	$\displaystyle=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)\backslash j}\theta^{k\rightarrow i}(t+1),$
(73)	$\displaystyle\theta^{k\rightarrow i}(t+1)-\theta^{k\rightarrow i}(t)$	$\displaystyle=-\lambda\phi^{k\rightarrow i}(t),$
(74)	$\displaystyle\phi^{k\rightarrow i}(t)$	$\displaystyle=(1-\lambda)(1-\mu)\phi^{k\rightarrow i}(t-1)+\left(P_{S}^{k\rightarrow i}(t-1)-P_{S}^{k\rightarrow i}(t)\right).$

The marginal probabilities that node $i$ is in a given state at time $t$ are then given by

(75)		$\displaystyle P_{S}^{i}(t+1)=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)}\theta^{k\rightarrow i}(t+1)\,,$
(76)		$\displaystyle P_{R}^{i}(t+1)=P_{R}^{i}(t)+\mu P_{I}^{i}(t)\,,$
(77)		$\displaystyle P_{I}^{i}(t+1)=1-P_{S}^{i}(t+1)-P_{R}^{i}(t+1)\,.$

These equations are only exact on trees, but they can also be applied to networks with cycles as a heuristic approach. The heuristic gives good approximations to the true marginals if the network is at least locally tree-like (Karrer and Newman, 2010).

B.2. DMP Equations for the DDE+HNM Model

There are several differences between the SIR model on locally tree-like networks and the DDE+HNM model (see Figure 2 (a)). First, the DDE model has additional compartments (exposed nodes, asymptomatic nodes), which motivates the introduction of several new variables. Let $\lambda_{(a)}$ (resp., $\lambda_{(s)}$ ) be the probability that an asymptomatic (resp., symptomatic) node infects a susceptible node. Let $\phi^{k\rightarrow i}(t)^{(a)}$ (resp., $\phi^{k\rightarrow i}(t)^{(s)}$ ) be the probability that the infection signal has not been passed from node $k$ to node $i$ up to time $t$ , and that node $k$ is asymptomatic (resp., symptomatic) infectious at time $t$ , in the dynamics $D_{i}$ .

The second important difference is that in the DDE model, the transition times between different compartments are deterministic instead of following a geometric distribution as in the standard SIR model. While deterministic transition times sound simpler at first, it turns out that they make the DMP equations more complex, because the Markovian property that each marginal probability depends only on the previous timestep is lost if the transition times are larger than $1$ . Recall that the times for the transitions $E\rightarrow I$ and $I\rightarrow R$ (with their default values) are $T_{E}=3$ and $T_{I}=14$ .

Let us incorporate these two differences into equations (72)–(74) to derive the DMP equations for the DDE model. Equation (78) is essentially a copy of (72). Equation (79) follows equation (73), but we incorporate the two different variants of infected (asymptomatic and symptomatic) patients with their respective infection probabilities $\lambda_{(a)}$ and $\lambda_{(s)}$ . Equation (80) is a new equation, which is necessary because recovery times are no longer geometric random variables; instead we need to check the probabilities of infection $T_{E}+T_{I}$ timesteps earlier than the current time $t$ . Finally, equation (81) (resp., (82)) is the asymptomatic (resp., symptomatic) version of equation (74), while also incorporating the deterministic time for the transition $E\rightarrow I$ . For $t\geq t_{0}$ , this yields equations

(78)		$\displaystyle P_{S}^{i\rightarrow j}(t+1)$	$\displaystyle=P_{S}^{i}(t_{0})\prod_{k\in N_{i}(t)\backslash j}\theta^{k\rightarrow i}(t+1)=P_{S}^{i}(t_{0})\frac{P_{S}^{i}(t+1)}{\theta^{j\rightarrow i}(t+1)},$
(79)		$\displaystyle\theta^{k\rightarrow i}(t+1)-\theta^{k\rightarrow i}(t)$	$\displaystyle=-\lambda_{(a)}\phi^{k\rightarrow i}_{(a)}(t)-\lambda_{(s)}\phi^{k\rightarrow i}_{(s)}(t),$
(80)		$\displaystyle P_{R}^{k\rightarrow i}(t)$	$\displaystyle=P_{S}^{k\rightarrow i}(t-T_{E}-T_{I}-1)-P_{S}^{k\rightarrow i}(t-T_{E}-T_{I})$
	$\displaystyle\phi^{k\rightarrow i}_{(a)}(t)$	$\displaystyle=(1-\lambda_{(a)})(1-P_{R}^{k\rightarrow i}(t))\phi^{k\rightarrow i}_{(a)}(t-1)$
(81)			$\displaystyle\qquad\qquad+p_{a}[P_{S}^{k\rightarrow i}(t-T_{E}-1)-P_{S}^{k\rightarrow i}(t-T_{E})].$
	$\displaystyle\phi^{k\rightarrow i}_{(s)}(t)$	$\displaystyle=(1-\lambda_{(s)})(1-P_{R}^{k\rightarrow i}(t))\phi^{k\rightarrow i}_{(s)}(t-1)$
(82)			$\displaystyle\qquad\qquad+(1-p_{a})[P_{S}^{k\rightarrow i}(t-T_{E}-1)-P_{S}^{k\rightarrow i}(t-T_{E})].$

We note that for early values of $t$ , equations (80)–(82) depend on $P_{S}^{k\rightarrow i}$ before $t_{0}$ , which we initialize to be 1 (all nodes are susceptible before the first node develops the infection). The marginal probability that node $i$ is susceptible at time $t$ is still computed by equation (75) as before. Equations (76)–(77) do not apply anymore; we explain it in Appendix B.4 how to take into account observations for nodes in the infectious compartments.

The third difference between the the SIR model on locally tree-like networks and the DDE+HNM model is that the HNM model contains many short cycles inside the households. Short cycles can cause unwanted feedback loops in the DMP equations where, loosely speaking, nodes are treated as if they could reinfect themselves. We solve this issue by modifying the underlying graph to be locally tree-like (only for the computation of the DMP equations). Specifically, we introduce a new central household-node for each household, and we replace the cliques inside the households by a star graph centered at this new household-node node. Introducing such a central household-node does of course alter epidemic process, in particular it makes household infections less independent and slower (all household infections need to pass through an extra node). To mitigate this issue, we assume that central household-nodes have $T_{E}=1$ and that they are infected with probability 1 by any node in the same household. We tested the validity of the resulting DMP equations against simulations of the epidemic progressions and we found the results to be quite accurate, in particular, more accurate than the version without the introduction of these central household-nodes.

Note that we derived the DMP equations for the DDE+HNM model, however, since (i) the compartments are the same, (ii) the equations support temporal networks, and (iii) we have separate infection probabilities $\lambda_{(a)}$ and $\lambda_{(s)}$ for asymptomatic and symptomatic nodes, our equations can also be applied to the DCS+TU model after a discretizing (rounding) the time observations.

Finally, we touch upon the computational complexity of computing the DMP equations. In principle, we need to update $O(dN)$ equations (for each edge) over $t_{\mathrm{max}}$ timesteps, where $t_{\mathrm{max}}$ is the maximum time during which the marginals can still change, which can be as large as $O(N)$ . However, since we are only interested in computing the likelihood of the 5 earliest observations, $t_{\mathrm{max}}$ is typically quite low. Moreover, since we assume to be in an early stage of the epidemic, most of the equations remain unchanged. For better computational scalability, we only compute $P_{S}^{i}(t)$ and $\theta^{k\rightarrow i}(t)$ for nodes $k,i$ that have $P_{I}^{k\rightarrow i}(t)>0.01$ , i.e., we only update nodes that are at least somewhat likely to have received the infection. Otherwise, we set $P_{I}^{k\rightarrow i}(t)=P_{I}^{k\rightarrow i}(t-1)$ , $\theta^{k\rightarrow i}(t)=\theta^{k\rightarrow i}(t-1)$ , and in the implementation we can perform these assignments implicitly using appropriate data structures. With these adjustments, the time-complexity of the algorithm becomes independent of $N$ , but remains dependent on the network parameters, the epidemic parameters and the number of sensors in a non-trivial way.

B.3. Feasible Source-time Pairs for Source Detection

In this section we explain how we implemented the feasible source identification algorithm, which was suggested as a preprocessing step for a method very similar to the DMP equations by (Jiang et al., 2016). Let us define the directed graph $G_{2}$ on (node,infecton_time) pairs (we use “nodes” for the nodes of the original graph $G$ and “pairs” for the nodes of $G_{2}$ ), and draw an edge between two pairs $(v_{1},t_{1})\rightarrow(v_{2},t_{2})$ if $v_{1}$ and $v_{2}$ are in contact at $t_{2}$ , and $t_{2}$ is in the interval $[t_{1}+T_{E},t_{1}+T_{E}+T_{I}]$ . Observe that in the DDE model there is an edge $(v_{1},t_{1})\rightarrow(v_{2},t_{2})$ if and only if $v_{1}$ becoming infected at time $t_{1}$ can infect $v_{2}$ at time $t_{2}$ . The definition of $G_{2}$ is applicable to the DCS model as well after discretization (rounding), however, since the infection times are not deterministic anymore, not all possible infections $(v_{1},t_{1})\rightarrow(v_{2},t_{2})$ have a corresponding edge in $G_{2}$ .

Then, we perform a breadth-first search backwards on the directed edges of $G_{2}$ , starting from each pair $(v_{i},t_{i}-T_{E}-T_{P})$ , where $v_{i}$ is a symptomatic sensor node, and $t_{i}$ is the symptom onset time of $v_{i}$ (for the DCS model, we start from integer times in the $t_{i}-T_{E}-T_{P}\pm(\sigma_{E}+\sigma_{P})$ interval to account for the randomness of the transition times). To limit the time complexity of the algorithm, we only consider the $k_{1}$ earliest observations, which means that we start $k_{1}$ breadth-first searches. With this construction, each pair $(v,t)$ discovered by a breadth-first search started from $(v_{i},t_{i}-T_{E}-T_{P})$ could have caused the infection in $v_{i}$ ; we say that $(v,t)$ is an explanation for observation $i$ . We perform the breath-first searches until we find $k_{2}$ pairs that explain all of the $k_{1}$ earliest observations. See the pseudocode in Algorithm B.1.

Claim 1.

In the DDE model, Algorithm B.1 with $\sigma_{E}=\sigma_{P}=0$ finds the $k_{2}$ feasible explanations with the latest starting time of the $k_{1}$ earliest symptomatic nodes.

Proof.

By construction, a source node $v$ that becomes infectious at time $t$ can cause an observation $(v_{i},t_{i})$ if and only if there is a directed path from $(v,t)$ to $(v_{i},t_{i}-T_{E}-T_{P})$ . Therefore, the breadth-first search algorithm finds all of the closest feasible sources in time. ∎

Input:

•

The mean exposed time $T_{E}$ the mean pre-infectious time $T_{P}$ , the mean infectious time $T_{I}$ ,

the std of the exposed time $\sigma_{E}$ and the std of the pre-infectious time $\sigma_{P}$
•

$F(v)_{min}$ and $F(v)_{max}$ returns the minimum and maximum times when $v$ could have been exposed based on all of its (possibly asymptomatic or negative) test results
•

$S(v)$ returns the time of symptom onset for a node $v$ tested positive symtomatic.
•

$N(v,[t_{min},t_{max}])$ returns the set of neighbors of node $v$ in the interval $[t_{min},t_{max}]$
•

A lower estimate of the time the source became infectious $t_{min}$
•

Integers $k_{1},k_{2}$

Output: A list of at most

k_{2}

tuples of node and time pairs that can explain the first

k_{1}

symptomatic nodes

l\leftarrow\{\}

; // if the list

l[t]

contains the tuple

(v,w)

, then the infection started at

w

at time

t

can explain

v

D\leftarrow\{\}

; // if the list

D[w,t]

contains the node

v

, then the infection started at

w

at time

t

can explain

v

doneList\leftarrow[]

;

for $v\in SortIncreasingByValues(S)[0:k_{1}]$ do

t^{\prime}_{min}\leftarrow S(v)-(T_{E}+T_{P})-(\sigma_{E}+\sigma_{P})

;

t^{\prime}_{max}\leftarrow S(v)-(T_{E}+T_{P})+(\sigma_{E}+\sigma_{P})

;

for $t^{\prime}\leftarrow t^{\prime}_{min}$ to $t^{\prime}_{max}$ do

Append((v,v),l[t^{\prime}])

;

Append(v,D[v,t^{\prime}])

;

if $Length(D[v,t^{\prime}])=k_{1}$ then

Append((v,t^{\prime}),doneList)

t\leftarrow SortIncreasingByValues(S)[k_{1}-1]

;

stopCondition\leftarrow False

;

while not $stopCondition$ and $t>t_{min}$ do

for $v,w\in l[t]$ do

for $u\in N(w,[t,t-1])$ do

t^{\prime}_{min}\leftarrow\max(F(u)_{min},t-T_{E}-T_{I})

;

t^{\prime}_{max}\leftarrow\min(F(u)_{max},t-T_{E})

;

for $t^{\prime}\leftarrow t^{\prime}_{max}$ to $t^{\prime}_{min}$ do

Append((v,u),l[t^{\prime}])

;

Append(v,D[(u,t^{\prime})])

;

if $Length(D[u,t^{\prime}])=k_{1}$ then

Append((u,t^{\prime}),doneList)

doneList\leftarrow SortBySecondElement(doneList)

;

if $Length(doneList\geq k2$ ) and $t-T_{E}\leq doneList[k2][1]$ then

stopCondition\leftarrow True

;

else

t\leftarrow t-1

;

return $doneList$

Algorithm B.1 Feasible source identification (reverse dissemination (Jiang et al., 2016))

B.4. Source Detection via Feasible Source Identification and DMP

In this section we explain how to combine Algorithm B.1 with the DMP equations derived in Appendix B.2. See the pseudocode in Algorithm B.2.

We start by computing the DMP equations (78)-(82) and (72) for the $k_{2}$ tuples of node and time pairs that can explain the first $k_{1}$ symptomatic observations returned by Algorithm B.1. Next, our goal is to use these DMP equations to compute the likelihood of each of the $k_{2}$ tuples using the $k_{1}$ observations. Similarly to (Lokhov et al., 2014), we make the assumption that the first $k_{1}$ observations are independent, and we can compute the likelihood by multiplying their respective marginals together. For symptomatic observed nodes $v$ , we know the time of symptom onset, which we denote by $S(v)$ . Then, the marginal probability of $v$ developing symptoms exactly at time $t$ can be computed by taking the difference of $P_{S}^{v}(S(v)-T_{P}-T_{E}-1)$ and $P_{S}^{v}(S(v)-T_{P}-T_{E})$ and multiplying the difference by $(1-p_{a})$ . In Algorithm B.2 we drop the multiplicative factor $(1-p_{a})$ because it is present for all of the tuples, and it does not change the final order of their scores. For asymptomatic (resp., negative) observations, we only know that at the time of testing, denoted by $A(v)$ (resp., $NE(v)$ ), at least a time interval of length $T_{E}$ has passed (resp., $T_{E}$ has not passed) since the time of infection. Therefore, dropping the $p_{a}$ factor similarly to the symptomatic case, we compute the marginal of asymptomatic observations as $1-P_{S}^{v}(A(v)-T_{E})$ , and we compute the marginal of negative observations as $P_{S}^{v}(NE(v)-T_{E})$ . Finally, the contributions of the observations are multiplied together for each of the $k_{2}$ tuples returned by Algorithm B.1, and the scores approximating the likelihoods are returned.

Input:

•

The mean exposed time $T_{E}$ , the mean pre-infectious time $T_{P}$ , the mean infectious time $T_{I}$
•

$S(v)$ returns the time of symptom onset for a node $v$ tested positive symtomatic.
•

$A(v)$ and $NE(v)$ return the time of asymptomatic and negative test results, respectively

Output: A dictionary

L

k_{2}

elements, which contains a score for each

(v,t)

pair that explains the first

k_{1}

observations. Higher scores signify higher confidence of being the source.

L\leftarrow\{\}

;

doneList\leftarrow\mathrm{Algorihtm\ \ref{alg:feasible}}(k_{1},k_{2})

;

for $v,t_{0}\in doneList$ do

P_{S}\leftarrow

eq. (72) based on DMP eq. (78)-(82) with

P_{S}^{v}(t_{0})=0

, and

P_{S}^{w}(t_{0})=1

for all

w\neq v

;

L[v,t_{0}]\leftarrow 1

;

for $w\in S$ do

L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot(P_{S}^{v}(S(v)-T_{P}-T_{E}-1)-P_{S}^{v}(S(v)-T_{P}-T_{E}))

;

if $w\in A$ then

L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot(1-P_{S}^{v}(A(v)-T_{E}))

;

for $w\in NE$ do

L[v,t_{0}]\leftarrow L[v,t_{0}]\cdot P_{S}^{v}(NE(v)-T_{E})

;

return $L$

Algorithm B.2 Source detection via DMP

Source Detection via Contact Tracing in the Presence of Asymptomatic Patients

Abstract.

1. Introduction

2. Warmup Results

2.1. A Simple Network and Epidemic Model and a Simple Algorithm

2.2. Back of the Envelope Calculation

3. Models, Methods, Algorithms

3.1. Epidemic Models

3.1.1. The DCS Model

3.1.2. The DDE Model

3.2. Simulating Mobility

3.2.1. Tubingen Mobility Model

3.2.2. Household Network Model

3.3. The Source Detection via Contact Tracing Framework

3.4. Parameters

3.5. The LocalSearch Algorithms LS and LS+

4. Theoretical Results

4.1. Success Probability of LS and LS+ Algorithms on the RB Tree

4.1.1. Red-Blue tree models

Definition 4.1.

Remark 4.1.

Definition 4.2.

Remark 4.2.

Lemma 4.3.

Remark 4.3.

4.1.2. The DDENR\mathrm{DDE}_{\mathrm{NR}} Model

4.1.3. Success Probability of LS

Definition 4.4.

Lemma 4.5.

Proof.

4.1.4. Success Probability of LS+

Theorem 4.6.

4.2. Approximating the Depth of the Path to the First Hospitalized Node

4.2.1. (dr,d)(d_{r},d)-ary Random Exponential Tree

Definition 4.7.

Definition 4.8.

Theorem 4.9.

Corollary 4.10.

4.2.2. Deterministic Exponential Tree with Parameters pa,php_{a},p_{h} and (ct,l)t,l∈ℕ(c_{t,l})_{t,l\in\mathbb{N}}

Definition 4.11.

Lemma 4.12.

Remark 4.4.

5. Simulation Results

5.1. Baseline Algorithms

5.1.1. Non-adaptive Baseline: Dynamic Message Passing

5.1.2. Adaptive Baseline: Size-Gain

5.2. Comparison with Baselines

5.3. Comparison of Simulations and Theoretical Results

5.4. Simulations on the DCS Model

6. Discussion

References

Appendix A Additional Proofs

A.1. Proof of Lemma 4.3

Lemma A.1.

Proof.

A.2. Proof of Theorem 4.6

Definition A.2.

Lemma A.3.

Proof.

Definition A.4.

Condition 1.

Lemma A.5.

Proof.

Lemma A.6.

Proof.

Theorem A.7.

Proof.

A.3. Proof of Theorem 4.9

Theorem A.8.

Proof.

A.4. Proof of Corollary 4.10

Corollary A.9.

Proof.

A.5. Proof or Lemma 4.12

Lemma A.10.

Proof.

Appendix B Dynamic Message Passing for the DDE model

B.1. DMP Equations for the SIR Model

B.2. DMP Equations for the DDE+HNM Model

B.3. Feasible Source-time Pairs for Source Detection

4.1.2. The $\mathrm{DDE}_{\mathrm{NR}}$ Model

4.2.1. $(d_{r},d)$ -ary Random Exponential Tree

4.2.2. Deterministic Exponential Tree with Parameters $p_{a},p_{h}$ and $(c_{t,l})_{t,l\in\mathbb{N}}$