Sequential Monte Carlo algorithms for agent-based models of disease transmission

Nianqiao Ju Corresponding author: nju@g.harvard.edu Department of Statistics, Harvard University, USA Jeremy Heng ESSEC Business School, Singapore Pierre E. Jacob Department of Statistics, Harvard University, USA

Abstract

Agent-based models of disease transmission involve stochastic rules that specify how a number of individuals would infect one another, recover or be removed from the population. Common yet stringent assumptions stipulate interchangeability of agents and that all pairwise contact are equally likely. Under these assumptions, the population can be summarized by counting the number of susceptible and infected individuals, which greatly facilitates statistical inference. We consider the task of inference without such simplifying assumptions, in which case, the population cannot be summarized by low-dimensional counts. We design improved particle filters, where each particle corresponds to a specific configuration of the population of agents, that take either the next or all future observations into account when proposing population configurations. Using simulated data sets, we illustrate that orders of magnitude improvements are possible over bootstrap particle filters. We also provide theoretical support for the approximations employed to make the algorithms practical.

1 Introduction

1.1 Statistical inference for agent-based models

Agent-based models also called individual-based models are used in many fields, such as social sciences (Epstein, 2006), demographics (Hooten et al., 2020), ecology (DeAngelis and Gross, 2018) and macroeconomics (Turrell, 2016). These models describe the time evolution of a population according to a Markov process. The population comprises of $N\in\mathbb{N}$ agents who interact with one another through a probabilistic mechanism. The popularity of these models could be attributed to the ease of model building and interpretation, while allowing the representation of complex phenomena. Specialized software is available to facilitate their simulation and the visualization of their output (e.g. Tisue and Wilensky, 2004).

The primary use of these models seems to involve “forward simulations” using various parameters corresponding to hypothetical scenarios. In comparison, relatively fewer works have tackled the question of statistical inference for such models, i.e. the estimation of model parameters given available observations, also known as model calibration. Broad articles on the question of fitting agent-based models include Grazzini et al. (2017); Hooten et al. (2020); Hazelbag et al. (2020) and this is a very active research area. The question of estimation is well-posed: by viewing agent-based models as a subclass of hidden Markov models, we can define the associated likelihood function, from which maximum likelihood and Bayesian procedures can be envisioned.

Gibbs sampling or “data augmentation” Markov chain Monte Carlo (MCMC) methods, that alternate between parameter and latent agent states updates are generically applicable, see Fintzi et al. (2017); Bu et al. (2020) in the context of disease transmission. However, the mixing properties of these chains can be problematic, as we illustrate in Appendix C. Likelihood-based inference that avoid the use of data augmentation requires one to marginalize out the latent population process. This is computationally challenging as the number of possible configurations of the population of agents grows exponentially in $N$ . Our contribution is to design improved particle filters to estimate the likelihood function of agent-based models, which can then be used for parameter inference procedures such as simulated maximum likelihood and particle MCMC (Andrieu et al., 2010). Particle MCMC methods, based on more standard particle filters, have been used for agent-based models before, see e.g. Kattwinkel and Reichert (2017). For indirect inference and approximate Bayesian computation applied to agent-based models we refer to (Platt, 2020; van der Vaart et al., 2016; Chen et al., 2017; Sirén et al., 2018).

1.2 Statistical inference for agent-based models of disease transmission

We limit the scope of this article to some agent-based models of disease transmission. We next introduce a specific model which will be used to draw clear connections between agent-based models and more tractable but restrictive formulations of disease outbreak models.

The state of agent $n$ at time $t$ , denoted by $X_{t}^{n}$ , takes the value $0$ or $1$ , corresponding to a “susceptible” or “infected” status in the context of disease transmission. We consider a closed population of size $N$ , $T\in\mathbb{N}$ time steps and a stepsize of $h>0$ . Initial agent states follow independent Bernoulli distributions, i.e.

X_{0}^{n}\sim\text{Ber}(\alpha_{0}),

(1)

for $n\in[1:N]=\{1,\ldots,N\}$ , where we assume for the time being that each agent has the same probability $\alpha_{0}\in(0,1)$ of being infected. The state of the population $X_{t}=(X_{t}^{n})_{n\in[1:N]}\in\{0,1\}^{N}$ evolves according to a Markov process that depends on the interactions between the agents. The Markov transition specifies that, given the previous states $x_{t-h}\in\{0,1\}^{N}$ , the next agent states $(X_{t}^{n})_{n\in[1:N]}$ are conditionally independent Bernoulli variables with probabilities $(\alpha^{n}(x_{t-h}))_{n\in[1:N]}$ given by

\displaystyle\alpha^{n}(x_{t-h})=\begin{cases}h\lambda N^{-1}\sum_{m=1}^{N}x_{t-h}^{m},\quad&\mbox{if }x_{t-h}^{n}=0,\\ 1-h\gamma,\quad&\mbox{if }x_{t-h}^{n}=1.\end{cases}

(2)

Here $\lambda\in(0,1)$ and $\gamma\in(0,1)$ represent infection and recovery rate parameters respectively. An uninfected agent becomes infected with probability proportional to $\lambda$ and to the proportion of infected individuals. An infected agent recovers with probability proportional to $\gamma$ . The ratio $R_{0}=\lambda/\gamma$ is known as the basic reproductive number, which characterizes the transmission potential of a disease (Lipsitch et al., 2003; Britton, 2010). The latent population process $(X_{t})$ can be related to aggregated observations $(Y_{t})$ of the entire population. For example, the number of infections reported at each time could be modelled as a Binomial distribution, i.e.

Y_{t}\mid X_{t}=x_{t}\sim\text{Bin}(I(x_{t}),\rho),

(3)

where $I(x_{t})=\sum_{n=1}^{N}x_{t}^{n}\in[0:N]$ counts the number of infected individuals and $\rho\in(0,1)$ represents a reporting rate. Equations (1), (2) and (3) define a susceptible-infected-susceptible (SIS) hidden Markov model.

As the latent state space has cardinality $2^{N}$ , evaluating the likelihood function with the forward algorithm would be impractical. Statistical inference is still feasible as the latent process admits low-dimensional summaries. Since (2) assumes that all agents are interchangeable and all pairwise contact are equally likely, the dynamics depends only on the number of infected individuals $I_{t}=I(X_{t})$ and susceptible individuals $S_{t}=N-I_{t}$ . More precisely, the Markov transition for each $t>0$ can be summarized as

\displaystyle S_{t}=S_{t-h}+N^{S}_{t}-N^{I}_{t},\qquad I_{t}=I_{t-h}-N^{S}_{t}+N^{I}_{t},

(4)

where $N^{S}_{t}\sim\text{Bin}(I_{t-h},h\gamma)$ and $N^{I}_{t}\sim\text{Bin}(S_{t-h},h\lambda N^{-1}I_{t-h})$ denote the number of new recoveries and infections, respectively. The initialization in (1) corresponds to having $I_{0}\sim\text{Bin}(N,\alpha_{0})$ and $S_{0}=N-I_{0}$ . The process $(S_{t},I_{t})$ is a Markov chain on the lower-dimensional state space $\{n,m\in[0:N]:n+m=N\}$ , and not in $\{0,1\}^{N}$ . When combined with (3), the resulting hidden Markov model can be estimated using particle filtering strategies; see e.g. Ionides et al. (2006); Endo et al. (2019); Whiteley and Rimella (2020).

For large population sizes $N$ , one can exploit the asymptotic properties of the dynamics (4) to further simplify inference procedures. As $N\rightarrow\infty$ , the finite population proportions $\bar{S}_{t}^{N}=S_{t}/N$ and $\bar{I}_{t}^{N}=I_{t}/N$ admit deterministic limits $\bar{S}_{t}$ and $\bar{I}_{t}$ , defined by the recursion

\displaystyle\bar{S}_{t}=\bar{S}_{t-h}+h\gamma\bar{I}_{t}-h\lambda\bar{S}_{t}\bar{I}_{t},\qquad\bar{I}_{t}=\bar{I}_{t-h}-h\gamma\bar{I}_{t}+h\lambda\bar{S}_{t}\bar{I}_{t},

(5)

with initial condition $(\bar{S}_{t},\bar{I}_{t})=(1-\alpha_{0},\alpha_{0})$ . If fine time resolutions are desired, one can also consider the limit $h\rightarrow 0$ , in which case $(\bar{S}_{t},\bar{I}_{t})$ converges to a deterministic continuous-time process $(\bar{S}(t),\bar{I}(t))$ , defined by the following system of ordinary differential equations

\displaystyle\frac{d\bar{S}(t)}{dt}=\gamma\bar{I}(t)-\lambda\bar{S}(t)\bar{I}(t),\qquad\frac{d\bar{I}(t)}{dt}=-\gamma\bar{I}(t)+\lambda\bar{S}(t)\bar{I}(t),

(6)

with $(\bar{S}(0),\bar{I}(0))=(1-\alpha_{0},\alpha_{0})$ . We refer to Allen and Burgin (2000) and Allen et al. (2008, Chapter 3) for formal links between stochastic and deterministic SIS models, in the form of limit theorems as $N\to\infty$ . The dynamics in (5) or (6) combined with an observation model such as (3) leads again to a fairly tractable statistical model, which can be estimated by maximum likelihood or in the Bayesian framework (e.g. Osthus et al., 2019).

The tractability in (4) and its limits (5)-(6) hinges critically on the assumption that all agents are interchangeable and equally likely to be in contact. These assumptions are strong as it would be desirable to let the infection and recovery rates depend on individual characteristics or features, and the structure of interactions be specified by an undirected network. Adopting any one of these generalizations would result in an agent-based model that cannot be summarized by low-dimensional counts. This article is concerned with the task of inference when these assumptions are relaxed. Our approach is to return to the hidden Markov model formulation with latent space $\{0,1\}^{N}$ , and seek novel sampling algorithms that have polynomial rather than exponential costs in $N$ . The guiding principle of our contribution is that we can improve the performance of Monte Carlo algorithms, sometimes dramatically, by tailoring them to the models at hand, and by using available observations.

The rest of this article is organized as follows. In Section 2, we consider a “static” case with a single time step to illustrate some of the challenges and proposed solutions in a simple setting. We describe how the cost of likelihood evaluation and sampling the posterior of agent states given observations can be reduced from exponential to polynomial in $N$ . In Section 3, we consider the dynamic setting with multiple time steps and design auxiliary particle filters and controlled SMC algorithms tailored to heterogeneous SIS agent-based models. We also provide theoretical support for the approximations employed to make the algorithms practical. We extend our methodology to susceptible-infected-recovered (SIR) models in Section 4 and highlight the associated difficulties. Section 5 summarizes our findings and open questions. This manuscript represents preliminary work and further numerical experiments will be added in a later version. The code is available at https://github.com/nianqiaoju/agents. The proofs and more details are in appendices.

2 Static agent-based model

For simplicity we begin with the static model $X^{n}\sim\text{Ber}(\alpha^{n})$ independently for $n\in[1:N]$ , and allow each agent to be unique by modeling $\alpha^{n}=(1+\exp(-\beta^{\top}w^{n}))^{-1}$ , where $\beta\in\mathbb{R}^{d}$ are parameters and $w^{n}\in\mathbb{R}^{d}$ are the covariates of agent $n$ . This allows individual factors to account for the probability of infection $\alpha=(\alpha^{n})_{n\in[1:N]}$ . Given an observation $y\in[0:N]$ modelled as (3), our inferential objective might include estimating the unknown parameters $\theta=(\beta,\rho)\in\Theta$ and/or the latent states $X=(X^{n})_{n\in[1:N]}\in\{0,1\}^{N}$ . The complete data likelihood is $p_{\theta}(x,y)=f_{\theta}(x)g_{\theta}(y|x)$ , where

f_{\theta}(x)=\prod_{n=1}^{N}\text{Ber}(x^{n};\alpha^{n}),\quad g_{\theta}(y|x)=\text{Bin}(y;I(x),\rho)\mathbbm{1}(I(x)\geq y).

(7)

Here $\text{Ber}(x^{n};\alpha^{n})$ and $\text{Bin}(y;I(x),\rho)$ denote the corresponding probability mass functions (PMF). Marginalizing over the latent states yields the likelihood $p_{\theta}(y)$ . The agent states given the observation and a parameter follow the posterior distribution $p_{\theta}(x|y)$ . In Sections 2.1 and 2.2, we examine the cost of exactly computing $p_{\theta}(y)$ and sampling from $p_{\theta}(x|y)$ , respectively, and describe cheaper approximations. The gains are assessed numerically in Section 2.3.

2.1 Marginal likelihood computations

A naive approach to compute the marginal likelihood $p_{\theta}(y)$ is to sum over all possible population configurations

p_{\theta}(y)=\sum_{x\in\{0,1\}^{N}}p_{\theta}(x,y)=\sum_{x\in\{0,1\}^{N}}f_{\theta}(x)g_{\theta}(y|x).

(8)

This requires $\mathcal{O}(2^{N})$ operations. A simple Monte Carlo approach involves sampling $X^{(p)}=(X^{(p),n})_{n\in[1:N]}\sim f_{\theta}$ independently for $p\in[1:P]$ , which represents $P$ possible configurations of the population, and return the Monte Carlo estimator $P^{-1}\sum_{p=1}^{P}g_{\theta}(y|X^{(p)})$ that weights each configuration according to the observation density. The estimator is unbiased and only costs $\mathcal{O}(NP)$ to compute, but its variance might be prohibitively large for practical values of $P$ , depending on the observation model $g_{\theta}(y|x)$ . Another issue with this estimator is that it can collapse to zero whenever $I(X^{(p)})<y$ for all $p\in[1:P]$ . Following Del Moral et al. (2015), this can be circumvented by repeatedly drawing samples $X^{(p)}\sim f_{\theta}$ independently until there are $P$ configurations that satisfy $I(X^{(p)})\geq y$ , and return the estimator ${(R-1)}^{-1}\sum_{p=1}^{R-1}g_{\theta}(y|X^{(p)})$ , where $R\geq P$ denotes the number of repetitions needed. The resulting estimator can be shown to be unbiased and has a random cost of $\mathcal{O}(N\,\mathbb{E}[R])$ , that depends on the value of $y$ . An obvious shortcoming of the above estimators is that the agents are sampled from $f_{\theta}$ without using the available observation $y$ .

We can in fact reduce the cost of computing (8) without resorting to Monte Carlo approximations. The starting observation is that $I=I(X)$ under $X\sim f_{\theta}$ is the sum of $N$ independent Bernoulli random variables with non-identical success probabilities, and follows a distribution called “Poisson Binomial”. We will refer to this distribution as PoiBin( $\alpha$ ) and write its PMF as

\text{PoiBin}(i;\alpha)=\sum_{x\in\{0,1\}^{N}}\mathbbm{1}\left(\sum_{n=1}^{N}x^{n}=i\right)\prod_{n=1}^{N}(\alpha^{n})^{x^{n}}(1-\alpha^{n})^{(1-x^{n})},\quad i\in[0:N].

(9)

Exact evaluation of (9) has been considered in Barlow and Heidtmann (1984); Chen et al. (1994); Chen and Liu (1997); Hong (2013) using different approaches. Defining $q(i,n)=\text{PoiBin}(i;\alpha^{n:N})$ for $i\in[0:N]$ and $n\in[1:N]$ , we will employ the following recursion

q(i,n)=\alpha^{n}q(i-1,n+1)+(1-\alpha^{n})q(i,n+1),\quad i\in[1:N],n\in[1:N-1],

(10)

with initial conditions $q(0,n)=\prod_{m=n}^{N}(1-\alpha^{m})$ for $n\in[1:N]$ , $q(1,N)=\alpha^{N}$ and $q(i,N)=0$ for $i\in[2:N]$ . The desired PMF $q(i,1)=\text{PoiBin}(i;\alpha)$ for $i\in[0:N]$ can thus be computed in $\mathcal{O}(N^{2})$ operations; see Appendix A.1 for a derivation of (10).

Using the above observation, we can rewrite the marginal likelihood as

p_{\theta}(y)=\sum_{i=0}^{N}\text{PoiBin}(i;\alpha)\text{Bin}(y;i,\rho)\mathbbm{1}(i\geq y),

(11)

which costs $\mathcal{O}(N^{2})$ to compute. Using a thinning argument detailed in Appendix A.2, the above sum is in fact equal to $p_{\theta}(y)=\text{PoiBin}(y;\rho\,\alpha)$ . Although the marginal likelihood will not admit such tractability in the general setup considered in Section 3, the preceding observations will guide our choice of approximations.

One can also rely on approximations of the Poisson Binomial distribution to further reduce the cost of computing (11) to $\mathcal{O}(N)$ . Choices include the Poisson approximation (Hodges and Le Cam, 1960; Barbour and Hall, 1984; Wang, 1993), the Normal approximation (Volkova, 1996) and the translated Poisson approximation (Barbour and Xia, 1999; Cekanavicius and Vaitkus, 2001; Barbour and Ćekanavićius, 2002). We will focus on the translated Poisson approximation which exactly matches the mean and approximately the variance of a Poisson Binomial distribution. Let $\text{Poi}(\lambda)$ denote a Poisson distribution with rate $\lambda>0$ and write the mean and variance of PoiBin( $\alpha$ ) as $\mu=\sum_{n=1}^{N}\alpha^{n}$ and $\sigma^{2}=\sum_{n=1}^{N}\alpha^{n}(1-\alpha^{n})$ , respectively. The translated Poisson approximation of (9) is given by

\text{TransPoi}(i;\mu,\sigma^{2})=\begin{cases}0,&i\in[0:\lfloor\mu-\sigma^{2}\rfloor-1],\\ \text{Poi}(i-\lfloor\mu-\sigma^{2}\rfloor;\sigma^{2}+\{\mu-\sigma^{2}\}),&i\in[\lfloor\mu-\sigma^{2}\rfloor:N],\end{cases}

(12)

where $\lfloor\mu-\sigma^{2}\rfloor$ and $\{\mu-\sigma^{2}\}$ denote the floor and fractional part of $\mu-\sigma^{2}$ , respectively. Since $\mu$ and $\sigma^{2}$ can be computed in $\mathcal{O}(N)$ operations, the translated Poisson approximation (12) and the resulting approximation of (11) only require $\mathcal{O}(N)$ operations. Hence this can be appealing in the setting of large population sizes $N$ at the expense of an approximation error that is well-studied. Indeed results in Cekanavicius and Vaitkus (2001, Theorem 2.1 & Corollary 2.1) and Barbour and Ćekanavićius (2002, Theorem 3.1) imply that the error, measured in the total variation distance, decay at the rate of $N^{-1/2}$ as $N\rightarrow\infty$ .

2.2 Posterior sampling of agent states

Sampling from the posterior distribution $p_{\theta}(x|y)$ by naively enumerating over all $2^{N}$ configurations is computationally impractical. A key observation is that the observation density in (7) depends on the high dimensional latent state $X\in\{0,1\}^{N}$ only through the one-dimensional summary $I(X)\in[0:N]$ . This prompts the inclusion of $I=I(X)$ as an auxiliary variable. Thus the joint posterior distribution can be decomposed as

\displaystyle p_{\theta}(x,i|y)=p_{\theta}(i|y)p_{\theta}(x|i).

(13)

The dominant cost of sampling the posterior distribution of the summary

\displaystyle p_{\theta}(i|y)=\frac{p_{\theta}(i)p_{\theta}(y|i)}{p_{\theta}(y)}=\frac{\text{PoiBin}(i;\alpha)\text{Bin}(y;i,\rho)\mathbbm{1}(i\geq y)}{p_{\theta}(y)},\quad i\in[0:N],

(14)

is the evaluation of the Poisson Binomial PMF (9). Recall from Section 2.1 that this can be done exactly in $\mathcal{O}(N^{2})$ , and in $\mathcal{O}(N)$ using a translated Poisson approximation. The conditional distribution of the latent state given the summary

\displaystyle p_{\theta}(x|i)=\frac{p_{\theta}(x)p_{\theta}(i|x)}{p_{\theta}(i)}=\frac{\prod_{n=1}^{N}(\alpha^{n})^{x^{n}}(1-\alpha^{n})^{1-x^{n}}\mathbbm{1}(I(x)=i)}{\text{PoiBin}(i;\alpha)}

(15)

is known as a conditional Bernoulli distribution, which we will write as $p_{\theta}(x|i)=\text{CondBer}(x;\alpha,i)$ . Various sampling schemes have been proposed (Fan et al., 1962; Chen et al., 1994); see Chen and Liu (1997, Section 4) for an overview. We will rely on the sequential decomposition

	$\displaystyle p_{\theta}(x\|i)$	$\displaystyle=p_{\theta}(x^{1}\|i)\prod_{n=2}^{N}p_{\theta}(x^{n}\|x^{1:n-1},i)$		(16)
		$\displaystyle=\prod_{n=1}^{N-1}\frac{(\alpha^{n})^{x^{n}}(1-\alpha^{n})^{1-x^{n}}q(i-i_{n-1}-x^{n},n+1)}{q(i-i_{n-1},n)}\mathbbm{1}(x^{N}=i-i_{N-1}),$

where $i_{0}=0$ and $i_{n}=\sum_{m=1}^{n}x^{m}$ for $n\in[1:N-1]$ . A derivation of (16) can be found in Appendix A of Heng et al. (2020b). The values of $q(j,n)$ for $j\in[0:i],n\in[1:N]$ needed in (16) can be precomputed using the same recursion as (10) in $\mathcal{O}(iN)$ cost. This precomputation is not necessary if these values are stored when computing Poisson Binomial probabilities.

To reduce the cost we can employ a Markov chain Monte Carlo (MCMC) method to sample from the conditional Bernoulli distribution $\text{CondBer}(\alpha,i)$ . This incurs a cost of $\mathcal{O}(1)$ per iteration and converges in $\mathcal{O}(N\log N)$ iterations, under some mild assumptions on $\alpha$ , as shown in Heng et al. (2020b). On a related note, we can design MCMC algorithms to target $p_{\theta}(x|y)$ . These might for example update the state of a few agents by sampling from their conditional distributions given every other variables, and alternately propose to swap zeros and ones in the vector $X\in\{0,1\}^{N}$ . Each of these steps can be done with a cost independent of $N$ , but the number of iterations for the algorithm to converge is expected to grow at least linearly with $N$ .

2.3 Numerical illustration

We set up numerical experiments as follows. We generate covariates $(w^{n})$ from $\mathcal{N}(4,1)$ for $N=1000$ individuals independently, where $\mathcal{N}(\mu,\sigma^{2})$ denotes a Normal distribution with mean $\mu\in\mathbb{R}$ and variance $\sigma^{2}>0$ . Individual specific infection probabilities are computed as $\alpha^{n}=(1+\exp(-\beta w^{n}))^{-1}$ using $\beta=0.3$ . We then simulate $X^{n}\sim\text{Ber}(\alpha^{n})$ independently for all $n$ , and sample $Y\sim\text{Bin}(\sum_{n=1}^{N}X^{n},\rho)$ , with $\rho=0.8$ . We adopt a Bayesian approach and assign an independent prior of $\mathcal{N}(0,1)$ on $\beta$ and $\text{Uniform}(0,1)$ on $\rho$ . We focus on the task of sampling from the posterior distribution of $\theta=(\beta,\rho)$ given $y$ and the covariates $(w^{n})$ .

We consider random walk Metropolis–Hastings with exact likelihood calculation (“MH-exact”), associated with a quadratic cost in $N$ . We also consider the same algorithm with a likelihood approximated by a translated Poisson (“MH-tp”), with a cost linear in $N$ . Finally we consider a pseudo-marginal approach (Andrieu and Roberts, 2009; Andrieu et al., 2010) with likelihood estimated with $P=20$ particles sampled from $f_{\theta}$ (“PMMH”). Note that $P=20$ samples resulted in a variance of the log-likelihood estimates of approximately $0.3$ at the data generating parameters (DGP). These samplers employ the same random walk, based on Normal proposals, independently on $\beta$ and $\log(\rho/(1-\rho))$ with standard deviation of $0.2$ . As a baseline we also consider a Gibbs sampler that alternates between the updates of $\theta$ given $X$ , employing the same proposal as above, and updates of $X$ given $\theta$ . These employ an equally weighted mixture of kernels, performing either $N$ random swap updates, or a systematic Gibbs scan of the $N$ components of $X$ ; thus the cost per iteration is linear in $N$ .

We first run “MH-exact” with $100,000$ MCMC iterations (after a burn-in of $5,000$ iterations), to obtain estimates of the posterior means of $\beta$ and $\rho$ . Using these posterior estimates as ground truth, we compute the mean squared error (MSE) of the posterior approximations obtained using each method with $20,000$ MCMC iterations (excluding a burn-in of $5000$ ) and 50 independent repetitions. Table 1 displays the MSE, as well as the relative wall-clock time to obtain each estimate. Comparing the results of MH-exact and MH-tp shows that it is possible to save considerable efforts at the expense of small differences in the parameter estimates using a translated Poisson approximation. The appeal of the PMMH approach compared to exact likelihood calculations is also clear. On the other hand, Gibbs samplers that alternate between the updates of $\theta$ and $X$ do not seem to be competitive in this example.

Method	$\mathbb{E}[\beta\|y]$		$\mathbb{E}[\rho\|y]$		Relative cost
	Bias²	Variance	Bias²	Variance
MH-exact	25	93.3	0.74	6.39	128
MH-tp	22	52.3	0.32	2.83	1
PMMH	18	79.2	0.50	4.67	8
Gibbs	1040	113	91.1	2.85	42

Table 1: Bias and variance when approximating the posterior mean of

\beta

and

\rho

with each inference method (in units of

10^{-4}

). The rightmost column displays the average cost for each method measured in terms of run-time, relative to MH-tp.

3 Susceptible-Infected-Susceptible model

We now extend the agent-based SIS model in Section 1.2. To allow for individual-specific attributes, for agent $n\in[1:N]$ , we model her initial infection probability $\alpha_{0}^{n}$ , infection rate $\lambda^{n}$ and recovery rate $\gamma^{n}$ as

\displaystyle\alpha_{0}^{n}=(1+\exp(-\beta_{0}^{\top}w^{n}))^{-1},\quad\lambda^{n}=(1+\exp(-\beta_{\lambda}^{\top}w^{n}))^{-1},\quad\gamma^{n}=(1+\exp(-\beta_{\gamma}^{\top}w^{n}))^{-1}.

(17)

Here $\beta_{0},\beta_{\lambda},\beta_{\gamma}\in\mathbb{R}^{d}$ are parameters and $w^{n}\in\mathbb{R}^{d}$ are the covariates of agent $n$ . The interactions between agents is assumed to be known and specified by an undirected network; inference of the network structure and extension to the time-varying case could be considered in future work. We will write $\mathcal{D}(n)$ and $\mathcal{N}(n)$ to denote the degree and neighbours of agent $n\in[1:N]$ .

For ease of presentation, we consider time steps of size $h=1$ . The time evolution of the population is given by

\displaystyle X_{0}\sim\mu_{\theta},\quad X_{t}|X_{t-1}=x_{t-1}\sim f_{\theta}(\cdot|x_{t-1}),\quad t\in[1:T].

(18)

The initial distribution $\mu_{\theta}(x_{0})=\prod_{n=1}^{N}\text{Ber}(x_{0}^{n};\alpha_{0}^{n})$ corresponds to the static model in Section 2 with infection probabilities $\alpha_{0}=(\alpha_{0}^{n})_{n\in[1:N]}$ . The Markov transition $f_{\theta}(x_{t}|x_{t-1})=\prod_{n=1}^{N}\text{Ber}(x_{t}^{n};\alpha^{n}(x_{t-1}))$ has conditional probabilities $\alpha(x_{t-1})=(\alpha^{n}(x_{t-1}))_{n\in[1:N]}$ given by

\displaystyle\alpha^{n}(x_{t-1})=\begin{cases}\lambda^{n}\mathcal{D}(n)^{-1}\sum_{m\in\mathcal{N}(n)}x_{t-1}^{m},\quad&\mbox{if }x_{t-1}^{n}=0,\\ 1-\gamma^{n},\quad&\mbox{if }x_{t-1}^{n}=1,\end{cases}

(19)

for $n\in[1:N]$ . We will assume that the cost of evaluating $\alpha$ is $\mathcal{O}(N)$ . We refer readers to McVinish and Pollett (2012) for the asymptotic behaviour of this process as $N\to\infty$ , in the case of a fully connected network. In (19) we see the proportion of infected neighbours appearing in the infection probability at each time step. As an individual can have covariates $w^{n}$ that include measures of contact frequency, which would affect $\alpha^{n}$ via $\lambda^{n}$ , it is possible for an individual with many neighbours (large $\mathcal{D}(n)$ ) to have low infection probability if the frequency of contact is low.

Equations (17)-(19) and the observation model in (3) define a hidden Markov model on $\{0,1\}^{N}$ , with unknown parameters $\theta=(\beta_{0},\beta_{\lambda},\beta_{\gamma},\rho)\in\Theta=\mathbb{R}^{3d}\times(0,1)$ ; see Figure 1 for a graphical model representation in the case of a fully connected network. Given an observation sequence $y_{0:T}\in[0:N]^{T+1}$ , the complete data likelihood is given by

\displaystyle p_{\theta}(x_{0:T},y_{0:T})=p_{\theta}(x_{0:T})p_{\theta}(y_{0:T}|x_{0:T})=\mu_{\theta}(x_{0})\prod_{t=1}^{T}f_{\theta}(x_{t}|x_{t-1})\prod_{t=0}^{T}g_{\theta}(y_{t}|x_{t}).

(20)

Parameter inference will require marginalizing over the latent process to obtain the marginal likelihood $p_{\theta}(y_{0:T})$ and estimation of agent states will involve sampling from the smoothing distribution $p_{\theta}(x_{0:T}|y_{0:T})$ . Exact computation of the marginal likelihood and marginals of the smoothing distribution using the forward algorithm and forward-backward algorithm, respectively, both require $\mathcal{O}(2^{2N}T)$ operations. As this is computationally prohibitive for large $N$ , we will rely on sequential Monte Carlo (SMC) approximations.

In Section 3.1, we describe how SMC methods can be used to approximate the marginal likelihood and the smoothing distribution. Like many Monte Carlo schemes, the efficiency of SMC crucially relies on the choice of proposal distributions. The bootstrap particle filter (BPF) (Gordon et al., 1993), which corresponds to having the joint distribution of the latent process (18) as proposal, can often give poor performance in practice when the observations are informative. By building on the insights from Section 2, we will show how the fully adapted auxiliary particle filter (APF) (Pitt and Shephard, 1999; Carpenter et al., 1999) can be implemented. As the APF constructs a proposal transition at each time step that takes the next observation in account, it often performs better than the BPF, although not always (Johansen and Doucet, 2008). In Section 3.2, we adapt the ideas in Guarniero et al. (2017); Heng et al. (2020a) to our setting and present a novel controlled SMC (cSMC) method that can significantly outperform the APF. Central to our approach is to take the entire observation sequence $y_{0:T}$ into account by constructing proposal distributions that approximate the smoothing distribution $p_{\theta}(x_{0:T}|y_{0:T})$ . Using a simulated dataset, in Section 3.3 we show that cSMC provides orders of magnitude improvements over BPF and APF in terms of estimating the marginal likelihood. In Section 3.4, we illustrate the behaviour of marginal likelihood as the number of observations increases, and perform parameter inference and predictions. We consider MCMC strategies as alternatives to SMC-based methods in Appendix C.

Figure 1: Graphical model representation of the agent-based SIS model in Section 3 for

T=4

time steps and a fully connected network with

N=3

agents.

3.1 Sequential Monte Carlo

SMC methods (Liu and Chen, 1998; Doucet et al., 2001), also known as particle filters, provide approximations of $p_{\theta}(y_{0:T})$ and $p_{\theta}(x_{0:T}|y_{0:T})$ by simulating an interacting particle system of size $P\in\mathbb{N}$ . In the following, we give a generic description of SMC to include several algorithms in a common framework.

At the initial time, one samples $P$ configurations of the population from a proposal distribution $q_{0}$ on $\{0,1\}^{N}$ , i.e. $X_{0}^{(p)}=(X_{0}^{(p),n})_{n\in[1:N]}\sim q_{0}(\cdot|\theta)$ independently for $p\in[1:P]$ . Each possible configuration is then assigned a weight $W_{0}^{(p)}\propto w_{0}(X_{0}^{(p)})$ that is normalized to sum to one. To focus our computation on the more likely configurations, we perform an operation known as resampling that discards some configurations and duplicates others according to their weights. Each resampling scheme involves sampling ancestor indexes $(A_{0}^{(p)})_{p\in[1:P]}\in[1:P]^{P}$ from a distribution $r(\cdot|W_{0}^{(1:P)})$ on $[1:P]^{P}$ . The simplest scheme is multinomial resampling (Gordon et al., 1993), which samples $(A_{0}^{(p)})_{p\in[1:P]}$ independently from the categorical distribution on $[1:P]$ with probabilities $W_{0}^{(1:P)}$ ; other lower variance and adaptive resampling schemes can also be employed (Fearnhead and Clifford, 2003; Gerber et al., 2019). Subsequently, for time step $t\in[1:T]$ , one propagates each resampled configuration according to a proposal transition $q_{t}$ on $\{0,1\}^{N}$ , i.e. $X_{t}^{(p)}=(X_{t}^{(p),n})_{n\in[1:N]}\sim q_{t}(\cdot|X_{t-1}^{(A_{t-1}^{(p)})},\theta)$ independently for $p\in[1:P]$ . As before, we weight each new configuration according to $W_{t}^{(p)}\propto w_{t}(X_{t-1}^{(A_{t-1}^{(p)})},X_{t}^{(p)})$ , and for $t<T$ , resample by drawing the ancestor indexes $(A_{t}^{(p)})_{p\in[1:P]}\sim r(\cdot|W_{t}^{(1:P)})$ . For notational simplicity, we suppress notational dependence of the weight functions $(w_{t})_{t\in[0:T]}$ on the parameter $\theta$ . To approximate the desired quantities $p_{\theta}(y_{0:T})$ and $p_{\theta}(x_{0:T}|y_{0:T})$ , these weight functions have to satisfy

\displaystyle w_{0}(x_{0})\prod_{t=1}^{T}w_{t}(x_{t-1},x_{t})=\frac{\mu_{\theta}(x_{0})\prod_{t=1}^{T}f_{\theta}(x_{t}|x_{t-1})\prod_{t=0}^{T}g_{\theta}(y_{t}|x_{t})}{q_{0}(x_{0}|\theta)\prod_{t=1}^{T}q_{t}(x_{t}|x_{t-1},\theta)}.

(21)

Given the output of the above simulation, an unbiased estimator of the marginal likelihood $p_{\theta}(y_{0:T})$ is

\displaystyle\hat{p}_{\theta}(y_{0:T})=\left\{\frac{1}{P}\sum_{p=1}^{P}w_{0}(X_{0}^{(p)})\right\}\left\{\prod_{t=1}^{T}\frac{1}{P}\sum_{p=1}^{P}w_{t}(X_{t-1}^{(A_{t-1}^{(p)})},X_{t}^{(p)})\right\},

(22)

and a particle approximation of the smoothing distribution is given by

\displaystyle\hat{p}_{\theta}(x_{0:T}|y_{0:T})=\sum_{p=1}^{P}W_{T}^{(p)}\delta_{X_{0:T}^{(p)}}(x_{0:T}).

(23)

In the latter approximation, each trajectory $X_{0:T}^{(p)}$ is formed by tracing the ancestral lineage of $X_{T}^{(p)}$ , i.e. $X_{0:T}^{(p)}=(X_{t}^{(B_{t}^{(p)})})_{t\in[0:T]}$ with $B_{T}^{(p)}=p$ and $B_{t}^{(p)}=A_{t}^{(B_{t+1}^{(p)})}$ for $t\in[0:T-1]$ . Convergence properties of these approximations as the size of the particle system $P\rightarrow\infty$ are well-studied; see for example Del Moral (2004). However, the quality of these approximations depends crucially on the choice of proposals $(q_{t})_{t\in[0:T]}$ and the corresponding weight functions $(w_{t})_{t\in[0:T]}$ that satisfy (21).

The BPF of Gordon et al. (1993) can be recovered by employing the proposals $q_{0}(x_{0}|\theta)=\mu_{\theta}(x_{0}),q_{t}(x_{t}|x_{t-1},\theta)=f_{\theta}(x_{t}|x_{t-1})$ for $t\in[1:T]$ and the weight functions $w_{t}(x_{t})=g_{\theta}(y_{t}|x_{t})$ for $t\in[0:T]$ . Although the BPF only costs $\mathcal{O}(NTP)$ to implement and has convergence guarantees as $P\rightarrow\infty$ , the variance of its marginal likelihood estimator can be too large to deploy within particle MCMC schemes for practical values of $P$ (see Section 3.3). Another issue with its marginal likelihood estimator, for this particular choice of observation equation, is that it can collapse to zero if all proposed configurations have less infections than the observed value, i.e. there exists $t\in[0:T]$ such that $I(X_{t}^{(p)})<y_{t}$ for all $p\in[1:P]$ . With increased cost, this issue can be circumvented using the alive particle filter of Del Moral et al. (2015), by repeatedly drawing samples at each time step until there are $P$ configurations with infections that are larger than or equal to the observed value.

Alternatively, one can construct better proposals with supports that respect these observational constraints. One such option is the fully adapted APF (Pitt and Shephard, 1999; Carpenter et al., 1999) that corresponds to having the proposals $q_{0}(x_{0}|\theta)=p_{\theta}(x_{0}|y_{0}),q_{t}(x_{t}|x_{t-1},\theta)=p_{\theta}(x_{t}|x_{t-1},y_{t})$ for $t\in[1:T]$ and the weight functions $w_{0}(x_{0})=p_{\theta}(y_{0})$ and $w_{t}(x_{t-1})=p_{\theta}(y_{t}|x_{t-1})$ for $t\in[1:T]$ . At the initial time, computing the marginal likelihood $p_{\theta}(y_{0})$ and sampling from the posterior of agent states $p_{\theta}(x_{0}|y_{0})$ can be done exactly (or approximately) as described in Sections 2.1 and 2.2 respectively for the static model. More precisely, we compute

\displaystyle p_{\theta}(y_{0})=\sum_{i_{0}=0}^{N}\text{PoiBin}(i_{0};\alpha_{0})\text{Bin}(y_{0};i_{0},\rho)\mathbbm{1}(i_{0}\geq y_{0})

(24)

and sample from

\displaystyle p_{\theta}(x_{0},i_{0}|y_{0})=p_{\theta}(i_{0}|y_{0})p_{\theta}(x_{0}|i_{0})=\frac{\text{PoiBin}(i_{0};\alpha_{0})\text{Bin}(y_{0};i_{0},\rho)\mathbbm{1}(i_{0}\geq y_{0})}{p_{\theta}(y_{0})}\text{CondBer}(x_{0};\alpha_{0},i_{0}),

(25)

which admits $p_{\theta}(x_{0}|y_{0})$ as its marginal distribution. For time step $t\in[1:T]$ , by conditioning on the previous configuration $x_{t-1}\in\{0,1\}^{N}$ , the same ideas can be used to compute the predictive likelihood $p_{\theta}(y_{t}|x_{t-1})$ and sample from the transition $p_{\theta}(x_{t}|x_{t-1},y_{t})$ , i.e. we compute

\displaystyle p_{\theta}(y_{t}|x_{t-1})=\sum_{i_{t}=0}^{N}\text{PoiBin}(i_{t};\alpha(x_{t-1}))\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})

(26)

and sample from

	$\displaystyle p_{\theta}(x_{t},i_{t}\|x_{t-1},y_{t})$	$\displaystyle=p_{\theta}(i_{t}\|x_{t-1},y_{t})p_{\theta}(x_{t}\|x_{t-1},i_{t})$		(27)
		$\displaystyle=\frac{\text{PoiBin}(i_{t};\alpha(x_{t-1}))\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})}{p_{\theta}(y_{t}\|x_{t-1})}\text{CondBer}(x_{t};\alpha(x_{t-1}),i_{t})$

which admits $p_{\theta}(x_{t},|x_{t-1},y_{t})$ as its marginal transition.

An algorithmic description of the resulting APF is detailed in Algorithm 1, where the notation $\text{Cat}([0:N],V^{(0:N)})$ refers to the categorical distribution on $[0:N]$ with probabilities $V^{(0:N)}=(V^{(i)})_{i\in[0:N]}$ . As the weights in the fully adapted APF at time $t\in[1:T]$ only depend on the configuration at time $t-1$ , note that we have interchanged the order of sampling and resampling to promote sample diversity. The cost of running APF exactly is $\mathcal{O}(N^{2}TP)$ . To reduce the computational cost to $\mathcal{O}(N\log(N)TP)$ one can approximate the above Poisson binomial PMFs with the translated Poisson approximation (12), and employ MCMC to sample from the above conditioned Bernoulli distributions.

Input: Parameters

\theta\in\Theta

and number of particles

P\in\mathbb{N}

compute

v_{0}^{(i)}=\text{PoiBin}(i;\alpha_{0})\text{Bin}(y_{0};i,\rho)\mathbbm{1}(i\geq y_{0})

for

i\in[0:N]

set

w_{0}=\sum_{i=0}^{N}v_{0}^{(i)}

normalize

V_{0}^{(i)}=v_{0}^{(i)}/w_{0}

for

i\in[0:N]

sample

I_{0}^{(p)}\sim\text{Cat}([0:N],V_{0}^{(0:N)})

and

X_{0}^{(p)}|I_{0}^{(p)}\sim\text{CondBer}(\alpha_{0},I_{0}^{(p)})

for

p\in[1:P]

for $t=1,\cdots,T$ and $p=1,\cdots,P$ do

compute

v_{t}^{(i,p)}=\text{PoiBin}(i;\alpha(X_{t-1}^{(p)}))\text{Bin}(y_{t};i,\rho)\mathbbm{1}(i\geq y_{t})

for

i\in[0:N]

set

w_{t}^{(p)}=\sum_{i=0}^{N}v_{t}^{(i,p)}

normalize

V_{t}^{(i,p)}=v_{t}^{(i,p)}/w_{t}^{(p)}

for

i\in[0:N]

normalize

W_{t}^{(p)}=w_{t}^{(p)}/\sum_{k=1}^{P}w_{t}^{(k)}

sample

A_{t-1}^{(p)}\sim r(\cdot|W_{t}^{(1:P)})

sample

I_{t}^{(p)}\sim\text{Cat}([0:N],V_{t}^{(0:N,A_{t-1}^{(p)})})

and

X_{t}^{(p)}|I_{t}^{(p)}\sim\text{CondBer}(\alpha(X_{t-1}^{(A_{t-1}^{(p)})}),I_{t}^{(p)})

end for

Output: Marginal likelihood estimator

\hat{p}_{\theta}(y_{0:T})=w_{0}\prod_{t=1}^{T}P^{-1}\sum_{p=1}^{P}w_{t}^{(p)}

, states

(X_{t}^{(p)})_{(t,p)\in[0:T]\times[1:P]}

and ancestors

(A_{t}^{(p)})_{(t,p)\in[0:T-1]\times[1:P]}

Algorithm 1 Auxiliary particle filter for SIS model

3.2 Controlled sequential Monte Carlo

To obtain better performance than APF, we can construct proposals that take not just the next but all future observations into account. We can sequentially decompose the smoothing distribution as

\displaystyle p_{\theta}(x_{0:T}|y_{0:T})=p_{\theta}(x_{0}|y_{0:T})\prod_{t=1}^{T}p_{\theta}(x_{t}|x_{t-1},y_{t:T}),

(28)

and it follows that the optimal proposal is $q_{0}^{\star}(x_{0}|\theta)=p_{\theta}(x_{0}|y_{0:T})$ and $q_{t}^{\star}(x_{t}|x_{t-1},\theta)=p_{\theta}(x_{t}|x_{t-1},y_{t:T})$ for $t\in[1:T]$ , as this gives exact samples from the smoothing distribution. The resulting SMC marginal likelihood estimator in (22) would have zero variance for any choice of weight functions satisfying (21). To design approximations of the optimal proposal, it will be instructive to rewrite it as

\displaystyle p_{\theta}(x_{0}|y_{0:T})=\frac{\mu_{\theta}(x_{0})\psi_{0}^{\star}(x_{0})}{\mu_{\theta}(\psi_{0}^{\star})},\quad p_{\theta}(x_{t}|x_{t-1},y_{t:T})=\frac{f_{\theta}(x_{t}|x_{t-1})\psi_{t}^{\star}(x_{t})}{f_{\theta}(\psi_{t}^{\star}|x_{t-1})},\quad t\in[1:T],

(29)

where $\psi_{t}^{\star}(x_{t})=p(y_{t:T}|x_{t})$ for $t\in[0:T]$ , $\mu_{\theta}(\psi_{0}^{\star})=\sum_{x_{0}\in\{0,1\}^{N}}\mu_{\theta}(x_{0})\psi_{0}^{\star}(x_{0})$ denotes the expectation of $\psi_{0}^{\star}$ with respect to $\mu_{\theta}$ and $f_{\theta}(\psi_{t}^{\star}|x_{t-1})=\sum_{x_{t}\in\{0,1\}^{N}}f_{\theta}(x_{t}|x_{t-1})\psi_{t}^{\star}(x_{t})$ denotes the conditional expectation of $\psi_{t}^{\star}$ under $f_{\theta}$ . Equation (29) shows how the latent process (18), defined by $\mu_{\theta}$ and $f_{\theta}$ , should be modified to obtain the optimal proposal. The functions $(\psi_{t}^{\star})_{t\in[0:T]}$ , known as the backward information filter (BIF) (Bresler, 1986; Briers et al., 2010), can be defined using the backward recursion

\displaystyle\psi_{T}^{\star}(x_{T})=g_{\theta}(y_{T}|x_{T}),\quad\psi_{t}^{\star}(x_{t})=g_{\theta}(y_{t}|x_{t})f_{\theta}(\psi_{t+1}^{\star}|x_{t}),\quad t\in[0:T-1],

(30)

which shows how information from future observations are propagated backwards over time. As the cost of computing and storing the BIF using the recursion (30) are $\mathcal{O}(2^{2N}T)$ and $\mathcal{O}(2^{N}T)$ , respectively, approximations are necessary when $N$ is large. In contrast to Guarniero et al. (2017); Heng et al. (2020a) that rely on regression techniques to approximate the BIF, our approach is based on dimensionality reduction by coarse-graining the agent-based model.

At the terminal time $T$ , the function $\psi_{T}^{\star}(x_{T})=\text{Bin}(y_{T};I(x_{T}),\rho)\mathbbm{1}(I(x_{T})\geq y_{T})$ only depends on the agent states $x_{T}\in\{0,1\}^{N}$ through the one-dimensional summary $I(x_{T})\in[0:N]$ . Therefore it suffices to compute and store $\psi_{T}(i_{T})=\text{Bin}(y_{T};i_{T},\rho)\mathbbm{1}(i_{T}\geq y_{T})$ for all $i_{T}\in[0:N]$ . Note that $\psi_{T}(I(x_{T}))$ should be seen as a function of the agent states. To iterate the backward recursion (30), we have to compute the conditional expectation $f_{\theta}(\psi_{T}|x_{T-1})=\sum_{i_{T}=0}^{N}\text{PoiBin}(i_{T};\alpha(x_{T-1}))\psi_{T}(i_{T})$ for all $x_{T-1}\in\{0,1\}^{N}$ . By a thinning argument (Appendix A.2), this is equal to $\text{PoiBin}(y_{T};\rho\,\alpha(x_{T-1}))$ . Hence it is clear that all $2^{N}$ possible configurations of the population have to be considered to iterate the recursion, and an approximation of $\alpha(x_{T-1})$ is necessary at this point.

To reduce dimension, we consider

\bar{\alpha}^{n}(x_{T-1})=\begin{cases}\bar{\lambda}N^{-1}I(x_{T-1}),\quad&\mbox{if }x_{T-1}^{n}=0,\\ 1-\bar{\gamma},\quad&\mbox{if }x_{T-1}^{n}=1.\end{cases}

(31)

This amounts to approximating the proportion of infected neighbours of each agent by the population proportion of infections, and replacing individual infection and recovery rates in (19) by their population averages, i.e. $\lambda^{n}\approx\bar{\lambda}=N^{-1}\sum_{n=1}^{N}\lambda^{n}$ and $\gamma^{n}\approx\bar{\gamma}=N^{-1}\sum_{n=1}^{N}\gamma^{n}$ . Writing $\bar{f}_{\theta}(x_{T}|x_{T-1})=\prod_{n=1}^{N}\text{Ber}(x_{T}^{n};\bar{\alpha}^{n}(x_{T-1}))$ as the Markov transition associated to (31), we approximate the conditional expectation $f_{\theta}(\psi_{T}|x_{T-1})$ by

\displaystyle\bar{f}_{\theta}(\psi_{T}|I(x_{T-1}))=\sum_{i_{T}=0}^{N}\text{SumBin}(i_{T};N-I(x_{T-1}),\bar{\lambda}N^{-1}I(x_{T-1}),I(x_{T-1}),1-\bar{\gamma})\psi_{T}(i_{T}),

(32)

where $\text{SumBin}(N_{1},p_{1},N_{2},p_{2})$ denotes the distribution of a sum of two independent $\text{Bin}(N_{1},p_{1})$ and $\text{Bin}(N_{2},p_{2})$ random variables. This follows as a Poisson binomial distribution with homogeneous probabilities (31) reduces to the SumBin distribution in (32), which is not analytically tractable but can be computed exactly in $\mathcal{O}(N^{2})$ cost using a naive implementation of the convolution¹¹1 $\text{SumBin}(i;N_{1},p_{1},N_{2},p_{2})=\sum_{j=0}^{i}\text{Bin}(j;N_{1},p_{1})\text{Bin}(i-j;N_{2},p_{2})$ for $i\in[0:N_{1}+N_{2}]$ .. In the large $N$ regime, we advocate approximating the SumBin distribution with the translated Poisson (12), which reduces the cost to $\mathcal{O}(N)$ at the price of an approximation error. Moreover, the latter bias only affects the quality of our proposal distributions, and not the consistency properties of SMC approximations. The resulting approximation of $\psi_{T-1}^{\star}(x_{T-1})$ can be computed using

\displaystyle\psi_{T-1}(i_{T-1})=\text{Bin}(y_{T-1};i_{T-1},\rho)\mathbbm{1}(i_{T-1}\geq y_{T-1})\bar{f}_{\theta}(\psi_{T}|i_{T-1})

(33)

for all $i_{T-1}\in[0:N]$ . This costs $\mathcal{O}(N^{3})$ if convolutions are implemented naively and $\mathcal{O}(N^{2})$ if translated Poisson approximations of (32) are employed. We then continue in the same manner to approximate the recursion (30) until the initial time. Algorithm 2 summarizes our approximation of the BIF $(\psi_{t})_{t\in[0:T]}$ , which costs $\mathcal{O}(N^{3}T)$ or $\mathcal{O}(N^{2}T)$ to compute and $\mathcal{O}(NT)$ in storage.

Our corresponding approximation of the optimal proposal (29) is

\displaystyle q_{0}(x_{0}|\theta)=\frac{\mu_{\theta}(x_{0})\psi_{0}(I(x_{0}))}{\mu_{\theta}(\psi_{0})},\quad q_{t}(x_{t}|x_{t-1},\theta)=\frac{f_{\theta}(x_{t}|x_{t-1})\psi_{t}(I(x_{t}))}{f_{\theta}(\psi_{t}|x_{t-1})},\quad t\in[1:T].

(34)

To employ these proposals within SMC, the appropriate weight functions (Scharth and Kohn, 2016) satisfying (21) are

\displaystyle w_{0}(x_{0})=\frac{\mu_{\theta}(\psi_{0})g_{\theta}(y_{0}|x_{0})f_{\theta}(\psi_{1}|x_{0})}{\psi_{0}(I(x_{0}))},

\displaystyle w_{t}(x_{t})=\frac{g_{\theta}(y_{t}|x_{t})f_{\theta}(\psi_{t+1}|x_{t})}{\psi_{t}(I(x_{t}))},\quad t\in[1:T-1],

(35)

and $w_{T}(x_{T})=1$ . To evaluate the weights, note that expectations can be computed as

\displaystyle\mu_{\theta}(\psi_{0})=\sum_{i_{0}=0}^{N}\text{PoiBin}(i_{0};\alpha_{0})\psi_{0}(i_{0}),\quad f_{\theta}(\psi_{t}|x_{t-1})=\sum_{i_{t}=0}^{N}\text{PoiBin}(i_{t};\alpha(x_{t-1}))\psi_{t}(i_{t}),\quad t\in[1:T].

(36)

Sampling from the proposals in (34) can be performed in a similar manner as the APF. To initialize, we sample from

\displaystyle q_{0}(x_{0},i_{0}|\theta)=q_{0}(i_{0}|\theta)q_{0}(x_{0}|i_{0},\theta)=\frac{\text{PoiBin}(i_{0};\alpha_{0})\psi_{0}(i_{0})}{\mu_{\theta}(\psi_{0})}\text{CondBer}(x_{0};\alpha_{0},i_{0})

(37)

which admits $q_{0}(x_{0}|\theta)$ as its marginal distribution, and for time $t\in[1:T]$

	$\displaystyle q_{t}(x_{t},i_{t}\|x_{t-1},\theta)$	$\displaystyle=q_{t}(i_{t}\|x_{t-1},\theta)q_{t}(x_{t}\|x_{t-1},i_{t},\theta)$		(38)
		$\displaystyle=\frac{\text{PoiBin}(i_{t};\alpha(x_{t-1}))\psi_{t}(i_{t})}{f_{\theta}(\psi_{t}\|x_{t-1})}\text{CondBer}(x_{t};\alpha(x_{t-1}),i_{t})$

which admits $q_{t}(x_{t}|x_{t-1},\theta)$ as its marginal transition. Algorithm 3 gives an algorithmic summary of the resulting SMC method, which we shall refer to as controlled SMC (cSMC), following the terminology of Heng et al. (2020a). This has the same cost as the APF, and one can also employ translated Poisson approximations (12) and MCMC to reduce the computational cost.

To study the performance of cSMC, we consider the Kullback–Leibler (KL) divergence from our proposal distribution $q_{\theta}(x_{0:T})=q_{0}(x_{0}|\theta)\prod_{t=1}^{T}q_{t}(x_{t}|x_{t-1},\theta)$ to the smoothing distribution $p_{\theta}(x_{0:T}|y_{0:T})$ , denoted as $\text{KL}\left(p_{\theta}(x_{0:T}|y_{0:T})\mid q_{\theta}(x_{0:T})\right)$ , which characterizes the quality of our importance proposal (Chatterjee and Diaconis, 2018). The following result provides a decomposition of this KL divergence in terms of logarithmic differences between the BIF $(\psi_{t}^{\star})_{t\in[0:T]}$ and our approximation $(\psi_{t})_{t\in[0:T]}$ under the marginal distributions of the smoothing distribution and our proposal distribution, denoted as $\eta_{t}^{\star}(x_{t}|\theta)$ and $\eta_{t}(x_{t}|\theta)$ respectively for each time $t\in[0:T]$ . Given a function $\varphi:\{0,1\}^{N}\rightarrow\mathbb{R}$ , we will write $\eta_{t}^{\star}(\varphi|\theta)$ and $\eta_{t}(\varphi|\theta)$ to denote expectations under these marginal distributions, and its corresponding $L^{2}$ -norms as $\|\varphi\|_{L^{2}(\eta_{t}^{\star})}=\eta_{t}^{\star}(\varphi^{2}|\theta)^{1/2}$ and $\|\varphi\|_{L^{2}(\eta_{t})}=\eta_{t}(\varphi^{2}|\theta)^{1/2}$ .

Proposition 3.1.

The Kullback–Leibler divergence from $q_{\theta}(x_{0:T})$ to $p_{\theta}(x_{0:T}|y_{0:T})$ satisfies

\displaystyle\text{KL}\left(p_{\theta}(x_{0:T}|y_{0:T})\mid q_{\theta}(x_{0:T})\right)\leq\sum_{t=0}^{T}\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\psi_{t})|\theta)+M_{t-1}\eta_{t}(\log(\psi_{t}/\psi_{t}^{\star})|\theta),

(39)

where $M_{-1}=1$ and $M_{t}=\max_{x_{t}\in\{0,1\}^{N}}\eta_{t}^{\star}(x_{t}|\theta)/\eta_{t}(x_{t}|\theta)$ for $t\in[0:T-1]$ .

The proof is given in Appendix B.1. We show in Appendix B.2 that the constants $(M_{t})_{t\in[0:T-1]}$ are finite by upper bounding the weight functions in (35). The next result characterizes the error of our BIF approximation measured in terms of the KL upper bound in (39).

Proposition 3.2.

For each time $t\in[0:T]$ , the BIF approximation in Algorithm 2 satisfies:

	$\displaystyle\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\psi_{t})\|\theta)\leq\sum_{k=t}^{T-1}c_{\theta}^{\star}(\psi_{k+1})\\|\phi_{\theta}^{\star}\\|_{L^{2}(\eta_{k}^{\star})}\\|\Delta_{\theta}\\|_{L^{2}(\eta_{k}^{\star})},$		(40)
	$\displaystyle\eta_{t}(\log(\psi_{t}/\psi_{t}^{\star})\|\theta)\leq\sum_{k=t}^{T-1}c_{\theta}(\psi_{k+1})\\|\phi_{\theta}\\|_{L^{2}(\eta_{k})}\\|\Delta_{\theta}\\|_{L^{2}(\eta_{k})},$		(41)

with constants

\displaystyle c_{\theta}^{\star}(\psi_{k})=\frac{1}{\min_{x_{k-1}\in\{0,1\}^{N}\setminus(0,\ldots,0)}f_{\theta}(\psi_{k}|x_{k-1})}<\infty,

\displaystyle c_{\theta}(\psi_{k})=\frac{1}{\min_{i_{k-1}\in[1:N]}\bar{f}_{\theta}(\psi_{k}|i_{k-1})\}}<\infty

for $k\in[1:T]$ , functions $\Delta_{\theta}(x)=\sum_{n=1}^{N}|\bar{\alpha}^{n}(x)-\alpha^{n}(x)|$ ,

\displaystyle\phi_{\theta}^{\star}(x)=\left\{\sum_{i\in[0:N]}\frac{\text{PoiBin}(i;\alpha(x))^{2}}{\text{PoiBin}(i;\bar{\alpha}(x))^{2}}\right\}^{1/2},

\displaystyle\phi_{\theta}(x)=\left\{\sum_{i\in[0:N]}\frac{\text{PoiBin}(i;\bar{\alpha}(x))^{2}}{\text{PoiBin}(i;\alpha(x))^{2}}\right\}^{1/2}

(42)

for $x\in\{0,1\}^{N}$ , and the convention that $\sum_{k=t}^{p}\{\cdot\}=0$ for $p<t$ .

The proof of Proposition 3.2 in Appendix B.3 derives recursive bounds of the approximation errors. The crux of our arguments is to upper bound the error of taking conditional expectations under the homogeneous probabilities (31). This relies on an upper bound of the Kullback–Leibler divergence between two Poisson binomial distributions that is established in Appendix A.3, which may be of independent interest. If conditional expectations are further approximated using the translated Poisson approximations, one can also employ the results by Cekanavicius and Vaitkus (2001); Barbour and Ćekanavićius (2002) to incorporate these errors in our analysis. The error bounds in (40) and (41) show how the accuracy of the BIF approximation depend on our approximation of the success probability $\alpha(x)$ via the term $\Delta_{\theta}(x)$ . If we decompose $\Delta_{\theta}(x)\leq\Delta_{\theta}^{\mathcal{G}}(x)+\Delta_{\theta}^{\lambda}+\Delta_{\theta}^{\gamma}$ , where

\displaystyle\Delta_{\theta}^{\mathcal{G}}(x)=\sum_{n=1}^{N}\left|N^{-1}I(x)-\mathcal{D}(n)^{-1}\sum_{m\in\mathcal{N}(n)}x^{m}\right|,\quad\Delta_{\theta}^{\lambda}=\sum_{n=1}^{N}|\bar{\lambda}-\lambda^{n}|,\quad\Delta_{\theta}^{\gamma}=\sum_{n=1}^{N}|\bar{\gamma}-\gamma^{n}|,

(43)

we see the effect of coarse-graining the agent-based model in (31). In Appendix B.4, we discuss how to reduce the errors $\Delta_{\theta}^{\lambda}(x)$ and $\Delta_{\theta}^{\gamma}(x)$ . By adopting a more fine-grained approximation based on clustering of the infection and recovery rates, one can obtain more accurate approximations at the expense of increased computational cost.

Input: Parameters

\theta\in\Theta

compute average infection and recovery rates

\bar{\lambda}=N^{-1}\sum_{n=1}^{N}\lambda^{n}

\bar{\gamma}=N^{-1}\sum_{n=1}^{N}\gamma^{n}

compute

\psi_{T}(i_{T})=\text{Bin}(y_{T};i_{T},\rho)\mathbbm{1}(i_{T}\geq y_{T})

for

i_{T}\in[0:N]

for $t=T-1,\cdots,0$ and $i_{t}=0,\cdots,N$ do

compute

\bar{f}_{\theta}(\psi_{t+1}|i_{t})=\sum_{i_{t+1}=0}^{N}\text{SumBin}(i_{t+1};N-i_{t},\bar{\lambda}N^{-1}i_{t},i_{t},1-\bar{\gamma})\psi_{t+1}(i_{t+1})

compute

\psi_{t}(i_{t})=\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})\bar{f}_{\theta}(\psi_{t+1}|i_{t})

end for

Output: Approximate BIF

(\psi_{t})_{t\in[0:T]}

Algorithm 2 Backward information filter approximation for SIS model

Input: Parameters

\theta\in\Theta

, approximate BIF

(\psi_{t})_{t\in[0:T]}

and number of particles

P\in\mathbb{N}

compute probabilities

v_{0}^{(i)}=\text{PoiBin}(i;\alpha_{0})\psi_{0}(i)

for

i\in[0:N]

set

E_{0}=\sum_{i=0}^{N}v_{0}^{(i)}

and normalize probabilities

V_{0}^{(i)}=v_{0}^{(i)}/E_{0}

for

i\in[0:N]

sample

I_{0}^{(p)}\sim\text{Cat}([0:N],V_{0}^{(0:N)})

and

X_{0}^{(p)}|I_{0}^{(p)}\sim\text{CondBer}(\alpha_{0},I_{0}^{(p)})

for

p\in[1:P]

compute weights

w_{0}^{(p)}=w_{0}(X_{0}^{(p)})

using (35) for

p\in[1:P]

for $t=1,\cdots,T$ and $p=1,\cdots,P$ do

normalize weights

W_{t-1}^{(p)}=w_{t-1}^{(p)}/\sum_{k=1}^{P}w_{t-1}^{(k)}

sample ancestors

A_{t-1}^{(p)}\sim r(\cdot|W_{t-1}^{(1:P)})

compute probabilities

v_{t}^{(i,p)}=\text{PoiBin}(i;\alpha(X_{t-1}^{(A_{t-1}^{(p)})}))\psi_{t}(i)

for

i\in[0:N]

set

E_{t}^{(p)}=\sum_{i=0}^{N}v_{t}^{(i,p)}

and normalize probabilities

V_{t}^{(i,p)}=v_{t}^{(i,p)}/E_{t}^{(p)}

for

i\in[0:N]

sample

I_{t}^{(p)}\sim\text{Cat}([0:N],V_{t}^{(0:N,p)})

and

X_{t}^{(p)}|I_{t}^{(p)}\sim\text{CondBer}(\alpha(X_{t-1}^{(A_{t-1}^{(p)})}),I_{t}^{(p)})

compute weights

w_{t}^{(p)}=w_{t}(X_{t}^{(p)})

using (35)

end for

Output: Marginal likelihood estimator

\hat{p}_{\theta}(y_{0:T})=\prod_{t=0}^{T}P^{-1}\sum_{p=1}^{P}w_{t}^{(p)}

, states

(X_{t}^{(p)})_{(t,p)\in[0:T]\times[1:P]}

and ancestors

(A_{t}^{(p)})_{(t,p)\in[0:T-1]\times[1:P]}

Algorithm 3 Controlled sequential Monte Carlo for SIS model

3.3 Numerical illustration of sequential Monte Carlo methods

We now illustrate the behaviour of the above SMC methods on simulated data. We consider a population of $N=100$ agents that are fully connected for $T=90$ time steps. The agent covariates $w^{n}=(w_{1}^{n},w_{2}^{n})$ are taken as $w_{1}^{n}=1$ and sampled from $w_{2}^{n}\sim\mathcal{N}(0,1)$ independently for $n\in[1:N]$ . Given these covariates, we simulate data from model (17)-(19) and (3) with the parameters $\beta_{0}=(-\log(N-1),0)$ , $\beta_{\lambda}=(-1,2)$ , $\beta_{\gamma}=(-1,-1)$ and $\rho=0.8$ .

The top panel of Figure 2 illustrates the performance of these SMC methods at the data-generating parameter (DGP), measured in terms of the effective sample size (ESS) criterion (Kong et al., 1994). The ESS at time $t\in[0:T]$ , defined in terms of the normalized weights $(W_{t}^{(p)})_{p\in[1:P]}$ as $1/\sum_{p=1}^{P}(W_{t}^{(p)})^{2}$ , measures the adequacy of the importance sampling approximation at each step. We consider two implementations of the controlled SMC in Algorithm 3: cSMC1 employs proposal distributions that are defined by the BIF approximation in Algorithm 2, while cSMC2 relies on translated Poisson approximations of the SumBin distributions in Algorithm 2 to lower the computational cost. This lowers the run-time of the BIF approximation from $0.35$ to $0.12$ second, which are insignificant relative to the cost of cSMC for a population size of $N=100$ . As expected, the APF performs better than the BPF by taking the next observation into account, and cSMC does better than APF by incorporating information from the entire observation sequence. Furthermore, the faster BIF approximation does not result in a noticeable loss of cSMC performance due to the accuracy of the translated Poisson approximations with $N=100$ agents. Although the performance of BPF seems adequate in this simulated data setting, the middle and bottom panels of Figure 2 reveal that its particle approximation can collapse whenever there are smaller or larger observation counts. In contrast, the performance of APF and cSMC appear to be more robust to such informative observations.

Refer to caption — Figure 2: Effective sample size of various SMC methods with $P=512$ particles when considering simulated observations (*top*), when observations at $t\in\{25,50,75\}$ are replaced by $\lfloor y_{t}/2\rfloor$ (*middle*), or $\min(2y_{t},N)$ (*bottom*).

Next we examine the performance of these SMC methods in terms of marginal likelihood estimation. Table 2 displays the variance of the log-marginal likelihood estimator at two parameter sets, and its average cost measured as run-time that were estimated using $100$ independent repetitions of each method. At the DGP, it is apparent that the cSMC estimators achieve the asymptotic regime of $P\rightarrow\infty$ earlier than BPF and APF, which seem to require at least $P=512$ particles. Based on the largest value of $P$ that we considered, the asymptotic variance of APF, cSMC1 and cSMC2 was found to be $29$ , $155$ and $115$ times smaller relative to BPF, respectively. As the cost of APF and cSMC was approximately $19$ times more expensive than BPF in our implementation, we see that APF, cSMC1 and cSMC2 are respectively $1.5$ , $8$ and $6$ times more efficient than BPF at the DGP. We can expect these efficiency gains to be more significant as we move away from the DGP. To illustrate this, we consider another parameter set which has $\beta_{\lambda}=(-3,0)$ and keeps all other parameters at the DGP. Although this is a less likely set of parameters as the log-marginal likelihood is approximately $232$ lower than the DGP, adequate marginal likelihood estimation is crucial when employing SMC methods within particle MCMC algorithms for parameter inference. In this case, we found that the BPF marginal likelihood estimates could collapse to zero for the values of $P$ that are considered in Table 2. In contrast, APF and cSMC would not suffer from this issue by construction. By increasing the number of BPF particles to $P=262,144$ and comparing its performance to APF, cSMC1 and cSMC2 with $P=2048$ particles, we find that BPF is respectively $9$ , $76$ and $42$ times less efficient at this parameter set. Lastly, Figure 3 illustrates the comparison of SMC methods as the parameter $\rho$ varies and all other parameters fixed at the DGP.

	BPF		APF			cSMC1		cSMC2
$P$	DGP	Cost	DGP	Non-DGP	Cost	DGP	Non-DGP	DGP	Non-DGP	Cost
	Var	(sec)	Var	Var	(sec)	Var	Var	Var	Var	(sec)
64	4.32	0.09	0.281	71.88	1.49	0.0696	13.52	0.0779	18.44	1.46
128	2.39	0.17	0.154	39.52	2.95	0.0285	8.62	0.0382	8.48	2.88
256	1.67	0.33	0.110	26.86	5.85	0.0164	4.11	0.0190	6.88	5.72
512	0.88	0.63	0.056	18.98	11.72	0.0087	3.57	0.0105	5.03	11.41
1024	0.55	1.25	0.026	13.38	23.46	0.0049	2.05	0.0046	3.41	22.83
2048	0.31	2.49	0.011	9.93	47.48	0.0020	1.15	0.0027	2.07	45.97

Table 2: Variance of log-marginal likelihood estimator and its average cost (in seconds) using different number of particles

P

and various SMC methods. The parameter sets considered are the data generating parameters (DGP) and a modification of the DGP with

\beta_{\lambda}=(-3,0)

(labelled as non-DGP). The cost of cSMC1 and cSMC2 are comparable and the run-times to compute the BIF approximations (0.35 and 0.12 second for cSMC1 and cSMC2) are not reported in this table.

3.4 Parameter and state inference

We first concern ourselves with the behaviour of the marginal likelihood and the maximum likelihood estimator (MLE) $\arg\max_{\theta\in\Theta}p_{\theta}(y_{0:T})$ as the number of observations $T\rightarrow\infty$ . This is illustrated with our running simulated dataset from Section 3.3. Figure 4 plots the log-likelihood as a function of $\beta_{\lambda}=(\beta_{\lambda}^{1},\beta_{\lambda}^{2})$ or $(\beta_{\lambda}^{2},\beta_{\gamma}^{2})$ with the other parameters fixed at their data generating values, estimated using cSMC with $P=64$ particles. These plots reveal the complex behavior of the likelihood functions induced by agent-based SIS models. Furthermore, we see that the likelihood concentrates more around the DGP as $T$ increases, and that the MLE can recover the DGP when $T$ is sufficiently large.

By building on the SMC methods in Sections 3.1 and 3.2, one can construct a stochastic gradient ascent scheme or an expectation-maximization algorithm to approximate the MLE. We refer readers to Kantas et al. (2015) for a comprehensive review of such approaches. In the Bayesian framework, our interest is on the posterior distribution

\displaystyle p(\theta,x_{0:T}|y_{0:T})=p(\theta|y_{0:T})p_{\theta}(x_{0:T}|y_{0:T})\propto p(\theta)p_{\theta}(x_{0:T},y_{0:T}),

(44)

where $p(d\theta)=p(\theta)d\theta$ is a given prior distribution on the parameter space $\Theta$ . We employ particle marginal Metropolis–Hastings (PMMH) to sample from the posterior distribution. Following the discussion in Section 3.3, we will choose cSMC to construct a more efficient PMMH chain.

We now illustrate our inference method on the simulated data setup of Section 3.3. We adopt a prior distribution that assumes the parameters are independent with $\beta_{0},\beta_{\lambda},\beta_{\gamma}\sim\mathcal{N}(0,9)$ and $\rho\sim\text{Uniform}(0,1)$ . All MH parameter updates employ a Normal random walk proposal transition on the $(\beta_{0},\beta_{\lambda},\beta_{\gamma},\log(\rho/(1-\rho)))$ -space, with a standard deviation of $0.2$ to achieve suitable acceptance probabilities. We use $P=128$ particles in the cSMC algorithm within PMMH and we run the PMMH algorithm for $100,000$ iterations after a burn in of $5000$ iterations. Using these posterior samples, we infer the ratios $R_{0}^{n}=\lambda^{n}/\gamma^{n}$ , which can be understood as the reproductive number of agent $n\in[1:N]$ . In the left panel of Figure 5, we display the estimated posterior medians and $95\%$ credible sets, as well as the data generating values. Although the posterior median estimates are similar across agents, there is large posterior uncertainty for agents with small or large data generated ratios. To visualize how $(R_{0}^{n})_{n\in[1:N]}$ is distributed in the population, we estimate histograms that take parameter uncertainty into account, illustrate the results in the right panel of Figure 5. The posterior median estimates yields a histogram that is more concentrated than its data generating counterpart.

Lastly, we examine the predictive performance of the model when relatively few observations are available. As illustrated in Figure 6, we assume access to the first $t=30$ observations (black dots) and predict the rest of the time series up to time $T=90$ (grey dots) using the posterior predictive distribution $p(y_{t+1:T}|y_{0:t})=\int_{\Theta}\sum_{x_{t}\in\{0,1\}^{N}}p_{\theta}(y_{t+1:T}|x_{t})p(\theta,x_{t}|y_{0:t})$ . By simulating trajectories from the posterior predictive (grey lines), we obtain the model predictions and uncertainty estimates in Figure 6.

4 Susceptible-Infected-Recovered model

We consider a susceptible-infected-recovered (SIR) model, where agents become immune to a pathogen after they recover from an infection. The “recovered” status of an agent shall be encoded by a state of $2$ . Given a population configuration $x=(x^{n})_{n\in[1:N]}\in\{0,1,2\}^{N}$ , we will write $S^{n}(x)=\mathbbm{1}(x^{n}=0)$ , $I^{n}(x)=\mathbbm{1}(x^{n}=1)$ , $R^{n}(x)=\mathbbm{1}(x^{n}=2)$ to indicate the status of agent $n\in[1:N]$ , and $S(x)=\sum_{n=1}^{N}S^{n}(x)$ , $I(x)=\sum_{n=1}^{N}I^{n}(x)$ , $R(x)=\sum_{n=1}^{N}R^{n}(x)$ to count the number of agents in each state. Under the assumption of a closed population, we have $S(x)+I(x)+R(x)=N$ .

The time evolution of the population $(X_{t})_{t\in[0:T]}$ is now modelled by a Markov chain on $\{0,1,2\}^{N}$ , i.e. the specification in (18) with

\displaystyle\mu_{\theta}(x_{0})=\prod_{n=1}^{N}\text{Cat}(x_{0}^{n};[0:2],\alpha_{0}^{n}),\quad f_{\theta}(x_{t}|x_{t-1})=\prod_{n=1}^{N}\text{Cat}(x_{t}^{n};[0:2],\alpha^{n}(x_{t-1})),\quad t\in[1:T].

(45)

The above probabilities are given by $\alpha_{0}^{n}=(\alpha_{0,S}^{n},\alpha_{0,I}^{n},\alpha_{0,R}^{n})=(1-\alpha_{0,I}^{n},\alpha_{0,I}^{n},0)$ and $\alpha^{n}(x_{t-1})=(\alpha_{S}^{n}(x_{t-1}),\alpha_{I}^{n}(x_{t-1}),\alpha_{R}^{n}(x_{t-1}))$ where

		$\displaystyle\alpha_{S}^{n}(x_{t-1})=S^{n}(x_{t-1})\left(1-\lambda^{n}\mathcal{D}(n)^{-1}\sum_{m\in\mathcal{N}(n)}I^{m}(x_{t-1})\right),$
		$\displaystyle\alpha_{I}^{n}(x_{t-1})=S^{n}(x_{t-1})\lambda^{n}\mathcal{D}(n)^{-1}\sum_{m\in\mathcal{N}(n)}I^{m}(x_{t-1})+I^{n}(x_{t-1})(1-\gamma^{n}),$		(46)
		$\displaystyle\alpha_{R}^{n}(x_{t-1})=I^{n}(x_{t-1})\gamma^{n}+R^{n}(x_{t-1}),$

which satisfies $\alpha_{S}^{n}(x_{t-1})+\alpha_{I}^{n}(x_{t-1})+\alpha_{R}^{n}(x_{t-1})=1$ . Note that $\alpha_{R}^{n}(x_{t-1})=1$ if $R^{n}(x_{t-1})=1$ , which reflects the above-mentioned immunity. Just like the SIS model, the agents’ initial infection probabilities, infection and recovery rates $(\alpha_{0,I}^{n},\lambda^{n},\gamma^{n})_{n\in[1:N]}$ are specified with (17), and the observation model for the number of infections reported over time $(Y_{t})_{t\in[0:T]}$ is (3).

We consider again SMC approximations of the marginal likelihood and smoothing distribution of the resulting hidden Markov model on $\{0,1,2\}^{N}$ . The BPF can be readily implemented in $\mathcal{O}(NTP)$ cost, but suffers from the difficulties discussed in Section 3.1. To obtain better performance, we discuss how to implement the fully adapted APF in Section 4.1 and adapt our cSMC construction in Section 4.2. Alternative MCMC approaches such as those detailed in Appendix C could also be considered.

4.1 Auxiliary particle filter

We recall that implementing the fully adapted APF requires one to sample from proposals $q_{0}(x_{0}|\theta)=p_{\theta}(x_{0}|y_{0}),q_{t}(x_{t}|x_{t-1},\theta)=p_{\theta}(x_{t}|x_{t-1},y_{t})$ for $t\in[1:T]$ , and evaluate the weight functions $w_{0}(x_{0})=p_{\theta}(y_{0})$ and $w_{t}(x_{t-1})=p_{\theta}(y_{t}|x_{t-1})$ for $t\in[1:T]$ . At the initial time, we compute the marginal likelihood $p_{\theta}(y_{0})$ as

\displaystyle p_{\theta}(y_{0})=\sum_{i_{0}=0}^{N}\text{PoiBin}(i_{0};\alpha_{0,I})\text{Bin}(y_{0};i_{0},\rho)\mathbbm{1}(i_{0}\geq y_{0})

(47)

where $\alpha_{0,I}=(\alpha_{0,I}^{n})_{n\in[1:N]}$ , and sample from

		$\displaystyle p_{\theta}(x_{0},i_{0},i_{0}^{1:N}\|y_{0})=p_{\theta}(i_{0}\|y_{0})p_{\theta}(i_{0}^{1:N}\|i_{0})p_{\theta}(x_{0}\|i_{0}^{1:N})$		(48)
		$\displaystyle=\frac{\text{PoiBin}(i_{0};\alpha_{0,I})\text{Bin}(y_{0};i_{0},\rho)\mathbbm{1}(i_{0}\geq y_{0})}{p_{\theta}(y_{0})}\text{CondBer}(i_{0}^{1:N};\alpha_{0,I},i_{0})\prod_{n=1}^{N}\text{Cat}(x_{0}^{n};[0:2],\nu_{0}(i_{0}^{n}))$

with $\nu_{0}(i_{0}^{n})=(1-i_{0}^{n},i_{0}^{n},0)$ , which admits the posterior of agent states $p_{\theta}(x_{0}|y_{0})$ as its marginal distribution. Similarly, for time step $t\in[1:T]$ , we compute the predictive likelihood $p_{\theta}(y_{t}|x_{t-1})$ as

\displaystyle p_{\theta}(y_{t}|x_{t-1})=\sum_{i_{t}=0}^{N}\text{PoiBin}(i_{t};\alpha_{I}(x_{t-1}))\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})

(49)

where $\alpha_{I}(x_{t-1})=(\alpha_{I}^{n}(x_{t-1}))_{n\in[1:N]}$ , and sample from

		$\displaystyle p_{\theta}(x_{t},i_{t},i_{t}^{1:N}\|x_{t-1},y_{t})=p_{\theta}(i_{t}\|x_{t-1},y_{t})p_{\theta}(i_{t}^{1:N}\|x_{t-1},i_{t})p_{\theta}(x_{t}\|x_{t-1},i_{t}^{1:N})$		(50)
		$\displaystyle=\frac{\text{PoiBin}(i_{t};\alpha_{I}(x_{t-1}))\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})}{p_{\theta}(y_{t}\|x_{t-1})}\text{CondBer}(i_{t}^{1:N};\alpha_{I}(x_{t-1}),i_{t})\prod_{n=1}^{N}\text{Cat}(x_{t}^{n};[0:2],\nu(x_{t-1},i_{t}^{n}))$

with $\nu(x_{t-1},i_{t}^{n})=((1-i_{t}^{n})S^{n}(x_{t-1}),i_{t}^{n},(1-i_{t}^{n})\{I^{n}(x_{t-1})+R^{n}(x_{t-1})\})$ , which admits $p_{\theta}(x_{t}|x_{t-1},y_{t})$ as its marginal transition.

By augmenting the infection status of the agents $I^{1:N}(x_{t})=(I^{n}(x_{t}))_{n\in[1:N]}$ , the first two steps in (48) and (50) are analogous to (25) and (27) for the SIS model, and the last step is to identify an agent’s state given her augmented current infection status and previous state. We point out that $\nu_{0}(i_{0}^{n}),\nu(x_{t-1},i_{t}^{n})\in\{(1,0,0),(0,1,0),(0,0,1)\}.$ These expressions for $\nu_{0}(i_{0}^{n})$ and $\nu(x_{t-1},i_{t}^{n})$ exploit the following facts: ( $i$ ) susceptible agents either remain susceptible or become infected; ( $ii$ ) infected agents either remain infected or recover; ( $iii$ ) agents who have recovered enjoy immunity. Algorithm 4 provides an algorithmic description of the resulting APF, which has the same cost as APF for the SIS model.

Input: Parameters

\theta\in\Theta

and number of particles

P\in\mathbb{N}

compute

v_{0}^{(i)}=\text{PoiBin}(i;\alpha_{0,I})\text{Bin}(y_{0};i,\rho)\mathbbm{1}(i\geq y_{0})

for

i\in[0:N]

set

w_{0}=\sum_{i=0}^{N}v_{0}^{(i)}

normalize

V_{0}^{(i)}=v_{0}^{(i)}/w_{0}

for

i\in[0:N]

sample

I_{0}^{(p)}\sim\text{Cat}([0:N],V_{0}^{(0:N)})

I_{0}^{1:N,(p)}|I_{0}^{(p)}\sim\text{CondBer}(\alpha_{0,I},I_{0}^{(p)})

for

p\in[1:P]

sample

X_{0}^{n,(p)}|I_{0}^{n,(p)}\sim\text{Cat}([0:2],\nu_{0}(I_{0}^{n,(p)}))

for

p\in[1:P]

and

n\in[1:N]

for $t=1,\cdots,T$ and $p=1,\cdots,P$ do

compute

v_{t}^{(i,p)}=\text{PoiBin}(i;\alpha_{I}(X_{t-1}^{(p)}))\text{Bin}(y_{t};i,\rho)\mathbbm{1}(i\geq y_{t})

for

i\in[0:N]

set

w_{t}^{(p)}=\sum_{i=0}^{N}v_{t}^{(i,p)}

normalize

V_{t}^{(i,p)}=v_{t}^{(i,p)}/w_{t}^{(p)}

for

i\in[0:N]

normalize

W_{t}^{(p)}=w_{t}^{(p)}/\sum_{k=1}^{P}w_{t}^{(k)}

sample

A_{t-1}^{(p)}\sim r(\cdot|W_{t}^{(1:P)})

sample

I_{t}^{(p)}\sim\text{Cat}([0:N],V_{t}^{(0:N,A_{t-1}^{(p)})})

and

I_{t}^{1:N,(p)}|I_{t}^{(p)}\sim\text{CondBer}(\alpha_{I}(X_{t-1}^{(A_{t-1}^{(p)})}),I_{t}^{(p)})

sample

X_{t}^{n,(p)}\mid X_{t-1}^{(A_{t-1}^{(p)})},I_{t}^{n,(p)}\sim\text{Cat}([0:2],\nu(X_{t-1}^{(A_{t-1}^{(p)})},I_{t}^{n,(p)}))

for

n\in[1:N]

end for

Output: Marginal likelihood estimator

\hat{p}_{\theta}(y_{0:T})=w_{0}\prod_{t=1}^{T}P^{-1}\sum_{p=1}^{P}w_{t}^{(p)}

, states

(X_{t}^{(p)})_{(t,p)\in[0:T]\times[1:P]}

and ancestors

(A_{t}^{(p)})_{(t,p)\in[0:T-1]\times[1:P]}

Algorithm 4 Auxiliary particle filter for SIR model

4.2 Controlled sequential Monte Carlo

We now consider approximation of the BIF (30) to construct a proposal distribution approximating (29). At the terminal time $T$ , it suffices to compute $\psi_{T}(i_{T})=\text{Bin}(y_{T};i_{T},\rho)\mathbbm{1}(i_{T}\geq y_{T})$ for all $i_{T}\in[0:N]$ to represent the function $\psi_{T}^{\star}(x_{T})$ . As before, the next iterate $\psi_{T-1}^{\star}(x_{T-1})$ requires an approximation of the conditional expectation $f_{\theta}(\psi_{T}|x_{T-1})=\sum_{i_{T}=0}^{N}\text{PoiBin}(i_{T};\alpha_{I}(x_{T-1}))\psi_{T}(i_{T})$ . Following the arguments in (31), we approximate $\alpha(x_{T-1})$ by

\displaystyle\bar{\alpha}^{n}(x_{T-1})=(\bar{\alpha}_{S}^{n}(x_{T-1}),\bar{\alpha}_{I}^{n}(x_{T-1}),\bar{\alpha}_{R}^{n}(x_{T-1})),

defined as

		$\displaystyle\bar{\alpha}_{S}^{n}(x_{T-1})=S^{n}(x_{T-1})\left(1-\bar{\lambda}N^{-1}I(x_{T-1})\right),$
		$\displaystyle\bar{\alpha}_{I}^{n}(x_{T-1})=S^{n}(x_{T-1})\bar{\lambda}N^{-1}I(x_{T-1})+I^{n}(x_{T-1})(1-\bar{\gamma}),$		(51)
		$\displaystyle\bar{\alpha}_{R}^{n}(x_{T-1})=I^{n}(x_{T-1})\bar{\gamma}+R^{n}(x_{T-1}),$

which satisfies $\bar{\alpha}_{S}^{n}(x_{T-1})+\bar{\alpha}_{I}^{n}(x_{T-1})+\bar{\alpha}_{R}^{n}(x_{T-1})=1$ . Writing the corresponding Markov transition as $\bar{f}_{\theta}(x_{T}|x_{T-1})=\prod_{n=1}^{N}\text{Cat}(x_{T}^{n};[0:2],\bar{\alpha}^{n}(x_{T-1}))$ , we approximate the conditional expectation $f_{\theta}(\psi_{T}|x_{T-1})$ by

\displaystyle\bar{f}_{\theta}(\psi_{T}|S(x_{T-1}),I(x_{T-1}))=\sum_{i_{T}=0}^{N}\text{SumBin}(i_{T};S(x_{T-1}),\bar{\lambda}N^{-1}I(x_{T-1}),I(x_{T-1}),1-\bar{\gamma})\psi_{T}(i_{T}).

(52)

Although (52) is analogous to (32) for the SIS model, the number of susceptible agents $S(x_{T-1})$ cannot be determined by just knowing the number of infections $I(x_{T-1})$ in the SIR model. Therefore it is necessary to account for both $S(x_{T-1})$ and $I(x_{T-1})$ in our approximation, i.e. we compute

\displaystyle\psi_{T-1}(s_{T-1},i_{T-1})=\text{Bin}(y_{T-1};i_{T-1},\rho)\mathbbm{1}(i_{T-1}\geq y_{T-1})\bar{f}_{\theta}(\psi_{T}|s_{T-1},i_{T-1})

(53)

for all $(s_{T-1},i_{T-1})\in[0:N]\times[0:(N-s_{T-1})]$ . Since the number of agents in the population is fixed, it is also possible to work with the variables $I(x_{T-1})$ and $R(x_{T-1})$ instead. Subsequently for $t\in[0:T-2]$ , we approximate (30) by

\displaystyle\psi_{t}(s_{t},i_{t})=\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})\bar{f}_{\theta}(\psi_{t+1}|s_{t},i_{t}),\quad(s_{t},i_{t})\in[0:N]\times[0:(N-s_{t})],

(54)

where $\bar{f}_{\theta}(\psi_{t+1}|s_{t},i_{t})=\sum_{s_{t+1}=0}^{N}\sum_{i_{t+1}=0}^{N-s_{t+1}}\bar{f}_{\theta}(s_{t+1},i_{t+1}|s_{t},i_{t})\psi_{t+1}(s_{t+1},i_{t+1})$ ,

\displaystyle\bar{f}_{\theta}(s_{t+1},i_{t+1}|s_{t},i_{t})=\text{Bin}(s_{t+1};s_{t},1-\bar{\lambda}N^{-1}i_{t})\text{Bin}(i_{t}-i_{t+1}+s_{t}-s_{t+1};i_{t},\bar{\gamma}),

(55)

for $(s_{t+1},i_{t+1})\in[0:s_{t}]\times[(s_{t}-s_{t+1}):(i_{t}+s_{t}-s_{t+1})]$ , and zero otherwise. The above expression follows from the SIR model structure under homogeneous probabilities (4.2). Algorithm 5 summarizes our approximation of the BIF $(\psi_{t})_{t\in[0:T]}$ , which costs $\mathcal{O}(N^{4}T)$ to compute and $\mathcal{O}(N^{2}T)$ in storage.

Input: Parameters

\theta\in\Theta

compute

\psi_{T}(i_{T})=\text{Bin}(y_{T};i_{T},\rho)\mathbbm{1}(i_{T}\geq y_{T})

for

i_{T}\in[0:N]

for $t=T-1,\cdots,0$ , $s_{t}=0,\ldots,N$ and $i_{t}=0,\cdots,N-s_{t}$ do

if $t=T-1$ then

compute

\bar{f}_{\theta}(\psi_{t+1}|s_{t},i_{t})=\sum_{i_{t+1}=0}^{N}\text{SumBin}(i_{t+1};s_{t},\bar{\lambda}N^{-1}i_{t},i_{t},1-\bar{\gamma})\psi_{t+1}(i_{t+1})

else

compute

\bar{f}_{\theta}(\psi_{t+1}|s_{t},i_{t})=\sum_{s_{t+1}=0}^{N}\sum_{i_{t+1}=0}^{N-s_{t+1}}\bar{f}_{\theta}(s_{t+1},i_{t+1}|i_{t},s_{t})\psi_{t+1}(s_{t+1},i_{t+1})

end if

compute

\psi_{t}(s_{t},i_{t})=\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})\bar{f}_{\theta}(\psi_{t+1}|s_{t},i_{t})

end for

Output: Approximate BIF

(\psi_{t})_{t\in[0:T]}

Algorithm 5 Backward information filter approximation for SIR model

We can define our proposal distribution and SMC weight functions in the same manner as (34) and (35), respectively. Expectations appearing in these SMC weights can be computed as

\displaystyle\mu_{\theta}(\psi_{0})=\sum_{i_{0}=0}^{N}\text{PoiBin}(i_{0};\alpha_{0,I})\psi_{0}(N-i_{0},i_{0}),\quad f_{\theta}(\psi_{t}|x_{t-1})=\sum_{s_{t}=0}^{N}\sum_{i_{t}=0}^{N-s_{t}}f_{\theta}(s_{t},i_{t}|x_{t-1})\psi_{t}(s_{t},i_{t}),

(56)

for $t\in[1:T-1]$ and $f_{\theta}(\psi_{T}|x_{T-1})=\sum_{i_{T}=0}^{N}\text{PoiBin}(i_{T};\alpha_{I}(x_{T-1}))\psi_{T}(i_{T})$ , where

\displaystyle f_{\theta}(s_{t},i_{t}|x_{t-1})=\text{PoiBin}(s_{t};\alpha_{S}(x_{t-1}))\text{PoiBin}(I(x_{t-1})-i_{t}+S(x_{t-1})-s_{t};(I^{n}(x_{t-1})\gamma^{n})_{n\in[1:N]}),

(57)

for $(s_{t},i_{t})\in[0:S(x_{t-1})]\times[(S(x_{t-1})-s_{t}):(I(x_{t-1})+S(x_{t-1})-s_{t})]$ , and zero otherwise. The above expression should be understood as the analogue of (55) under the heterogeneous probabilities (4). Sampling from the proposals can be done in a similar manner as the APF in Section 4.1. At the initial time, we sample from

	$\displaystyle q_{0}(x_{0},i_{0},i_{0}^{1:N}\|\theta)=q_{0}(i_{0}\|\theta)q_{0}(i_{0}^{1:N}\|i_{0},\theta)q_{0}(x_{0}\|i_{0}^{1:N},\theta)$		(58)
	$\displaystyle=\frac{\text{PoiBin}(i_{0};\alpha_{0,I})\psi_{0}(N-i_{0},i_{0})}{\mu_{\theta}(\psi_{0})}\text{CondBer}(i_{0}^{1:N};\alpha_{0,I},i_{0})\prod_{n=1}^{N}\text{Cat}(x_{0}^{n};[0:2],\nu_{0}(i_{0}^{n})),$

which admits $q_{0}(x_{0}|\theta)$ as its marginal distribution. For time $t\in[1:T-1]$ , we sample from

	$\displaystyle q_{t}(x_{t},i_{t},i_{t}^{1:N}\|x_{t-1},\theta)=q_{t}(i_{t}\|x_{t-1},\theta)q_{t}(i_{t}^{1:N}\|x_{t-1},i_{t},\theta)q_{t}(x_{t}\|x_{t-1},i_{t}^{1:N},\theta)$		(59)
	$\displaystyle=\frac{\sum_{s_{t}=0}^{N-i_{t}}f_{\theta}(s_{t},i_{t}\|x_{t-1})\psi_{t}(s_{t},i_{t})}{f_{\theta}(\psi_{t}\|x_{t-1})}\text{CondBer}(i_{t}^{1:N};\alpha_{I}(x_{t-1}),i_{t})\prod_{n=1}^{N}\text{Cat}(x_{t}^{n};[0:2],\nu(x_{t-1},i_{t}^{n})),$

and

	$\displaystyle q_{T}(x_{T},i_{T},i_{T}^{1:N}\|x_{T-1},\theta)=q_{T}(i_{T}\|x_{T-1},\theta)q_{T}(i_{T}^{1:N}\|x_{T-1},i_{T},\theta)q_{T}(x_{T}\|x_{T-1},i_{T}^{1:N},\theta)$		(60)
	$\displaystyle=\frac{\text{PoiBin}(i_{T};\alpha_{I}(x_{T-1}))\psi_{T}(i_{T})}{f_{\theta}(\psi_{T}\|x_{T-1})}\text{CondBer}(i_{T}^{1:N};\alpha_{I}(x_{T-1}),i_{T})\prod_{n=1}^{N}\text{Cat}(x_{T}^{n};[0:2],\nu(x_{T-1},i_{T}^{n})),$

which admits $q_{t}(x_{t}|x_{t-1},\theta)$ as its marginal transition for all $t\in[1:T]$ . Algorithm 6 details the resulting cSMC method, which costs the same as cSMC for the SIS model.

Input: Parameters

\theta\in\Theta

, approximate BIF

(\psi_{t})_{t\in[0:T]}

and number of particles

P\in\mathbb{N}

compute probabilities

v_{0}^{(i)}=\text{PoiBin}(i;\alpha_{0})\psi_{0}(i)

for

i\in[0:N]

compute

v_{0}^{(i)}=\text{PoiBin}(i;\alpha_{0,I})\psi_{0}(N-i,i)

for

i\in[0:N]

set

E_{0}=\sum_{i=0}^{N}v_{0}^{(i)}

normalize

V_{0}^{(i)}=v_{0}^{(i)}/E_{0}

for

i\in[0:N]

sample

I_{0}^{(p)}\sim\text{Cat}([0:N],V_{0}^{(0:N)})

and

I_{0}^{1:N,(p)}|I_{0}^{(p)}\sim\text{CondBer}(\alpha_{0,I},I_{0}^{(p)})

for

p\in[1:P]

sample

X_{0}^{n,(p)}|I_{0}^{n,(p)}\sim\text{Cat}([0:2],\nu_{0}(I_{0}^{n,(p)}))

for

p\in[1:P]

and

n\in[1:N]

compute

w_{0}^{(p)}=w_{0}(X_{0}^{(p)})

using (35) for

p\in[1:P]

for $t=1,\cdots,T$ and $p=1,\cdots,P$ do

normalize

W_{t-1}^{(p)}=w_{t-1}^{(p)}/\sum_{k=1}^{P}w_{t-1}^{(k)}

sample

A_{t-1}^{(p)}\sim r(\cdot|W_{t-1}^{(1:P)})

if $t<T$ then

compute

v_{t}^{(i,p)}=\sum_{s=0}^{N-i}f_{\theta}(s,i|X_{t-1}^{(A_{t-1}^{(p)})})\psi_{t}(s,i)

for

i\in[0:N]

else

compute

v_{t}^{(i,p)}=\text{PoiBin}(i;\alpha_{I}(X_{t-1}^{(A_{t-1}^{(p)})}))\psi_{t}(i)

for

i\in[0:N]

end if

set

E_{t}^{(p)}=\sum_{i=0}^{N}v_{t}^{(i,p)}

normalize

V_{t}^{(i,p)}=v_{t}^{(i,p)}/E_{t}^{(p)}

for

i\in[0:N]

sample

I_{t}^{(p)}\sim\text{Cat}([0:N],V_{t}^{(0:N,p)})

and

I_{t}^{1:N,(p)}|I_{t}^{(p)}\sim\text{CondBer}(\alpha_{I}(X_{t-1}^{(A_{t-1}^{(p)})}),I_{t}^{(p)})

sample

X_{t}^{n,(p)}|X_{t-1}^{(A_{t-1}^{(p)})},I_{t}^{n,(p)}\sim\text{Cat}([0:2],\nu(X_{t-1}^{(A_{t-1}^{(p)})},I_{t}^{n,(p)}))

for

n\in[1:N]

compute

w_{t}^{(p)}=w_{t}(X_{t}^{(p)})

using (35)

end for

Output: Marginal likelihood estimator

\hat{p}_{\theta}(y_{0:T})=\prod_{t=0}^{T}P^{-1}\sum_{p=1}^{P}w_{t}^{(p)}

, states

(X_{t}^{(p)})_{(t,p)\in[0:T]\times[1:P]}

and ancestors

(A_{t}^{(p)})_{(t,p)\in[0:T-1]\times[1:P]}

Algorithm 6 Controlled sequential Monte Carlo for SIR model

Our proposal distribution $q_{\theta}(x_{0:T})$ also satisfies the Kullback–Leibler upper bound in Proposition 3.1, with appropriate notational extensions to the state space $\{0,1,2\}^{N}$ . The following result is analogous to Proposition 3.2 for the SIS model.

Proposition 4.1.

For each time $t\in[0:T]$ , the BIF approximation in Algorithm 5 satisfies:

	$\displaystyle\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\psi_{t})\|\theta)\leq\sum_{k=t}^{T-1}c^{\star}_{\theta}(\psi_{k+1})\{\\|\phi^{\star}_{\theta}\\|_{L^{2}(\eta_{k}^{\star})}\\|\Delta_{\theta,S}\\|_{L^{2}(\eta_{k}^{\star})}+\\|\xi^{\star}_{\theta}\\|_{L^{2}(\eta_{k}^{\star})}\\|\Delta_{\theta,R}\\|_{L^{2}(\eta_{k}^{\star})}\},$		(61)
	$\displaystyle\eta_{t}(\log(\psi_{t}/\psi_{t}^{\star})\|\theta)\leq\sum_{k=t}^{T-1}c_{\theta}(\psi_{k+1})\{\\|\phi_{\theta}\\|_{L^{2}(\eta_{k})}\\|\Delta_{\theta,S}\\|_{L^{2}(\eta_{k})}+\\|\xi_{\theta}\\|_{L^{2}(\eta_{k})}\\|\Delta_{\theta,R}\\|_{L^{2}(\eta_{k})}\},$		(62)

for $t\in[0:T]$ . The constants are $c^{\star}_{\theta}(\psi_{k})=\{\min_{x_{k-1}\in\{0,1,2\}^{N}\setminus(0,\ldots,0)}{f}_{\theta}(\psi_{k}|x_{k-1})\}^{-1}<\infty$ , $c_{\theta}(\psi_{k})=\{\min_{(s_{k-1},i_{k-1})\in[0:N-i_{k-1}]\times[1:N]}\bar{f}_{\theta}(\psi_{k}|s_{k-1},i_{k-1})\}^{-1}<\infty$ for $k\in[1:T]$ , and the functions are $\Delta_{\theta,S}(x)=\sum_{n=1}^{N}|\bar{\alpha}_{S}^{n}(x)-\alpha_{S}^{n}(x)|$ , $\Delta_{\theta,R}(x)=\sum_{n=1}^{N}|\bar{\alpha}_{R}^{n}(x)-\alpha_{R}^{n}(x)|$ ,

	$\displaystyle\phi^{\star}_{\theta}(x)=\left\{\sum_{s=0}^{S(x)}\frac{\text{PoiBin}(s;{\alpha}_{S}(x))^{2}}{\text{PoiBin}(s;\bar{\alpha}_{S}(x))^{2}}\right\}^{1/2},\quad\xi^{\star}_{\theta}(x)=\left\{\sum_{r=0}^{I(x)}\frac{\text{PoiBin}(r;(I^{n}(x){\gamma^{n}})_{n\in[1:N]})^{2}}{\text{PoiBin}(r;(I^{n}(x)\bar{\gamma})_{n\in[1:N]})^{2}}\right\}^{1/2},$		(63)
	$\displaystyle\phi_{\theta}(x)=\left\{\sum_{s=0}^{S(x)}\frac{\text{PoiBin}(s;\bar{\alpha}_{S}(x))^{2}}{\text{PoiBin}(s;\alpha_{S}(x))^{2}}\right\}^{1/2},\quad\xi_{\theta}(x)=\left\{\sum_{r=0}^{I(x)}\frac{\text{PoiBin}(r;(I^{n}(x)\bar{\gamma})_{n\in[1:N]})^{2}}{\text{PoiBin}(r;(I^{n}(x)\gamma^{n})_{n\in[1:N]})^{2}}\right\}^{1/2},$		(64)

for $x\in\{0,1,2\}^{N}$ .

The arguments in the proof of Proposition 4.1 in Appendix B.3 are similar to Proposition 3.2, with some modifications tailored to the SIR model. This result shows the dependence of the BIF approximation error on our approximation of the transition probability $\alpha(x)$ via the terms $\Delta_{\theta,S}(x)$ and $\Delta_{\theta,R}(x)$ . As before, we can decompose $\Delta_{\theta,S}(x)\leq\Delta_{\theta}^{\mathcal{G}}(x)+\Delta_{\theta}^{\lambda}$ and $\Delta_{\theta,R}(x)\leq\Delta_{\theta}^{\gamma}$ into the elements defined in (43). One can also obtain more fine-grained approximations using clustering in the spirit of Appendix B.4.

5 Discussion

Although agent-based models have been widely used as a simulation paradigm, statistical inference for these models has not received as much attention, due in part to the computational challenges involved. We have focused on agent-based models of disease transmission, and presented new SMC methods that can estimate their likelihood more efficiently than the standard BPF.

Our proposed methodology can be extended in various directions. Instead of the binomial model in (3), other observation models such as a negative binomial distribution could be considered. We could also adapt the controlled SMC methodology to handle the case where observations are only available at a collection of time instances, and we can expect further relative gains in such settings. Future work could consider settings where the difference in infection counts between successive time steps is observed.

We view our contribution as a step towards inference for larger classes of agent-based models, and that some of our contributions might be useful beyond the models considered here. We hope to motivate further work on alleviating the computational burden of inference in agent-based models, and that the removal of some of the computational bottlenecks might encourage further investigation on the statistical properties of these models.

Acknowledgments

This work was funded by CY Initiative of Excellence (grant “Investissements d’Avenir” ANR-16-IDEX-0008). Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grants DMS-1712872 and DMS-1844695.

References

Allen and Burgin [2000] Linda JS Allen and Amy M Burgin. Comparison of deterministic and stochastic SIS and SIR models in discrete time. Mathematical biosciences, 163(1):1–33, 2000.
Allen et al. [2008] Linda JS Allen, Fred Brauer, Pauline Van den Driessche, and Jianhong Wu. Mathematical epidemiology, volume 1945. Springer, 2008.
Andrieu and Roberts [2009] Christophe Andrieu and Gareth O Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
Andrieu et al. [2010] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
Barbour and Ćekanavićius [2002] Andrew D Barbour and V Ćekanavićius. Total variation asymptotics for sums of independent integer random variables. The Annals of Probability, 30(2):509–545, 2002.
Barbour and Hall [1984] Andrew D Barbour and Peter Hall. On the rate of Poisson convergence. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 95, pages 473–480. Cambridge University Press, 1984.
Barbour and Xia [1999] Andrew D Barbour and Aihua Xia. Poisson perturbations. ESAIM: Probability and Statistics, 3:131–150, 1999.
Barlow and Heidtmann [1984] Richard E Barlow and Klaus D Heidtmann. Computing k-out-of-n system reliability. IEEE Transactions on Reliability, 33(4):322–323, 1984.
Bresler [1986] Yoram Bresler. Two-filter formulae for discrete-time non-linear Bayesian smoothing. International Journal of Control, 43(2):629–641, 1986.
Briers et al. [2010] Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state–space models. Annals of the Institute of Statistical Mathematics, 62(1):61, 2010.
Britton [2010] Tom Britton. Stochastic epidemic models: A survey. Mathematical biosciences, 225(1):24–35, 2010.
Bu et al. [2020] Fan Bu, Allison E Aiello, Jason Xu, and Alexander Volfovsky. Likelihood-based inference for partially observed epidemics on dynamic networks. Journal of the American Statistical Association, pages 1–17, 2020.
Carpenter et al. [1999] James Carpenter, Peter Clifford, and Paul Fearnhead. Improved particle filter for nonlinear problems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7, 1999.
Cekanavicius and Vaitkus [2001] V Cekanavicius and P Vaitkus. Centered Poisson approximation via Stein’s method. Lithuanian Mathematical Journal, 41(4):319–329, 2001.
Chatterjee and Diaconis [2018] Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of Applied Probability, 28(2):1099–1135, 2018.
Chen et al. [2017] CC-M Chen, Christopher C Drovandi, Jonathan M Keith, Ken Anthony, M Julian Caley, and KL Mengersen. Bayesian semi-individual based model with approximate bayesian computation for parameters calibration: Modelling crown-of-thorns populations on the Great Barrier Reef. Ecological Modelling, 364:113–123, 2017.
Chen and Liu [1997] Sean X Chen and Jun S Liu. Statistical applications of the Poisson-Binomial and conditional Bernoulli distributions. Statistica Sinica, pages 875–892, 1997.
Chen et al. [1994] Xiang-Hui Chen, Arthur P Dempster, and Jun S Liu. Weighted finite population sampling to maximize entropy. Biometrika, 81(3):457–469, 1994.
DeAngelis and Gross [2018] Donald Lee DeAngelis and Louis J Gross. Individual-based models and approaches in ecology: Populations, communities and ecosystems. CRC Press, 2018.
Del Moral [2004] Pierre Del Moral. Feynman-kac formulae: Genealogical and Interacting Particle Systems with Applications. Springer-Verlag New York, 2004.
Del Moral et al. [2015] Pierre Del Moral, Ajay Jasra, Anthony Lee, Christopher Yau, and Xiaole Zhang. The alive particle filter and its use in particle Markov chain Monte Carlo. Stochastic Analysis and Applications, 33(6):943–974, 2015.
Doucet et al. [2001] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer, 2001.
Endo et al. [2019] Akira Endo, Edwin van Leeuwen, and Marc Baguelin. Introduction to particle Markov chain Monte Carlo for disease dynamics modellers. Epidemics, 29:100363, 2019.
Epstein [2006] Joshua M Epstein. Generative social science: Studies in agent-based computational modeling, volume 13. Princeton University Press, 2006.
Fan et al. [1962] CT Fan, Mervin E Muller, and Ivan Rezucha. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57(298):387–402, 1962.
Fearnhead and Clifford [2003] Paul Fearnhead and Peter Clifford. On-line inference for hidden Markov models via particle filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887–899, 2003.
Fintzi et al. [2017] Jonathan Fintzi, Xiang Cui, Jon Wakefield, and Vladimir N Minin. Efficient data augmentation for fitting stochastic epidemic models to prevalence data. Journal of Computational and Graphical Statistics, 26(4):918–929, 2017.
Gerber et al. [2019] Mathieu Gerber, Nicolas Chopin, and Nick Whiteley. Negative association, ordering and convergence of resampling methods. Annals of Statistics, 47(4):2236–2260, 2019.
Ghahramani and Jordan [1996] Zoubin Ghahramani and Michael I Jordan. Factorial hidden Markov models. In Advances in Neural Information Processing Systems, pages 472–478, 1996.
Gordon et al. [1993] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-gaussian Bayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140, pages 107–113. IET, 1993.
Grazzini et al. [2017] Jakob Grazzini, Matteo G Richiardi, and Mike Tsionas. Bayesian estimation of agent-based models. Journal of Economic Dynamics and Control, 77:26–47, 2017.
Guarniero et al. [2017] Pieralberto Guarniero, Adam M Johansen, and Anthony Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636–1647, 2017.
Hazelbag et al. [2020] C Marijn Hazelbag, Jonathan Dushoff, Emanuel M Dominic, Zinhle E Mthombothi, and Wim Delva. Calibration of individual-based models to epidemiological data: A systematic review. PLoS computational biology, 16(5):e1007893, 2020.
Heng et al. [2020a] Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential Monte Carlo. Annals of Statistics, 48(5):2904–2929, 2020a.
Heng et al. [2020b] Jeremy Heng, Pierre E. Jacob, and Nianqiao Ju. A simple Markov chain for independent Bernoulli variables conditioned on their sum. arXiv preprint arXiv:2012.03103, 2020b.
Hodges and Le Cam [1960] Joseph L Hodges and Lucien Le Cam. The Poisson approximation to the Poisson binomial distribution. The Annals of Mathematical Statistics, 31(3):737–740, 1960.
Hong [2013] Yili Hong. On computing the distribution function for the Poisson binomial distribution. Computational Statistics & Data Analysis, 59:41–51, 2013.
Hooten et al. [2020] Mevin Hooten, Christopher Wikle, and Michael Schwob. Statistical implementations of agent-based demographic models. International Statistical Review, 88(2):441–461, 2020.
Ionides et al. [2006] Edward L Ionides, Carles Bretó, and Aaron A King. Inference for nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 103(49):18438–18443, 2006.
Johansen and Doucet [2008] Adam M Johansen and Arnaud Doucet. A note on auxiliary particle filters. Statistics & Probability Letters, 78(12):1498–1504, 2008.
Kantas et al. [2015] Nikolas Kantas, Arnaud Doucet, Sumeetpal S Singh, Jan Maciejowski, and Nicolas Chopin. On particle methods for parameter estimation in state-space models. Statistical science, 30(3):328–351, 2015.
Kattwinkel and Reichert [2017] Mira Kattwinkel and Peter Reichert. Bayesian parameter inference for individual-based models using a Particle Markov Chain Monte Carlo method. Environmental Modelling & Software, 87:110–119, 2017.
Kong et al. [1994] Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing data problems. Journal of the American statistical association, 89(425):278–288, 1994.
Lipsitch et al. [2003] Marc Lipsitch, Ted Cohen, Ben Cooper, James M Robins, Stefan Ma, Lyn James, Gowri Gopalakrishna, Suok Kai Chew, Chorh Chuan Tan, Matthew H Samore, et al. Transmission dynamics and control of severe acute respiratory syndrome. Science, 300(5627):1966–1970, 2003.
Liu and Chen [1998] Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association, 93(443):1032–1044, 1998.
McVinish and Pollett [2012] R. McVinish and P. K. Pollett. A central limit theorem for a discrete-time SIS model with individual variation. Journal of Applied Probability, 49(2):521–530, 2012. doi: 10.1239/jap/1339878802.
Osthus et al. [2019] Dave Osthus, James Gattiker, Reid Priedhorsky, and Sara Y Del Valle. Dynamic Bayesian influenza forecasting in the United States with hierarchical discrepancy (with discussion). Bayesian Analysis, 14(1):261–312, 2019.
Pitt and Shephard [1999] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590–599, 1999.
Platt [2020] Donovan Platt. A comparison of economic agent-based model calibration methods. Journal of Economic Dynamics and Control, 113:103859, 2020.
Scharth and Kohn [2016] Marcel Scharth and Robert Kohn. Particle efficient importance sampling. Journal of Econometrics, 190(1):133–147, 2016.
Sirén et al. [2018] Jukka Sirén, Luc Lens, Laurence Cousseau, and Otso Ovaskainen. Assessing the dynamics of natural populations by fitting individual-based models with approximate Bayesian computation. Methods in Ecology and Evolution, 9(5):1286–1295, 2018.
Tisue and Wilensky [2004] Seth Tisue and Uri Wilensky. Netlogo: A simple environment for modeling complexity. In International conference on complex systems, volume 21, pages 16–21. Boston, MA, 2004.
Turrell [2016] Arthur Turrell. Agent-based models: Understanding the economy from the bottom up. Bank of England Quarterly Bulletin, page Q4, 2016.
van der Vaart et al. [2016] Elske van der Vaart, Alice SA Johnston, and Richard M Sibly. Predicting how many animals will be where: How to build, calibrate and evaluate individual-based models. Ecological modelling, 326:113–123, 2016.
Volkova [1996] A Yu Volkova. A refinement of the central limit theorem for sums of independent random indicators. Theory of Probability & Its Applications, 40(4):791–794, 1996.
Wang [1993] Yuan H Wang. On the number of successes in independent trials. Statistica Sinica, pages 295–312, 1993.
Whiteley and Rimella [2020] Nick Whiteley and Lorenzo Rimella. Inference in stochastic epidemic models via multinomial approximations. arXiv preprint arXiv:2006.13700, 2020.

Appendix A Poisson binomial distributions

A.1 Recursive definition of Poisson binomial probabilities

The following provides a derivation of the recursion (10) for the function $q(i,n)=\text{PoiBin}(i;\alpha^{n:N})=\mathbb{P}(\sum_{m=n}^{N}X^{m}=i)$ with $X^{m}\sim\text{Ber}(\alpha^{m})$ independently. The recursion allows the computation of Poisson binomial probabilities in $\mathcal{O}(N^{2})$ operations. Under $X^{n}\sim\text{Ber}(\alpha^{n})$ independently for $n\in[1:N]$ , the initial conditions are given by

q(0,n)=\mathbb{P}\left(\sum_{m=n}^{N}X^{m}=0\right)=\prod_{m=n}^{N}\mathbb{P}(X^{m}=0)=\prod_{m=n}^{N}(1-\alpha^{m}),\quad n\in[1:N],

(65)

in the case of no success, $q(1,N)=\mathbb{P}(X^{N}=1)=\alpha^{N}$ where the sum reduces to a single Bernoulli variable, and $q(i,n)=0$ for $i>N-n+1$ because a sum of $N-n+1$ Bernoulli variables cannot be larger than $N-n+1$ , in particular $q(i,N)=0$ for all $i\geq 2$ . For $i\in[1:N]$ and $n\in[1:N-1]$ , by conditioning on the value of $X^{n}\in\{0,1\}$ , the law of total probability gives (10):

$\displaystyle q(i,n)$	$\displaystyle=\mathbb{P}(X^{n}=1)\leavevmode\nobreak\ \mathbb{P}\left(\sum_{m=n}^{N}X^{m}=i\mid X^{n}=1\right)\leavevmode\nobreak\ +\leavevmode\nobreak\ \mathbb{P}(X^{n}=0)\leavevmode\nobreak\ \mathbb{P}\left(\sum_{m=n}^{N}X^{m}=i\mid X^{n}=0\right)$
	$\displaystyle=\alpha^{n}\leavevmode\nobreak\ \mathbb{P}\left(\sum_{m=n+1}^{N}X^{m}=i-1\right)\leavevmode\nobreak\ +\leavevmode\nobreak\ (1-\alpha^{n})\leavevmode\nobreak\ \mathbb{P}\left(\sum_{m=n+1}^{N}X^{m}=i\right)$	(66)
	$\displaystyle=\alpha^{n}q(i-1,n+1)+(1-\alpha^{n})q(i,n+1).$

A.2 A thinning result

We show that under the static model $X^{n}\sim\text{Ber}(\alpha^{n})$ independently for $n\in[1:N]$ and the observation model $Y\mid X=x\sim\text{Bin}(I(x),\rho)$ , we have $Y\sim\text{PoiBin}(\rho\,\alpha)$ marginally. We first note that the characteristic function of $I(X)=\sum_{n=1}^{N}X^{n}\sim\text{PoiBin}(\alpha)$ is given by

\mathbb{E}\left[\exp(i\omega I(X))\right]=\prod_{n=1}^{N}\mathbb{E}\left[\exp(i\omega X^{n})\right]=\prod_{n=1}^{N}\{\alpha^{n}\exp(i\omega)+(1-\alpha^{n})\}

(67)

for $\omega\in\mathbb{R}$ . Consider the representation $Y=\sum_{n=1}^{I}Z^{n}$ where $Z^{n}\sim\text{Ber}(\rho)$ independently. Note that the characteristic function of each Bernoulli random variable with success probability $\rho$ is $\varphi_{Z}(\omega)=\mathbb{E}\left[\exp(i\omega Z^{n})\right]=\rho\exp(i\omega)+(1-\rho)$ . By the law of iterated expectations, the characteristic function of $Y$ is

$\displaystyle\mathbb{E}\left[\exp(i\omega Y)\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\exp\left(i\omega\sum_{n=1}^{I}Z^{n}\right)\mid I\right]\right]$
	$\displaystyle=\mathbb{E}\left[\varphi_{Z}(\omega)^{I}\right]$
	$\displaystyle=\prod_{n=1}^{N}\mathbb{E}\left[\varphi_{Z}(\omega)^{X^{n}}\right]$	(68)
	$\displaystyle=\prod_{n=1}^{N}\{\alpha^{n}\varphi_{Z}(\omega)+(1-\alpha^{n})\}$
	$\displaystyle=\prod_{n=1}^{N}\{\rho\alpha^{n}\exp(i\omega)+(1-\rho\alpha^{n})\}$

for $\omega\in\mathbb{R}$ . By comparing this characteristic function with (67), we can conclude that $Y\sim\text{PoiBin}(\rho\,\alpha)$ .

A.3 Comparing two Poisson binomial distributions

The following result will be of use in Appendix B.

Lemma A.1.

Let $\text{PoiBin}(\alpha)$ and $\text{PoiBin}(\bar{\alpha})$ denote two Poisson binomial distributions with probabilities $\alpha=(\alpha^{n})_{n\in[1:N]}$ and $\bar{\alpha}=(\bar{\alpha}^{n})_{n\in[1:N]}$ , respectively. The $\ell^{2}$ -norm between these PMFs satisfies

\displaystyle\sum_{i\in[0:N]}\left\{\text{PoiBin}(i;\bar{\alpha})-\text{PoiBin}(i;\alpha)\right\}^{2}\leq\left(\sum_{n=1}^{N}|\bar{\alpha}^{n}-\alpha^{n}|\right)^{2}.

(69)

The Kullback–Leibler divergence from $\text{PoiBin}(\alpha)$ to $\text{PoiBin}(\bar{\alpha})$ , defined as

\displaystyle\text{KL}\left(\text{PoiBin}(\bar{\alpha})\mid\text{PoiBin}(\alpha)\right)=\sum_{i\in[0:N]}\text{PoiBin}(i;\bar{\alpha})\log\left(\frac{\text{PoiBin}(i;\bar{\alpha})}{\text{PoiBin}(i;\alpha)}\right),

is upper bounded by

\displaystyle\text{KL}\left(\text{PoiBin}(\bar{\alpha})\mid\text{PoiBin}(\alpha)\right)\leq\left(\sum_{i\in[0:N]}\frac{\text{PoiBin}(i;\bar{\alpha})^{2}}{\text{PoiBin}(i;\alpha)^{2}}\right)^{1/2}\left(\sum_{n=1}^{N}|\bar{\alpha}^{n}-\alpha^{n}|\right).

(70)

Proof.

To bound the $\ell^{2}$ -norm between two Poisson binomial PMFs, we rely on Parseval’s identity

\displaystyle\sum_{i\in[0:N]}\left\{\text{PoiBin}(i;\bar{\alpha})-\text{PoiBin}(i;\alpha)\right\}^{2}=\frac{1}{2\pi}\int_{-\pi}^{\pi}\left\{\bar{\varphi}(\omega;\bar{\alpha})-\varphi(\omega;\alpha\right\}^{2}d\omega,

(71)

where

\bar{\varphi}(\omega;\bar{\alpha})=\prod_{n=1}^{N}\{\bar{\alpha}^{n}\exp(i\omega)+(1-\bar{\alpha}^{n})\},\quad\varphi(\omega;\alpha)=\prod_{n=1}^{N}\{\alpha^{n}\exp(i\omega)+(1-\alpha^{n})\},

(72)

denote the characteristic functions of $\text{PoiBin}(\bar{\alpha})$ and $\text{PoiBin}(\alpha)$ , respectively (as in (67)). Noting that each term of the products in (72) is at most one, by repeated applications of triangle inequality on the decomposition

	$\displaystyle\bar{\varphi}(\omega;\bar{\alpha})-\varphi(\omega;\alpha)$		(73)
	$\displaystyle=\sum_{n=1}^{N}(\bar{\alpha}^{n}-\alpha^{n})(\exp(i\omega)-1)\prod_{m=1}^{n-1}\{\alpha^{m}\exp(i\omega)+(1-\alpha^{m})\}\prod_{p=n+1}^{N}\{\bar{\alpha}^{p}\exp(i\omega)+(1-\bar{\alpha}^{p})\}$

(with the convention $\prod_{n=m}^{p}=1$ for $p<m$ ), we have

\displaystyle\left\{\frac{1}{2\pi}\int_{-\pi}^{\pi}\left\{\bar{\varphi}(\omega;\bar{\alpha})-\varphi(\omega;\alpha)\right\}^{2}d\omega\right\}^{1/2}\leq\sum_{n=1}^{N}\left\{\frac{1}{2\pi}\int_{-\pi}^{\pi}(\bar{\alpha}^{n}-\alpha^{n})^{2}(\exp(i\omega)-1)^{2}d\omega\right\}^{1/2}=\sum_{n=1}^{N}|\bar{\alpha}^{n}-\alpha^{n}|.

Hence (69) follows by squaring both sides and applying the identity in (71).

To bound the Kullback–Leibler divergence, we apply the inequality $\log(x)\leq x-1$ for $x>0$ and the Cauchy–Schwarz inequality

	$\displaystyle\text{KL}\left(\text{PoiBin}(\bar{\alpha})\mid\text{PoiBin}(\alpha)\right)\leq\sum_{i\in[0:N]}\left\{\frac{\text{PoiBin}(i;\bar{\alpha})}{\text{PoiBin}(i;\alpha)}\right\}\left\{\text{PoiBin}(i;\bar{\alpha})-\text{PoiBin}(i;\alpha)\right\}$
	$\displaystyle\leq\left(\sum_{i\in[0:N]}\frac{\text{PoiBin}(i;\bar{\alpha})^{2}}{\text{PoiBin}(i;\alpha)^{2}}\right)^{1/2}\left(\sum_{i\in[0:N]}\left\{\text{PoiBin}(i;\bar{\alpha})-\text{PoiBin}(i;\alpha)\right\}^{2}\right)^{1/2}.$		(74)

Applying the inequality in (69) leads to (70).

∎

Appendix B Controlled sequential Monte Carlo

B.1 Performance of controlled sequential Monte Carlo

Proof of Proposition 3.1.

Using the form of the smoothing distribution $p_{\theta}(x_{0:T}|y_{0:T})$ in (28) and (29), and the definition of the proposal distribution $q_{\theta}(x_{0:T})$ in (34), we have

\displaystyle\log\left(\frac{p_{\theta}(x_{0:T}|y_{0:T})}{q_{\theta}(x_{0:T})}\right)=\log\left(\frac{\mu_{\theta}(\psi_{0})}{\mu_{\theta}(\psi_{0}^{\star})}\right)+\sum_{t=0}^{T}\log\left(\frac{\psi_{t}^{\star}(x_{t})}{\psi_{t}(I(x_{t}))}\right)+\sum_{t=1}^{T}\log\left(\frac{f_{\theta}(\psi_{t}|x_{t-1})}{f_{\theta}(\psi_{t}^{\star}|x_{t-1})}\right).

(75)

By the log-sum inequality

\displaystyle\log\left(\frac{\mu_{\theta}(\psi_{0})}{\mu_{\theta}(\psi_{0}^{\star})}\right)

\displaystyle\leq\mu_{\theta}(\psi_{0})^{-1}\sum_{x_{0}\in\{0,1\}^{N}}\mu_{\theta}(x_{0})\psi_{0}(x_{0})\log\left(\frac{\psi_{0}(I(x_{0}))}{\psi_{0}^{\star}(x_{0})}\right)\leq\eta_{0}(\log(\psi_{0}/\psi_{0}^{\star})|\theta)

(76)

and

	$\displaystyle\log\left(\frac{f_{\theta}(\psi_{t}\|x_{t-1})}{f_{\theta}(\psi_{t}^{\star}\|x_{t-1})}\right)$	$\displaystyle\leq f_{\theta}(\psi_{t}\|x_{t-1})^{-1}\sum_{x_{t}\in\{0,1\}^{N}}f_{\theta}(x_{t}\|x_{t-1})\psi_{t}(I(x_{t}))\log\left(\frac{\psi_{t}(I(x_{t}))}{\psi_{t}^{\star}(x_{t})}\right)$
		$\displaystyle\leq q_{t}(\log(\psi_{t}/\psi_{t}^{\star})\|x_{t-1},\theta).$		(77)

Using the expression in (75) and the inequalities (76)-(B.1), the Kullback–Leibler divergence from $q_{\theta}(x_{0:T})$ to $p_{\theta}(x_{0:T}|y_{0:T})$ satisfies

	$\displaystyle\text{KL}\left(p_{\theta}(x_{0:T}\|y_{0:T})\mid q_{\theta}(x_{0:T})\right)=\sum_{x_{0:T}\in\{0,1\}^{N\times(T+1)}}p_{\theta}(x_{0:T}\|y_{0:T})\log\left(\frac{p_{\theta}(x_{0:T}\|y_{0:T})}{q_{\theta}(x_{0:T})}\right)$		(78)
	$\displaystyle\leq\eta_{0}(\log(\psi_{0}/\psi_{0}^{\star})\|\theta)+\sum_{t=0}^{T}\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\psi_{t})\|\theta)+\sum_{t=1}^{T}\sum_{x_{t-1}\in\{0,1\}^{N}}\eta_{t-1}^{\star}(x_{t-1}\|\theta)q_{t}(\log(\psi_{t}/\psi_{t}^{\star})\|x_{t-1},\theta).$

Equation (39) follows by employing the upper bound $M_{t}=\max_{x_{t}\in\{0,1\}^{N}}\eta_{t}^{\star}(x_{t}|\theta)/\eta_{t}(x_{t}|\theta)$ and noting that

\displaystyle\sum_{x_{t-1}\in\{0,1\}^{N}}\eta_{t-1}(x_{t-1}|\theta)q_{t}(\log(\psi_{t}/\psi_{t}^{\star})|x_{t-1},\theta)=\eta_{t}(\log(\psi_{t}/\psi_{t}^{\star})|\theta).

(79)

∎

B.2 Marginal distributions of smoothing distribution and proposal distribution

To show that the constants $M_{t}=\max_{x_{t}\in\{0,1\}^{N}}\eta_{t}^{\star}(x_{t}|\theta)/\eta_{t}(x_{t}|\theta)$ in Proposition 3.1 are finite for all $t\in[0:T-1]$ , it suffices to upper bound the ratio $p_{\theta}(x_{0:T}|y_{0:T})/q_{\theta}(x_{0:T})$ by a constant for all $x_{0:T}\in\{0,1\}^{N(T+1)}$ . Under the requirement (21), we have

\displaystyle\frac{p_{\theta}(x_{0:T}|y_{0:T})}{q_{\theta}(x_{0:T})}=p_{\theta}(y_{0:T})^{-1}\prod_{t=0}^{T-1}w_{t}(x_{t}),

(80)

hence we will argue that each of the above weight functions can be upper bounded by some constant. From (35), the weight functions can be rewritten as $w_{0}(x_{0})=\mu_{\theta}(\psi_{0})f_{\theta}(\psi_{1}|x_{0})/\bar{f}_{\theta}(\psi_{0}|I(x_{0}))$ and $w_{t}(x_{t})=f_{\theta}(\psi_{t+1}|x_{t})/\bar{f}_{\theta}(\psi_{t+1}|I(x_{t}))$ for $t\in[1:T-1]$ . Noting that the ratio $f_{\theta}(\psi_{t+1}|x_{t})/\bar{f}_{\theta}(\psi_{t+1}|I(x_{t}))=1$ when $x_{t}^{n}=0$ for all $n\in[1:N]$ and $t\in[0:T-1]$ , we do not have to consider this case.

By induction, the BIF approximation $(\psi_{t})_{t\in[0:T]}$ are upper bounded by one, hence $\mu_{\theta}(\psi_{0})\leq 1$ and $f_{\theta}(\psi_{t}|x_{t-1})\leq 1$ for all $x_{t-1}\in\{0,1\}^{N}$ and $t\in[1:T]$ . It remains to show that $\bar{f}_{\theta}(\psi_{t+1}|I(x_{t}))$ is lower bounded by some strictly positive constant. We notice that the conditional expectation in (32) can be lower bounded by

\displaystyle\bar{f}_{\theta}(\psi_{t+1}|I(x_{t}))\geq\left(\bar{\lambda}N^{-1}I(x_{t})\right)^{N-I(x_{t})}\left(1-\bar{\gamma}\right)^{I(x_{t})}\psi_{t+1}(N).

(81)

Define the constant $c=\min_{i\in[1:N]}\left(\bar{\lambda}N^{-1}i\right)^{N-i}\left(1-\bar{\gamma}\right)^{i}>0$ . At the terminal time, we have $\psi_{T}(N)=\text{Bin}(y_{T};N,\rho)>0$ . By iterating the backward recursion, we have

\displaystyle\psi_{t}(N)=\text{Bin}(y_{t};N,\rho)\bar{f}_{\theta}(\psi_{t+1}|N)\geq\text{Bin}(y_{t};N,\rho)\,c\,\psi_{t+1}(N),

(82)

for each $t\in[0:T-1]$ . By induction, we have $\bar{f}_{\theta}(\psi_{t+1}|I(x_{t}))\geq c^{T-t}\prod_{k=t+1}^{T}\text{Bin}(y_{k};N,\rho)$ for all $t\in[0:T-1]$ and $x_{t}\in\{0,1\}^{N}$ . Combining the above observations allows us to conclude.

B.3 Backward information filter approximation

Proof of Proposition 3.2.

We will establish (40); the proof of (41) follows using similar arguments. Define $e_{t}=\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\psi_{t})|\theta)$ for each time $t\in[0:T]$ . At the terminal time $T$ , we have $e_{T}=0$ . For time $t\in[0:T-1]$ , we consider the decomposition

\displaystyle e_{t}=\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\tilde{\psi}_{t})|\theta)+\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})|\theta)

(83)

where

\displaystyle\tilde{\psi}_{t}(x_{t})=\text{Bin}(y_{t};I(x_{t}),\rho)\mathbbm{1}(I(x_{t})\geq y_{t})f_{\theta}(\psi_{t+1}|x_{t}).

(84)

Using the log-sum inequality and the decomposition of the smoothing distribution given in (28) and (29), the first term of (83) is bounded by

		$\displaystyle\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\tilde{\psi}_{t})\|\theta)=\sum_{x_{t}\in\{0,1\}^{N}}\log\left(\frac{f_{\theta}(\psi_{t+1}^{\star}\|x_{t})}{f_{\theta}(\psi_{t+1}\|x_{t})}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}}f_{\theta}(\psi_{t+1}^{\star}\|x_{t})^{-1}\sum_{x_{t+1}\in\{0,1\}^{N}}f_{\theta}(x_{t+1}\|x_{t})\psi_{t+1}^{\star}(x_{t+1})\log\left(\frac{\psi_{t+1}^{\star}(x_{t+1})}{\psi_{t+1}(I(x_{t+1}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}}\sum_{x_{t+1}\in\{0,1\}^{N}}\log\left(\frac{\psi_{t+1}^{\star}(x_{t+1})}{\psi_{t+1}(I(x_{t+1}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)p_{\theta}(x_{t+1}\|x_{t},y_{t+1:T})=e_{t+1}.$		(85)

By the log-sum inequality, the second term of (83) is bounded by

		$\displaystyle\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})\|\theta)=\sum_{x_{t}\in\{0,1\}^{N}\setminus(0,\ldots,0)}\log\left(\frac{f_{\theta}(\psi_{t+1}\|x_{t})}{\bar{f}_{\theta}(\psi_{t+1}\|I(x_{t}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}\setminus(0,\ldots,0)}f_{\theta}(\psi_{t+1}\|x_{t})^{-1}\sum_{i_{t+1}=0}^{N}\text{PoiBin}(i_{t+1};\alpha(x_{t}))\psi_{t+1}(i_{t+1})\log\left(\frac{\text{PoiBin}(i_{t+1};\alpha(x_{t}))}{\text{PoiBin}(i_{t+1};\bar{\alpha}(x_{t}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq c_{\theta}^{\star}(\psi_{t+1})\sum_{x_{t}\in\{0,1\}^{N}}\text{KL}\left(\text{PoiBin}(\alpha(x_{t}))\mid\text{PoiBin}(\bar{\alpha}(x_{t}))\right)\eta_{t}^{\star}(x_{t}\|\theta).$		(86)

The above repeatedly uses the fact that in the case of zero infections, i.e. $x_{t}^{n}=0$ for all $n\in[1:N]$ , we have $\alpha^{n}(x_{t})=\bar{\alpha}^{n}(x_{t})=0$ for all $n\in[1:N]$ . The constant $c_{\theta}^{\star}(\psi_{t+1})$ can be shown to be finite using the arguments in Appendix B.2. The last inequality also uses the fact that $\max_{i_{t+1}\in[0:N]}|\psi_{t+1}(i_{t+1})|\leq 1$ , which follows by induction. Applying (70) of Lemma A.1 and Cauchy–Schwarz inequality gives

\displaystyle\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})|\theta)\leq c_{\theta}^{\star}(\psi_{t+1})\|\phi_{\theta}^{\star}\|_{L^{2}(\eta_{t}^{\star})}\|\Delta_{\theta}\|_{L^{2}(\eta_{t}^{\star})}.

(87)

By combining (B.3) and (87), we obtain the following recursion

\displaystyle e_{t}\leq e_{t+1}+c_{\theta}^{\star}(\psi_{t+1})\|\phi_{\theta}^{\star}\|_{L^{2}(\eta_{t}^{\star})}\|\Delta_{\theta}\|_{L^{2}(\eta_{t}^{\star})}.

(88)

The claim in (40) follows by induction. ∎

Proof of Proposition 4.1.

The arguments are similar to Proposition 3.2 with some modifications that we will outline below. Like before, we consider the quantity $e_{t}=\eta_{t}^{\star}(\log(\psi^{\star}_{t}/\psi_{t})|\theta)$ for $t\in[0:T]$ in (61) as (62) follows using similar arguments. By the same arguments in (B.3)

\displaystyle\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\tilde{\psi}_{t})|\theta)\leq e_{t+1}.

(89)

Instead of (B.3), we have

\displaystyle\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})|\theta)\leq c_{\theta}^{\star}(\psi_{t+1})\sum_{x_{t}\in\{0,1,2\}^{N}}\text{KL}\left({f}_{\theta}(s_{t+1},i_{t+1}|x_{t})\mid\bar{f}_{\theta}(s_{t+1},i_{t+1}|S(x_{t}),I(x_{t}))\right)\eta_{t}^{\star}(x_{t}|\theta).

(90)

The Kullback–Leibler divergence from $\bar{f}_{\theta}(s_{t+1},i_{t+1}|S(x_{t}),I(x_{t}))$ to ${f}_{\theta}(s_{t+1},i_{t+1}|x_{t})$ can be decomposed as

	$\displaystyle\text{KL}\left({f}_{\theta}(s_{t+1},i_{t+1}\|x_{t})\mid\bar{f}_{\theta}(s_{t+1},i_{t+1}\|S(x_{t}),I(x_{t}))\right)$		(91)
	$\displaystyle=\text{KL}\left({f}_{\theta}(s_{t+1}\|x_{t})\mid\bar{f}_{\theta}(s_{t+1}\|S(x_{t}),I(x_{t}))\right)$
	$\displaystyle\quad\quad+\sum_{s_{t+1}=0}^{N}{f}_{\theta}(s_{t+1}\|x_{t})\text{KL}\left({f}_{\theta}(i_{t+1}\|x_{t},s_{t+1})\mid\bar{f}_{\theta}(i_{t+1}\|S(x_{t}),I(x_{t}),s_{t+1})\right)$
	$\displaystyle\leq\text{KL}\left(\text{PoiBin}({\alpha}_{S}(x_{t}))\mid\text{PoiBin}(\bar{\alpha}_{S}(x_{t}))\right)+\text{KL}\left(\text{PoiBin}((I^{n}(x_{t}){\gamma})_{n\in[1:N]})\mid\text{PoiBin}((I^{n}(x_{t})\bar{\gamma}^{n})_{n\in[1:N]})\right).$

Applying (70) of Lemma A.1 gives

\displaystyle\text{KL}\left({f}_{\theta}(s_{t+1},i_{t+1}|x_{t})\mid\bar{f}_{\theta}(s_{t+1},i_{t+1}|S(x_{t}),I(x_{t}))\right)\leq\phi^{\star}_{\theta}(x_{t})\Delta_{\theta,S}(x_{t})+\xi^{\star}_{\theta}(x_{t})\Delta_{\theta,R}(x_{t}).

(92)

Hence by Cauchy–Schwarz inequality, we have

\displaystyle\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})|\theta)\leq c_{\theta}^{\star}(\psi_{t+1})\{\|\phi^{\star}_{\theta}\|_{L^{2}(\eta_{t}^{\star})}\|\Delta_{\theta,S}\|_{L^{2}(\eta_{t}^{\star})}+\|\xi^{\star}_{\theta}\|_{L^{2}(\eta_{t}^{\star})}\|\Delta_{\theta,R}\|_{L^{2}(\eta_{t}^{\star})}\}.

(93)

∎

B.4 Finer-grained approximations

We discuss finer approximations of the BIF and provide algorithmic details of the resulting cSMC. Instead of simply approximating individual infection and recovery rates by their population averages, we consider a clustering of agents based on their infection and recovery rates, and approximate these rates by their within-cluster average. More precisely, if $K\in[1:N]$ denote the desired number of clusters, $(C_{k})_{k\in[1:K]}$ denote a partition of $[1:N]$ that represents our clustering and $N_{k}=|C_{k}|$ denotes the number of agents in cluster $k\in[1:K]$ , we approximate $\lambda^{n}\approx\bar{\lambda}^{k}=N_{k}^{-1}\sum_{n\in C_{k}}\lambda^{n}$ and $\gamma^{n}\approx\bar{\gamma}^{k}=N_{k}^{-1}\sum_{n\in C_{k}}\gamma^{n}$ . For each state of the population $x\in\{0,1\}^{N}$ , we define the summary $I^{k}(x)=\sum_{n\in C_{k}}x^{n}$ , which counts the number of infected agents in cluster $k\in[1:K]$ . We will write $x^{C_{k}}=(x^{n})_{n\in C_{k}}$ and $I^{1:K}(x)=(I^{k}(x))_{k\in[1:K]}$ . Note that $\sum_{k=1}^{K}I^{k}(x)=I(x)$ . The corresponding approximation of $\alpha(x)$ is given by

\bar{\alpha}^{n}(x)=\begin{cases}\bar{\lambda}^{c(n)}N^{-1}I(x),\quad&\mbox{if }x^{n}=0,\\ 1-\bar{\gamma}^{c(n)},\quad&\mbox{if }x^{n}=1,\end{cases}

(94)

where $c:[1:N]\rightarrow[1:K]$ maps agent labels to cluster membership. We will write associated Markov transition as $\bar{f}_{\theta}(x_{t}|x_{t-1})=\prod_{n=1}^{N}\text{Ber}(x_{t}^{n};\bar{\alpha}^{n}(x_{t-1}))$ . The approximation in (94) recovers (31) in the case of $K=1$ cluster, in which case $I^{1}(x)=I(x)\in[0:N]$ . Larger values of $K$ can be seen as finer-grained approximations at the expense of increased dimensionality of $I^{1:K}(x)\in\prod_{k=1}^{K}[0:N_{k}]$ . Having $K=N$ clusters offers no dimensionality reduction as $I^{1:N}(x)=x\in\{0,1\}^{N}$ .

At the terminal time $T$ , since we have $\psi_{T}^{\star}(x_{T})=\text{Bin}(y_{T};I(x_{T}),\rho)\mathbbm{1}(I(x_{T})\geq y_{T})$ , it suffices to compute $\psi_{T}(i_{T}^{1:K})=\text{Bin}(y_{T};\sum_{k=1}^{K}i_{T}^{k},\rho)\mathbbm{1}(\sum_{k=1}^{K}i_{T}^{k}\geq y_{T})$ for all $i_{T}^{1:K}\in\prod_{k=1}^{K}[0:N_{k}]$ . To approximate the backward recursion (30) for $t\in[0:T-1]$ , we proceed inductively by assuming an approximation of the form $\psi_{t+1}(I^{1:K}(x_{t+1}))$ at time $t+1$ . By plugging in the approximation $\psi_{t+1}(I^{1:K}(x_{t+1}))\approx\psi_{t+1}^{\star}(x_{t+1})$ , we consider the iterate

\displaystyle\text{Bin}(y_{t};I(x_{t}),\rho)\mathbbm{1}(I(x_{t})\geq y_{t})f_{\theta}(\psi_{t+1}|x_{t})

(95)

to form an approximation of $\psi_{t}^{\star}(x_{t})$ . We approximate the conditional expectation $f_{\theta}(\psi_{t+1}|x_{t})$ by

\displaystyle\bar{f}_{\theta}(\psi_{t+1}|I^{1:K}(x_{t}))=\sum_{i_{t+1}^{1:K}\in\prod_{k=1}^{K}[0:N_{k}]}\prod_{k=1}^{K}\text{SumBin}(i_{t+1}^{k};N_{k}-I^{k}(x_{t}),\bar{\lambda}^{k}N^{-1}I(x_{t}),I^{k}(x_{t}),1-\bar{\gamma}^{k})\psi_{t+1}(i_{t+1}^{1:K}).

(96)

The resulting approximation of $\psi_{t}^{\star}(x_{t})$ is computed using

\displaystyle\psi_{t}(i_{t}^{1:K})=\text{Bin}\left(y_{t};\sum_{k=1}^{K}i_{t}^{k},\rho\right)\mathbbm{1}\left(\sum_{k=1}^{K}i_{t}^{k}\geq y_{t}\right)\bar{f}_{\theta}(\psi_{t+1}|i_{t}^{1:K})

(97)

for all $i_{t}^{1:K}\in\prod_{k=1}^{K}[0:N_{k}]$ in $\mathcal{O}\left(\prod_{k=1}^{K}(1+N_{k})^{2}\right)$ cost. Hence our overall approximation of the BIF $(\psi_{t})_{t\in[0:T]}$ costs $\mathcal{O}(\prod_{k=1}^{K}(1+N_{k})^{2}T)$ to compute and $\mathcal{O}(\prod_{k=1}^{K}(1+N_{k})T)$ in storage. It is straightforward to extend Proposition 3.2 to characterize the error of (97); we omit this for the sake of brevity.

The corresponding approximation of the optimal proposal (29) is

\displaystyle q_{0}(x_{0}|\theta)=\frac{\mu_{\theta}(x_{0})\psi_{0}(I^{1:K}(x_{0}))}{\mu_{\theta}(\psi_{0})},\quad q_{t}(x_{t}|x_{t-1},\theta)=\frac{f_{\theta}(x_{t}|x_{t-1})\psi_{t}(I^{1:K}(x_{t}))}{f_{\theta}(\psi_{t}|x_{t-1})},\quad t\in[1:T].

(98)

Analogous to (35), the appropriate SMC weight functions are

\displaystyle w_{0}(x_{0})=\frac{\mu_{\theta}(\psi_{0})g_{\theta}(y_{0}|x_{0})f_{\theta}(\psi_{1}|x_{0})}{\psi_{0}(I^{1:K}(x_{0}))},

\displaystyle w_{t}(x_{t})=\frac{g_{\theta}(y_{t}|x_{t})f_{\theta}(\psi_{t+1}|x_{t})}{\psi_{t}(I^{1:K}(x_{t}))},\quad t\in[1:T-1],

(99)

and $w_{T}(x_{T})=1$ . The above expectations are computed as

	$\displaystyle\mu_{\theta}(\psi_{0})$	$\displaystyle=\sum_{i_{0}^{1:K}\in\prod_{k=1}^{K}[0:N_{k}]}\prod_{k=1}^{K}\text{PoiBin}(i_{0}^{k};\alpha_{0}^{C_{k}})\psi_{0}(i_{0}^{1:K}),$		(100)
	$\displaystyle f_{\theta}(\psi_{t}\|x_{t-1})$	$\displaystyle=\sum_{i_{0}^{1:K}\in\prod_{k=1}^{K}[0:N_{k}]}\prod_{k=1}^{K}\text{PoiBin}(i_{t}^{k};\alpha^{C_{k}}(x_{t-1}))\psi_{t}(i_{t}^{1:K}),\quad t\in[1:T],$		(101)

where $\alpha_{0}^{C_{k}}=(\alpha_{0}^{n})_{n\in C_{k}}$ and $\alpha^{C_{k}}(x_{t-1})=(\alpha^{n}(x_{t-1}))_{n\in C_{k}}$ . Sampling from the proposals (98) involves initializing from

\displaystyle q_{0}(x_{0},i_{0}^{1:K}|\theta)=q_{0}(i_{0}^{1:K}|\theta)\prod_{k=1}^{K}q_{0}(x_{0}^{C_{k}}|i_{0}^{k},\theta)=\frac{\prod_{k=1}^{K}\text{PoiBin}(i_{0}^{k};\alpha_{0}^{C_{k}})\psi_{0}(i_{0}^{1:K})}{\mu_{\theta}(\psi_{0})}\prod_{k=1}^{K}\text{CondBer}(x_{0}^{C_{k}};\alpha_{0}^{C_{k}},i_{0}^{k})

(102)

which admits $q_{0}(x_{0}|\theta)$ as its marginal distribution, and for time $t\in[1:T]$ sampling from

	$\displaystyle q_{t}(x_{t},i_{t}^{1:K}\|x_{t-1},\theta)$	$\displaystyle=q_{t}(i_{t}^{1:K}\|x_{t-1},\theta)\prod_{k=1}^{K}q_{t}(x_{t}^{C_{k}}\|x_{t-1},i_{t}^{k},\theta)$		(103)
		$\displaystyle=\frac{\prod_{k=1}^{K}\text{PoiBin}(i_{t}^{k};\alpha^{C_{k}}(x_{t-1}))\psi_{t}(i_{t}^{1:K})}{f_{\theta}(\psi_{t}\|x_{t-1})}\prod_{k=1}^{K}\text{CondBer}(x_{t}^{C_{k}};\alpha^{C_{k}}(x_{t-1}),i_{t}^{k})$

which admits $q_{t}(x_{t}|x_{t-1},\theta)$ as its marginal transition. The overall cost of implementing cSMC is

\displaystyle\mathcal{O}\left(\left(\sum_{k=1}^{K}N_{k}^{2}+\prod_{k=1}^{K}N_{k}\right)TP\right),

(104)

\displaystyle\mathcal{O}\left(\left(\sum_{k=1}^{K}N_{k}\log(N_{k})+\prod_{k=1}^{K}N_{k}\right)TP\right),

(105)

if one employs translated Poisson approximations (12) and the MCMC in Heng et al. [2020b] to sample from conditioned Bernoulli distributions. As alluded earlier, the choice of the number of clusters $K$ allows one to trade-off the quality of our proposal for computation complexity.

Appendix C Posterior sampling of agent states using MCMC

We now describe various MCMC algorithms that can be used to sample from the smoothing distribution $p_{\theta}(x_{0:T}|y_{0:T})$ of the agent-based SIS model in Section 3. These Gibbs sampling strategies can be seen as alternatives to the SMC approach in Sections 3.1 and 3.2. The same ideas can also be applied to the SIR model in Section 4 with some modifications.

A simple option is a single-site update that samples the state of an agent at a specific time from the full conditional distribution $p_{\theta}(x_{t}^{n}|x_{0:t-1},x_{t}^{-n},x_{t+1:T},y_{0:T})$ , where $x_{t}^{-n}=(x_{t}^{m})_{m\in[1:N]\setminus n}$ , for $t\in[0:T]$ and $n\in[1:N]$ that can be chosen deterministically or randomly with uniform probabilities using a systematic or random Gibbs scan. Using the conditional independence structure of the model, for $t\in[1:T-1]$ , this full conditional is a Bernoulli distribution with success probability that is proportional to

\displaystyle\alpha^{n}(x_{t-1})\text{Bin}(y_{t};I(\tilde{x}_{t}),\rho)\prod_{m\in\mathcal{N}(n)}\text{Ber}(x_{t+1}^{m};\alpha^{m}(\tilde{x}_{t})),

(106)

where $\tilde{x}_{t}=(x_{t}^{m})_{m\in[1:N]}$ with $x_{t}^{n}=1$ . This expression also holds for the case $t=0$ by replacing $\alpha^{n}(x_{t-1})$ with $\alpha_{0}^{n}$ , and $t=T$ by removing the last product. As each update costs $\mathcal{O}(\mathcal{D}(n))$ , the overall cost of a Gibbs scan is $\mathcal{O}(T\sum_{n=1}^{N}\mathcal{D}(n))$ .

As discussed in Ghahramani and Jordan [1996] for a related class of models, one can consider block updates that jointly sample the trajectory of a few agents. For a subset $b\subseteq[1:N]$ of size $B$ , this Gibbs update involves sampling from $p_{\theta}(x_{0:T}^{b}|x_{0:T}^{-b},y_{0:T})$ , where $x_{0:T}^{b}=(x_{t}^{n})_{t\in[0:T],n\in b}$ and $x_{0:T}^{-b}=(x_{t}^{n})_{t\in[0:T],n\in[1:N]\setminus b}$ . This can be done using the forward-backward algorithm in $\mathcal{O}((2^{2B}+2^{B}N)T)$ cost. Hence the overall cost of a systematic Gibbs scan over the entire population is $\mathcal{O}((2^{2B}+2^{B}N)TN/B)$ . As the latter is computationally prohibitive for large $B$ , one has to consider reasonably small block sizes.

We can also employ a swap update that leaves the full conditional distribution $p_{\theta}(x_{t}|x_{0:t-1},x_{t+1:T},y_{0:T})$ invariant for each $t\in[0:T]$ (with $x_{0:-1}=x_{T+1:T}=\emptyset$ ). Let $x_{t}$ denote the current state of the population at time $t$ . We will choose $n_{0}$ and $n_{1}$ uniformly from the sets $\{i:x_{t}^{i}=0\}$ and $\{i:x_{t}^{i}=1\}$ respectively. The proposed state $\tilde{x}_{t}$ is such that $\tilde{x}_{t}^{n_{0}}=1$ , $\tilde{x}_{t}^{n_{1}}=0$ and $\tilde{x}_{t}^{n}=x_{t}^{n}$ for $n\in[1:N]\setminus\{n_{0},n_{1}\}$ . For $t\in[1:T-1]$ , the MH acceptance probability is

\displaystyle\min\left\{1,\frac{\alpha^{n_{0}}(x_{t-1})(1-\alpha^{n_{1}}(x_{t-1}))\prod_{n=1}^{N}\text{Ber}(x_{t+1}^{n};\alpha^{n}(\tilde{x}_{t}))}{\alpha^{n_{1}}(x_{t-1})(1-\alpha^{n_{0}}(x_{t-1}))\prod_{n=1}^{N}\text{Ber}(x_{t+1}^{n};\alpha^{n}(x_{t}))}\right\}.

(107)

The case $t=0$ can be accommodated by replacing $\alpha(x_{t-1})$ with $\alpha_{0}$ , and $t=T$ by removing the ratios of Bernoulli PMFs. The cost of each swap update ranges between $\mathcal{O}(1)$ and $\mathcal{O}(N)$ , depending on the structure of the network. As swap updates on their own do not change the number of infected agents, we propose to employ them within a mixture kernel that also relies on the above-mentioned forward-backward scans.

As alternatives to PMMH, we can consider Gibbs samplers that alternate between sampling the parameters from $p(\theta|x_{0:T},y_{0:T})$ using a MH algorithm, and the agent states from $p_{\theta}(x_{0:T}|y_{0:T})$ using the above MCMC algorithms. In particular, we employ an equally weighted mixture of kernels, performing either $N$ random swap updates for each time step, or systematic Gibbs scan with single-site updates (“single-site”) or block updates for $B=5$ agents (“block5”). We compare these Gibbs samplers to PMMH on the simulated dataset of Section 3.3. We set the MH Normal proposal standard deviation within Gibbs samplers as $0.08$ to achieve suitable acceptance probabilities. We initialize the parameters from either the DGP or the prior distribution, and run all algorithms for the same number of iterations. Figure 7 displays the resulting approximations of the marginal posterior distribution of two parameters. Compared to the Gibbs samplers, the PMMH approximations seem to be more stable between runs with different initializations. These plots also suggest that the Gibbs samplers provide a poorer exploration of the tails of the posterior distribution, which may be attributed to the dependence between the agent states and parameters in this setting.

	$\displaystyle p_{\theta}(x_{t},i_{t}\|x_{t-1},y_{t})$	$\displaystyle=p_{\theta}(i_{t}\|x_{t-1},y_{t})p_{\theta}(x_{t}\|x_{t-1},i_{t})$		(27)
		$\displaystyle=\frac{\text{PoiBin}(i_{t};\alpha(x_{t-1}))\text{Bin}(y_{t};i_{t},\rho)\mathbbm{1}(i_{t}\geq y_{t})}{p_{\theta}(y_{t}\|x_{t-1})}\text{CondBer}(x_{t};\alpha(x_{t-1}),i_{t})$

	$\displaystyle q_{t}(x_{t},i_{t}\|x_{t-1},\theta)$	$\displaystyle=q_{t}(i_{t}\|x_{t-1},\theta)q_{t}(x_{t}\|x_{t-1},i_{t},\theta)$		(38)
		$\displaystyle=\frac{\text{PoiBin}(i_{t};\alpha(x_{t-1}))\psi_{t}(i_{t})}{f_{\theta}(\psi_{t}\|x_{t-1})}\text{CondBer}(x_{t};\alpha(x_{t-1}),i_{t})$

	$\displaystyle\log\left(\frac{f_{\theta}(\psi_{t}\|x_{t-1})}{f_{\theta}(\psi_{t}^{\star}\|x_{t-1})}\right)$	$\displaystyle\leq f_{\theta}(\psi_{t}\|x_{t-1})^{-1}\sum_{x_{t}\in\{0,1\}^{N}}f_{\theta}(x_{t}\|x_{t-1})\psi_{t}(I(x_{t}))\log\left(\frac{\psi_{t}(I(x_{t}))}{\psi_{t}^{\star}(x_{t})}\right)$
		$\displaystyle\leq q_{t}(\log(\psi_{t}/\psi_{t}^{\star})\|x_{t-1},\theta).$		(77)

		$\displaystyle\eta_{t}^{\star}(\log(\psi_{t}^{\star}/\tilde{\psi}_{t})\|\theta)=\sum_{x_{t}\in\{0,1\}^{N}}\log\left(\frac{f_{\theta}(\psi_{t+1}^{\star}\|x_{t})}{f_{\theta}(\psi_{t+1}\|x_{t})}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}}f_{\theta}(\psi_{t+1}^{\star}\|x_{t})^{-1}\sum_{x_{t+1}\in\{0,1\}^{N}}f_{\theta}(x_{t+1}\|x_{t})\psi_{t+1}^{\star}(x_{t+1})\log\left(\frac{\psi_{t+1}^{\star}(x_{t+1})}{\psi_{t+1}(I(x_{t+1}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}}\sum_{x_{t+1}\in\{0,1\}^{N}}\log\left(\frac{\psi_{t+1}^{\star}(x_{t+1})}{\psi_{t+1}(I(x_{t+1}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)p_{\theta}(x_{t+1}\|x_{t},y_{t+1:T})=e_{t+1}.$		(85)

		$\displaystyle\eta_{t}^{\star}(\log(\tilde{\psi}_{t}/\psi_{t})\|\theta)=\sum_{x_{t}\in\{0,1\}^{N}\setminus(0,\ldots,0)}\log\left(\frac{f_{\theta}(\psi_{t+1}\|x_{t})}{\bar{f}_{\theta}(\psi_{t+1}\|I(x_{t}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq\sum_{x_{t}\in\{0,1\}^{N}\setminus(0,\ldots,0)}f_{\theta}(\psi_{t+1}\|x_{t})^{-1}\sum_{i_{t+1}=0}^{N}\text{PoiBin}(i_{t+1};\alpha(x_{t}))\psi_{t+1}(i_{t+1})\log\left(\frac{\text{PoiBin}(i_{t+1};\alpha(x_{t}))}{\text{PoiBin}(i_{t+1};\bar{\alpha}(x_{t}))}\right)\eta_{t}^{\star}(x_{t}\|\theta)$
		$\displaystyle\leq c_{\theta}^{\star}(\psi_{t+1})\sum_{x_{t}\in\{0,1\}^{N}}\text{KL}\left(\text{PoiBin}(\alpha(x_{t}))\mid\text{PoiBin}(\bar{\alpha}(x_{t}))\right)\eta_{t}^{\star}(x_{t}\|\theta).$		(86)