On backward smoothing algorithms

Hai-Dang Dau & Nicolas Chopin CREST-ENSAE, Institut Polytechnique de Paris nicolas.chopin@ensae.fr

Abstract.

In the context of state-space models, skeleton-based smoothing algorithms rely on a backward sampling step which by default has a $\mathcal{O}(N^{2})$ complexity (where $N$ is the number of particles). Existing improvements in the literature are unsatisfactory: a popular rejection sampling– based approach, as we shall show, might lead to badly behaved execution time; another rejection sampler with stopping lacks complexity analysis; yet another MCMC-inspired algorithm comes with no stability guarantee. We provide several results that close these gaps. In particular, we prove a novel non-asymptotic stability theorem, thus enabling smoothing with truly linear complexity and adequate theoretical justification. We propose a general framework which unites most skeleton-based smoothing algorithms in the literature and allows to simultaneously prove their convergence and stability, both in online and offline contexts. Furthermore, we derive, as a special case of that framework, a new coupling-based smoothing algorithm applicable to models with intractable transition densities. We elaborate practical recommendations and confirm those with numerical experiments.

1. Introduction

1.1. Background

A state-space model is composed of an unobserved Markov process $X_{0},\ldots,X_{T}$ and observed data $Y_{0},\ldots,Y_{T}$ . Given $X_{0},\ldots,X_{T}$ , the data $Y_{0},\ldots,Y_{T}$ are independent and generated through some specified emission distribution $Y_{t}|X_{t}\sim\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t})$ . These models have wide-ranging applications (e.g. in biology, economics and engineering). Two important inference tasks related to state-space models are filtering (computing the distribution of $X_{t}$ given $Y_{0},\ldots,Y_{t}$ ) and smoothing (computing the distribution of the whole trajectory $(X_{0},\ldots,X_{t})$ , again given all data until time $t$ ). Filtering is usually carried out through a particle filter, that is, a sequential Monte Carlo algorithm that propagates $N$ weighted particles (realisations) through Markov and importance sampling steps; see Chopin and Papaspiliopoulos, (2020) for a general introduction to state-space models (Chapter 2) and particle filters (Chapter 10).

This paper is concerned with skeleton-based smoothing algorithms, i.e. algorithms that approximate the smoothing distributions with empirical distributions based on the output of a particle filter (i.e. the locations and weights of the $N$ particles at each time step). A simple example is genealogy tracking (empirically proposed in Kitagawa,, 1996 and theoretically analysed in Del Moral and Miclo,, 2001) which keeps track of the ancestry (past states) of each particles. This smoother suffers from degeneracy: for $t$ large enough, all the particles have the same ancestor at time $0$ .

The forward filtering backward smoothing (FFBS) algorithm (Godsill et al.,, 2004) has been proposed as a solution to this problem. Starting from the filtering approximation at time $t$ , the algorithm samples successively particles at times $t-1$ , $t-2$ , etc. using backward kernels. Its theoretical properties, in particular the stability as $t\to\infty$ , have been studied by Del Moral et al., (2010); Douc et al., (2011).

In many applications, one is mainly interested in approximating smoothing expectations of additive functions of the form

\mathbb{E}\left[\left.{\psi_{0}(X_{0})+\psi_{1}(X_{0},X_{1})+\cdots+\psi_{t}(X_{t-1},X_{t})}\right|{Y_{0},\ldots,Y_{t}}\right]

for some functions $\psi_{0},\ldots,\psi_{t}$ . Such expectations can be approximated in an online fashion by a procedure described in Del Moral et al., (2010). Inspired by this, the particle-based, rapid incremental smoother (PaRIS) algorithm of Olsson and Westerborn, (2017) replaces some of the calculations with an additional layer of Monte Carlo approximation.

The backward sampling operation is central to both the FFBS and the PaRIS algorithms. The naive implementation has an $\mathcal{O}(N^{2})$ cost. There have been numerous attempts at alleviating this problem in the literature, but, to our knowledge, they all lack formal support in terms of either computation complexity or stability as $T\to\infty$ .

In the following five paragraphs, we elaborate on this limitation for each of the three major contenders, and we point out two related challenges with current backward sampling algorithms that we try to resolve in this article.

1.2. State of the art

First, Douc et al., (2011) proposed to use rejection sampling for the generation of backward indices in FFBS, and Olsson and Westerborn, (2017) extended this technique to PaRIS. If the model has upper-and-lower bounded transition densities, this sampler has an $\mathcal{O}(N)$ expected execution time (Douc et al., 2011, Proposition 2 and Olsson and Westerborn, 2017, Theorem 10). Unfortunately, most practical state space models (including linear Gaussian ones) violate this assumption, and the behaviour of the algorithm in this case is unclear. Empirically, it has been observed (Taghavi et al., 2013; Bunch and Godsill, 2013; Olsson and Westerborn, 2017, Section 4.3) that in real examples, FFBS-reject and PaRIS-reject frequently suffer from low acceptance rates, in contrary to what users would expect from an algorithm with linear complexity. To cite Bunch and Godsill, (2013), “[a]lthough theoretically elegant, the […] algorithm has been found to suffer from such high rejection rates as to render it consistently slower than direct sampling implementation on problems with more than one state dimension”. To the best of our knowledge, no theoretical result has been put forward to formalise or quantify this bad behaviour.

Second, Taghavi et al., (2013) and Olsson and Westerborn, (2017, Section 4.3) suggest putting a threshold on the number of rejection sampling trials to get more stable execution times. The thresholds are either chosen adaptively using a Kalman filter in Taghavi et al., (2013) or fixed at $\sqrt{N}$ in Olsson and Westerborn, (2017, Section 4.3). Although improvements are empirically observed, to the best of our knowledge, no theoretical analysis of the complexity of the resulting algorithm or formal justification of the proposed threshold is available.

Third, Bunch and Godsill, (2013) use MCMC moves starting from the filtering ancestor instead of a full backward sampling step. Empirically, this algorithm seems to prevent the degeneracy associated with the genealogy tracking smoother using a very small number of MCMC steps (e.g. less than five). Unfortunately, as far as we know, this stability property is not proved anywhere in the literature, which deters the adoption of the algorithm. Using MCMC moves provides a procedure with truly linear and deterministic run time, and a stability result is the only missing piece of the puzzle to completely resolve the $\mathcal{O}(N^{2})$ problem. We believe one reason for the current state of affair is that the stability proof techniques employed by Douc et al., (2011) and Olsson and Westerborn, (2017) are difficult to extend to the MCMC case.

Fourth, and this is related to the third point, the stability of the PaRIS algorithm has only been proved in the asymptotic regime. More specifically, Olsson and Westerborn, (2017) established a central limit theorem as $N\to\infty$ in Corollary 5, then showed that the corresponding asymptotic error remains controlled as $T\to\infty$ in Theorem 8 and Theorem 9. While non-asymptotic stability bounds for the FFBS algorithm are already available in Del Moral et al., (2010); Douc et al., (2011); Dubarry and Le Corff, (2013), we do not think that they can be extended straightforwardly to PaRIS and we are not aware of any such attempt in the literature.

Fifth, all backward samplers mentioned thus far require access to the transition density. Many models have dynamics that can be simulated from but transition densities that are not explicitly calculable. Enabling backward sampling in this scenario is challenging and will certainly require some kind of problem-specific knowledge to extract information from the transition densities, despite not being able to evaluate them exactly.

1.3. Structure and contribution

Section 2 presents a general framework which admits as particular cases a wide variety of existing algorithms (e.g. FFBS, forward-additive smoothing, PaRIS) as well as the novel ones considered later in the paper. It allows to simultaneously prove the consistency as $N\to\infty$ and the stability as $T\to\infty$ for all of them. The main ingredient is the discrete backward kernels, which are essentially random $N\times N$ matrices employed differently in the offline and the online settings. On the technical side, the stability result is proved using a new technique, yielding a non-asymptotic bound that addresses the fourth point in subsection 1.2.

Next, we closely look at the use of rejection sampling and realise that in many models, the resulting execution time may be significantly heavy-tailed; see Section 3. For instance, the run time of PaRIS may have infinite expectation, whereas the run time of FFBS may have infinite variance. (Since it is technically more involved, the material for the FFBS algorithm is delegated to Supplement B.) These results address the first point in subsection 1.2 and we discuss their severe practical implications.

We then derive and analyse hybrid rejection sampling schemes (i.e. schemes that use rejection sampling only up to a certain number of attempts, and then switch to the standard method). We show that they lead to a nearly $\mathcal{O}(N)$ algorithm (up to some $\log$ factor) in Gaussian models; again see Section 3. This stems from the subtle interaction between the tail of Gaussian densities and the finite Feynman-Kac approximation. Outside this class of model, the hybrid algorithm can still escape the $\mathcal{O}(N^{2})$ complexity, although it might not reach the ideal linear run time target. These results shed some light on the second issue mentioned in subsection 1.2.

In Section 4, we look at backward kernels that are more efficient to simulate than the FFBS and the PaRIS ones. Section 4.1 describes backward kernels based on MCMC (Markov chain Monte Carlo) following Bunch and Godsill, (2013) and extends them to the online scenario. We cast this family of algorithms as a particular case of the general framework developed in Section 2, which allows convergence and stability to follow immediately. This solves the long-standing problem described in the third point of subsection 1.2.

MCMC methods require evaluation of the likelihood and thus cannot be applied to models with intractable transition densities. In Section 4.2, we show how the use of forward coupling can replace the role of backward MCMC steps in these scenarios. This makes it possible to obtain stable performance in both on-line and off-line scenarios (with intractable transition densities) and provides a possible solution to the fifth challenge describe in subsection 1.2.

Section 5 illustrates the aforementioned algorithms in both on-line and off-line uses. We highlight how hybrid and MCMC samplers lead to a more user-friendly (i.e. smaller, less random and less model-dependent) execution time than the pure rejection sampler. We also apply our smoother for intractable densities to a continuous-time diffusion process with discretization. We observe that our procedure can indeed prevent degeneracy as $T\to\infty$ , provided that some care is taken to build couplings with good performance. Section 6 concludes the paper with final practical recommendations and further research directions. Table 1 gives an overview of existing and novel algorithms as well as our contributions for each.

1.4. Related work

Proposition 1 of Douc et al., (2011) states that under certain assumptions, the FFBS-reject algorithm has an asymptotic $\mathcal{O}_{\mathbb{P}}(N)$ complexity. This does not contradict our results, which point out the undesirable properties of the non-asymptotic execution time. Clearly, non-asymptotic behaviours are what users really observe in practice. From a technical point of view, the proof of Douc et al., (2011, Prop. 1) is a simple application of Theorem 5 of the same paper. In contrast, non-asymptotic results such as Theorem B.1 and Theorem B.2 require more delicate finite sample analyses.

Figure 1 of Olsson and Westerborn, (2017) and the accompanying discussion provide an excellent intuition on the stability of smoothing algorithms based on the support size of the backward kernels. We formalise this support size condition for the first time by the inequality (13) and construct a novel non-asymptotic stability result based on it. In contrast, Olsson and Westerborn, (2017) depart from their initial intuition and use an entirely different technique to establish stability. Their result is asymptotic in nature.

Gloaguen et al., (2022) briefly mention the use of MCMC in PaRIS algorithm, but their algorithm is fundamentally different to and less efficient than Bunch and Godsill, (2013). Indeed, they do not start the MCMC chains at the ancestors previously obtained during the filtering step. They are thus obliged to perform a large number of MCMC iterations for decorrelation, whereas the algorithms described in our Proposition 4, built upon the ideas of Bunch and Godsill, (2013), only require a single MCMC step to guarantee stability. However, we would like to stress again that Bunch and Godsill, (2013) did not prove this important fact.

Another way to reduce the computation time is to perform the expensive backward sampling steps at certain times $t$ only. For other values of $t$ , the naive genealogy tracking smoother is used instead. This idea has been recently proposed by Mastrototaro et al., (2021), who also provided a practical recipe for deciding at which values of $t$ the backward sampling should take place and derived corresponding theoretical results.

Smoothing in models with intractable transition densities is very challenging. If these densities can be estimated accurately, the algorithms proposed by Gloaguen et al., (2022) permit to attack this problem. A case of particular interest is diffusion models, where unbiased transition density estimators are provided in Beskos et al., (2006); Fearnhead et al., (2008). More recently, Yonekura and Beskos, (2022) use a special bridge path-space construction to overcome the unavailability of transition densities when the diffusion (possibly with jumps) must be discretised.

Our smoother for intractable models are based on a general coupling principle that is not specific to diffusions. We only require users to be able to simulate their dynamics (e.g. using discretisation in the case of diffusions) and to manipulate random numbers in their simulations so that dynamics starting from two different points can meet with some probability. Our method does not directly provide an estimator for the gradient of the transition density with respect to model parameters and thus cannot be used in its current form to perform maximum likelihood estimation (MLE) in intractable models; whereas the aforementioned work have been able to do so in the case of diffusions. However, the main advantage of our approach lies in its generality beyond the diffusion case. Furthermore, modifications allowing to perform MLE are possible and might be explored in further work specifically dedicated to the parameter estimation problem.

The idea of coupling has been incorporated in the smoothing problem in a different manner by Jacob et al., (2019). There, the goal is to provide offline unbiased estimates of the expectation under the smoothing distribution. Coupling and more generally ideas based on correlated random numbers are also useful in the context of partially observed diffusions via the multilevel approach (Jasra et al.,, 2017).

In this work, we consider smoothing algorithms that are based on a unique pass of the particle filter. Offline smoothing can be done using repeated iterations of the conditional particle filter (Andrieu et al.,, 2010). Full trajectories can also be constructed in an online manner if one is willing to accept some lag approximations (Duffield and Singh,, 2022). Another approach to smoothing consists of using an additional information filter (Fearnhead et al.,, 2010), but it is limited to functions depending on one state only. Each of these algorithmic families has their own advantages and disadvantages, of which a detailed discussion is out of the scope of this article (see however Nordh and Antonsson,, 2015).

2. General structure of smoothing algorithms

In this section, we decompose each smoothing algorithm into two separate parts: the backward kernel (which determines its theoretical properties such as the convergence and the stability) and the execution mode (which is either online or offline and determines its implementation). This has two advantages: first, it induces an easy-to-navigate categorization of algorithms (see Table 1); and second, it allows to prove the convergence and the stability for each of them by verifying sufficient conditions on the backward kernel component only.

Mode \ Kernel

FFBS kernel

PaRIS kernel

MCMC kernels

Intract.

Offline

(*) FFBS

(+) Thm. B.1, Thm. B.2

(+) Thm. B.3, Cor. 2

(*) FFBS-MCMC

(+) Prop. 4

(**)

Online

(*) Forward-additive

(*) PaRIS

(+) Thm. 2.2, Prop. 2

(+) Thm. 3.1, Thm. 3.2

(**)

Table 1. Summary of smoothing algorithms considered in this paper (classified by the backward kernel and the execution mode) and our contributions. (*) means an existing algorithm, (+) means a novel theoretical result and (**) means a novel algorithm

2.1. Notations

Measure-kernel-function notations. Let $\mathcal{X}$ and $\mathcal{Y}$ be two measurable spaces with respective $\sigma$ -algebras $\mathcal{B}(\mathcal{X})$ and $\mathcal{B}(\mathcal{Y})$ . The following definitions involve integrals and only make sense when they are well-defined. For a measure $\mu$ on $\mathcal{X}$ and a function $f:\mathcal{X}\to\mathbb{R}$ , the notations $\mu f$ and $\mu(f)$ refer to $\int f(x)\mu(\mathrm{d}x)$ . A kernel (resp. Markov kernel) $K$ is a mapping from $\mathcal{X}\times\mathcal{B}(\mathcal{Y})$ to $\mathbb{R}$ (resp. $[0,1]$ ) such that, for $B\in\mathcal{B}(\mathcal{Y})$ fixed, $x\mapsto K(x,B)$ is a measurable function on $\mathcal{X}$ ; and for $x$ fixed, $B\mapsto K(x,B)$ is a measure (resp. probability measure) on $\mathcal{Y}$ . For a real-valued function $g$ defined on $\mathcal{Y}$ , let $Kg:\mathcal{X}\to\mathbb{R}$ be the function $Kg(x):=\int g(y)K(x,\mathrm{d}y)$ . We sometimes write $K(x,g)$ for the same expression. The product of the measure $\mu$ on $\mathcal{X}$ and the kernel $K$ is a measure on $\mathcal{Y}$ , defined by $\mu K(B):=\int K(x,B)\mu(\mathrm{d}x)$ . Other notations. • The notation $X_{0:t}$ is a shorthand for $(X_{0},\ldots,X_{t})$ • We denote by $\mathcal{M}(W^{1:N})$ the multinomial distribution supported on $\left\{1,2,\ldots,N\right\}$ . The respective probabilities are $W_{1},\ldots,W_{N}$ . If they do not sum to $1$ , we implicitly refer to the normalised version obtained by multiplication of the weights with the appropriate constant • The symbol $\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}$ means convergence in probability and $\Rightarrow$ means convergence in distribution • The geometric distribution with parameter $\lambda$ is supported on $\mathbb{Z}_{\geq 1}$ , has probability mass function $f(n)=\lambda(1-\lambda)^{n-1}$ and is noted by $\operatorname{Geo}(\lambda)$ • Let $\mathcal{X}$ and $\mathcal{Y}$ be two measurable spaces. Let $\mu$ and $\nu$ be two probability measures on $\mathcal{X}$ and $\mathcal{Y}$ respectively. The o-times product measure $\mu\otimes\nu$ is defined via $(\mu\otimes\nu)(h):=\iint h(x,y)\mu(\mathrm{d}x)\nu(\mathrm{d}y)$ for bounded functions $h:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ . If $X\sim\mu$ and $Y\sim\nu$ , we sometimes note $\mu\otimes\nu$ by $X\otimes Y$ .

2.2. Feynman-Kac formalism and the bootstrap particle filter

Let $\mathcal{X}_{0:T}$ be a sequence of measurable spaces and $M_{1:T}$ be a sequence of Markov kernels such that $M_{t}$ is a kernel from $\mathcal{X}_{t-1}$ to $\mathcal{X}_{t}$ . Let $X_{0:T}$ be an unobserved inhomogeneous Markov chain with starting distribution $X_{0}\sim\mathbb{M}_{0}(\mathrm{d}x_{0})$ and Markov kernels $M_{1:T}$ ; i.e. $X_{t}|X_{t-1}\sim M_{t}(X_{t-1},\mathrm{d}x_{t})$ for $t\geq 1$ . We aim to study the distribution of $X_{0:T}$ given observed data $Y_{0:T}$ . Conditioned on $X_{0:T}$ , the data $Y_{0},\ldots,Y_{T}$ are independent and $Y_{t}|X_{0:T}\equiv Y_{t}|X_{t}\sim\boldsymbol{f}_{t}(\cdot|X_{t})$ for a certain emission distribution $\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t})$ . Assume that there exists dominating measures $\tilde{\lambda}_{t}$ not depending on $x_{t}$ such that

\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t})=f_{t}(y_{t}|x_{t})\tilde{\lambda}_{t}(\mathrm{d}y_{t}).

The distribution of $X_{0:t}|Y_{0:t}$ is then given by

(1)

\mathbb{Q}_{t}(\mathrm{d}x_{0:t})=\frac{1}{L_{t}}\mathbb{M}_{0}(\mathrm{d}x_{0})\prod_{s=1}^{t}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s})

where $G_{s}(x_{s}):=f(y_{s}|x_{s})$ and $L_{t}>0$ is the normalising constant. Moreover, $\mathbb{Q}_{-1}:=\mathbb{M}_{0}$ and $L_{-1}:=1$ by convention. Equation (1) defines a Feynman-Kac model (Del Moral,, 2004). It does not require $M_{t}$ to admit a transition density, although herein we only consider models where this assumption holds. Let $\lambda_{t}$ be a dominating measure on $\mathcal{X}_{t}$ in the sense that there exists a function $m_{t}$ (not necessarily tractable) such that

(2)

M_{t}(x_{t-1},\mathrm{d}x_{t})=m_{t}(x_{t-1},x_{t})\lambda_{t}(\mathrm{d}x_{t}).

A special case of the current framework are linear Gaussian state space models. They will serve as a running example for the article, and some of the results will be specifically demonstrated for models of this class. The rationale is that many real-world dynamics are partly, or close to, Gaussian. The notations for linear Gaussian models are given in Supplement A.1 and we will refer to them whenever this model class is discussed.

Particle filters are algorithms that sample from $\mathbb{Q}_{t}(\mathrm{d}x_{t})$ in an on-line manner. In this article, we only consider the bootstrap particle filter (Gordon et al.,, 1993) and we detail its notations in Algorithm 1. Many results in the following do apply to the auxiliary filter (Pitt and Shephard,, 1999) as well, and we shall as a rule indicate explicitly when it is not the case.

Input: Feynman-Kac model (1)

Simulate

X_{0}^{1:N}\overset{\text{i.i.d.}}{\sim}\mathbb{M}_{0}

Set

\omega_{0}^{n}\leftarrow G_{0}(X_{0}^{n})

for

n=1,\ldots,N

Set

\ell_{0}^{N}\leftarrow\sum_{n=1}^{N}\omega_{0}^{n}/N

Set

W_{0}^{n}\leftarrow\omega_{0}^{n}/N\ell_{0}^{N}

for

n=1,\ldots,N

for $t\leftarrow 1$ to T do

Resample. Simulate

A_{t}^{1:N}\overset{\text{i.i.d.}}{\sim}\mathcal{M}(W_{t-1}^{1:N})

Move. Simulate

X_{t}^{n}\sim M_{t}(X_{t-1}^{A_{t}^{n}},\mathrm{d}x_{t})

for

n=1,\ldots,N

Reweight. Set

\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n})

for

n=1,2,\ldots,N

Set

\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N

Set

W_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N}

for

n=1,2,\ldots,N

Output: For all

t\geq 0

and function

\varphi:\mathcal{X}_{t}\to\mathbb{R}

, the quantity

\sum_{n=1}^{N}W_{t}^{n}\varphi(X_{t}^{n})

approximates

\int\mathbb{Q}_{t}(\mathrm{d}x_{0:t})\varphi(x_{t})

and the quantity

\ell_{t}^{N}

approximates

L_{t}/L_{t-1}

Algorithm 1 Bootstrap particle filter

We end this subsection with the definition of two sigma-algebras that will be referred to throughout the paper. Using the notations of Algorithm 1, let

(3)

\begin{split}\mathcal{F}_{t}&:=\sigma(X_{0:t}^{1:N},A_{1:t}^{1:N}),\\ \mathcal{F}_{t}^{-}&:=\sigma(X_{0:t}^{1:N}).\end{split}

2.3. Backward kernels and off-line smoothing

In this subsection, we first describe three examples of backward kernels, in which we emphasise both the random measure and the random matrix viewpoints. We then formalise their use by stating a generic off-line smoothing algorithm.

Example 1 (FFBS algorithm, Godsill et al.,, 2004).

Once Algorithm 1 has been run, the FFBS procedure generates a trajectory approximating the smoothing distribution in a backward manner. More precisely, it starts by simulating index $\mathcal{I}_{T}\sim\mathcal{M}(W_{T}^{1:N})$ at time $T$ . Then, recursively for $t=T,\ldots,1$ , given indices $\mathcal{I}_{t:T}$ , it generates $\mathcal{I}_{t-1}\in\left\{1,\ldots,N\right\}$ with probability proportional to $W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}})$ . The smoothing trajectory is returned as $(X_{0}^{\mathcal{I}_{0}},\ldots,X_{T}^{\mathcal{I}_{T}})$ . Formally, given $\mathcal{F}_{T}$ , the indices $\mathcal{I}_{0:T}$ are generated according to the distribution

\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{T})\left[B_{T}^{N,\mathrm{FFBS}}(i_{T},\mathrm{d}i_{T-1})B_{T-1}^{N,\mathrm{FFBS}}(i_{T-1},\mathrm{d}i_{T-2})\ldots B_{1}^{N,\mathrm{FFBS}}(i_{1},\mathrm{d}i_{0})\right]

where the (random) backward kernels $B_{t}^{N,\mathrm{FFBS}}$ are defined by

(4)

B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}):=\sum_{n=1}^{N}\frac{W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{i_{t}})}{\sum_{k=1}^{N}W_{t-1}^{k}m_{t}(X_{t-1}^{k},X_{t}^{i_{t}})}\delta_{n}(\mathrm{d}i_{t-1}).

More simply, we can also look at these random kernels as random $N\times N$ matrices of which entries are given by

(5)

\hat{B}_{t}^{N,\mathrm{FFBS}}[i_{t},i_{t-1}]:=\frac{W_{t-1}^{i_{t-1}}m_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})}{\sum_{k=1}^{N}W_{t-1}^{k}m_{t}(X_{t-1}^{k},X_{t}^{i_{t}})}.

We will need both the kernel viewpoint (4) and the matrix viewpoint (5) in this paper as the better choice depends on the context.

Example 2 (Genealogy tracking, Kitagawa,, 1996; Del Moral and Miclo,, 2001).

It is well known that Algorithm 1 already gives as a by-product an approximation of the smoothing distribution. This information can be extracted from the genealogy, by first simulating index $\mathcal{I}_{T}\sim\mathcal{M}(W_{T}^{1:N})$ at time $T$ , then successively appending ancestors until time $0$ (i.e. setting sequentially $\mathcal{I}_{t-1}\leftarrow A_{t}^{\mathcal{I}_{t}}$ ). The smoothed trajectory is returned as $(X_{0}^{\mathcal{I}_{0}},\ldots,X_{T}^{\mathcal{I}_{T}})$ . More formally, conditioned on $\mathcal{F}_{T}$ , we simulate the indices $\mathcal{I}_{0:T}$ according to

\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{T})\left[B_{T}^{N,\mathrm{GT}}(i_{T},\mathrm{d}i_{T-1})B_{T-1}^{N,\mathrm{GT}}(i_{T-1},\mathrm{d}i_{T-2})\ldots B_{1}^{N,\mathrm{GT}}(i_{1},\mathrm{d}i_{0})\right]

where GT stands for “genealogy tracking” and the kernels $B_{t}^{N,\mathrm{GT}}$ are simply

(6)

B_{t}^{N,\mathrm{GT}}(i_{t},\mathrm{d}i_{t-1}):=\delta_{A_{t}^{i_{t}}}(\mathrm{d}i_{t-1}).

Again, it may be more intuitive to view this random kernel as a random $N\times N$ matrix, the elements of which are given by

\hat{B}_{t}^{N,\mathrm{GT}}[i_{t},i_{t-1}]:=\mathbbm{1}\left\{i_{t-1}=A_{t}^{i_{t}}\right\}.

Example 3 (MCMC backward samplers, Bunch and Godsill,, 2013).

In Example 2, the backward variable $\mathcal{I}_{t-1}$ is simply set to $A_{t}^{\mathcal{I}_{t}}$ . On the contrary, in Example 1, we need to launch a simulator for the discrete measure $W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}})$ . Interestingly, the current value of $A_{t}^{\mathcal{I}_{t}}$ is not taken into account in that simulator. Therefore, a natural idea to combine the two previous examples is to apply one (or several) MCMC steps to $A_{t}^{\mathcal{I}_{t}}$ and assign the result to $\mathcal{I}_{t-1}$ . The MCMC algorithm operates on the space $\left\{1,2,\ldots,N\right\}$ and targets the invariant measure $W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}})$ . If only one independent Metropolis-Hastings (MH) step is used and the proposal is $\mathcal{M}(W_{t-1}^{1:N})$ , the corresponding random matrix $\hat{B}_{t}^{N,\mathrm{IMH}}$ has values

\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},i_{t-1}]=W_{t-1}^{i_{t-1}}\min\left(1,{m_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})}/{m_{t}(X_{t-1}^{A_{t}^{i_{t}}},X_{t}^{i_{t}})}\right)

if $i_{t-1}\neq A_{t}^{i_{t}}$ , and

\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},A_{t}^{i_{t}}]=1-\sum_{n\neq A_{t}^{i_{t}}}\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},n].

This third example shows that some elements of the matrix $\hat{B}_{t}^{N,\mathrm{IMH}}$ might be expensive to calculate. If several MCMC steps are performed, all elements of $\hat{B}_{t}^{N,\mathrm{IMH}}$ will have non-trivial expressions. Still, simulating from $B_{t}^{N,\mathrm{IMH}}(i_{t},\mathrm{d}i_{t-1})$ is easy as it amounts to running a standard MCMC algorithm. MCMC backward samples are studied in more details in Section 4.1.

We formalise how off-line smoothing can be done given random matrices $\hat{B}_{1:T}^{N}$ ; see Algorithm 2. Note that in the above examples, our matrices $\hat{B}_{t}^{N}$ are $\mathcal{F}_{t}$ -measurable (i.e. they depend on particles and indices up to time $t$ ), but this is not necessarily the case in general (i.e. they may also depend on additional random variables, see Section 2.5). Furthermore, Algorithm 2 describes how to perform smoothing using the matrices $\hat{B}_{1:T}^{N}$ , but does not say where they come from. At this point, it is useful to keep in mind the above three examples. In Section 2.4, we will give a general recipe for constructing valid matrices $\hat{B}_{t}^{N}$ (i.e. those that give a consistent algorithm).

Input: Filtering results

X_{0:T}^{1:N}

W_{0:T}^{1:N}

, and

A_{1:T}^{1:N}

from Algorithm 1; random matrices

\hat{B}_{1:T}^{N}

(see Section 2.3 for two examples of such matrices and Section 2.4 for a general recipe to construct them)

for $n\leftarrow 1$ to $N$ do

Simulate

\mathcal{I}_{T}^{n}\sim\mathcal{M}(W_{T}^{1:N})

for $t\leftarrow T$ to 1 do

Simulate

\mathcal{I}_{t-1}^{n}\sim B_{t}^{N}(\mathcal{I}_{t}^{n},\mathrm{d}i_{t-1})

(the kernel

B_{t}^{N}(\mathcal{I}_{t},\cdot)

is defined by the

\mathcal{I}_{t}

-th row of the input matrix

\hat{B}_{t}^{N}

)

Output: The

N

smoothed trajectories

(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})

for

n=1,\ldots,N

Algorithm 2 Generic off-line smoother

Algorithm 2 simulates, given $\mathcal{F}_{T}$ and $\hat{B}_{1:T}^{N}$ , $N$ i.i.d. index sequences $\mathcal{I}_{0:T}^{n}$ , each distributed according to

\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})\prod_{t=T}^{1}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1}).

Once the indices $\mathcal{I}_{0:T}^{1:N}$ are simulated, the $N$ smoothed trajectories are returned as $(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})$ . Given $\mathcal{F}_{T}$ and $\hat{B}_{1:T}^{N}$ , they are thus conditionally i.i.d. and their conditional distribution is described by the $x_{0:T}$ component of the joint distribution

(7)

\bar{\mathbb{Q}}_{T}^{N}(\mathrm{d}x_{0:T},\mathrm{d}i_{0:T}):=\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})\left[\prod_{t=T}^{1}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\right]\left[\prod_{t=T}^{0}\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})\right].

Throughout the paper, the symbol $\bar{\mathbb{Q}}_{T}^{N}$ will refer to this joint distribution, while the symbol $\mathbb{Q}_{T}^{N}$ will refer to the $x_{0:T}$ -marginal of $\bar{\mathbb{Q}}_{T}^{N}$ only. This allows the notation $\mathbb{Q}_{T}^{N}\varphi$ to make sense, where $\varphi=\varphi(x_{0},\ldots,x_{T})$ is a real-valued function defined on the hidden states.

2.4. Validity and convergence

The kernels $B_{t}^{N,\text{FFBS}}$ and $B_{t}^{N,\text{GT}}$ are both valid backward kernels to generate convergent approximation of the smoothing distribution (Del Moral,, 2004; Douc et al.,, 2011). This subsection shows that they are not the only ones and gives a sufficient condition for a backward kernel to be valid. It will prove a necessary tool to build more efficient $B_{t}^{N}$ later in the paper.

Recall that Algorithm 1 outputs particles $X_{0:T}^{1:N}$ , weights $W_{0:T}^{1:N}$ and ancestor variables $A_{1:T}^{1:N}$ . Imagine that the $A_{1:T}^{1:N}$ were discarded after filtering has been done and we wish to simulate them back. We note that, since the $X_{0:T}^{1:N}$ are given, the $T\times N$ variables $A_{1:T}^{1:N}$ are conditionally i.i.d. We can thus simulate them back from

p(a_{t}^{n}|x_{0:T}^{1:N})=p(a_{t}^{n}|x_{t-1}^{1:N},x_{t}^{n})\propto w_{t-1}^{a_{t}^{n}}m_{t}(x_{t-1}^{a_{t}^{n}},x_{t}^{n}).

This is precisely the distribution of $B_{t}^{N,\text{FFBS}}(n,\cdot)$ . It turns out that any other invariant kernel that can be used for simulating back the discarded $A_{1:T}^{1:N}$ will lead to a convergent algorithm as well. For instance, $B_{t}^{N,\mathrm{GT}}(n,\cdot)$ (Example 2) simply returns back the old $A_{t}^{n}$ , unlike $B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ which creates a new version. The kernel $B_{t}^{N,\mathrm{IMH}}(n,\cdot)$ (Example 3) is somewhat an intermediate between the two. We formalise these intuitions in the following theorem. It is stated for the bootstrap particle filter, but as a matter of fact, the proof can be extended straightforwardly to auxiliary particle filters as well.

Assumption 1.

For all $0\leq t\leq T$ , $G_{t}(x_{t})>0$ and $\left\lVert G_{t}\right\rVert_{\infty}<\infty$ .

Theorem 2.1.

We use the same notations as in Algorithms 1 and 2 (in particular, $\hat{B}_{t}^{N}$ denotes the transition matrix that corresponds to the considered kernel $B_{t}^{N}$ ). Assume that for any $1\leq t\leq T$ , the random matrix $\hat{B}_{t}^{N}$ satisfies the following conditions:

•

given $\mathcal{F}_{t-1}$ and $\hat{B}_{1:t-1}^{N}$ , the variables $(X_{t}^{n},A_{t}^{n},\hat{B}_{t}^{N}(n,\cdot))$ for $n=1,\ldots,N$ are i.i.d. and their distribution only depends on $X_{t-1}^{1:N}$ , where $\hat{B}_{t}^{N}(n,\cdot)$ is the $n$ -th row of matrix $\hat{B}_{t}^{N}$ ;
•

if $J_{t}^{n}$ is a random variable such that

$J_{t}^{n}\ |\ X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N}(n,\cdot)\sim B_{t}^{N}(n,\cdot)$

, then $(J_{t}^{n},X_{t}^{n})$ has the same distribution as $(A_{t}^{n},X_{t}^{n})$ given $X_{t-1}^{1:N}$ .

Then under Assumption 1, there exists constants $C_{T}>0$ and $S_{T}<\infty$ such that, for any $\delta>0$ and function $\varphi=\varphi(x_{0},\ldots,x_{T})$ :

(8)

\mathbb{P}\left(\left|\mathbb{Q}_{T}^{N}\varphi-\mathbb{Q}_{T}\varphi\right|\geq\frac{\sqrt{-2\log(\delta/2C_{T})}S_{T}\left\lVert\varphi\right\rVert_{\infty}}{\sqrt{N}}\right)\leq\delta

where $\mathbb{Q}_{T}^{N}$ is defined by (7).

A typical relation between variables defined in the statement of the theorem is illustrated by a graphical model in Figure 1. (See Bishop, 2006, Chapter 8 for the formal definition of graphical models and how to use them.) By “typical”, we mean that Theorem 2.1 technically allows for more complicated relations, but the aforementioned figure captures the most essential cases.

Figure 1. Relation between variables described in Theorem 2.1.

Theorem 2.1 is a generalisation of Douc et al., (2011, Theorem 5). Its proof thus follows the same lines (Supplement E.1). However, in our case the measure $\mathbb{Q}_{T}^{N}(\mathrm{d}x_{0:T})$ is no longer Markovian. This is because the backward kernel $B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})$ does not depend on $X_{t}^{i_{t}}$ alone, but also possibly on its ancestor and extra random variables. This small difference has a big consequence: compared to Douc et al., (2011, Theorem 5), Theorem 2.1 has a much broader applicability and encompasses, for instance, the MCMC-based algorithms presented in Section 4.1 and novel kernels presented in Section 4.2 for intractable densities.

As we have seen in (7), $\mathbb{Q}_{T}^{N}$ is fundamentally a discrete measure of which the support contains $N^{T+1}$ elements. As such, $\mathbb{Q}_{T}^{N}\varphi$ cannot be computed exactly in general and must be approximated using $N$ trajectories $(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})$ simulated via Algorithm 2. Theorem 2.1 is thus completed by the following corollary, which is an immediate consequence of Hoeffding inequality (Supplement E.13).

Corollary 1.

Under the same setting as Theorem 2.1, we have

\mathbb{P}\left(\left|\frac{1}{N}\sum_{n}\varphi(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})-\mathbb{Q}_{T}\varphi\right|\geq\frac{\sqrt{-2\log\left(\frac{\delta}{2(C_{T}+1)}\right)}(S_{T}+1)\left\lVert\varphi\right\rVert_{\infty}}{\sqrt{N}}\right)\leq\delta.

2.5. Generic on-line smoother

As we have seen in Section 2.3 and Section 2.4, in general, the expectation $\mathbb{Q}_{T}^{N}\varphi$ , for a real-valued function $\varphi=\varphi(x_{0},\ldots,x_{T})$ of the hidden states, cannot be computed exactly due to the large support ( $N^{T+1}$ elements) of $\mathbb{Q}_{T}^{N}$ . Moreover, in certain settings we are interested in the quantities $\mathbb{Q}_{t}^{N}\varphi_{t}$ for different functions $\varphi_{t}$ . They cannot be approximated in an on-line manner without more assumptions on the connection between $\varphi_{t-1}$ and $\varphi_{t}$ . If the family $(\varphi_{t})$ is additive, i.e. there exists functions $\psi_{t}$ such that

(9)

\varphi_{t}(x_{0:t}):=\psi_{0}(x_{0})+\psi_{1}(x_{0},x_{1})+\cdots+\psi_{t}(x_{t-1},x_{t})

then we can calculate $\mathbb{Q}_{t}^{N}\varphi_{t}$ both exactly and on-line. The procedure was first described in Del Moral et al., (2010) for the kernel $\mathbb{Q}_{t}^{N,\text{FFBS}}$ (i.e. the measure defined by (7) and the random kernels $B_{t}^{N,\text{FFBS}}$ ), but we will use the idea for other kernels as well. In this subsection, we first explain the principle of the method, then discuss its computational complexity and the link to the PaRIS algorithm (Olsson and Westerborn,, 2017).

Principle

For simplicity, we start with the special case $\varphi_{t}(x_{0:t})=\psi_{0}(x_{0})$ . Equation (7) and the matrix viewpoint of Markov kernels then give

\mathbb{Q}_{t}^{N}\varphi_{t}=\begin{bmatrix}W_{t}^{1}\ldots W_{t}^{N}\end{bmatrix}\hat{B}_{t}^{N}\hat{B}_{t-1}^{N}\ldots\hat{B}_{1}^{N}\begin{bmatrix}\psi_{0}(X_{0}^{1})\\ \vdots\\ \psi_{0}(X_{0}^{N})\end{bmatrix}.

This naturally suggests the following recursion formula to compute $\mathbb{Q}_{t}^{N}\varphi_{t}$ :

\mathbb{Q}_{t}^{N}\varphi_{t}=\begin{bmatrix}W_{t}^{1}\ldots W_{t}^{N}\end{bmatrix}\hat{S}_{t}^{N}

with $\hat{S}_{0}^{N}=[\psi_{0}(X_{0}^{1})\ldots\psi_{0}(X_{0}^{N})]^{\top}$ and

(10)

\hat{S}_{t}^{N}:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}.

In the general case where functions $\varphi_{t}$ are given by (9), simple calculations (Supplement E.2) show that (10) is replaced by

(11)

\hat{S}_{t}^{N}:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}+\operatorname{diag}(\hat{B}_{t}^{N}\hat{\psi}_{t}^{N})

where the $N\times N$ matrix $\hat{\psi}_{t}^{N}$ is defined by

\hat{\psi}_{t}^{N}[i_{t-1},i_{t}]:=\psi_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})

and the operator $\operatorname{diag}:\mathbb{R}^{N\times N}\rightarrow\mathbb{R}^{N}$ extracts the diagonal of a matrix. This is exactly what is done in Algorithm 3.

Input: Particles

X_{t-1}^{1:N}

and weights

W_{t-1}^{1:N}

at time

t-1

; the

N\times 1

vector

\hat{S}_{t-1}^{N}

(see text); additive function (9)

Generate

X^{1:N}_{t}

and

W_{t}^{1:N}

according to the particle filter (Algorithm 1)

Calculate the random matrix

\hat{B}_{t}^{N}

(see Section 2.3 and Section 2.4)

Create the

N\times 1

vector

\hat{S}_{t}^{N}

according to (11). More precisely:

for $i_{t}\leftarrow 1$ to $N$ do

\hat{S}_{t}^{N}[i_{t}]\leftarrow\sum_{i_{t-1}}\hat{B}_{t}^{N}[i_{t},i_{t-1}]\left(\hat{S}_{t-1}^{N}[i_{t-1}]+\psi_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})\right)

Output: Quantity

\sum_{n}W_{t}^{n}\hat{S}_{t}^{N}[n]

which is equal to

\mathbb{Q}_{t}^{N}\varphi_{t}

and is an esimate of

\mathbb{Q}_{t}(\varphi_{t})

; particles

X_{t}^{1:N}

, weights

W_{t}^{1:N}

and vector

S_{t}^{N}

for the next step

Algorithm 3 Generic on-line smoother for additive functions (one step)

Computational complexity and the PaRIS algorithm

Equations (10) and (11) involve a matrix-vector multiplication and thus require, in general, $\mathcal{O}(N^{2})$ operations to be evaluated. When $\hat{B}_{t}^{N}\equiv\hat{B}_{t}^{N,\text{FFBS}}$ , Algorithm 3 becomes the $\mathcal{O}(N^{2})$ on-line smoothing algorithm of Del Moral et al., (2010). The $\mathcal{O}(N^{2})$ complexity can however be lowered to $\mathcal{O}(N)$ if the matrices $\hat{B}_{t}^{N}$ are sparse. This is the idea behind the PaRIS algorithm (Olsson and Westerborn,, 2017), where the full matrix $\hat{B}_{t}^{N,\text{FFBS}}$ is unbiasedly estimated by a sparse matrix $\hat{B}_{t}^{N,\text{PaRIS}}$ . More specifically, for any integer $\tilde{N}>1$ , for any $n\in 1,\ldots,N$ , let $J_{t}^{n,1},\ldots,J_{t}^{n,\tilde{N}}$ be conditionally i.i.d. random variables simulated from $B_{t}^{N,\text{FFBS}}(n,\cdot)$ . The random matrix $\hat{B}_{t}^{N,\text{PaRIS}}$ is then defined as

\hat{B}_{t}^{N,\text{PaRIS}}[n,m]:=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\mathbbm{1}\left\{J_{t}^{n,\tilde{n}}=m\right\}

and the corresponding random kernel is

(12)

B_{t}^{N,\text{PaRIS}}(n,\mathrm{d}m)=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\delta_{J_{t}^{n,\tilde{n}}}(\mathrm{d}m).

The following straightforward proposition establishes the validity of the $B_{t}^{N,\text{PaRIS}}$ kernel. Together with Theorem 2.1, it can be thought of as a reformulation of the consistency of the PaRIS algorithm (Olsson and Westerborn,, 2017, Corollary 2) in the language of our framework.

Proposition 1.

The matrix $\hat{B}_{t}^{N,\mathrm{PaRIS}}$ has only $\mathcal{O}(N\tilde{N})$ non-zero elements out of $N^{2}$ . It is an unbiased estimate of $\hat{B}_{t}^{N,\mathrm{FFBS}}$ in the sense that

\mathbb{E}\left[\left.{\hat{B}_{t}^{N,\mathrm{PaRIS}}}\right|{\mathcal{F}_{t}}\right]=\hat{B}_{t}^{N,\mathrm{FFBS}}.

Moreover, the sequence of matrices $B_{1:T}^{N,\mathrm{PaRIS}}$ satisfies the two conditions of Theorem 2.1.

The proposition also justifies the $\mathcal{O}(N)$ complexity of (10) and (11), as long as $\tilde{N}$ is fixed as $N\to\infty$ . But it is important to remark that the preceding $\mathcal{O}(N)$ complexity does not include the cost of generating the matrices $\hat{B}_{t}^{N,\text{PaRIS}}$ themselves, i.e., the operations required to simulate the indices $J_{t}^{n,\tilde{n}}$ . In Olsson and Westerborn, (2017) it is argued that such simulations have an $\mathcal{O}(N)$ cost using the rejection sampling method whenever the transition density is both upper and lower bounded. Section 3 investigates the claim when this hypothesis is violated.

2.6. Stability

When $\hat{B}_{t}^{N}\equiv\hat{B}_{t}^{N,\text{GT}}$ , Algorithms 2 and 3 reduce to the genealogy tracking smoother (Kitagawa,, 1996). The matrix $\hat{B}_{t}^{N,\text{GT}}$ is indeed sparse, leading to the well-known $\mathcal{O}(N)$ complexity of this on-line procedure. As per Theorem 2.1, smoothing via genealogy tracking is convergent at rate $\mathcal{O}(N^{-1/2})$ if $T$ is fixed. When $T\to\infty$ however, all particles will eventually share the same ancestor at time $0$ (or any fixed time $t$ ). Mathematically, this phenomenon is manifested in two ways: (a) for fixed $t$ and function $\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}$ , the error of estimating $\mathbb{E}[\phi_{t}(X_{t})|Y_{0:T}]$ grows linearly with $T$ ; and (b) the error of estimating $\mathbb{E}\left[\left.{\sum_{t=0}^{T}\psi_{t}(x_{t-1},x_{t})}\right|{Y_{0:T}}\right]$ grows quadratically with $T$ . These correspond respectively to the degeneracy for the fixed marginal smoothing and the additive smoothing problems; see also the introductory section of Olsson and Westerborn, (2017) for a discussion. The random matrices $\hat{B}_{t}^{N,\text{GT}}$ are therefore said to be unstable as $T\to\infty$ , which is not the case for $\hat{B}_{t}^{N,\text{FFBS}}$ or $\hat{B}_{t}^{N,\text{PaRIS}}$ . This subsection gives sufficient conditions to ensure the stability of a general $\hat{B}_{t}^{N}$ .

The essential point behind smoothing stability is simple: the support of $B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ or $B_{t}^{N,\mathrm{PaRIS}}(n,\cdot)$ for $\tilde{N}\geq 2$ contains more than one element, contrary to that of $B_{t}^{N,\mathrm{GT}}(n,\cdot)$ . This property is formalised by (13). To explain the intuitions, we use the notations of Algorithm 2 and consider the estimate

N^{-1}\left(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}})+\cdots+\psi_{0}(X_{0}^{\mathcal{I}_{0}^{N}})\right)

of $\mathbb{E}\left[\left.{\psi_{0}(X_{0})}\right|{Y_{0:T}}\right]$ when $T\to\infty$ . The variance of the quantity above is a sum of $\mathrm{Cov}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{i}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{j}}))$ terms. It can therefore be understood by looking at a pair of trajectories simulated using Algorithm 2.

At final time $t=T$ , $\mathcal{I}_{T}^{1}$ and $\mathcal{I}_{T}^{2}$ both follow the $\mathcal{M}(W_{T}^{1:N})$ distribution. Under regularity conditions (e.g. no extreme weights), they are likely to be different, i.e., $\mathbb{P}(\mathcal{I}_{T}^{1}=\mathcal{I}_{T}^{2})=\mathcal{O}(1/N)$ . This property can be propagated backward: as long as $\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}$ , the two variables $\mathcal{I}_{t-1}^{1}$ and $\mathcal{I}_{t-1}^{2}$ are also likely to be different, with however a small $\mathcal{O}(1/N)$ chance of being equal. Moreover, as long as the two trajectories have not met, they can be simulated independently given $\mathcal{F}_{T}^{-}$ (the sigma algebra defined in (3)). In mathematical terms, under the two hypotheses of Theorem 2.1, given $\mathcal{F}_{T}^{-}$ and $\mathcal{I}_{t:T}^{1,2}$ , it can be proved that the two variables $\mathcal{I}_{t-1}^{1}$ and $\mathcal{I}_{t-1}^{2}$ are independent if $\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}$ (Lemma E.1, Supplement E.3).

Since there is an $\mathcal{O}(1/N)$ chance of meeting at each time step, if $T\gg N$ , it is likely that the two paths will meet at some point $t\gg 0$ . When $\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}$ , the two indices $\mathcal{I}_{t-1}$ and $\mathcal{I}_{t-2}$ are both simulated according to $B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot)$ . In the genealogy tracking algorithm, $B_{t}^{N,\text{GT}}(i,\cdot)$ is a Dirac measure, leading to $\mathcal{I}_{t-1}^{1}=\mathcal{I}_{t-1}^{2}$ almost surely. This spreads until time $0$ , so $\operatorname{Corr}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{2}}))$ is almost $1$ if $T\gg N$ .

Other kernels like $B_{t}^{N,\text{FFBS}}$ or $B_{t}^{N,\text{PaRIS}}$ do not suffer from the same problem. For these, the support size of $B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot)$ is greater than one and thus there is some real chance that $\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}$ . If that does happen, we are again back to the regime where the next states of the two paths can be simulated independently. Note also that the support of $B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot)$ does not need to be large and can contain as few as $2$ elements. Even if $\mathcal{I}_{t-1}^{1}$ might still be equal to $\mathcal{I}_{t-1}^{2}$ with some probability, the two paths will have new chances to diverge at times $t-2$ , $t-3$ and so on. Overall, this makes $\operatorname{Corr}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{2}}))$ quite small (Lemma E.3, Supplement E.3).

We formalise these arguments in the following theorem, whose proof (Supplement E.3) follows them very closely. The price for proof intuitiveness is that the theorem is specific to the bootstrap filter, although numerical evidence (Section 5) suggests that other filters are stable as well.

Assumption 2.

The transition densities $m_{t}$ are upper and lower bounded:

\bar{M}_{\ell}\leq m_{t}(x_{t-1},x_{t})\leq\bar{M}_{h}

for constants $0<\bar{M}_{\ell}<\bar{M}_{h}<\infty$ .

Assumption 3.

The potential functions $G_{t}$ are upper and lower bounded:

\bar{G}_{\ell}\leq G_{t}(x_{t})\leq\bar{G}_{h}

for constants $0<\bar{G}_{\ell}<\bar{G}_{h}<\infty$ .

Remark. Since Assumption 2 implies that the $\mathcal{X}_{t}$ ’s are compact, Assumption 1 automatically implies Assumption 3 as soon as the $G_{t}$ ’s’ are continuous functions.

Theorem 2.2.

We use the notations of Algorithms 1 and 2. Suppose that Assumptions 2 and 3 hold and the random kernels $B_{1:T}^{N}$ satisfy the conditions of Theorem 2.1. If, in addition, for the pair of random variables $(J_{t}^{n,1},J_{t}^{n,2})$ whose distribution given $X_{t-1}^{1:N}$ , $X_{t}^{n}$ and $\hat{B}_{t}^{N}(n,\cdot)$ is defined by $B_{t}^{N}(n,\cdot)\otimes B_{t}^{N}(n,\cdot)$ , we have

(13)

\mathbb{P}\left(\left.{J_{t}^{n,1}\neq J_{t}^{n,2}}\right|{X_{t-1}^{1:N},X_{t}^{n}}\right)\geq\varepsilon_{\mathrm{S}}

for some $\varepsilon_{\mathrm{S}}>0$ and all $t$ , $n$ ; then there exists a constant $C$ not depending on $T$ such that:

•

fixed marginal smoothing is stable, i.e. for $s\in\left\{0,\ldots,T\right\}$ and a real-valued function $\phi_{s}:\mathcal{X}_{s}\to\mathbb{R}$ of the hidden state $X_{s}$ , we have

(14)

\mathbb{E}\left[\left(\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{s})\phi_{s}(x_{s})-\mathbb{E}\left[\left.{\phi_{s}(X_{s})}\right|{Y_{0:T}}\right]\right)^{2}\right]\leq\frac{C\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}{N};

•

additive smoothing is stable, i.e. for $T\geq 2$ and the function $\varphi_{T}$ defined in (9), we have

(15)

\mathbb{E}\left[\left(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right]\leq\frac{C\sum_{t=0}^{T}\left\lVert\psi_{t}\right\rVert_{\infty}^{2}}{N}\left(1+\sqrt{\frac{T}{N}}\right)^{2}.

In particular, when $B_{t}^{N}$ is the PaRIS kernel with $\tilde{N}\geq 2$ , Theorem 2.2 implies a novel non-asymptotic bound for the PaRIS algorithm. Olsson and Westerborn, (2017) first established a central limit theorem as $N\to\infty$ and $T$ fixed, then showed that the asymptotic variance is controlled as $T\to\infty$ . In contrast, we follow an original approach (whose intuition is explained at the beginning of this subsection) in order to derive a finite sample size bound.

The main technical difficulty is to prove the fast mixing of the Markov kernel product $B_{t}^{N}B_{t-1}^{N}\ldots B_{t^{\prime}}^{N}$ in terms of $t-t^{\prime}$ . For the original FFBS kernel, the stability proof by Douc et al., (2011) relies on the uniform Doeblin property of each of the term $B_{s}^{N,\mathrm{FFBS}}$ (page 2136, towards the end of their proof of Lemma 10) and from there, deduces the exponentially fast mixing of the product. When $B_{s}^{N,\mathrm{FFBS}}$ is approximated by a sparse matrix $B_{s}^{N}$ (which is the case for PaRIS, but also for certain MCMC-based and coupling-based smoothers that we shall see later), the aforementioned property no longer holds for each individual term $B_{s}^{N}$ . Interestingly however, the good mixing of $B_{t}^{N,\mathrm{FFBS}}\ldots B_{t^{\prime}}^{N,\mathrm{FFBS}}$ is still conserved in the product $B_{t}^{N}\ldots B_{t^{\prime}}^{N}$ . In Lemma E.3, we show that two trajectories generated via the latter kernel have such a small correlation that they are virtually indistinguishable from two independent trajectories generated via the former one.

Theorem 2.2 is stated under strong assumptions (similar to those used in Chopin and Papaspiliopoulos, 2020, Chapter 11.4, and slightly stronger than Douc et al., 2011, Assumption 4). On the other hand, it applies to a large class of backward kernels (rather than only FFBS), including the new ones introduced in the forthcoming sections.

In the proof of this theorem, we proceed in two steps: first, we apply existing bounds (Dubarry and Le Corff,, 2013, Theorem 3.1 and Del Moral,, 2013, Chapter 17) for the error between the $B_{t}^{N,\mathrm{FFBS}}$ -induced distribution and the true target; and second, we use our own techniques to control the error when $B_{t}^{N,\mathrm{FFBS}}$ is replaced by any other kernel $B_{t}^{n}$ satisfying (13). The $(1+\sqrt{T/N})^{2}$ term in (15) comes from the first part and we do not know whether it can be dropped. However, it does not affect the scaling of the algorithm. Indeed, with or without it, the inequality implies that in order to have a constant error in the additive smoothing problem, one only has to take $N=\mathcal{O}(T)$ (instead of $N=\mathcal{O}(T^{2})$ without backward sampling). Moreover, from an asymptotic point of view, we always have $\sigma^{2}(T)=\mathcal{O}(T)$ regardless of the presence of the $(1+\sqrt{T/N})^{2}$ term, where $\sigma^{2}(T):=\lim_{N\to\infty}N\mathbb{E}\left[\left(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right]$ .

3. Sampling from the FFBS Backward Kernels

Sampling from the FFBS backward kernel lies at the heart of both the FFBS algorithm (Example 1) and the PaRIS one (Section 2.5). Indeed, at time $t$ , they require generating random variables distributed according to $B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})$ for $i_{t}$ running from $1$ to $N$ . Since sampling from a discrete measure on $N$ elements requires $\mathcal{O}(N)$ operations (e.g. via CDF inversion), the total computational cost becomes $\mathcal{O}(N^{2})$ . To reduce this, we start by considering the subclass of models satisfying the following assumption, which is much weaker than Assumption 2.

Assumption 4.

The transition density $m_{t}(x_{t-1},x_{t})$ is strictly positive and upper bounded, i.e. there exists $\bar{M}_{h}>0$ such that $0<m_{t}(x_{t-1},x_{t})\leq\bar{M}_{h},\forall\ (x_{t-1},x_{t})$ .

The motivation for the first condition $0<m_{t}(x_{t-1},x_{t})$ will be clear after Assumption 5 is defined. For now, we see that it is possible to sample from $B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})$ using rejection sampling via the proposal distribution $\mathcal{M}(W_{t-1}^{1:N})$ . After an $\mathcal{O}(N)$ -cost initialisation, new draws can be simulated from the proposal in amortised $\mathcal{O}(1)$ time; see Chopin and Papaspiliopoulos, (2020, Python Corner, Chapter 9), see also Douc et al., (2011, Appendix B.1) for an alternative algorithm with an $\mathcal{O}(\log N)$ cost per draw. The resulting procedure is summarised in Algorithm 4. Compared to traditional FFBS or PaRIS implementations, these rejection–based variants have a random execution time that is more difficult to analyse. Under Assumption 2, Douc et al., (2011) and Olsson and Westerborn, (2017) derive an $\mathcal{O}(N\bar{M}_{h}/\bar{M}_{\ell})$ expected complexity. However, the general picture, where the state space is not compact and only Assumption 4 holds, is less clear.

Input: Particles

X_{t-1}^{1:N}

and weights

W_{t-1}^{1:N}

at time

t-1

; particle

X_{t}^{i_{t}}

at time

t

; constant

\bar{M}_{h}

; pre-initialised

\mathcal{O}(1)

sampler for

\mathcal{M}(W_{t-1}^{1:N})

repeat

\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{1:N})

using the pre-initialised

\mathcal{O}(1)

sampler

U\sim\operatorname{Unif}[0,1]

until $U\leq m_{t}(X_{t-1}^{\mathcal{I}_{t-1}},X_{t}^{i_{t}})/\bar{M}_{h}$

Output:

\mathcal{I}_{t-1}

, which is distributed according to

B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

Algorithm 4 Pure rejection sampler for simulating from

B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

The present subsection intends to fill this gap. Our main focus is the PaRIS algorithm of which the presentation is simpler. Results for the FFBS algorithm can be found in Supplement B. We restrict ourselves to the case where $\mathcal{X}_{t}=\mathbb{R}^{d_{t}}$ , although extensions to other non compact state spaces are possible. Only the bootstrap particle filter is considered, and results from this section do not extend trivially to other filtering algorithms. In Section 5, we shall employ different types of particle filters and see that the performance could change from one type to another, which is an additional weak point of rejection-based algorithms.

Assumption 5.

The hidden state $X_{t}$ is defined on the space $\mathcal{X}_{t}=\mathbb{R}^{d_{t}}$ . The measure $\lambda_{t}(\mathrm{d}x_{t})$ with respect to which the transition density $m_{t}(x_{t-1},x_{t})$ is defined (cf. (2)) is the Lebesgue measure on $\mathbb{R}^{d_{t}}$ .

This assumption together with the condition $m_{t}(x_{t-1},x_{t})>0$ of Assumption 4 ensures that the state space model is “truly non-compact”. Indeed, if $m_{t}(x_{t-1},x_{t})$ is zero whenever $x_{t-1}\notin\mathcal{C}_{t-1}$ or $x_{t}\notin\mathcal{C}_{t}$ , where $\mathcal{C}_{t-1}$ and $\mathcal{C}_{t}$ are respectively two compact subsets of $\mathbb{R}^{d_{t-1}}$ and $\mathbb{R}^{d_{t}}$ , then we are basically reduced to a state space model where $\mathcal{X}_{t-1}=\mathcal{C}_{t-1}$ and $\mathcal{X}_{t}=\mathcal{C}_{t}$ .

3.1. Complexity of PaRIS algorithm with pure rejection sampling

We consider the PaRIS algorithm (i.e. Algorithm 3 using the $B_{t}^{N,\mathrm{PaRIS}}$ kernels). Algorithm 5 provides a concrete description of the resulting procedure, using the bootstrap particle filter. At each time $t$ , let $\tau_{t}^{n,\mathrm{PaRIS}}$ be the number of rejection trials required to sample from $B_{t}^{N,\mathrm{FFBS}}(n,\mathrm{d}m)$ . We then have

(16)

\tau_{t}^{n,\mathrm{PaRIS}}\textrm{ }|\textrm{ }\mathcal{F}_{t-1},X_{t}^{n}\sim\operatorname{Geo}\left(\frac{\sum_{i}W_{t-1}^{i}m_{t}(X_{t-1}^{i},X_{t}^{n})}{\bar{M}_{h}}\right)

with $\bar{M}_{h}$ defined in Assumption 4.

Input: Particles

X_{t-1}^{1:N}

; weights

W_{t-1}^{1:N}

; vector

S_{t-1}^{N}

\mathbb{R}^{N}

; pre-initialised sampler for

\mathcal{M}(W_{t-1}^{1:N})

; function

\psi_{t}

(cf. (9)); user-specified parameter

\tilde{N}

for $n\leftarrow 1$ to $N$ do

A_{t}^{n}\sim\mathcal{M}(W_{t-1}^{1:N})

(\star)

X_{t}^{n}\sim M_{t}(X_{t-1}^{A_{t}^{n}},\mathrm{d}x_{t})

Simulate

J_{t}^{n,1:\tilde{N}}\overset{\textrm{i.i.d.}}{\sim}B_{t}^{N,\mathrm{FFBS}}(n,\mathrm{d}n^{\prime})

using either the pure rejection sampler (Algorithm 4) or the hybrid rejection sampler (Algorithm 6)

S_{t}^{N}[n]\leftarrow\tilde{N}^{-1}\sum_{\tilde{n}=1}^{\tilde{N}}\left\{S_{t-1}^{N}[J_{t}^{n,\tilde{n}}]+\psi_{t}(X_{t-1}^{J_{t}^{n,\tilde{n}}},X_{t}^{n})\right\}

for $n\leftarrow 1$ to $N$ do

W_{t}^{n}\leftarrow G_{t}(X_{t}^{n})/\sum_{i}G_{t}(X_{t}^{i})

\mu_{t}^{N}\leftarrow\sum_{n=1}^{N}W_{t}^{n}S_{t}^{N}(n)

Initialise a sampler for

\mathcal{M}(W_{t}^{1:N})

Output: Estimate

\mu_{t}^{N}

\mathbb{E}\left[\left.{\varphi(X_{0:t})}\right|{Y_{0:t}}\right]

; particles

X_{t}^{1:N}

; weights

W_{t}^{1:N}

; vector

S_{t}^{N}

\mathbb{R}^{N}

and pre-initialised sampler

\mathcal{M}(W_{t}^{1:N})

for the next iteration

Algorithm 5 Concrete implementation of PaRIS algorithm (i.e. Algorithm 3 with the

B_{t}^{N,\mathrm{PaRIS}}

backward kernel) using the bootstrap particle filter

By exchangeability of particles, the expected cost of the PaRIS algorithm at step $t$ is proportional to $N\tilde{N}\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}]$ , where $\tilde{N}$ is a fixed user-chosen parameter. Occasionally, $X_{t}^{1}$ falls into an unlikely region of $\mathbb{R}^{d}$ and the acceptance rate becomes low. In other words, $\tau_{t}^{1,\mathrm{PaRIS}}$ is a mixture of geometric distribution, some components of which might have a large expectation. Unfortunately, these inefficiencies add up and produce an unbounded execution time in expectation, as shown in the following proposition.

Proposition 2.

Under Assumptions 4 and 5, the version of Algorithm 5 using the pure rejection sampler satisfies $\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}]=\infty$ , where $\tau_{t}^{1,\mathrm{PaRIS}}$ is defined in (16).

Proof.

We have

	$\displaystyle\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}]$	$\displaystyle=\bar{M}_{h}\mathbb{E}\left[\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},X_{t}^{1})W_{t-1}^{n}}\right]\quad\textrm{via \eqref{eq:dist_tau_N}}$
		$\displaystyle=\bar{M}_{h}\mathbb{E}\left[\mathbb{E}\left[\left.{\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},X_{t}^{1})W_{t-1}^{n}}}\right\|{\mathcal{F}_{t-1}}\right]\right]$
		$\displaystyle=\bar{M}_{h}\mathbb{E}\left[\int_{\mathcal{X}_{t}}\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},x)W_{t-1}^{n}}\left(\sum m_{t}(X_{t-1}^{n},x)W_{t-1}^{n}\right)\lambda_{t}(\mathrm{d}x)\right]$
		$\displaystyle=\bar{M}_{h}\mathbb{E}\left[\int_{\mathcal{X}_{t}}1\times\lambda_{t}(\mathrm{d}x)\right]=\infty\quad\textrm{by Assumption~{}\ref{asp:space}}.$

∎

In highly parallel computing architectures, each processor only handles one or a small number of particles. As such, the heavy-tailed nature of the execution time means that a few machines might prevent the whole system from moving forward. In all computing architectures, an execution time without expectation is essentially unpredictable. A common practice to estimate execution time is to run a certain algorithm with a small number $N$ of particles, then “extrapolate” to the $N_{\mathrm{final}}$ of the definitive run. However, as $\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}]$ is infinite for any $N$ , it is unclear what kind of information we might get from preliminary runs. In Supplement B, besides studying the execution time of rejection-based implementations of the FFBS algorithm, we will delve deeper into the difference between the non-parallel and parallel settings.

From the proof of Proposition 2, it is clear that the quantity $\sum_{n}W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})$ will play a key role in the upcoming developments. We thus define it formally.

Definition 1.

The true predictive density function $r_{t}$ and its approximation $r_{t}^{N}$ are defined as

	$\displaystyle r_{t}(x_{t})$	$\displaystyle:=\frac{(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})}{\lambda_{t}(\mathrm{d}x_{t})}$
	$\displaystyle r_{t}^{N}(x_{t})$	$\displaystyle:=\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})$

where the first equation is understood in the sense of the Radon-Nikodym derivative and the density $m_{t-1}(x_{t-1},x_{t})$ is defined with respect to the dominating measure $\lambda_{t}(\mathrm{d}x_{t})$ on $\mathcal{X}_{t}$ (cf. (2)).

3.2. Hybrid rejection sampling

To solve the aforementioned issues of the pure rejection sampling procedure, we propose a hybrid rejection sampling scheme. The basic observation is that, for a single $m$ , direct simulation (e.g. via CDF inversion) of $B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})$ costs $\mathcal{O}(N)$ . Thus, once $K=\mathcal{O}(N)$ rejection sampling trials have been attempted, one should instead switch to a direct simulation method. In other words, it does not make sense (at least asymptotically) to switch to direct sampling after $K$ trials if $K\ll\mathcal{O}(N)$ or $K\gg\mathcal{O}(N)$ . The validity of this method is established in the following proposition, where we actually allow $K$ to depend on trials drawn so far. The proof, which is not an immediate consequence of the validity of ordinary rejection sampling, is given in Supplement E.4.

Proposition 3.

Let $\mu_{0}(x)$ and $\mu_{1}(x)$ be two probability densities defined on some measurable space $\mathcal{X}$ with respect to a dominating measure $\lambda(\mathrm{d}x)$ . Suppose that there exists $C>0$ such that $\mu_{1}(x)\leq C\mu_{0}(x)$ . Let $(X_{1},U_{1}),(X_{2},U_{2}),\ldots$ be a sequence of i.i.d. random variables distributed according to $\mu_{0}\otimes\operatorname{Unif}[0,1]$ and let $X^{*}\sim\mu_{1}$ be independent of that sequence. Put

K^{*}:=\inf\left\{n\in\mathbb{Z}_{\geq 1}\textrm{ such that }U_{n}\leq\frac{\mu_{1}(X_{n})}{C\mu_{0}(X_{n})}\right\}

and let $K$ be any stopping time with respect to the natural filtration associated with the sequence $\left\{(X_{n},U_{n})\right\}_{n=1}^{\infty}$ . Let $Z$ be defined as $X_{K^{*}}$ if $K^{*}\leq K$ and $X^{*}$ otherwise. Then $Z$ is $\mu_{1}$ -distributed.

Proposition 3 thus allows users to pick $K=\alpha N$ , where $\alpha>0$ might be chosen somehow adaptively from earlier trials. In the following, we only consider the simple rule $K=N$ , which does not induce any loss of generality in terms of the asymptotic behaviour and is easy to implement. The resulting iteration is described in Algorithm 6.

Input: Particles

X_{t-1}^{1:N}

and weights

W_{t-1}^{1:N}

at time

t-1

; particle

X_{t}^{i_{t}}

at time

t

; constant

\bar{M}_{h}

; pre-initialised

\mathcal{O}(1)

sampler for

\mathcal{M}(W_{t-1}^{1:N})

accepted\leftarrow\operatorname{False}

for $i\leftarrow 1$ to $N$ do

\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{1:N})

using the pre-initialised

\mathcal{O}(1)

sampler

U\sim\operatorname{Unif}[0,1]

if $U\leq m_{t}(X_{t-1}^{\mathcal{I}_{t-1}},X_{t}^{i_{t}})/\bar{M}_{h}$ then

accepted\leftarrow\operatorname{True}

break

if not $accepted$ then

\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{n}m(X_{t-1}^{n},X_{t}^{i_{t}}))

Output:

\mathcal{I}_{t-1}

, which is distributed according to

B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

Algorithm 6 Hybrid rejection sampler for simulating from

B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

When applied in the context of Algorithm 5, Algorithm 6 gives a smoother of expected complexity proportional to

N\tilde{N}\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]

at time $t$ , where $\tau_{t}^{1,\mathrm{PaRIS}}$ is defined in (16)). This quantity is no longer infinite, but its growth when $N\to\infty$ might depend on the model. Still, in all cases, it remains strictly larger than $\mathcal{O}(N)$ and strictly smaller than $\mathcal{O}(N^{2})$ . Perhaps more surprisingly, in linear Gaussian models (see Supplement A.1 for detailed notations), the smoother is of near-linear complexity (up to log factors). The following two theorems formalise these claims.

Assumption 6.

The predictive density $r_{t}$ of $X_{t}$ given $Y_{0:t-1}$ and the potential function $G_{t}$ are continuous functions on $\mathbb{R}^{d_{t}}$ . The transition density $m_{t}(x_{t-1},x_{t})$ is a continuous function on $\mathbb{R}^{d_{t-1}}\times\mathbb{R}^{d_{t}}$ .

Theorem 3.1.

Under Assumptions 1, 4, 5 and 6, the version of Algorithm 5 using the hybrid rejection sampler (Algorithm 6) satisfies $\lim_{N\to\infty}\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]=\infty$ and $\lim_{N\to\infty}{\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]}/{N}=0$ , where $\tau_{t}^{1,\mathrm{PaRIS}}$ is defined in (16).

Theorem 3.2.

We assume the same setting as Theorem 3.1. In linear Gaussian state space models (Supplement A.1), we have $\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]=\mathcal{O}((\log N)^{d_{t}/2})$ .

While Proposition 2 shows that $\tau_{t}^{1,\mathrm{PaRIS}}$ has infinite expectation, Theorem 3.2 implies that its $N$ -thresholded version only displays a slowly increasing mean. To give a very rough intuition on the phenomenon, consider $X\sim\mathcal{N}(0,1)$ . Then

\mathbb{E}\left[e^{X^{2}/2}\right]=\int_{\mathbb{R}}e^{x^{2}/2}\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}=+\infty

whereas

(17)

\begin{split}\mathbb{E}\left[\min(e^{X^{2}/2},N)\right]&=\int_{\mathbb{R}}\min(e^{x^{2}/2},N)\frac{1}{\sqrt{2\pi}}e^{-{x^{2}/2}}\mathrm{d}x\\ &=\int_{\left|x\right|\leq\sqrt{2\log N}}\frac{1}{\sqrt{2\pi}}\mathrm{d}x+N\int_{\left|x\right|>\sqrt{2\log N}}\frac{1}{\sqrt{2\pi}}e^{-{x^{2}/2}}\mathrm{d}x\\ &\leq\sqrt{\frac{4\log N}{\pi}}+\frac{1}{\sqrt{\pi\log N}}\end{split}

using the bound $\mathbb{P}(X>x)\leq\frac{e^{-x^{2}/2}}{x\sqrt{2\pi}}$ for $x>0$ . The main technical difficulty of the proof of Theorem 3.2 (see Supplement E.6) is to perform this kind of argument under the error induced by the finite sample size particle approximation. In the language of this oversimplified example, we want (17) to hold when $X$ does not follow $\mathcal{N}(0,1)$ any more, but only an $N$ -dependent approximation of it.

4. Efficient backward kernels

4.1. MCMC Backward Kernels

This subsection analyses and extends the MCMC backward kernel defined in Example 3. As we remarked there, the matrix $\hat{B}_{t}^{N,\mathrm{IMH}}$ is not sparse and even has some expensive-to-evaluate entries. We thus reserve it for use in the off-line smoother (Algorithm 2) whereas in the on-line scenario (Algorithm 3), we use its PaRIS-like counterpart

(18)

\hat{B}_{t}^{N,\mathrm{IMHP}}[i_{t},i_{t-1}]:=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\mathbbm{1}\left\{i_{t-1}=\tilde{J}_{t}^{i_{t},\tilde{n}}\right\}

where $\tilde{J}_{t}^{i_{t},1:\tilde{N}}$ is an independent Metropolis-Hastings chain started at $J_{t}^{i_{t},1}:=A_{t}^{i_{t}}$ , targeting the measure $B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})$ and using the proposal distribution $\mathcal{M}(W_{t-1}^{1:N})$ . Thus, the parameter $\tilde{N}$ signifies that $\tilde{N}-1$ MCMC steps are applied to $A_{t}^{i_{t}}$ , and we shall use the same convention for the kernel $B_{t}^{N,\operatorname{IMH}}$ . In both cases, the complexity of the corresponding algorithms are $\mathcal{O}((\tilde{N}-1)N)$ which is equivalent to $\mathcal{O}(N)$ as long as $\tilde{N}$ remains fixed when $N\to\infty$ .

The validity and the stability of $\hat{B}_{t}^{N,\mathrm{IMH}}$ and $\hat{B}_{t}^{N,\mathrm{IMHP}}$ are established in the following proposition (proved in Supplement E.9). For simplicity, only the case $\tilde{N}=2$ is examined, but as a matter of fact, the proposition remains true for $\tilde{N}\geq 2$ .

Proposition 4.

The kernels $B_{t}^{N,\mathrm{IMH}}$ and $B_{t}^{N,\mathrm{IMHP}}$ with $\tilde{N}=2$ satisfy the hypotheses of Theorem 2.1 and, under Assumptions 2 and 3, those of Theorem 2.2. Hence, their respective uses in Algorithms 2 and 3 guarantee a convergent and stable smoother.

From a theoretical viewpoint, Proposition 4 is the first result establishing the stability for the use of MCMC moves inside backward sampling. It relies on technical innovations that we have explained in Section 2.6, in particular after the statement of Theorem 2.2.

From a practical viewpoint, the advantages of independent Metropolis-Hastings MCMC kernels compared to the rejection samplers of Section 3 are the dispensability of specifying an explicit $\bar{M}_{h}$ and the deterministic $\mathcal{O}(N)$ nature of the execution time. In practice, we observe that the MCMC smoothers are usually 10-20 times faster than the rejection sampling–based counterparts (see e.g. Figure 4) while producing essentially the same sample quality. Finally, it is not hard to imagine situations where some proposal smarter than $\mathcal{M}(W_{t-1}^{1:N})$ would be beneficial. However, we only consider that one here, mainly because it already performs satisfactorily in our numerical examples.

4.2. Dealing with intractable transition densities

4.2.1. Intuition and formulation

The purpose of backward sampling is to re-generate, for each particle, a new ancestor that is different from that of the filtering step. However, backward sampling is infeasible if the transition density $m_{t}(x_{t-1},x_{t})$ cannot be calculated. To get around this, we modify the particle filter so that each particle might, in some sense, have two ancestors right from the forward pass.

Consider the standard PF (Algorithm 1). Among the $N$ resampled particles $X_{t-1}^{A_{t}^{1:N}}$ , let us track two of them, say $x_{t-1}$ and $x_{t-1}^{\prime}$ for simplicity. The move step of Algorithm 1 will push them through $M_{t}$ using independent noises, resulting in $x_{t}$ and $x_{t}^{\prime}$ (that is, given $x_{t-1}$ and $x_{t-1}^{\prime}$ , we have $x_{t}\sim M_{t}(x_{t-1},\cdot)$ and $x_{t}^{\prime}\sim M_{t}(x_{t-1}^{\prime},\cdot)$ such that $x_{t}$ and $x_{t}^{\prime}$ are independent). Thus, for e.g. linear Gaussian models, we have $\mathbb{P}(x_{t}=x_{t}^{\prime})=0$ . However, if the two simulations $x_{t}\sim M_{t}(x_{t-1},\cdot)$ and $x_{t}^{\prime}\sim M_{t}(x_{t-1}^{\prime},\cdot)$ are done with specifically correlated noises, it can happen that $\mathbb{P}(x_{t}=x_{t}^{\prime})>0$ . The joint distribution $(x_{t},x_{t}^{\prime})$ given $(x_{t-1},x_{t-1}^{\prime})$ is called a coupling of $M_{t}(x_{t-1},\cdot)$ and $M_{t}(x_{t-1}^{\prime},\cdot)$ ; the event $x_{t}=x_{t}^{\prime}$ is called the meeting event and we say that the coupling is successful when it occurs. In that case, the particle $x_{t}$ automatically has two ancestors $x_{t-1}$ and $x_{t-1}^{\prime}$ at time $t-1$ without needing any backward sampling.

The precise formulation of the modified forward pass is detailed in Algorithm 7. It consists of building in an on-line manner the backward kernels $B_{t}^{N,\mathrm{ITR}}$ (where ITR stands for “intractable”). The main interest of this algorithm lies in the fact that while the function $m_{t}$ may prove impossible to evaluate, it is usually possible to make $x_{t}$ and $x_{t}^{\prime}$ meet by correlating somehow the random numbers used in their simulations. One typical example which this article focuses on is the coupling of continuous-time processes, but it is useful to keep in mind that Algorithm 7 is conceptually more general than that.

Input: Feynman-Kac model (1), particles

X_{t-1}^{1:N}

and weights

W_{t-1}^{1:N}

that approximate the filtering distribution at time

t-1

for $n\leftarrow 1$ to $N$ do

Resample. Simulate

(A_{t}^{n,1},A_{t}^{n,2})

such that marginally each component is distributed according to

\mathcal{M}(W_{t-1}^{1:N})

Move. Simulate

(X_{t}^{n,1},X_{t}^{n,2})

such that marginally the two components are distributed respectively according to

M_{t}(X_{t-1}^{A_{t}^{n,1}},\mathrm{d}x_{t})

and

M_{t}(X_{t-1}^{A_{t}^{n,2}},\mathrm{d}x_{t})

Choose

L\sim\operatorname{Uniform}(\{1,2\})

Set

X_{t}^{n}\leftarrow X_{t}^{n,L}

Calculate backward kernel.

if $X_{t}^{n,1}=X_{t}^{n,2}$ then

B_{t}^{N,\mathrm{ITR}}(n,\mathrm{d}i_{t-1})\leftarrow\left(\delta\left\{A_{t}^{n,1}\right\}+\delta\left\{A_{t}^{n,2}\right\}\right)/2

else

B_{t}^{N,\mathrm{ITR}}(n,\mathrm{d}i_{t-1})\leftarrow\delta\left\{{A_{t}^{n,L}}\right\}

Reweight. Set

\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n})

for

n=1,2,\ldots,N

Set

\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N

Set

W_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N}

for

n=1,2,\ldots,N

Output: Particles

X_{t}^{1:N}

and weights

W_{t}^{1:N}

that approximate the filtering distribution at time

t

; backward kernel

B_{t}^{N,\mathrm{ITR}}

that can be used in either Algorithm 2 or 3

Algorithm 7 Modified forward pass for smoothing of intractable models (one time step)

4.2.2. Validity and stability

The consistency of Algorithm 7 follows straightforwardly from Theorem 2.1. To produce a stable routine however, some conditions must be imposed on the couplings $(A_{t}^{n,1},A_{t}^{n,2})$ and $(X_{t}^{n,1},X_{t}^{n,2})$ . We want $A_{t}^{n,1}$ to be different from $A_{t}^{n,2}$ as frequently as possible. On the contrary, we aim for a coupling of the two distributions $M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot)$ and $M_{t}(X_{t-1}^{A_{t}^{n,2}},\cdot)$ with high success rate so as to maximise the probability that $X_{t}^{n,1}=X_{t}^{n,2}$ .

Assumption 7.

There exists an $\varepsilon_{A}>0$ such that $\mathbb{P}(A_{t}^{n,1}\neq A_{t}^{n,2}|X_{t-1}^{1:N})\geq\varepsilon_{A}.$

Assumption 8.

There exists an $\varepsilon_{D}>0$ such that

\mathbb{P}(X_{t}^{n,2}=X_{t}^{n,1}|X_{t-1}^{1:N},A_{t}^{n,1},A_{t}^{n,2},X_{t}^{n,1})\geq\varepsilon_{D}\left(1\land\frac{m_{t}(\rm,X_{t}^{n,1})}{m_{t}(X_{t-1}^{A_{t}^{n,1}},X_{t}^{n,1})}\right).

The letters A and D in $\varepsilon_{A}$ and $\varepsilon_{D}$ stand for “ancestors” and “dynamics”. Assumption 8 means that the user-chosen coupling of $M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot)$ and $M_{t}(\rm,\cdot)$ must be at least as $\varepsilon_{D}$ times as efficient as their maximal couplings. For details on this interpretation, see Proposition 10 in the Supplement. In Lemma E.12, we also show that in spite of its appearance, Assumption 8 is actually symmetric with regards to $X_{t}^{n,1}$ and $X_{t}^{n,2}$ .

We are now ready to state the main theorem of this subsection (see Supplement E.11 for a proof).

Theorem 4.1.

The kernels $B_{t}^{N,\mathrm{ITR}}$ generated by Algorithm 7 satisfy the hypotheses of Theorem 2.1. Thus, under Assumption 1, Algorithm 7 provides a consistent smoothing estimate. If, in addition, the Feynman-Kac model (1) satisfies Assumptions 2 and 3 and the user-chosen couplings satisfy Assumptions 7 and 8, the kernels $B_{t}^{N,\mathrm{ITR}}$ also fulfil (13) and the smoothing estimates generated by Algorithm 7 are stable.

4.2.3. Good ancestor couplings

It is notable that Assumption 7 only considers the event $A_{t}^{n,1}\neq A_{t}^{n,2}$ , which is a pure index condition that does not take into account the underlying particles $X_{t-1}^{A_{t}^{n,1}}$ and $X_{t-1}^{A_{t}^{n,2}}$ . Indeed, if smoothing algorithms prevent degeneracy by creating multiple ancestors for a particle, we would expect that their separation (i.e. that they are far away in the state space $\mathcal{X}_{t-1}$ , e.g. $\mathbb{R}^{d}$ ) is critical to the performance. Surprisingly, it is unnecessary: two very close particles (in $\mathbb{R}^{d}$ ) at time $t-1$ may have ancestors far away at time $t-2$ thanks to the mixing of the model.

We advise choosing an ancestor coupling $(A_{t}^{n,1},A_{t}^{n,2})$ such that the distance between $X_{t-1}^{A_{t}^{n,1}}$ and $X_{t-1}^{A_{t}^{n,2}}$ is small. It will then be easier to design a dynamic coupling of $M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot)$ and $M_{t}(X_{t-1}^{A_{t}^{n,2}},\cdot)$ with a high success rate. Furthermore, simulating the dynamic coupling with two close rather than far away starting points can also take less time when, for instance, the dynamic involves multiple intermediate steps, but the two processes couple early. One way to achieve an ancestor coupling with the aforementioned property is to first simulate $A_{t}^{n,1}\sim\mathcal{M}(W_{t-1}^{1:N})$ , then move $A_{t}^{n,1}$ through an MCMC algorithm which keeps invariant $\mathcal{M}(W_{t-1}^{1:N})$ and set the result to $A_{t}^{n,2}$ . It suffices to use a proposal looking at indices whose underlying particles are close (in $\mathbb{R}^{d}$ ) to $X_{t-1}^{A_{t}^{n,1}}$ . Finding nearby particles are efficient if they are first sorted using the Hilbert curve, hashed using locality-sensitive hashing or put in a KD-tree (see Samet,, 2006, for a comprehensive review). In the context of particle filters, such techniques have been studied for different purposes in Gerber and Chopin, (2015), Jacob et al., (2019) and Sen et al., (2018).

4.2.4. Conditionally-correlated version

In Algorithm 7, the ancestor pairs $(A_{t}^{n,1},A_{t}^{n,2})_{n=1}^{N}$ are conditionally independent given $\mathcal{F}_{t}^{-}$ and the same holds for the particles $(X_{t}^{n})_{n=1}^{N}$ . These conditional independences allow easier theoretical analysis, in particular, the casting of Algorithm 7 in the framework of Theorems 2.1 and 2.2. However, they are not optimal for performance in two important ways: (a) they do not allow keeping both $X_{t}^{n,1}$ and $X_{t}^{n,2}$ when the two are not equal, and (b) the set of ancestor variables $(A_{t}^{n,1})_{n=1}^{N}$ is multinomially resampled from $\left\{1,2,\ldots,N\right\}$ with weights $W_{t-1}^{1:N}$ . We know that multinomial resampling is not the ideal scheme, see Supplement C.1 for discussion.

Consequently, in practice, we shall allow ourselves to break free from conditional independence. The resulting procedure is described in Algorithm 9 (Supplement C). Despite a lack of rigorous theoretical support, this is the algorithm that we will use in Section 5 since it enjoys better performance and it constitutes a fair comparison with practical implementations of the standard particle filter, which are mostly based on alternative resampling schemes.

5. Numerical experiments

5.1. Linear Gaussian state-space models

Linear Gaussian models constitute a particular class of state space models. They are characterised by Markov dynamics that are Gaussian and observations that are projection of hidden states plus some Gaussian noises. Supplement A.1 defines, for different components of these models, the notations that we shall use here. In this section, we consider an instance described in Guarniero et al., (2017), where the matrix $F_{X}$ satisfies $F_{X}[i,j]=\alpha^{1+|i-j|}$ for some $\alpha$ . We consider the problem with $\operatorname{dim}_{X}=\operatorname{dim}_{Y}=2$ and the observations are noisy versions of the hidden states with $C_{Y}$ being $\sigma_{Y}^{2}$ times the identity matrix of size $2$ . Unless otherwise specified, we take $\alpha=0.4$ and $\sigma_{Y}^{2}=0.5$ .

In this section, we focus on the performance of different online smoothers based on either genealogy tracking, pure/hybrid rejection sampling or MCMC. Rejection-based online smoothing amounts to the PaRIS algorithm, for which we use $\tilde{N}=2$ for the $B_{t}^{N,\mathrm{PaRIS}}$ kernel. We take $T=3000$ and simulate the data from the model. The benchmark additive function is simply $\varphi_{t}(x_{0:t})=\sum_{s=0}^{t}x_{s}(0)$ where $x_{s}(0)$ is the first coordinate of the $\mathbb{R}^{2}$ vector $x_{s}=[x_{s}(0),x_{s}(1)]$ . For a study of offline smoothers (including FFBS), see Supplement D.1. In all programs here and there, we choose $N=1000$ and use systematic resampling for the forward particle filters (see section C.1). Regarding MCMC smoothers, we employ the kernels $B_{t}^{N,\mathrm{IMH}}$ or $B_{t}^{N,\mathrm{IMHP}}$ consisting of only one MCMC step. All results are based on $150$ independent runs.

Although our theoretical results are only proved for the bootstrap filter, we stress throughout that some of them extend to other filters as well. Therefore, we will also consider guided particle filters in the simulations. An introduction to this topic can be found in Chopin and Papaspiliopoulos, (2020, Chapter 10.3.2), where the expression for the optimal proposal is also provided. In linear Gaussian models, this proposal is fully tractable and is the one we use.

To present efficiently the combination of two different filters (bootstrap and guided) and four different algorithms (naive genealogy tracking, pure/hybrid rejection and MCMC) we use the following abbreviations: “B” for bootstrap, “G” for guided, “N” for naive genealogy tracking, “P” for pure rejection, “H” for hybrid rejection and “M” for MCMC. For instance, the algorithm referred to as “BM” uses the bootstrap filter for the forward pass and the MCMC backward kernels to perform smoothing. Furthermore, the letter “R” will refer to the rejection kernel whenever the distinction between pure rejection and hybrid rejection is not necessary. (Recall that the two rejection methods produce estimators with the same distribution.)

Figure 2 shows the squared interquartile range for the online smoothing estimates $\mathbb{Q}_{t}(\varphi_{t})$ with respect to $t$ . It verifies the rates of Theorem 2.2, although linear Gaussian models are not strongly mixing in the sense of Assumptions 2 and 3: the grid lines hint at a variance growth rate of $\mathcal{O}(T)$ for the MCMC and reject-based smoothers and of $\mathcal{O}(T^{2})$ for the genealogy tracking ones. Unsurprisingly guided filters have better performance than bootstrap.

Refer to caption — Figure 2. Squared interquartile range of the estimators $\mathbb{Q}_{t}(\varphi_{t})$ with respect to $t$ , for different online smoothing algorithms. The model is linear Gaussian with parameters specified in section 5.1. See text for full explanation of the legend. For readability, the curves are down-sampled to $50$ points before being drawn.

Figure 3 show box-plots of the execution time (divided by $NT$ ) for different algorithms over $150$ runs. By execution time, we mean the number of Markov kernel transition density evaluations. We see that the bootstrap particle filter coupled with pure rejection sampling has a very heavy-tailed execution time. This behaviour is expected as per Proposition 2. Using the guided particle filter seems to fare better, but Figure 4 (for the same model but with $\sigma_{Y}^{2}=2$ ) makes it clear that this cannot be relied on either. Overall, these results highlight two fundamental problems with pure rejection sampling: the computational time has heavy tails and depends on the type of forward particle filter being used.

On the other hand, hybrid rejection sampling, despite having random execution time in principle, displays a very consistent number of transition density evaluations over different independent runs. Thus it is safe to say that the algorithm has a virtually deterministic execution time. The catch is that the average computational load (which is around $16$ in Figure 3) cannot be easily calculated beforehand. In any case, it is much larger than the value $1$ of MCMC smoothers (since only $1$ MCMC step is performed in the kernel $B_{t}^{N,\mathrm{IMHP}}$ ); whereas the performance (Figure 2) is comparable.

The bottom line is that MCMC smoothers should be the default option, and one MCMC step seems to be enough. If for some reason one would like to use rejection-based methods, using hybrid rejection is a must.

5.2. Lotka-Volterra SDE

Lotka-Voleterra models (originated in Lotka,, 1925 and Volterra,, 1928) describe the population fluctuation of species due to natural birth and death as well as the consumption of one species by others. The emblematic case of two species is also known as the predator-prey model. In this subsection, we study the stochastic differential equation (SDE) version that appears in Hening and Nguyen, (2018). Let $X_{t}=(X_{t}(0),X_{t}(1))$ represent respectively the populations of the prey and the predator at time $t$ and let us consider the dynamics

(19)

\left\{\begin{aligned} \mathrm{d}X_{t}(0)&=\bigl{[}\beta_{0}X_{t}(0)-\frac{1}{2}\tau_{0}[X_{t}(0)]^{2}&&-\tau_{1}X_{t}(0)X_{t}(1)\bigr{]}\mathrm{d}t+X_{t}(0)\mathrm{d}E_{t}(0)\\ \mathrm{d}X_{t}(1)&=\bigl{[}\qquad\quad-\beta_{1}X_{t}(1)&&+\tau_{1}X_{t}(0)X_{t}(1)\bigr{]}\mathrm{d}t+X_{t}(1)\mathrm{d}E_{t}(1)\end{aligned}\right.

where $E_{t}=\Gamma W_{t}$ with $W_{t}$ being the standard Brownian motion in $\mathbb{R}^{2}$ and $\Gamma$ being some $2\times 2$ matrix. The parameters $\beta_{0}$ and $\beta_{1}$ are the natural birth rate of the prey and death rate of the predator. The predator interacts with (eats) the prey at rate $\tau_{1}$ . The quantity $\tau_{0}$ encodes intra-species competition in the prey population. The $\frac{1}{2}$ in its parametrisation is to line up with the Lotka Volterra jump process in $\mathbb{Z}^{2}$ where the population sizes are integers and the interaction term becomes $\tau_{0}X_{t}(0)[X_{t}(0)-1]/2$ .

The state space model is comprised of the process $X_{t}$ and its noisy observations $Y_{t}$ recorded at integer times. The Markov dynamics cannot be simulated exactly, but can be approximated through (Euler) discretisation. Nevertheless, the Euler transition density $m_{t}^{\mathrm{E}}(x_{t-1},x_{t})$ remains intractable (unless the step size is exactly $1$ ). Thus, the algorithms presented in Subsection 4.2 are useful. The missing bit is a method to efficiently couple $m_{t}^{\mathrm{E}}(x_{t-1},\cdot)$ and $m_{t}^{\mathrm{E}}(x_{t-1}^{\prime},\cdot)$ , which we carefully describe in Supplement D.2.1.

We consider the model with $\tau_{0}=1/800$ , $\tau_{1}=1/400$ , $\beta_{0}=0.3125$ and $\beta_{1}=0.25$ . The matrix $\Gamma$ is such that the covariance matrix of $E_{1}$ is $\begin{bmatrix}1/100&1/200\\ 1/200&1/100\end{bmatrix}$ . The observations are recorded on the log scale with Gaussian error of covariance matrix $\begin{bmatrix}0.04&0.02\\ 0.02&0.04\end{bmatrix}$ . The distribution of $X_{0}$ is two-dimensional normal with mean $[100,100]$ and covariance matrix $\begin{bmatrix}100&50\\ 50&100\end{bmatrix}$ . This choice is motivated by the fact that the preceding parameters give the stationary population vector $[100,100]$ . According to Hening and Nguyen, (2018), they also guarantee that neither animal goes extinct almost surely as $t\to\infty$ .

By discretising (19) with time step $\delta=1$ , one can get some very rough intuition on the dynamics. For instance, per second there are about $31$ preys born. Approximately the same number die (to maintain equilibrium), of which $6$ die due to internal competition and $25$ are eaten by the predator. The duration between two recorded observations corresponds more or less to one-third generation of the prey and one-fourth generation of the predator. The standard deviation of the variation due to environmental noise is about $10$ individuals per observation period, for each animal.

Again, these intuitions are highly approximate. For readers wishing to get more familiar with the model, Supplement D.2.2 contains real plots of the states and the observations; as well as data on the performance of different smoothing algorithms for moderate values of $T$ . We now showcase the results obtained in a large scale problem where $T=3000$ and the data is simulated from the model.

We consider the additive function $\varphi_{t}(x_{0:t}):=\sum_{s=0}^{t}\left[x_{s}(0)-100\right]$ . Figure 5 represents using box plots the distributions of the estimators for $\mathbb{Q}_{T}(\varphi_{T})$ using either the genealogy tracking smoother (with systematic resampling; see Supplement C.1) or Algorithm 9. Our proposed smoother greatly reduces the variance, at a computational cost which is empirically $1.5$ to $2$ times greater than the naive method. Since we used Hilbert curve to design good ancestor couplings (see Section 4.2.3), coupling of the dynamics succeeds $80\%$ of the time. As discussed in the aforementioned section, starting two diffusion dynamics from nearby points make them couple earlier, which reduces the computational load afterwards.

Figure 6 plots with respect to $t$ the squared interquartile range of the two methods for the estimation of $\mathbb{Q}_{t}(\varphi_{t})$ . Grid lines hint at a quadratic growth for the genealogy tracking smoother (as analysed in Olsson and Westerborn,, 2017, Sect. 1) and a linear growth for the kernel $B_{t}^{N,\mathrm{ITRC}}$ (as described in Theorem 2.2).

Finally, Figure 20 (Supplement D.2.2) shows properties of the effective sample size (ESS) ratio for this model. In a nutshell, while being globally stable (between $40\%$ and $70\%$ ), it has a tendency to drift towards near $0$ from time to time due to unusual data points. At these moments, resampling kills most of the particles and aggravates the degeneracy problem for the naive smoother. As we have seen in the above figures, systematic resampling is not enough to mitigate this in the long run.

6. Conclusion

6.1. Practical recommendations

Our first recommendation does not concern the smoothing algorithm per se. It is of paramount importance that the particle filter used in in the preliminary filtering step performs reasonably well, since its output defines the support of the approximations generated by the subsequent smoothing algorithm. (Standard recommendations to obtain good performance from a particle filter are to increase $N$ , or to use better proposal distributions, or both.)

When the transition density is tractable, we recommend the MCMC smoother by default (rather than even the standard, $\mathcal{O}(N^{2})$ approach). It has a deterministic, $\mathcal{O}(N)$ complexity, it does not require the transition density to be bounded, and it seems to perform well even with one or two MCMC steps. If one still wants to use the rejection smoother instead, it is safe to say that there is no reason not to use the hybrid method.

Although the assumptions under which we prove the stability of the smoothing estimates are strong, the general message still holds. The Markov kernel and the potential functions must make the model forget its past in some ways. Otherwise, we get an unstable model for which no smoothing methods can work. The rejection sampling – based smoothing algorithms can therefore serve as the ultimate test. Since they simulate exactly independent trajectories given the skeleton, there is no hope to perform better, unless one switches to another family of smoothing algorithms.

For intractable models, the key issue is to design couplings with high meeting probability. Fortunately, the inherent chaos of the model makes it possible to choose two very close starting points for the dynamics and thus easy to obtain a reasonable meeting probability. If further difficulties persist, there is a practical (and very heuristic) recipe to test whether one coupling of $M_{t}(x,\cdot)$ and $M_{t}(x^{\prime},\cdot)$ is close to optimal. It consists in approximating $M_{t}(x,\cdot)$ and $M_{t}(x^{\prime},\cdot)$ by Gaussian distributions and deduce the optimal coupling rate from their total variation distance. There is no closed formula for the total variation distance between two Gaussian distributions in high dimensions. However, it can be reliably estimated using the geometric interpretation of the total variation distance being one minus the area of the intersection created by the corresponding density graphs. In this way, one can get a very rough idea of to what extent a certain coupling realises the meeting potential that the two distributions have. If the coupling seems good and the trajectories still look degenerate, it can very well be that the model is unstable.

6.2. Further directions

The major limitation of our work is the exclusive theoretical analysis under the bootstrap particle filter. Moreover, we require that the $N$ new particles generated at step $t$ are conditionally independent given previous particles at time $t-1$ . This excludes practical optimisations like systematic resampling and Algorithm 9. Finally, the backward sampling step is also used in other algorithms (in particular Particle Markov Chain Monte Carlo, see Andrieu et al.,, 2010) and it would be interesting to see to what extent our techniques can be applied there.

6.3. Data and code

The code used to run numerical experiments is available at https://github.com/hai-dang-dau/backward-samplers-code. Some of the algorithms are already available in an experimental branch of the particles Python package at https://github.com/nchopin/particles.

Acknowledgements

The first author acknowledges a CREST PhD scholarship via AMX funding, and wish to thank the members of his PhD defense jury (Stéphanie Allassonière, Randal Douc, Arnaud Doucet, Anthony Lee, Pierre del Moral, Christian Robert) for helpful comments on the corresponding chapter in his thesis.

We also thank Adrien Corenflos, Samuel Duffield, the associate editor, and the referees for their comments on a preliminary version of the paper.

References

Andrieu et al., (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(3):269–342.
Beskos et al., (2006) Beskos, A., Papaspiliopoulos, O., Roberts, G. O., and Fearnhead, P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(3):333–382. With discussions and a reply by the authors.
Bishop, (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. Information Science and Statistics. Springer, New York.
Bou-Rabee et al., (2020) Bou-Rabee, N., Eberle, A., and Zimmer, R. (2020). Coupling and convergence for Hamiltonian Monte Carlo. The Annals of Applied Probability, 30(3):1209 – 1250.
Bunch and Godsill, (2013) Bunch, P. and Godsill, S. (2013). Improved particle approximations to the joint smoothing distribution using Markov chain Monte Carlo. IEEE Transactions on Signal Processing, 61(4):956–963.
Carpenter et al., (1999) Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. IEE Proc. Radar, Sonar Navigation, 146(1):2–7.
Chopin, (2004) Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference. Ann. Statist., 32(6):2385–2411.
Chopin and Papaspiliopoulos, (2020) Chopin, N. and Papaspiliopoulos, O. (2020). An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer.
Corenflos and Särkkä, (2022) Corenflos, A. and Särkkä, S. (2022). The Coupled Rejection Sampler. arXiv preprint arXiv:2201.09585.
Del Moral, (2004) Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. Probability and its Applications. Springer Verlag, New York.
Del Moral, (2013) Del Moral, P. (2013). Mean field simulation for Monte Carlo integration, volume 126 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL.
Del Moral et al., (2010) Del Moral, P., Doucet, A., and Singh, S. S. (2010). A backward particle interpretation of Feynman-Kac formulae. M2AN Math. Model. Numer. Anal., 44(5):947–975.
Del Moral and Miclo, (2001) Del Moral, P. and Miclo, L. (2001). Genealogies and increasing propagation of chaos for Feynman-Kac and genetic models. Ann. Appl. Probab., 11(4):1166–1198.
Douc et al., (2005) Douc, R., Cappé, O., and Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pages 64–69. IEEE.
Douc et al., (2011) Douc, R., Garivier, A., Moulines, E., and Olsson, J. (2011). Sequential Monte Carlo smoothing for general state space hidden Markov models. Ann. Appl. Probab., 21(6):2109–2145.
Douc et al., (2018) Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018). Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, Cham.
Dubarry and Le Corff, (2013) Dubarry, C. and Le Corff, S. (2013). Non-asymptotic deviation inequalities for smoothed additive functionals in nonlinear state-space models. Bernoulli, 19(5B):2222–2249.
Duffield and Singh, (2022) Duffield, S. and Singh, S. S. (2022). Online particle smoothing with application to map-matching. IEEE Trans. Signal Process., 70:497–508.
Fearnhead et al., (2008) Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. (2008). Particle filters for partially observed diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(4):755–777.
Fearnhead et al., (2010) Fearnhead, P., Wyncoll, D., and Tawn, J. (2010). A sequential smoothing algorithm with linear computational cost. Biometrika, 97(2):447–464.
Gerber and Chopin, (2015) Gerber, M. and Chopin, N. (2015). Sequential quasi Monte Carlo. J. R. Stat. Soc. Ser. B. Stat. Methodol., 77(3):509–579.
Gerber et al., (2019) Gerber, M., Chopin, N., and Whiteley, N. (2019). Negative association, ordering and convergence of resampling methods. Ann. Statist., 47(4):2236–2260.
Gloaguen et al., (2022) Gloaguen, P., Le Corff, S., and Olsson, J. (2022). A pseudo-marginal sequential Monte Carlo online smoothing algorithm. Bernoulli, 28(4):2606–2633.
Godsill et al., (2004) Godsill, S. J., Doucet, A., and West, M. (2004). Monte Carlo smoothing for nonlinear times series. J. Amer. Statist. Assoc., 99(465):156–168.
Gordon et al., (1993) Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F, Comm., Radar, Signal Proc., 140(2):107–113.
Guarniero et al., (2017) Guarniero, P., Johansen, A. M., and Lee, A. (2017). The iterated auxiliary particle filter. J. Amer. Statist. Assoc., 112(520):1636–1647.
Hening and Nguyen, (2018) Hening, A. and Nguyen, D. H. (2018). Stochastic Lotka-Volterra food chains. J. Math. Biol., 77(1):135–163.
Jacob et al., (2019) Jacob, P. E., Lindsten, F., and Schön, T. B. (2019). Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association.
Jacob et al., (2020) Jacob, P. E., O’Leary, J., and Atchadé, Y. F. (2020). Unbiased Markov chain Monte Carlo methods with couplings. J. R. Stat. Soc. Ser. B. Stat. Methodol., 82(3):543–600.
Janson, (2011) Janson, S. (2011). Probability asymptotics: notes on notation. arXiv preprint arXiv:1108.3924.
Jasra et al., (2017) Jasra, A., Kamatani, K., Law, K. J., and Zhou, Y. (2017). Multilevel particle filters. SIAM Journal on Numerical Analysis, 55(6):3068–3096.
Kalman, (1960) Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45.
Kalman and Bucy, (1961) Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans. ASME Ser. D. J. Basic Engrg., 83:95–108.
Kitagawa, (1996) Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Statist., 5(1):1–25.
Lévy, (1940) Lévy, P. (1940). Sur certains processus stochastiques homogènes. Compositio mathematica, 7:283–339.
Lindvall and Rogers, (1986) Lindvall, T. and Rogers, L. C. G. (1986). Coupling of multidimensional diffusions by reflection. Ann. Probab., 14(3):860–872.
Lotka, (1925) Lotka, A. J. (1925). Elements of physical biology. Williams & Wilkins.
Mastrototaro et al., (2021) Mastrototaro, A., Olsson, J., and Alenlöv, J. (2021). Fast and numerically stable particle-based online additive smoothing: the adasmooth algorithm. arXiv preprint arXiv:2108.00432.
Mörters and Peres, (2010) Mörters, P. and Peres, Y. (2010). Brownian motion, volume 30. Cambridge University Press.
Nordh and Antonsson, (2015) Nordh, J. and Antonsson, J. (2015). A Quantitative Evaluation of Monte Carlo Smoothers. Technical report.
Olsson and Westerborn, (2017) Olsson, J. and Westerborn, J. (2017). Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. Bernoulli, 23(3):1951–1996.
Olver et al., (2010) Olver, F. W. J., Lozier, D. W., Boisvert, R. F., and Clark, C. W., editors (2010). NIST handbook of mathematical functions. U.S. Department of Commerce, National Institute of Standards and Technology, Washington, DC; Cambridge University Press, Cambridge. With 1 CD-ROM (Windows, Macintosh and UNIX).
Pitt and Shephard, (1999) Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: auxiliary particle filters. J. Amer. Statist. Assoc., 94(446):590–599.
Robert and Casella, (2004) Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer-Verlag, New York.
Roberts and Rosenthal, (2004) Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv., 1:20–71.
Samet, (2006) Samet, H. (2006). Foundations of multidimensional and metric data structures. Morgan Kaufmann.
Sen et al., (2018) Sen, D., Thiery, A. H., and Jasra, A. (2018). On coupling particle filter trajectories. Statistics and Computing, 28(2):461–475.
Taghavi et al., (2013) Taghavi, E., Lindsten, F., Svensson, L., and Schön, T. B. (2013). Adaptive stopping for fast particle smoothing. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6293–6297.
Vershynin, (2018) Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Volterra, (1928) Volterra, V. (1928). Variations and fluctuations of the number of individuals in animal species living together. ICES Journal of Marine Science, 3(1):3–51.
Wang et al., (2021) Wang, G., O’Leary, J., and Jacob, P. (2021). Maximal couplings of the Metropolis-Hastings algorithm. In International Conference on Artificial Intelligence and Statistics, pages 1225–1233. PMLR.
Yonekura and Beskos, (2022) Yonekura, S. and Beskos, A. (2022). Online smoothing for diffusion processes observed with noise. Journal of Computational and Graphical Statistics, 0(0):1–17.

Supplement A Additional notations

This section defines new notations that do not appear in the main text (except notations for linear Gaussian models) but are used in the Supplement.

A.1. Linear Gaussian models

Let $\operatorname{dim}_{X}$ and $\operatorname{dim}_{Y}$ be two strictly positive integers and $F_{X}$ and $F_{Y}$ be two full-rank matrices of sizes $\operatorname{dim}_{X}\times\operatorname{dim}_{X}$ and $\operatorname{dim}_{Y}\times\operatorname{dim}_{X}$ respectively. Let $C_{X}$ and $C_{Y}$ be two symmetric positive definite matrices of respective sizes $\operatorname{dim}_{X}\times\operatorname{dim}_{X}$ and $\operatorname{dim}_{Y}\times\operatorname{dim}_{Y}$ . A linear Gaussian state space model has the underlying Markov process defined by

X_{t}|X_{0:t-1}\sim\mathcal{N}(F_{X}X_{t-1},C_{X}),

where $X_{0}$ also follows a Gaussian distribution; and admits the observation process

Y_{t}|X_{t}\sim\mathcal{N}(F_{Y}X_{t},C_{Y}).

The predictive ( $X_{t}$ given $Y_{0:t-1}$ ), filtering ( $X_{t}$ given $Y_{0:t}$ ) and smoothing ( $X_{t}$ given $Y_{0:T}$ ) distributions are all Gaussian and their parameters can be explicitly calculated via recurrence formulas (Kalman,, 1960; Kalman and Bucy,, 1961). We shall denote their respective mean vectors and covariance matrices by $(\mu_{t}^{\mathrm{pred}},\Sigma_{t}^{\mathrm{pred}})$ , $(\mu_{t}^{\mathrm{filt}},\Sigma_{t}^{\mathrm{filt}})$ and $(\mu_{t}^{\mathrm{smth}},\Sigma_{t}^{\mathrm{smth}})$ . In particular, the starting distribution $X_{0}$ is $\mathcal{N}(\mu_{0}^{\mathrm{pred}},\Sigma_{0}^{\mathrm{pred}})$ .

A.2. Total variation distance

Let $\mu$ and $\nu$ be two probability measures on $\mathcal{X}$ . The total variation distance between $\mu$ and $\nu$ , sometimes also denoted $\operatorname{TV}(\mu,\nu)$ , is defined as $\left\lVert\mu-\nu\right\rVert_{\operatorname{TV}}:=\sup_{f:\mathcal{X}\to[0,1]}\left|\mu(f)-\nu(f)\right|$ . The definition remains valid if $f$ is restricted to the class of indicator functions on measurable subsets of $\mathcal{X}$ . It implies in particular that $\left|\mu(f)-\nu(f)\right|\leq\left\lVert f\right\rVert_{\mathrm{osc}}\operatorname{TV}(\mu,\nu)$ .

Next, we state a lemma summarising basic properties of the total variation distance and defining coupling-related notions (see, e.g. Proposition 3 and formula (13) of Roberts and Rosenthal, (2004)). While the last property (covariance bound) is not in the aforementioned reference and does not seem popular in the literature, its proof is straightforward and therefore omitted.

Lemma A.1.

The total variation distance has the following properties:

•

(Alternative expressions.) If $\mu$ and $\nu$ admit densities $f(x)$ and $g(x)$ respectively with reference to a dominating measure $\lambda$ , we have

\operatorname{TV}(\mu,\nu)=\frac{1}{2}\int\left|f(x)-g(x)\right|\lambda(\mathrm{d}x)=1-\int\min(f(x),g(x))\lambda(\mathrm{d}x).

•

(Coupling inequality & maximal coupling.) For any pair of random variables $(M,N)$ such that $M\sim\mu$ and $N\sim\nu$ , we have

$\mathbb{P}(M\neq N)\geq\operatorname{TV}(\mu,\nu).$

There exist pairs $(M^{*},N^{*})$ for which equality holds. They are called maximal couplings of $\mu$ and $\nu$ .
•

(Contraction property.) Let $(X_{n})$ be a Markov chain with invariant measure $\mu^{\star}$ . Then

$\operatorname{TV}(X_{n},\mu^{*})\geq\operatorname{TV}(X_{n+1},\mu^{*}).$

•

(Covariance bound.) For any pair of random variables $(M,N)$ such that $M\sim\mu$ and $N\sim\nu$ and real-valued functions $h_{1}$ and $h_{2}$ , we have

\left|\mathrm{Cov}(h_{1}(M),h_{2}(N))\right|\leq 2\left\lVert h_{1}\right\rVert_{\infty}\left\lVert h_{2}\right\rVert_{\infty}\operatorname{TV}\left((M,N),\mu\otimes\nu\right).

A.3. Cost-to-go function

In the context of the Feynman-Kac model (1), define the associated cost-to-go function $H_{t:T}$ as (see e.g. Chopin and Papaspiliopoulos, (2020, Chapter 5))

(20)

H_{t:T}(x_{t}):=\prod_{s=t+1}^{T}M_{s-1}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

This function bridges $\mathbb{Q}_{t}(\mathrm{d}x_{t})$ and $\mathbb{Q}_{T}(\mathrm{d}x_{t})$ , since $\mathbb{Q}_{T}(\mathrm{d}x_{t})\propto\mathbb{Q}_{t}(\mathrm{d}x_{t})H_{t:T}(x_{t})$ .

A.4. The projection kernel

Let $\mathcal{X}$ and $\mathcal{Y}$ be two measurable spaces. The projection kernel $\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}$ is defined by

\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}\left((x,y),\mathrm{d}x^{*}\right):=\delta_{x}(\mathrm{d}x^{*}).

In particular, for any function $g:\mathcal{X}\to\mathbb{R}$ and measure $\mu(\mathrm{d}x,\mathrm{d}y)$ defined on $\mathcal{X}\times\mathcal{Y}$ , we have

	$\displaystyle(\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}g)(x,y)$	$\displaystyle=g(x)$
	$\displaystyle(\mu\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}})(g)$	$\displaystyle=\iint g(x)\mu(\mathrm{d}x,\mathrm{d}y)=\int g(x)\mu(\mathrm{d}x)$

where the second identity shows the marginalising action of $\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}$ on $\mu$ . In the context of state space models, we define the shorthand

\Pi^{0:T}_{t}:=\Pi^{(\mathcal{X}_{0},\ldots,\mathcal{X}_{T})}_{\mathcal{X}_{t}}.

A.5. Other notations

For a real number $x$ , let $\lfloor x\rfloor$ be the largest integer not exceeding $x$ . The mapping $x\mapsto\lfloor x\rfloor$ is called the floor function • The Gamma function $\Gamma(a)$ is defined for $a>0$ and is given by $\Gamma(a):=\int_{\mathbb{R}_{+}}e^{-x}x^{a-1}\mathrm{d}x$ • Let $\mathcal{X}$ and $\mathcal{Y}$ be two measurable spaces. Let $K(x,\mathrm{d}y)$ be a (not necessarily probability) kernel from $\mathcal{X}$ to $\mathcal{Y}$ . The norm of $K$ is defined by $\left\lVert K\right\rVert_{\infty}:=\sup_{f:\mathcal{X}\to\mathcal{Y},f\neq 0}\left\lVert Kf\right\rVert_{\infty}/\left\lVert f\right\rVert_{\infty}$ . In particular, for any function $f:\mathcal{X}\to\mathcal{Y}$ , we have $\left\lVert Kf\right\rVert_{\infty}\leq\left\lVert K\right\rVert_{\infty}\left\lVert f\right\rVert_{\infty}$ • Let $X_{n}$ be a sequence of random variables. We say that $X_{n}=\mathcal{O}_{\mathbb{P}}(1)$ if for any $\varepsilon>0$ , there exists $M>0$ and $N_{0}$ , both depending on $\varepsilon$ , such that $\mathbb{P}(|X_{n}|\geq M)\leq\varepsilon$ for all $n\geq N_{0}$ . For a strictly positive deterministic sequence $a_{n}$ , we say that $X_{n}=\mathcal{O}_{\mathbb{P}}(a_{n})$ if $X_{n}/a_{n}=\mathcal{O}_{\mathbb{P}}(1)$ . See Janson, (2011) for discussions • We use the notation $\mathcal{N}(x|\mu,\Sigma)$ to refer to the value at $x$ of the density function of the normal distribution $\mathcal{N}(\mu,\Sigma)$ • Let $f:U\to V$ be a function from some space $U$ to another space $V$ . Let $S$ be a subset of $U$ . The restriction of $f$ to S, written $f|_{S}$ , is the function from $S$ to $V$ defined by $f|_{S}(x)=f(x)$ , $\forall x\in S$ .

Supplement B FFBS complexity for different rejection schemes

B.1. Framework and notations

The FFBS algorithm is a particular instance of Algorithm 2 where $B_{t}^{N,\mathrm{FFBS}}$ kernels are used. If backward simulation is done using pure rejection sampling (Algorithm 4), the computational cost to simulate the $t-1$ -th index of the $n$ -th trajectory has conditional distribution

(21)

\tau_{t}^{n,\mathrm{FFBS}}|\ \mathcal{F}_{T},\mathcal{I}_{t:T}^{n}\sim\operatorname{Geo}\left(\frac{\sum_{i}W_{t-1}^{i}m_{t}(X_{t-1}^{i},X_{t}^{\mathcal{I}_{t}^{n}})}{\bar{M}_{h}}\right).

At this point, it would be useful to compare this formula with (16) of the PaRIS algorithm. The difference is subtle but will drive interesting changes to the way rejection-based FFBS behaves.

If hybrid rejection sampling (Algorithm 6) is to be used instead, we are interested in the distribution of $\min(\tau_{t}^{n,\mathrm{FFBS}},N)$ , for reasons discussed in Subsection 3.2. In a highly parallel setting, it is preferable that the distribution of individual execution times, i.e. $\tau_{t}^{n,\mathrm{FFBS}}$ or $\min(\tau_{t}^{n,\mathrm{FFBS}},N)$ , are not heavy-tailed. In contrast, for non-parallel hardware, only cumulative execution times, i.e. $\sum_{n=1}^{N}\tau_{t}^{n,\mathrm{FFBS}}$ or $\sum_{n=1}^{N}\min(\tau_{t}^{n,\mathrm{FFBS}},N)$ , matter. Even though the individual times might behave badly, the cumulative times could be much more regular thanks to effect of the central limit theorem, whenever applicable. Nevertheless, studying the finiteness of the $k$ -th order moment of $\tau_{t}^{1,\mathrm{FFBS}}$ is still a good way to get information about both types of execution times, since it automatically implies $k$ -th order moment (in)finiteness for both of them.

B.2. Execution time for pure rejection sampling

We show that under certain circumstances, the execution time of the pure rejection procedure has infinite expectation. Proposition 1 in Douc et al., (2011) hints that the cost per trajectory for FFBS-reject might tend to infinity when $N\to\infty$ . In contrast, we show that infinite expectation might very well happen for finite sample sizes. We first give the statement for general state space models, then focus on their implications for Gaussian ones. In particular, while infinite expectations occur only under certain configurations, infinite higher moments happen in all linear Gaussian models with non-degenerate dynamics.

Theorem B.1.

Using the setting and notations of Supplement B.1, under Assumptions 1 and 4 , we have $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\infty$ whenever

\int_{\mathcal{X}_{t}}G_{t}(x_{t})H_{t:T}(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty

where the cost-to-go function $H_{t:T}$ is defined in (20) and the measure $\lambda_{t}$ is defined in (2).

Theorem B.2.

Using the setting and notations of Supplement B.1, we consider linear Gaussian models and their notations defined in Supplement A.1. Then we have $\mathbb{E}[(\tau_{t}^{1,\mathrm{FFBS}})^{k}]=\infty$ whenever $k$ is greater than a certain $k_{0}$ being the smallest eigenvalue of the matrix $\operatorname{Id}+C_{X}^{1/2}\left({(\Sigma^{\mathrm{smth}}_{t})}^{-1}-{(\Sigma^{\mathrm{pred}}_{t})}^{-1}\right)C_{X}^{1/2}$ .

The proofs of the two assertions are given in Supplement E.7. We now look at how they are manifested in concrete examples. The first remark is that for technical reasons, Theorem B.2 gives no information on the finiteness of $E[(\tau_{t}^{1,\mathrm{FFBS}})^{k}]$ for $k=1$ (since $k_{0}$ is already greater than or equal to $1$ by definition). To study the finiteness of $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]$ , we thus turn to Theorem B.1.

Example 4.

In linear Gaussian models, the integral of Theorem B.1 is equal to

\int\mathcal{N}(y_{t}|F_{Y}x_{t},C_{Y})\prod_{s=t+1}^{T}\mathcal{N}(x_{s}|F_{X}x_{s-1},C_{X})\mathcal{N}(y_{s}|F_{Y}x_{s},C_{Y})\mathrm{d}x_{t:T}

where the notation $\mathcal{N}(\mu,\Sigma)$ refers to the density of the normal distribution. The integrand is proportional to $\exp[-0.5(Q(x_{t:T})-R(x_{t:T}))]$ for some quadratic form $Q(x_{t:T})$ and linear form $R(x_{t:T})$ . The integral is finite if and only if $Q$ is positive definite. In our case, this means that there is no non-trivial root for the equation $Q(x_{t:T})=0$ , which is equivalent to

\begin{cases}F_{Y}x_{s}&=0,\forall s=t,\ldots T\\ F_{X}x_{s-1}&=x_{s},\forall s=t+1,\ldots,T.\end{cases}

Put another way, $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]$ is infinite whenever the intersection

\bigcap_{k=0}^{T-t}\operatorname{Ker}(F_{Y}F_{X}^{k})=\bigcap_{k=0}^{T-t}F_{X}^{-k}(\operatorname{Ker}(F_{Y}))

contains other things than the zero vector. A common and particularly troublesome situation is when $F_{X}=c\operatorname{Id}$ for some $c>0$ (but $C_{X}$ can be arbitrary) and the dimension of the states ( $\operatorname{dim}_{X}$ ) is greater than that of the observations ( $\operatorname{dim}_{Y}$ ). Then the above intersection remains non-trivial no matter how big $T-t$ is. Thus, $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]$ has no expectation for any $t$ . In general, the problem is less severe as successive intersections will shrink the space quickly to $\{0\}$ . Consequently, Theorem B.1 only points out infiniteness of $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]$ for $t$ close to $T$ . The bad news however will come from higher moments, as seen in the below example.

We will now focus on a simple but particularly striking example. Our purpose here is to illustrate the concepts as well as to show that their implications are relevant even in small, familiar settings. More advanced scenarios are presented in Section 5 devoted to numerical experiments.

Example 5.

We consider two one-dimensional Gaussian state-space models: they both have $F_{X}=0.5$ , $C_{X}=1$ , $X_{0}\sim\mathcal{N}(0,C_{X}^{2}/(1-F_{X}^{2}))$ and $T=3$ . The only difference between them is that one has $\sigma_{y}^{2}:=C_{Y}=0.5^{2}$ and another has $\sigma_{y}^{2}=3^{2}$ . We are interested in the execution times $\tau_{1}^{n,\mathrm{FFBS}}$ at time $t=1$ (i.e. the rejection-based simulation of indices $\mathcal{I}_{0}^{n}$ at time $t=0$ ). Theorem B.2 then gives $k_{0}\approx 1.14$ for $\sigma_{y}=3$ and $k_{0}\approx 5$ for $\sigma_{y}=0.5$ . The first implication is that in both cases, $\tau_{1}^{n,\mathrm{FFBS}}$ is a heavy-tailed random variable and therefore FFBS-reject is not a viable option in a highly parallel setting. But an interesting phenomenon happens in the sequential hardware scenario where one is rather interested in the cumulative execution time, i.e. $\sum_{n=1}^{N}\tau_{1}^{n,\mathrm{FFBS}}$ , or equivalently, the mean number of trials per particle. In the $\sigma_{y}=3$ case, non-existence of second moment prevents the cumulative regularisation effect of the central limit theorem. This is not the case for $\sigma_{y}=0.5$ , in which the cumulative execution time actually behaves nicely (Figures 7 and 8). However, the most valuable message from this example is perhaps that the performance of FFBS-reject depends in a non-trivial (hard to predict) way on the model parameters.

B.3. Execution time for hybrid rejection sampling

Formula (21) suggests defining the limit distribution $\tau_{t}^{\infty,\mathrm{FFBS}}$ as

\tau_{t}^{\infty,\mathrm{FFBS}}\ |\ X_{t}^{\infty,\mathrm{FFBS}}\sim\operatorname{Geo}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{FFBS}})}{\bar{M}_{h}}\right)

where $X_{t}^{\infty,\mathrm{FFBS}}\sim\mathbb{Q}_{T}(\mathrm{d}x_{t})$ and $r_{t}$ given in Definition 1. These quantities provide the following characterisation of the cumulative execution time for the hybrid FFBS algorithm (proved in Section E.8).

Theorem B.3.

Under Assumptions 1 and 4 and the setting of Section B.1, we have

\frac{\sum_{n=1}^{N}\min(\tau_{t}^{n,\mathrm{FFBS}},N)}{N}=\mathcal{O}_{\mathbb{P}}\left(\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]\right)

where the notation $\mathcal{O}_{\mathbb{P}}$ is defined in Supplement A.

This theorem admits the following corollary for linear Gaussian models (also proved in Section E.8).

Corollary 2.

For linear Gaussian models (Supplement A.1), if smoothing is performed using the hybrid rejection version of the FFBS algorithm, the mean execution time per particle at time step $t$ is $\mathcal{O}_{\mathbb{P}}(\log^{d_{t}/2}N)$ where $d_{t}$ is the dimension of $\mathcal{X}_{t}$ .

The bound $\mathcal{O}_{\mathbb{P}}(\log^{d_{t}/2}N)$ is actually quite conservative. For instance, with either $\sigma_{y}=0.5$ or $\sigma_{y}=3$ , the model considered in Example 5 admits $\mathbb{E}[\tau_{t}^{\infty,\mathrm{FFBS}}]<\infty$ . (Gaussian dynamics can be handled using exact analytic calculations and enables to verify the claim straightforwardly.) Theorem B.3 then gives an execution time per particle of order $\mathcal{O}_{\mathbb{P}}(1)$ for hybrid FFBS, which is better than the $\mathcal{O}_{\mathbb{P}}(\sqrt{\log N})$ predicted by Corollary 2. Yet another unsatisfactory point of the result is its failure to make sense of the spectacular improvement brought by hybrid rejection sampling over the ordinary procedure in the $\sigma_{y}=3$ case (see Figure 7). As explained in Example 5, this is connected to the variance of $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]$ and not merely the expectation; so a study of second order properties of $N^{-1}\sum_{n}\min(\tau_{t}^{n,\mathrm{FFBS}},N)$ would be desirable.

Supplement C Conditionally-correlated versions of particle algorithms

C.1. Alternative resampling schemes

In Algorithm 1, the indices $A_{t}^{1:N}$ are drawn conditionally i.i.d. from the multinomial distribution $\mathcal{M}(W_{t-1}^{1:N})$ . They satisfy

\mathbb{E}\left[\left.{\sum_{j=1}^{N}\mathbb{1}_{A_{t}^{j}=i}}\right|{\mathcal{F}_{t-1}}\right]=NW_{t-1}^{i}

for any $i=1,\ldots,N$ . There are other ways to generate $A_{t}^{1:N}$ from $W_{t-1}^{1:N}$ that still verify this identity. We call them unbiased resampling schemes, and the natural one used in Algorithm 1 multinomial resampling.

The main motivation for alternative resampling schemes is performance. We refer to Chopin, (2004); Douc et al., (2005); Gerber et al., (2019) for more details, but would like to mention that the theoretical studies of particle algorithms using other resampling schemes are more complicated since $X_{t}^{1:N}$ are no longer i.i.d. given $\mathcal{F}_{t-1}$ . We use systematic resampling (Carpenter et al.,, 1999) in our experiments. See Algorithm 8 for a succinct description and Chopin and Papaspiliopoulos, (2020, Chapter 9) for efficient implementations in $\mathcal{O}(N)$ running time.

Input: Weights

W_{t-1}^{1:N}

summing to

1

Generate

U\sim\operatorname{Uniform}[0,1]

for $n\leftarrow 1$ to $N$ do

Set

A_{t}^{n}

to the unique index

k

satisfying

W_{1}+\cdots+W_{k-1}\leq\frac{n-1+u}{N}<W_{1}+\cdots+W_{k}

Output: Resampled indices

A_{t}^{1:N}

Algorithm 8 Systematic resampling

C.2. Conditionally-correlated version of Algorithm 7

In this part, we present an alternative version of Algorithm 7 that does not create conditionally i.i.d. particles at each time step. The procedure is detailed in Algorithm 9. It creates on the fly backward kernels $B_{t}^{N,\mathrm{ITRC}}$ (for “intractable, conditionally correlated”). It involves a resampling step which can be done in principle using any unbiased resampling scheme. Following the intuitions of Subsection 4.2.3 and the notations of Algorithm 9, we want a scheme such that in most cases, $A_{t}^{2k-1}\neq A_{t}^{2k}$ but the Euclidean distance between $X_{t-1}^{A_{t}^{2k-1}}$ and $X_{t-1}^{A_{t}^{2k}}$ is small. Algorithm 10 proposes such a method (which we name the Adjacent Resampler). It can run in $\mathcal{O}(N)$ time using a suitably implemented linked list.

Input: Feynman-Kac model (1), particles

X_{t-1}^{1:N}

and weights

W_{t-1}^{1:N}

that approximate

\mathbb{Q}_{t-1}(\mathrm{d}x_{t-1})

Resample

A_{t}^{1:N}

from

\left\{1,2,\ldots,N\right\}

with weights

W_{t-1}^{1:N}

using any resampling scheme (such as the Adjacent Resampler in Algorithm 10)

for $k\leftarrow 1$ to $N/2$ do

Move. Simulate

X_{t}^{2k-1}

and

X_{t}^{2k}

such that marginally,

X_{t}^{2k-1}\sim M_{t}(X_{t-1}^{A_{t}^{2k-1}},\cdot)

and

X_{t}^{2k}\sim M_{t}(X_{t-1}^{A_{t}^{2k}},\cdot)

Calculate backward kernel.

if $X_{t}^{2k-1}=X_{t}^{2k}$ then

Set

B_{t}^{N,\mathrm{ITRC}}(2k-1,\cdot)\leftarrow\left(\delta\left\{A_{t}^{2k-1}\right\}+\delta\left\{A_{t}^{2k}\right\}\right)/2

Set

B_{t}^{N,\mathrm{ITRC}}(2k,\cdot)\leftarrow\left(\delta\left\{A_{t}^{2k-1}\right\}+\delta\left\{A_{t}^{2k}\right\}\right)/2

else

Set

B_{t}^{N,\mathrm{ITRC}}(2k-1,\cdot)\leftarrow\delta\left\{A_{t}^{2k-1}\right\}

Set

B_{t}^{N,\mathrm{ITRC}}(2k,\cdot)\leftarrow\delta\left\{A_{t}^{2k}\right\}

Reweight. Set

\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n})

for

n=1,2,\ldots,N

Set

\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N

Set

W_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N}

for

n=1,2,\ldots,N

Output: Particles

X_{t}^{1:N}

and weights

W_{t}^{1:N}

that approximate

\mathbb{Q}_{t}(\mathrm{d}x_{t})

; backward kernel

B_{t}^{N,\mathrm{ITRC}}

for use in Algorithms 2 and 3

Algorithm 9 Conditionally-correlated version of Algorithm 7

Input: Particles

X_{t-1}^{1:N}

, weights

W_{t-1}^{1:N}

Sort the particles

X_{t-1}^{1:N}

using the Hilbert curve. Let

s\leftarrow[s_{1}\ldots s_{N}]

be the corresponding indices

Resample from

\left\{1,\ldots,N\right\}

with weights

W_{t-1}^{1:N}

using systematic resampling (Carpenter et al.,, 1999; Gerber et al.,, 2019), then let

f:\left\{1,\ldots,N\right\}\to\mathbb{Z}

be the function defined by

f(i)

being the number of times the index

s_{i}

was resampled. Obviously

\sum_{i=1}^{N}f(i)=N

Initialise

i\leftarrow 1

for $n\leftarrow 1$ to $N$ do

Set

A_{t}^{n}\leftarrow s_{i}

Update

f(i)\leftarrow f(i)-1

Let

\Omega_{1}

be the set

\left\{\min\left\{\ell>i\mid f_{\ell}>0\right\}\right\}

(which has one element if the minimum is well-defined and zero element otherwise)

Let

\Omega_{2}

be the set

\left\{\max\left\{\ell<i\mid f_{\ell}>0\right\}\right\}

(which has one element if the maximum is well-defined and zero element otherwise)

\Omega_{1}\cup\Omega_{2}

is not empty, update

i\leftarrow\operatorname{argmax}f|_{\Omega_{1}\cup\Omega_{2}}

(see section A.5 for the restriction notation). If there is more than one argmax, pick one randomly

Output: Resampled indices

A_{t}^{1:N}

Algorithm 10 The Adjacent Resampler

Supplement D Additional information on numerical experiments

D.1. Offline smoothing in linear Gaussian models

In this section, we study offline smoothing for the linear Gaussian model specified in Section 5.1. Since offline processing requires storing particles at all times $t$ in the memory, we use $T=500$ here instead of $T=3000$ . Apart from that, the algorithmic and benchmark settings remain the same.

Figure 9 plots the squared interquartile range of the estimators $\mathbb{Q}_{T}(\varphi_{t})$ with respect to $t$ , for different algorithms. For small $t$ , the function $\varphi_{t}$ only looks at states close to time $0$ , whereas for bigger $t$ , recent states less affected by degeneracy are also taken into account. In all cases though, we see that MCMC and rejection-based smoothers have superior performance.

Figure 10 shows box plots of the averaged execution times (per particle $N$ per time $t$ ) based on $150$ runs. The observations are comparable to those in Section 5.1. We see a performance difference between the rejection-based smoothers using the bootstrap and the guided filters. Both have an execution time that is much more variable than hybrid rejection algorithms. The latter still need around $10$ times more CPU load than MCMC smoothers, for essentially the same precision.

We now take a closer look at the reason behind the performance difference between the bootstrap filter and the guided one when pure rejection sampling is used. Figure 11 shows the effective sample size (ESS) of both filters as a function of time. We can see that there is an outlier in the data around time $t=40$ . Figure 12 box-plots the execution times divided by $N$ at $t=40$ for the pure rejection sampling algorithm, whereas Figures 13 and 14 do the same for $t=38$ and $t=42$ . The root of the problem is now clear: at most times $t$ there is very few difference between the execution times of the bootstrap and the guided filters. However, if an outlier is present in the data, the guided filter suddenly requires a very high number of transition density evaluation in the rejection sampler. This gives yet another reason to avoid using pure rejection sampling.

D.2. Lotka-Volterra SDE

D.2.1. Coupling of Euler discretisations

Consider the SDE

(22)

\mathrm{d}X_{t}=b(X_{t})\mathrm{d}t+\sigma(X_{t})\mathrm{d}W_{t}

and two starting points $X_{0}^{\mathrm{A}}$ and $X_{0}^{\mathrm{B}}$ in $\mathbb{R}^{d}$ . We wish to simulate $X_{1}^{\mathrm{A}}$ and $X_{1}^{\mathrm{B}}$ such that the transitions from $X_{0}^{\mathrm{A}}$ to $X_{1}^{\mathrm{A}}$ and $X_{0}^{\mathrm{B}}$ to $X_{1}^{\mathrm{B}}$ both follow the Euler-discretised version of the equation, but $X_{1}^{\mathrm{A}}$ and $X_{1}^{\mathrm{B}}$ are correlated in a way that increases, as much as we can, the probability that they are equal. Algorithm 11 makes it clear that it all boils down to the coupling of two Gaussian distributions.

Input: Functions

b:\mathbb{R}^{d}\to\mathbb{R}^{d}

and

\sigma:\mathbb{R}^{d}\to\mathbb{R}^{d\times d}

, two starting points

X_{0}^{\mathrm{A}}

and

X_{0}^{\mathrm{B}}

at time

0

, number of discretisation step

N_{\mathrm{dist}}

Initialise

X^{\mathrm{A}}\leftarrow X_{0}^{\mathrm{A}}

Initialise

X^{\mathrm{B}}\leftarrow X_{0}^{\mathrm{B}}

Set

\delta\leftarrow 1/N_{\mathrm{dist}}

for $i\leftarrow 1$ to $N_{\mathrm{dist}}$ do

Simulate

(\tilde{X}^{\mathrm{A}},\tilde{X}^{\mathrm{B}})

from a coupling of

\mathcal{N}(X^{\mathrm{A}}+\delta b(X^{\mathrm{A}}),\delta\sigma(X^{\mathrm{A}})\sigma(X^{\mathrm{A}})^{\top})

and

\mathcal{N}(X^{\mathrm{B}}+\delta b(X^{\mathrm{B}}),\delta\sigma(X^{\mathrm{B}})\sigma(X^{\mathrm{B}})^{\top}),

such as Algorithm 14

Update

(X^{\mathrm{A}},X^{\mathrm{B}})\leftarrow(\tilde{X}^{\mathrm{A}},\tilde{X}^{\mathrm{B}})

Set

(X_{1}^{\mathrm{A}},X_{1}^{\mathrm{B}})\leftarrow(X^{\mathrm{A}},X^{\mathrm{B}})

Output: Two endpoints

X_{1}^{\mathrm{A}}

and

X_{1}^{\mathrm{B}}

at time

1

, obtained by passing

X_{0}^{\mathrm{A}}

and

X_{0}^{\mathrm{B}}

in a correlated manner through a discretised version of (22)

Algorithm 11 Coupling of two Euler discretisations

Lindvall and Rogers, (1986) propose the following construction: if two diffusions $X_{t}^{\mathrm{A}}$ and $X_{t}^{\mathrm{B}}$ both follow the dynamics of (22), that is,

	$\displaystyle\mathrm{d}X_{t}^{\mathrm{A}}$	$\displaystyle=b(X_{t}^{\mathrm{A}})\mathrm{d}t+\sigma(X_{t}^{\mathrm{A}})\mathrm{d}W_{t}^{\mathrm{A}}$
	$\displaystyle\mathrm{d}X_{t}^{\mathrm{B}}$	$\displaystyle=b(X_{t}^{\mathrm{B}})\mathrm{d}t+\sigma(X_{t}^{\mathrm{B}})\mathrm{d}W_{t}^{\mathrm{B}}$

and the two Brownian motions are correlated via

(23)

\mathrm{d}W_{t}^{\mathrm{B}}=[\operatorname{Id}-2u(X^{\mathrm{A}},X^{\mathrm{B}})u(X^{\mathrm{A}},X^{\mathrm{B}})^{\top}]\mathrm{d}W_{t}^{\mathrm{A}}

where $\operatorname{Id}$ is the identity matrix and the vector $u$ is defined by

u(x,x^{\prime})=\frac{\sigma(x^{\prime})^{-1}(x-x^{\prime})}{\left\lVert\sigma(x^{\prime})^{-1}(x-x^{\prime})\right\rVert_{2}},

then under some regularity conditions, the two diffusions meet almost surely. (Note two special features of (23): it is valid because the term in the square bracket is an orthogonal matrix; and it ceases to be well-defined once the two trajectories have met.) Simulating the meeting time $\tau$ turns out to be very challenging. The Euler discretisation (Algorithm 11 + Algorithm 12) has a fixed step size $\delta$ , and there is zero probability that $\tau$ is of the form $k\delta$ for some integer $k$ . Since the coupling transform is deterministic, the two Euler-simulated trajectories will never meet. Figure 15 depicts this difficulty in the special case of two Brownian motions in dimension $1$ (i.e. $b(x)\equiv 0$ and $\sigma\equiv 1$ ). Under this setting, (23) means that the two Brownian increments are symmetric with respect to the midpoint of the segment connecting their initial states. Note that the two dashed lines do cross at two points, but using them as meeting points is invalid: since they are not part of the discretisation but the result of some heuristic “linear interpolation”, it would change the distribution of the trajectories.

Input: Two vectors

\mu^{\mathrm{A}},\mu^{\mathrm{B}}

\mathbb{R}^{d}

and two

d\times d

matrices

\sigma^{\mathrm{A}}

and

\sigma^{\mathrm{B}}

Calculate

u\leftarrow(\sigma^{\mathrm{B}})^{-1}(\mu^{\mathrm{A}}-\mu^{\mathrm{B}})

Normalise

u\leftarrow u/\left\lVert u\right\rVert_{2}

Simulate

W^{\mathrm{A}}\sim\mathcal{N}(0,\operatorname{Id})

Set

W^{\mathrm{B}}\leftarrow(\operatorname{Id}-2uu^{\top})W^{\mathrm{A}}

Set

X^{\mathrm{A}}\leftarrow\mu^{\mathrm{A}}+\sigma^{\mathrm{A}}W^{\mathrm{A}}

Set

X^{\mathrm{B}}\leftarrow\mu^{\mathrm{B}}+\sigma^{\mathrm{B}}W^{\mathrm{B}}

Output: Two correlated points

X^{\mathrm{A}}

and

X^{\mathrm{B}}

marginally distributed according to

\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top})

and

\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top})

respectively

Algorithm 12 Lindvall-Rogers coupling of two Gaussian distributions

We therefore need some coupling that has a non-zero meeting probability at each $\delta$ -step. This can be achieved by the rejection maximal coupling (Algorithm 13, see also, e.g. Roberts and Rosenthal,, 2004) as well as the recently proposed coupled rejection sampler (Corenflos and Särkkä,, 2022). However, they all make use of rejection sampling in one way or another, which renders the execution time random. We wish to avoid this if possible. The reflection-maximal coupling (Bou-Rabee et al.,, 2020; Jacob et al.,, 2020) has deterministic cost and optimal meeting probability, but is only applicable for two Gaussian distributions of the same covariance matrix, which is not our case.

Input: Two probability distributions

f^{\mathrm{A}}

and

f^{\mathrm{B}}

Simulate

X^{\mathrm{A}}\sim f^{\mathrm{A}}

Simulate

U^{\mathrm{A}}\sim\operatorname{Uniform}[0,f^{\mathrm{A}}(X^{\mathrm{A}})]

if $U^{\mathrm{A}}\leq f^{\mathrm{B}}(X^{\mathrm{A}})$ then

Set

X^{\mathrm{B}}\leftarrow X^{\mathrm{A}}

else

repeat

Simulate

X^{\mathrm{B}}\sim f^{\mathrm{B}}

Simulate

U^{\mathrm{B}}\sim\operatorname{Uniform}[0,f^{\mathrm{B}}(X^{\mathrm{B}})]

until $U^{\mathrm{B}}>f^{\mathrm{A}}(X^{\mathrm{B}})$

Output: Two maximally-coupled realisations

X^{\mathrm{A}}

and

X^{\mathrm{B}}

, marginally

f^{\mathrm{A}}

-distributed and

f^{\mathrm{B}}

-distributed respectively

Algorithm 13 Rejection maximal coupler for two distributions

As suggested by Figure 15, the discretised Lindvall-Rogers coupling (Algorithm 12) is actually great for bringing together two faraway trajectories. Only when they start getting closer that it misses out. At that moment, the two distributions corresponding to the next $\delta$ -step have non-negligible overlap and would preferably be coupled in the style of Algorithm 13. We propose a modified coupling scheme that acts like Algorithm 12 when the two trajectories are at a large distance and behaves as Algorithm 13 otherwise.

The idea is to preliminarily generate a uniform draw in the “overlapping zone” of the two distributions (if they are close enough to make that easy). Next, we perform Algorithm 12 and then, any of the two simulations belonging to the overlapping zone will be replaced by the aforementioned preliminary draw (if it is available). The precise mathematical formulation is given in Algorithm 14 and the proof in Supplement E.12.

Input: Two vectors

\mu^{\mathrm{A}}

and

\mu^{\mathrm{B}}

\mathbb{R}^{d}

, two

d\times d

matrices

\sigma^{\mathrm{A}}

and

\sigma^{\mathrm{B}}

Let

f^{\mathrm{A}}

and

f^{\mathrm{B}}

be respectively the probability densities of

\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top})

and

\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top})

Simulate

X^{\mathrm{A}}

and

X^{\mathrm{B}}

from Algorithm 12

Simulate

U\sim\operatorname{Uniform}[0,1]

Set

U^{\mathrm{A}}\leftarrow Uf^{\mathrm{A}}(X^{\mathrm{A}})

and

U^{\mathrm{B}}\leftarrow Uf^{\mathrm{B}}(X^{\mathrm{B}})

Simulate

Y\sim f^{\mathrm{A}}

and

V\sim\operatorname{Uniform}[0,f^{\mathrm{A}}(Y)]

if $V\leq f^{\mathrm{B}}(Y)$ then

U^{\mathrm{A}}\leq f^{\mathrm{B}}(X^{\mathrm{A}})

then update

(X^{\mathrm{A}},U^{\mathrm{A}})\leftarrow(Y,V)

U^{\mathrm{B}}\leq f^{\mathrm{A}}(X^{\mathrm{B}})

then update

(X^{\mathrm{B}},U^{\mathrm{B}})\leftarrow(Y,V)

Output: Two correlated random vectors

X^{\mathrm{A}}

and

X^{\mathrm{B}}

, distributed marginally according to

\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top})

and

\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top})

Algorithm 14 Modified Lindvall-Rogers (MLR) coupler of two Gaussian distributions

Algorithm 14 has a deterministic execution time, but it does not attain the optimal coupling rate. Yet, as $\delta\to 0$ , we see empirically that it still recovers the oracle coupling time defined by (23) (although we did not try to prove this formally). In Figure 16, we couple two standard Brownian motions starting from $a=0$ and $b=1.5$ using Algorithm 14 with different values of $\delta$ . It is known, by a simple application of the reflection principle (Lévy,, 1940; see also Chapter 2.2 of Mörters and Peres,, 2010), that the reflection coupling (23) succeeds after a $\operatorname{Levy}(0,(b-a)^{2}/4)$ -distributed time. We therefore have to deal with a heavy-tailed distribution and restrict ourselves to the interval $[0,5]$ . We see that the law of the meeting time is stable and convergent as $\delta\to 0$ . Thus, at least empirically, Algorithm 14 does not suffer from the instability problem as $\delta\to 0$ , contrary to a naive path space augmentation approach (see Yonekura and Beskos,, 2022 for a discussion).

D.2.2. Supplementary figures

Figure 17 plots a realisation of the states and data with parameters given in Subsection 5.2, for a relatively small scale dataset ( $T=50$ ). While the periodic trait seen in classical deterministic Lotka-Volterra equations is still visible (with a period of around $20$ ), it is clear that here random perturbations have added considerable chaos to the system. Figures 18 and 19 show respectively the performances of the naive genealogy tracking smoother and ours (Algorithm 9) on the dataset of Figure 17. Our smoother has successfully prevented the degeneracy phenomenon, particularly for times close to $0$ . Figure 20 shows, in two different ways, the properties of effective sample sizes (ESS) in the $T=3000$ scenario (see Section 5.2).

Supplement E Proofs

E.1. Proof of Theorem 2.1 (general convergence theorem)

In line with (7), we define the distribution $\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t})$ for $t<T$ as the $x_{0:t}$ marginal of the joint distribution

(24)

\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t},\mathrm{d}i_{0:t}):=\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{t})\left[\prod_{s=t}^{1}B_{s}^{N}(i_{s},\mathrm{d}i_{s-1})\right]\left[\prod_{s=t}^{0}\delta_{X_{s}^{i_{s}}}(\mathrm{d}x_{s})\right].

The proof builds up on an inductive argument which links $\mathbb{Q}_{t}^{N}$ with $\mathbb{Q}_{t-1}^{N}$ through new innovations at time $t$ . More precisely, we have the following fundamental proposition, where $\mathcal{F}_{t}^{+}$ is defined as the smallest $\sigma$ -algebra containing $\mathcal{F}_{t}$ and $\hat{B}_{1:t}^{N}$ .

Proposition 5.

$\mathbb{Q}_{t}^{N}$ is a mixture distribution that admits the representation

(25)

\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t})=(\ell_{t}^{N})^{-1}N^{-1}\sum_{n}G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})

where $\ell_{t}^{N}$ is defined in Algorithm 1 and $K_{t}^{N}(n,\mathrm{d}x_{0:t})$ is a certain probability measure satisfying

(26)

\mathbb{E}\left[\left.{K_{t}^{N}(n,\mathrm{d}x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1})M_{t}(x_{t-1},\mathrm{d}x_{t}).

In other words, for any (possibly random) function $\varphi_{t}^{N}:\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{t}\to\mathbb{R}$ such that $\varphi_{t}^{N}(x_{0:t})$ is $\mathcal{F}_{t-1}^{+}$ -measurable, we have

\mathbb{E}\left[\left.{\int K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\int\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1})M_{t}(x_{t-1},\mathrm{d}x_{t})\varphi_{t}^{N}(x_{0:t}).

Moreover, $\int K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})$ , for $n=1,\ldots,N$ are i.i.d. given $\mathcal{F}_{t-1}^{+}$ .

The proof is postponed until the end of this subsection. This proposition gives the expression (25) for $\mathbb{Q}_{t}^{N}$ , which is easier to manipulate than (24) and which highlights, through (26), its connection to $\mathbb{Q}_{t-1}^{N}$ . To further simplify the notations, let us define, following Douc et al., (2011), the kernel $L_{t_{1}:t_{2}}$ , for $t_{1}\leq t_{2}$ , as

(27)

L_{t_{1}:t_{2}}(x^{\star}_{0:t_{1}},\mathrm{d}x_{0:t_{2}}):=\delta_{x^{\star}_{0:t_{1}}}(\mathrm{d}x_{0:t_{1}})\prod_{s=t_{1}+1}^{t_{2}}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

In other words, for real-valued functions $\varphi_{t_{2}}=\varphi_{t_{2}}(x_{0},\ldots,x_{t_{2}})$ , we have

L_{t_{1}:t_{2}}(x^{\star}_{0:t_{1}},\varphi_{t_{2}})=\int\varphi_{t_{2}}(x_{0}^{\star},\ldots,x_{t_{1}}^{\star},x_{t_{1}+1},\ldots,x_{t_{2}})\prod_{s=t_{1}+1}^{t_{2}}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

The usefulness of these kernels will come from the simple remark $\mathbb{Q}_{t_{2}}\propto\mathbb{Q}_{t_{1}}L_{t_{1}:t_{2}}$ . We also see that

\left\lVert L_{t_{1}:t_{2}}\varphi_{t_{2}}\right\rVert_{\infty}\leq\left\lVert\varphi_{t_{2}}\right\rVert_{\infty}\prod_{s=t_{1}+1}^{t_{2}}\left\lVert G_{s}\right\rVert_{\infty},

which gives $\left\lVert L_{t_{1}:t_{2}}\right\rVert_{\infty}<\infty$ , where the norm of a kernel is defined in Subsection A. We are now in a position to state an importance sampling-like representation of $\mathbb{Q}_{t}^{N}$ .

Corollary 3.

Let $\varphi_{t}^{N}:\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{t}\to\mathbb{R}$ be a (possibly random) function such that $\varphi_{t}^{N}(x_{0:t})$ is $\mathcal{F}_{t-1}^{+}$ -measurable. Suppose that $\varphi_{t}^{N}$ is either uniformly non-negative (i.e. $\varphi_{t}^{N}(x_{0:t})\geq 0$ almost surely) or uniformly bounded (i.e. there exists a deterministic $C$ such that $|\varphi_{t}^{N}(x_{0:t})|\leq C$ almost surely). Then

\mathbb{Q}_{t}^{N}\varphi_{t}^{N}=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})},

where $\tilde{K}_{t}^{N}(n,\cdot)$ is a certain random kernel such that

•

$\mathbb{E}\left[\left.{\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}\right|{\mathcal{F}_{t-1}^{+}}\right]=(\mathbb{Q}_{t-1}^{N}L_{{t-1}:t})\varphi_{t}^{N}$ ;
•

$N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})=\ell_{t}^{N}$ ;
•

$\left(\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\right)_{n=1,\ldots,N}$ are i.i.d. given $\mathcal{F}_{t-1}^{+}$ ;
•

almost surely, $\left|\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\right|\leq\left\lVert\varphi_{t}^{N}\right\rVert_{\infty}\left\lVert G_{t}\right\rVert_{\infty}$ if $\varphi_{t}^{N}$ is uniformly bounded and $\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\geq 0$ if $\varphi_{t}^{N}$ is uniformly non-negative.

These statements are valid for $t=0$ under the convention $\mathbb{Q}_{-1}^{N}L_{-1:0}=\mathbb{Q}_{-1}L_{-1:0}=\mathbb{M}_{0}$ and $\mathcal{F}_{-1}$ being the trivial $\sigma$ -algebra.

Proof.

Put $\tilde{K}_{t}^{N}(n,\varphi_{t}^{N}):=\int G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})$ where $K_{t}^{N}$ is defined in Proposition 5. Then

\begin{split}\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})&=\frac{N^{-1}\sum_{n}\int G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})}{\ell_{t}^{N}}\\ &=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{\ell_{t}^{N}}.\end{split}

Since $\mathbb{Q}_{t}^{N}$ is a probability measure, applying this identity twice yields

\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})=\frac{\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})}{\mathbb{Q}_{t}^{N}(\mathbbm{1})}=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})}.

The remaining points are simple consequences of the definition of $\tilde{K}_{t}^{N}$ and $L_{t-1:t}$ . ∎

The corollary hints at a natural induction proof for Theorem 2.1.

Proof of Theorem 2.1.

The following calculations are valid for all $T\geq 0$ , under the convention defined at the end of Corollary 3. They will prove (8) for $T=0$ and, at the same time, prove it for any $T\geq 1$ under the hypothesis that it already holds true for $T-1$ . Let $\varphi_{T}=\varphi_{T}(x_{0},\ldots,x_{T})$ be a real-valued function on $\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{T}$ . Write

(28)

\sqrt{N}(\mathbb{Q}_{T}^{N}\varphi_{T}-\mathbb{Q}_{T}\varphi_{T})=\sqrt{N}\left(\frac{{N}^{-1}\sum_{n}\tilde{K}_{T}^{N}(n,\varphi_{T})}{{N}^{-1}\sum_{n}\tilde{K}_{T}^{N}(n,\mathbbm{1})}-\frac{\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T}}{\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1}}\right)

where the rewriting of $\mathbb{Q}_{T}\varphi_{T}$ is a consequence of $Q_{T}\propto\mathbb{Q}_{T-1}L_{T-1:T}$ . We will bound this difference by Hoeffding’s inequalities for ratios (see Supplement E.13 for notations, including the definition of sub-Gaussian variables that we shall use below). We have

•

that $\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\varphi_{T})-\mathbb{Q}_{T-1}^{N}L_{T-1:T}\varphi_{T})$ is $(1,\left\lVert\varphi_{T}\right\rVert_{\infty}\left\lVert G_{T}\right\rVert_{\infty})$ -sub-Gaussian conditioned on $\mathcal{F}_{t-1}^{+}$ because of Theorem E.15 (and thus unconditionally, by the law of total expectation);
•

and that $\sqrt{N}({N}^{-1}\mathbb{Q}_{T-1}^{N}L_{T-1:T}\varphi_{T}-\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T})$ is sub-Gaussian with parameters

$(C_{T-1},S_{T-1}\left\lVert L_{T-1:T}\right\rVert_{\infty}\left\lVert\varphi_{T}\right\rVert_{\infty})$

if $T\geq 1$ by induction hypothesis. The quantity is equal to $0$ if $T=0$ .

This permits to apply Lemma E.16, which results in the sub-Gaussian properties of

•

the quantity $\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\varphi_{T})-\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T})$ , with parameters $(1+C_{T-1},S_{T-1}^{\prime}\left\lVert\varphi_{T}\right\rVert_{\infty})$ , for a certain constant $S_{T-1}^{\prime}$ ;
•

and the quantity $\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\mathbbm{1})-\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1})$ , which is a special case of the former one, with parameters $(1+C_{T-1},S_{T-1}^{\prime})$ .

Finally, we invoke Proposition 11 and deduce the sub-Gaussian property of (28) with parameters

\left(2+2C_{T-1},2\frac{S^{\prime}_{T-1}\left\lVert\varphi_{T}\right\rVert_{\infty}}{\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1}}\right)

which finishes the proof. ∎

Proof of Proposition 5.

From (24), we have

\begin{split}\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t})&=\sum_{i_{t}}\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}i_{t})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\\ &=(\ell_{t}^{N})^{-1}N^{-1}\sum_{i_{t}}G_{t}(X_{t}^{i_{t}})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\\ &=(\ell_{t}^{N})^{-1}N^{-1}\sum_{i_{t}}G_{t}(x_{t})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\end{split}

since $\bar{\mathbb{Q}}_{t}^{N}(dx_{0:t}|i_{t})$ has a $\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})$ term. In fact, the identity

\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t},\mathrm{d}i_{t-1}|i_{t})=\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})

follows directly from the backward recursive nature of Algorithm 2, and thus

(29)

\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})=\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})\int_{i_{t-1}}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1}).

The $\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})$ term is $\mathcal{F}_{t-1}^{+}$ -measurable. We shall calculate the expectation of $\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})$ given $\mathcal{F}_{t-1}^{+}$ . The following arguments are necessary for formal verification, but the result (30) is natural in light of the ancestor regeneration intuition explained in Section 2.4.

Let $f_{t}^{N}:\left\{1,\ldots,N\right\}\times\mathcal{X}_{t}\to\mathbb{R}$ be a (possibly random) function such that $f_{t}^{N}(i_{t-1},x_{t})$ is $\mathcal{F}_{t-1}^{+}$ -measurable. Let $J_{t}^{i_{t}}$ be a random variable such that given $\mathcal{F}_{t-1}^{+}$ , $X_{t}^{i_{t}}$ and $\hat{B}_{t}^{N}(i_{t},\cdot)$ , $J_{t}^{i_{t}}$ is $B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})$ -distributed. This automatically makes $J_{t}^{i_{t}}$ satisfy the second hypothesis of Theorem 2.1. Additionally, by virtue of its first hypothesis, the distribution of $(J_{t}^{i_{t}},A_{t}^{i_{t}})$ is the same given either $\mathcal{F}_{t-1}^{+}$ or $X_{t-1}^{1:N}$ (see also Figure 1). We can now write

\begin{split}&\mathbb{E}\left[\left.{\int f_{t}^{N}(i_{t-1},x_{t})\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{\int f_{t}^{N}(i_{t-1},X_{t}^{i_{t}})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{\mathbb{E}\left[\left.{f_{t}^{N}(J_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+},X_{t}^{i_{t}},\hat{B}_{t}^{N}(i_{t},\cdot)}\right]}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{f_{t}^{N}(J_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+}}\right]\text{ by the law of total expectation}\\ =&\mathbb{E}\left[\left.{f_{t}^{N}(A_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+}}\right]\text{ by the second hypothesis of Theorem~{}\ref{thm:convergence_mcmc}}\\ =&\int f_{t}^{N}(i_{t-1},x_{t})\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t}).\end{split}

This equality means that

(30)

\mathbb{E}\left[\left.{\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t}),

Now, put

K^{N}(i_{t},\mathrm{d}x_{0:t}):=\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t}).

From (29) and (30), we have

\begin{split}\mathbb{E}\left[\left.{K^{N}(i_{t},\mathrm{d}x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]&=\int_{i_{t-1}}\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\\ &=M_{t}(x_{t-1},\mathrm{d}x_{t})\int_{i_{t-1}}\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\\ &\text{since }\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\text{ has a }\delta_{X_{t-1}^{i_{t-1}}(\mathrm{d}x_{t-1})}\text{ term}\\ &=M_{t}(x_{t-1},\mathrm{d}x_{t})\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:{t-1}})\end{split}

which finishes the proof. ∎

E.2. Proof of Equation (11) (online smoothing recursion)

Proof.

Using (7) and the matrix notations, the distribution $\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}i_{s})$ can be represented by the $1\times N$ vector

\hat{q}_{s|t}^{N}:=[W_{t}^{1}\ldots W_{t}^{N}]\hat{B}_{t}^{N}\ldots\hat{B}_{s+1}^{N}.

Defining the $N\times N$ matrix $\hat{\psi}_{s}^{N}$ as

\hat{\psi}_{s}^{N}[i_{s-1},i_{s}]:=\psi_{s}(X_{s-1}^{i_{s-1}},X_{s}^{i_{s}}),

we have

	$\displaystyle\mathbb{E}_{\mathbb{Q}_{t}^{N}}[\psi_{s}(X_{s-1},X_{s})]$	$\displaystyle=\sum_{i_{s},i_{s-1}}\hat{q}_{s\|t}^{N}[1,i_{s}]\hat{B}_{s}^{N}[i_{s},i_{s-1}]\hat{\psi}_{s}^{N}[i_{s-1},i_{s}]$
		$\displaystyle=\sum_{i_{s}}\hat{q}_{s\|t}^{N}[1,i_{s}](\hat{B}_{s}^{N}\hat{\psi}_{s}^{N})[i_{s},i_{s}]$
		$\displaystyle=\hat{q}_{s\|t}^{N}\operatorname{diag}(\hat{B}_{s}^{N}\hat{\psi}_{s}^{N}).$

Therefore,

\mathbb{Q}_{t}^{N}\varphi_{t}=\sum_{s=0}^{t}[W_{t}^{1}\ldots W_{t}^{N}]\hat{B}_{t}^{N}\ldots\hat{B}_{s+1}^{N}\operatorname{diag}(\hat{B}_{s}^{N}\hat{\psi}_{s}^{N})

from which follows the recursion

\begin{cases}\mathbb{Q}_{t}^{N}\varphi_{t}&\ =[W_{t}^{1}\ldots W_{t}^{N}]\hat{S}_{t}^{N},\\ \hat{S}_{t}^{N}&:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}+\operatorname{diag}(\hat{B}_{t}^{N}\hat{\psi}_{t}^{N}).\end{cases}

This is exactly (11). ∎

E.3. Proof of Theorem 2.2 (general stability theorem)

The following lemma describes the simultaneous backward construction of two trajectories $\mathcal{I}_{0:T}^{1}$ and $\mathcal{I}_{0:T}^{2}$ given $\mathcal{F}_{T}^{-}$ .

Lemma E.1.

We use the same notations as in Algorithms 1 and 2. Suppose that the hypotheses of Theorem 2.1 are satisfied. Then, given $\mathcal{I}_{t:T}^{1}$ , $\mathcal{I}_{t:T}^{2}$ and $\mathcal{F}_{T}^{-}$ ,

•

if $\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}$ , the two variables $\mathcal{I}_{t-1}^{1}$ and $\mathcal{I}_{t-1}^{2}$ are conditionally independent and their marginal distributions are respectively $B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{1},\cdot)$ and $B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{2},\cdot)$ ;

•

if $\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}$ , under the aforementioned conditioning, the two variables $\mathcal{I}_{t-1}^{1}$ and $\mathcal{I}_{t-1}^{2}$ are both marginally distributed according to $B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{1},\cdot)$ . Moreover, if (13) holds, we have

(31)

\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right|{\mathcal{I}_{t:T}^{1,2},\mathcal{F}_{T}^{-}}\right)\mathbbm{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}\geq\varepsilon_{\mathrm{S}}\ \mathbbm{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}.

In particular, the sequence of variables $(\mathcal{I}_{T-s}^{1},\mathcal{I}_{T-s}^{2})_{s=0}^{T}$ is a Markov chain given $\mathcal{F}_{T}^{-}$ .

Proof.

To simplify the notations, let $\tilde{b}_{t}^{n}$ denote the $\mathbb{R}^{n}$ vector $\hat{B}_{t}^{N}(n,\cdot)$ . The relation between variables generated by Algorithm 2 is depicted as a graphical model in Figure 21. We consider

Figure 21. Directed graph representing the relations between variables generated in Algorithm 2. Only those necessary for the proof of Lemma E.1 are included.

(32)

\begin{split}p(\tilde{b}_{t}^{1:N},i_{t-1}^{1:2}|\mathcal{F}_{T}^{-},i_{t:T}^{1:2})&=p(\tilde{b}_{t}^{1:N}|\mathcal{F}_{T}^{-},i_{t:T}^{1:2})\ p(i_{t-1}^{1:2}|\tilde{b}_{t}^{1:N},\mathcal{F}_{T}^{-},i_{t:T}^{1,2})\\ &=p(\tilde{b}_{t}^{1:N}|x_{t-1}^{1:N},x_{t}^{1:N})\ p(i_{t-1}^{1:2}|\tilde{b}_{t}^{1:N},i_{t}^{1:2})\\ &\textrm{(by properties of graphical models, see Figure~{}\ref{fig:alg2:variables})}\\ &=\left[\prod_{n}p(\tilde{b}_{t}^{n}|x_{t-1}^{1:N},x_{t}^{n})\right]\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1})\tilde{b}_{t}^{i_{t}^{2}}(i_{t-1}^{2}).\end{split}

The distribution of $i_{t-1}^{1}$ given $\mathcal{F}_{T}^{-}$ and $i_{t:T}^{1:2}$ is thus the $i_{t-1}^{1}$ -marginal of

(33)

p(\tilde{b}_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1}),

which is exactly the distribution of $p(j_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})$ where the $J$ ’s are defined in the statement of Theorem 2.1. By the second hypothesis of that theorem, the aforementioned distribution is equal to $p(a_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})$ , which is in turn no other than $B_{t}^{N,\mathrm{FFBS}}(i_{t}^{1},\cdot)$ . Moreover, if $i_{t}^{1}\neq i_{t}^{2}$ , (32) straightforwardly implies the conditional independence of $i_{t-1}^{1}$ and $i_{t-1}^{2}$ . When $i_{t}^{1}=i_{t}^{2}$ , the distribution of $i_{t-1}^{1:2}$ given $\mathcal{F}_{T}^{-}$ and $i_{t:T}^{1:2}$ is the $i_{t-1}^{1:2}$ -marginal of

p(\tilde{b}_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{2}).

Thus, we can apply (13) for $n=i_{t}^{1}$ , where $i_{t-1}^{1:2}$ here plays the role of $J_{t}^{1:2}$ there. Equation (31) is now proved. ∎

As Lemma E.1 describes the distribution of two trajectories, it immediately gives the distribution of a single trajectory.

Corollary 4.

Under the same settings as in Lemma E.1, given $\mathcal{F}_{T}^{-}$ , the distribution of $\mathcal{I}_{0:T}^{1}$ is

\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})B_{T}^{N,\mathrm{FFBS}}(i_{T},\mathrm{d}i_{T-1})\ldots B_{1}^{N,\mathrm{FFBS}}(i_{1},\mathrm{d}i_{0}).

Note that the corollary applies even if the backward kernel used in Algorithm 2 is not the FFBS one. This is due to the conditioning on $\mathcal{F}_{T}^{-}$ and the second hypothesis of Theorem 2.1.

Proof of Theorem 2.2.

First of all, we remark that as per Algorithm 2, using index variables $\mathcal{I}_{0:T}^{1:N}$ adds a level of Monte Carlo approximation to $\mathbb{Q}_{T}^{N}(\mathrm{d}x_{0:T})$ . Therefore

	$\displaystyle\mathbb{E}\left[(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T}))^{2}\right]$	$\displaystyle=\mathbb{E}\left[\left(\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right]$
(34)			$\displaystyle=\mathbb{E}\left[(\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T}))^{2}\right]+$
		$\displaystyle\quad+\mathbb{E}\left[\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})}\right\|{\mathcal{F}_{T}^{-}}\right)\right]$

where the ultimate inequality is justified by the law of total expectation and Corollary 4. (Note that $(\mathcal{I}_{0:T}^{n})_{n=1}^{N}$ are identically distributed but not necessarily independent given $\mathcal{F}_{T}^{-}$ .) Using Lemma E.3 (stated and proved below) and putting $\rho:=1-\bar{M}_{\ell}/\bar{M}_{h}$ , we have

(35)			$\displaystyle\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\sum_{t=0}^{T}\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}})}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\frac{1}{N^{2}}\sum_{n,m\leq N}\sum_{s,t\leq T}\operatorname{Cov}\left(\left.\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}}),\psi_{s}(X_{s-1}^{\mathcal{I}_{s-1}^{m}},X_{s}^{\mathcal{I}_{s}^{m}})\right\|\mathcal{F}_{T}^{-}\right)$
			$\displaystyle\leq\frac{2}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n=m\end{subarray}}\sum_{s,t\leq T}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}+$
			$\displaystyle\quad+\frac{4}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n\neq m\end{subarray}}\sum_{s,t\leq T}\frac{\tilde{C}}{N}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}$
			$\displaystyle=\left(\sum_{s,t\leq T}2\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}\right)\frac{(2\tilde{C}+1)N-2\tilde{C}}{N^{2}}$
			$\displaystyle\leq\left[\sum_{s,t\leq T}\left(\left\lVert\psi_{t}\right\rVert_{\infty}^{2}+\left\lVert\psi_{s}\right\rVert_{\infty}^{2}\right)\rho^{\left\|t-s\right\|-1}\right]\frac{2\tilde{C}+1}{N}\leq\frac{4(2\tilde{C}+1)}{N\rho(1-\rho)}\sum\left\lVert\psi_{t}\right\rVert_{\infty}^{2}.$

We now look at the first term of (34). In the fixed marginal smoothing case, for any $s\in\mathbb{Z}_{+}$ , $s\leq T$ and any function $\phi_{s}:\mathcal{X}_{s}\to\mathbb{R}$ , Douc et al., (2011) proved that

\mathbb{P}\left(\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|\geq\varepsilon\right)\leq B^{\prime}e^{-C^{\prime}N\varepsilon^{2}/\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}

for $\varphi_{T}(x_{0:T})=\phi_{s}(x_{s})$ and constants $B^{\prime}$ and $C^{\prime}$ not depending on $T$ . Using $\mathbb{E}[\Delta^{2}]=\int_{0}^{\infty}\mathbb{P}(\Delta^{2}\geq t)\mathrm{d}t$ , the inequality implies

(36)

\mathbb{E}\left[\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|^{2}\right]\leq\frac{B^{\prime}\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}{C^{\prime}N}

for $\varphi_{T}(x_{0:T})=\phi_{s}(x_{s})$ . In the additive smoothing case, Dubarry and Le Corff, (2013) proved that, for $T\geq 2$ ,

(37)

\mathbb{E}\left[\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|^{2}\right]\leq\frac{C^{\prime}}{N}\left(\sum_{t=0}^{T}\left\lVert\psi_{t}\right\rVert_{\infty}^{2}\right)\left(1+\sqrt{\frac{T}{N}}\right)^{2}.

Equations (36), (37), (35) and (34) conclude the proof. ∎

The following lemma quantifies the backward mixing property induced by Assumption 2.

Lemma E.2.

Under the same setting as Theorem 2.2, we have

\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(m,\cdot),B_{t}^{N,\mathrm{FFBS}}(n,\cdot)\right)\leq 1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}

for all $m,n\in\left\{1,\ldots,N\right\}$ and $t\in\left\{1,\ldots,T\right\}$ .

Proof.

We have

	$\displaystyle 1-\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(m,\cdot),B_{t}^{N,\mathrm{FFBS}}(n,\cdot)\right)$
	$\displaystyle=\left[\sum_{i=1}^{N}\min\left(\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{m})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{m})},\right.\right.$
	$\displaystyle\qquad\left.\left.\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{n})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{n})}\right)\right]\text{ by Lemma~{}\ref{lem:properties_TV} (Supplement~{}\ref{apx:tv})}$
	$\displaystyle\geq\left[\sum_{i=1}^{N}\frac{G_{t}(X_{t-1}^{i})\bar{M}_{\ell}}{\sum_{j=1}^{N}G_{t}(X_{t-1}^{j})\bar{M}_{h}}\right]\text{ by Assumption~{}\ref{asp:mt_2ways_bound}}$
	$\displaystyle=(\bar{M}_{\ell}/\bar{M}_{h}).$

∎

Lemma E.3.

Under the same settings as in Theorem 2.2, for any $m,n\in\left\{1,\ldots,N\right\}$ and $s,s^{\prime}\in\left\{0,\ldots,T\right\}$ , we have

(38)

\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}_{s-1}^{m}},X_{s}^{\mathcal{I}_{s}^{m}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}_{s^{\prime}-1}^{n}},X_{s^{\prime}}^{\mathcal{I}_{s^{\prime}}^{n}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq\quad 2\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{\left|s-s^{\prime}\right|-1}\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\times\begin{cases}2\tilde{C}/N&\text{ if }m\neq n\\ 1&\text{ if }m=n\end{cases}

where $\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}})$ is a constant that does not depend on $T$ (and which arises in the formulation of Lemma E.4). If $s$ or $s^{\prime}$ is equal to $0$ , we adopt the natural convention $\psi_{0}(x_{-1},x_{0}):=\psi_{0}(x_{0})$ .

Proof.

We first handle the case $m\neq n$ . Without loss of generality, assume that $m=1$ , $n=2$ and $s\geq s^{\prime}$ . The covariance bound of Lemma A.1 yields

(39)

\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{1}_{s-1}},X_{s}^{\mathcal{I}^{1}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{2}_{s^{\prime}-1}},X_{s^{\prime}}^{\mathcal{I}^{2}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\operatorname{TV}\left((\mathcal{I}^{1}_{s-1:s},\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}})|\mathcal{F}_{T}^{-},(\mathcal{I}^{1}_{s-1:s}|\mathcal{F}_{T}^{-})\otimes(\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}|\mathcal{F}_{T}^{-})\right).

We shall bound this total variation distance via the coupling inequality of Lemma A.1 (Supplement A.2). The idea is to construct, in addition to $\mathcal{I}^{1}_{0:T}$ and $\mathcal{I}^{2}_{0:T}$ , two trajectories $\mathcal{I}^{*1}_{0:T}$ and $\mathcal{I}^{*2}_{0:T}$ i.i.d. given $\mathcal{F}_{T}^{-}$ such that each of them is conditionally distributed according to $\mathcal{I}^{1}_{0:T}$ (cf. Corollary 4). To make the coupling inequality efficient, it is desirable to make $\mathcal{I}^{1}_{0:T}$ and $\mathcal{I}^{*1}_{0:T}$ as similar as possible (same thing for $\mathcal{I}^{2}_{0:T}$ and $\mathcal{I}^{*2}_{0:T}$ ).

The detailed construction of the four trajectories $\mathcal{I}^{1}_{0:T}$ , $\mathcal{I}^{2}_{0:T}$ , $\mathcal{I}^{*1}_{0:T}$ and $\mathcal{I}^{*2}_{0:T}$ given $\mathcal{F}_{T}^{-}$ is described in Algorithm 15. In particular, we ensure that $\forall t\geq s-1$ , we have $\mathcal{I}^{1}_{t}=\mathcal{I}^{*1}_{t}$ . For $t\leq s-1$ , if $\mathcal{I}^{2}_{t}=\mathcal{I}^{*2}_{t}$ , it is guaranteed that $\mathcal{I}^{2}_{\ell}=\mathcal{I}^{*2}_{\ell}$ holds $\forall\ell\leq t$ . The rationale for different coupling behaviours between the times $t\geq s-1$ and $t\leq s-1$ will become clear in the proof: the former aim to control the correlation between two different trajectories $m=1$ and $n=2$ and result in the $\tilde{C}/N$ term of (38); the latter are for bounding the correlation between two times $s$ and $s^{\prime}$ and result in the $(1-\bar{M}_{\ell}/\bar{M}_{h})^{|s-s^{\prime}|-1}$ term of the same equation.

Input: Feynman-Kac model (1), variables

X_{0:T}^{1:N}

from the output of Algorithm 1, integer

s\geq 0

(see statement of Lemma E.3)

Sample

\mathcal{I}^{1}_{T},\mathcal{I}^{2}_{T}\overset{i.i.d.}{\sim}\mathcal{M}(W_{T}^{1:N})

Set

\mathcal{I}^{*1}_{T}\leftarrow\mathcal{I}^{1}_{T}

and

\mathcal{I}^{*2}_{T}\leftarrow\mathcal{I}^{2}_{T}

for $t\leftarrow T$ to $1$ do

if $\mathcal{I}^{1}_{t}\neq\mathcal{I}^{2}_{t}$ then

for $k\in\left\{1,2\right\}$ do

Sample

(\mathcal{I}_{t-1}^{k},\mathcal{I}^{*k}_{t-1})

from any maximal coupling of

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot)

and

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot)

(cf. Lemma A.1)

else

Sample the

\mathbb{R}^{N}

vector

\hat{B}_{t}^{N}(\mathcal{I}_{t}^{1},\cdot)

from

p(\hat{b}_{t}^{N}(i_{t}^{1},\cdot)|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})

Sample

\mathcal{I}_{t-1}^{1},\mathcal{I}^{2}_{t-1}\overset{i.i.d.}{\sim}\hat{B}_{t}^{N}(\mathcal{I}^{1}_{t},\cdot)

Set

k\leftarrow 1,\ell\leftarrow 2

t\geq s

and

k\leftarrow 2,\ell\leftarrow 1

otherwise

Sample

\mathcal{I}_{t-1}^{*k}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot)

such that

(\mathcal{I}_{t-1}^{*k},\mathcal{I}_{t-1}^{k})

is any maximal coupling of

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot)

and

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot)

given

\mathcal{I}_{t:T}^{1:2}

\mathcal{I}_{t:T}^{*1:2}

and

\mathcal{F}_{T}^{-}

(

(\star)

- see text for validity of this step)

Sample

\mathcal{I}_{t-1}^{*\ell}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*\ell},\cdot)

Output: Four trajectories

\mathcal{I}_{0:T}^{1}

\mathcal{I}^{2}_{0:T}

\mathcal{I}^{*1}_{0:T}

\mathcal{I}^{*2}_{0:T}

to be used in the proof of Lemma E.3

Algorithm 15 Sampler for the variables

\mathcal{I}^{1}_{0:T}

\mathcal{I}^{2}_{0:T}

\mathcal{I}^{*1}_{0:T}

and

\mathcal{I}^{*2}_{0:T}

(see proof of Lemma E.3)

The correctness of Algorithm 15 is asserted by Lemma E.1. Step $(\star)$ is valid because that lemma states that the distribution of $\mathcal{I}_{t-1}^{k}$ given $\mathcal{F}_{T}^{-}$ , $\mathcal{I}_{t:T}^{1,2}$ and $\mathcal{I}_{t:T}^{*1,2}$ is $B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot)$ . Furthermore, we note that $(R_{T-t})_{t=0}^{T}$ where

R_{t}:=(\mathcal{I}_{t}^{1},\mathcal{I}_{t}^{2},\mathcal{I}^{*1}_{t},\mathcal{I}_{t}^{*2}),

is a Markov chain given $\mathcal{F}_{T}^{-}$ .

From (39), applying the coupling inequality of Lemma A.1 gives

(40)

\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{1}_{s-1}},X_{s}^{\mathcal{I}^{1}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{2}_{s^{\prime}-1}},X_{s^{\prime}}^{\mathcal{I}^{2}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{(\mathcal{I}^{1}_{s-1:s},\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}})\neq(\mathcal{I}^{*1}_{s-1:s},\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}})}\right|{\mathcal{F}_{T}^{-}}\right)\\ =2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)

where the last equality results from the construction of Algorithm 15. The sub-case $s=s^{\prime}$ following directly from Lemma E.4, we now focus on the sub-case $s\geq s^{\prime}+1$ . For all $t\leq s-1$ ,

(41)			$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{*2}_{t-1}}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}}\right\|{\mathcal{F}_{T}^{-}}\right)$
			by construction of Algorithm 15
			$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)}\right\|{\mathcal{F}_{T}^{-}}\right]$
			by the law of total expectation
			$\displaystyle=\mathbb{E}\left[\left.{\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{2}_{t},\cdot),B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{2}_{t},\cdot)\right)\mathbb{1}\left\{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}\right\}}\right\|{\mathcal{F}_{T}^{-}}\right]$
			$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by Lemma~{}\ref{lem:backward_mixing}}.$

Thus

	$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}}}\right\|{\mathcal{F}_{T}^{-}}\right)$
	$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by construction of Algorithm~{}\ref{algo:four_trajs}}$
	$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s-1}\neq\mathcal{I}^{*2}_{s-1}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by applying \eqref{eq:two_twostar_diff_bw} recursively}$
	$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\frac{\tilde{C}}{N}\text{ by Lemma~{}\ref{lem:I_s_star_likely_same},}$

which, combined with (40) finishes the proof for the current sub-case $s\geq s^{\prime}+1$ . It remains to show (38) when $m=n$ . The proof follows the same lines as in the case $m\neq n$ , although we shall briefly outline some arguments to show how the factor $\tilde{C}/N$ disappeared. The case $s=s^{\prime}$ being trivial, suppose that $s\geq s^{\prime}+1$ and without loss of generality that $m=n=3$ . To use the coupling tools of Lemma A.1, we construct trajectories $\mathcal{I}^{3}_{0:T}$ , $\mathcal{I}^{*3}_{0:T}$ and $\mathcal{I}^{*4}_{0:T}$ via Algorithm 16 and write, in the spirit of (40):

Input: Feynman-Kac model (1), variables

X_{0:T}^{1:N}

from the output of Algorithm 1, integer

s\geq 0

(see statement of Lemma E.3)

Sample

\mathcal{I}^{*3}_{T},\mathcal{I}^{*4}_{T}\overset{i.i.d.}{\sim}\mathcal{M}(W_{T}^{1:N})

Set

\mathcal{I}^{3}_{T}\leftarrow\mathcal{I}^{*3}_{T}

for $t\leftarrow T$ to 1 do

if $t\geq s$ then

Sample

\mathcal{I}^{*3}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*3}_{t},\cdot)

and

\mathcal{I}^{*4}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*4}_{t},\cdot)

Set

\mathcal{I}^{3}_{t-1}\leftarrow\mathcal{I}^{*3}_{t-1}

else

Sample

(\mathcal{I}^{3}_{t-1},\mathcal{I}^{*4}_{t-1})

from a maximal coupling of

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{3}_{t},\cdot)

and

B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*4}_{t},\cdot)

Sample

\mathcal{I}^{*3}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*3}_{t},\cdot)

Output: Three trajectories

\mathcal{I}^{3}_{0:T}

\mathcal{I}^{*3}_{0:T}

and

\mathcal{I}^{*4}_{0:T}

to be used in the proof of Lemma E.3

Algorithm 16 Sampler for the variables

\mathcal{I}^{3}_{0:T}

\mathcal{I}^{*3}_{0:T}

and

\mathcal{I}^{*4}_{0:T}

(see proof of Lemma E.3)

(42)

\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{3}_{s-1}},X_{s}^{\mathcal{I}^{3}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{3}_{s-1}},X_{s^{\prime}}^{\mathcal{I}^{3}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{(\mathcal{I}^{3}_{s-1:s},\mathcal{I}^{3}_{s^{\prime}-1:s^{\prime}})\neq(\mathcal{I}^{*3}_{s-1:s},\mathcal{I}^{*4}_{s^{\prime}-1:s^{\prime}})}\right|{\mathcal{F}_{T}^{-}}\right)\\ =2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s^{\prime}}\neq\mathcal{I}^{*4}_{s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)

where the last equality follows from the construction of Algorithm 16 and the hypothesis $s\geq s^{\prime}+1$ . For all $t\leq s-1$ , the inequality

(43)

\mathbb{P}\left(\left.{\mathcal{I}^{3}_{t-1}\neq\mathcal{I}^{*4}_{t-1}}\right|{\mathcal{F}_{T}^{-}}\right)\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{P}\left(\left.{\mathcal{I}^{3}_{t}\neq\mathcal{I}^{*4}_{t}}\right|{\mathcal{F}_{T}^{-}}\right)

can be proved using the same techniques as those used to prove (41): applying Lemma E.2 given $(\mathcal{I}^{3}_{t},\mathcal{I}^{*3}_{t},\mathcal{I}^{*4}_{t})$ then invoking the law of total expectation. Repeatedly instantiating (43) gives

	$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s^{\prime}}\neq\mathcal{I}^{*4}_{s^{\prime}}}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s-1}\neq\mathcal{I}^{*4}_{s-1}}\right\|{\mathcal{F}_{T}^{-}}\right)$
		$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}$

which, when plugged into (42), finishes the proof. ∎

Lemma E.4.

For $\mathcal{I}_{s}^{2}$ and $\mathcal{I}_{s}^{2*}$ defined by the output of Algorithm 15, we have

	$\displaystyle\mathbb{P}\left(\mathcal{I}^{2}_{s}\neq\mathcal{I}^{*2}_{s}\|\mathcal{F}_{T}^{-}\right)$	$\displaystyle\leq\tilde{C}/N,\text{ and }$
	$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s-1}\neq\mathcal{I}^{*2}_{s-1}}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\leq\tilde{C}/N,\text{ if }s\geq 1,$

for some constant $\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}})$ .

Proof.

Define $A_{t}:=\mathbb{1}\left\{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}\right\}$ , $B_{t}:=\mathbb{1}\left\{\mathcal{I}_{t}^{2}=\mathcal{I}_{t}^{*2}\right\}$ and $\Gamma_{t}:=A_{t}B_{t}$ and recall that $R_{t}:=(\mathcal{I}_{t}^{1},\mathcal{I}_{t}^{2},\mathcal{I}_{t}^{*1},\mathcal{I}_{t}^{*2})$ . The sequence $(R_{T-\ell})_{\ell=0}^{T}$ is a Markov chain given $\mathcal{F}_{T}^{-}$ , but this is not necessarily the case for the sequence $(\Gamma_{T-\ell})_{\ell=0}^{T}$ of Bernoulli random variables. Nevertheless, Lemma E.5 below shows that one can get bounds on two-step “transition probabilities” for $(\Gamma_{T-\ell})$ , i.e. the probabilities under $\mathcal{F}_{T}^{-}$ that $\Gamma_{t-2}=1$ given $\Gamma_{t}$ and $R_{t}$ . This motivates our following construction of actual Markov chains approximating the dynamic of $\Gamma_{t}$ . Let $\Gamma^{*}_{T}$ and $\Gamma^{*}_{T-1}$ be two independent Bernoulli random variables given $\mathcal{F}_{T}^{-}$ such that

(44)		$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$
(44)		$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right).$

Let $\Gamma^{*}_{T},\Gamma^{*}_{T-2},\Gamma^{*}_{T-4},\ldots$ and $\Gamma^{*}_{T-1},\Gamma^{*}_{T-3},\ldots$ be two homogeneous Markov chains given $\mathcal{F}_{T}^{-}$ with the same transition kernel $\overleftarrow{\mathfrak{C}^{2}}$ defined by

(45)			$\displaystyle\mathbb{P}_{\mathcal{F}_{T}^{-}}(\Gamma^{}_{t-2}=1\|\Gamma^{}_{t}=1)$		$\displaystyle=1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}$		$\displaystyle=:\overleftarrow{\mathfrak{C}^{2}}_{11}$
(45)			$\displaystyle\mathbb{P}_{\mathcal{F}_{T}^{-}}(\Gamma^{}_{t-2}=1\|\Gamma^{}_{t}=0)$		$\displaystyle=\frac{\bar{M}_{\ell}\varepsilon_{\mathrm{S}}}{2\bar{M}_{h}}$		$\displaystyle=:\overleftarrow{\mathfrak{C}^{2}}_{01}$

where for two events $E_{1}$ , $E_{2}$ , the notation $\mathbb{P}_{\mathcal{F}_{T}^{-}}(E_{1}|E_{2})$ is the ratio between $\mathbb{P}\left(\left.{E_{1},E_{2}}\right|{\mathcal{F}_{T}^{-}}\right)$ and $\mathbb{P}\left(\left.{E_{2}}\right|{\mathcal{F}_{T}^{-}}\right)$ . We shall now prove by backward induction the following statement:

(46)

\mathbb{P}\left(\left.{\Gamma_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right)\geq\mathbb{P}\left(\left.{\Gamma^{*}_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right),\forall t\geq s-1.

Firstly, (46) holds for $t=T$ and $t=T-1$ . Now suppose that it holds for some $t\geq s+1$ and we wish to justify it for $t-2$ . By Lemma E.5,

	$\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{t}=1}$	$\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{11}\mathbb{1}_{\Gamma_{t}=1}$
	$\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{t}=0}$	$\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{01}\mathbb{1}_{\Gamma_{t}=0}.$

Applying the law of total expectation gives

	$\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{11}\mathbb{P}\left(\left.{\Gamma_{t}=1}\right\|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}\mathbb{P}\left(\left.{\Gamma_{t}=0}\right\|{\mathcal{F}_{T}^{-}}\right)$
		$\displaystyle=\left(\overleftarrow{\mathfrak{C}^{2}}_{11}-\overleftarrow{\mathfrak{C}^{2}}_{01}\right)\mathbb{P}\left(\left.{\Gamma_{t}=1}\right\|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}$
		$\displaystyle\geq\left(\overleftarrow{\mathfrak{C}^{2}}_{11}-\overleftarrow{\mathfrak{C}^{2}}_{01}\right)\mathbb{P}\left(\left.{\Gamma^{*}_{t}=1}\right\|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}$
		$\displaystyle\text{ if }N\text{ is large enough, by induction hypothesis}$
		$\displaystyle=\mathbb{P}\left(\left.{\Gamma^{*}_{t-2}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$

and (46) is now proved. To finish the proof of the lemma, it is necessary to lower bound its right hand side. We start by controlling the distribution $\Gamma^{*}_{t}$ for $t=T$ and $t=T-1$ . We have

(47)	$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by \eqref{eq:gammastar_initial}}$
		$\displaystyle=1-\mathbb{P}\left(\left.{A_{T}=0}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ as }B_{T}=1\text{ by Algorithm~{}\ref{algo:four_trajs}}$
		$\displaystyle=1-\sum_{i=1}^{N}\mathbb{P}\left(\left.{\mathcal{I}_{T}^{1}=\mathcal{I}_{T}^{2}=i}\right\|{\mathcal{F}_{T}^{-}}\right)$
		$\displaystyle=1-\sum_{i=1}^{N}\left(\frac{G(X_{T}^{i})}{\sum_{j=1}^{N}G(X_{T}^{j})}\right)^{2}$
		$\displaystyle\geq 1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}\text{ by Assumption~{}\ref{asp:g_2ways_bound}}$

and

(48)	$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\geq\mathbb{P}\left(\left.{\Gamma_{T}=1,\Gamma_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$
		$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\Gamma_{T}=1,\Gamma_{T-1}=1}\right\|{R_{T},\mathcal{F}_{T}^{-}}\right)}\right\|{\mathcal{F}_{T}^{-}}\right]$
		by the law of total expectation
		$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\Gamma_{T-1}=1}\right\|{R_{T},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{T}=1}}\right\|{\mathcal{F}_{T}^{-}}\right]$
		$\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{P}\left(\left.{\Gamma_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ via \eqref{ieq_c_4trajs_backward_reg_bound}}$
		$\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}\right].$

The contraction property of Lemma A.1 makes it possible to relate the intermediate distributions $\Gamma^{*}_{t}|\mathcal{F}_{T}^{-}$ to the end point ones $\Gamma^{*}_{T-1}|\mathcal{F}_{T}^{-}$ and $\Gamma^{*}_{T}|\mathcal{F}_{T}^{-}$ . More specifically, (45) and Lemma A.1 lead to

(49)

\operatorname{TV}(\Gamma^{*}_{t}|\mathcal{F}_{T}^{-},\mu^{*})\leq\max\left(\operatorname{TV}(\Gamma^{*}_{T}|\mathcal{F}_{T}^{-},\mu^{*}),\operatorname{TV}(\Gamma^{*}_{T-1}|\mathcal{F}_{T}^{-},\mu^{*})\right)

where $\mu^{*}$ is the invariant distribution of a Markov chain with transition matrix $\overleftarrow{\mathfrak{C}^{2}}$ , namely

(50)

\begin{cases}\mu^{*}(\left\{0\right\})&=\frac{\overleftarrow{\mathfrak{C}^{2}}_{10}}{\overleftarrow{\mathfrak{C}^{2}}_{01}+\overleftarrow{\mathfrak{C}^{2}}_{10}}\\ \mu^{*}(\left\{1\right\})&=1-\mu^{*}(\left\{0\right\}).\end{cases}

Furthermore, an alternative expression of the total variation distance given in Lemma A.1 implies that the total variation distance between two Bernoulli distributions of parameters $p$ and $q$ is $\left|p-q\right|$ . Combining this with (49), the triangle inequality and the rough estimate $\max(a,b)\leq a+b\ \forall a,b\geq 0$ , we get

\mathbb{P}\left(\left.{\Gamma^{*}_{t}=0}\right|{\mathcal{F}_{T}^{-}}\right)\leq 3\mu^{*}(\left\{0\right\})+\mathbb{P}\left(\left.{\Gamma^{*}_{T}=0}\right|{\mathcal{F}_{T}^{-}}\right)+\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=0}\right|{\mathcal{F}_{T}^{-}}\right)\leq\tilde{C}/N

where $\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}})$ . The last inequality is straightforwardly derived by plugging respectively (50), (47) and (48) into the three terms of the preceding sum. This combined with (46) finishes the proof. ∎

Lemma E.5.

For $s$ defined in the statement of Lemma E.3; $A_{t}$ , $B_{t}$ and $R_{t}$ defined in the proof of Lemma E.4 and all $t\geq s+1$ , we have

	$\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}$	$\displaystyle\geq\left(1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t}B_{t}=1};$
	$\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\geq\frac{\bar{M}_{\ell}\varepsilon_{\mathrm{S}}}{2\bar{M}_{h}}$

where the inequalities hold for $N$ large enough, i.e., $N\geq N_{0}=N_{0}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}})$ .

Proof.

We start by showing the following three inequalities for all $t\geq s$ and $N$ sufficiently large:

(51)	$\displaystyle\mathbb{P}\left(\left.{A_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)$	$\displaystyle\geq\varepsilon_{\mathrm{S}};$
(52)	$\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}$	$\displaystyle\geq(\bar{M}_{\ell}/2\bar{M}_{h})\mathbb{1}_{A_{t}=1};$
(53)	$\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}$	$\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{1}_{A_{t}B_{t}=1}.$

For (51), we have

(54)

\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}\neq 1}=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}\geq\varepsilon_{\mathrm{S}}\mathbb{1}_{A_{t}\neq 1}

by Lemma E.1. Next,

(55)			$\displaystyle\mathbb{P}\left(\left.{A_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}$
			$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}$
			$\displaystyle=\left[1-\sum_{i}\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}=\mathcal{I}_{t-1}^{2}=i}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}$
			$\displaystyle=\left[1-\sum_{i=1}^{N}\prod_{k=1}^{2}\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{\mathcal{I}_{t}^{k}})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{\mathcal{I}_{t}^{k}})}\right]\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}\text{ by Lemma~{}\ref{lem:two_backward_traj}}$
			$\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{1}_{A_{t}=1}\text{ by Assumptions~{}\ref{asp:mt_2ways_bound} and~{}\ref{asp:g_2ways_bound}.}$

Combining (54) and (55) yields (51) for $N$ large enough. To prove (52), we write

(56)			$\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}$
			$\displaystyle=\left[1-\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=0}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{A_{t}=1}$
			$\displaystyle\geq\left[1-\mathbb{P}\left(\left.{A_{t-1}=0}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)-\mathbb{P}\left(\left.{B_{t-1}=0}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{A_{t}=1}$
			$\displaystyle=\left[\mathbb{P}\left(\left.{A_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)+\mathbb{P}\left(\left.{B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)-1\right]\mathbb{1}_{A_{t}=1}.$

We analyse the second term in the above expression. We have

(57)			$\displaystyle\mathbb{P}\left(\left.{B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}$
			$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{2}=\mathcal{I}_{t-1}^{*2}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}$
			$\displaystyle=\left[1-\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{2},\cdot),B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*2},\cdot)\right)\right]\mathbb{1}_{A_{t}=1}$
			by construction of Algorithm 15
			$\displaystyle\geq(\bar{M}_{\ell}/\bar{M}_{h})\mathbb{1}_{A_{t}=1}\text{ by Lemma~{}\ref{lem:backward_mixing}.}$

Plugging (55) and (57) into (56) yields

\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}\geq\left(-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}+\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{1}_{A_{t}=1}

and thus (52) follows if $N$ is large enough. The inequality (53) is justified by combining (55), the simple decomposition $\mathbb{1}_{A_{t}B_{t}=1}=\mathbb{1}_{A_{t}=1}\mathbb{1}_{B_{t}=1}$ and the fact that Algorithm 15 guarantees $B_{t-1}=1$ if $A_{t}=B_{t}=1$ .

We can now deduce the two inequalities in the statement of the Lemma. The first one is a straightforward double application of (53):

	$\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}$
	$\displaystyle\geq\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}B_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}$
	$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}B_{t-1}=1}\right\|{R_{t-1},R_{t},\mathcal{F}_{T}^{-}}\right)}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}$
	by the law of total expectation
	$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t-1},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t-1}B_{t-1}=1}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}$
	$\displaystyle\text{ since }(R_{T-\ell})_{\ell=0}^{T}\text{ is Markov given }\mathcal{F}_{T}^{-}$
	$\displaystyle\geq\mathbb{E}\left[\left.{\left(1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t-1}B_{t-1}=1}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}$
	$\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]^{2}\mathbb{1}_{A_{t}B_{t}=1}\geq\left(1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t}B_{t}=1}.$

Finally, we have

	$\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)$
	$\displaystyle\geq\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)$
	$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right\|{R_{t-1},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t-1}=1}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right]$
	using law of total expectation and the Markov property as above
	$\displaystyle\geq\frac{\bar{M}_{\ell}}{2\bar{M}_{h}}\mathbb{P}\left(\left.{A_{t-1}=1}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)\text{ by \eqref{ieq_b_4trajs_backward_reg_bound}}$
	$\displaystyle\geq\frac{\bar{M}_{\ell}}{2\bar{M}_{h}}\varepsilon_{\mathrm{S}}\text{ by \eqref{ieq_a_4trajs_backward_reg_bound}}$

and the second inequality is proved. ∎

E.4. Proof of Proposition 3 (hybrid rejection validity)

Proof.

Put $Z_{n}:=(X_{n},U_{n}C\mu_{0}(X_{n}))$ . Then $Z_{n}$ is uniformly distributed on

\mathcal{G}_{0}:=\left\{(x,y)\in\mathcal{X}\times\mathbb{R}_{+},y\leq C\mu_{0}(x)\right\}.

The proof would be done if one could show that, given $K^{*}\leq K$ , the variable $Z_{K^{*}}$ is uniformly distributed on

\mathcal{G}_{1}:=\left\{(x,y)\in\mathcal{X}\times\mathbb{R}_{+},y\leq\mu_{1}(x)\right\}.

Note that $K^{*}$ is, by definition, the first time index where the sequence $(Z_{n})$ touches $\mathcal{G}_{1}$ . Let $B$ be any subset of $\mathcal{G}_{1}$ . We have

(58)			$\displaystyle\mathbb{P}\left(\left.{Z_{K^{}}\in B}\right\|{K^{}\leq K}\right)\propto\mathbb{P}(Z_{K^{}}\in B,K^{}\leq K)$
			$\displaystyle=\sum_{k^{}=1}^{\infty}\mathbb{P}\left(Z_{k^{}}\in B,K^{}=k^{},K\geq k^{*}\right)$
			$\displaystyle=\sum_{k^{}=1}^{\infty}\mathbb{P}\left(Z_{k^{}}\in B,Z_{1:k^{}-1}\notin\mathcal{G}_{1},K>k^{}-1\right)$
			$\displaystyle=\sum_{k^{}=1}^{\infty}\mathbb{P}(Z_{k^{}}\in B)\mathbb{P}\left(Z_{1:k^{}-1}\notin\mathcal{G}_{1},K>k^{}-1\right)\mbox{since $K$ stopping time}$
			$\displaystyle=\mathbb{P}(Z_{1}\in B)\sum_{k^{}=1}^{\infty}\mathbb{P}\left(Z_{1:k^{}-1}\notin\mathcal{G}_{1},K>k^{*}-1\right)$
			$\displaystyle\propto\mathbb{P}(Z_{1}\in B)\propto\mathbb{P}\left(\left.{Z_{1}\in B}\right\|{Z_{1}\in\mathcal{G}_{1}}\right).$

By considering the special case $B=\mathcal{G}_{1}$ , we see that the constant of proportionality between the first and the last terms of (58) must be $1$ , from which the proof follows. ∎

E.5. Proof of Theorem 3.1 (hybrid algorithm’s intermediate complexity)

From (16), one may have the correct intuition that as $N\to\infty$ , $\tau_{t}^{1,\mathrm{PaRIS}}$ tends in distribution to that of the variable $\tau_{t}^{\infty,\mathrm{PaRIS}}$ defined as

(59)

\tau_{t}^{\infty,\mathrm{PaRIS}}\textrm{ }|\textrm{ }X_{t}^{\infty,\mathrm{PaRIS}}\sim\operatorname{Geo}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)

where $X_{t}^{\infty,\mathrm{PaRIS}}\sim\mathbb{Q}_{t-1}M_{t}(\mathrm{d}x_{t})$ is distributed according to the predictive distribution of $X_{t}$ given $Y_{0:t-1}$ and $r_{t}$ is the density of $X_{t}^{\infty,\mathrm{PaRIS}}$ with respect to the Lebesgue measure (cf. Definition 1). The following proposition formalises the connection between $\tau_{t}^{1,\mathrm{PaRIS}}$ and $\tau_{t}^{\infty,\mathrm{PaRIS}}$ .

Proposition 6.

We have $\tau_{t}^{1,\mathrm{PaRIS}}\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}}$ as $N\to\infty$ .

Proof.

From (16) and Definition 1 one has

(60)

\tau_{t}^{1,\mathrm{PaRIS}}\textrm{ }|\textrm{ }X_{t}^{1},\mathcal{F}_{t-1}\sim\operatorname{Geo}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right).

In light of (59), it suffices to establish that

(61)

\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\Rightarrow\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}.

Indeed, this would mean that for any continuous bounded function $\psi$ , we have

	$\displaystyle\mathbb{E}[\psi(\tau_{t}^{1,\mathrm{PaRIS}})]=\mathbb{E}\left[(\operatorname{Geo}^{\star}\psi)\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)\right]$	$\displaystyle\rightarrow\mathbb{E}\left[(\operatorname{Geo}^{\star}\psi)\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)\right]$
		$\displaystyle=\mathbb{E}[\psi(\tau_{t}^{\infty,\mathrm{PaRIS}})]$

where $\operatorname{Geo}^{\star}$ is the geometric Markov kernel that sends each $\lambda$ to the geometric distribution of parameter $\lambda$ , i.e. $\operatorname{Geo}^{\star}(\lambda,\mathrm{d}x)=\operatorname{Geo}(\lambda)$ . To this end, write

(62)

\begin{split}r_{t}^{N}(X_{t}^{1})-r_{t}(X_{t}^{1})&=\frac{\sum_{n}G_{t-1}(X_{t-1}^{n})m_{t}(X_{t-1}^{n},X_{t}^{1})}{\sum_{n}G_{t-1}(X_{t-1}^{n})}-r_{t}(X_{t}^{1})\\ &=\frac{\sum_{n}N^{-1}G_{t-1}(X_{t-1}^{n})\left[m_{t}(X_{t-1}^{n},X_{t}^{1})-r_{t}(X_{t}^{1})\right]}{N^{-1}\sum_{n}G_{t-1}(X_{t-1}^{n})}.\end{split}

We study the mean squared error of the numerator:

	$\displaystyle\mathbb{E}\left\{\frac{1}{N}\sum_{n}G_{t-1}(X_{t-1}^{n})\left[m_{t}(X_{t-1}^{n},X_{t}^{1})-r_{t}(X_{t}^{1})\right]\right\}^{2}$
	$\displaystyle=\frac{1}{N}\mathbb{E}\left\{G_{t-1}(X_{t-1}^{1})^{2}\left[m_{t}(X_{t-1}^{1},X_{t}^{1})-r_{t}(X_{t}^{1})\right]^{2}\right\}$
	$\displaystyle\quad+\frac{N(N-1)}{N^{2}}\mathbb{E}\Big{\{}G_{t-1}(X_{t-1}^{1})G_{t-1}(X_{t-1}^{2})\left[m_{t}(X_{t-1}^{1},X_{t}^{1})-r_{t}(X_{t}^{1})\right]$
	$\displaystyle\qquad\times\left[m_{t}(X_{t-1}^{2},X_{t}^{1})-r_{t}(X_{t}^{1})\right]\Big{\}}$

where we have again used the exchangeability induced by step $(\star)$ of Algorithm 5. The first term obviously tends to $0$ as $N\to\infty$ by Assumptions 4 and 1. The second term also vanishes asymptotically thanks to Lemma E.6 below and Assumption 6. Assumption 1 also implies that the denominator of (62) converges in probability to some constant, via the consistency of particle approximations, see e.g. Del Moral, (2004) or Chopin and Papaspiliopoulos, (2020). Thus, $r_{t}^{N}(X_{t}^{1})-r_{t}(X_{t}^{1})\Rightarrow 0$ by Slutsky’s theorem. Moreover, $r_{t}(X_{t}^{1})\Rightarrow r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})$ by the continuity of $r_{t}$ and the consistency of particle approximations. Using again Slutsky’s theorem yields (61). ∎

The following lemma is needed to complete the proof of Proposition 6 and is related to the propagation of chaos property, see Del Moral, (2004, Chapter 8).

Lemma E.6.

We have $(X_{t-1}^{1},X_{t-1}^{2},X_{t}^{1})\Rightarrow\mathbb{Q}_{t-2}M_{t-1}\otimes\mathbb{Q}_{t-2}M_{t-1}\otimes Q_{t-1}M_{t}$ .

Proof.

For vectors $u$ , $v$ , and $w$ , we have, by the symmetry of the distribution of particles:

	$\displaystyle\mathbb{E}\left[\exp\left(iuX_{t-1}^{1}+ivX_{t-1}^{2}+iwX_{t}^{1}\right)\right]$
	$\displaystyle=\mathbb{E}\left[\left(\frac{1}{N}\sum e^{iuX_{t-1}^{n}}\right)\left(\frac{1}{N}\sum e^{ivX_{t-1}^{n}}\right)\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right]$
	$\displaystyle\quad-\frac{N}{N^{2}}\mathbb{E}\left[e^{iuX_{t-1}^{1}}e^{ivX_{t-1}^{1}}\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right]$
	$\displaystyle\quad+\frac{N}{N^{2}}\mathbb{E}\left[e^{iuX_{t-1}^{1}}e^{ivX_{t-1}^{2}}\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right].$

Note that

\frac{1}{N}\sum e^{iuX_{t-1}^{n}}\overset{\textrm{a.s.}}{\longrightarrow}\mathbb{Q}_{t-2}M_{t-1}\left(\exp(iu\bullet)\right)

and

\frac{1}{N}\sum e^{iwX_{t}^{n}}\overset{\textrm{a.s.}}{\longrightarrow}\mathbb{Q}_{t-1}M_{t}\left(\exp(iw\bullet)\right).

The dominated convergence theorem, applicable since $\left|e^{iu}\right|\leq 1$ for $u\in\mathbb{R}$ , finishes the proof. ∎

Proof of Theorem 3.1.

First of all,

(63)

\mathbb{E}[\tau_{t}^{\infty,\mathrm{PaRIS}}]=\mathbb{E}\left[\frac{\bar{M}_{h}}{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}\right]=\int_{\mathcal{X}_{t}}\frac{\bar{M}_{h}}{r_{t}(x_{t})}r_{t}(x_{t})\mathrm{d}x_{t}=\infty

by Assumption 5. Next, for any $x\in\mathbb{R}\setminus\mathbb{Z}$ and $N>x$ ,

\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\leq x\right)=\mathbb{P}\left(\tau_{t}^{1,\mathrm{PaRIS}}\leq x\right)\to\mathbb{P}\left(\tau_{t}^{\infty,\mathrm{PaRIS}}\leq x\right)

by Proposition 6. Thus, by Portmanteau theorem,

(64)

\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}}.

Altogether, we have

	$\displaystyle\liminf_{N\to\infty}\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]$	$\displaystyle=\liminf_{N\to\infty}\sum k\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)=k\right)$
		$\displaystyle\geq\sum\liminf_{N\to\infty}k\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)=k\right)\textrm{ by Fatou's lemma}$
		$\displaystyle=\sum k\mathbb{P}\left(\tau_{t}^{\infty,\mathrm{PaRIS}}=k\right)\textrm{ by \eqref{eq:proof_thm2_tnn_to_tinf}}$
		$\displaystyle=\infty{\textrm{ by \eqref{eq:proof_thm2_expectation_tauinf}}}$

and

\lim_{N\to\infty}\frac{1}{N}\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]=\lim_{N\to\infty}\mathbb{E}\left[\min\left(\frac{\tau_{t}^{1,\mathrm{PaRIS}}}{N},1\right)\right]\to 0

since $\tau_{t}^{1,\mathrm{PaRIS}}\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}}$ implies that the sequence of random variables

\min\left(\frac{\tau_{t}^{1,\mathrm{PaRIS}}}{N},1\right)

converges to $0$ in distribution while being bounded between $0$ and $1$ . ∎

E.6. Proof of Theorem 3.2 (hybrid PaRIS near-linear complexity)

The following proposition shows that the real execution time for the hybrid algorithm is asymptotically at most of the same order as the “oracle” hybrid execution time.

Proposition 7.

We have

\limsup_{N\to\infty}\frac{\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]}{\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right]}<\infty.

Proof.

Put

(65)

z^{N}(\lambda):=\frac{1-(1-\lambda)^{N}}{\lambda}=\sum_{n=0}^{N-1}(1-\lambda)^{n}.

One can quickly verify (using the memorylessness of the geometric distribution for example) that $z^{N}(\lambda)=\mathbb{E}\left[\left.{\min(G,N)}\right|{G\sim\operatorname{Geo}(\lambda)}\right]$ . It will be useful to keep in mind the elementary estimate $z^{N}(\lambda)\leq\min(N,\lambda^{-1})$ . We can now write

	$\displaystyle\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]$	$\displaystyle=\mathbb{E}\left[z^{N}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)\right]\textrm{ (by \eqref{eq:dist_tauN_with_r})}=\mathbb{E}\left[\mathbb{E}\left[\left.{z^{N}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)}\right\|{\mathcal{F}_{t-1}}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]$
		$\displaystyle\leq c_{t}\left(\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}(x_{t})\lambda_{t}(\mathrm{d}x_{t})+b_{t}\right)\text{ by Lemma~{}\ref{lem:magic_jensen}}$
		$\displaystyle=c_{t}\left(\mathbb{E}[\min(\tau_{t}^{{\infty,\mathrm{PaRIS}}},N)]+b_{t}\right)$

from which the proposition is immediate. ∎

Lemma E.7.

In addition to notations of Algorithm 1, let the function $z^{N}$ be defined as in (65) and the functions $r_{t}$ and $r_{t}^{N}$ be defined as in Definition 1. Let $\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}_{>0}$ be a bounded non-negative deterministic function. Then, under Assumptions 1 and 4, there exist constants $b_{t}$ and $c_{t}$ depending only on the model such that

\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right]\leq c_{t}\left(\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\phi_{t}+b_{t}\left\lVert\phi_{t}\right\rVert_{\infty}\right)

where for brevity, we shortened the integration notation (e.g. dropping $\lambda_{t}(\mathrm{d}x_{t})$ , dropping $x_{t}$ from $\phi(x_{t})$ , etc.) whenever there is no ambiguity.

Proof.

We have

(66)

\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right]\leq\int_{\mathcal{X}_{t}}z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\mathbb{E}\left[r_{t}^{N}(x_{t})\right]\phi_{t}

using Fubini’s theorem and the concavity of $\lambda\mapsto\lambda z^{N}(\lambda)$ on $[0,1]$ . By a well-known result on the bias of a particle filter (which is in fact the propagation of chaos in the special case of $q=1$ particle), we have:

	$\displaystyle\left\|\mathbb{E}\left[r_{t}^{N}(x_{t})\right]-r_{t}(x_{t})\right\|$	$\displaystyle=\left\|\mathbb{E}\left[\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})\right]-r_{t}(x_{t})\right\|$
		$\displaystyle=\left\|\mathbb{E}\left[m_{t}\left(X_{t-1}^{A_{t}^{1}},x_{t}\right)\right]-\mathbb{Q}_{t-1}\left(m_{t}\left(\bullet,x_{t}\right)\right)\right\|$
		$\displaystyle\leq\frac{b_{t}\bar{M}_{h}}{N}$

for some constant $b_{t}$ . We next show that such a bias does not change the asymptotic behavior of $z^{N}$ . More precisely,

(67)	$\displaystyle z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)$	$\displaystyle\leq z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}-\frac{b_{t}}{N}\right)$
		$\displaystyle=\sum_{n=0}^{N-1}\left(\frac{1-{r_{t}(x_{t})}/{\bar{M}_{h}}+{b_{t}}/{N}}{1-{r_{t}(x_{t})}/{\bar{M}_{h}}}\right)^{n}\left(1-\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)^{n}$
		$\displaystyle\leq\sum_{n=0}^{N-1}\left(1+\frac{b_{t}}{N\left(1-r_{t}(x_{t})/\bar{M}_{h}\right)}\right)^{N}\left(1-\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)^{n}$
		$\displaystyle\leq\exp\left(\frac{b_{t}}{1-r_{t}(x_{t})/\bar{M}_{h}}\right)z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)$
		$\displaystyle\leq e^{2b_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)$

if $x_{t}$ is such that $r_{t}(x_{t})/\bar{M}_{h}\leq 1/2$ . In contrast, if $r_{t}(x_{t})/\bar{M}_{h}\geq 1/2$ , then provided that $N\geq 6b_{t}$ , we have

(68)

z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\leq z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}-\frac{b_{t}}{N}\right)\leq z^{N}\left(\frac{1}{3}\right)\leq 3z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right).

Putting together (67) and (68), we have, for $N\geq 6b_{t}$ ,

z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\leq\left(e^{2b_{t}}+3\right)z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)

and so, by (66),

	$\displaystyle\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right]$	$\displaystyle\leq\left(e^{2b_{t}}+3\right)\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)\mathbb{E}\left[r_{t}^{N}(x_{t})\right]\phi_{t}$
		$\displaystyle=\left(e^{2b_{t}}+3\right)\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{1})}{\bar{M}_{h}}\right)\phi_{t}(X_{t}^{1})\right].$

Again, using the result on the bias of a particle filter,

\left|\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{1})}{\bar{M}_{h}}\right)\phi_{t}(X_{t}^{1})\right]-\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\phi_{t}\right|\leq\frac{b_{t}\left\lVert z^{N}\right\rVert_{\infty}\left\lVert\phi_{t}\right\rVert_{\infty}}{N}=b_{t}\left\lVert\phi_{t}\right\rVert_{\infty}

which, together with the previous inequality, implies the desired result. ∎

Proposition 8.

In linear Gaussian state space models, we have

\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right]=\mathcal{O}\left((\log N)^{d_{t}/2}\right).

Proof.

Let $\mu_{t}$ and $\Sigma_{t}$ be such that $X_{t}^{\infty,\mathrm{PaRIS}}\sim\mathcal{N}(\mu_{t},\Sigma_{t})$ . Then

\log(r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})/\bar{M}_{h})=b^{\prime}_{t}-W_{t}

where $b^{\prime}_{t}$ is some constant and

W_{t}:=\frac{(X_{t}^{\infty,\mathrm{PaRIS}}-\mu_{t})^{\top}\Sigma_{t}^{-1}(X_{t}^{\infty,\mathrm{PaRIS}}-\mu_{t})}{2}\sim\operatorname{Gamma}\left(\frac{d_{t}}{2},1\right).

We have

	$\displaystyle\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right]$	$\displaystyle=\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)\right]=\mathbb{E}\left[z^{N}(e^{b^{\prime}_{t}-W_{t}})\right]$
		$\displaystyle=\int_{0}^{\infty}z^{N}(e^{b^{\prime}_{t}-w})\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w$
		$\displaystyle\leq\int_{0}^{\log N}e^{w-b^{\prime}_{t}}\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w+\int_{\log N}^{\infty}N\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w$

using the bound $z^{N}(\lambda)\leq\min(N,1/\lambda)$ . The first term is of order $\mathcal{O}(\log^{d_{t}/2}N)$ by elementary calculus, while the second term is of order $\mathcal{O}(\log^{d_{t}/2-1}N)$ using asymptotic properties of the incomplete Gamma function, see Olver et al., (2010, Section 8.11). ∎

Proof of Theorem 3.2.

The theorem is a straightforward consequence of Proposition 7 and Proposition 8. ∎

E.7. Proof of Theorems B.1 and B.2 (pure rejection FFBS complexity)

We start with a useful remark linking the projection kernels $\Pi$ and the cost-to-go functions defined in Supplement A with the L-kernels formulated in (27). The proof is simple and therefore omitted.

Lemma E.8.

We have $L_{t:T}(x_{0:t},\mathbbm{1})=H_{t:T}(x_{t})$ for all $x_{0:t}$ . Moreover, for any function $\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}$ , we have

L_{t:T}\Pi^{0:T}_{t}\phi_{t}=\Pi^{0:t}_{t}(\phi_{t}\times H_{t:T}).

Theorems B.1 and B.2 both rely on an induction argument wrapped up in the following proposition.

Proposition 9.

We use the notations of Algorithm 2. Let $\mathbb{Q}_{t}^{N}$ be defined as in (24), where the $B_{s}^{N}$ kernels can be $B_{s}^{N,\mathrm{FFBS}}$ or any other kernels satisfying the hypotheses of Theorem 2.1. Suppose that Assumption 1 holds. Let $f_{t}^{N}:\mathcal{X}_{t}\to\mathbb{R}_{\geq 0}$ be a (possibly random) function such that $f_{t}^{N}(x_{t})$ is $\mathcal{F}_{t-1}$ -measurable. Then the following assertions are true:

(a)

Suppose that $\mathbb{E}\left[\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]=\infty$ , where $r_{t}^{N}$ and $\lambda_{t}$ are defined in Definition 1. Then

$\mathbb{E}\left[\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})\right]=\infty.$
(b)

Suppose that $\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0$ . Then

$\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0.$

Proof.

Part (a). We shall prove by induction the statement

\mathbb{E}\left[\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right]=\infty,\forall\ t-1\leq s\leq T.

For $s=t-1$ , it follows from part (a)’s hypothesis and Lemma E.8. Indeed,

	$\displaystyle\mathbb{Q}_{t-1}^{N}L_{t-1:T}\Pi_{t}^{0:T}f_{t}^{N}$
	$\displaystyle=\mathbb{Q}_{t-1}^{N}L_{t-1:t}L_{t:T}\Pi_{t}^{0:T}f_{t}^{N}=\mathbb{Q}_{t-1}^{N}L_{t-1:t}\Pi_{t}^{0:t}(f_{t}^{N}\times H_{t:T})$
	$\displaystyle=\iint_{\mathcal{X}_{t-1}\times\mathcal{X}_{t}}\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{t-1})m_{t}(x_{t-1},x_{t})\lambda_{t}(\mathrm{d}x_{t})G_{t}(x_{t})(f_{t}^{N}\times H_{t:T})(x_{t})$
	$\displaystyle=\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t}).$

For $s\geq t$ , we have

	$\displaystyle\mathbb{E}\left[\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right]$	$\displaystyle=\mathbb{E}\left[\frac{{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})}{\ell_{s}^{N}}\right]\text{ by Corollary~{}\ref{corol:fundamental}}$
		$\displaystyle\geq\frac{1}{\left\lVert G_{s}\right\rVert_{\infty}}\mathbb{E}\left[{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})\right]$
		$\displaystyle\text{by Assumption~{}\ref{asp:Gbound} and definition of }\ell_{s}^{N}\text{ (see Algorithm~{}\ref{algo:bootstrap})}$
		$\displaystyle\geq\frac{1}{\left\lVert G_{s}\right\rVert_{\infty}}\mathbb{E}\left[\mathbb{Q}_{s-1}^{N}L_{s-1:s}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right]$
		by Corollary 3 and law of total expectation
		$\displaystyle=\mathbb{E}\left[\mathbb{Q}_{s-1}^{N}L_{{s-1}:T}\Pi^{0:T}_{t}f_{t}^{N}\right]=\infty\text{ (induction hypothesis).}$

Part (b). Similar to part (a), we shall prove by induction the statement

\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0,\forall\ t-1\leq s\leq T.

Again, by Corollary 3, this quantity is equal to

\frac{{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})}{\ell_{s}^{N}},

and the expectation of the numerator given $\mathcal{F}_{s-1}$ is $\mathbb{Q}_{s-1}^{N}L_{s-1:T}\Pi_{t}^{0:T}f_{t}^{N}$ , which tends to $0$ in probability by induction hypothesis. Lemma E.11 (see below at the end of the section), the classical result $\ell_{s}^{N}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}\ell_{s}:=\mathbb{Q}_{s-1}M_{s}(G_{s})$ and Stutsky’s theorem concludes the proof. ∎

Proof of Theorem B.1.

By (21), we have $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\mathbb{E}[\int\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})]$ where

f_{t}^{N}(x_{t})=\frac{\bar{M}_{h}}{\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})}=\frac{\bar{M}_{h}}{r_{t}^{N}}

with $r_{t}^{N}$ given in Definition 1. Proposition 9(a) gives a sufficient condition for $\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\infty$ to hold, namely

\int_{\mathcal{X}_{t}}(r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T})(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty,

which is equivalent to the hypothesis of the theorem. ∎

Proof of Theorem B.2.

We use notations from Definition 1 and Supplement A.1. We note $\mathcal{N}(x|\mu,\Sigma)$ the density of the specified normal distribution at point $x$ . Using Lemma E.9, Proposition 9 and (21), we have

	$\displaystyle\mathbb{E}[(\tau_{t}^{1,\mathrm{FFBS}})]=\infty$	$\displaystyle\Leftrightarrow\mathbb{E}\left[\frac{1}{r_{t}^{N}(X_{t}^{\mathcal{I}_{t}^{1}})^{k}}\right]=\infty$
		$\displaystyle\Leftrightarrow\mathbb{E}\left[\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})\frac{1}{r_{t}^{N}(x_{t})^{k}}\right]=\infty$
		$\displaystyle\Leftarrow\mathbb{E}\left[\int_{\mathcal{X}_{t}}\frac{r_{t}^{N}G_{t}H_{t:T}}{(r_{t}^{N})^{k}}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]=\infty$
		$\displaystyle\Leftarrow\int_{\mathcal{X}_{t}}\frac{r_{t}G_{t}H_{t:T}}{(r_{t}^{N})^{k-1}r_{t}}(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty\text{ almost surely}$
		$\displaystyle\Leftrightarrow\int_{\mathcal{X}_{t}}\frac{\mathcal{N}(x_{t}\|\mu_{t}^{\mathrm{smth}},\Sigma^{\mathrm{smth}}_{t})}{r_{t}^{N}(x_{t})^{k-1}\mathcal{N}(x_{t}\|\mu_{t}^{\mathrm{pred}},\Sigma^{\mathrm{pred}}_{t})}\lambda_{t}(\mathrm{d}x_{t})=\infty\text{ a.s. }$

The theorem then follows from elementary arguments, by noting that $r_{t}^{N}$ is a mixture of $N$ Gaussian distributions with covariance matrix $C_{X}$ . ∎

Lemma E.9.

Let $L$ be a $]0,1]$ -valued random variable. Suppose $X$ is another random variable such that $X|L\sim\operatorname{Geo}(L)$ . Then for any real number $k>0$ ,

\mathbb{E}[X^{k}]=\infty\Leftrightarrow\mathbb{E}[L^{-k}]=\infty.

Proof.

By the definition of $X$ , we have

\mathbb{E}[X^{k}]=\mathbb{E}\left[\sum_{x=1}^{\infty}x^{k}(1-L)^{x-1}L\right].

A natural idea is then to approximate the sum by the integral $\int_{0}^{\infty}x^{k}(1-L)^{x-1}L\mathrm{d}x$ , from which one easily extracts the $L^{-k}$ factor. This is however technically laborious, since the function $x\mapsto x^{k}(1-L)^{x-1}L$ is not monotone on the whole real line. It is only so starting from a certain $x_{0}$ which itself depends on $L$ . We would therefore rather write

\begin{split}\mathbb{E}[X^{k}]&=\int_{0}^{\infty}\mathbb{P}(X^{k}\geq x)\mathrm{d}x=\int_{0}^{\infty}\mathbb{P}(X\geq x^{1/k})\mathrm{d}x\\ &=\int_{0}^{\infty}\mathbb{E}\left[(1-L)^{\lfloor x^{1/k}\rfloor}\right]\mathrm{d}x\\ &\text{where the two integrands are equal Lebesgue--almost-everywhere}\\ &=\mathbb{E}\left[\int_{0}^{\infty}\exp\left(-\left|\log(1-L)\right|\lfloor x^{1/k}\rfloor\right)\mathrm{d}x\right]\end{split}

with the natural interpretation of expressions when $L=1$ . Using $u\sim v$ as a shorthand for “ $u$ and $v$ are either both finite or both infinite”, we have

\begin{split}\mathbb{E}[X^{k}]&\sim\mathbb{E}\left[\int_{0}^{\infty}\exp\left(-\left|\log(1-L)\right|{x^{1/k}}\right)\mathrm{d}x\right]\text{ by Lemma~{}\ref{lem:equivalent_in_01}}\\ &=k\ \Gamma(k)\ \mathbb{E}\left[\frac{1}{\left|\log(1-L)\right|^{k}}\right]\sim\mathbb{E}\left[\frac{1}{L^{k}}\right]\text{ by Lemma~{}\ref{lem:equivalent_in_01} again.}\end{split}

∎

The following lemma is elementary. Its proof is therefore omitted.

Lemma E.10.

Let $L$ be a $]0,1]$ -valued random variable and let $f_{1}$ and $f_{2}$ be two continuous functions from $]0,1]$ to $\mathbb{R}$ . Suppose that $\limsup_{\ell\to 0^{+}}f_{1}(\ell)/f_{2}(\ell)$ and $\limsup_{\ell\to 0^{+}}f_{2}(\ell)/f_{1}(\ell)$ are both finite. Then $\mathbb{E}[f_{1}(L)]$ is finite if and only if $\mathbb{E}[f_{2}(L)]$ is so.

Lemma E.11.

Let $Z_{1},Z_{2},\ldots$ be non-negative random variables. Suppose that there exist $\sigma$ -algebras $\mathcal{F}_{1},\mathcal{F}_{2},\ldots$ such that $\mathbb{E}[Z_{n}|\mathcal{F}_{n}]\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0$ . Then $Z_{n}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0$ .

Proof.

Fix $\varepsilon>0$ . By Markov’s inequality, $\mathbb{P}(Z_{n}\geq\varepsilon|\mathcal{F}_{n})\leq{\varepsilon}^{-1}\mathbb{E}[Z_{n}|\mathcal{F}_{n}]$ . Therefore, the $[0,1]-$ bounded random variable $\mathbb{P}(Z_{n}\geq\varepsilon|\mathcal{F}_{n})$ tends to $0$ in probability, hence also in expectation. The law of total expectation then gives $\mathbb{P}(Z_{n}\geq\varepsilon)\to 0$ , which, by varying $\varepsilon$ , establishes the convergence of $Z_{n}$ to $0$ in probability. ∎

E.8. Proof of Theorem B.3 and Corollary 2 (hybrid FFBS complexity)

Proof of Theorem B.3.

According to Janson, (2011, Lemma 3), it is sufficient to show that

\frac{\sum_{n}\min(\tau_{t}^{n,\mathrm{FFBS}},N)}{N\alpha_{N}}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0

for any deterministic sequence $\alpha_{N}$ such that ${\alpha_{N}}/{\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]}\to\infty$ . By Lemma E.11, we can take expectation with respect to $\mathcal{F}_{T}$ to derive a sufficient condition, namely

\begin{split}&\hskip 14.22636pt\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0\text{ with }z^{N}\text{ defined in \eqref{eq:def_zn}}\\ &\Leftarrow\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0\text{ by Proposition~{}\ref{prop:ffbs_exec_induction}(b)}\\ &\Leftarrow\mathbb{E}\left[\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\right]\to 0\\ &\Leftarrow\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\to 0\text{ by Lemma~{}\ref{lem:magic_jensen}}\\ &\Leftrightarrow\frac{\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]}{\alpha_{N}}\to 0.\end{split}

The proof is now complete. ∎

Proof of Corollary 2.

We have, using the cost-to-go, the $z^{N}$ functions and the $\tau_{t}^{\infty,\mathrm{PaRIS}}$ distribution defined respectively in (20), (65) and (59):

\begin{split}&\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]=\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)\mathbb{Q}_{T}(\mathrm{d}x_{t})\\ =&\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)(G_{t}H_{t:T})(x_{t})(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})\\ \leq&\left\lVert G_{t}H_{t:T}\right\rVert_{\infty}\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})\\ =&\left\lVert G_{t}H_{t:T}\right\rVert_{\infty}\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)].\end{split}

Proposition 8 then finishes the proof. ∎

E.9. Proof of Proposition 4 (MCMC kernel properties)

Proof.

Let $J_{t}^{n}$ be such that $J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot)\sim B_{t}^{N,\mathrm{IMH}}(n,\cdot)$ . Moreover, let $K_{t}^{n}$ be such that

K_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},A_{t}^{n},\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot)\sim B_{t}^{N,\mathrm{IMH}}(n,\cdot).

Given $X_{t-1}^{1:N}$ , $X_{t}^{n}$ and $A_{t}^{n}$ , the kernel $B_{t}^{N,\mathrm{IMH}}(n,\cdot)$ applies to $A_{t}^{n}$ one or more several MCMC steps keeping invariant $B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ . Since $A_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ , it follows that $K_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ too. On the other hand, $J_{t}^{n}$ and $K_{t}^{n}$ share the same distribution given $X_{t-1}^{1:N}$ , $X_{t}^{n}$ and $\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot)$ . Hence they also do, given $X_{t-1}^{1:N}$ and $X_{t}^{n}$ only. This implies that $J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot)$ , which is the same as $A_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}$ . Thus $(J_{t}^{n},X_{t}^{n})$ have the same distribution as $(A_{t}^{n},X_{t}^{n})$ given $X_{t-1}^{1:N}$ , as required by Theorem 2.1. The arguments for the kernel $B_{t}^{N,\mathrm{IMHP}}$ are similar.

To show that a certain kernel $B_{t}^{N}$ satisfies (13), we look at two conditionally i.i.d. simulations $J_{t}^{n,1}$ and $J_{t}^{n,2}$ from $B_{t}^{N}(n,\cdot)$ and lower bound the probability that they are different. For the kernel $B_{t}^{N,\mathrm{IMH}}$ , the variables $J_{t}^{n,1}$ and $J_{t}^{n,2}$ both result from one step of MH applied to $A_{t}^{n}$ . Let $J_{t}^{n,1*}$ and $J_{t}^{n,2*}$ be the corresponding MH proposals. A sufficient condition for $J_{t}^{n,1}\neq J_{t}^{n,2}$ is that $J_{t}^{n,1*}\neq J_{t}^{n,2*}$ and the two proposals are both accepted. The acceptance rate is at least $\bar{M}_{\ell}/\bar{M}_{h}$ by Assumption 2 and the probability that $J_{t}^{n,1*}\neq J_{t}^{n,2*}$ is

1-\sum_{n=1}^{N}(W_{t-1}^{n})^{2}\geq 1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}

by Assumption 3. Thus (13) is satisfied for $\varepsilon_{\mathrm{S}}=\bar{M}_{\ell}/2\bar{M}_{h}$ for $N$ large enough. Similarly, the probability that $J_{t}^{n,1}\neq J_{t}^{n,2}$ for the $B_{t}^{N,\mathrm{IMHP}}$ kernel with $\tilde{N}=2$ can be lower-bounded via the probability that $\tilde{J}_{t}^{n,1}\neq\tilde{J}_{t}^{n,2}$ (where $\tilde{J}_{t}^{n,1}$ and $\tilde{J}_{t}^{n,2}$ are defined in (18)). Thus using the same arguments, (13) is satisfied here for $\varepsilon_{\mathrm{S}}=\bar{M}_{\ell}/4\bar{M}_{h}$ . ∎

E.10. Conditional probability of maximal couplings

In general, there exist multiple maximal couplings of two random distributions (i.e. couplings that maximise the probability of equality of the two variables). However, they all satisfy a certain conditional probability property stated in the following lemma. It is closely related to results on the coupling density on the diagonal (see e.g. Wang et al.,, 2021, Lemma 2 or Douc et al.,, 2018, Theorem 19.1.6). Its statement, which we were unable to find in the literature in the exact form we need, is obvious in the discrete case but requires lengthier arguments in the continuous one.

Proposition 10.

Let $X_{1}$ and $X_{2}$ be two random variables with densities $f_{1}$ and $f_{2}$ with respect to some dominating measure defined on a space $\mathcal{X}$ . Then, the following inequality holds almost surely:

(69)

\mathbb{P}(X_{2}=X_{1}|X_{1})\leq 1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}.

Moreover, the equality occurs almost surely if and only if $X_{1}$ and $X_{2}$ form a maximal coupling.

Proof.

Let $h$ be any non-negative test function from $\mathcal{X}$ to $\mathbb{R}$ . Putting

	$\displaystyle A_{1}$	$\displaystyle:=\left\{x\in\mathcal{X}\mid f_{1}(x)\geq f_{2}(x)\right\}$
	$\displaystyle A_{2}$	$\displaystyle:=\left\{x\in\mathcal{X}\mid f_{2}(x)\geq f_{1}(x)\right\},$

we have

\begin{split}\mathbb{E}[\mathbb{P}(X_{2}=X_{1}|X_{1})h(X_{1})]&=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}h(X_{1})]\\ &=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbbm{1}_{X_{1}\in A_{1}}h(X_{1})]+\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbbm{1}_{X_{2}\in A_{1}}h(X_{2})]+\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &\leq\mathbb{E}[\mathbb{1}_{X_{2}\in A_{1}}h(X_{2})]+\mathbb{E}[\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &=\int h(x)f_{2}\land f_{1}(x)\mathrm{d}x=\mathbb{E}\left[\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)h(X_{1})\right].\end{split}

The inequality (69) is now proved almost-surely. As a result, almost-sure equality occurs if and only if the expectation of the two sides of (69) are equal, which means, via Lemma A.1, that the two variables are maximally coupled. ∎

The following lemma establishes the symmetry of Assumption 8. Again, its statement is obvious in the discrete case, though some work is needed to rigorously justify the continuous one.

Lemma E.12.

Let $X_{1}$ and $X_{2}$ be two random variables of densities $f_{1}$ and $f_{2}$ w.r.t. some dominating measure defined on some space $\mathcal{X}$ . Suppose that almost-surely

\mathbb{P}(X_{2}=X_{1}|X_{1})\geq\varepsilon\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)

for some $\varepsilon>0$ . Then almost-surely,

\mathbb{P}(X_{1}=X_{2}|X_{2})\geq\varepsilon\left(1\land\frac{f_{1}(X_{2})}{f_{2}(X_{2})}\right).

Proof.

We introduce a non-negative test function $h_{2}:\mathcal{X}\to\mathbb{R}$ and write

\begin{split}\mathbb{E}[\mathbb{P}(X_{1}=X_{2}|X_{2})h(X_{2})]&=\mathbb{E}[\mathbb{1}_{X_{1}=X_{2}}h(X_{2})]=\mathbb{E}[\mathbb{1}_{X_{2}=X_{1}}h(X_{1})]\\ &=\mathbb{E}[\mathbb{P}(X_{2}=X_{1}|X_{1})h(X_{1})]\\ &\geq\mathbb{E}\left[\varepsilon\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)h(X_{1})\right]\\ &=\int\varepsilon f_{1}\land f_{2}(x)h(x)\mathrm{d}x\\ &=\mathbb{E}\left[\varepsilon\left(1\land\frac{f_{1}(X_{2})}{f_{1}(X_{2})}\right)h(X_{2})\right]\end{split}

which implies the desired result. ∎

E.11. Proof of Theorem 4.1 (intractable kernel properties)

Proof.

Let $J_{t}^{n}$ be a random variable such that

J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N,\mathrm{ITR}}(n,\cdot)\sim B_{t}^{N,\mathrm{ITR}}(n,\cdot).

By construction of Algorithm 7, given $X_{t-1}^{1:N}$ , the couple $(J_{t}^{n},X_{t}^{n})$ has the same distribution as $(A_{t}^{n,L},X_{t}^{n,L})$ . Thus, $B_{t}^{N,\mathrm{ITR}}$ satisfies the hypotheses of Theorem 2.1. To verify (13), we define the variables $J_{t}^{n,1:2}$ accordingly and write:

\mathbb{P}\left(\left.{J_{t}^{n,1}\neq J_{t}^{n,2}}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \begin{aligned} &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2},L=1}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{L=1,X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\text{(by symmetry)}\\ &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\frac{1}{2}\begin{multlined}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{X_{t}^{n,2}=x_{t}}\right|{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2},X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\times\\ \times\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\end{multlined}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{X_{t}^{n,2}=x_{t}}\right|{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2},X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\times\\ \times\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &\geq\frac{1}{2}\begin{multlined}\varepsilon_{D}\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \text{(by Assumptions~{}\ref{asp:coupling:dynamics} and~{}\ref{asp:mt_2ways_bound})}\end{multlined}\varepsilon_{D}\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \text{(by Assumptions~{}\ref{asp:coupling:dynamics} and~{}\ref{asp:mt_2ways_bound})}\\ &\geq\frac{1}{2}\varepsilon_{D}\left(\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{2}\varepsilon_{A}\text{ by Lemma~{}\ref{lem:modified_epsa}.}\end{aligned}

The proof is complete. ∎

Lemma E.13.

We use the notations of Algorithm 7. Under Assumptions 2 and 7, we have

\mathbb{P}\left(\left.{A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1},X_{t-1}^{1:N}}\right)\geq\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\varepsilon_{A}.

Proof.

We write (and define new notations along the way):

\begin{split}\pi(a_{t}^{n,1},a_{t}^{n,2})&:=p(a_{t}^{n,1},a_{t}^{n,2}|X_{t}^{n,1},X_{t-1}^{1:N})\\ &\propto p(a_{t}^{n,1},a_{t}^{n,2}|X_{t-1}^{1:N})m_{t}(X_{t-1}^{a_{t}^{n,1}},X_{t}^{n,1})\\ &=:p(a_{t}^{n,1},a_{t}^{n,2}|X_{t-1}^{1:N})\phi(a_{t}^{n,1})\\ &=:\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1}).\end{split}

Thus

\begin{split}\mathbb{P}\left(\left.{A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1},X_{t-1}^{1:N}}\right)&=\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi(a_{t}^{n,1},a_{t}^{n,2})\\ &=\frac{\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1})}{\int\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1})}\\ &\geq\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\end{split}

by the boundedness of the function $\phi$ between $\bar{M}_{\ell}$ and $\bar{M}_{h}$ . From this, we get the desired result by virtue of Assumption 7. ∎

E.12. Validity of Algorithm 14 (modified Lindvall-Rogers coupler)

Recall that generating a random variable is equivalent to uniformly simulating under the graph of its density (see e.g. Robert and Casella,, 2004, The Fundamental Theorem of Simulation, chapter 2.3.1). Algorithm 14’s correctness is thus a direct corollary of the following intuitive lemma.

Lemma E.14.

Let $S_{A}$ and $S_{B}$ be two subsets of $\mathbb{R}^{d}$ with finite Lebesgue measures. Let $A$ and $B$ be two not necessarily independent random variables distributed according to $\operatorname{Uniform}(S_{A})$ and $\operatorname{Uniform}(S_{B})$ respectively. Denote by $S_{0}$ the intersection of $S_{A}$ and $S_{B}$ ; and by $C$ a certain $\operatorname{Uniform}(S_{A})$ -distributed random variable that is independent from $(A,B)$ . Define $A^{\star}$ and $B^{\star}$ as

A^{\star}=\begin{cases}C&\text{ if }(A,C)\in S_{0}\times S_{0}\\ A&\text{ otherwise}\end{cases}

and

B^{\star}=\begin{cases}C&\text{ if }(B,C)\in S_{0}\times S_{0}\\ B&\text{ otherwise}\end{cases}

Then $A^{\star}\sim\operatorname{Uniform}(S_{A})$ and $B^{\star}\sim\operatorname{Uniform}(S_{B})$ .

Proof.

Given $(A,C)\in S_{0}\times S_{0}$ , the two variables $A$ and $C$ have the same distribution (which is $\operatorname{Uniform}(S_{0})$ ). Thus, the definition of $A^{\star}$ implies that $A$ and $A^{\star}$ have the same (unconditional) distribution. The same argument applies to $B$ and $B^{\star}$ notwithstanding the asymmetry in the definition of $C$ . ∎

E.13. Hoeffding inequalities

This section proves a Hoeffding inequality for ratios, which helps us to bound (28). It is essentially a reformulation of Douc et al., (2011, Lemma 4) in a slightly more general manner.

Definition 2.

A real-valued random variable $X$ is called $(C,S)$ -sub-Gaussian if

\mathbb{P}\left(\frac{\left|X\right|}{S}>t\right)\leq 2Ce^{-t^{2}/2},\forall\ t\geq 0.

This definition is close to other sub-Gaussian definitions in the literature, see e.g. Vershynin, (2018, Chapter 2.5). It basically means that the tails of $X$ decreases at least as fast as the tails of the $\mathcal{N}(0,S^{2})$ distribution, which is itself $(1,S)$ -sub-Gaussian. The following result is classic.

Theorem E.15 (Hoeffding’s inequality).

Let $X_{1},\ldots,X_{N}$ be $N$ i.i.d. random variables with mean $\mu$ and almost surely contained between $a$ and $b$ . Then $N^{1/2}(\sum X_{i}/N-\mu)$ is $(1,(b-a)/2)$ -sub-Gaussian.

The following lemma is elementary from Definition 2. The proof is omitted.

Lemma E.16.

Let $X$ and $Y$ be two (not necessarily independent) random variables. If $X$ is $(C_{1},S_{1})$ -sub-Gaussian and $Y$ is $(C_{2},S_{2})$ -sub-Gaussian, then $X+Y$ is $(C_{1}+C_{2},S_{1}+S_{2})$ -sub-Gaussian.

We are ready to state the main result of this section.

Proposition 11 (Hoeffding’s inequality for ratios).

Let $a_{N}$ , $b_{N}$ , $a^{*}$ , $b^{*}$ be random variables such that $\sqrt{N}(a_{N}-a^{*})$ is $(C_{a},S_{a})$ -sub-Gaussian and $\sqrt{N}(b_{N}-b^{*})$ is $(C_{b},S_{b})$ -sub-Gaussian. Then $\sqrt{N}\left({a_{N}}/{b_{N}}-{a^{*}}/{b^{*}}\right)$ is sub-Gaussian with parameters $(C^{*},S^{*})$ where

\begin{cases}C^{*}&=C_{a}+C_{b}\\ S^{*}&=\left\lVert\frac{1}{b^{*}}\right\rVert_{\infty}(S_{a}+S_{b}\left\lVert\frac{a_{N}}{b_{N}}\right\rVert_{\infty}).\end{cases}

The terms with inf-norm can be infinite if the corresponding random variables are unbounded.

Proof.

We have

	$\displaystyle\left\|\sqrt{N}\left(\frac{a_{N}}{b_{N}}-\frac{a^{}}{b^{}}\right)\right\|$	$\displaystyle\leq\left\|\sqrt{N}\left(\frac{a_{N}}{b_{N}}-\frac{a_{N}}{b^{}}\right)\right\|+\left\|\sqrt{N}\left(\frac{a_{N}}{b^{}}-\frac{a^{}}{b^{}}\right)\right\|$
		$\displaystyle=\left\|\frac{a_{N}}{b_{N}}\right\|\left\|\frac{1}{b^{}}\right\|\left\|\sqrt{N}(b_{N}-b^{})\right\|+\left\|\frac{1}{b^{}}\right\|\left\|\sqrt{N}(a_{N}-a^{})\right\|$

by which the proposition follows from Lemma E.16. ∎

	$\displaystyle\mathbb{E}_{\mathbb{Q}_{t}^{N}}[\psi_{s}(X_{s-1},X_{s})]$	$\displaystyle=\sum_{i_{s},i_{s-1}}\hat{q}_{s\|t}^{N}[1,i_{s}]\hat{B}_{s}^{N}[i_{s},i_{s-1}]\hat{\psi}_{s}^{N}[i_{s-1},i_{s}]$
		$\displaystyle=\sum_{i_{s}}\hat{q}_{s\|t}^{N}[1,i_{s}](\hat{B}_{s}^{N}\hat{\psi}_{s}^{N})[i_{s},i_{s}]$
		$\displaystyle=\hat{q}_{s\|t}^{N}\operatorname{diag}(\hat{B}_{s}^{N}\hat{\psi}_{s}^{N}).$

(35)			$\displaystyle\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\sum_{t=0}^{T}\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}})}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\frac{1}{N^{2}}\sum_{n,m\leq N}\sum_{s,t\leq T}\operatorname{Cov}\left(\left.\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}}),\psi_{s}(X_{s-1}^{\mathcal{I}_{s-1}^{m}},X_{s}^{\mathcal{I}_{s}^{m}})\right\|\mathcal{F}_{T}^{-}\right)$
			$\displaystyle\leq\frac{2}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n=m\end{subarray}}\sum_{s,t\leq T}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}+$
			$\displaystyle\quad+\frac{4}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n\neq m\end{subarray}}\sum_{s,t\leq T}\frac{\tilde{C}}{N}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}$
			$\displaystyle=\left(\sum_{s,t\leq T}2\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left\|t-s\right\|-1}\right)\frac{(2\tilde{C}+1)N-2\tilde{C}}{N^{2}}$
			$\displaystyle\leq\left[\sum_{s,t\leq T}\left(\left\lVert\psi_{t}\right\rVert_{\infty}^{2}+\left\lVert\psi_{s}\right\rVert_{\infty}^{2}\right)\rho^{\left\|t-s\right\|-1}\right]\frac{2\tilde{C}+1}{N}\leq\frac{4(2\tilde{C}+1)}{N\rho(1-\rho)}\sum\left\lVert\psi_{t}\right\rVert_{\infty}^{2}.$

(41)			$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{*2}_{t-1}}\right\|{\mathcal{F}_{T}^{-}}\right)$
			$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}}\right\|{\mathcal{F}_{T}^{-}}\right)$
			by construction of Algorithm 15
			$\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}}\right\|{R_{t},\mathcal{F}_{T}^{-}}\right)}\right\|{\mathcal{F}_{T}^{-}}\right]$
			by the law of total expectation
			$\displaystyle=\mathbb{E}\left[\left.{\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{2}_{t},\cdot),B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{2}_{t},\cdot)\right)\mathbb{1}\left\{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{2}_{t}\right\}}\right\|{\mathcal{F}_{T}^{-}}\right]$
			$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by Lemma~{}\ref{lem:backward_mixing}}.$

	$\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}}}\right\|{\mathcal{F}_{T}^{-}}\right)$
	$\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by construction of Algorithm~{}\ref{algo:four_trajs}}$
	$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s-1}\neq\mathcal{I}^{*2}_{s-1}}\right\|{\mathcal{F}_{T}^{-}}\right)\text{ by applying \eqref{eq:two_twostar_diff_bw} recursively}$
	$\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\frac{\tilde{C}}{N}\text{ by Lemma~{}\ref{lem:I_s_star_likely_same},}$

(44)		$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$
(44)		$\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right)$	$\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T-1}=1}\right\|{\mathcal{F}_{T}^{-}}\right).$