This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On backward smoothing algorithms

Hai-Dang Dau & Nicolas Chopin CREST-ENSAE, Institut Polytechnique de Paris nicolas.chopin@ensae.fr
Abstract.

In the context of state-space models, skeleton-based smoothing algorithms rely on a backward sampling step which by default has a 𝒪(N2)\mathcal{O}(N^{2}) complexity (where NN is the number of particles). Existing improvements in the literature are unsatisfactory: a popular rejection sampling– based approach, as we shall show, might lead to badly behaved execution time; another rejection sampler with stopping lacks complexity analysis; yet another MCMC-inspired algorithm comes with no stability guarantee. We provide several results that close these gaps. In particular, we prove a novel non-asymptotic stability theorem, thus enabling smoothing with truly linear complexity and adequate theoretical justification. We propose a general framework which unites most skeleton-based smoothing algorithms in the literature and allows to simultaneously prove their convergence and stability, both in online and offline contexts. Furthermore, we derive, as a special case of that framework, a new coupling-based smoothing algorithm applicable to models with intractable transition densities. We elaborate practical recommendations and confirm those with numerical experiments.

1. Introduction

1.1. Background

A state-space model is composed of an unobserved Markov process X0,,XTX_{0},\ldots,X_{T} and observed data Y0,,YTY_{0},\ldots,Y_{T}. Given X0,,XTX_{0},\ldots,X_{T}, the data Y0,,YTY_{0},\ldots,Y_{T} are independent and generated through some specified emission distribution Yt|Xt𝒇t(dyt|xt)Y_{t}|X_{t}\sim\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t}). These models have wide-ranging applications (e.g. in biology, economics and engineering). Two important inference tasks related to state-space models are filtering (computing the distribution of XtX_{t} given Y0,,YtY_{0},\ldots,Y_{t}) and smoothing (computing the distribution of the whole trajectory (X0,,Xt)(X_{0},\ldots,X_{t}), again given all data until time tt). Filtering is usually carried out through a particle filter, that is, a sequential Monte Carlo algorithm that propagates NN weighted particles (realisations) through Markov and importance sampling steps; see Chopin and Papaspiliopoulos, (2020) for a general introduction to state-space models (Chapter 2) and particle filters (Chapter 10).

This paper is concerned with skeleton-based smoothing algorithms, i.e. algorithms that approximate the smoothing distributions with empirical distributions based on the output of a particle filter (i.e. the locations and weights of the NN particles at each time step). A simple example is genealogy tracking (empirically proposed in Kitagawa,, 1996 and theoretically analysed in Del Moral and Miclo,, 2001) which keeps track of the ancestry (past states) of each particles. This smoother suffers from degeneracy: for tt large enough, all the particles have the same ancestor at time 0.

The forward filtering backward smoothing (FFBS) algorithm (Godsill et al.,, 2004) has been proposed as a solution to this problem. Starting from the filtering approximation at time tt, the algorithm samples successively particles at times t1t-1, t2t-2, etc. using backward kernels. Its theoretical properties, in particular the stability as tt\to\infty, have been studied by Del Moral et al., (2010); Douc et al., (2011).

In many applications, one is mainly interested in approximating smoothing expectations of additive functions of the form

𝔼[ψ0(X0)+ψ1(X0,X1)++ψt(Xt1,Xt)|Y0,,Yt]\mathbb{E}\left[\left.{\psi_{0}(X_{0})+\psi_{1}(X_{0},X_{1})+\cdots+\psi_{t}(X_{t-1},X_{t})}\right|{Y_{0},\ldots,Y_{t}}\right]

for some functions ψ0,,ψt\psi_{0},\ldots,\psi_{t}. Such expectations can be approximated in an online fashion by a procedure described in Del Moral et al., (2010). Inspired by this, the particle-based, rapid incremental smoother (PaRIS) algorithm of Olsson and Westerborn, (2017) replaces some of the calculations with an additional layer of Monte Carlo approximation.

The backward sampling operation is central to both the FFBS and the PaRIS algorithms. The naive implementation has an 𝒪(N2)\mathcal{O}(N^{2}) cost. There have been numerous attempts at alleviating this problem in the literature, but, to our knowledge, they all lack formal support in terms of either computation complexity or stability as TT\to\infty.

In the following five paragraphs, we elaborate on this limitation for each of the three major contenders, and we point out two related challenges with current backward sampling algorithms that we try to resolve in this article.

1.2. State of the art

First, Douc et al., (2011) proposed to use rejection sampling for the generation of backward indices in FFBS, and Olsson and Westerborn, (2017) extended this technique to PaRIS. If the model has upper-and-lower bounded transition densities, this sampler has an 𝒪(N)\mathcal{O}(N) expected execution time (Douc et al., 2011, Proposition 2 and Olsson and Westerborn, 2017, Theorem 10). Unfortunately, most practical state space models (including linear Gaussian ones) violate this assumption, and the behaviour of the algorithm in this case is unclear. Empirically, it has been observed (Taghavi et al., 2013; Bunch and Godsill, 2013; Olsson and Westerborn, 2017, Section 4.3) that in real examples, FFBS-reject and PaRIS-reject frequently suffer from low acceptance rates, in contrary to what users would expect from an algorithm with linear complexity. To cite Bunch and Godsill, (2013), “[a]lthough theoretically elegant, the […] algorithm has been found to suffer from such high rejection rates as to render it consistently slower than direct sampling implementation on problems with more than one state dimension”. To the best of our knowledge, no theoretical result has been put forward to formalise or quantify this bad behaviour.

Second, Taghavi et al., (2013) and Olsson and Westerborn, (2017, Section 4.3) suggest putting a threshold on the number of rejection sampling trials to get more stable execution times. The thresholds are either chosen adaptively using a Kalman filter in Taghavi et al., (2013) or fixed at N\sqrt{N} in Olsson and Westerborn, (2017, Section 4.3). Although improvements are empirically observed, to the best of our knowledge, no theoretical analysis of the complexity of the resulting algorithm or formal justification of the proposed threshold is available.

Third, Bunch and Godsill, (2013) use MCMC moves starting from the filtering ancestor instead of a full backward sampling step. Empirically, this algorithm seems to prevent the degeneracy associated with the genealogy tracking smoother using a very small number of MCMC steps (e.g. less than five). Unfortunately, as far as we know, this stability property is not proved anywhere in the literature, which deters the adoption of the algorithm. Using MCMC moves provides a procedure with truly linear and deterministic run time, and a stability result is the only missing piece of the puzzle to completely resolve the 𝒪(N2)\mathcal{O}(N^{2}) problem. We believe one reason for the current state of affair is that the stability proof techniques employed by Douc et al., (2011) and Olsson and Westerborn, (2017) are difficult to extend to the MCMC case.

Fourth, and this is related to the third point, the stability of the PaRIS algorithm has only been proved in the asymptotic regime. More specifically, Olsson and Westerborn, (2017) established a central limit theorem as NN\to\infty in Corollary 5, then showed that the corresponding asymptotic error remains controlled as TT\to\infty in Theorem 8 and Theorem 9. While non-asymptotic stability bounds for the FFBS algorithm are already available in Del Moral et al., (2010); Douc et al., (2011); Dubarry and Le Corff, (2013), we do not think that they can be extended straightforwardly to PaRIS and we are not aware of any such attempt in the literature.

Fifth, all backward samplers mentioned thus far require access to the transition density. Many models have dynamics that can be simulated from but transition densities that are not explicitly calculable. Enabling backward sampling in this scenario is challenging and will certainly require some kind of problem-specific knowledge to extract information from the transition densities, despite not being able to evaluate them exactly.

1.3. Structure and contribution

Section 2 presents a general framework which admits as particular cases a wide variety of existing algorithms (e.g. FFBS, forward-additive smoothing, PaRIS) as well as the novel ones considered later in the paper. It allows to simultaneously prove the consistency as NN\to\infty and the stability as TT\to\infty for all of them. The main ingredient is the discrete backward kernels, which are essentially random N×NN\times N matrices employed differently in the offline and the online settings. On the technical side, the stability result is proved using a new technique, yielding a non-asymptotic bound that addresses the fourth point in subsection 1.2.

Next, we closely look at the use of rejection sampling and realise that in many models, the resulting execution time may be significantly heavy-tailed; see Section 3. For instance, the run time of PaRIS may have infinite expectation, whereas the run time of FFBS may have infinite variance. (Since it is technically more involved, the material for the FFBS algorithm is delegated to Supplement B.) These results address the first point in subsection 1.2 and we discuss their severe practical implications.

We then derive and analyse hybrid rejection sampling schemes (i.e. schemes that use rejection sampling only up to a certain number of attempts, and then switch to the standard method). We show that they lead to a nearly 𝒪(N)\mathcal{O}(N) algorithm (up to some log\log factor) in Gaussian models; again see Section 3. This stems from the subtle interaction between the tail of Gaussian densities and the finite Feynman-Kac approximation. Outside this class of model, the hybrid algorithm can still escape the 𝒪(N2)\mathcal{O}(N^{2}) complexity, although it might not reach the ideal linear run time target. These results shed some light on the second issue mentioned in subsection 1.2.

In Section 4, we look at backward kernels that are more efficient to simulate than the FFBS and the PaRIS ones. Section 4.1 describes backward kernels based on MCMC (Markov chain Monte Carlo) following Bunch and Godsill, (2013) and extends them to the online scenario. We cast this family of algorithms as a particular case of the general framework developed in Section 2, which allows convergence and stability to follow immediately. This solves the long-standing problem described in the third point of subsection 1.2.

MCMC methods require evaluation of the likelihood and thus cannot be applied to models with intractable transition densities. In Section 4.2, we show how the use of forward coupling can replace the role of backward MCMC steps in these scenarios. This makes it possible to obtain stable performance in both on-line and off-line scenarios (with intractable transition densities) and provides a possible solution to the fifth challenge describe in subsection 1.2.

Section 5 illustrates the aforementioned algorithms in both on-line and off-line uses. We highlight how hybrid and MCMC samplers lead to a more user-friendly (i.e. smaller, less random and less model-dependent) execution time than the pure rejection sampler. We also apply our smoother for intractable densities to a continuous-time diffusion process with discretization. We observe that our procedure can indeed prevent degeneracy as TT\to\infty, provided that some care is taken to build couplings with good performance. Section 6 concludes the paper with final practical recommendations and further research directions. Table 1 gives an overview of existing and novel algorithms as well as our contributions for each.

1.4. Related work

Proposition 1 of Douc et al., (2011) states that under certain assumptions, the FFBS-reject algorithm has an asymptotic 𝒪(N)\mathcal{O}_{\mathbb{P}}(N) complexity. This does not contradict our results, which point out the undesirable properties of the non-asymptotic execution time. Clearly, non-asymptotic behaviours are what users really observe in practice. From a technical point of view, the proof of Douc et al., (2011, Prop. 1) is a simple application of Theorem 5 of the same paper. In contrast, non-asymptotic results such as Theorem B.1 and Theorem B.2 require more delicate finite sample analyses.

Figure 1 of Olsson and Westerborn, (2017) and the accompanying discussion provide an excellent intuition on the stability of smoothing algorithms based on the support size of the backward kernels. We formalise this support size condition for the first time by the inequality  (13) and construct a novel non-asymptotic stability result based on it. In contrast, Olsson and Westerborn, (2017) depart from their initial intuition and use an entirely different technique to establish stability. Their result is asymptotic in nature.

Gloaguen et al., (2022) briefly mention the use of MCMC in PaRIS algorithm, but their algorithm is fundamentally different to and less efficient than Bunch and Godsill, (2013). Indeed, they do not start the MCMC chains at the ancestors previously obtained during the filtering step. They are thus obliged to perform a large number of MCMC iterations for decorrelation, whereas the algorithms described in our Proposition 4, built upon the ideas of Bunch and Godsill, (2013), only require a single MCMC step to guarantee stability. However, we would like to stress again that Bunch and Godsill, (2013) did not prove this important fact.

Another way to reduce the computation time is to perform the expensive backward sampling steps at certain times tt only. For other values of tt, the naive genealogy tracking smoother is used instead. This idea has been recently proposed by Mastrototaro et al., (2021), who also provided a practical recipe for deciding at which values of tt the backward sampling should take place and derived corresponding theoretical results.

Smoothing in models with intractable transition densities is very challenging. If these densities can be estimated accurately, the algorithms proposed by Gloaguen et al., (2022) permit to attack this problem. A case of particular interest is diffusion models, where unbiased transition density estimators are provided in Beskos et al., (2006); Fearnhead et al., (2008). More recently, Yonekura and Beskos, (2022) use a special bridge path-space construction to overcome the unavailability of transition densities when the diffusion (possibly with jumps) must be discretised.

Our smoother for intractable models are based on a general coupling principle that is not specific to diffusions. We only require users to be able to simulate their dynamics (e.g. using discretisation in the case of diffusions) and to manipulate random numbers in their simulations so that dynamics starting from two different points can meet with some probability. Our method does not directly provide an estimator for the gradient of the transition density with respect to model parameters and thus cannot be used in its current form to perform maximum likelihood estimation (MLE) in intractable models; whereas the aforementioned work have been able to do so in the case of diffusions. However, the main advantage of our approach lies in its generality beyond the diffusion case. Furthermore, modifications allowing to perform MLE are possible and might be explored in further work specifically dedicated to the parameter estimation problem.

The idea of coupling has been incorporated in the smoothing problem in a different manner by Jacob et al., (2019). There, the goal is to provide offline unbiased estimates of the expectation under the smoothing distribution. Coupling and more generally ideas based on correlated random numbers are also useful in the context of partially observed diffusions via the multilevel approach (Jasra et al.,, 2017).

In this work, we consider smoothing algorithms that are based on a unique pass of the particle filter. Offline smoothing can be done using repeated iterations of the conditional particle filter (Andrieu et al.,, 2010). Full trajectories can also be constructed in an online manner if one is willing to accept some lag approximations (Duffield and Singh,, 2022). Another approach to smoothing consists of using an additional information filter (Fearnhead et al.,, 2010), but it is limited to functions depending on one state only. Each of these algorithmic families has their own advantages and disadvantages, of which a detailed discussion is out of the scope of this article (see however Nordh and Antonsson,, 2015).

2. General structure of smoothing algorithms

In this section, we decompose each smoothing algorithm into two separate parts: the backward kernel (which determines its theoretical properties such as the convergence and the stability) and the execution mode (which is either online or offline and determines its implementation). This has two advantages: first, it induces an easy-to-navigate categorization of algorithms (see Table 1); and second, it allows to prove the convergence and the stability for each of them by verifying sufficient conditions on the backward kernel component only.

Mode \  Kernel FFBS kernel PaRIS kernel MCMC kernels Intract.
Offline
(*) FFBS
(+) Thm. B.1, Thm. B.2
(+) Thm. B.3, Cor. 2
(*) FFBS-MCMC
(+) Prop. 4
(**)
Online (*) Forward-additive
(*) PaRIS
(+) Thm. 2.2, Prop. 2
(+) Thm. 3.1, Thm. 3.2
(**) (**)
Table 1. Summary of smoothing algorithms considered in this paper (classified by the backward kernel and the execution mode) and our contributions. (*) means an existing algorithm, (+) means a novel theoretical result and (**) means a novel algorithm

2.1. Notations

Measure-kernel-function notations. Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two measurable spaces with respective σ\sigma-algebras (𝒳)\mathcal{B}(\mathcal{X}) and (𝒴)\mathcal{B}(\mathcal{Y}). The following definitions involve integrals and only make sense when they are well-defined. For a measure μ\mu on 𝒳\mathcal{X} and a function f:𝒳f:\mathcal{X}\to\mathbb{R}, the notations μf\mu f and μ(f)\mu(f) refer to f(x)μ(dx)\int f(x)\mu(\mathrm{d}x). A kernel (resp. Markov kernel) KK is a mapping from 𝒳×(𝒴)\mathcal{X}\times\mathcal{B}(\mathcal{Y}) to \mathbb{R} (resp. [0,1][0,1]) such that, for B(𝒴)B\in\mathcal{B}(\mathcal{Y}) fixed, xK(x,B)x\mapsto K(x,B) is a measurable function on 𝒳\mathcal{X}; and for xx fixed, BK(x,B)B\mapsto K(x,B) is a measure (resp. probability measure) on 𝒴\mathcal{Y}. For a real-valued function gg defined on 𝒴\mathcal{Y}, let Kg:𝒳Kg:\mathcal{X}\to\mathbb{R} be the function Kg(x):=g(y)K(x,dy)Kg(x):=\int g(y)K(x,\mathrm{d}y). We sometimes write K(x,g)K(x,g) for the same expression. The product of the measure μ\mu on 𝒳\mathcal{X} and the kernel KK is a measure on 𝒴\mathcal{Y}, defined by μK(B):=K(x,B)μ(dx)\mu K(B):=\int K(x,B)\mu(\mathrm{d}x). Other notations. • The notation X0:tX_{0:t} is a shorthand for (X0,,Xt)(X_{0},\ldots,X_{t}) • We denote by (W1:N)\mathcal{M}(W^{1:N}) the multinomial distribution supported on {1,2,,N}\left\{1,2,\ldots,N\right\}. The respective probabilities are W1,,WNW_{1},\ldots,W_{N}. If they do not sum to 11, we implicitly refer to the normalised version obtained by multiplication of the weights with the appropriate constant • The symbol \stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}} means convergence in probability and \Rightarrow means convergence in distribution • The geometric distribution with parameter λ\lambda is supported on 1\mathbb{Z}_{\geq 1}, has probability mass function f(n)=λ(1λ)n1f(n)=\lambda(1-\lambda)^{n-1} and is noted by Geo(λ)\operatorname{Geo}(\lambda) • Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two measurable spaces. Let μ\mu and ν\nu be two probability measures on 𝒳\mathcal{X} and 𝒴\mathcal{Y} respectively. The o-times product measure μν\mu\otimes\nu is defined via (μν)(h):=h(x,y)μ(dx)ν(dy)(\mu\otimes\nu)(h):=\iint h(x,y)\mu(\mathrm{d}x)\nu(\mathrm{d}y) for bounded functions h:𝒳×𝒴h:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}. If XμX\sim\mu and YνY\sim\nu, we sometimes note μν\mu\otimes\nu by XYX\otimes Y.

2.2. Feynman-Kac formalism and the bootstrap particle filter

Let 𝒳0:T\mathcal{X}_{0:T} be a sequence of measurable spaces and M1:TM_{1:T} be a sequence of Markov kernels such that MtM_{t} is a kernel from 𝒳t1\mathcal{X}_{t-1} to 𝒳t\mathcal{X}_{t}. Let X0:TX_{0:T} be an unobserved inhomogeneous Markov chain with starting distribution X0𝕄0(dx0)X_{0}\sim\mathbb{M}_{0}(\mathrm{d}x_{0}) and Markov kernels M1:TM_{1:T}; i.e. Xt|Xt1Mt(Xt1,dxt)X_{t}|X_{t-1}\sim M_{t}(X_{t-1},\mathrm{d}x_{t}) for t1t\geq 1. We aim to study the distribution of X0:TX_{0:T} given observed data Y0:TY_{0:T}. Conditioned on X0:TX_{0:T}, the data Y0,,YTY_{0},\ldots,Y_{T} are independent and Yt|X0:TYt|Xt𝒇t(|Xt)Y_{t}|X_{0:T}\equiv Y_{t}|X_{t}\sim\boldsymbol{f}_{t}(\cdot|X_{t}) for a certain emission distribution 𝒇t(dyt|xt)\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t}). Assume that there exists dominating measures λ~t\tilde{\lambda}_{t} not depending on xtx_{t} such that

𝒇t(dyt|xt)=ft(yt|xt)λ~t(dyt).\boldsymbol{f}_{t}(\mathrm{d}y_{t}|x_{t})=f_{t}(y_{t}|x_{t})\tilde{\lambda}_{t}(\mathrm{d}y_{t}).

The distribution of X0:t|Y0:tX_{0:t}|Y_{0:t} is then given by

(1) t(dx0:t)=1Lt𝕄0(dx0)s=1tMs(xs1,dxs)Gs(xs)\mathbb{Q}_{t}(\mathrm{d}x_{0:t})=\frac{1}{L_{t}}\mathbb{M}_{0}(\mathrm{d}x_{0})\prod_{s=1}^{t}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s})

where Gs(xs):=f(ys|xs)G_{s}(x_{s}):=f(y_{s}|x_{s}) and Lt>0L_{t}>0 is the normalising constant. Moreover, 1:=𝕄0\mathbb{Q}_{-1}:=\mathbb{M}_{0} and L1:=1L_{-1}:=1 by convention. Equation (1) defines a Feynman-Kac model (Del Moral,, 2004). It does not require MtM_{t} to admit a transition density, although herein we only consider models where this assumption holds. Let λt\lambda_{t} be a dominating measure on 𝒳t\mathcal{X}_{t} in the sense that there exists a function mtm_{t} (not necessarily tractable) such that

(2) Mt(xt1,dxt)=mt(xt1,xt)λt(dxt).M_{t}(x_{t-1},\mathrm{d}x_{t})=m_{t}(x_{t-1},x_{t})\lambda_{t}(\mathrm{d}x_{t}).

A special case of the current framework are linear Gaussian state space models. They will serve as a running example for the article, and some of the results will be specifically demonstrated for models of this class. The rationale is that many real-world dynamics are partly, or close to, Gaussian. The notations for linear Gaussian models are given in Supplement A.1 and we will refer to them whenever this model class is discussed.

Particle filters are algorithms that sample from t(dxt)\mathbb{Q}_{t}(\mathrm{d}x_{t}) in an on-line manner. In this article, we only consider the bootstrap particle filter (Gordon et al.,, 1993) and we detail its notations in Algorithm 1. Many results in the following do apply to the auxiliary filter (Pitt and Shephard,, 1999) as well, and we shall as a rule indicate explicitly when it is not the case.

Input: Feynman-Kac model (1)
Simulate X01:Ni.i.d.𝕄0X_{0}^{1:N}\overset{\text{i.i.d.}}{\sim}\mathbb{M}_{0}
Set ω0nG0(X0n)\omega_{0}^{n}\leftarrow G_{0}(X_{0}^{n}) for n=1,,Nn=1,\ldots,N
Set 0Nn=1Nω0n/N\ell_{0}^{N}\leftarrow\sum_{n=1}^{N}\omega_{0}^{n}/N
Set W0nω0n/N0NW_{0}^{n}\leftarrow\omega_{0}^{n}/N\ell_{0}^{N} for n=1,,Nn=1,\ldots,N
for t1t\leftarrow 1 to T do
       Resample. Simulate At1:Ni.i.d.(Wt11:N)A_{t}^{1:N}\overset{\text{i.i.d.}}{\sim}\mathcal{M}(W_{t-1}^{1:N})
       Move. Simulate XtnMt(Xt1Atn,dxt)X_{t}^{n}\sim M_{t}(X_{t-1}^{A_{t}^{n}},\mathrm{d}x_{t}) for n=1,,Nn=1,\ldots,N
       Reweight. Set ωtnGt(Xtn)\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n}) for n=1,2,,Nn=1,2,\ldots,N
       Set tNn=1Nωtn/N\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N
       Set Wtnωtn/NtNW_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N} for n=1,2,,Nn=1,2,\ldots,N
      
Output: For all t0t\geq 0 and function φ:𝒳t\varphi:\mathcal{X}_{t}\to\mathbb{R}, the quantity n=1NWtnφ(Xtn)\sum_{n=1}^{N}W_{t}^{n}\varphi(X_{t}^{n}) approximates t(dx0:t)φ(xt)\int\mathbb{Q}_{t}(\mathrm{d}x_{0:t})\varphi(x_{t}) and the quantity tN\ell_{t}^{N} approximates Lt/Lt1L_{t}/L_{t-1}
Algorithm 1 Bootstrap particle filter

We end this subsection with the definition of two sigma-algebras that will be referred to throughout the paper. Using the notations of Algorithm 1, let

(3) t:=σ(X0:t1:N,A1:t1:N),t:=σ(X0:t1:N).\begin{split}\mathcal{F}_{t}&:=\sigma(X_{0:t}^{1:N},A_{1:t}^{1:N}),\\ \mathcal{F}_{t}^{-}&:=\sigma(X_{0:t}^{1:N}).\end{split}

2.3. Backward kernels and off-line smoothing

In this subsection, we first describe three examples of backward kernels, in which we emphasise both the random measure and the random matrix viewpoints. We then formalise their use by stating a generic off-line smoothing algorithm.

Example 1 (FFBS algorithm, Godsill et al.,, 2004).

Once Algorithm 1 has been run, the FFBS procedure generates a trajectory approximating the smoothing distribution in a backward manner. More precisely, it starts by simulating index T(WT1:N)\mathcal{I}_{T}\sim\mathcal{M}(W_{T}^{1:N}) at time TT. Then, recursively for t=T,,1t=T,\ldots,1, given indices t:T\mathcal{I}_{t:T}, it generates t1{1,,N}\mathcal{I}_{t-1}\in\left\{1,\ldots,N\right\} with probability proportional to Wt1nmt(Xt1n,Xtt)W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}}). The smoothing trajectory is returned as (X00,,XTT)(X_{0}^{\mathcal{I}_{0}},\ldots,X_{T}^{\mathcal{I}_{T}}). Formally, given T\mathcal{F}_{T}, the indices 0:T\mathcal{I}_{0:T} are generated according to the distribution

(Wt1:N)(diT)[BTN,FFBS(iT,diT1)BT1N,FFBS(iT1,diT2)B1N,FFBS(i1,di0)]\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{T})\left[B_{T}^{N,\mathrm{FFBS}}(i_{T},\mathrm{d}i_{T-1})B_{T-1}^{N,\mathrm{FFBS}}(i_{T-1},\mathrm{d}i_{T-2})\ldots B_{1}^{N,\mathrm{FFBS}}(i_{1},\mathrm{d}i_{0})\right]

where the (random) backward kernels BtN,FFBSB_{t}^{N,\mathrm{FFBS}} are defined by

(4) BtN,FFBS(it,dit1):=n=1NWt1nmt(Xt1n,Xtit)k=1NWt1kmt(Xt1k,Xtit)δn(dit1).B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}):=\sum_{n=1}^{N}\frac{W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{i_{t}})}{\sum_{k=1}^{N}W_{t-1}^{k}m_{t}(X_{t-1}^{k},X_{t}^{i_{t}})}\delta_{n}(\mathrm{d}i_{t-1}).

More simply, we can also look at these random kernels as random N×NN\times N matrices of which entries are given by

(5) B^tN,FFBS[it,it1]:=Wt1it1mt(Xt1it1,Xtit)k=1NWt1kmt(Xt1k,Xtit).\hat{B}_{t}^{N,\mathrm{FFBS}}[i_{t},i_{t-1}]:=\frac{W_{t-1}^{i_{t-1}}m_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})}{\sum_{k=1}^{N}W_{t-1}^{k}m_{t}(X_{t-1}^{k},X_{t}^{i_{t}})}.

We will need both the kernel viewpoint (4) and the matrix viewpoint (5) in this paper as the better choice depends on the context.

Example 2 (Genealogy tracking, Kitagawa,, 1996; Del Moral and Miclo,, 2001).

It is well known that Algorithm 1 already gives as a by-product an approximation of the smoothing distribution. This information can be extracted from the genealogy, by first simulating index T(WT1:N)\mathcal{I}_{T}\sim\mathcal{M}(W_{T}^{1:N}) at time TT, then successively appending ancestors until time 0 (i.e. setting sequentially t1Att\mathcal{I}_{t-1}\leftarrow A_{t}^{\mathcal{I}_{t}}). The smoothed trajectory is returned as (X00,,XTT)(X_{0}^{\mathcal{I}_{0}},\ldots,X_{T}^{\mathcal{I}_{T}}). More formally, conditioned on T\mathcal{F}_{T}, we simulate the indices 0:T\mathcal{I}_{0:T} according to

(Wt1:N)(diT)[BTN,GT(iT,diT1)BT1N,GT(iT1,diT2)B1N,GT(i1,di0)]\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{T})\left[B_{T}^{N,\mathrm{GT}}(i_{T},\mathrm{d}i_{T-1})B_{T-1}^{N,\mathrm{GT}}(i_{T-1},\mathrm{d}i_{T-2})\ldots B_{1}^{N,\mathrm{GT}}(i_{1},\mathrm{d}i_{0})\right]

where GT stands for “genealogy tracking” and the kernels BtN,GTB_{t}^{N,\mathrm{GT}} are simply

(6) BtN,GT(it,dit1):=δAtit(dit1).B_{t}^{N,\mathrm{GT}}(i_{t},\mathrm{d}i_{t-1}):=\delta_{A_{t}^{i_{t}}}(\mathrm{d}i_{t-1}).

Again, it may be more intuitive to view this random kernel as a random N×NN\times N matrix, the elements of which are given by

B^tN,GT[it,it1]:=𝟙{it1=Atit}.\hat{B}_{t}^{N,\mathrm{GT}}[i_{t},i_{t-1}]:=\mathbbm{1}\left\{i_{t-1}=A_{t}^{i_{t}}\right\}.
Example 3 (MCMC backward samplers, Bunch and Godsill,, 2013).

In Example 2, the backward variable t1\mathcal{I}_{t-1} is simply set to AttA_{t}^{\mathcal{I}_{t}}. On the contrary, in Example 1, we need to launch a simulator for the discrete measure Wt1nmt(Xt1n,Xtt)W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}}). Interestingly, the current value of AttA_{t}^{\mathcal{I}_{t}} is not taken into account in that simulator. Therefore, a natural idea to combine the two previous examples is to apply one (or several) MCMC steps to AttA_{t}^{\mathcal{I}_{t}} and assign the result to t1\mathcal{I}_{t-1}. The MCMC algorithm operates on the space {1,2,,N}\left\{1,2,\ldots,N\right\} and targets the invariant measure Wt1nmt(Xt1n,Xtt)W_{t-1}^{n}m_{t}(X_{t-1}^{n},X_{t}^{\mathcal{I}_{t}}). If only one independent Metropolis-Hastings (MH) step is used and the proposal is (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}), the corresponding random matrix B^tN,IMH\hat{B}_{t}^{N,\mathrm{IMH}} has values

B^tN,IMH[it,it1]=Wt1it1min(1,mt(Xt1it1,Xtit)/mt(Xt1Atit,Xtit))\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},i_{t-1}]=W_{t-1}^{i_{t-1}}\min\left(1,{m_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})}/{m_{t}(X_{t-1}^{A_{t}^{i_{t}}},X_{t}^{i_{t}})}\right)

if it1Atiti_{t-1}\neq A_{t}^{i_{t}}, and

B^tN,IMH[it,Atit]=1nAtitB^tN,IMH[it,n].\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},A_{t}^{i_{t}}]=1-\sum_{n\neq A_{t}^{i_{t}}}\hat{B}_{t}^{N,\mathrm{IMH}}[i_{t},n].

This third example shows that some elements of the matrix B^tN,IMH\hat{B}_{t}^{N,\mathrm{IMH}} might be expensive to calculate. If several MCMC steps are performed, all elements of B^tN,IMH\hat{B}_{t}^{N,\mathrm{IMH}} will have non-trivial expressions. Still, simulating from BtN,IMH(it,dit1)B_{t}^{N,\mathrm{IMH}}(i_{t},\mathrm{d}i_{t-1}) is easy as it amounts to running a standard MCMC algorithm. MCMC backward samples are studied in more details in Section 4.1.

We formalise how off-line smoothing can be done given random matrices B^1:TN\hat{B}_{1:T}^{N}; see Algorithm 2. Note that in the above examples, our matrices B^tN\hat{B}_{t}^{N} are t\mathcal{F}_{t}-measurable (i.e. they depend on particles and indices up to time tt), but this is not necessarily the case in general (i.e. they may also depend on additional random variables, see Section 2.5). Furthermore, Algorithm 2 describes how to perform smoothing using the matrices B^1:TN\hat{B}_{1:T}^{N}, but does not say where they come from. At this point, it is useful to keep in mind the above three examples. In Section 2.4, we will give a general recipe for constructing valid matrices B^tN\hat{B}_{t}^{N} (i.e. those that give a consistent algorithm).

Input: Filtering results X0:T1:NX_{0:T}^{1:N}, W0:T1:NW_{0:T}^{1:N}, and A1:T1:NA_{1:T}^{1:N} from Algorithm 1; random matrices B^1:TN\hat{B}_{1:T}^{N} (see Section 2.3 for two examples of such matrices and Section 2.4 for a general recipe to construct them)
for n1n\leftarrow 1 to NN do
       Simulate Tn(WT1:N)\mathcal{I}_{T}^{n}\sim\mathcal{M}(W_{T}^{1:N})
       for tTt\leftarrow T to 1 do
             Simulate t1nBtN(tn,dit1)\mathcal{I}_{t-1}^{n}\sim B_{t}^{N}(\mathcal{I}_{t}^{n},\mathrm{d}i_{t-1}) (the kernel BtN(t,)B_{t}^{N}(\mathcal{I}_{t},\cdot) is defined by the t\mathcal{I}_{t}-th row of the input matrix B^tN\hat{B}_{t}^{N})
            
      
Output: The NN smoothed trajectories (X00n,,XTTn)(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}}) for n=1,,Nn=1,\ldots,N
Algorithm 2 Generic off-line smoother

Algorithm 2 simulates, given T\mathcal{F}_{T} and B^1:TN\hat{B}_{1:T}^{N}, NN i.i.d. index sequences 0:Tn\mathcal{I}_{0:T}^{n}, each distributed according to

(WT1:N)(diT)t=T1BtN(it,dit1).\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})\prod_{t=T}^{1}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1}).

Once the indices 0:T1:N\mathcal{I}_{0:T}^{1:N} are simulated, the NN smoothed trajectories are returned as (X00n,,XTTn)(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}}). Given T\mathcal{F}_{T} and B^1:TN\hat{B}_{1:T}^{N}, they are thus conditionally i.i.d. and their conditional distribution is described by the x0:Tx_{0:T} component of the joint distribution

(7) ¯TN(dx0:T,di0:T):=(WT1:N)(diT)[t=T1BtN(it,dit1)][t=T0δXtit(dxt)].\bar{\mathbb{Q}}_{T}^{N}(\mathrm{d}x_{0:T},\mathrm{d}i_{0:T}):=\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})\left[\prod_{t=T}^{1}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\right]\left[\prod_{t=T}^{0}\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})\right].

Throughout the paper, the symbol ¯TN\bar{\mathbb{Q}}_{T}^{N} will refer to this joint distribution, while the symbol TN\mathbb{Q}_{T}^{N} will refer to the x0:Tx_{0:T}-marginal of ¯TN\bar{\mathbb{Q}}_{T}^{N} only. This allows the notation TNφ\mathbb{Q}_{T}^{N}\varphi to make sense, where φ=φ(x0,,xT)\varphi=\varphi(x_{0},\ldots,x_{T}) is a real-valued function defined on the hidden states.

2.4. Validity and convergence

The kernels BtN,FFBSB_{t}^{N,\text{FFBS}} and BtN,GTB_{t}^{N,\text{GT}} are both valid backward kernels to generate convergent approximation of the smoothing distribution (Del Moral,, 2004; Douc et al.,, 2011). This subsection shows that they are not the only ones and gives a sufficient condition for a backward kernel to be valid. It will prove a necessary tool to build more efficient BtNB_{t}^{N} later in the paper.

Recall that Algorithm 1 outputs particles X0:T1:NX_{0:T}^{1:N}, weights W0:T1:NW_{0:T}^{1:N} and ancestor variables A1:T1:NA_{1:T}^{1:N}. Imagine that the A1:T1:NA_{1:T}^{1:N} were discarded after filtering has been done and we wish to simulate them back. We note that, since the X0:T1:NX_{0:T}^{1:N} are given, the T×NT\times N variables A1:T1:NA_{1:T}^{1:N} are conditionally i.i.d. We can thus simulate them back from

p(atn|x0:T1:N)=p(atn|xt11:N,xtn)wt1atnmt(xt1atn,xtn).p(a_{t}^{n}|x_{0:T}^{1:N})=p(a_{t}^{n}|x_{t-1}^{1:N},x_{t}^{n})\propto w_{t-1}^{a_{t}^{n}}m_{t}(x_{t-1}^{a_{t}^{n}},x_{t}^{n}).

This is precisely the distribution of BtN,FFBS(n,)B_{t}^{N,\text{FFBS}}(n,\cdot). It turns out that any other invariant kernel that can be used for simulating back the discarded A1:T1:NA_{1:T}^{1:N} will lead to a convergent algorithm as well. For instance, BtN,GT(n,)B_{t}^{N,\mathrm{GT}}(n,\cdot) (Example 2) simply returns back the old AtnA_{t}^{n}, unlike BtN,FFBS(n,)B_{t}^{N,\mathrm{FFBS}}(n,\cdot) which creates a new version. The kernel BtN,IMH(n,)B_{t}^{N,\mathrm{IMH}}(n,\cdot) (Example 3) is somewhat an intermediate between the two. We formalise these intuitions in the following theorem. It is stated for the bootstrap particle filter, but as a matter of fact, the proof can be extended straightforwardly to auxiliary particle filters as well.

Assumption 1.

For all 0tT0\leq t\leq T, Gt(xt)>0G_{t}(x_{t})>0 and Gt<\left\lVert G_{t}\right\rVert_{\infty}<\infty.

Theorem 2.1.

We use the same notations as in Algorithms 1 and 2 (in particular, B^tN\hat{B}_{t}^{N} denotes the transition matrix that corresponds to the considered kernel BtNB_{t}^{N}). Assume that for any 1tT1\leq t\leq T, the random matrix B^tN\hat{B}_{t}^{N} satisfies the following conditions:

  • given t1\mathcal{F}_{t-1} and B^1:t1N\hat{B}_{1:t-1}^{N}, the variables (Xtn,Atn,B^tN(n,))(X_{t}^{n},A_{t}^{n},\hat{B}_{t}^{N}(n,\cdot)) for n=1,,Nn=1,\ldots,N are i.i.d. and their distribution only depends on Xt11:NX_{t-1}^{1:N}, where B^tN(n,)\hat{B}_{t}^{N}(n,\cdot) is the nn-th row of matrix B^tN\hat{B}_{t}^{N};

  • if JtnJ_{t}^{n} is a random variable such that

    Jtn|Xt11:N,Xtn,B^tN(n,)BtN(n,)J_{t}^{n}\ |\ X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N}(n,\cdot)\sim B_{t}^{N}(n,\cdot)

    , then (Jtn,Xtn)(J_{t}^{n},X_{t}^{n}) has the same distribution as (Atn,Xtn)(A_{t}^{n},X_{t}^{n}) given Xt11:NX_{t-1}^{1:N}.

Then under Assumption 1, there exists constants CT>0C_{T}>0 and ST<S_{T}<\infty such that, for any δ>0\delta>0 and function φ=φ(x0,,xT)\varphi=\varphi(x_{0},\ldots,x_{T}):

(8) (|TNφTφ|2log(δ/2CT)STφN)δ\mathbb{P}\left(\left|\mathbb{Q}_{T}^{N}\varphi-\mathbb{Q}_{T}\varphi\right|\geq\frac{\sqrt{-2\log(\delta/2C_{T})}S_{T}\left\lVert\varphi\right\rVert_{\infty}}{\sqrt{N}}\right)\leq\delta

where TN\mathbb{Q}_{T}^{N} is defined by (7).

A typical relation between variables defined in the statement of the theorem is illustrated by a graphical model in Figure 1. (See Bishop, 2006, Chapter 8 for the formal definition of graphical models and how to use them.) By “typical”, we mean that Theorem 2.1 technically allows for more complicated relations, but the aforementioned figure captures the most essential cases.

Xt11:NX_{t-1}^{1:N}AtnA_{t}^{n}XtnX_{t}^{n}B^tN(n,)\hat{B}_{t}^{N}(n,\cdot)JtnJ_{t}^{n}
Figure 1. Relation between variables described in Theorem 2.1.

Theorem 2.1 is a generalisation of Douc et al., (2011, Theorem 5). Its proof thus follows the same lines (Supplement E.1). However, in our case the measure TN(dx0:T)\mathbb{Q}_{T}^{N}(\mathrm{d}x_{0:T}) is no longer Markovian. This is because the backward kernel BtN(it,dit1)B_{t}^{N}(i_{t},\mathrm{d}i_{t-1}) does not depend on XtitX_{t}^{i_{t}} alone, but also possibly on its ancestor and extra random variables. This small difference has a big consequence: compared to Douc et al., (2011, Theorem 5), Theorem 2.1 has a much broader applicability and encompasses, for instance, the MCMC-based algorithms presented in Section 4.1 and novel kernels presented in Section 4.2 for intractable densities.

As we have seen in (7), TN\mathbb{Q}_{T}^{N} is fundamentally a discrete measure of which the support contains NT+1N^{T+1} elements. As such, TNφ\mathbb{Q}_{T}^{N}\varphi cannot be computed exactly in general and must be approximated using NN trajectories (X00n,,XTTn)(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}}) simulated via Algorithm 2. Theorem 2.1 is thus completed by the following corollary, which is an immediate consequence of Hoeffding inequality (Supplement E.13).

Corollary 1.

Under the same setting as Theorem 2.1, we have

(|1Nnφ(X00n,,XTTn)Tφ|2log(δ2(CT+1))(ST+1)φN)δ.\mathbb{P}\left(\left|\frac{1}{N}\sum_{n}\varphi(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})-\mathbb{Q}_{T}\varphi\right|\geq\frac{\sqrt{-2\log\left(\frac{\delta}{2(C_{T}+1)}\right)}(S_{T}+1)\left\lVert\varphi\right\rVert_{\infty}}{\sqrt{N}}\right)\leq\delta.

2.5. Generic on-line smoother

As we have seen in Section 2.3 and Section 2.4, in general, the expectation TNφ\mathbb{Q}_{T}^{N}\varphi, for a real-valued function φ=φ(x0,,xT)\varphi=\varphi(x_{0},\ldots,x_{T}) of the hidden states, cannot be computed exactly due to the large support (NT+1N^{T+1} elements) of TN\mathbb{Q}_{T}^{N}. Moreover, in certain settings we are interested in the quantities tNφt\mathbb{Q}_{t}^{N}\varphi_{t} for different functions φt\varphi_{t}. They cannot be approximated in an on-line manner without more assumptions on the connection between φt1\varphi_{t-1} and φt\varphi_{t}. If the family (φt)(\varphi_{t}) is additive, i.e. there exists functions ψt\psi_{t} such that

(9) φt(x0:t):=ψ0(x0)+ψ1(x0,x1)++ψt(xt1,xt)\varphi_{t}(x_{0:t}):=\psi_{0}(x_{0})+\psi_{1}(x_{0},x_{1})+\cdots+\psi_{t}(x_{t-1},x_{t})

then we can calculate tNφt\mathbb{Q}_{t}^{N}\varphi_{t} both exactly and on-line. The procedure was first described in Del Moral et al., (2010) for the kernel tN,FFBS\mathbb{Q}_{t}^{N,\text{FFBS}} (i.e. the measure defined by (7) and the random kernels BtN,FFBSB_{t}^{N,\text{FFBS}}), but we will use the idea for other kernels as well. In this subsection, we first explain the principle of the method, then discuss its computational complexity and the link to the PaRIS algorithm (Olsson and Westerborn,, 2017).

Principle

For simplicity, we start with the special case φt(x0:t)=ψ0(x0)\varphi_{t}(x_{0:t})=\psi_{0}(x_{0}). Equation (7) and the matrix viewpoint of Markov kernels then give

tNφt=[Wt1WtN]B^tNB^t1NB^1N[ψ0(X01)ψ0(X0N)].\mathbb{Q}_{t}^{N}\varphi_{t}=\begin{bmatrix}W_{t}^{1}\ldots W_{t}^{N}\end{bmatrix}\hat{B}_{t}^{N}\hat{B}_{t-1}^{N}\ldots\hat{B}_{1}^{N}\begin{bmatrix}\psi_{0}(X_{0}^{1})\\ \vdots\\ \psi_{0}(X_{0}^{N})\end{bmatrix}.

This naturally suggests the following recursion formula to compute tNφt\mathbb{Q}_{t}^{N}\varphi_{t}:

tNφt=[Wt1WtN]S^tN\mathbb{Q}_{t}^{N}\varphi_{t}=\begin{bmatrix}W_{t}^{1}\ldots W_{t}^{N}\end{bmatrix}\hat{S}_{t}^{N}

with S^0N=[ψ0(X01)ψ0(X0N)]\hat{S}_{0}^{N}=[\psi_{0}(X_{0}^{1})\ldots\psi_{0}(X_{0}^{N})]^{\top} and

(10) S^tN:=B^tNS^t1N.\hat{S}_{t}^{N}:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}.

In the general case where functions φt\varphi_{t} are given by (9), simple calculations (Supplement E.2) show that (10) is replaced by

(11) S^tN:=B^tNS^t1N+diag(B^tNψ^tN)\hat{S}_{t}^{N}:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}+\operatorname{diag}(\hat{B}_{t}^{N}\hat{\psi}_{t}^{N})

where the N×NN\times N matrix ψ^tN\hat{\psi}_{t}^{N} is defined by

ψ^tN[it1,it]:=ψt(Xt1it1,Xtit)\hat{\psi}_{t}^{N}[i_{t-1},i_{t}]:=\psi_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})

and the operator diag:N×NN\operatorname{diag}:\mathbb{R}^{N\times N}\rightarrow\mathbb{R}^{N} extracts the diagonal of a matrix. This is exactly what is done in Algorithm 3.

Input: Particles Xt11:NX_{t-1}^{1:N} and weights Wt11:NW_{t-1}^{1:N} at time t1t-1; the N×1N\times 1 vector S^t1N\hat{S}_{t-1}^{N} (see text); additive function (9)
Generate Xt1:NX^{1:N}_{t} and Wt1:NW_{t}^{1:N} according to the particle filter (Algorithm 1)
Calculate the random matrix B^tN\hat{B}_{t}^{N} (see Section 2.3 and Section 2.4)
Create the N×1N\times 1 vector S^tN\hat{S}_{t}^{N} according to (11). More precisely:
for it1i_{t}\leftarrow 1 to NN do
       S^tN[it]it1B^tN[it,it1](S^t1N[it1]+ψt(Xt1it1,Xtit))\hat{S}_{t}^{N}[i_{t}]\leftarrow\sum_{i_{t-1}}\hat{B}_{t}^{N}[i_{t},i_{t-1}]\left(\hat{S}_{t-1}^{N}[i_{t-1}]+\psi_{t}(X_{t-1}^{i_{t-1}},X_{t}^{i_{t}})\right)
Output: Quantity nWtnS^tN[n]\sum_{n}W_{t}^{n}\hat{S}_{t}^{N}[n] which is equal to tNφt\mathbb{Q}_{t}^{N}\varphi_{t} and is an esimate of t(φt)\mathbb{Q}_{t}(\varphi_{t}); particles Xt1:NX_{t}^{1:N}, weights Wt1:NW_{t}^{1:N} and vector StNS_{t}^{N} for the next step
Algorithm 3 Generic on-line smoother for additive functions (one step)

Computational complexity and the PaRIS algorithm

Equations (10) and (11) involve a matrix-vector multiplication and thus require, in general, 𝒪(N2)\mathcal{O}(N^{2}) operations to be evaluated. When B^tNB^tN,FFBS\hat{B}_{t}^{N}\equiv\hat{B}_{t}^{N,\text{FFBS}}, Algorithm 3 becomes the 𝒪(N2)\mathcal{O}(N^{2}) on-line smoothing algorithm of Del Moral et al., (2010). The 𝒪(N2)\mathcal{O}(N^{2}) complexity can however be lowered to 𝒪(N)\mathcal{O}(N) if the matrices B^tN\hat{B}_{t}^{N} are sparse. This is the idea behind the PaRIS algorithm (Olsson and Westerborn,, 2017), where the full matrix B^tN,FFBS\hat{B}_{t}^{N,\text{FFBS}} is unbiasedly estimated by a sparse matrix B^tN,PaRIS\hat{B}_{t}^{N,\text{PaRIS}}. More specifically, for any integer N~>1\tilde{N}>1, for any n1,,Nn\in 1,\ldots,N, let Jtn,1,,Jtn,N~J_{t}^{n,1},\ldots,J_{t}^{n,\tilde{N}} be conditionally i.i.d. random variables simulated from BtN,FFBS(n,)B_{t}^{N,\text{FFBS}}(n,\cdot). The random matrix B^tN,PaRIS\hat{B}_{t}^{N,\text{PaRIS}} is then defined as

B^tN,PaRIS[n,m]:=1N~n~=1N~𝟙{Jtn,n~=m}\hat{B}_{t}^{N,\text{PaRIS}}[n,m]:=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\mathbbm{1}\left\{J_{t}^{n,\tilde{n}}=m\right\}

and the corresponding random kernel is

(12) BtN,PaRIS(n,dm)=1N~n~=1N~δJtn,n~(dm).B_{t}^{N,\text{PaRIS}}(n,\mathrm{d}m)=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\delta_{J_{t}^{n,\tilde{n}}}(\mathrm{d}m).

The following straightforward proposition establishes the validity of the BtN,PaRISB_{t}^{N,\text{PaRIS}} kernel. Together with Theorem 2.1, it can be thought of as a reformulation of the consistency of the PaRIS algorithm (Olsson and Westerborn,, 2017, Corollary 2) in the language of our framework.

Proposition 1.

The matrix B^tN,PaRIS\hat{B}_{t}^{N,\mathrm{PaRIS}} has only 𝒪(NN~)\mathcal{O}(N\tilde{N}) non-zero elements out of N2N^{2}. It is an unbiased estimate of B^tN,FFBS\hat{B}_{t}^{N,\mathrm{FFBS}} in the sense that

𝔼[B^tN,PaRIS|t]=B^tN,FFBS.\mathbb{E}\left[\left.{\hat{B}_{t}^{N,\mathrm{PaRIS}}}\right|{\mathcal{F}_{t}}\right]=\hat{B}_{t}^{N,\mathrm{FFBS}}.

Moreover, the sequence of matrices B1:TN,PaRISB_{1:T}^{N,\mathrm{PaRIS}} satisfies the two conditions of Theorem 2.1.

The proposition also justifies the 𝒪(N)\mathcal{O}(N) complexity of (10) and (11), as long as N~\tilde{N} is fixed as NN\to\infty. But it is important to remark that the preceding 𝒪(N)\mathcal{O}(N) complexity does not include the cost of generating the matrices B^tN,PaRIS\hat{B}_{t}^{N,\text{PaRIS}} themselves, i.e., the operations required to simulate the indices Jtn,n~J_{t}^{n,\tilde{n}}. In Olsson and Westerborn, (2017) it is argued that such simulations have an 𝒪(N)\mathcal{O}(N) cost using the rejection sampling method whenever the transition density is both upper and lower bounded. Section 3 investigates the claim when this hypothesis is violated.

2.6. Stability

When B^tNB^tN,GT\hat{B}_{t}^{N}\equiv\hat{B}_{t}^{N,\text{GT}}, Algorithms 2 and 3 reduce to the genealogy tracking smoother (Kitagawa,, 1996). The matrix B^tN,GT\hat{B}_{t}^{N,\text{GT}} is indeed sparse, leading to the well-known 𝒪(N)\mathcal{O}(N) complexity of this on-line procedure. As per Theorem 2.1, smoothing via genealogy tracking is convergent at rate 𝒪(N1/2)\mathcal{O}(N^{-1/2}) if TT is fixed. When TT\to\infty however, all particles will eventually share the same ancestor at time 0 (or any fixed time tt). Mathematically, this phenomenon is manifested in two ways: (a) for fixed tt and function ϕt:𝒳t\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}, the error of estimating 𝔼[ϕt(Xt)|Y0:T]\mathbb{E}[\phi_{t}(X_{t})|Y_{0:T}] grows linearly with TT; and (b) the error of estimating 𝔼[t=0Tψt(xt1,xt)|Y0:T]\mathbb{E}\left[\left.{\sum_{t=0}^{T}\psi_{t}(x_{t-1},x_{t})}\right|{Y_{0:T}}\right] grows quadratically with TT. These correspond respectively to the degeneracy for the fixed marginal smoothing and the additive smoothing problems; see also the introductory section of Olsson and Westerborn, (2017) for a discussion. The random matrices B^tN,GT\hat{B}_{t}^{N,\text{GT}} are therefore said to be unstable as TT\to\infty, which is not the case for B^tN,FFBS\hat{B}_{t}^{N,\text{FFBS}} or B^tN,PaRIS\hat{B}_{t}^{N,\text{PaRIS}}. This subsection gives sufficient conditions to ensure the stability of a general B^tN\hat{B}_{t}^{N}.

The essential point behind smoothing stability is simple: the support of BtN,FFBS(n,)B_{t}^{N,\mathrm{FFBS}}(n,\cdot) or BtN,PaRIS(n,)B_{t}^{N,\mathrm{PaRIS}}(n,\cdot) for N~2\tilde{N}\geq 2 contains more than one element, contrary to that of BtN,GT(n,)B_{t}^{N,\mathrm{GT}}(n,\cdot). This property is formalised by (13). To explain the intuitions, we use the notations of Algorithm 2 and consider the estimate

N1(ψ0(X001)++ψ0(X00N))N^{-1}\left(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}})+\cdots+\psi_{0}(X_{0}^{\mathcal{I}_{0}^{N}})\right)

of 𝔼[ψ0(X0)|Y0:T]\mathbb{E}\left[\left.{\psi_{0}(X_{0})}\right|{Y_{0:T}}\right] when TT\to\infty. The variance of the quantity above is a sum of Cov(ψ0(X00i),ψ0(X00j))\mathrm{Cov}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{i}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{j}})) terms. It can therefore be understood by looking at a pair of trajectories simulated using Algorithm 2.

At final time t=Tt=T, T1\mathcal{I}_{T}^{1} and T2\mathcal{I}_{T}^{2} both follow the (WT1:N)\mathcal{M}(W_{T}^{1:N}) distribution. Under regularity conditions (e.g. no extreme weights), they are likely to be different, i.e., (T1=T2)=𝒪(1/N)\mathbb{P}(\mathcal{I}_{T}^{1}=\mathcal{I}_{T}^{2})=\mathcal{O}(1/N). This property can be propagated backward: as long as t1t2\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}, the two variables t11\mathcal{I}_{t-1}^{1} and t12\mathcal{I}_{t-1}^{2} are also likely to be different, with however a small 𝒪(1/N)\mathcal{O}(1/N) chance of being equal. Moreover, as long as the two trajectories have not met, they can be simulated independently given T\mathcal{F}_{T}^{-} (the sigma algebra defined in (3)). In mathematical terms, under the two hypotheses of Theorem 2.1, given T\mathcal{F}_{T}^{-} and t:T1,2\mathcal{I}_{t:T}^{1,2}, it can be proved that the two variables t11\mathcal{I}_{t-1}^{1} and t12\mathcal{I}_{t-1}^{2} are independent if t1t2\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2} (Lemma E.1, Supplement E.3).

Since there is an 𝒪(1/N)\mathcal{O}(1/N) chance of meeting at each time step, if TNT\gg N, it is likely that the two paths will meet at some point t0t\gg 0. When t1=t2\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}, the two indices t1\mathcal{I}_{t-1} and t2\mathcal{I}_{t-2} are both simulated according to BtN(t1,)B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot). In the genealogy tracking algorithm, BtN,GT(i,)B_{t}^{N,\text{GT}}(i,\cdot) is a Dirac measure, leading to t11=t12\mathcal{I}_{t-1}^{1}=\mathcal{I}_{t-1}^{2} almost surely. This spreads until time 0, so Corr(ψ0(X001),ψ0(X002))\operatorname{Corr}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{2}})) is almost 11 if TNT\gg N.

Other kernels like BtN,FFBSB_{t}^{N,\text{FFBS}} or BtN,PaRISB_{t}^{N,\text{PaRIS}} do not suffer from the same problem. For these, the support size of BtN(t1,)B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot) is greater than one and thus there is some real chance that t11t12\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}. If that does happen, we are again back to the regime where the next states of the two paths can be simulated independently. Note also that the support of BtN(t1,)B_{t}^{N}(\mathcal{I}_{t}^{1},\cdot) does not need to be large and can contain as few as 22 elements. Even if t11\mathcal{I}_{t-1}^{1} might still be equal to t12\mathcal{I}_{t-1}^{2} with some probability, the two paths will have new chances to diverge at times t2t-2, t3t-3 and so on. Overall, this makes Corr(ψ0(X001),ψ0(X002))\operatorname{Corr}(\psi_{0}(X_{0}^{\mathcal{I}_{0}^{1}}),\psi_{0}(X_{0}^{\mathcal{I}_{0}^{2}})) quite small (Lemma E.3, Supplement E.3).

We formalise these arguments in the following theorem, whose proof (Supplement E.3) follows them very closely. The price for proof intuitiveness is that the theorem is specific to the bootstrap filter, although numerical evidence (Section 5) suggests that other filters are stable as well.

Assumption 2.

The transition densities mtm_{t} are upper and lower bounded:

M¯mt(xt1,xt)M¯h\bar{M}_{\ell}\leq m_{t}(x_{t-1},x_{t})\leq\bar{M}_{h}

for constants 0<M¯<M¯h<0<\bar{M}_{\ell}<\bar{M}_{h}<\infty.

Assumption 3.

The potential functions GtG_{t} are upper and lower bounded:

G¯Gt(xt)G¯h\bar{G}_{\ell}\leq G_{t}(x_{t})\leq\bar{G}_{h}

for constants 0<G¯<G¯h<0<\bar{G}_{\ell}<\bar{G}_{h}<\infty.

Remark. Since Assumption 2 implies that the 𝒳t\mathcal{X}_{t}’s are compact, Assumption 1 automatically implies Assumption 3 as soon as the GtG_{t}’s’ are continuous functions.

Theorem 2.2.

We use the notations of Algorithms 1 and 2. Suppose that Assumptions 2 and 3 hold and the random kernels B1:TNB_{1:T}^{N} satisfy the conditions of Theorem 2.1. If, in addition, for the pair of random variables (Jtn,1,Jtn,2)(J_{t}^{n,1},J_{t}^{n,2}) whose distribution given Xt11:NX_{t-1}^{1:N}, XtnX_{t}^{n} and B^tN(n,)\hat{B}_{t}^{N}(n,\cdot) is defined by BtN(n,)BtN(n,)B_{t}^{N}(n,\cdot)\otimes B_{t}^{N}(n,\cdot), we have

(13) (Jtn,1Jtn,2|Xt11:N,Xtn)εS\mathbb{P}\left(\left.{J_{t}^{n,1}\neq J_{t}^{n,2}}\right|{X_{t-1}^{1:N},X_{t}^{n}}\right)\geq\varepsilon_{\mathrm{S}}

for some εS>0\varepsilon_{\mathrm{S}}>0 and all tt, nn; then there exists a constant CC not depending on TT such that:

  • fixed marginal smoothing is stable, i.e. for s{0,,T}s\in\left\{0,\ldots,T\right\} and a real-valued function ϕs:𝒳s\phi_{s}:\mathcal{X}_{s}\to\mathbb{R} of the hidden state XsX_{s}, we have

    (14) 𝔼[(TN(dxs)ϕs(xs)𝔼[ϕs(Xs)|Y0:T])2]Cϕs2N;\mathbb{E}\left[\left(\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{s})\phi_{s}(x_{s})-\mathbb{E}\left[\left.{\phi_{s}(X_{s})}\right|{Y_{0:T}}\right]\right)^{2}\right]\leq\frac{C\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}{N};
  • additive smoothing is stable, i.e. for T2T\geq 2 and the function φT\varphi_{T} defined in (9), we have

    (15) 𝔼[(TN(φT)T(φT))2]Ct=0Tψt2N(1+TN)2.\mathbb{E}\left[\left(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right]\leq\frac{C\sum_{t=0}^{T}\left\lVert\psi_{t}\right\rVert_{\infty}^{2}}{N}\left(1+\sqrt{\frac{T}{N}}\right)^{2}.

In particular, when BtNB_{t}^{N} is the PaRIS kernel with N~2\tilde{N}\geq 2, Theorem 2.2 implies a novel non-asymptotic bound for the PaRIS algorithm. Olsson and Westerborn, (2017) first established a central limit theorem as NN\to\infty and TT fixed, then showed that the asymptotic variance is controlled as TT\to\infty. In contrast, we follow an original approach (whose intuition is explained at the beginning of this subsection) in order to derive a finite sample size bound.

The main technical difficulty is to prove the fast mixing of the Markov kernel product BtNBt1NBtNB_{t}^{N}B_{t-1}^{N}\ldots B_{t^{\prime}}^{N} in terms of ttt-t^{\prime}. For the original FFBS kernel, the stability proof by Douc et al., (2011) relies on the uniform Doeblin property of each of the term BsN,FFBSB_{s}^{N,\mathrm{FFBS}} (page 2136, towards the end of their proof of Lemma 10) and from there, deduces the exponentially fast mixing of the product. When BsN,FFBSB_{s}^{N,\mathrm{FFBS}} is approximated by a sparse matrix BsNB_{s}^{N} (which is the case for PaRIS, but also for certain MCMC-based and coupling-based smoothers that we shall see later), the aforementioned property no longer holds for each individual term BsNB_{s}^{N}. Interestingly however, the good mixing of BtN,FFBSBtN,FFBSB_{t}^{N,\mathrm{FFBS}}\ldots B_{t^{\prime}}^{N,\mathrm{FFBS}} is still conserved in the product BtNBtNB_{t}^{N}\ldots B_{t^{\prime}}^{N}. In Lemma E.3, we show that two trajectories generated via the latter kernel have such a small correlation that they are virtually indistinguishable from two independent trajectories generated via the former one.

Theorem 2.2 is stated under strong assumptions (similar to those used in Chopin and Papaspiliopoulos, 2020, Chapter 11.4, and slightly stronger than Douc et al., 2011, Assumption 4). On the other hand, it applies to a large class of backward kernels (rather than only FFBS), including the new ones introduced in the forthcoming sections.

In the proof of this theorem, we proceed in two steps: first, we apply existing bounds (Dubarry and Le Corff,, 2013, Theorem 3.1 and Del Moral,, 2013, Chapter 17) for the error between the BtN,FFBSB_{t}^{N,\mathrm{FFBS}}-induced distribution and the true target; and second, we use our own techniques to control the error when BtN,FFBSB_{t}^{N,\mathrm{FFBS}} is replaced by any other kernel BtnB_{t}^{n} satisfying (13). The (1+T/N)2(1+\sqrt{T/N})^{2} term in (15) comes from the first part and we do not know whether it can be dropped. However, it does not affect the scaling of the algorithm. Indeed, with or without it, the inequality implies that in order to have a constant error in the additive smoothing problem, one only has to take N=𝒪(T)N=\mathcal{O}(T) (instead of N=𝒪(T2)N=\mathcal{O}(T^{2}) without backward sampling). Moreover, from an asymptotic point of view, we always have σ2(T)=𝒪(T)\sigma^{2}(T)=\mathcal{O}(T) regardless of the presence of the (1+T/N)2(1+\sqrt{T/N})^{2} term, where σ2(T):=limNN𝔼[(TN(φT)T(φT))2]\sigma^{2}(T):=\lim_{N\to\infty}N\mathbb{E}\left[\left(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right].

3. Sampling from the FFBS Backward Kernels

Sampling from the FFBS backward kernel lies at the heart of both the FFBS algorithm (Example 1) and the PaRIS one (Section 2.5). Indeed, at time tt, they require generating random variables distributed according to BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}) for iti_{t} running from 11 to NN. Since sampling from a discrete measure on NN elements requires 𝒪(N)\mathcal{O}(N) operations (e.g. via CDF inversion), the total computational cost becomes 𝒪(N2)\mathcal{O}(N^{2}). To reduce this, we start by considering the subclass of models satisfying the following assumption, which is much weaker than Assumption 2.

Assumption 4.

The transition density mt(xt1,xt)m_{t}(x_{t-1},x_{t}) is strictly positive and upper bounded, i.e. there exists M¯h>0\bar{M}_{h}>0 such that 0<mt(xt1,xt)M¯h,(xt1,xt)0<m_{t}(x_{t-1},x_{t})\leq\bar{M}_{h},\forall\ (x_{t-1},x_{t}).

The motivation for the first condition 0<mt(xt1,xt)0<m_{t}(x_{t-1},x_{t}) will be clear after Assumption 5 is defined. For now, we see that it is possible to sample from BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}) using rejection sampling via the proposal distribution (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}). After an 𝒪(N)\mathcal{O}(N)-cost initialisation, new draws can be simulated from the proposal in amortised 𝒪(1)\mathcal{O}(1) time; see Chopin and Papaspiliopoulos, (2020, Python Corner, Chapter 9), see also Douc et al., (2011, Appendix B.1) for an alternative algorithm with an 𝒪(logN)\mathcal{O}(\log N) cost per draw. The resulting procedure is summarised in Algorithm 4. Compared to traditional FFBS or PaRIS implementations, these rejection–based variants have a random execution time that is more difficult to analyse. Under Assumption 2, Douc et al., (2011) and Olsson and Westerborn, (2017) derive an 𝒪(NM¯h/M¯)\mathcal{O}(N\bar{M}_{h}/\bar{M}_{\ell}) expected complexity. However, the general picture, where the state space is not compact and only Assumption 4 holds, is less clear.

Input: Particles Xt11:NX_{t-1}^{1:N} and weights Wt11:NW_{t-1}^{1:N} at time t1t-1; particle XtitX_{t}^{i_{t}} at time tt; constant M¯h\bar{M}_{h}; pre-initialised 𝒪(1)\mathcal{O}(1) sampler for (Wt11:N)\mathcal{M}(W_{t-1}^{1:N})
repeat
       t1(Wt11:N)\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{1:N}) using the pre-initialised 𝒪(1)\mathcal{O}(1) sampler
       UUnif[0,1]U\sim\operatorname{Unif}[0,1]
      
until Umt(Xt1t1,Xtit)/M¯hU\leq m_{t}(X_{t-1}^{\mathcal{I}_{t-1}},X_{t}^{i_{t}})/\bar{M}_{h}
Output: t1\mathcal{I}_{t-1}, which is distributed according to BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}).
Algorithm 4 Pure rejection sampler for simulating from BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

The present subsection intends to fill this gap. Our main focus is the PaRIS algorithm of which the presentation is simpler. Results for the FFBS algorithm can be found in Supplement B. We restrict ourselves to the case where 𝒳t=dt\mathcal{X}_{t}=\mathbb{R}^{d_{t}}, although extensions to other non compact state spaces are possible. Only the bootstrap particle filter is considered, and results from this section do not extend trivially to other filtering algorithms. In Section 5, we shall employ different types of particle filters and see that the performance could change from one type to another, which is an additional weak point of rejection-based algorithms.

Assumption 5.

The hidden state XtX_{t} is defined on the space 𝒳t=dt\mathcal{X}_{t}=\mathbb{R}^{d_{t}}. The measure λt(dxt)\lambda_{t}(\mathrm{d}x_{t}) with respect to which the transition density mt(xt1,xt)m_{t}(x_{t-1},x_{t}) is defined (cf. (2)) is the Lebesgue measure on dt\mathbb{R}^{d_{t}}.

This assumption together with the condition mt(xt1,xt)>0m_{t}(x_{t-1},x_{t})>0 of Assumption 4 ensures that the state space model is “truly non-compact”. Indeed, if mt(xt1,xt)m_{t}(x_{t-1},x_{t}) is zero whenever xt1𝒞t1x_{t-1}\notin\mathcal{C}_{t-1} or xt𝒞tx_{t}\notin\mathcal{C}_{t}, where 𝒞t1\mathcal{C}_{t-1} and 𝒞t\mathcal{C}_{t} are respectively two compact subsets of dt1\mathbb{R}^{d_{t-1}} and dt\mathbb{R}^{d_{t}}, then we are basically reduced to a state space model where 𝒳t1=𝒞t1\mathcal{X}_{t-1}=\mathcal{C}_{t-1} and 𝒳t=𝒞t\mathcal{X}_{t}=\mathcal{C}_{t}.

3.1. Complexity of PaRIS algorithm with pure rejection sampling

We consider the PaRIS algorithm (i.e. Algorithm 3 using the BtN,PaRISB_{t}^{N,\mathrm{PaRIS}} kernels). Algorithm 5 provides a concrete description of the resulting procedure, using the bootstrap particle filter. At each time tt, let τtn,PaRIS\tau_{t}^{n,\mathrm{PaRIS}} be the number of rejection trials required to sample from BtN,FFBS(n,dm)B_{t}^{N,\mathrm{FFBS}}(n,\mathrm{d}m). We then have

(16) τtn,PaRIS | t1,XtnGeo(iWt1imt(Xt1i,Xtn)M¯h)\tau_{t}^{n,\mathrm{PaRIS}}\textrm{ }|\textrm{ }\mathcal{F}_{t-1},X_{t}^{n}\sim\operatorname{Geo}\left(\frac{\sum_{i}W_{t-1}^{i}m_{t}(X_{t-1}^{i},X_{t}^{n})}{\bar{M}_{h}}\right)

with M¯h\bar{M}_{h} defined in Assumption 4.

Input: Particles Xt11:NX_{t-1}^{1:N}; weights Wt11:NW_{t-1}^{1:N}; vector St1NS_{t-1}^{N} in N\mathbb{R}^{N}; pre-initialised sampler for (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}); function ψt\psi_{t} (cf. (9)); user-specified parameter N~\tilde{N}
for n1n\leftarrow 1 to NN  do
       Atn(Wt11:N)A_{t}^{n}\sim\mathcal{M}(W_{t-1}^{1:N}) ()(\star)
       XtnMt(Xt1Atn,dxt)X_{t}^{n}\sim M_{t}(X_{t-1}^{A_{t}^{n}},\mathrm{d}x_{t})
       Simulate Jtn,1:N~i.i.d.BtN,FFBS(n,dn)J_{t}^{n,1:\tilde{N}}\overset{\textrm{i.i.d.}}{\sim}B_{t}^{N,\mathrm{FFBS}}(n,\mathrm{d}n^{\prime}) using either the pure rejection sampler (Algorithm 4) or the hybrid rejection sampler (Algorithm 6)
       StN[n]N~1n~=1N~{St1N[Jtn,n~]+ψt(Xt1Jtn,n~,Xtn)}S_{t}^{N}[n]\leftarrow\tilde{N}^{-1}\sum_{\tilde{n}=1}^{\tilde{N}}\left\{S_{t-1}^{N}[J_{t}^{n,\tilde{n}}]+\psi_{t}(X_{t-1}^{J_{t}^{n,\tilde{n}}},X_{t}^{n})\right\}
      
for n1n\leftarrow 1 to NN do
       WtnGt(Xtn)/iGt(Xti)W_{t}^{n}\leftarrow G_{t}(X_{t}^{n})/\sum_{i}G_{t}(X_{t}^{i})
      
μtNn=1NWtnStN(n)\mu_{t}^{N}\leftarrow\sum_{n=1}^{N}W_{t}^{n}S_{t}^{N}(n)
Initialise a sampler for (Wt1:N)\mathcal{M}(W_{t}^{1:N})
Output: Estimate μtN\mu_{t}^{N} of 𝔼[φ(X0:t)|Y0:t]\mathbb{E}\left[\left.{\varphi(X_{0:t})}\right|{Y_{0:t}}\right]; particles Xt1:NX_{t}^{1:N}; weights Wt1:NW_{t}^{1:N}; vector StNS_{t}^{N} in N\mathbb{R}^{N} and pre-initialised sampler (Wt1:N)\mathcal{M}(W_{t}^{1:N}) for the next iteration
Algorithm 5 Concrete implementation of PaRIS algorithm (i.e. Algorithm 3 with the BtN,PaRISB_{t}^{N,\mathrm{PaRIS}} backward kernel) using the bootstrap particle filter

By exchangeability of particles, the expected cost of the PaRIS algorithm at step tt is proportional to NN~𝔼[τt1,PaRIS]N\tilde{N}\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}], where N~\tilde{N} is a fixed user-chosen parameter. Occasionally, Xt1X_{t}^{1} falls into an unlikely region of d\mathbb{R}^{d} and the acceptance rate becomes low. In other words, τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} is a mixture of geometric distribution, some components of which might have a large expectation. Unfortunately, these inefficiencies add up and produce an unbounded execution time in expectation, as shown in the following proposition.

Proposition 2.

Under Assumptions 4 and 5, the version of Algorithm 5 using the pure rejection sampler satisfies 𝔼[τt1,PaRIS]=\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}]=\infty, where τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} is defined in (16).

Proof.

We have

𝔼[τt1,PaRIS]\displaystyle\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}] =M¯h𝔼[1nmt(Xt1n,Xt1)Wt1n]via (16)\displaystyle=\bar{M}_{h}\mathbb{E}\left[\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},X_{t}^{1})W_{t-1}^{n}}\right]\quad\textrm{via \eqref{eq:dist_tau_N}}
=M¯h𝔼[𝔼[1nmt(Xt1n,Xt1)Wt1n|t1]]\displaystyle=\bar{M}_{h}\mathbb{E}\left[\mathbb{E}\left[\left.{\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},X_{t}^{1})W_{t-1}^{n}}}\right|{\mathcal{F}_{t-1}}\right]\right]
=M¯h𝔼[𝒳t1nmt(Xt1n,x)Wt1n(mt(Xt1n,x)Wt1n)λt(dx)]\displaystyle=\bar{M}_{h}\mathbb{E}\left[\int_{\mathcal{X}_{t}}\frac{1}{\sum_{n}m_{t}(X_{t-1}^{n},x)W_{t-1}^{n}}\left(\sum m_{t}(X_{t-1}^{n},x)W_{t-1}^{n}\right)\lambda_{t}(\mathrm{d}x)\right]
=M¯h𝔼[𝒳t1×λt(dx)]=by Assumption 5.\displaystyle=\bar{M}_{h}\mathbb{E}\left[\int_{\mathcal{X}_{t}}1\times\lambda_{t}(\mathrm{d}x)\right]=\infty\quad\textrm{by Assumption~{}\ref{asp:space}}.

In highly parallel computing architectures, each processor only handles one or a small number of particles. As such, the heavy-tailed nature of the execution time means that a few machines might prevent the whole system from moving forward. In all computing architectures, an execution time without expectation is essentially unpredictable. A common practice to estimate execution time is to run a certain algorithm with a small number NN of particles, then “extrapolate” to the NfinalN_{\mathrm{final}} of the definitive run. However, as 𝔼[τt1,PaRIS]\mathbb{E}[\tau_{t}^{1,\mathrm{PaRIS}}] is infinite for any NN, it is unclear what kind of information we might get from preliminary runs. In Supplement B, besides studying the execution time of rejection-based implementations of the FFBS algorithm, we will delve deeper into the difference between the non-parallel and parallel settings.

From the proof of Proposition 2, it is clear that the quantity nWt1nmt(Xt1n,xt)\sum_{n}W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t}) will play a key role in the upcoming developments. We thus define it formally.

Definition 1.

The true predictive density function rtr_{t} and its approximation rtNr_{t}^{N} are defined as

rt(xt)\displaystyle r_{t}(x_{t}) :=(t1Mt)(dxt)λt(dxt)\displaystyle:=\frac{(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})}{\lambda_{t}(\mathrm{d}x_{t})}
rtN(xt)\displaystyle r_{t}^{N}(x_{t}) :=Wt1nmt(Xt1n,xt)\displaystyle:=\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})

where the first equation is understood in the sense of the Radon-Nikodym derivative and the density mt1(xt1,xt)m_{t-1}(x_{t-1},x_{t}) is defined with respect to the dominating measure λt(dxt)\lambda_{t}(\mathrm{d}x_{t}) on 𝒳t\mathcal{X}_{t} (cf. (2)).

3.2. Hybrid rejection sampling

To solve the aforementioned issues of the pure rejection sampling procedure, we propose a hybrid rejection sampling scheme. The basic observation is that, for a single mm, direct simulation (e.g. via CDF inversion) of BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}) costs 𝒪(N)\mathcal{O}(N). Thus, once K=𝒪(N)K=\mathcal{O}(N) rejection sampling trials have been attempted, one should instead switch to a direct simulation method. In other words, it does not make sense (at least asymptotically) to switch to direct sampling after KK trials if K𝒪(N)K\ll\mathcal{O}(N) or K𝒪(N)K\gg\mathcal{O}(N). The validity of this method is established in the following proposition, where we actually allow KK to depend on trials drawn so far. The proof, which is not an immediate consequence of the validity of ordinary rejection sampling, is given in Supplement E.4.

Proposition 3.

Let μ0(x)\mu_{0}(x) and μ1(x)\mu_{1}(x) be two probability densities defined on some measurable space 𝒳\mathcal{X} with respect to a dominating measure λ(dx)\lambda(\mathrm{d}x). Suppose that there exists C>0C>0 such that μ1(x)Cμ0(x)\mu_{1}(x)\leq C\mu_{0}(x). Let (X1,U1),(X2,U2),(X_{1},U_{1}),(X_{2},U_{2}),\ldots be a sequence of i.i.d. random variables distributed according to μ0Unif[0,1]\mu_{0}\otimes\operatorname{Unif}[0,1] and let Xμ1X^{*}\sim\mu_{1} be independent of that sequence. Put

K:=inf{n1 such that Unμ1(Xn)Cμ0(Xn)}K^{*}:=\inf\left\{n\in\mathbb{Z}_{\geq 1}\textrm{ such that }U_{n}\leq\frac{\mu_{1}(X_{n})}{C\mu_{0}(X_{n})}\right\}

and let KK be any stopping time with respect to the natural filtration associated with the sequence {(Xn,Un)}n=1\left\{(X_{n},U_{n})\right\}_{n=1}^{\infty}. Let ZZ be defined as XKX_{K^{*}} if KKK^{*}\leq K and XX^{*} otherwise. Then ZZ is μ1\mu_{1}-distributed.

Proposition 3 thus allows users to pick K=αNK=\alpha N, where α>0\alpha>0 might be chosen somehow adaptively from earlier trials. In the following, we only consider the simple rule K=NK=N, which does not induce any loss of generality in terms of the asymptotic behaviour and is easy to implement. The resulting iteration is described in Algorithm 6.

Input: Particles Xt11:NX_{t-1}^{1:N} and weights Wt11:NW_{t-1}^{1:N} at time t1t-1; particle XtitX_{t}^{i_{t}} at time tt; constant M¯h\bar{M}_{h}; pre-initialised 𝒪(1)\mathcal{O}(1) sampler for (Wt11:N)\mathcal{M}(W_{t-1}^{1:N})
acceptedFalseaccepted\leftarrow\operatorname{False}
for i1i\leftarrow 1 to NN do
       t1(Wt11:N)\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{1:N}) using the pre-initialised 𝒪(1)\mathcal{O}(1) sampler
       UUnif[0,1]U\sim\operatorname{Unif}[0,1]
       if Umt(Xt1t1,Xtit)/M¯hU\leq m_{t}(X_{t-1}^{\mathcal{I}_{t-1}},X_{t}^{i_{t}})/\bar{M}_{h} then
             acceptedTrueaccepted\leftarrow\operatorname{True}
             break
            
      
if not acceptedaccepted then
       t1(Wt1nm(Xt1n,Xtit))\mathcal{I}_{t-1}\sim\mathcal{M}(W_{t-1}^{n}m(X_{t-1}^{n},X_{t}^{i_{t}}))
      
Output: t1\mathcal{I}_{t-1}, which is distributed according to BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}).
Algorithm 6 Hybrid rejection sampler for simulating from BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1})

When applied in the context of Algorithm 5, Algorithm 6 gives a smoother of expected complexity proportional to

NN~𝔼[min(τt1,PaRIS,N)]N\tilde{N}\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]

at time tt, where τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} is defined in (16)). This quantity is no longer infinite, but its growth when NN\to\infty might depend on the model. Still, in all cases, it remains strictly larger than 𝒪(N)\mathcal{O}(N) and strictly smaller than 𝒪(N2)\mathcal{O}(N^{2}). Perhaps more surprisingly, in linear Gaussian models (see Supplement A.1 for detailed notations), the smoother is of near-linear complexity (up to log factors). The following two theorems formalise these claims.

Assumption 6.

The predictive density rtr_{t} of XtX_{t} given Y0:t1Y_{0:t-1} and the potential function GtG_{t} are continuous functions on dt\mathbb{R}^{d_{t}}. The transition density mt(xt1,xt)m_{t}(x_{t-1},x_{t}) is a continuous function on dt1×dt\mathbb{R}^{d_{t-1}}\times\mathbb{R}^{d_{t}}.

Theorem 3.1.

Under Assumptions 1, 4, 5 and 6, the version of Algorithm 5 using the hybrid rejection sampler (Algorithm 6) satisfies limN𝔼[min(τt1,PaRIS,N)]=\lim_{N\to\infty}\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]=\infty and limN𝔼[min(τt1,PaRIS,N)]/N=0\lim_{N\to\infty}{\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]}/{N}=0, where τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} is defined in (16).

Theorem 3.2.

We assume the same setting as Theorem 3.1. In linear Gaussian state space models (Supplement A.1), we have 𝔼[min(τt1,PaRIS,N)]=𝒪((logN)dt/2)\mathbb{E}[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)]=\mathcal{O}((\log N)^{d_{t}/2}).

While Proposition 2 shows that τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} has infinite expectation, Theorem 3.2 implies that its NN-thresholded version only displays a slowly increasing mean. To give a very rough intuition on the phenomenon, consider X𝒩(0,1)X\sim\mathcal{N}(0,1). Then

𝔼[eX2/2]=ex2/2ex2/22π=+\mathbb{E}\left[e^{X^{2}/2}\right]=\int_{\mathbb{R}}e^{x^{2}/2}\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}=+\infty

whereas

(17) 𝔼[min(eX2/2,N)]=min(ex2/2,N)12πex2/2dx=|x|2logN12πdx+N|x|>2logN12πex2/2dx4logNπ+1πlogN\begin{split}\mathbb{E}\left[\min(e^{X^{2}/2},N)\right]&=\int_{\mathbb{R}}\min(e^{x^{2}/2},N)\frac{1}{\sqrt{2\pi}}e^{-{x^{2}/2}}\mathrm{d}x\\ &=\int_{\left|x\right|\leq\sqrt{2\log N}}\frac{1}{\sqrt{2\pi}}\mathrm{d}x+N\int_{\left|x\right|>\sqrt{2\log N}}\frac{1}{\sqrt{2\pi}}e^{-{x^{2}/2}}\mathrm{d}x\\ &\leq\sqrt{\frac{4\log N}{\pi}}+\frac{1}{\sqrt{\pi\log N}}\end{split}

using the bound (X>x)ex2/2x2π\mathbb{P}(X>x)\leq\frac{e^{-x^{2}/2}}{x\sqrt{2\pi}} for x>0x>0. The main technical difficulty of the proof of Theorem 3.2 (see Supplement E.6) is to perform this kind of argument under the error induced by the finite sample size particle approximation. In the language of this oversimplified example, we want (17) to hold when XX does not follow 𝒩(0,1)\mathcal{N}(0,1) any more, but only an NN-dependent approximation of it.

4. Efficient backward kernels

4.1. MCMC Backward Kernels

This subsection analyses and extends the MCMC backward kernel defined in Example 3. As we remarked there, the matrix B^tN,IMH\hat{B}_{t}^{N,\mathrm{IMH}} is not sparse and even has some expensive-to-evaluate entries. We thus reserve it for use in the off-line smoother (Algorithm 2) whereas in the on-line scenario (Algorithm 3), we use its PaRIS-like counterpart

(18) B^tN,IMHP[it,it1]:=1N~n~=1N~𝟙{it1=J~tit,n~}\hat{B}_{t}^{N,\mathrm{IMHP}}[i_{t},i_{t-1}]:=\frac{1}{\tilde{N}}\sum_{\tilde{n}=1}^{\tilde{N}}\mathbbm{1}\left\{i_{t-1}=\tilde{J}_{t}^{i_{t},\tilde{n}}\right\}

where J~tit,1:N~\tilde{J}_{t}^{i_{t},1:\tilde{N}} is an independent Metropolis-Hastings chain started at Jtit,1:=AtitJ_{t}^{i_{t},1}:=A_{t}^{i_{t}}, targeting the measure BtN,FFBS(it,dit1)B_{t}^{N,\mathrm{FFBS}}(i_{t},\mathrm{d}i_{t-1}) and using the proposal distribution (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}). Thus, the parameter N~\tilde{N} signifies that N~1\tilde{N}-1 MCMC steps are applied to AtitA_{t}^{i_{t}}, and we shall use the same convention for the kernel BtN,IMHB_{t}^{N,\operatorname{IMH}}. In both cases, the complexity of the corresponding algorithms are 𝒪((N~1)N)\mathcal{O}((\tilde{N}-1)N) which is equivalent to 𝒪(N)\mathcal{O}(N) as long as N~\tilde{N} remains fixed when NN\to\infty.

The validity and the stability of B^tN,IMH\hat{B}_{t}^{N,\mathrm{IMH}} and B^tN,IMHP\hat{B}_{t}^{N,\mathrm{IMHP}} are established in the following proposition (proved in Supplement E.9). For simplicity, only the case N~=2\tilde{N}=2 is examined, but as a matter of fact, the proposition remains true for N~2\tilde{N}\geq 2.

Proposition 4.

The kernels BtN,IMHB_{t}^{N,\mathrm{IMH}} and BtN,IMHPB_{t}^{N,\mathrm{IMHP}} with N~=2\tilde{N}=2 satisfy the hypotheses of Theorem 2.1 and, under Assumptions 2 and 3, those of Theorem 2.2. Hence, their respective uses in Algorithms 2 and 3 guarantee a convergent and stable smoother.

From a theoretical viewpoint, Proposition 4 is the first result establishing the stability for the use of MCMC moves inside backward sampling. It relies on technical innovations that we have explained in Section 2.6, in particular after the statement of Theorem 2.2.

From a practical viewpoint, the advantages of independent Metropolis-Hastings MCMC kernels compared to the rejection samplers of Section 3 are the dispensability of specifying an explicit M¯h\bar{M}_{h} and the deterministic 𝒪(N)\mathcal{O}(N) nature of the execution time. In practice, we observe that the MCMC smoothers are usually 10-20 times faster than the rejection sampling–based counterparts (see e.g. Figure 4) while producing essentially the same sample quality. Finally, it is not hard to imagine situations where some proposal smarter than (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}) would be beneficial. However, we only consider that one here, mainly because it already performs satisfactorily in our numerical examples.

4.2. Dealing with intractable transition densities

4.2.1. Intuition and formulation

The purpose of backward sampling is to re-generate, for each particle, a new ancestor that is different from that of the filtering step. However, backward sampling is infeasible if the transition density mt(xt1,xt)m_{t}(x_{t-1},x_{t}) cannot be calculated. To get around this, we modify the particle filter so that each particle might, in some sense, have two ancestors right from the forward pass.

Consider the standard PF (Algorithm 1). Among the NN resampled particles Xt1At1:NX_{t-1}^{A_{t}^{1:N}}, let us track two of them, say xt1x_{t-1} and xt1x_{t-1}^{\prime} for simplicity. The move step of Algorithm 1 will push them through MtM_{t} using independent noises, resulting in xtx_{t} and xtx_{t}^{\prime} (that is, given xt1x_{t-1} and xt1x_{t-1}^{\prime}, we have xtMt(xt1,)x_{t}\sim M_{t}(x_{t-1},\cdot) and xtMt(xt1,)x_{t}^{\prime}\sim M_{t}(x_{t-1}^{\prime},\cdot) such that xtx_{t} and xtx_{t}^{\prime} are independent). Thus, for e.g. linear Gaussian models, we have (xt=xt)=0\mathbb{P}(x_{t}=x_{t}^{\prime})=0. However, if the two simulations xtMt(xt1,)x_{t}\sim M_{t}(x_{t-1},\cdot) and xtMt(xt1,)x_{t}^{\prime}\sim M_{t}(x_{t-1}^{\prime},\cdot) are done with specifically correlated noises, it can happen that (xt=xt)>0\mathbb{P}(x_{t}=x_{t}^{\prime})>0. The joint distribution (xt,xt)(x_{t},x_{t}^{\prime}) given (xt1,xt1)(x_{t-1},x_{t-1}^{\prime}) is called a coupling of Mt(xt1,)M_{t}(x_{t-1},\cdot) and Mt(xt1,)M_{t}(x_{t-1}^{\prime},\cdot); the event xt=xtx_{t}=x_{t}^{\prime} is called the meeting event and we say that the coupling is successful when it occurs. In that case, the particle xtx_{t} automatically has two ancestors xt1x_{t-1} and xt1x_{t-1}^{\prime} at time t1t-1 without needing any backward sampling.

The precise formulation of the modified forward pass is detailed in Algorithm 7. It consists of building in an on-line manner the backward kernels BtN,ITRB_{t}^{N,\mathrm{ITR}} (where ITR stands for “intractable”). The main interest of this algorithm lies in the fact that while the function mtm_{t} may prove impossible to evaluate, it is usually possible to make xtx_{t} and xtx_{t}^{\prime} meet by correlating somehow the random numbers used in their simulations. One typical example which this article focuses on is the coupling of continuous-time processes, but it is useful to keep in mind that Algorithm 7 is conceptually more general than that.

Input: Feynman-Kac model (1), particles Xt11:NX_{t-1}^{1:N} and weights Wt11:NW_{t-1}^{1:N}
that approximate the filtering distribution at time t1t-1
for n1n\leftarrow 1 to NN do
       Resample. Simulate (Atn,1,Atn,2)(A_{t}^{n,1},A_{t}^{n,2}) such that marginally each component is distributed according to (Wt11:N)\mathcal{M}(W_{t-1}^{1:N})
       Move. Simulate (Xtn,1,Xtn,2)(X_{t}^{n,1},X_{t}^{n,2}) such that marginally the two components are distributed respectively according to Mt(Xt1Atn,1,dxt)M_{t}(X_{t-1}^{A_{t}^{n,1}},\mathrm{d}x_{t}) and Mt(Xt1Atn,2,dxt)M_{t}(X_{t-1}^{A_{t}^{n,2}},\mathrm{d}x_{t})
       Choose LUniform({1,2})L\sim\operatorname{Uniform}(\{1,2\})
       Set XtnXtn,LX_{t}^{n}\leftarrow X_{t}^{n,L}
       Calculate backward kernel.
       if Xtn,1=Xtn,2X_{t}^{n,1}=X_{t}^{n,2} then
             BtN,ITR(n,dit1)(δ{Atn,1}+δ{Atn,2})/2B_{t}^{N,\mathrm{ITR}}(n,\mathrm{d}i_{t-1})\leftarrow\left(\delta\left\{A_{t}^{n,1}\right\}+\delta\left\{A_{t}^{n,2}\right\}\right)/2
            
      else
             BtN,ITR(n,dit1)δ{Atn,L}B_{t}^{N,\mathrm{ITR}}(n,\mathrm{d}i_{t-1})\leftarrow\delta\left\{{A_{t}^{n,L}}\right\}
            
      
Reweight. Set ωtnGt(Xtn)\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n}) for n=1,2,,Nn=1,2,\ldots,N
Set tNn=1Nωtn/N\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N
Set Wtnωtn/NtNW_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N} for n=1,2,,Nn=1,2,\ldots,N
Output: Particles Xt1:NX_{t}^{1:N} and weights Wt1:NW_{t}^{1:N} that approximate the filtering distribution at time tt; backward kernel BtN,ITRB_{t}^{N,\mathrm{ITR}} that can be used in either Algorithm 2 or 3
Algorithm 7 Modified forward pass for smoothing of intractable models (one time step)

4.2.2. Validity and stability

The consistency of Algorithm 7 follows straightforwardly from Theorem 2.1. To produce a stable routine however, some conditions must be imposed on the couplings (Atn,1,Atn,2)(A_{t}^{n,1},A_{t}^{n,2}) and (Xtn,1,Xtn,2)(X_{t}^{n,1},X_{t}^{n,2}). We want Atn,1A_{t}^{n,1} to be different from Atn,2A_{t}^{n,2} as frequently as possible. On the contrary, we aim for a coupling of the two distributions Mt(Xt1Atn,1,)M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot) and Mt(Xt1Atn,2,)M_{t}(X_{t-1}^{A_{t}^{n,2}},\cdot) with high success rate so as to maximise the probability that Xtn,1=Xtn,2X_{t}^{n,1}=X_{t}^{n,2}.

Assumption 7.

There exists an εA>0\varepsilon_{A}>0 such that (Atn,1Atn,2|Xt11:N)εA.\mathbb{P}(A_{t}^{n,1}\neq A_{t}^{n,2}|X_{t-1}^{1:N})\geq\varepsilon_{A}.

Assumption 8.

There exists an εD>0\varepsilon_{D}>0 such that

(Xtn,2=Xtn,1|Xt11:N,Atn,1,Atn,2,Xtn,1)εD(1mt(,Xtn,1)mt(Xt1Atn,1,Xtn,1)).\mathbb{P}(X_{t}^{n,2}=X_{t}^{n,1}|X_{t-1}^{1:N},A_{t}^{n,1},A_{t}^{n,2},X_{t}^{n,1})\geq\varepsilon_{D}\left(1\land\frac{m_{t}(\rm,X_{t}^{n,1})}{m_{t}(X_{t-1}^{A_{t}^{n,1}},X_{t}^{n,1})}\right).

The letters A and D in εA\varepsilon_{A} and εD\varepsilon_{D} stand for “ancestors” and “dynamics”. Assumption 8 means that the user-chosen coupling of Mt(Xt1Atn,1,)M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot) and Mt(,)M_{t}(\rm,\cdot) must be at least as εD\varepsilon_{D} times as efficient as their maximal couplings. For details on this interpretation, see Proposition 10 in the Supplement. In Lemma E.12, we also show that in spite of its appearance, Assumption 8 is actually symmetric with regards to Xtn,1X_{t}^{n,1} and Xtn,2X_{t}^{n,2}.

We are now ready to state the main theorem of this subsection (see Supplement E.11 for a proof).

Theorem 4.1.

The kernels BtN,ITRB_{t}^{N,\mathrm{ITR}} generated by Algorithm 7 satisfy the hypotheses of Theorem 2.1. Thus, under Assumption 1, Algorithm 7 provides a consistent smoothing estimate. If, in addition, the Feynman-Kac model (1) satisfies Assumptions 2 and 3 and the user-chosen couplings satisfy Assumptions 7 and 8, the kernels BtN,ITRB_{t}^{N,\mathrm{ITR}} also fulfil (13) and the smoothing estimates generated by Algorithm 7 are stable.

4.2.3. Good ancestor couplings

It is notable that Assumption 7 only considers the event Atn,1Atn,2A_{t}^{n,1}\neq A_{t}^{n,2}, which is a pure index condition that does not take into account the underlying particles Xt1Atn,1X_{t-1}^{A_{t}^{n,1}} and Xt1Atn,2X_{t-1}^{A_{t}^{n,2}}. Indeed, if smoothing algorithms prevent degeneracy by creating multiple ancestors for a particle, we would expect that their separation (i.e. that they are far away in the state space 𝒳t1\mathcal{X}_{t-1}, e.g. d\mathbb{R}^{d}) is critical to the performance. Surprisingly, it is unnecessary: two very close particles (in d\mathbb{R}^{d}) at time t1t-1 may have ancestors far away at time t2t-2 thanks to the mixing of the model.

We advise choosing an ancestor coupling (Atn,1,Atn,2)(A_{t}^{n,1},A_{t}^{n,2}) such that the distance between Xt1Atn,1X_{t-1}^{A_{t}^{n,1}} and Xt1Atn,2X_{t-1}^{A_{t}^{n,2}} is small. It will then be easier to design a dynamic coupling of Mt(Xt1Atn,1,)M_{t}(X_{t-1}^{A_{t}^{n,1}},\cdot) and Mt(Xt1Atn,2,)M_{t}(X_{t-1}^{A_{t}^{n,2}},\cdot) with a high success rate. Furthermore, simulating the dynamic coupling with two close rather than far away starting points can also take less time when, for instance, the dynamic involves multiple intermediate steps, but the two processes couple early. One way to achieve an ancestor coupling with the aforementioned property is to first simulate Atn,1(Wt11:N)A_{t}^{n,1}\sim\mathcal{M}(W_{t-1}^{1:N}), then move Atn,1A_{t}^{n,1} through an MCMC algorithm which keeps invariant (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}) and set the result to Atn,2A_{t}^{n,2}. It suffices to use a proposal looking at indices whose underlying particles are close (in d\mathbb{R}^{d}) to Xt1Atn,1X_{t-1}^{A_{t}^{n,1}}. Finding nearby particles are efficient if they are first sorted using the Hilbert curve, hashed using locality-sensitive hashing or put in a KD-tree (see Samet,, 2006, for a comprehensive review). In the context of particle filters, such techniques have been studied for different purposes in Gerber and Chopin, (2015), Jacob et al., (2019) and Sen et al., (2018).

4.2.4. Conditionally-correlated version

In Algorithm 7, the ancestor pairs (Atn,1,Atn,2)n=1N(A_{t}^{n,1},A_{t}^{n,2})_{n=1}^{N} are conditionally independent given t\mathcal{F}_{t}^{-} and the same holds for the particles (Xtn)n=1N(X_{t}^{n})_{n=1}^{N}. These conditional independences allow easier theoretical analysis, in particular, the casting of Algorithm 7 in the framework of Theorems 2.1 and 2.2. However, they are not optimal for performance in two important ways: (a) they do not allow keeping both Xtn,1X_{t}^{n,1} and Xtn,2X_{t}^{n,2} when the two are not equal, and (b) the set of ancestor variables (Atn,1)n=1N(A_{t}^{n,1})_{n=1}^{N} is multinomially resampled from {1,2,,N}\left\{1,2,\ldots,N\right\} with weights Wt11:NW_{t-1}^{1:N}. We know that multinomial resampling is not the ideal scheme, see Supplement C.1 for discussion.

Consequently, in practice, we shall allow ourselves to break free from conditional independence. The resulting procedure is described in Algorithm 9 (Supplement C). Despite a lack of rigorous theoretical support, this is the algorithm that we will use in Section 5 since it enjoys better performance and it constitutes a fair comparison with practical implementations of the standard particle filter, which are mostly based on alternative resampling schemes.

5. Numerical experiments

5.1. Linear Gaussian state-space models

Linear Gaussian models constitute a particular class of state space models. They are characterised by Markov dynamics that are Gaussian and observations that are projection of hidden states plus some Gaussian noises. Supplement A.1 defines, for different components of these models, the notations that we shall use here. In this section, we consider an instance described in Guarniero et al., (2017), where the matrix FXF_{X} satisfies FX[i,j]=α1+|ij|F_{X}[i,j]=\alpha^{1+|i-j|} for some α\alpha. We consider the problem with dimX=dimY=2\operatorname{dim}_{X}=\operatorname{dim}_{Y}=2 and the observations are noisy versions of the hidden states with CYC_{Y} being σY2\sigma_{Y}^{2} times the identity matrix of size 22. Unless otherwise specified, we take α=0.4\alpha=0.4 and σY2=0.5\sigma_{Y}^{2}=0.5.

In this section, we focus on the performance of different online smoothers based on either genealogy tracking, pure/hybrid rejection sampling or MCMC. Rejection-based online smoothing amounts to the PaRIS algorithm, for which we use N~=2\tilde{N}=2 for the BtN,PaRISB_{t}^{N,\mathrm{PaRIS}} kernel. We take T=3000T=3000 and simulate the data from the model. The benchmark additive function is simply φt(x0:t)=s=0txs(0)\varphi_{t}(x_{0:t})=\sum_{s=0}^{t}x_{s}(0) where xs(0)x_{s}(0) is the first coordinate of the 2\mathbb{R}^{2} vector xs=[xs(0),xs(1)]x_{s}=[x_{s}(0),x_{s}(1)]. For a study of offline smoothers (including FFBS), see Supplement D.1. In all programs here and there, we choose N=1000N=1000 and use systematic resampling for the forward particle filters (see section C.1). Regarding MCMC smoothers, we employ the kernels BtN,IMHB_{t}^{N,\mathrm{IMH}} or BtN,IMHPB_{t}^{N,\mathrm{IMHP}} consisting of only one MCMC step. All results are based on 150150 independent runs.

Although our theoretical results are only proved for the bootstrap filter, we stress throughout that some of them extend to other filters as well. Therefore, we will also consider guided particle filters in the simulations. An introduction to this topic can be found in Chopin and Papaspiliopoulos, (2020, Chapter 10.3.2), where the expression for the optimal proposal is also provided. In linear Gaussian models, this proposal is fully tractable and is the one we use.

To present efficiently the combination of two different filters (bootstrap and guided) and four different algorithms (naive genealogy tracking, pure/hybrid rejection and MCMC) we use the following abbreviations: “B” for bootstrap, “G” for guided, “N” for naive genealogy tracking, “P” for pure rejection, “H” for hybrid rejection and “M” for MCMC. For instance, the algorithm referred to as “BM” uses the bootstrap filter for the forward pass and the MCMC backward kernels to perform smoothing. Furthermore, the letter “R” will refer to the rejection kernel whenever the distinction between pure rejection and hybrid rejection is not necessary. (Recall that the two rejection methods produce estimators with the same distribution.)

Figure 2 shows the squared interquartile range for the online smoothing estimates t(φt)\mathbb{Q}_{t}(\varphi_{t}) with respect to tt. It verifies the rates of Theorem 2.2, although linear Gaussian models are not strongly mixing in the sense of Assumptions 2 and 3: the grid lines hint at a variance growth rate of 𝒪(T)\mathcal{O}(T) for the MCMC and reject-based smoothers and of 𝒪(T2)\mathcal{O}(T^{2}) for the genealogy tracking ones. Unsurprisingly guided filters have better performance than bootstrap.

Refer to caption
Figure 2. Squared interquartile range of the estimators t(φt)\mathbb{Q}_{t}(\varphi_{t}) with respect to tt, for different online smoothing algorithms. The model is linear Gaussian with parameters specified in section 5.1. See text for full explanation of the legend. For readability, the curves are down-sampled to 5050 points before being drawn.

Figure 3 show box-plots of the execution time (divided by NTNT) for different algorithms over 150150 runs. By execution time, we mean the number of Markov kernel transition density evaluations. We see that the bootstrap particle filter coupled with pure rejection sampling has a very heavy-tailed execution time. This behaviour is expected as per Proposition 2. Using the guided particle filter seems to fare better, but Figure 4 (for the same model but with σY2=2\sigma_{Y}^{2}=2) makes it clear that this cannot be relied on either. Overall, these results highlight two fundamental problems with pure rejection sampling: the computational time has heavy tails and depends on the type of forward particle filter being used.

Refer to caption
Refer to caption
Figure 3. Box plots (based on 150150 runs) of averaged execution times (numbers of transition density evaluations divided by NTNT) for different algorithms on the linear Gaussian model of section 5.1. Left: original figure, right: zoomed-in version.
Refer to caption
Figure 4. Same as Figure 3, but for the modified model where σY2=2\sigma_{Y}^{2}=2.

On the other hand, hybrid rejection sampling, despite having random execution time in principle, displays a very consistent number of transition density evaluations over different independent runs. Thus it is safe to say that the algorithm has a virtually deterministic execution time. The catch is that the average computational load (which is around 1616 in Figure 3) cannot be easily calculated beforehand. In any case, it is much larger than the value 11 of MCMC smoothers (since only 11 MCMC step is performed in the kernel BtN,IMHPB_{t}^{N,\mathrm{IMHP}}); whereas the performance (Figure 2) is comparable.

The bottom line is that MCMC smoothers should be the default option, and one MCMC step seems to be enough. If for some reason one would like to use rejection-based methods, using hybrid rejection is a must.

5.2. Lotka-Volterra SDE

Lotka-Voleterra models (originated in Lotka,, 1925 and Volterra,, 1928) describe the population fluctuation of species due to natural birth and death as well as the consumption of one species by others. The emblematic case of two species is also known as the predator-prey model. In this subsection, we study the stochastic differential equation (SDE) version that appears in Hening and Nguyen, (2018). Let Xt=(Xt(0),Xt(1))X_{t}=(X_{t}(0),X_{t}(1)) represent respectively the populations of the prey and the predator at time tt and let us consider the dynamics

(19) {dXt(0)=[β0Xt(0)12τ0[Xt(0)]2τ1Xt(0)Xt(1)]dt+Xt(0)dEt(0)dXt(1)=[β1Xt(1)+τ1Xt(0)Xt(1)]dt+Xt(1)dEt(1)\left\{\begin{aligned} \mathrm{d}X_{t}(0)&=\bigl{[}\beta_{0}X_{t}(0)-\frac{1}{2}\tau_{0}[X_{t}(0)]^{2}&&-\tau_{1}X_{t}(0)X_{t}(1)\bigr{]}\mathrm{d}t+X_{t}(0)\mathrm{d}E_{t}(0)\\ \mathrm{d}X_{t}(1)&=\bigl{[}\qquad\quad-\beta_{1}X_{t}(1)&&+\tau_{1}X_{t}(0)X_{t}(1)\bigr{]}\mathrm{d}t+X_{t}(1)\mathrm{d}E_{t}(1)\end{aligned}\right.

where Et=ΓWtE_{t}=\Gamma W_{t} with WtW_{t} being the standard Brownian motion in 2\mathbb{R}^{2} and Γ\Gamma being some 2×22\times 2 matrix. The parameters β0\beta_{0} and β1\beta_{1} are the natural birth rate of the prey and death rate of the predator. The predator interacts with (eats) the prey at rate τ1\tau_{1}. The quantity τ0\tau_{0} encodes intra-species competition in the prey population. The 12\frac{1}{2} in its parametrisation is to line up with the Lotka Volterra jump process in 2\mathbb{Z}^{2} where the population sizes are integers and the interaction term becomes τ0Xt(0)[Xt(0)1]/2\tau_{0}X_{t}(0)[X_{t}(0)-1]/2.

The state space model is comprised of the process XtX_{t} and its noisy observations YtY_{t} recorded at integer times. The Markov dynamics cannot be simulated exactly, but can be approximated through (Euler) discretisation. Nevertheless, the Euler transition density mtE(xt1,xt)m_{t}^{\mathrm{E}}(x_{t-1},x_{t}) remains intractable (unless the step size is exactly 11). Thus, the algorithms presented in Subsection 4.2 are useful. The missing bit is a method to efficiently couple mtE(xt1,)m_{t}^{\mathrm{E}}(x_{t-1},\cdot) and mtE(xt1,)m_{t}^{\mathrm{E}}(x_{t-1}^{\prime},\cdot), which we carefully describe in Supplement D.2.1.

We consider the model with τ0=1/800\tau_{0}=1/800, τ1=1/400\tau_{1}=1/400, β0=0.3125\beta_{0}=0.3125 and β1=0.25\beta_{1}=0.25. The matrix Γ\Gamma is such that the covariance matrix of E1E_{1} is [1/1001/2001/2001/100]\begin{bmatrix}1/100&1/200\\ 1/200&1/100\end{bmatrix}. The observations are recorded on the log scale with Gaussian error of covariance matrix [0.040.020.020.04]\begin{bmatrix}0.04&0.02\\ 0.02&0.04\end{bmatrix}. The distribution of X0X_{0} is two-dimensional normal with mean [100,100][100,100] and covariance matrix [1005050100]\begin{bmatrix}100&50\\ 50&100\end{bmatrix}. This choice is motivated by the fact that the preceding parameters give the stationary population vector [100,100][100,100]. According to Hening and Nguyen, (2018), they also guarantee that neither animal goes extinct almost surely as tt\to\infty.

By discretising (19) with time step δ=1\delta=1, one can get some very rough intuition on the dynamics. For instance, per second there are about 3131 preys born. Approximately the same number die (to maintain equilibrium), of which 66 die due to internal competition and 2525 are eaten by the predator. The duration between two recorded observations corresponds more or less to one-third generation of the prey and one-fourth generation of the predator. The standard deviation of the variation due to environmental noise is about 1010 individuals per observation period, for each animal.

Again, these intuitions are highly approximate. For readers wishing to get more familiar with the model, Supplement D.2.2 contains real plots of the states and the observations; as well as data on the performance of different smoothing algorithms for moderate values of TT. We now showcase the results obtained in a large scale problem where T=3000T=3000 and the data is simulated from the model.

We consider the additive function φt(x0:t):=s=0t[xs(0)100]\varphi_{t}(x_{0:t}):=\sum_{s=0}^{t}\left[x_{s}(0)-100\right]. Figure 5 represents using box plots the distributions of the estimators for T(φT)\mathbb{Q}_{T}(\varphi_{T}) using either the genealogy tracking smoother (with systematic resampling; see Supplement C.1) or Algorithm 9. Our proposed smoother greatly reduces the variance, at a computational cost which is empirically 1.51.5 to 22 times greater than the naive method. Since we used Hilbert curve to design good ancestor couplings (see Section 4.2.3), coupling of the dynamics succeeds 80%80\% of the time. As discussed in the aforementioned section, starting two diffusion dynamics from nearby points make them couple earlier, which reduces the computational load afterwards.

Refer to caption
Figure 5. Box plot of estimators (over 5050 independent runs with N=1000N=1000 particles) for T(φT)\mathbb{Q}_{T}(\varphi_{T}) in the Lotka-Volterra SDE model with T=3000T=3000. They are calculated using either the naive genealogy tracking smoother or our smoother developed for intractable models (Algorithm 9).

Figure 6 plots with respect to tt the squared interquartile range of the two methods for the estimation of t(φt)\mathbb{Q}_{t}(\varphi_{t}). Grid lines hint at a quadratic growth for the genealogy tracking smoother (as analysed in Olsson and Westerborn,, 2017, Sect. 1) and a linear growth for the kernel BtN,ITRCB_{t}^{N,\mathrm{ITRC}} (as described in Theorem 2.2).

Refer to caption
Figure 6. Squared interquartile range for the genealogy tracking smoother and our proposed one. Same context as in Figure 5

Finally, Figure 20 (Supplement D.2.2) shows properties of the effective sample size (ESS) ratio for this model. In a nutshell, while being globally stable (between 40%40\% and 70%70\%), it has a tendency to drift towards near 0 from time to time due to unusual data points. At these moments, resampling kills most of the particles and aggravates the degeneracy problem for the naive smoother. As we have seen in the above figures, systematic resampling is not enough to mitigate this in the long run.

6. Conclusion

6.1. Practical recommendations

Our first recommendation does not concern the smoothing algorithm per se. It is of paramount importance that the particle filter used in in the preliminary filtering step performs reasonably well, since its output defines the support of the approximations generated by the subsequent smoothing algorithm. (Standard recommendations to obtain good performance from a particle filter are to increase NN, or to use better proposal distributions, or both.)

When the transition density is tractable, we recommend the MCMC smoother by default (rather than even the standard, 𝒪(N2)\mathcal{O}(N^{2}) approach). It has a deterministic, 𝒪(N)\mathcal{O}(N) complexity, it does not require the transition density to be bounded, and it seems to perform well even with one or two MCMC steps. If one still wants to use the rejection smoother instead, it is safe to say that there is no reason not to use the hybrid method.

Although the assumptions under which we prove the stability of the smoothing estimates are strong, the general message still holds. The Markov kernel and the potential functions must make the model forget its past in some ways. Otherwise, we get an unstable model for which no smoothing methods can work. The rejection sampling – based smoothing algorithms can therefore serve as the ultimate test. Since they simulate exactly independent trajectories given the skeleton, there is no hope to perform better, unless one switches to another family of smoothing algorithms.

For intractable models, the key issue is to design couplings with high meeting probability. Fortunately, the inherent chaos of the model makes it possible to choose two very close starting points for the dynamics and thus easy to obtain a reasonable meeting probability. If further difficulties persist, there is a practical (and very heuristic) recipe to test whether one coupling of Mt(x,)M_{t}(x,\cdot) and Mt(x,)M_{t}(x^{\prime},\cdot) is close to optimal. It consists in approximating Mt(x,)M_{t}(x,\cdot) and Mt(x,)M_{t}(x^{\prime},\cdot) by Gaussian distributions and deduce the optimal coupling rate from their total variation distance. There is no closed formula for the total variation distance between two Gaussian distributions in high dimensions. However, it can be reliably estimated using the geometric interpretation of the total variation distance being one minus the area of the intersection created by the corresponding density graphs. In this way, one can get a very rough idea of to what extent a certain coupling realises the meeting potential that the two distributions have. If the coupling seems good and the trajectories still look degenerate, it can very well be that the model is unstable.

6.2. Further directions

The major limitation of our work is the exclusive theoretical analysis under the bootstrap particle filter. Moreover, we require that the NN new particles generated at step tt are conditionally independent given previous particles at time t1t-1. This excludes practical optimisations like systematic resampling and Algorithm 9. Finally, the backward sampling step is also used in other algorithms (in particular Particle Markov Chain Monte Carlo, see Andrieu et al.,, 2010) and it would be interesting to see to what extent our techniques can be applied there.

6.3. Data and code

The code used to run numerical experiments is available at https://github.com/hai-dang-dau/backward-samplers-code. Some of the algorithms are already available in an experimental branch of the particles Python package at https://github.com/nchopin/particles.

Acknowledgements

The first author acknowledges a CREST PhD scholarship via AMX funding, and wish to thank the members of his PhD defense jury (Stéphanie Allassonière, Randal Douc, Arnaud Doucet, Anthony Lee, Pierre del Moral, Christian Robert) for helpful comments on the corresponding chapter in his thesis.

We also thank Adrien Corenflos, Samuel Duffield, the associate editor, and the referees for their comments on a preliminary version of the paper.

References

  • Andrieu et al., (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(3):269–342.
  • Beskos et al., (2006) Beskos, A., Papaspiliopoulos, O., Roberts, G. O., and Fearnhead, P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(3):333–382. With discussions and a reply by the authors.
  • Bishop, (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. Information Science and Statistics. Springer, New York.
  • Bou-Rabee et al., (2020) Bou-Rabee, N., Eberle, A., and Zimmer, R. (2020). Coupling and convergence for Hamiltonian Monte Carlo. The Annals of Applied Probability, 30(3):1209 – 1250.
  • Bunch and Godsill, (2013) Bunch, P. and Godsill, S. (2013). Improved particle approximations to the joint smoothing distribution using Markov chain Monte Carlo. IEEE Transactions on Signal Processing, 61(4):956–963.
  • Carpenter et al., (1999) Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. IEE Proc. Radar, Sonar Navigation, 146(1):2–7.
  • Chopin, (2004) Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference. Ann. Statist., 32(6):2385–2411.
  • Chopin and Papaspiliopoulos, (2020) Chopin, N. and Papaspiliopoulos, O. (2020). An Introduction to Sequential Monte Carlo. Springer Series in Statistics. Springer.
  • Corenflos and Särkkä, (2022) Corenflos, A. and Särkkä, S. (2022). The Coupled Rejection Sampler. arXiv preprint arXiv:2201.09585.
  • Del Moral, (2004) Del Moral, P. (2004). Feynman-Kac formulae. Genealogical and interacting particle systems with applications. Probability and its Applications. Springer Verlag, New York.
  • Del Moral, (2013) Del Moral, P. (2013). Mean field simulation for Monte Carlo integration, volume 126 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL.
  • Del Moral et al., (2010) Del Moral, P., Doucet, A., and Singh, S. S. (2010). A backward particle interpretation of Feynman-Kac formulae. M2AN Math. Model. Numer. Anal., 44(5):947–975.
  • Del Moral and Miclo, (2001) Del Moral, P. and Miclo, L. (2001). Genealogies and increasing propagation of chaos for Feynman-Kac and genetic models. Ann. Appl. Probab., 11(4):1166–1198.
  • Douc et al., (2005) Douc, R., Cappé, O., and Moulines, E. (2005). Comparison of resampling schemes for particle filtering. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pages 64–69. IEEE.
  • Douc et al., (2011) Douc, R., Garivier, A., Moulines, E., and Olsson, J. (2011). Sequential Monte Carlo smoothing for general state space hidden Markov models. Ann. Appl. Probab., 21(6):2109–2145.
  • Douc et al., (2018) Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018). Markov chains. Springer Series in Operations Research and Financial Engineering. Springer, Cham.
  • Dubarry and Le Corff, (2013) Dubarry, C. and Le Corff, S. (2013). Non-asymptotic deviation inequalities for smoothed additive functionals in nonlinear state-space models. Bernoulli, 19(5B):2222–2249.
  • Duffield and Singh, (2022) Duffield, S. and Singh, S. S. (2022). Online particle smoothing with application to map-matching. IEEE Trans. Signal Process., 70:497–508.
  • Fearnhead et al., (2008) Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. (2008). Particle filters for partially observed diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(4):755–777.
  • Fearnhead et al., (2010) Fearnhead, P., Wyncoll, D., and Tawn, J. (2010). A sequential smoothing algorithm with linear computational cost. Biometrika, 97(2):447–464.
  • Gerber and Chopin, (2015) Gerber, M. and Chopin, N. (2015). Sequential quasi Monte Carlo. J. R. Stat. Soc. Ser. B. Stat. Methodol., 77(3):509–579.
  • Gerber et al., (2019) Gerber, M., Chopin, N., and Whiteley, N. (2019). Negative association, ordering and convergence of resampling methods. Ann. Statist., 47(4):2236–2260.
  • Gloaguen et al., (2022) Gloaguen, P., Le Corff, S., and Olsson, J. (2022). A pseudo-marginal sequential Monte Carlo online smoothing algorithm. Bernoulli, 28(4):2606–2633.
  • Godsill et al., (2004) Godsill, S. J., Doucet, A., and West, M. (2004). Monte Carlo smoothing for nonlinear times series. J. Amer. Statist. Assoc., 99(465):156–168.
  • Gordon et al., (1993) Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F, Comm., Radar, Signal Proc., 140(2):107–113.
  • Guarniero et al., (2017) Guarniero, P., Johansen, A. M., and Lee, A. (2017). The iterated auxiliary particle filter. J. Amer. Statist. Assoc., 112(520):1636–1647.
  • Hening and Nguyen, (2018) Hening, A. and Nguyen, D. H. (2018). Stochastic Lotka-Volterra food chains. J. Math. Biol., 77(1):135–163.
  • Jacob et al., (2019) Jacob, P. E., Lindsten, F., and Schön, T. B. (2019). Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association.
  • Jacob et al., (2020) Jacob, P. E., O’Leary, J., and Atchadé, Y. F. (2020). Unbiased Markov chain Monte Carlo methods with couplings. J. R. Stat. Soc. Ser. B. Stat. Methodol., 82(3):543–600.
  • Janson, (2011) Janson, S. (2011). Probability asymptotics: notes on notation. arXiv preprint arXiv:1108.3924.
  • Jasra et al., (2017) Jasra, A., Kamatani, K., Law, K. J., and Zhou, Y. (2017). Multilevel particle filters. SIAM Journal on Numerical Analysis, 55(6):3068–3096.
  • Kalman, (1960) Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45.
  • Kalman and Bucy, (1961) Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans. ASME Ser. D. J. Basic Engrg., 83:95–108.
  • Kitagawa, (1996) Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Statist., 5(1):1–25.
  • Lévy, (1940) Lévy, P. (1940). Sur certains processus stochastiques homogènes. Compositio mathematica, 7:283–339.
  • Lindvall and Rogers, (1986) Lindvall, T. and Rogers, L. C. G. (1986). Coupling of multidimensional diffusions by reflection. Ann. Probab., 14(3):860–872.
  • Lotka, (1925) Lotka, A. J. (1925). Elements of physical biology. Williams & Wilkins.
  • Mastrototaro et al., (2021) Mastrototaro, A., Olsson, J., and Alenlöv, J. (2021). Fast and numerically stable particle-based online additive smoothing: the adasmooth algorithm. arXiv preprint arXiv:2108.00432.
  • Mörters and Peres, (2010) Mörters, P. and Peres, Y. (2010). Brownian motion, volume 30. Cambridge University Press.
  • Nordh and Antonsson, (2015) Nordh, J. and Antonsson, J. (2015). A Quantitative Evaluation of Monte Carlo Smoothers. Technical report.
  • Olsson and Westerborn, (2017) Olsson, J. and Westerborn, J. (2017). Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. Bernoulli, 23(3):1951–1996.
  • Olver et al., (2010) Olver, F. W. J., Lozier, D. W., Boisvert, R. F., and Clark, C. W., editors (2010). NIST handbook of mathematical functions. U.S. Department of Commerce, National Institute of Standards and Technology, Washington, DC; Cambridge University Press, Cambridge. With 1 CD-ROM (Windows, Macintosh and UNIX).
  • Pitt and Shephard, (1999) Pitt, M. K. and Shephard, N. (1999). Filtering via simulation: auxiliary particle filters. J. Amer. Statist. Assoc., 94(446):590–599.
  • Robert and Casella, (2004) Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer-Verlag, New York.
  • Roberts and Rosenthal, (2004) Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv., 1:20–71.
  • Samet, (2006) Samet, H. (2006). Foundations of multidimensional and metric data structures. Morgan Kaufmann.
  • Sen et al., (2018) Sen, D., Thiery, A. H., and Jasra, A. (2018). On coupling particle filter trajectories. Statistics and Computing, 28(2):461–475.
  • Taghavi et al., (2013) Taghavi, E., Lindsten, F., Svensson, L., and Schön, T. B. (2013). Adaptive stopping for fast particle smoothing. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6293–6297.
  • Vershynin, (2018) Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • Volterra, (1928) Volterra, V. (1928). Variations and fluctuations of the number of individuals in animal species living together. ICES Journal of Marine Science, 3(1):3–51.
  • Wang et al., (2021) Wang, G., O’Leary, J., and Jacob, P. (2021). Maximal couplings of the Metropolis-Hastings algorithm. In International Conference on Artificial Intelligence and Statistics, pages 1225–1233. PMLR.
  • Yonekura and Beskos, (2022) Yonekura, S. and Beskos, A. (2022). Online smoothing for diffusion processes observed with noise. Journal of Computational and Graphical Statistics, 0(0):1–17.

Supplement A Additional notations

This section defines new notations that do not appear in the main text (except notations for linear Gaussian models) but are used in the Supplement.

A.1. Linear Gaussian models

Let dimX\operatorname{dim}_{X} and dimY\operatorname{dim}_{Y} be two strictly positive integers and FXF_{X} and FYF_{Y} be two full-rank matrices of sizes dimX×dimX\operatorname{dim}_{X}\times\operatorname{dim}_{X} and dimY×dimX\operatorname{dim}_{Y}\times\operatorname{dim}_{X} respectively. Let CXC_{X} and CYC_{Y} be two symmetric positive definite matrices of respective sizes dimX×dimX\operatorname{dim}_{X}\times\operatorname{dim}_{X} and dimY×dimY\operatorname{dim}_{Y}\times\operatorname{dim}_{Y}. A linear Gaussian state space model has the underlying Markov process defined by

Xt|X0:t1𝒩(FXXt1,CX),X_{t}|X_{0:t-1}\sim\mathcal{N}(F_{X}X_{t-1},C_{X}),

where X0X_{0} also follows a Gaussian distribution; and admits the observation process

Yt|Xt𝒩(FYXt,CY).Y_{t}|X_{t}\sim\mathcal{N}(F_{Y}X_{t},C_{Y}).

The predictive (XtX_{t} given Y0:t1Y_{0:t-1}), filtering (XtX_{t} given Y0:tY_{0:t}) and smoothing (XtX_{t} given Y0:TY_{0:T}) distributions are all Gaussian and their parameters can be explicitly calculated via recurrence formulas (Kalman,, 1960; Kalman and Bucy,, 1961). We shall denote their respective mean vectors and covariance matrices by (μtpred,Σtpred)(\mu_{t}^{\mathrm{pred}},\Sigma_{t}^{\mathrm{pred}}), (μtfilt,Σtfilt)(\mu_{t}^{\mathrm{filt}},\Sigma_{t}^{\mathrm{filt}}) and (μtsmth,Σtsmth)(\mu_{t}^{\mathrm{smth}},\Sigma_{t}^{\mathrm{smth}}). In particular, the starting distribution X0X_{0} is 𝒩(μ0pred,Σ0pred)\mathcal{N}(\mu_{0}^{\mathrm{pred}},\Sigma_{0}^{\mathrm{pred}}).

A.2. Total variation distance

Let μ\mu and ν\nu be two probability measures on 𝒳\mathcal{X}. The total variation distance between μ\mu and ν\nu, sometimes also denoted TV(μ,ν)\operatorname{TV}(\mu,\nu), is defined as μνTV:=supf:𝒳[0,1]|μ(f)ν(f)|\left\lVert\mu-\nu\right\rVert_{\operatorname{TV}}:=\sup_{f:\mathcal{X}\to[0,1]}\left|\mu(f)-\nu(f)\right|. The definition remains valid if ff is restricted to the class of indicator functions on measurable subsets of 𝒳\mathcal{X}. It implies in particular that |μ(f)ν(f)|foscTV(μ,ν)\left|\mu(f)-\nu(f)\right|\leq\left\lVert f\right\rVert_{\mathrm{osc}}\operatorname{TV}(\mu,\nu).

Next, we state a lemma summarising basic properties of the total variation distance and defining coupling-related notions (see, e.g. Proposition 3 and formula (13) of Roberts and Rosenthal, (2004)). While the last property (covariance bound) is not in the aforementioned reference and does not seem popular in the literature, its proof is straightforward and therefore omitted.

Lemma A.1.

The total variation distance has the following properties:

  • (Alternative expressions.) If μ\mu and ν\nu admit densities f(x)f(x) and g(x)g(x) respectively with reference to a dominating measure λ\lambda, we have

    TV(μ,ν)=12|f(x)g(x)|λ(dx)=1min(f(x),g(x))λ(dx).\operatorname{TV}(\mu,\nu)=\frac{1}{2}\int\left|f(x)-g(x)\right|\lambda(\mathrm{d}x)=1-\int\min(f(x),g(x))\lambda(\mathrm{d}x).
  • (Coupling inequality & maximal coupling.) For any pair of random variables (M,N)(M,N) such that MμM\sim\mu and NνN\sim\nu, we have

    (MN)TV(μ,ν).\mathbb{P}(M\neq N)\geq\operatorname{TV}(\mu,\nu).

    There exist pairs (M,N)(M^{*},N^{*}) for which equality holds. They are called maximal couplings of μ\mu and ν\nu.

  • (Contraction property.) Let (Xn)(X_{n}) be a Markov chain with invariant measure μ\mu^{\star}. Then

    TV(Xn,μ)TV(Xn+1,μ).\operatorname{TV}(X_{n},\mu^{*})\geq\operatorname{TV}(X_{n+1},\mu^{*}).
  • (Covariance bound.) For any pair of random variables (M,N)(M,N) such that MμM\sim\mu and NνN\sim\nu and real-valued functions h1h_{1} and h2h_{2}, we have

    |Cov(h1(M),h2(N))|2h1h2TV((M,N),μν).\left|\mathrm{Cov}(h_{1}(M),h_{2}(N))\right|\leq 2\left\lVert h_{1}\right\rVert_{\infty}\left\lVert h_{2}\right\rVert_{\infty}\operatorname{TV}\left((M,N),\mu\otimes\nu\right).

A.3. Cost-to-go function

In the context of the Feynman-Kac model (1), define the associated cost-to-go function Ht:TH_{t:T} as (see e.g. Chopin and Papaspiliopoulos, (2020, Chapter 5))

(20) Ht:T(xt):=s=t+1TMs1(xs1,dxs)Gs(xs).H_{t:T}(x_{t}):=\prod_{s=t+1}^{T}M_{s-1}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

This function bridges t(dxt)\mathbb{Q}_{t}(\mathrm{d}x_{t}) and T(dxt)\mathbb{Q}_{T}(\mathrm{d}x_{t}), since T(dxt)t(dxt)Ht:T(xt)\mathbb{Q}_{T}(\mathrm{d}x_{t})\propto\mathbb{Q}_{t}(\mathrm{d}x_{t})H_{t:T}(x_{t}).

A.4. The projection kernel

Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two measurable spaces. The projection kernel Π𝒳(𝒳,𝒴)\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}} is defined by

Π𝒳(𝒳,𝒴)((x,y),dx):=δx(dx).\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}\left((x,y),\mathrm{d}x^{*}\right):=\delta_{x}(\mathrm{d}x^{*}).

In particular, for any function g:𝒳g:\mathcal{X}\to\mathbb{R} and measure μ(dx,dy)\mu(\mathrm{d}x,\mathrm{d}y) defined on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, we have

(Π𝒳(𝒳,𝒴)g)(x,y)\displaystyle(\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}}g)(x,y) =g(x)\displaystyle=g(x)
(μΠ𝒳(𝒳,𝒴))(g)\displaystyle(\mu\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}})(g) =g(x)μ(dx,dy)=g(x)μ(dx)\displaystyle=\iint g(x)\mu(\mathrm{d}x,\mathrm{d}y)=\int g(x)\mu(\mathrm{d}x)

where the second identity shows the marginalising action of Π𝒳(𝒳,𝒴)\Pi^{(\mathcal{X},\mathcal{Y})}_{\mathcal{X}} on μ\mu. In the context of state space models, we define the shorthand

Πt0:T:=Π𝒳t(𝒳0,,𝒳T).\Pi^{0:T}_{t}:=\Pi^{(\mathcal{X}_{0},\ldots,\mathcal{X}_{T})}_{\mathcal{X}_{t}}.

A.5. Other notations

For a real number xx, let x\lfloor x\rfloor be the largest integer not exceeding xx. The mapping xxx\mapsto\lfloor x\rfloor is called the floor function • The Gamma function Γ(a)\Gamma(a) is defined for a>0a>0 and is given by Γ(a):=+exxa1dx\Gamma(a):=\int_{\mathbb{R}_{+}}e^{-x}x^{a-1}\mathrm{d}x • Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two measurable spaces. Let K(x,dy)K(x,\mathrm{d}y) be a (not necessarily probability) kernel from 𝒳\mathcal{X} to 𝒴\mathcal{Y}. The norm of KK is defined by K:=supf:𝒳𝒴,f0Kf/f\left\lVert K\right\rVert_{\infty}:=\sup_{f:\mathcal{X}\to\mathcal{Y},f\neq 0}\left\lVert Kf\right\rVert_{\infty}/\left\lVert f\right\rVert_{\infty}. In particular, for any function f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y}, we have KfKf\left\lVert Kf\right\rVert_{\infty}\leq\left\lVert K\right\rVert_{\infty}\left\lVert f\right\rVert_{\infty} • Let XnX_{n} be a sequence of random variables. We say that Xn=𝒪(1)X_{n}=\mathcal{O}_{\mathbb{P}}(1) if for any ε>0\varepsilon>0, there exists M>0M>0 and N0N_{0}, both depending on ε\varepsilon, such that (|Xn|M)ε\mathbb{P}(|X_{n}|\geq M)\leq\varepsilon for all nN0n\geq N_{0}. For a strictly positive deterministic sequence ana_{n}, we say that Xn=𝒪(an)X_{n}=\mathcal{O}_{\mathbb{P}}(a_{n}) if Xn/an=𝒪(1)X_{n}/a_{n}=\mathcal{O}_{\mathbb{P}}(1). See Janson, (2011) for discussions • We use the notation 𝒩(x|μ,Σ)\mathcal{N}(x|\mu,\Sigma) to refer to the value at xx of the density function of the normal distribution 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) • Let f:UVf:U\to V be a function from some space UU to another space VV. Let SS be a subset of UU. The restriction of ff to S, written f|Sf|_{S}, is the function from SS to VV defined by f|S(x)=f(x)f|_{S}(x)=f(x), xS\forall x\in S.

Supplement B FFBS complexity for different rejection schemes

B.1. Framework and notations

The FFBS algorithm is a particular instance of Algorithm 2 where BtN,FFBSB_{t}^{N,\mathrm{FFBS}} kernels are used. If backward simulation is done using pure rejection sampling (Algorithm 4), the computational cost to simulate the t1t-1-th index of the nn-th trajectory has conditional distribution

(21) τtn,FFBS|T,t:TnGeo(iWt1imt(Xt1i,Xttn)M¯h).\tau_{t}^{n,\mathrm{FFBS}}|\ \mathcal{F}_{T},\mathcal{I}_{t:T}^{n}\sim\operatorname{Geo}\left(\frac{\sum_{i}W_{t-1}^{i}m_{t}(X_{t-1}^{i},X_{t}^{\mathcal{I}_{t}^{n}})}{\bar{M}_{h}}\right).

At this point, it would be useful to compare this formula with (16) of the PaRIS algorithm. The difference is subtle but will drive interesting changes to the way rejection-based FFBS behaves.

If hybrid rejection sampling (Algorithm 6) is to be used instead, we are interested in the distribution of min(τtn,FFBS,N)\min(\tau_{t}^{n,\mathrm{FFBS}},N), for reasons discussed in Subsection 3.2. In a highly parallel setting, it is preferable that the distribution of individual execution times, i.e. τtn,FFBS\tau_{t}^{n,\mathrm{FFBS}} or min(τtn,FFBS,N)\min(\tau_{t}^{n,\mathrm{FFBS}},N), are not heavy-tailed. In contrast, for non-parallel hardware, only cumulative execution times, i.e. n=1Nτtn,FFBS\sum_{n=1}^{N}\tau_{t}^{n,\mathrm{FFBS}} or n=1Nmin(τtn,FFBS,N)\sum_{n=1}^{N}\min(\tau_{t}^{n,\mathrm{FFBS}},N), matter. Even though the individual times might behave badly, the cumulative times could be much more regular thanks to effect of the central limit theorem, whenever applicable. Nevertheless, studying the finiteness of the kk-th order moment of τt1,FFBS\tau_{t}^{1,\mathrm{FFBS}} is still a good way to get information about both types of execution times, since it automatically implies kk-th order moment (in)finiteness for both of them.

B.2. Execution time for pure rejection sampling

We show that under certain circumstances, the execution time of the pure rejection procedure has infinite expectation. Proposition 1 in Douc et al., (2011) hints that the cost per trajectory for FFBS-reject might tend to infinity when NN\to\infty. In contrast, we show that infinite expectation might very well happen for finite sample sizes. We first give the statement for general state space models, then focus on their implications for Gaussian ones. In particular, while infinite expectations occur only under certain configurations, infinite higher moments happen in all linear Gaussian models with non-degenerate dynamics.

Theorem B.1.

Using the setting and notations of Supplement B.1, under Assumptions 1 and 4 , we have 𝔼[τt1,FFBS]=\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\infty whenever

𝒳tGt(xt)Ht:T(xt)λt(dxt)=\int_{\mathcal{X}_{t}}G_{t}(x_{t})H_{t:T}(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty

where the cost-to-go function Ht:TH_{t:T} is defined in (20) and the measure λt\lambda_{t} is defined in (2).

Theorem B.2.

Using the setting and notations of Supplement B.1, we consider linear Gaussian models and their notations defined in Supplement A.1. Then we have 𝔼[(τt1,FFBS)k]=\mathbb{E}[(\tau_{t}^{1,\mathrm{FFBS}})^{k}]=\infty whenever kk is greater than a certain k0k_{0} being the smallest eigenvalue of the matrix Id+CX1/2((Σtsmth)1(Σtpred)1)CX1/2\operatorname{Id}+C_{X}^{1/2}\left({(\Sigma^{\mathrm{smth}}_{t})}^{-1}-{(\Sigma^{\mathrm{pred}}_{t})}^{-1}\right)C_{X}^{1/2}.

The proofs of the two assertions are given in Supplement E.7. We now look at how they are manifested in concrete examples. The first remark is that for technical reasons, Theorem B.2 gives no information on the finiteness of E[(τt1,FFBS)k]E[(\tau_{t}^{1,\mathrm{FFBS}})^{k}] for k=1k=1 (since k0k_{0} is already greater than or equal to 11 by definition). To study the finiteness of 𝔼[τt1,FFBS]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}], we thus turn to Theorem B.1.

Example 4.

In linear Gaussian models, the integral of Theorem B.1 is equal to

𝒩(yt|FYxt,CY)s=t+1T𝒩(xs|FXxs1,CX)𝒩(ys|FYxs,CY)dxt:T\int\mathcal{N}(y_{t}|F_{Y}x_{t},C_{Y})\prod_{s=t+1}^{T}\mathcal{N}(x_{s}|F_{X}x_{s-1},C_{X})\mathcal{N}(y_{s}|F_{Y}x_{s},C_{Y})\mathrm{d}x_{t:T}

where the notation 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) refers to the density of the normal distribution. The integrand is proportional to exp[0.5(Q(xt:T)R(xt:T))]\exp[-0.5(Q(x_{t:T})-R(x_{t:T}))] for some quadratic form Q(xt:T)Q(x_{t:T}) and linear form R(xt:T)R(x_{t:T}). The integral is finite if and only if QQ is positive definite. In our case, this means that there is no non-trivial root for the equation Q(xt:T)=0Q(x_{t:T})=0, which is equivalent to

{FYxs=0,s=t,TFXxs1=xs,s=t+1,,T.\begin{cases}F_{Y}x_{s}&=0,\forall s=t,\ldots T\\ F_{X}x_{s-1}&=x_{s},\forall s=t+1,\ldots,T.\end{cases}

Put another way, 𝔼[τt1,FFBS]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}] is infinite whenever the intersection

k=0TtKer(FYFXk)=k=0TtFXk(Ker(FY))\bigcap_{k=0}^{T-t}\operatorname{Ker}(F_{Y}F_{X}^{k})=\bigcap_{k=0}^{T-t}F_{X}^{-k}(\operatorname{Ker}(F_{Y}))

contains other things than the zero vector. A common and particularly troublesome situation is when FX=cIdF_{X}=c\operatorname{Id} for some c>0c>0 (but CXC_{X} can be arbitrary) and the dimension of the states (dimX\operatorname{dim}_{X}) is greater than that of the observations (dimY\operatorname{dim}_{Y}). Then the above intersection remains non-trivial no matter how big TtT-t is. Thus, 𝔼[τt1,FFBS]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}] has no expectation for any tt. In general, the problem is less severe as successive intersections will shrink the space quickly to {0}\{0\}. Consequently, Theorem B.1 only points out infiniteness of 𝔼[τt1,FFBS]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}] for tt close to TT. The bad news however will come from higher moments, as seen in the below example.

We will now focus on a simple but particularly striking example. Our purpose here is to illustrate the concepts as well as to show that their implications are relevant even in small, familiar settings. More advanced scenarios are presented in Section 5 devoted to numerical experiments.

Example 5.

We consider two one-dimensional Gaussian state-space models: they both have FX=0.5F_{X}=0.5, CX=1C_{X}=1, X0𝒩(0,CX2/(1FX2))X_{0}\sim\mathcal{N}(0,C_{X}^{2}/(1-F_{X}^{2})) and T=3T=3. The only difference between them is that one has σy2:=CY=0.52\sigma_{y}^{2}:=C_{Y}=0.5^{2} and another has σy2=32\sigma_{y}^{2}=3^{2}. We are interested in the execution times τ1n,FFBS\tau_{1}^{n,\mathrm{FFBS}} at time t=1t=1 (i.e. the rejection-based simulation of indices 0n\mathcal{I}_{0}^{n} at time t=0t=0). Theorem B.2 then gives k01.14k_{0}\approx 1.14 for σy=3\sigma_{y}=3 and k05k_{0}\approx 5 for σy=0.5\sigma_{y}=0.5. The first implication is that in both cases, τ1n,FFBS\tau_{1}^{n,\mathrm{FFBS}} is a heavy-tailed random variable and therefore FFBS-reject is not a viable option in a highly parallel setting. But an interesting phenomenon happens in the sequential hardware scenario where one is rather interested in the cumulative execution time, i.e. n=1Nτ1n,FFBS\sum_{n=1}^{N}\tau_{1}^{n,\mathrm{FFBS}}, or equivalently, the mean number of trials per particle. In the σy=3\sigma_{y}=3 case, non-existence of second moment prevents the cumulative regularisation effect of the central limit theorem. This is not the case for σy=0.5\sigma_{y}=0.5, in which the cumulative execution time actually behaves nicely (Figures 7 and 8). However, the most valuable message from this example is perhaps that the performance of FFBS-reject depends in a non-trivial (hard to predict) way on the model parameters.

Refer to caption
Figure 7. Box plots for the mean number of trials per particles to simulate indices at time 0, for models described in Example 5 and for FFBS algorithms based on pure and hybrid rejection sampling. The figure is obtained by running bootstrap particle filters with N=500N=500 over 15001500 independent executions.
Refer to caption
Figure 8. Zoom of Figure 7 to 0y80\leq y\leq 8

B.3. Execution time for hybrid rejection sampling

Formula (21) suggests defining the limit distribution τt,FFBS\tau_{t}^{\infty,\mathrm{FFBS}} as

τt,FFBS|Xt,FFBSGeo(rt(Xt,FFBS)M¯h)\tau_{t}^{\infty,\mathrm{FFBS}}\ |\ X_{t}^{\infty,\mathrm{FFBS}}\sim\operatorname{Geo}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{FFBS}})}{\bar{M}_{h}}\right)

where Xt,FFBST(dxt)X_{t}^{\infty,\mathrm{FFBS}}\sim\mathbb{Q}_{T}(\mathrm{d}x_{t}) and rtr_{t} given in Definition 1. These quantities provide the following characterisation of the cumulative execution time for the hybrid FFBS algorithm (proved in Section E.8).

Theorem B.3.

Under Assumptions 1 and 4 and the setting of Section B.1, we have

n=1Nmin(τtn,FFBS,N)N=𝒪(𝔼[min(τt,FFBS,N)])\frac{\sum_{n=1}^{N}\min(\tau_{t}^{n,\mathrm{FFBS}},N)}{N}=\mathcal{O}_{\mathbb{P}}\left(\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]\right)

where the notation 𝒪\mathcal{O}_{\mathbb{P}} is defined in Supplement A.

This theorem admits the following corollary for linear Gaussian models (also proved in Section E.8).

Corollary 2.

For linear Gaussian models (Supplement A.1), if smoothing is performed using the hybrid rejection version of the FFBS algorithm, the mean execution time per particle at time step tt is 𝒪(logdt/2N)\mathcal{O}_{\mathbb{P}}(\log^{d_{t}/2}N) where dtd_{t} is the dimension of 𝒳t\mathcal{X}_{t}.

The bound 𝒪(logdt/2N)\mathcal{O}_{\mathbb{P}}(\log^{d_{t}/2}N) is actually quite conservative. For instance, with either σy=0.5\sigma_{y}=0.5 or σy=3\sigma_{y}=3, the model considered in Example 5 admits 𝔼[τt,FFBS]<\mathbb{E}[\tau_{t}^{\infty,\mathrm{FFBS}}]<\infty. (Gaussian dynamics can be handled using exact analytic calculations and enables to verify the claim straightforwardly.) Theorem B.3 then gives an execution time per particle of order 𝒪(1)\mathcal{O}_{\mathbb{P}}(1) for hybrid FFBS, which is better than the 𝒪(logN)\mathcal{O}_{\mathbb{P}}(\sqrt{\log N}) predicted by Corollary 2. Yet another unsatisfactory point of the result is its failure to make sense of the spectacular improvement brought by hybrid rejection sampling over the ordinary procedure in the σy=3\sigma_{y}=3 case (see Figure 7). As explained in Example 5, this is connected to the variance of 𝔼[τt1,FFBS]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}] and not merely the expectation; so a study of second order properties of N1nmin(τtn,FFBS,N)N^{-1}\sum_{n}\min(\tau_{t}^{n,\mathrm{FFBS}},N) would be desirable.

Supplement C Conditionally-correlated versions of particle algorithms

C.1. Alternative resampling schemes

In Algorithm 1, the indices At1:NA_{t}^{1:N} are drawn conditionally i.i.d. from the multinomial distribution (Wt11:N)\mathcal{M}(W_{t-1}^{1:N}). They satisfy

𝔼[j=1N𝟙Atj=i|t1]=NWt1i\mathbb{E}\left[\left.{\sum_{j=1}^{N}\mathbb{1}_{A_{t}^{j}=i}}\right|{\mathcal{F}_{t-1}}\right]=NW_{t-1}^{i}

for any i=1,,Ni=1,\ldots,N. There are other ways to generate At1:NA_{t}^{1:N} from Wt11:NW_{t-1}^{1:N} that still verify this identity. We call them unbiased resampling schemes, and the natural one used in Algorithm 1 multinomial resampling.

The main motivation for alternative resampling schemes is performance. We refer to Chopin, (2004); Douc et al., (2005); Gerber et al., (2019) for more details, but would like to mention that the theoretical studies of particle algorithms using other resampling schemes are more complicated since Xt1:NX_{t}^{1:N} are no longer i.i.d. given t1\mathcal{F}_{t-1}. We use systematic resampling (Carpenter et al.,, 1999) in our experiments. See Algorithm 8 for a succinct description and Chopin and Papaspiliopoulos, (2020, Chapter 9) for efficient implementations in 𝒪(N)\mathcal{O}(N) running time.

Input: Weights Wt11:NW_{t-1}^{1:N} summing to 11
Generate UUniform[0,1]U\sim\operatorname{Uniform}[0,1]
for n1n\leftarrow 1 to NN do
       Set AtnA_{t}^{n} to the unique index kk satisfying
W1++Wk1n1+uN<W1++WkW_{1}+\cdots+W_{k-1}\leq\frac{n-1+u}{N}<W_{1}+\cdots+W_{k}
Output: Resampled indices At1:NA_{t}^{1:N}
Algorithm 8 Systematic resampling

C.2. Conditionally-correlated version of Algorithm 7

In this part, we present an alternative version of Algorithm 7 that does not create conditionally i.i.d. particles at each time step. The procedure is detailed in Algorithm 9. It creates on the fly backward kernels BtN,ITRCB_{t}^{N,\mathrm{ITRC}} (for “intractable, conditionally correlated”). It involves a resampling step which can be done in principle using any unbiased resampling scheme. Following the intuitions of Subsection 4.2.3 and the notations of Algorithm 9, we want a scheme such that in most cases, At2k1At2kA_{t}^{2k-1}\neq A_{t}^{2k} but the Euclidean distance between Xt1At2k1X_{t-1}^{A_{t}^{2k-1}} and Xt1At2kX_{t-1}^{A_{t}^{2k}} is small. Algorithm 10 proposes such a method (which we name the Adjacent Resampler). It can run in 𝒪(N)\mathcal{O}(N) time using a suitably implemented linked list.

Input: Feynman-Kac model (1), particles Xt11:NX_{t-1}^{1:N} and weights Wt11:NW_{t-1}^{1:N} that approximate t1(dxt1)\mathbb{Q}_{t-1}(\mathrm{d}x_{t-1})
Resample At1:NA_{t}^{1:N} from {1,2,,N}\left\{1,2,\ldots,N\right\} with weights Wt11:NW_{t-1}^{1:N} using any resampling scheme (such as the Adjacent Resampler in Algorithm 10)
for k1k\leftarrow 1 to N/2N/2 do
       Move. Simulate Xt2k1X_{t}^{2k-1} and Xt2kX_{t}^{2k} such that marginally, Xt2k1Mt(Xt1At2k1,)X_{t}^{2k-1}\sim M_{t}(X_{t-1}^{A_{t}^{2k-1}},\cdot) and Xt2kMt(Xt1At2k,)X_{t}^{2k}\sim M_{t}(X_{t-1}^{A_{t}^{2k}},\cdot)
       Calculate backward kernel.
       if Xt2k1=Xt2kX_{t}^{2k-1}=X_{t}^{2k} then
             Set BtN,ITRC(2k1,)(δ{At2k1}+δ{At2k})/2B_{t}^{N,\mathrm{ITRC}}(2k-1,\cdot)\leftarrow\left(\delta\left\{A_{t}^{2k-1}\right\}+\delta\left\{A_{t}^{2k}\right\}\right)/2
             Set BtN,ITRC(2k,)(δ{At2k1}+δ{At2k})/2B_{t}^{N,\mathrm{ITRC}}(2k,\cdot)\leftarrow\left(\delta\left\{A_{t}^{2k-1}\right\}+\delta\left\{A_{t}^{2k}\right\}\right)/2
      else
             Set BtN,ITRC(2k1,)δ{At2k1}B_{t}^{N,\mathrm{ITRC}}(2k-1,\cdot)\leftarrow\delta\left\{A_{t}^{2k-1}\right\}
             Set BtN,ITRC(2k,)δ{At2k}B_{t}^{N,\mathrm{ITRC}}(2k,\cdot)\leftarrow\delta\left\{A_{t}^{2k}\right\}
            
      
Reweight. Set ωtnGt(Xtn)\omega_{t}^{n}\leftarrow G_{t}(X_{t}^{n}) for n=1,2,,Nn=1,2,\ldots,N
Set tNn=1Nωtn/N\ell_{t}^{N}\leftarrow\sum_{n=1}^{N}\omega_{t}^{n}/N
Set Wtnωtn/NtNW_{t}^{n}\leftarrow\omega_{t}^{n}/N\ell_{t}^{N} for n=1,2,,Nn=1,2,\ldots,N
Output: Particles Xt1:NX_{t}^{1:N} and weights Wt1:NW_{t}^{1:N} that approximate t(dxt)\mathbb{Q}_{t}(\mathrm{d}x_{t}); backward kernel BtN,ITRCB_{t}^{N,\mathrm{ITRC}} for use in Algorithms 2 and 3
Algorithm 9 Conditionally-correlated version of Algorithm 7
Input: Particles Xt11:NX_{t-1}^{1:N}, weights Wt11:NW_{t-1}^{1:N}
Sort the particles Xt11:NX_{t-1}^{1:N} using the Hilbert curve. Let s[s1sN]s\leftarrow[s_{1}\ldots s_{N}] be the corresponding indices
Resample from {1,,N}\left\{1,\ldots,N\right\} with weights Wt11:NW_{t-1}^{1:N} using systematic resampling (Carpenter et al.,, 1999; Gerber et al.,, 2019), then let f:{1,,N}f:\left\{1,\ldots,N\right\}\to\mathbb{Z} be the function defined by f(i)f(i) being the number of times the index sis_{i} was resampled. Obviously i=1Nf(i)=N\sum_{i=1}^{N}f(i)=N
Initialise i1i\leftarrow 1
for n1n\leftarrow 1 to NN do
       Set AtnsiA_{t}^{n}\leftarrow s_{i}
       Update f(i)f(i)1f(i)\leftarrow f(i)-1
       Let Ω1\Omega_{1} be the set {min{>if>0}}\left\{\min\left\{\ell>i\mid f_{\ell}>0\right\}\right\} (which has one element if the minimum is well-defined and zero element otherwise)
       Let Ω2\Omega_{2} be the set {max{<if>0}}\left\{\max\left\{\ell<i\mid f_{\ell}>0\right\}\right\} (which has one element if the maximum is well-defined and zero element otherwise)
       If Ω1Ω2\Omega_{1}\cup\Omega_{2} is not empty, update iargmaxf|Ω1Ω2i\leftarrow\operatorname{argmax}f|_{\Omega_{1}\cup\Omega_{2}} (see section A.5 for the restriction notation). If there is more than one argmax, pick one randomly
      
Output: Resampled indices At1:NA_{t}^{1:N}
Algorithm 10 The Adjacent Resampler

Supplement D Additional information on numerical experiments

D.1. Offline smoothing in linear Gaussian models

In this section, we study offline smoothing for the linear Gaussian model specified in Section 5.1. Since offline processing requires storing particles at all times tt in the memory, we use T=500T=500 here instead of T=3000T=3000. Apart from that, the algorithmic and benchmark settings remain the same.

Figure 9 plots the squared interquartile range of the estimators T(φt)\mathbb{Q}_{T}(\varphi_{t}) with respect to tt, for different algorithms. For small tt, the function φt\varphi_{t} only looks at states close to time 0, whereas for bigger tt, recent states less affected by degeneracy are also taken into account. In all cases though, we see that MCMC and rejection-based smoothers have superior performance.

Refer to caption
Figure 9. Squared interquartile range of the estimators of T(φt)\mathbb{Q}_{T}(\varphi_{t}) with respect to tt, for different algorithms applied to the model of Section D.1. See Section 5.1 for the meaning of the acronyms in the legend.

Figure 10 shows box plots of the averaged execution times (per particle NN per time tt) based on 150150 runs. The observations are comparable to those in Section 5.1. We see a performance difference between the rejection-based smoothers using the bootstrap and the guided filters. Both have an execution time that is much more variable than hybrid rejection algorithms. The latter still need around 1010 times more CPU load than MCMC smoothers, for essentially the same precision.

Refer to caption
Figure 10. Box plots of the number of transition density evaluations divided by NTNT for different algorithms in the offline linear Gaussian model of Section D.1.

We now take a closer look at the reason behind the performance difference between the bootstrap filter and the guided one when pure rejection sampling is used. Figure 11 shows the effective sample size (ESS) of both filters as a function of time. We can see that there is an outlier in the data around time t=40t=40. Figure 12 box-plots the execution times divided by NN at t=40t=40 for the pure rejection sampling algorithm, whereas Figures 13 and 14 do the same for t=38t=38 and t=42t=42. The root of the problem is now clear: at most times tt there is very few difference between the execution times of the bootstrap and the guided filters. However, if an outlier is present in the data, the guided filter suddenly requires a very high number of transition density evaluation in the rejection sampler. This gives yet another reason to avoid using pure rejection sampling.

Refer to caption
Figure 11. Evolution of the ESS for the linear Gaussian model of Section D.1.
Refer to caption
Figure 12. Box plots of the average execution time per particle for the pure rejection algorithm at time t=40t=40. Figure produced based on 150150 independent runs of the model described in Section D.1.
Refer to caption
Figure 13. Same as Figure 12, but for t=38t=38.
Refer to caption
Figure 14. Same as Figure 12, but for t=42t=42.

D.2. Lotka-Volterra SDE

D.2.1. Coupling of Euler discretisations

Consider the SDE

(22) dXt=b(Xt)dt+σ(Xt)dWt\mathrm{d}X_{t}=b(X_{t})\mathrm{d}t+\sigma(X_{t})\mathrm{d}W_{t}

and two starting points X0AX_{0}^{\mathrm{A}} and X0BX_{0}^{\mathrm{B}} in d\mathbb{R}^{d}. We wish to simulate X1AX_{1}^{\mathrm{A}} and X1BX_{1}^{\mathrm{B}} such that the transitions from X0AX_{0}^{\mathrm{A}} to X1AX_{1}^{\mathrm{A}} and X0BX_{0}^{\mathrm{B}} to X1BX_{1}^{\mathrm{B}} both follow the Euler-discretised version of the equation, but X1AX_{1}^{\mathrm{A}} and X1BX_{1}^{\mathrm{B}} are correlated in a way that increases, as much as we can, the probability that they are equal. Algorithm 11 makes it clear that it all boils down to the coupling of two Gaussian distributions.

Input: Functions b:ddb:\mathbb{R}^{d}\to\mathbb{R}^{d} and σ:dd×d\sigma:\mathbb{R}^{d}\to\mathbb{R}^{d\times d}, two starting points X0AX_{0}^{\mathrm{A}} and X0BX_{0}^{\mathrm{B}} at time 0, number of discretisation step NdistN_{\mathrm{dist}}
Initialise XAX0AX^{\mathrm{A}}\leftarrow X_{0}^{\mathrm{A}}
Initialise XBX0BX^{\mathrm{B}}\leftarrow X_{0}^{\mathrm{B}}
Set δ1/Ndist\delta\leftarrow 1/N_{\mathrm{dist}}
for i1i\leftarrow 1 to NdistN_{\mathrm{dist}} do
       Simulate (X~A,X~B)(\tilde{X}^{\mathrm{A}},\tilde{X}^{\mathrm{B}}) from a coupling of
𝒩(XA+δb(XA),δσ(XA)σ(XA))\mathcal{N}(X^{\mathrm{A}}+\delta b(X^{\mathrm{A}}),\delta\sigma(X^{\mathrm{A}})\sigma(X^{\mathrm{A}})^{\top})
and
𝒩(XB+δb(XB),δσ(XB)σ(XB)),\mathcal{N}(X^{\mathrm{B}}+\delta b(X^{\mathrm{B}}),\delta\sigma(X^{\mathrm{B}})\sigma(X^{\mathrm{B}})^{\top}),
such as Algorithm 14
       Update (XA,XB)(X~A,X~B)(X^{\mathrm{A}},X^{\mathrm{B}})\leftarrow(\tilde{X}^{\mathrm{A}},\tilde{X}^{\mathrm{B}})
      
Set (X1A,X1B)(XA,XB)(X_{1}^{\mathrm{A}},X_{1}^{\mathrm{B}})\leftarrow(X^{\mathrm{A}},X^{\mathrm{B}})
Output: Two endpoints X1AX_{1}^{\mathrm{A}} and X1BX_{1}^{\mathrm{B}} at time 11, obtained by passing X0AX_{0}^{\mathrm{A}} and X0BX_{0}^{\mathrm{B}} in a correlated manner through a discretised version of (22)
Algorithm 11 Coupling of two Euler discretisations

Lindvall and Rogers, (1986) propose the following construction: if two diffusions XtAX_{t}^{\mathrm{A}} and XtBX_{t}^{\mathrm{B}} both follow the dynamics of (22), that is,

dXtA\displaystyle\mathrm{d}X_{t}^{\mathrm{A}} =b(XtA)dt+σ(XtA)dWtA\displaystyle=b(X_{t}^{\mathrm{A}})\mathrm{d}t+\sigma(X_{t}^{\mathrm{A}})\mathrm{d}W_{t}^{\mathrm{A}}
dXtB\displaystyle\mathrm{d}X_{t}^{\mathrm{B}} =b(XtB)dt+σ(XtB)dWtB\displaystyle=b(X_{t}^{\mathrm{B}})\mathrm{d}t+\sigma(X_{t}^{\mathrm{B}})\mathrm{d}W_{t}^{\mathrm{B}}

and the two Brownian motions are correlated via

(23) dWtB=[Id2u(XA,XB)u(XA,XB)]dWtA\mathrm{d}W_{t}^{\mathrm{B}}=[\operatorname{Id}-2u(X^{\mathrm{A}},X^{\mathrm{B}})u(X^{\mathrm{A}},X^{\mathrm{B}})^{\top}]\mathrm{d}W_{t}^{\mathrm{A}}

where Id\operatorname{Id} is the identity matrix and the vector uu is defined by

u(x,x)=σ(x)1(xx)σ(x)1(xx)2,u(x,x^{\prime})=\frac{\sigma(x^{\prime})^{-1}(x-x^{\prime})}{\left\lVert\sigma(x^{\prime})^{-1}(x-x^{\prime})\right\rVert_{2}},

then under some regularity conditions, the two diffusions meet almost surely. (Note two special features of (23): it is valid because the term in the square bracket is an orthogonal matrix; and it ceases to be well-defined once the two trajectories have met.) Simulating the meeting time τ\tau turns out to be very challenging. The Euler discretisation (Algorithm 11 + Algorithm 12) has a fixed step size δ\delta, and there is zero probability that τ\tau is of the form kδk\delta for some integer kk. Since the coupling transform is deterministic, the two Euler-simulated trajectories will never meet. Figure 15 depicts this difficulty in the special case of two Brownian motions in dimension 11 (i.e. b(x)0b(x)\equiv 0 and σ1\sigma\equiv 1). Under this setting, (23) means that the two Brownian increments are symmetric with respect to the midpoint of the segment connecting their initial states. Note that the two dashed lines do cross at two points, but using them as meeting points is invalid: since they are not part of the discretisation but the result of some heuristic “linear interpolation”, it would change the distribution of the trajectories.

Input: Two vectors μA,μB\mu^{\mathrm{A}},\mu^{\mathrm{B}} in d\mathbb{R}^{d} and two d×dd\times d matrices σA\sigma^{\mathrm{A}} and σB\sigma^{\mathrm{B}}
Calculate u(σB)1(μAμB)u\leftarrow(\sigma^{\mathrm{B}})^{-1}(\mu^{\mathrm{A}}-\mu^{\mathrm{B}})
Normalise uu/u2u\leftarrow u/\left\lVert u\right\rVert_{2}
Simulate WA𝒩(0,Id)W^{\mathrm{A}}\sim\mathcal{N}(0,\operatorname{Id})
Set WB(Id2uu)WAW^{\mathrm{B}}\leftarrow(\operatorname{Id}-2uu^{\top})W^{\mathrm{A}}
Set XAμA+σAWAX^{\mathrm{A}}\leftarrow\mu^{\mathrm{A}}+\sigma^{\mathrm{A}}W^{\mathrm{A}}
Set XBμB+σBWBX^{\mathrm{B}}\leftarrow\mu^{\mathrm{B}}+\sigma^{\mathrm{B}}W^{\mathrm{B}}
Output: Two correlated points XAX^{\mathrm{A}} and XBX^{\mathrm{B}} marginally distributed according to 𝒩(μA,σA(σA))\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top}) and 𝒩(μB,σB(σB))\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top}) respectively
Algorithm 12 Lindvall-Rogers coupling of two Gaussian distributions
Refer to caption
Figure 15. Coupling of two Brownian motions in \mathbb{R} starting from 0 and 44 respectively. The true Lindvall-Rogers coupling (23) is represented by the continuous grey lines. The dicretised simulation (Algorithm 11 + Algorithm 12) is shown by the dashed lines. The discretised trajectories not only miss the true meeting point τ\tau but also never meet afterwards (see text).

We therefore need some coupling that has a non-zero meeting probability at each δ\delta-step. This can be achieved by the rejection maximal coupling (Algorithm 13, see also, e.g. Roberts and Rosenthal,, 2004) as well as the recently proposed coupled rejection sampler (Corenflos and Särkkä,, 2022). However, they all make use of rejection sampling in one way or another, which renders the execution time random. We wish to avoid this if possible. The reflection-maximal coupling (Bou-Rabee et al.,, 2020; Jacob et al.,, 2020) has deterministic cost and optimal meeting probability, but is only applicable for two Gaussian distributions of the same covariance matrix, which is not our case.

Input: Two probability distributions fAf^{\mathrm{A}} and fBf^{\mathrm{B}}
Simulate XAfAX^{\mathrm{A}}\sim f^{\mathrm{A}}
Simulate UAUniform[0,fA(XA)]U^{\mathrm{A}}\sim\operatorname{Uniform}[0,f^{\mathrm{A}}(X^{\mathrm{A}})]
if UAfB(XA)U^{\mathrm{A}}\leq f^{\mathrm{B}}(X^{\mathrm{A}}) then
       Set XBXAX^{\mathrm{B}}\leftarrow X^{\mathrm{A}}
      
else
       repeat
             Simulate XBfBX^{\mathrm{B}}\sim f^{\mathrm{B}}
             Simulate UBUniform[0,fB(XB)]U^{\mathrm{B}}\sim\operatorname{Uniform}[0,f^{\mathrm{B}}(X^{\mathrm{B}})]
            
      until UB>fA(XB)U^{\mathrm{B}}>f^{\mathrm{A}}(X^{\mathrm{B}})
Output: Two maximally-coupled realisations XAX^{\mathrm{A}} and XBX^{\mathrm{B}}, marginally fAf^{\mathrm{A}}-distributed and fBf^{\mathrm{B}}-distributed respectively
Algorithm 13 Rejection maximal coupler for two distributions

As suggested by Figure 15, the discretised Lindvall-Rogers coupling (Algorithm 12) is actually great for bringing together two faraway trajectories. Only when they start getting closer that it misses out. At that moment, the two distributions corresponding to the next δ\delta-step have non-negligible overlap and would preferably be coupled in the style of Algorithm 13. We propose a modified coupling scheme that acts like Algorithm 12 when the two trajectories are at a large distance and behaves as Algorithm 13 otherwise.

The idea is to preliminarily generate a uniform draw in the “overlapping zone” of the two distributions (if they are close enough to make that easy). Next, we perform Algorithm 12 and then, any of the two simulations belonging to the overlapping zone will be replaced by the aforementioned preliminary draw (if it is available). The precise mathematical formulation is given in Algorithm 14 and the proof in Supplement E.12.

Input: Two vectors μA\mu^{\mathrm{A}} and μB\mu^{\mathrm{B}} in d\mathbb{R}^{d}, two d×dd\times d matrices σA\sigma^{\mathrm{A}} and σB\sigma^{\mathrm{B}}
Let fAf^{\mathrm{A}} and fBf^{\mathrm{B}} be respectively the probability densities of 𝒩(μA,σA(σA))\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top}) and 𝒩(μB,σB(σB))\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top})
Simulate XAX^{\mathrm{A}} and XBX^{\mathrm{B}} from Algorithm 12
Simulate UUniform[0,1]U\sim\operatorname{Uniform}[0,1]
Set UAUfA(XA)U^{\mathrm{A}}\leftarrow Uf^{\mathrm{A}}(X^{\mathrm{A}}) and UBUfB(XB)U^{\mathrm{B}}\leftarrow Uf^{\mathrm{B}}(X^{\mathrm{B}})
Simulate YfAY\sim f^{\mathrm{A}} and VUniform[0,fA(Y)]V\sim\operatorname{Uniform}[0,f^{\mathrm{A}}(Y)]
if VfB(Y)V\leq f^{\mathrm{B}}(Y) then
       if UAfB(XA)U^{\mathrm{A}}\leq f^{\mathrm{B}}(X^{\mathrm{A}}) then update (XA,UA)(Y,V)(X^{\mathrm{A}},U^{\mathrm{A}})\leftarrow(Y,V)
       if UBfA(XB)U^{\mathrm{B}}\leq f^{\mathrm{A}}(X^{\mathrm{B}}) then update (XB,UB)(Y,V)(X^{\mathrm{B}},U^{\mathrm{B}})\leftarrow(Y,V)
      
Output: Two correlated random vectors XAX^{\mathrm{A}} and XBX^{\mathrm{B}}, distributed marginally according to 𝒩(μA,σA(σA))\mathcal{N}(\mu^{\mathrm{A}},\sigma^{\mathrm{A}}(\sigma^{\mathrm{A}})^{\top}) and 𝒩(μB,σB(σB))\mathcal{N}(\mu^{\mathrm{B}},\sigma^{\mathrm{B}}(\sigma^{\mathrm{B}})^{\top})
Algorithm 14 Modified Lindvall-Rogers (MLR) coupler of two Gaussian distributions

Algorithm 14 has a deterministic execution time, but it does not attain the optimal coupling rate. Yet, as δ0\delta\to 0, we see empirically that it still recovers the oracle coupling time defined by (23) (although we did not try to prove this formally). In Figure 16, we couple two standard Brownian motions starting from a=0a=0 and b=1.5b=1.5 using Algorithm 14 with different values of δ\delta. It is known, by a simple application of the reflection principle (Lévy,, 1940; see also Chapter 2.2 of Mörters and Peres,, 2010), that the reflection coupling (23) succeeds after a Levy(0,(ba)2/4)\operatorname{Levy}(0,(b-a)^{2}/4)-distributed time. We therefore have to deal with a heavy-tailed distribution and restrict ourselves to the interval [0,5][0,5]. We see that the law of the meeting time is stable and convergent as δ0\delta\to 0. Thus, at least empirically, Algorithm 14 does not suffer from the instability problem as δ0\delta\to 0, contrary to a naive path space augmentation approach (see Yonekura and Beskos,, 2022 for a discussion).

Refer to caption
Figure 16. Densities of the meeting times restricted to [0,5][0,5] for two Brownian motions started from 0 and 1.51.5. The curves are drawn using 20 00020\ 000 simulations from either a Levy distribution (for the “True distribution” curve) or Algorithm 14 (for the MLR ones). The boundary effect of kernel density estimators causes spills beyond 0 and 55.

D.2.2. Supplementary figures

Figure 17 plots a realisation of the states and data with parameters given in Subsection 5.2, for a relatively small scale dataset (T=50T=50). While the periodic trait seen in classical deterministic Lotka-Volterra equations is still visible (with a period of around 2020), it is clear that here random perturbations have added considerable chaos to the system. Figures 18 and 19 show respectively the performances of the naive genealogy tracking smoother and ours (Algorithm 9) on the dataset of Figure 17. Our smoother has successfully prevented the degeneracy phenomenon, particularly for times close to 0. Figure 20 shows, in two different ways, the properties of effective sample sizes (ESS) in the T=3000T=3000 scenario (see Section 5.2).

Refer to caption
Figure 17. A realisation of the Lotka-Volterra SDE with parameters described in Section 5.2. The stationary point of the system is [100,100][100,100].
Refer to caption
Figure 18. Smoothing trajectories for the dataset of Figure 17 using the naive genealogy tracking smoother (BtN,GTB_{t}^{N,\mathrm{GT}} kernels) with systematic resampling (see Section C.1). We took N=100N=100 and randomly plotted 3030 smoothing trajectories.
Refer to caption
Figure 19. Same as Figure 18, but smoothing was done using Algorithm 9 instead.
Refer to caption
Figure 20. Effective sample size (ESS) for the Lotka-Voltterra SDE model with T=3000T=3000 (Section 5.2). The left pane draws the box plot of the collection of all estimated ESS for t=0,,3000t=0,\ldots,3000. The right pane plots the evolution of ESS with time. The quantity changes so chaotically that the curve only plots one value every 2020 time steps for readability.

Supplement E Proofs

E.1. Proof of Theorem  2.1 (general convergence theorem)

In line with (7), we define the distribution tN(dx0:t)\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t}) for t<Tt<T as the x0:tx_{0:t} marginal of the joint distribution

(24) ¯tN(dx0:t,di0:t):=(Wt1:N)(dit)[s=t1BsN(is,dis1)][s=t0δXsis(dxs)].\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t},\mathrm{d}i_{0:t}):=\mathcal{M}(W_{t}^{1:N})(\mathrm{d}i_{t})\left[\prod_{s=t}^{1}B_{s}^{N}(i_{s},\mathrm{d}i_{s-1})\right]\left[\prod_{s=t}^{0}\delta_{X_{s}^{i_{s}}}(\mathrm{d}x_{s})\right].

The proof builds up on an inductive argument which links tN\mathbb{Q}_{t}^{N} with t1N\mathbb{Q}_{t-1}^{N} through new innovations at time tt. More precisely, we have the following fundamental proposition, where t+\mathcal{F}_{t}^{+} is defined as the smallest σ\sigma-algebra containing t\mathcal{F}_{t} and B^1:tN\hat{B}_{1:t}^{N}.

Proposition 5.

tN\mathbb{Q}_{t}^{N} is a mixture distribution that admits the representation

(25) tN(dx0:t)=(tN)1N1nGt(xt)KtN(n,dx0:t)\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t})=(\ell_{t}^{N})^{-1}N^{-1}\sum_{n}G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})

where tN\ell_{t}^{N} is defined in Algorithm 1 and KtN(n,dx0:t)K_{t}^{N}(n,\mathrm{d}x_{0:t}) is a certain probability measure satisfying

(26) 𝔼[KtN(n,dx0:t)|t1+]=t1N(dx0:t1)Mt(xt1,dxt).\mathbb{E}\left[\left.{K_{t}^{N}(n,\mathrm{d}x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1})M_{t}(x_{t-1},\mathrm{d}x_{t}).

In other words, for any (possibly random) function φtN:𝒳0××𝒳t\varphi_{t}^{N}:\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{t}\to\mathbb{R} such that φtN(x0:t)\varphi_{t}^{N}(x_{0:t}) is t1+\mathcal{F}_{t-1}^{+}-measurable, we have

𝔼[KtN(n,dx0:t)φtN(x0:t)|t1+]=t1N(dx0:t1)Mt(xt1,dxt)φtN(x0:t).\mathbb{E}\left[\left.{\int K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\int\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1})M_{t}(x_{t-1},\mathrm{d}x_{t})\varphi_{t}^{N}(x_{0:t}).

Moreover, KtN(n,dx0:t)φtN(x0:t)\int K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t}), for n=1,,Nn=1,\ldots,N are i.i.d. given t1+\mathcal{F}_{t-1}^{+}.

The proof is postponed until the end of this subsection. This proposition gives the expression (25) for tN\mathbb{Q}_{t}^{N}, which is easier to manipulate than (24) and which highlights, through (26), its connection to t1N\mathbb{Q}_{t-1}^{N}. To further simplify the notations, let us define, following Douc et al., (2011), the kernel Lt1:t2L_{t_{1}:t_{2}}, for t1t2t_{1}\leq t_{2}, as

(27) Lt1:t2(x0:t1,dx0:t2):=δx0:t1(dx0:t1)s=t1+1t2Ms(xs1,dxs)Gs(xs).L_{t_{1}:t_{2}}(x^{\star}_{0:t_{1}},\mathrm{d}x_{0:t_{2}}):=\delta_{x^{\star}_{0:t_{1}}}(\mathrm{d}x_{0:t_{1}})\prod_{s=t_{1}+1}^{t_{2}}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

In other words, for real-valued functions φt2=φt2(x0,,xt2)\varphi_{t_{2}}=\varphi_{t_{2}}(x_{0},\ldots,x_{t_{2}}), we have

Lt1:t2(x0:t1,φt2)=φt2(x0,,xt1,xt1+1,,xt2)s=t1+1t2Ms(xs1,dxs)Gs(xs).L_{t_{1}:t_{2}}(x^{\star}_{0:t_{1}},\varphi_{t_{2}})=\int\varphi_{t_{2}}(x_{0}^{\star},\ldots,x_{t_{1}}^{\star},x_{t_{1}+1},\ldots,x_{t_{2}})\prod_{s=t_{1}+1}^{t_{2}}M_{s}(x_{s-1},\mathrm{d}x_{s})G_{s}(x_{s}).

The usefulness of these kernels will come from the simple remark t2t1Lt1:t2\mathbb{Q}_{t_{2}}\propto\mathbb{Q}_{t_{1}}L_{t_{1}:t_{2}}. We also see that

Lt1:t2φt2φt2s=t1+1t2Gs,\left\lVert L_{t_{1}:t_{2}}\varphi_{t_{2}}\right\rVert_{\infty}\leq\left\lVert\varphi_{t_{2}}\right\rVert_{\infty}\prod_{s=t_{1}+1}^{t_{2}}\left\lVert G_{s}\right\rVert_{\infty},

which gives Lt1:t2<\left\lVert L_{t_{1}:t_{2}}\right\rVert_{\infty}<\infty, where the norm of a kernel is defined in Subsection A. We are now in a position to state an importance sampling-like representation of tN\mathbb{Q}_{t}^{N}.

Corollary 3.

Let φtN:𝒳0××𝒳t\varphi_{t}^{N}:\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{t}\to\mathbb{R} be a (possibly random) function such that φtN(x0:t)\varphi_{t}^{N}(x_{0:t}) is t1+\mathcal{F}_{t-1}^{+}-measurable. Suppose that φtN\varphi_{t}^{N} is either uniformly non-negative (i.e. φtN(x0:t)0\varphi_{t}^{N}(x_{0:t})\geq 0 almost surely) or uniformly bounded (i.e. there exists a deterministic CC such that |φtN(x0:t)|C|\varphi_{t}^{N}(x_{0:t})|\leq C almost surely). Then

tNφtN=N1nK~tN(n,φtN)N1nK~tN(n,𝟙),\mathbb{Q}_{t}^{N}\varphi_{t}^{N}=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})},

where K~tN(n,)\tilde{K}_{t}^{N}(n,\cdot) is a certain random kernel such that

  • 𝔼[K~tN(n,φtN)|t1+]=(t1NLt1:t)φtN\mathbb{E}\left[\left.{\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}\right|{\mathcal{F}_{t-1}^{+}}\right]=(\mathbb{Q}_{t-1}^{N}L_{{t-1}:t})\varphi_{t}^{N};

  • N1nK~tN(n,𝟙)=tNN^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})=\ell_{t}^{N};

  • (K~tN(n,φtN))n=1,,N\left(\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\right)_{n=1,\ldots,N} are i.i.d. given t1+\mathcal{F}_{t-1}^{+};

  • almost surely, |K~tN(n,φtN)|φtNGt\left|\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\right|\leq\left\lVert\varphi_{t}^{N}\right\rVert_{\infty}\left\lVert G_{t}\right\rVert_{\infty} if φtN\varphi_{t}^{N} is uniformly bounded and K~tN(n,φtN)0\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})\geq 0 if φtN\varphi_{t}^{N} is uniformly non-negative.

These statements are valid for t=0t=0 under the convention 1NL1:0=1L1:0=𝕄0\mathbb{Q}_{-1}^{N}L_{-1:0}=\mathbb{Q}_{-1}L_{-1:0}=\mathbb{M}_{0} and 1\mathcal{F}_{-1} being the trivial σ\sigma-algebra.

Proof.

Put K~tN(n,φtN):=Gt(xt)KtN(n,dx0:t)φtN(x0:t)\tilde{K}_{t}^{N}(n,\varphi_{t}^{N}):=\int G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t}) where KtNK_{t}^{N} is defined in Proposition 5. Then

tN(φtN)=N1nGt(xt)KtN(n,dx0:t)φtN(x0:t)tN=N1nK~tN(n,φtN)tN.\begin{split}\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})&=\frac{N^{-1}\sum_{n}\int G_{t}(x_{t})K_{t}^{N}(n,\mathrm{d}x_{0:t})\varphi_{t}^{N}(x_{0:t})}{\ell_{t}^{N}}\\ &=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{\ell_{t}^{N}}.\end{split}

Since tN\mathbb{Q}_{t}^{N} is a probability measure, applying this identity twice yields

tN(φtN)=tN(φtN)tN(𝟙)=N1nK~tN(n,φtN)N1nK~tN(n,𝟙).\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})=\frac{\mathbb{Q}_{t}^{N}(\varphi_{t}^{N})}{\mathbb{Q}_{t}^{N}(\mathbbm{1})}=\frac{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\varphi_{t}^{N})}{N^{-1}\sum_{n}\tilde{K}_{t}^{N}(n,\mathbbm{1})}.

The remaining points are simple consequences of the definition of K~tN\tilde{K}_{t}^{N} and Lt1:tL_{t-1:t}. ∎

The corollary hints at a natural induction proof for Theorem 2.1.

Proof of Theorem 2.1.

The following calculations are valid for all T0T\geq 0, under the convention defined at the end of Corollary 3. They will prove (8) for T=0T=0 and, at the same time, prove it for any T1T\geq 1 under the hypothesis that it already holds true for T1T-1. Let φT=φT(x0,,xT)\varphi_{T}=\varphi_{T}(x_{0},\ldots,x_{T}) be a real-valued function on 𝒳0××𝒳T\mathcal{X}_{0}\times\cdots\times\mathcal{X}_{T}. Write

(28) N(TNφTTφT)=N(N1nK~TN(n,φT)N1nK~TN(n,𝟙)T1LT1:TφTT1LT1:T𝟙)\sqrt{N}(\mathbb{Q}_{T}^{N}\varphi_{T}-\mathbb{Q}_{T}\varphi_{T})=\sqrt{N}\left(\frac{{N}^{-1}\sum_{n}\tilde{K}_{T}^{N}(n,\varphi_{T})}{{N}^{-1}\sum_{n}\tilde{K}_{T}^{N}(n,\mathbbm{1})}-\frac{\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T}}{\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1}}\right)

where the rewriting of TφT\mathbb{Q}_{T}\varphi_{T} is a consequence of QTT1LT1:TQ_{T}\propto\mathbb{Q}_{T-1}L_{T-1:T}. We will bound this difference by Hoeffding’s inequalities for ratios (see Supplement E.13 for notations, including the definition of sub-Gaussian variables that we shall use below). We have

  • that N(N1K~TN(n,φT)T1NLT1:TφT)\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\varphi_{T})-\mathbb{Q}_{T-1}^{N}L_{T-1:T}\varphi_{T}) is (1,φTGT)(1,\left\lVert\varphi_{T}\right\rVert_{\infty}\left\lVert G_{T}\right\rVert_{\infty})-sub-Gaussian conditioned on t1+\mathcal{F}_{t-1}^{+} because of Theorem E.15 (and thus unconditionally, by the law of total expectation);

  • and that N(N1T1NLT1:TφTT1LT1:TφT)\sqrt{N}({N}^{-1}\mathbb{Q}_{T-1}^{N}L_{T-1:T}\varphi_{T}-\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T}) is sub-Gaussian with parameters

    (CT1,ST1LT1:TφT)(C_{T-1},S_{T-1}\left\lVert L_{T-1:T}\right\rVert_{\infty}\left\lVert\varphi_{T}\right\rVert_{\infty})

    if T1T\geq 1 by induction hypothesis. The quantity is equal to 0 if T=0T=0.

This permits to apply Lemma E.16, which results in the sub-Gaussian properties of

  • the quantity N(N1K~TN(n,φT)T1LT1:TφT)\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\varphi_{T})-\mathbb{Q}_{T-1}L_{T-1:T}\varphi_{T}), with parameters (1+CT1,ST1φT)(1+C_{T-1},S_{T-1}^{\prime}\left\lVert\varphi_{T}\right\rVert_{\infty}), for a certain constant ST1S_{T-1}^{\prime};

  • and the quantity N(N1K~TN(n,𝟙)T1LT1:T𝟙)\sqrt{N}({N}^{-1}\sum\tilde{K}_{T}^{N}(n,\mathbbm{1})-\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1}), which is a special case of the former one, with parameters (1+CT1,ST1)(1+C_{T-1},S_{T-1}^{\prime}).

Finally, we invoke Proposition 11 and deduce the sub-Gaussian property of (28) with parameters

(2+2CT1,2ST1φTT1LT1:T𝟙)\left(2+2C_{T-1},2\frac{S^{\prime}_{T-1}\left\lVert\varphi_{T}\right\rVert_{\infty}}{\mathbb{Q}_{T-1}L_{T-1:T}\mathbbm{1}}\right)

which finishes the proof. ∎

Proof of Proposition 5.

From (24), we have

tN(dx0:t)=it¯tN(dit)¯tN(dx0:t|it)=(tN)1N1itGt(Xtit)¯tN(dx0:t|it)=(tN)1N1itGt(xt)¯tN(dx0:t|it)\begin{split}\mathbb{Q}_{t}^{N}(\mathrm{d}x_{0:t})&=\sum_{i_{t}}\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}i_{t})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\\ &=(\ell_{t}^{N})^{-1}N^{-1}\sum_{i_{t}}G_{t}(X_{t}^{i_{t}})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\\ &=(\ell_{t}^{N})^{-1}N^{-1}\sum_{i_{t}}G_{t}(x_{t})\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})\end{split}

since ¯tN(dx0:t|it)\bar{\mathbb{Q}}_{t}^{N}(dx_{0:t}|i_{t}) has a δXtit(dxt)\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t}) term. In fact, the identity

¯tN(dx0:t,dit1|it)=δXtit(dxt)BtN(it,dit1)¯t1N(dx0:t1|it1)\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t},\mathrm{d}i_{t-1}|i_{t})=\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})

follows directly from the backward recursive nature of Algorithm 2, and thus

(29) ¯tN(dx0:t|it)=δXtit(dxt)it1BtN(it,dit1)¯t1N(dx0:t1|it1).\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t})=\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})\int_{i_{t-1}}B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1}).

The t1N(dx0:t1|it1)\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1}) term is t1+\mathcal{F}_{t-1}^{+}-measurable. We shall calculate the expectation of δXtit(dxt)BtN(it,dit1)\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1}) given t1+\mathcal{F}_{t-1}^{+}. The following arguments are necessary for formal verification, but the result (30) is natural in light of the ancestor regeneration intuition explained in Section 2.4.

Let ftN:{1,,N}×𝒳tf_{t}^{N}:\left\{1,\ldots,N\right\}\times\mathcal{X}_{t}\to\mathbb{R} be a (possibly random) function such that ftN(it1,xt)f_{t}^{N}(i_{t-1},x_{t}) is t1+\mathcal{F}_{t-1}^{+}-measurable. Let JtitJ_{t}^{i_{t}} be a random variable such that given t1+\mathcal{F}_{t-1}^{+}, XtitX_{t}^{i_{t}} and B^tN(it,)\hat{B}_{t}^{N}(i_{t},\cdot), JtitJ_{t}^{i_{t}} is BtN(it,dit1)B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})-distributed. This automatically makes JtitJ_{t}^{i_{t}} satisfy the second hypothesis of Theorem 2.1. Additionally, by virtue of its first hypothesis, the distribution of (Jtit,Atit)(J_{t}^{i_{t}},A_{t}^{i_{t}}) is the same given either t1+\mathcal{F}_{t-1}^{+} or Xt11:NX_{t-1}^{1:N} (see also Figure 1). We can now write

𝔼[ftN(it1,xt)δXtit(dxt)BtN(it,dit1)|t1+]=𝔼[ftN(it1,Xtit)BtN(it,dit1)|t1+]=𝔼[𝔼[ftN(Jtit,Xtit)|t1+,Xtit,B^tN(it,)]|t1+]=𝔼[ftN(Jtit,Xtit)|t1+] by the law of total expectation=𝔼[ftN(Atit,Xtit)|t1+] by the second hypothesis of Theorem 2.1=ftN(it1,xt)(Wt11:N)(dit1)Mt(Xt1it1,dxt).\begin{split}&\mathbb{E}\left[\left.{\int f_{t}^{N}(i_{t-1},x_{t})\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{\int f_{t}^{N}(i_{t-1},X_{t}^{i_{t}})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{\mathbb{E}\left[\left.{f_{t}^{N}(J_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+},X_{t}^{i_{t}},\hat{B}_{t}^{N}(i_{t},\cdot)}\right]}\right|{\mathcal{F}_{t-1}^{+}}\right]\\ =&\mathbb{E}\left[\left.{f_{t}^{N}(J_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+}}\right]\text{ by the law of total expectation}\\ =&\mathbb{E}\left[\left.{f_{t}^{N}(A_{t}^{i_{t}},X_{t}^{i_{t}})}\right|{\mathcal{F}_{t-1}^{+}}\right]\text{ by the second hypothesis of Theorem~{}\ref{thm:convergence_mcmc}}\\ =&\int f_{t}^{N}(i_{t-1},x_{t})\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t}).\end{split}

This equality means that

(30) 𝔼[δXtit(dxt)BtN(it,dit1)|t1+]=(Wt11:N)(dit1)Mt(Xt1it1,dxt),\mathbb{E}\left[\left.{\delta_{X_{t}^{i_{t}}}(\mathrm{d}x_{t})B_{t}^{N}(i_{t},\mathrm{d}i_{t-1})}\right|{\mathcal{F}_{t-1}^{+}}\right]=\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t}),

Now, put

KN(it,dx0:t):=¯tN(dx0:t|it).K^{N}(i_{t},\mathrm{d}x_{0:t}):=\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}x_{0:t}|i_{t}).

From (29) and (30), we have

𝔼[KN(it,dx0:t)|t1+]=it1(Wt11:N)(dit1)Mt(Xt1it1,dxt)¯t1N(dx0:t1|it1)=Mt(xt1,dxt)it1(Wt11:N)(dit1)¯t1N(dx0:t1|it1)since ¯t1N(dx0:t1|it1) has a δXt1it1(dxt1) term=Mt(xt1,dxt)t1N(dx0:t1)\begin{split}\mathbb{E}\left[\left.{K^{N}(i_{t},\mathrm{d}x_{0:t})}\right|{\mathcal{F}_{t-1}^{+}}\right]&=\int_{i_{t-1}}\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})M_{t}(X_{t-1}^{i_{t-1}},\mathrm{d}x_{t})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\\ &=M_{t}(x_{t-1},\mathrm{d}x_{t})\int_{i_{t-1}}\mathcal{M}(W_{t-1}^{1:N})(\mathrm{d}i_{t-1})\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\\ &\text{since }\bar{\mathbb{Q}}_{t-1}^{N}(\mathrm{d}x_{0:t-1}|i_{t-1})\text{ has a }\delta_{X_{t-1}^{i_{t-1}}(\mathrm{d}x_{t-1})}\text{ term}\\ &=M_{t}(x_{t-1},\mathrm{d}x_{t})\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{0:{t-1}})\end{split}

which finishes the proof. ∎

E.2. Proof of Equation (11) (online smoothing recursion)

Proof.

Using (7) and the matrix notations, the distribution ¯tN(dis)\bar{\mathbb{Q}}_{t}^{N}(\mathrm{d}i_{s}) can be represented by the 1×N1\times N vector

q^s|tN:=[Wt1WtN]B^tNB^s+1N.\hat{q}_{s|t}^{N}:=[W_{t}^{1}\ldots W_{t}^{N}]\hat{B}_{t}^{N}\ldots\hat{B}_{s+1}^{N}.

Defining the N×NN\times N matrix ψ^sN\hat{\psi}_{s}^{N} as

ψ^sN[is1,is]:=ψs(Xs1is1,Xsis),\hat{\psi}_{s}^{N}[i_{s-1},i_{s}]:=\psi_{s}(X_{s-1}^{i_{s-1}},X_{s}^{i_{s}}),

we have

𝔼tN[ψs(Xs1,Xs)]\displaystyle\mathbb{E}_{\mathbb{Q}_{t}^{N}}[\psi_{s}(X_{s-1},X_{s})] =is,is1q^s|tN[1,is]B^sN[is,is1]ψ^sN[is1,is]\displaystyle=\sum_{i_{s},i_{s-1}}\hat{q}_{s|t}^{N}[1,i_{s}]\hat{B}_{s}^{N}[i_{s},i_{s-1}]\hat{\psi}_{s}^{N}[i_{s-1},i_{s}]
=isq^s|tN[1,is](B^sNψ^sN)[is,is]\displaystyle=\sum_{i_{s}}\hat{q}_{s|t}^{N}[1,i_{s}](\hat{B}_{s}^{N}\hat{\psi}_{s}^{N})[i_{s},i_{s}]
=q^s|tNdiag(B^sNψ^sN).\displaystyle=\hat{q}_{s|t}^{N}\operatorname{diag}(\hat{B}_{s}^{N}\hat{\psi}_{s}^{N}).

Therefore,

tNφt=s=0t[Wt1WtN]B^tNB^s+1Ndiag(B^sNψ^sN)\mathbb{Q}_{t}^{N}\varphi_{t}=\sum_{s=0}^{t}[W_{t}^{1}\ldots W_{t}^{N}]\hat{B}_{t}^{N}\ldots\hat{B}_{s+1}^{N}\operatorname{diag}(\hat{B}_{s}^{N}\hat{\psi}_{s}^{N})

from which follows the recursion

{tNφt=[Wt1WtN]S^tN,S^tN:=B^tNS^t1N+diag(B^tNψ^tN).\begin{cases}\mathbb{Q}_{t}^{N}\varphi_{t}&\ =[W_{t}^{1}\ldots W_{t}^{N}]\hat{S}_{t}^{N},\\ \hat{S}_{t}^{N}&:=\hat{B}_{t}^{N}\hat{S}_{t-1}^{N}+\operatorname{diag}(\hat{B}_{t}^{N}\hat{\psi}_{t}^{N}).\end{cases}

This is exactly (11). ∎

E.3. Proof of Theorem  2.2 (general stability theorem)

The following lemma describes the simultaneous backward construction of two trajectories 0:T1\mathcal{I}_{0:T}^{1} and 0:T2\mathcal{I}_{0:T}^{2} given T\mathcal{F}_{T}^{-}.

Lemma E.1.

We use the same notations as in Algorithms 1 and 2. Suppose that the hypotheses of Theorem 2.1 are satisfied. Then, given t:T1\mathcal{I}_{t:T}^{1}, t:T2\mathcal{I}_{t:T}^{2} and T\mathcal{F}_{T}^{-},

  • if t1t2\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}, the two variables t11\mathcal{I}_{t-1}^{1} and t12\mathcal{I}_{t-1}^{2} are conditionally independent and their marginal distributions are respectively BtN,FFBS(t1,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{1},\cdot) and BtN,FFBS(t2,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{2},\cdot);

  • if t1=t2\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}, under the aforementioned conditioning, the two variables t11\mathcal{I}_{t-1}^{1} and t12\mathcal{I}_{t-1}^{2} are both marginally distributed according to BtN,FFBS(t1,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{1},\cdot). Moreover, if (13) holds, we have

    (31) (t11t12|t:T1,2,T)𝟙t1=t2εS 1t1=t2.\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right|{\mathcal{I}_{t:T}^{1,2},\mathcal{F}_{T}^{-}}\right)\mathbbm{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}\geq\varepsilon_{\mathrm{S}}\ \mathbbm{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}.

In particular, the sequence of variables (Ts1,Ts2)s=0T(\mathcal{I}_{T-s}^{1},\mathcal{I}_{T-s}^{2})_{s=0}^{T} is a Markov chain given T\mathcal{F}_{T}^{-}.

Proof.

To simplify the notations, let b~tn\tilde{b}_{t}^{n} denote the n\mathbb{R}^{n} vector B^tN(n,)\hat{B}_{t}^{N}(n,\cdot). The relation between variables generated by Algorithm 2 is depicted as a graphical model in Figure 21. We consider

xt21:Nx_{t-2}^{1:N}xt11:Nx_{t-1}^{1:N}xt1:Nx_{t}^{1:N}xt+11:Nx_{t+1}^{1:N}\ldotsxT1:Nx_{T}^{1:N}\ldotsb~t11:N\tilde{b}_{t-1}^{1:N}b~t1:N\tilde{b}_{t}^{1:N}b~t+11:N\tilde{b}_{t+1}^{1:N}\ldotsb~T1:N\tilde{b}_{T}^{1:N}iT1:2i_{T}^{1:2}\ldotsit21:2i_{t-2}^{1:2}it11:2i_{t-1}^{1:2}it1:2i_{t}^{1:2}\ldotsiT11:2i_{T-1}^{1:2}
Figure 21. Directed graph representing the relations between variables generated in Algorithm 2. Only those necessary for the proof of Lemma E.1 are included.
(32) p(b~t1:N,it11:2|T,it:T1:2)=p(b~t1:N|T,it:T1:2)p(it11:2|b~t1:N,T,it:T1,2)=p(b~t1:N|xt11:N,xt1:N)p(it11:2|b~t1:N,it1:2)(by properties of graphical models, see Figure 21)=[np(b~tn|xt11:N,xtn)]b~tit1(it11)b~tit2(it12).\begin{split}p(\tilde{b}_{t}^{1:N},i_{t-1}^{1:2}|\mathcal{F}_{T}^{-},i_{t:T}^{1:2})&=p(\tilde{b}_{t}^{1:N}|\mathcal{F}_{T}^{-},i_{t:T}^{1:2})\ p(i_{t-1}^{1:2}|\tilde{b}_{t}^{1:N},\mathcal{F}_{T}^{-},i_{t:T}^{1,2})\\ &=p(\tilde{b}_{t}^{1:N}|x_{t-1}^{1:N},x_{t}^{1:N})\ p(i_{t-1}^{1:2}|\tilde{b}_{t}^{1:N},i_{t}^{1:2})\\ &\textrm{(by properties of graphical models, see Figure~{}\ref{fig:alg2:variables})}\\ &=\left[\prod_{n}p(\tilde{b}_{t}^{n}|x_{t-1}^{1:N},x_{t}^{n})\right]\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1})\tilde{b}_{t}^{i_{t}^{2}}(i_{t-1}^{2}).\end{split}

The distribution of it11i_{t-1}^{1} given T\mathcal{F}_{T}^{-} and it:T1:2i_{t:T}^{1:2} is thus the it11i_{t-1}^{1}-marginal of

(33) p(b~tit1|xt11:N,xtit1)b~tit1(it11),p(\tilde{b}_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1}),

which is exactly the distribution of p(jtit1|xt11:N,xtit1)p(j_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}}) where the JJ’s are defined in the statement of Theorem 2.1. By the second hypothesis of that theorem, the aforementioned distribution is equal to p(atit1|xt11:N,xtit1)p(a_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}}), which is in turn no other than BtN,FFBS(it1,)B_{t}^{N,\mathrm{FFBS}}(i_{t}^{1},\cdot). Moreover, if it1it2i_{t}^{1}\neq i_{t}^{2}, (32) straightforwardly implies the conditional independence of it11i_{t-1}^{1} and it12i_{t-1}^{2}. When it1=it2i_{t}^{1}=i_{t}^{2}, the distribution of it11:2i_{t-1}^{1:2} given T\mathcal{F}_{T}^{-} and it:T1:2i_{t:T}^{1:2} is the it11:2i_{t-1}^{1:2}-marginal of

p(b~tit1|xt11:N,xtit1)b~tit1(it11)b~tit1(it12).p(\tilde{b}_{t}^{i_{t}^{1}}|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{1})\tilde{b}_{t}^{i_{t}^{1}}(i_{t-1}^{2}).

Thus, we can apply (13) for n=it1n=i_{t}^{1}, where it11:2i_{t-1}^{1:2} here plays the role of Jt1:2J_{t}^{1:2} there. Equation (31) is now proved. ∎

As Lemma E.1 describes the distribution of two trajectories, it immediately gives the distribution of a single trajectory.

Corollary 4.

Under the same settings as in Lemma E.1, given T\mathcal{F}_{T}^{-}, the distribution of 0:T1\mathcal{I}_{0:T}^{1} is

(WT1:N)(diT)BTN,FFBS(iT,diT1)B1N,FFBS(i1,di0).\mathcal{M}(W_{T}^{1:N})(\mathrm{d}i_{T})B_{T}^{N,\mathrm{FFBS}}(i_{T},\mathrm{d}i_{T-1})\ldots B_{1}^{N,\mathrm{FFBS}}(i_{1},\mathrm{d}i_{0}).

Note that the corollary applies even if the backward kernel used in Algorithm 2 is not the FFBS one. This is due to the conditioning on T\mathcal{F}_{T}^{-} and the second hypothesis of Theorem 2.1.

Proof of Theorem 2.2.

First of all, we remark that as per Algorithm 2, using index variables 0:T1:N\mathcal{I}_{0:T}^{1:N} adds a level of Monte Carlo approximation to TN(dx0:T)\mathbb{Q}_{T}^{N}(\mathrm{d}x_{0:T}). Therefore

𝔼[(TN(φT)T(φT))2]\displaystyle\mathbb{E}\left[(\mathbb{Q}_{T}^{N}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T}))^{2}\right] =𝔼[(1Nn=1NφT(X00n,,XTTn)T(φT))2]\displaystyle=\mathbb{E}\left[\left(\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})-\mathbb{Q}_{T}(\varphi_{T})\right)^{2}\right]
(34) =𝔼[(TN,FFBS(φT)T(φT))2]+\displaystyle=\mathbb{E}\left[(\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T}))^{2}\right]+
+𝔼[Var(1Nn=1NφT(X00n,,XTTn)|T)]\displaystyle\quad+\mathbb{E}\left[\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})}\right|{\mathcal{F}_{T}^{-}}\right)\right]

where the ultimate inequality is justified by the law of total expectation and Corollary 4. (Note that (0:Tn)n=1N(\mathcal{I}_{0:T}^{n})_{n=1}^{N} are identically distributed but not necessarily independent given T\mathcal{F}_{T}^{-}.) Using Lemma E.3 (stated and proved below) and putting ρ:=1M¯/M¯h\rho:=1-\bar{M}_{\ell}/\bar{M}_{h}, we have

(35) Var(1Nn=1NφT(X00n,,XTTn)|T)\displaystyle\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\varphi_{T}(X_{0}^{\mathcal{I}_{0}^{n}},\ldots,X_{T}^{\mathcal{I}_{T}^{n}})}\right|{\mathcal{F}_{T}^{-}}\right)
=Var(1Nn=1Nt=0Tψt(Xt1t1n,Xttn)|T)\displaystyle=\mathrm{Var}\left(\left.{\frac{1}{N}\sum_{n=1}^{N}\sum_{t=0}^{T}\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}})}\right|{\mathcal{F}_{T}^{-}}\right)
=1N2n,mNs,tTCov(ψt(Xt1t1n,Xttn),ψs(Xs1s1m,Xssm)|T)\displaystyle=\frac{1}{N^{2}}\sum_{n,m\leq N}\sum_{s,t\leq T}\operatorname{Cov}\left(\left.\psi_{t}(X_{t-1}^{\mathcal{I}_{t-1}^{n}},X_{t}^{\mathcal{I}_{t}^{n}}),\psi_{s}(X_{s-1}^{\mathcal{I}_{s-1}^{m}},X_{s}^{\mathcal{I}_{s}^{m}})\right|\mathcal{F}_{T}^{-}\right)
2N2n,mNn=ms,tTψtψsρ|ts|1+\displaystyle\leq\frac{2}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n=m\end{subarray}}\sum_{s,t\leq T}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left|t-s\right|-1}+
+4N2n,mNnms,tTC~Nψtψsρ|ts|1\displaystyle\quad+\frac{4}{N^{2}}\sum_{\begin{subarray}{c}n,m\leq N\\ n\neq m\end{subarray}}\sum_{s,t\leq T}\frac{\tilde{C}}{N}\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left|t-s\right|-1}
=(s,tT2ψtψsρ|ts|1)(2C~+1)N2C~N2\displaystyle=\left(\sum_{s,t\leq T}2\left\lVert\psi_{t}\right\rVert_{\infty}\left\lVert\psi_{s}\right\rVert_{\infty}\rho^{\left|t-s\right|-1}\right)\frac{(2\tilde{C}+1)N-2\tilde{C}}{N^{2}}
[s,tT(ψt2+ψs2)ρ|ts|1]2C~+1N4(2C~+1)Nρ(1ρ)ψt2.\displaystyle\leq\left[\sum_{s,t\leq T}\left(\left\lVert\psi_{t}\right\rVert_{\infty}^{2}+\left\lVert\psi_{s}\right\rVert_{\infty}^{2}\right)\rho^{\left|t-s\right|-1}\right]\frac{2\tilde{C}+1}{N}\leq\frac{4(2\tilde{C}+1)}{N\rho(1-\rho)}\sum\left\lVert\psi_{t}\right\rVert_{\infty}^{2}.

We now look at the first term of (34). In the fixed marginal smoothing case, for any s+s\in\mathbb{Z}_{+}, sTs\leq T and any function ϕs:𝒳s\phi_{s}:\mathcal{X}_{s}\to\mathbb{R}, Douc et al., (2011) proved that

(|TN,FFBS(φT)T(φT)|ε)BeCNε2/ϕs2\mathbb{P}\left(\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|\geq\varepsilon\right)\leq B^{\prime}e^{-C^{\prime}N\varepsilon^{2}/\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}

for φT(x0:T)=ϕs(xs)\varphi_{T}(x_{0:T})=\phi_{s}(x_{s}) and constants BB^{\prime} and CC^{\prime} not depending on TT. Using 𝔼[Δ2]=0(Δ2t)dt\mathbb{E}[\Delta^{2}]=\int_{0}^{\infty}\mathbb{P}(\Delta^{2}\geq t)\mathrm{d}t, the inequality implies

(36) 𝔼[|TN,FFBS(φT)T(φT)|2]Bϕs2CN\mathbb{E}\left[\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|^{2}\right]\leq\frac{B^{\prime}\left\lVert\phi_{s}\right\rVert_{\infty}^{2}}{C^{\prime}N}

for φT(x0:T)=ϕs(xs)\varphi_{T}(x_{0:T})=\phi_{s}(x_{s}). In the additive smoothing case, Dubarry and Le Corff, (2013) proved that, for T2T\geq 2,

(37) 𝔼[|TN,FFBS(φT)T(φT)|2]CN(t=0Tψt2)(1+TN)2.\mathbb{E}\left[\left|\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\varphi_{T})-\mathbb{Q}_{T}(\varphi_{T})\right|^{2}\right]\leq\frac{C^{\prime}}{N}\left(\sum_{t=0}^{T}\left\lVert\psi_{t}\right\rVert_{\infty}^{2}\right)\left(1+\sqrt{\frac{T}{N}}\right)^{2}.

Equations (36), (37), (35) and (34) conclude the proof. ∎

The following lemma quantifies the backward mixing property induced by Assumption 2.

Lemma E.2.

Under the same setting as Theorem 2.2, we have

TV(BtN,FFBS(m,),BtN,FFBS(n,))1M¯M¯h\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(m,\cdot),B_{t}^{N,\mathrm{FFBS}}(n,\cdot)\right)\leq 1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}

for all m,n{1,,N}m,n\in\left\{1,\ldots,N\right\} and t{1,,T}t\in\left\{1,\ldots,T\right\}.

Proof.

We have

1TV(BtN,FFBS(m,),BtN,FFBS(n,))\displaystyle 1-\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(m,\cdot),B_{t}^{N,\mathrm{FFBS}}(n,\cdot)\right)
=[i=1Nmin(Gt1(Xt1i)mt(Xt1i,Xtm)j=1NGt1(Xt1j)mt(Xt1j,Xtm),\displaystyle=\left[\sum_{i=1}^{N}\min\left(\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{m})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{m})},\right.\right.
Gt1(Xt1i)mt(Xt1i,Xtn)j=1NGt1(Xt1j)mt(Xt1j,Xtn))] by Lemma A.1 (Supplement A.2)\displaystyle\qquad\left.\left.\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{n})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{n})}\right)\right]\text{ by Lemma~{}\ref{lem:properties_TV} (Supplement~{}\ref{apx:tv})}
[i=1NGt(Xt1i)M¯j=1NGt(Xt1j)M¯h] by Assumption 2\displaystyle\geq\left[\sum_{i=1}^{N}\frac{G_{t}(X_{t-1}^{i})\bar{M}_{\ell}}{\sum_{j=1}^{N}G_{t}(X_{t-1}^{j})\bar{M}_{h}}\right]\text{ by Assumption~{}\ref{asp:mt_2ways_bound}}
=(M¯/M¯h).\displaystyle=(\bar{M}_{\ell}/\bar{M}_{h}).

Lemma E.3.

Under the same settings as in Theorem 2.2, for any m,n{1,,N}m,n\in\left\{1,\ldots,N\right\} and s,s{0,,T}s,s^{\prime}\in\left\{0,\ldots,T\right\}, we have

(38) Cov(ψs(Xs1s1m,Xssm),ψs(Xs1s1n,Xssn)|T)2(1M¯M¯h)|ss|1ψsψs×{2C~/N if mn1 if m=n\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}_{s-1}^{m}},X_{s}^{\mathcal{I}_{s}^{m}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}_{s^{\prime}-1}^{n}},X_{s^{\prime}}^{\mathcal{I}_{s^{\prime}}^{n}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq\quad 2\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{\left|s-s^{\prime}\right|-1}\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\times\begin{cases}2\tilde{C}/N&\text{ if }m\neq n\\ 1&\text{ if }m=n\end{cases}

where C~=C~(M¯,M¯h,G¯,G¯h,εS)\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}}) is a constant that does not depend on TT (and which arises in the formulation of Lemma E.4). If ss or ss^{\prime} is equal to 0, we adopt the natural convention ψ0(x1,x0):=ψ0(x0)\psi_{0}(x_{-1},x_{0}):=\psi_{0}(x_{0}).

Proof.

We first handle the case mnm\neq n. Without loss of generality, assume that m=1m=1, n=2n=2 and sss\geq s^{\prime}. The covariance bound of Lemma A.1 yields

(39) Cov(ψs(Xs1s11,Xss1),ψs(Xs1s12,Xss2)|T)2ψsψsTV((s1:s1,s1:s2)|T,(s1:s1|T)(s1:s2|T)).\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{1}_{s-1}},X_{s}^{\mathcal{I}^{1}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{2}_{s^{\prime}-1}},X_{s^{\prime}}^{\mathcal{I}^{2}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\operatorname{TV}\left((\mathcal{I}^{1}_{s-1:s},\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}})|\mathcal{F}_{T}^{-},(\mathcal{I}^{1}_{s-1:s}|\mathcal{F}_{T}^{-})\otimes(\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}|\mathcal{F}_{T}^{-})\right).

We shall bound this total variation distance via the coupling inequality of Lemma A.1 (Supplement A.2). The idea is to construct, in addition to 0:T1\mathcal{I}^{1}_{0:T} and 0:T2\mathcal{I}^{2}_{0:T}, two trajectories 0:T1\mathcal{I}^{*1}_{0:T} and 0:T2\mathcal{I}^{*2}_{0:T} i.i.d. given T\mathcal{F}_{T}^{-} such that each of them is conditionally distributed according to 0:T1\mathcal{I}^{1}_{0:T} (cf. Corollary 4). To make the coupling inequality efficient, it is desirable to make 0:T1\mathcal{I}^{1}_{0:T} and 0:T1\mathcal{I}^{*1}_{0:T} as similar as possible (same thing for 0:T2\mathcal{I}^{2}_{0:T} and 0:T2\mathcal{I}^{*2}_{0:T}).

The detailed construction of the four trajectories 0:T1\mathcal{I}^{1}_{0:T}, 0:T2\mathcal{I}^{2}_{0:T}, 0:T1\mathcal{I}^{*1}_{0:T} and 0:T2\mathcal{I}^{*2}_{0:T} given T\mathcal{F}_{T}^{-} is described in Algorithm 15. In particular, we ensure that ts1\forall t\geq s-1, we have t1=t1\mathcal{I}^{1}_{t}=\mathcal{I}^{*1}_{t}. For ts1t\leq s-1, if t2=t2\mathcal{I}^{2}_{t}=\mathcal{I}^{*2}_{t}, it is guaranteed that 2=2\mathcal{I}^{2}_{\ell}=\mathcal{I}^{*2}_{\ell} holds t\forall\ell\leq t. The rationale for different coupling behaviours between the times ts1t\geq s-1 and ts1t\leq s-1 will become clear in the proof: the former aim to control the correlation between two different trajectories m=1m=1 and n=2n=2 and result in the C~/N\tilde{C}/N term of (38); the latter are for bounding the correlation between two times ss and ss^{\prime} and result in the (1M¯/M¯h)|ss|1(1-\bar{M}_{\ell}/\bar{M}_{h})^{|s-s^{\prime}|-1} term of the same equation.

Input: Feynman-Kac model (1), variables X0:T1:NX_{0:T}^{1:N} from the output of Algorithm 1, integer s0s\geq 0 (see statement of Lemma E.3)
Sample T1,T2i.i.d.(WT1:N)\mathcal{I}^{1}_{T},\mathcal{I}^{2}_{T}\overset{i.i.d.}{\sim}\mathcal{M}(W_{T}^{1:N})
Set T1T1\mathcal{I}^{*1}_{T}\leftarrow\mathcal{I}^{1}_{T} and T2T2\mathcal{I}^{*2}_{T}\leftarrow\mathcal{I}^{2}_{T}
for tTt\leftarrow T to 11 do
       if t1t2\mathcal{I}^{1}_{t}\neq\mathcal{I}^{2}_{t} then
             for k{1,2}k\in\left\{1,2\right\} do
                   Sample (t1k,t1k)(\mathcal{I}_{t-1}^{k},\mathcal{I}^{*k}_{t-1}) from any maximal coupling of BtN,FFBS(tk,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot) and BtN,FFBS(tk,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot) (cf. Lemma A.1)
            
      else
             Sample the N\mathbb{R}^{N} vector B^tN(t1,)\hat{B}_{t}^{N}(\mathcal{I}_{t}^{1},\cdot) from p(b^tN(it1,)|xt11:N,xtit1)p(\hat{b}_{t}^{N}(i_{t}^{1},\cdot)|x_{t-1}^{1:N},x_{t}^{i_{t}^{1}})
             Sample t11,t12i.i.d.B^tN(t1,)\mathcal{I}_{t-1}^{1},\mathcal{I}^{2}_{t-1}\overset{i.i.d.}{\sim}\hat{B}_{t}^{N}(\mathcal{I}^{1}_{t},\cdot)
             Set k1,2k\leftarrow 1,\ell\leftarrow 2 if tst\geq s and k2,1k\leftarrow 2,\ell\leftarrow 1 otherwise
             Sample t1kBtN,FFBS(tk,)\mathcal{I}_{t-1}^{*k}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot) such that (t1k,t1k)(\mathcal{I}_{t-1}^{*k},\mathcal{I}_{t-1}^{k}) is any maximal coupling of BtN,FFBS(tk,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*k},\cdot) and BtN,FFBS(tk,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot) given t:T1:2\mathcal{I}_{t:T}^{1:2}, t:T1:2\mathcal{I}_{t:T}^{*1:2} and T\mathcal{F}_{T}^{-} (()(\star) - see text for validity of this step)
             Sample t1BtN,FFBS(t,)\mathcal{I}_{t-1}^{*\ell}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*\ell},\cdot)
            
      
Output: Four trajectories 0:T1\mathcal{I}_{0:T}^{1}, 0:T2\mathcal{I}^{2}_{0:T}, 0:T1\mathcal{I}^{*1}_{0:T}, 0:T2\mathcal{I}^{*2}_{0:T} to be used in the proof of Lemma E.3
Algorithm 15 Sampler for the variables 0:T1\mathcal{I}^{1}_{0:T}, 0:T2\mathcal{I}^{2}_{0:T}, 0:T1\mathcal{I}^{*1}_{0:T} and 0:T2\mathcal{I}^{*2}_{0:T} (see proof of Lemma E.3)

The correctness of Algorithm 15 is asserted by Lemma E.1. Step ()(\star) is valid because that lemma states that the distribution of t1k\mathcal{I}_{t-1}^{k} given T\mathcal{F}_{T}^{-}, t:T1,2\mathcal{I}_{t:T}^{1,2} and t:T1,2\mathcal{I}_{t:T}^{*1,2} is BtN,FFBS(tk,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{k},\cdot). Furthermore, we note that (RTt)t=0T(R_{T-t})_{t=0}^{T} where

Rt:=(t1,t2,t1,t2),R_{t}:=(\mathcal{I}_{t}^{1},\mathcal{I}_{t}^{2},\mathcal{I}^{*1}_{t},\mathcal{I}_{t}^{*2}),

is a Markov chain given T\mathcal{F}_{T}^{-}.

From (39), applying the coupling inequality of Lemma A.1 gives

(40) Cov(ψs(Xs1s11,Xss1),ψs(Xs1s12,Xss2)|T)2ψsψs((s1:s1,s1:s2)(s1:s1,s1:s2)|T)=2ψsψs(s1:s2s1:s2|T)\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{1}_{s-1}},X_{s}^{\mathcal{I}^{1}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{2}_{s^{\prime}-1}},X_{s^{\prime}}^{\mathcal{I}^{2}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{(\mathcal{I}^{1}_{s-1:s},\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}})\neq(\mathcal{I}^{*1}_{s-1:s},\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}})}\right|{\mathcal{F}_{T}^{-}}\right)\\ =2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)

where the last equality results from the construction of Algorithm 15. The sub-case s=ss=s^{\prime} following directly from Lemma E.4, we now focus on the sub-case ss+1s\geq s^{\prime}+1. For all ts1t\leq s-1,

(41) (t12t12|T)\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{*2}_{t-1}}\right|{\mathcal{F}_{T}^{-}}\right)
=(t12t12,t2t2|T)\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{*2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}}\right|{\mathcal{F}_{T}^{-}}\right)
by construction of Algorithm 15
=𝔼[(t12t12,t2t2|Rt,T)|T]\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t-1}\neq\mathcal{I}^{*2}_{t-1},\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)}\right|{\mathcal{F}_{T}^{-}}\right]
by the law of total expectation
=𝔼[TV(BtN,FFBS(t2,),BtN,FFBS(t2,))𝟙{t2t2}|T]\displaystyle=\mathbb{E}\left[\left.{\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{2}_{t},\cdot),B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*2}_{t},\cdot)\right)\mathbb{1}\left\{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}\right\}}\right|{\mathcal{F}_{T}^{-}}\right]
(1M¯M¯h)(t2t2|T) by Lemma E.2.\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{P}\left(\left.{\mathcal{I}^{2}_{t}\neq\mathcal{I}^{*2}_{t}}\right|{\mathcal{F}_{T}^{-}}\right)\text{ by Lemma~{}\ref{lem:backward_mixing}}.

Thus

(s1:s2s1:s2|T)\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}-1:s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}-1:s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)
=(s2s2|T) by construction of Algorithm 15\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s^{\prime}}\neq\mathcal{I}^{*2}_{s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)\text{ by construction of Algorithm~{}\ref{algo:four_trajs}}
(1M¯M¯h)ss1(s12s12|T) by applying (41) recursively\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s-1}\neq\mathcal{I}^{*2}_{s-1}}\right|{\mathcal{F}_{T}^{-}}\right)\text{ by applying \eqref{eq:two_twostar_diff_bw} recursively}
(1M¯M¯h)ss1C~N by Lemma E.4,\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\frac{\tilde{C}}{N}\text{ by Lemma~{}\ref{lem:I_s_star_likely_same},}

which, combined with (40) finishes the proof for the current sub-case ss+1s\geq s^{\prime}+1. It remains to show (38) when m=nm=n. The proof follows the same lines as in the case mnm\neq n, although we shall briefly outline some arguments to show how the factor C~/N\tilde{C}/N disappeared. The case s=ss=s^{\prime} being trivial, suppose that ss+1s\geq s^{\prime}+1 and without loss of generality that m=n=3m=n=3. To use the coupling tools of Lemma A.1, we construct trajectories 0:T3\mathcal{I}^{3}_{0:T}, 0:T3\mathcal{I}^{*3}_{0:T} and 0:T4\mathcal{I}^{*4}_{0:T} via Algorithm 16 and write, in the spirit of (40):

Input: Feynman-Kac model (1), variables X0:T1:NX_{0:T}^{1:N} from the output of Algorithm 1, integer s0s\geq 0 (see statement of Lemma E.3)
Sample T3,T4i.i.d.(WT1:N)\mathcal{I}^{*3}_{T},\mathcal{I}^{*4}_{T}\overset{i.i.d.}{\sim}\mathcal{M}(W_{T}^{1:N})
Set T3T3\mathcal{I}^{3}_{T}\leftarrow\mathcal{I}^{*3}_{T}
for tTt\leftarrow T to 1 do
       if tst\geq s then
             Sample t13BtN,FFBS(t3,)\mathcal{I}^{*3}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*3}_{t},\cdot) and t14BtN,FFBS(t4,)\mathcal{I}^{*4}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*4}_{t},\cdot)
             Set t13t13\mathcal{I}^{3}_{t-1}\leftarrow\mathcal{I}^{*3}_{t-1}
            
      else
             Sample (t13,t14)(\mathcal{I}^{3}_{t-1},\mathcal{I}^{*4}_{t-1}) from a maximal coupling of BtN,FFBS(t3,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{3}_{t},\cdot) and BtN,FFBS(t4,)B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*4}_{t},\cdot)
             Sample t13BtN,FFBS(t3,)\mathcal{I}^{*3}_{t-1}\sim B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}^{*3}_{t},\cdot)
      
Output: Three trajectories 0:T3\mathcal{I}^{3}_{0:T}, 0:T3\mathcal{I}^{*3}_{0:T} and 0:T4\mathcal{I}^{*4}_{0:T} to be used in the proof of Lemma E.3
Algorithm 16 Sampler for the variables 0:T3\mathcal{I}^{3}_{0:T}, 0:T3\mathcal{I}^{*3}_{0:T} and 0:T4\mathcal{I}^{*4}_{0:T} (see proof of Lemma E.3)
(42) Cov(ψs(Xs1s13,Xss3),ψs(Xs1s13,Xss3)|T)2ψsψs((s1:s3,s1:s3)(s1:s3,s1:s4)|T)=2ψsψs(s3s4|T)\operatorname{Cov}\left(\left.\psi_{s}(X_{s-1}^{\mathcal{I}^{3}_{s-1}},X_{s}^{\mathcal{I}^{3}_{s}}),\psi_{s^{\prime}}(X_{s^{\prime}-1}^{\mathcal{I}^{3}_{s-1}},X_{s^{\prime}}^{\mathcal{I}^{3}_{s^{\prime}}})\right|\mathcal{F}_{T}^{-}\right)\\ \leq 2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{(\mathcal{I}^{3}_{s-1:s},\mathcal{I}^{3}_{s^{\prime}-1:s^{\prime}})\neq(\mathcal{I}^{*3}_{s-1:s},\mathcal{I}^{*4}_{s^{\prime}-1:s^{\prime}})}\right|{\mathcal{F}_{T}^{-}}\right)\\ =2\left\lVert\psi_{s}\right\rVert_{\infty}\left\lVert\psi_{s^{\prime}}\right\rVert_{\infty}\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s^{\prime}}\neq\mathcal{I}^{*4}_{s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right)

where the last equality follows from the construction of Algorithm 16 and the hypothesis ss+1s\geq s^{\prime}+1. For all ts1t\leq s-1, the inequality

(43) (t13t14|T)(1M¯M¯h)(t3t4|T)\mathbb{P}\left(\left.{\mathcal{I}^{3}_{t-1}\neq\mathcal{I}^{*4}_{t-1}}\right|{\mathcal{F}_{T}^{-}}\right)\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{P}\left(\left.{\mathcal{I}^{3}_{t}\neq\mathcal{I}^{*4}_{t}}\right|{\mathcal{F}_{T}^{-}}\right)

can be proved using the same techniques as those used to prove (41): applying Lemma E.2 given (t3,t3,t4)(\mathcal{I}^{3}_{t},\mathcal{I}^{*3}_{t},\mathcal{I}^{*4}_{t}) then invoking the law of total expectation. Repeatedly instantiating (43) gives

(s3s4|T)\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s^{\prime}}\neq\mathcal{I}^{*4}_{s^{\prime}}}\right|{\mathcal{F}_{T}^{-}}\right) (1M¯M¯h)ss1(s13s14|T)\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}\mathbb{P}\left(\left.{\mathcal{I}^{3}_{s-1}\neq\mathcal{I}^{*4}_{s-1}}\right|{\mathcal{F}_{T}^{-}}\right)
(1M¯M¯h)ss1\displaystyle\leq\left(1-\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{s-s^{\prime}-1}

which, when plugged into (42), finishes the proof. ∎

Lemma E.4.

For s2\mathcal{I}_{s}^{2} and s2\mathcal{I}_{s}^{2*} defined by the output of Algorithm 15, we have

(s2s2|T)\displaystyle\mathbb{P}\left(\mathcal{I}^{2}_{s}\neq\mathcal{I}^{*2}_{s}|\mathcal{F}_{T}^{-}\right) C~/N, and\displaystyle\leq\tilde{C}/N,\text{ and }
(s12s12|T)\displaystyle\mathbb{P}\left(\left.{\mathcal{I}^{2}_{s-1}\neq\mathcal{I}^{*2}_{s-1}}\right|{\mathcal{F}_{T}^{-}}\right) C~/N, if s1,\displaystyle\leq\tilde{C}/N,\text{ if }s\geq 1,

for some constant C~=C~(M¯,M¯h,G¯,G¯h,εS)\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}}).

Proof.

Define At:=𝟙{t1t2}A_{t}:=\mathbb{1}\left\{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}\right\}, Bt:=𝟙{t2=t2}B_{t}:=\mathbb{1}\left\{\mathcal{I}_{t}^{2}=\mathcal{I}_{t}^{*2}\right\} and Γt:=AtBt\Gamma_{t}:=A_{t}B_{t} and recall that Rt:=(t1,t2,t1,t2)R_{t}:=(\mathcal{I}_{t}^{1},\mathcal{I}_{t}^{2},\mathcal{I}_{t}^{*1},\mathcal{I}_{t}^{*2}). The sequence (RT)=0T(R_{T-\ell})_{\ell=0}^{T} is a Markov chain given T\mathcal{F}_{T}^{-}, but this is not necessarily the case for the sequence (ΓT)=0T(\Gamma_{T-\ell})_{\ell=0}^{T} of Bernoulli random variables. Nevertheless, Lemma E.5 below shows that one can get bounds on two-step “transition probabilities” for (ΓT)(\Gamma_{T-\ell}), i.e. the probabilities under T\mathcal{F}_{T}^{-} that Γt2=1\Gamma_{t-2}=1 given Γt\Gamma_{t} and RtR_{t}. This motivates our following construction of actual Markov chains approximating the dynamic of Γt\Gamma_{t}. Let ΓT\Gamma^{*}_{T} and ΓT1\Gamma^{*}_{T-1} be two independent Bernoulli random variables given T\mathcal{F}_{T}^{-} such that

(44) (ΓT=1|T)\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T}=1}\right|{\mathcal{F}_{T}^{-}}\right) =(ΓT=1|T)\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T}=1}\right|{\mathcal{F}_{T}^{-}}\right)
(ΓT1=1|T)\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=1}\right|{\mathcal{F}_{T}^{-}}\right) =(ΓT1=1|T).\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T-1}=1}\right|{\mathcal{F}_{T}^{-}}\right).

Let ΓT,ΓT2,ΓT4,\Gamma^{*}_{T},\Gamma^{*}_{T-2},\Gamma^{*}_{T-4},\ldots and ΓT1,ΓT3,\Gamma^{*}_{T-1},\Gamma^{*}_{T-3},\ldots be two homogeneous Markov chains given T\mathcal{F}_{T}^{-} with the same transition kernel 2\overleftarrow{\mathfrak{C}^{2}} defined by

(45) T(Γt2=1|Γt=1)\displaystyle\mathbb{P}_{\mathcal{F}_{T}^{-}}(\Gamma^{*}_{t-2}=1|\Gamma^{*}_{t}=1) =12N(G¯hM¯hG¯M¯)2\displaystyle=1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2} =:211\displaystyle=:\overleftarrow{\mathfrak{C}^{2}}_{11}
T(Γt2=1|Γt=0)\displaystyle\mathbb{P}_{\mathcal{F}_{T}^{-}}(\Gamma^{*}_{t-2}=1|\Gamma^{*}_{t}=0) =M¯εS2M¯h\displaystyle=\frac{\bar{M}_{\ell}\varepsilon_{\mathrm{S}}}{2\bar{M}_{h}} =:201\displaystyle=:\overleftarrow{\mathfrak{C}^{2}}_{01}

where for two events E1E_{1}, E2E_{2}, the notation T(E1|E2)\mathbb{P}_{\mathcal{F}_{T}^{-}}(E_{1}|E_{2}) is the ratio between (E1,E2|T)\mathbb{P}\left(\left.{E_{1},E_{2}}\right|{\mathcal{F}_{T}^{-}}\right) and (E2|T)\mathbb{P}\left(\left.{E_{2}}\right|{\mathcal{F}_{T}^{-}}\right). We shall now prove by backward induction the following statement:

(46) (Γt=1|T)(Γt=1|T),ts1.\mathbb{P}\left(\left.{\Gamma_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right)\geq\mathbb{P}\left(\left.{\Gamma^{*}_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right),\forall t\geq s-1.

Firstly, (46) holds for t=Tt=T and t=T1t=T-1. Now suppose that it holds for some ts+1t\geq s+1 and we wish to justify it for t2t-2. By Lemma E.5,

(Γt2=1|Rt,T)𝟙Γt=1\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{t}=1} 211𝟙Γt=1\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{11}\mathbb{1}_{\Gamma_{t}=1}
(Γt2=1|Rt,T)𝟙Γt=0\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{t}=0} 201𝟙Γt=0.\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{01}\mathbb{1}_{\Gamma_{t}=0}.

Applying the law of total expectation gives

(Γt2=1|T)\displaystyle\mathbb{P}\left(\left.{\Gamma_{t-2}=1}\right|{\mathcal{F}_{T}^{-}}\right) 211(Γt=1|T)+201(Γt=0|T)\displaystyle\geq\overleftarrow{\mathfrak{C}^{2}}_{11}\mathbb{P}\left(\left.{\Gamma_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}\mathbb{P}\left(\left.{\Gamma_{t}=0}\right|{\mathcal{F}_{T}^{-}}\right)
=(211201)(Γt=1|T)+201\displaystyle=\left(\overleftarrow{\mathfrak{C}^{2}}_{11}-\overleftarrow{\mathfrak{C}^{2}}_{01}\right)\mathbb{P}\left(\left.{\Gamma_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}
(211201)(Γt=1|T)+201\displaystyle\geq\left(\overleftarrow{\mathfrak{C}^{2}}_{11}-\overleftarrow{\mathfrak{C}^{2}}_{01}\right)\mathbb{P}\left(\left.{\Gamma^{*}_{t}=1}\right|{\mathcal{F}_{T}^{-}}\right)+\overleftarrow{\mathfrak{C}^{2}}_{01}
if N is large enough, by induction hypothesis\displaystyle\text{ if }N\text{ is large enough, by induction hypothesis}
=(Γt2=1|T)\displaystyle=\mathbb{P}\left(\left.{\Gamma^{*}_{t-2}=1}\right|{\mathcal{F}_{T}^{-}}\right)

and (46) is now proved. To finish the proof of the lemma, it is necessary to lower bound its right hand side. We start by controlling the distribution Γt\Gamma^{*}_{t} for t=Tt=T and t=T1t=T-1. We have

(47) (ΓT=1|T)\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T}=1}\right|{\mathcal{F}_{T}^{-}}\right) =(ΓT=1|T) by (44)\displaystyle=\mathbb{P}\left(\left.{\Gamma_{T}=1}\right|{\mathcal{F}_{T}^{-}}\right)\text{ by \eqref{eq:gammastar_initial}}
=1(AT=0|T) as BT=1 by Algorithm 15\displaystyle=1-\mathbb{P}\left(\left.{A_{T}=0}\right|{\mathcal{F}_{T}^{-}}\right)\text{ as }B_{T}=1\text{ by Algorithm~{}\ref{algo:four_trajs}}
=1i=1N(T1=T2=i|T)\displaystyle=1-\sum_{i=1}^{N}\mathbb{P}\left(\left.{\mathcal{I}_{T}^{1}=\mathcal{I}_{T}^{2}=i}\right|{\mathcal{F}_{T}^{-}}\right)
=1i=1N(G(XTi)j=1NG(XTj))2\displaystyle=1-\sum_{i=1}^{N}\left(\frac{G(X_{T}^{i})}{\sum_{j=1}^{N}G(X_{T}^{j})}\right)^{2}
11N(G¯hG¯)2 by Assumption 3\displaystyle\geq 1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}\text{ by Assumption~{}\ref{asp:g_2ways_bound}}

and

(48) (ΓT1=1|T)\displaystyle\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=1}\right|{\mathcal{F}_{T}^{-}}\right) (ΓT=1,ΓT1=1|T)\displaystyle\geq\mathbb{P}\left(\left.{\Gamma_{T}=1,\Gamma_{T-1}=1}\right|{\mathcal{F}_{T}^{-}}\right)
=𝔼[(ΓT=1,ΓT1=1|RT,T)|T]\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\Gamma_{T}=1,\Gamma_{T-1}=1}\right|{R_{T},\mathcal{F}_{T}^{-}}\right)}\right|{\mathcal{F}_{T}^{-}}\right]
by the law of total expectation
=𝔼[(ΓT1=1|RT,T)𝟙ΓT=1|T]\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{\Gamma_{T-1}=1}\right|{R_{T},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\Gamma_{T}=1}}\right|{\mathcal{F}_{T}^{-}}\right]
[11N(G¯hM¯hG¯M¯)2](ΓT=1|T) via (53)\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{P}\left(\left.{\Gamma_{T}=1}\right|{\mathcal{F}_{T}^{-}}\right)\text{ via \eqref{ieq_c_4trajs_backward_reg_bound}}
[11N(G¯hM¯hG¯M¯)2][11N(G¯hG¯)2].\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}\right].

The contraction property of Lemma A.1 makes it possible to relate the intermediate distributions Γt|T\Gamma^{*}_{t}|\mathcal{F}_{T}^{-} to the end point ones ΓT1|T\Gamma^{*}_{T-1}|\mathcal{F}_{T}^{-} and ΓT|T\Gamma^{*}_{T}|\mathcal{F}_{T}^{-}. More specifically, (45) and Lemma A.1 lead to

(49) TV(Γt|T,μ)max(TV(ΓT|T,μ),TV(ΓT1|T,μ))\operatorname{TV}(\Gamma^{*}_{t}|\mathcal{F}_{T}^{-},\mu^{*})\leq\max\left(\operatorname{TV}(\Gamma^{*}_{T}|\mathcal{F}_{T}^{-},\mu^{*}),\operatorname{TV}(\Gamma^{*}_{T-1}|\mathcal{F}_{T}^{-},\mu^{*})\right)

where μ\mu^{*} is the invariant distribution of a Markov chain with transition matrix 2\overleftarrow{\mathfrak{C}^{2}}, namely

(50) {μ({0})=210201+210μ({1})=1μ({0}).\begin{cases}\mu^{*}(\left\{0\right\})&=\frac{\overleftarrow{\mathfrak{C}^{2}}_{10}}{\overleftarrow{\mathfrak{C}^{2}}_{01}+\overleftarrow{\mathfrak{C}^{2}}_{10}}\\ \mu^{*}(\left\{1\right\})&=1-\mu^{*}(\left\{0\right\}).\end{cases}

Furthermore, an alternative expression of the total variation distance given in Lemma A.1 implies that the total variation distance between two Bernoulli distributions of parameters pp and qq is |pq|\left|p-q\right|. Combining this with (49), the triangle inequality and the rough estimate max(a,b)a+ba,b0\max(a,b)\leq a+b\ \forall a,b\geq 0, we get

(Γt=0|T)3μ({0})+(ΓT=0|T)+(ΓT1=0|T)C~/N\mathbb{P}\left(\left.{\Gamma^{*}_{t}=0}\right|{\mathcal{F}_{T}^{-}}\right)\leq 3\mu^{*}(\left\{0\right\})+\mathbb{P}\left(\left.{\Gamma^{*}_{T}=0}\right|{\mathcal{F}_{T}^{-}}\right)+\mathbb{P}\left(\left.{\Gamma^{*}_{T-1}=0}\right|{\mathcal{F}_{T}^{-}}\right)\leq\tilde{C}/N

where C~=C~(M¯,M¯h,G¯,G¯h,εS)\tilde{C}=\tilde{C}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}}). The last inequality is straightforwardly derived by plugging respectively (50), (47) and (48) into the three terms of the preceding sum. This combined with (46) finishes the proof. ∎

Lemma E.5.

For ss defined in the statement of Lemma E.3; AtA_{t}, BtB_{t} and RtR_{t} defined in the proof of Lemma E.4 and all ts+1t\geq s+1, we have

(At2Bt2=1|Rt,T)𝟙AtBt=1\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1} (12N(G¯hM¯hG¯M¯)2)𝟙AtBt=1;\displaystyle\geq\left(1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t}B_{t}=1};
(At2Bt2=1|Rt,T)\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right) M¯εS2M¯h\displaystyle\geq\frac{\bar{M}_{\ell}\varepsilon_{\mathrm{S}}}{2\bar{M}_{h}}

where the inequalities hold for NN large enough, i.e., NN0=N0(M¯,M¯h,G¯,G¯h,εS)N\geq N_{0}=N_{0}(\bar{M}_{\ell},\bar{M}_{h},\bar{G}_{\ell},\bar{G}_{h},\varepsilon_{\mathrm{S}}).

Proof.

We start by showing the following three inequalities for all tst\geq s and NN sufficiently large:

(51) (At1=1|Rt,T)\displaystyle\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right) εS;\displaystyle\geq\varepsilon_{\mathrm{S}};
(52) (At1Bt1=1|Rt,T)𝟙At=1\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1} (M¯/2M¯h)𝟙At=1;\displaystyle\geq(\bar{M}_{\ell}/2\bar{M}_{h})\mathbb{1}_{A_{t}=1};
(53) (At1Bt1=1|Rt,T)𝟙AtBt=1\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1} [11N(G¯hM¯hG¯M¯)2]𝟙AtBt=1.\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{1}_{A_{t}B_{t}=1}.

For (51), we have

(54) (At1=1|Rt,T)𝟙At1=(t11t12|Rt,T)𝟙t1=t2εS𝟙At1\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}\neq 1}=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}=\mathcal{I}_{t}^{2}}\geq\varepsilon_{\mathrm{S}}\mathbb{1}_{A_{t}\neq 1}

by Lemma E.1. Next,

(55) (At1=1|Rt,T)𝟙At=1\displaystyle\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}
=(t11t12|Rt,T)𝟙t1t2\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}\neq\mathcal{I}_{t-1}^{2}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}
=[1i(t11=t12=i|Rt,T)]𝟙t1t2\displaystyle=\left[1-\sum_{i}\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{1}=\mathcal{I}_{t-1}^{2}=i}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}
=[1i=1Nk=12Gt1(Xt1i)mt(Xt1i,Xttk)j=1NGt1(Xt1j)mt(Xt1j,Xttk)]𝟙t1t2 by Lemma E.1\displaystyle=\left[1-\sum_{i=1}^{N}\prod_{k=1}^{2}\frac{G_{t-1}(X_{t-1}^{i})m_{t}(X_{t-1}^{i},X_{t}^{\mathcal{I}_{t}^{k}})}{\sum_{j=1}^{N}G_{t-1}(X_{t-1}^{j})m_{t}(X_{t-1}^{j},X_{t}^{\mathcal{I}_{t}^{k}})}\right]\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}\text{ by Lemma~{}\ref{lem:two_backward_traj}}
[11N(G¯hM¯hG¯M¯)2]𝟙At=1 by Assumptions 2 and 3.\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]\mathbb{1}_{A_{t}=1}\text{ by Assumptions~{}\ref{asp:mt_2ways_bound} and~{}\ref{asp:g_2ways_bound}.}

Combining (54) and (55) yields (51) for NN large enough. To prove (52), we write

(56) (At1Bt1=1|Rt,T)𝟙At=1\displaystyle\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}
=[1(At1Bt1=0|Rt,T)]𝟙At=1\displaystyle=\left[1-\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=0}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{A_{t}=1}
[1(At1=0|Rt,T)(Bt1=0|Rt,T)]𝟙At=1\displaystyle\geq\left[1-\mathbb{P}\left(\left.{A_{t-1}=0}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)-\mathbb{P}\left(\left.{B_{t-1}=0}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\right]\mathbb{1}_{A_{t}=1}
=[(At1=1|Rt,T)+(Bt1=1|Rt,T)1]𝟙At=1.\displaystyle=\left[\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)+\mathbb{P}\left(\left.{B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)-1\right]\mathbb{1}_{A_{t}=1}.

We analyse the second term in the above expression. We have

(57) (Bt1=1|Rt,T)𝟙At=1\displaystyle\mathbb{P}\left(\left.{B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}
=(t12=t12|Rt,T)𝟙t1t2\displaystyle=\mathbb{P}\left(\left.{\mathcal{I}_{t-1}^{2}=\mathcal{I}_{t-1}^{*2}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{\mathcal{I}_{t}^{1}\neq\mathcal{I}_{t}^{2}}
=[1TV(BtN,FFBS(t2,),BtN,FFBS(t2,))]𝟙At=1\displaystyle=\left[1-\operatorname{TV}\left(B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{2},\cdot),B_{t}^{N,\mathrm{FFBS}}(\mathcal{I}_{t}^{*2},\cdot)\right)\right]\mathbb{1}_{A_{t}=1}
by construction of Algorithm 15
(M¯/M¯h)𝟙At=1 by Lemma E.2.\displaystyle\geq(\bar{M}_{\ell}/\bar{M}_{h})\mathbb{1}_{A_{t}=1}\text{ by Lemma~{}\ref{lem:backward_mixing}.}

Plugging (55) and (57) into (56) yields

(At1Bt1=1|Rt,T)𝟙At=1(1N(G¯hM¯hG¯M¯)2+M¯M¯h)𝟙At=1\mathbb{P}\left(\left.{A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}=1}\geq\left(-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}+\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)\mathbb{1}_{A_{t}=1}

and thus (52) follows if NN is large enough. The inequality (53) is justified by combining (55), the simple decomposition 𝟙AtBt=1=𝟙At=1𝟙Bt=1\mathbb{1}_{A_{t}B_{t}=1}=\mathbb{1}_{A_{t}=1}\mathbb{1}_{B_{t}=1} and the fact that Algorithm 15 guarantees Bt1=1B_{t-1}=1 if At=Bt=1A_{t}=B_{t}=1.

We can now deduce the two inequalities in the statement of the Lemma. The first one is a straightforward double application of (53):

(At2Bt2=1|Rt,T)𝟙AtBt=1\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}
(At2Bt2=1,At1Bt1=1|Rt,T)𝟙AtBt=1\displaystyle\geq\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}B_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t}B_{t}=1}
=𝔼[(At2Bt2=1,At1Bt1=1|Rt1,Rt,T)|Rt,T]𝟙AtBt=1\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}B_{t-1}=1}\right|{R_{t-1},R_{t},\mathcal{F}_{T}^{-}}\right)}\right|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}
by the law of total expectation
=𝔼[(At2Bt2=1|Rt1,T)𝟙At1Bt1=1|Rt,T]𝟙AtBt=1\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t-1},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t-1}B_{t-1}=1}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}
since (RT)=0T is Markov given T\displaystyle\text{ since }(R_{T-\ell})_{\ell=0}^{T}\text{ is Markov given }\mathcal{F}_{T}^{-}
𝔼[(11N(G¯hM¯hG¯M¯)2)𝟙At1Bt1=1|Rt,T]𝟙AtBt=1\displaystyle\geq\mathbb{E}\left[\left.{\left(1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t-1}B_{t-1}=1}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right]\mathbb{1}_{A_{t}B_{t}=1}
[11N(G¯hM¯hG¯M¯)2]2𝟙AtBt=1(12N(G¯hM¯hG¯M¯)2)𝟙AtBt=1.\displaystyle\geq\left[1-\frac{1}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right]^{2}\mathbb{1}_{A_{t}B_{t}=1}\geq\left(1-\frac{2}{N}\left(\frac{\bar{G}_{h}\bar{M}_{h}}{\bar{G}_{\ell}\bar{M}_{\ell}}\right)^{2}\right)\mathbb{1}_{A_{t}B_{t}=1}.

Finally, we have

(At2Bt2=1|Rt,T)\displaystyle\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)
(At2Bt2=1,At1=1|Rt,T)\displaystyle\geq\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1,A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)
=𝔼[(At2Bt2=1|Rt1,T)𝟙At1=1|Rt,T]\displaystyle=\mathbb{E}\left[\left.{\mathbb{P}\left(\left.{A_{t-2}B_{t-2}=1}\right|{R_{t-1},\mathcal{F}_{T}^{-}}\right)\mathbb{1}_{A_{t-1}=1}}\right|{R_{t},\mathcal{F}_{T}^{-}}\right]
using law of total expectation and the Markov property as above
M¯2M¯h(At1=1|Rt,T) by (52)\displaystyle\geq\frac{\bar{M}_{\ell}}{2\bar{M}_{h}}\mathbb{P}\left(\left.{A_{t-1}=1}\right|{R_{t},\mathcal{F}_{T}^{-}}\right)\text{ by \eqref{ieq_b_4trajs_backward_reg_bound}}
M¯2M¯hεS by (51)\displaystyle\geq\frac{\bar{M}_{\ell}}{2\bar{M}_{h}}\varepsilon_{\mathrm{S}}\text{ by \eqref{ieq_a_4trajs_backward_reg_bound}}

and the second inequality is proved. ∎

E.4. Proof of Proposition  3 (hybrid rejection validity)

Proof.

Put Zn:=(Xn,UnCμ0(Xn))Z_{n}:=(X_{n},U_{n}C\mu_{0}(X_{n})). Then ZnZ_{n} is uniformly distributed on

𝒢0:={(x,y)𝒳×+,yCμ0(x)}.\mathcal{G}_{0}:=\left\{(x,y)\in\mathcal{X}\times\mathbb{R}_{+},y\leq C\mu_{0}(x)\right\}.

The proof would be done if one could show that, given KKK^{*}\leq K, the variable ZKZ_{K^{*}} is uniformly distributed on

𝒢1:={(x,y)𝒳×+,yμ1(x)}.\mathcal{G}_{1}:=\left\{(x,y)\in\mathcal{X}\times\mathbb{R}_{+},y\leq\mu_{1}(x)\right\}.

Note that KK^{*} is, by definition, the first time index where the sequence (Zn)(Z_{n}) touches 𝒢1\mathcal{G}_{1}. Let BB be any subset of 𝒢1\mathcal{G}_{1}. We have

(58) (ZKB|KK)(ZKB,KK)\displaystyle\mathbb{P}\left(\left.{Z_{K^{*}}\in B}\right|{K^{*}\leq K}\right)\propto\mathbb{P}(Z_{K^{*}}\in B,K^{*}\leq K)
=k=1(ZkB,K=k,Kk)\displaystyle=\sum_{k^{*}=1}^{\infty}\mathbb{P}\left(Z_{k^{*}}\in B,K^{*}=k^{*},K\geq k^{*}\right)
=k=1(ZkB,Z1:k1𝒢1,K>k1)\displaystyle=\sum_{k^{*}=1}^{\infty}\mathbb{P}\left(Z_{k^{*}}\in B,Z_{1:k^{*}-1}\notin\mathcal{G}_{1},K>k^{*}-1\right)
=k=1(ZkB)(Z1:k1𝒢1,K>k1)since K stopping time\displaystyle=\sum_{k^{*}=1}^{\infty}\mathbb{P}(Z_{k^{*}}\in B)\mathbb{P}\left(Z_{1:k^{*}-1}\notin\mathcal{G}_{1},K>k^{*}-1\right)\mbox{since $K$ stopping time}
=(Z1B)k=1(Z1:k1𝒢1,K>k1)\displaystyle=\mathbb{P}(Z_{1}\in B)\sum_{k^{*}=1}^{\infty}\mathbb{P}\left(Z_{1:k^{*}-1}\notin\mathcal{G}_{1},K>k^{*}-1\right)
(Z1B)(Z1B|Z1𝒢1).\displaystyle\propto\mathbb{P}(Z_{1}\in B)\propto\mathbb{P}\left(\left.{Z_{1}\in B}\right|{Z_{1}\in\mathcal{G}_{1}}\right).

By considering the special case B=𝒢1B=\mathcal{G}_{1}, we see that the constant of proportionality between the first and the last terms of (58) must be 11, from which the proof follows. ∎

E.5. Proof of Theorem  3.1 (hybrid algorithm’s intermediate complexity)

From (16), one may have the correct intuition that as NN\to\infty, τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} tends in distribution to that of the variable τt,PaRIS\tau_{t}^{\infty,\mathrm{PaRIS}} defined as

(59) τt,PaRIS | Xt,PaRISGeo(rt(Xt,PaRIS)M¯h)\tau_{t}^{\infty,\mathrm{PaRIS}}\textrm{ }|\textrm{ }X_{t}^{\infty,\mathrm{PaRIS}}\sim\operatorname{Geo}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)

where Xt,PaRISt1Mt(dxt)X_{t}^{\infty,\mathrm{PaRIS}}\sim\mathbb{Q}_{t-1}M_{t}(\mathrm{d}x_{t}) is distributed according to the predictive distribution of XtX_{t} given Y0:t1Y_{0:t-1} and rtr_{t} is the density of Xt,PaRISX_{t}^{\infty,\mathrm{PaRIS}} with respect to the Lebesgue measure (cf. Definition 1). The following proposition formalises the connection between τt1,PaRIS\tau_{t}^{1,\mathrm{PaRIS}} and τt,PaRIS\tau_{t}^{\infty,\mathrm{PaRIS}}.

Proposition 6.

We have τt1,PaRISτt,PaRIS\tau_{t}^{1,\mathrm{PaRIS}}\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}} as NN\to\infty.

Proof.

From (16) and Definition 1 one has

(60) τt1,PaRIS | Xt1,t1Geo(rtN(Xt1)M¯h).\tau_{t}^{1,\mathrm{PaRIS}}\textrm{ }|\textrm{ }X_{t}^{1},\mathcal{F}_{t-1}\sim\operatorname{Geo}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right).

In light of (59), it suffices to establish that

(61) rtN(Xt1)M¯hrt(Xt,PaRIS)M¯h.\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\Rightarrow\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}.

Indeed, this would mean that for any continuous bounded function ψ\psi, we have

𝔼[ψ(τt1,PaRIS)]=𝔼[(Geoψ)(rtN(Xt1)M¯h)]\displaystyle\mathbb{E}[\psi(\tau_{t}^{1,\mathrm{PaRIS}})]=\mathbb{E}\left[(\operatorname{Geo}^{\star}\psi)\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)\right] 𝔼[(Geoψ)(rt(Xt,PaRIS)M¯h)]\displaystyle\rightarrow\mathbb{E}\left[(\operatorname{Geo}^{\star}\psi)\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)\right]
=𝔼[ψ(τt,PaRIS)]\displaystyle=\mathbb{E}[\psi(\tau_{t}^{\infty,\mathrm{PaRIS}})]

where Geo\operatorname{Geo}^{\star} is the geometric Markov kernel that sends each λ\lambda to the geometric distribution of parameter λ\lambda, i.e. Geo(λ,dx)=Geo(λ)\operatorname{Geo}^{\star}(\lambda,\mathrm{d}x)=\operatorname{Geo}(\lambda). To this end, write

(62) rtN(Xt1)rt(Xt1)=nGt1(Xt1n)mt(Xt1n,Xt1)nGt1(Xt1n)rt(Xt1)=nN1Gt1(Xt1n)[mt(Xt1n,Xt1)rt(Xt1)]N1nGt1(Xt1n).\begin{split}r_{t}^{N}(X_{t}^{1})-r_{t}(X_{t}^{1})&=\frac{\sum_{n}G_{t-1}(X_{t-1}^{n})m_{t}(X_{t-1}^{n},X_{t}^{1})}{\sum_{n}G_{t-1}(X_{t-1}^{n})}-r_{t}(X_{t}^{1})\\ &=\frac{\sum_{n}N^{-1}G_{t-1}(X_{t-1}^{n})\left[m_{t}(X_{t-1}^{n},X_{t}^{1})-r_{t}(X_{t}^{1})\right]}{N^{-1}\sum_{n}G_{t-1}(X_{t-1}^{n})}.\end{split}

We study the mean squared error of the numerator:

𝔼{1NnGt1(Xt1n)[mt(Xt1n,Xt1)rt(Xt1)]}2\displaystyle\mathbb{E}\left\{\frac{1}{N}\sum_{n}G_{t-1}(X_{t-1}^{n})\left[m_{t}(X_{t-1}^{n},X_{t}^{1})-r_{t}(X_{t}^{1})\right]\right\}^{2}
=1N𝔼{Gt1(Xt11)2[mt(Xt11,Xt1)rt(Xt1)]2}\displaystyle=\frac{1}{N}\mathbb{E}\left\{G_{t-1}(X_{t-1}^{1})^{2}\left[m_{t}(X_{t-1}^{1},X_{t}^{1})-r_{t}(X_{t}^{1})\right]^{2}\right\}
+N(N1)N2𝔼{Gt1(Xt11)Gt1(Xt12)[mt(Xt11,Xt1)rt(Xt1)]\displaystyle\quad+\frac{N(N-1)}{N^{2}}\mathbb{E}\Big{\{}G_{t-1}(X_{t-1}^{1})G_{t-1}(X_{t-1}^{2})\left[m_{t}(X_{t-1}^{1},X_{t}^{1})-r_{t}(X_{t}^{1})\right]
×[mt(Xt12,Xt1)rt(Xt1)]}\displaystyle\qquad\times\left[m_{t}(X_{t-1}^{2},X_{t}^{1})-r_{t}(X_{t}^{1})\right]\Big{\}}

where we have again used the exchangeability induced by step ()(\star) of Algorithm 5. The first term obviously tends to 0 as NN\to\infty by Assumptions 4 and 1. The second term also vanishes asymptotically thanks to Lemma E.6 below and Assumption 6. Assumption 1 also implies that the denominator of (62) converges in probability to some constant, via the consistency of particle approximations, see e.g. Del Moral, (2004) or Chopin and Papaspiliopoulos, (2020). Thus, rtN(Xt1)rt(Xt1)0r_{t}^{N}(X_{t}^{1})-r_{t}(X_{t}^{1})\Rightarrow 0 by Slutsky’s theorem. Moreover, rt(Xt1)rt(Xt,PaRIS)r_{t}(X_{t}^{1})\Rightarrow r_{t}(X_{t}^{\infty,\mathrm{PaRIS}}) by the continuity of rtr_{t} and the consistency of particle approximations. Using again Slutsky’s theorem yields (61). ∎

The following lemma is needed to complete the proof of Proposition 6 and is related to the propagation of chaos property, see Del Moral, (2004, Chapter 8).

Lemma E.6.

We have (Xt11,Xt12,Xt1)t2Mt1t2Mt1Qt1Mt(X_{t-1}^{1},X_{t-1}^{2},X_{t}^{1})\Rightarrow\mathbb{Q}_{t-2}M_{t-1}\otimes\mathbb{Q}_{t-2}M_{t-1}\otimes Q_{t-1}M_{t}.

Proof.

For vectors uu, vv, and ww, we have, by the symmetry of the distribution of particles:

𝔼[exp(iuXt11+ivXt12+iwXt1)]\displaystyle\mathbb{E}\left[\exp\left(iuX_{t-1}^{1}+ivX_{t-1}^{2}+iwX_{t}^{1}\right)\right]
=𝔼[(1NeiuXt1n)(1NeivXt1n)(1NeiwXtn)]\displaystyle=\mathbb{E}\left[\left(\frac{1}{N}\sum e^{iuX_{t-1}^{n}}\right)\left(\frac{1}{N}\sum e^{ivX_{t-1}^{n}}\right)\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right]
NN2𝔼[eiuXt11eivXt11(1NeiwXtn)]\displaystyle\quad-\frac{N}{N^{2}}\mathbb{E}\left[e^{iuX_{t-1}^{1}}e^{ivX_{t-1}^{1}}\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right]
+NN2𝔼[eiuXt11eivXt12(1NeiwXtn)].\displaystyle\quad+\frac{N}{N^{2}}\mathbb{E}\left[e^{iuX_{t-1}^{1}}e^{ivX_{t-1}^{2}}\left(\frac{1}{N}\sum e^{iwX_{t}^{n}}\right)\right].

Note that

1NeiuXt1na.s.t2Mt1(exp(iu))\frac{1}{N}\sum e^{iuX_{t-1}^{n}}\overset{\textrm{a.s.}}{\longrightarrow}\mathbb{Q}_{t-2}M_{t-1}\left(\exp(iu\bullet)\right)

and

1NeiwXtna.s.t1Mt(exp(iw)).\frac{1}{N}\sum e^{iwX_{t}^{n}}\overset{\textrm{a.s.}}{\longrightarrow}\mathbb{Q}_{t-1}M_{t}\left(\exp(iw\bullet)\right).

The dominated convergence theorem, applicable since |eiu|1\left|e^{iu}\right|\leq 1 for uu\in\mathbb{R}, finishes the proof. ∎

Proof of Theorem 3.1.

First of all,

(63) 𝔼[τt,PaRIS]=𝔼[M¯hrt(Xt,PaRIS)]=𝒳tM¯hrt(xt)rt(xt)dxt=\mathbb{E}[\tau_{t}^{\infty,\mathrm{PaRIS}}]=\mathbb{E}\left[\frac{\bar{M}_{h}}{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}\right]=\int_{\mathcal{X}_{t}}\frac{\bar{M}_{h}}{r_{t}(x_{t})}r_{t}(x_{t})\mathrm{d}x_{t}=\infty

by Assumption 5. Next, for any xx\in\mathbb{R}\setminus\mathbb{Z} and N>xN>x,

(min(τt1,PaRIS,N)x)=(τt1,PaRISx)(τt,PaRISx)\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\leq x\right)=\mathbb{P}\left(\tau_{t}^{1,\mathrm{PaRIS}}\leq x\right)\to\mathbb{P}\left(\tau_{t}^{\infty,\mathrm{PaRIS}}\leq x\right)

by Proposition 6. Thus, by Portmanteau theorem,

(64) min(τt1,PaRIS,N)τt,PaRIS.\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}}.

Altogether, we have

lim infN𝔼[min(τt1,PaRIS,N)]\displaystyle\liminf_{N\to\infty}\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right] =lim infNk(min(τt1,PaRIS,N)=k)\displaystyle=\liminf_{N\to\infty}\sum k\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)=k\right)
lim infNk(min(τt1,PaRIS,N)=k) by Fatou’s lemma\displaystyle\geq\sum\liminf_{N\to\infty}k\mathbb{P}\left(\min(\tau_{t}^{1,\mathrm{PaRIS}},N)=k\right)\textrm{ by Fatou's lemma}
=k(τt,PaRIS=k) by (64)\displaystyle=\sum k\mathbb{P}\left(\tau_{t}^{\infty,\mathrm{PaRIS}}=k\right)\textrm{ by \eqref{eq:proof_thm2_tnn_to_tinf}}
= by (63)\displaystyle=\infty{\textrm{ by \eqref{eq:proof_thm2_expectation_tauinf}}}

and

limN1N𝔼[min(τt1,PaRIS,N)]=limN𝔼[min(τt1,PaRISN,1)]0\lim_{N\to\infty}\frac{1}{N}\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]=\lim_{N\to\infty}\mathbb{E}\left[\min\left(\frac{\tau_{t}^{1,\mathrm{PaRIS}}}{N},1\right)\right]\to 0

since τt1,PaRISτt,PaRIS\tau_{t}^{1,\mathrm{PaRIS}}\Rightarrow\tau_{t}^{\infty,\mathrm{PaRIS}} implies that the sequence of random variables

min(τt1,PaRISN,1)\min\left(\frac{\tau_{t}^{1,\mathrm{PaRIS}}}{N},1\right)

converges to 0 in distribution while being bounded between 0 and 11. ∎

E.6. Proof of Theorem  3.2 (hybrid PaRIS near-linear complexity)

The following proposition shows that the real execution time for the hybrid algorithm is asymptotically at most of the same order as the “oracle” hybrid execution time.

Proposition 7.

We have

lim supN𝔼[min(τt1,PaRIS,N)]𝔼[min(τt,PaRIS,N)]<.\limsup_{N\to\infty}\frac{\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right]}{\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right]}<\infty.
Proof.

Put

(65) zN(λ):=1(1λ)Nλ=n=0N1(1λ)n.z^{N}(\lambda):=\frac{1-(1-\lambda)^{N}}{\lambda}=\sum_{n=0}^{N-1}(1-\lambda)^{n}.

One can quickly verify (using the memorylessness of the geometric distribution for example) that zN(λ)=𝔼[min(G,N)|GGeo(λ)]z^{N}(\lambda)=\mathbb{E}\left[\left.{\min(G,N)}\right|{G\sim\operatorname{Geo}(\lambda)}\right]. It will be useful to keep in mind the elementary estimate zN(λ)min(N,λ1)z^{N}(\lambda)\leq\min(N,\lambda^{-1}). We can now write

𝔼[min(τt1,PaRIS,N)]\displaystyle\mathbb{E}\left[\min(\tau_{t}^{1,\mathrm{PaRIS}},N)\right] =𝔼[zN(rtN(Xt1)M¯h)] (by (60))=𝔼[𝔼[zN(rtN(Xt1)M¯h)|t1]]\displaystyle=\mathbb{E}\left[z^{N}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)\right]\textrm{ (by \eqref{eq:dist_tauN_with_r})}=\mathbb{E}\left[\mathbb{E}\left[\left.{z^{N}\left(\frac{r_{t}^{N}(X_{t}^{1})}{\bar{M}_{h}}\right)}\right|{\mathcal{F}_{t-1}}\right]\right]
=𝔼[𝒳tzN(rtN(xt)M¯h)rtN(xt)λt(dxt)]\displaystyle=\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]
ct(𝒳tzN(rt(xt)M¯h)rt(xt)λt(dxt)+bt) by Lemma E.7\displaystyle\leq c_{t}\left(\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}(x_{t})\lambda_{t}(\mathrm{d}x_{t})+b_{t}\right)\text{ by Lemma~{}\ref{lem:magic_jensen}}
=ct(𝔼[min(τt,PaRIS,N)]+bt)\displaystyle=c_{t}\left(\mathbb{E}[\min(\tau_{t}^{{\infty,\mathrm{PaRIS}}},N)]+b_{t}\right)

from which the proposition is immediate. ∎

Lemma E.7.

In addition to notations of Algorithm 1, let the function zNz^{N} be defined as in (65) and the functions rtr_{t} and rtNr_{t}^{N} be defined as in Definition 1. Let ϕt:𝒳t>0\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}_{>0} be a bounded non-negative deterministic function. Then, under Assumptions 1 and 4, there exist constants btb_{t} and ctc_{t} depending only on the model such that

𝔼[𝒳tzN(rtN(xt)M¯h)rtNϕt]ct(𝒳tzN(rt(xt)M¯h)rtϕt+btϕt)\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right]\leq c_{t}\left(\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\phi_{t}+b_{t}\left\lVert\phi_{t}\right\rVert_{\infty}\right)

where for brevity, we shortened the integration notation (e.g. dropping λt(dxt)\lambda_{t}(\mathrm{d}x_{t}), dropping xtx_{t} from ϕ(xt)\phi(x_{t}), etc.) whenever there is no ambiguity.

Proof.

We have

(66) 𝔼[𝒳tzN(rtN(xt)M¯h)rtNϕt]𝒳tzN(𝔼[rtN(xt)M¯h])𝔼[rtN(xt)]ϕt\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right]\leq\int_{\mathcal{X}_{t}}z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\mathbb{E}\left[r_{t}^{N}(x_{t})\right]\phi_{t}

using Fubini’s theorem and the concavity of λλzN(λ)\lambda\mapsto\lambda z^{N}(\lambda) on [0,1][0,1]. By a well-known result on the bias of a particle filter (which is in fact the propagation of chaos in the special case of q=1q=1 particle), we have:

|𝔼[rtN(xt)]rt(xt)|\displaystyle\left|\mathbb{E}\left[r_{t}^{N}(x_{t})\right]-r_{t}(x_{t})\right| =|𝔼[Wt1nmt(Xt1n,xt)]rt(xt)|\displaystyle=\left|\mathbb{E}\left[\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})\right]-r_{t}(x_{t})\right|
=|𝔼[mt(Xt1At1,xt)]t1(mt(,xt))|\displaystyle=\left|\mathbb{E}\left[m_{t}\left(X_{t-1}^{A_{t}^{1}},x_{t}\right)\right]-\mathbb{Q}_{t-1}\left(m_{t}\left(\bullet,x_{t}\right)\right)\right|
btM¯hN\displaystyle\leq\frac{b_{t}\bar{M}_{h}}{N}

for some constant btb_{t}. We next show that such a bias does not change the asymptotic behavior of zNz^{N}. More precisely,

(67) zN(𝔼[rtN(xt)M¯h])\displaystyle z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right) zN(rt(xt)M¯hbtN)\displaystyle\leq z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}-\frac{b_{t}}{N}\right)
=n=0N1(1rt(xt)/M¯h+bt/N1rt(xt)/M¯h)n(1rt(xt)M¯h)n\displaystyle=\sum_{n=0}^{N-1}\left(\frac{1-{r_{t}(x_{t})}/{\bar{M}_{h}}+{b_{t}}/{N}}{1-{r_{t}(x_{t})}/{\bar{M}_{h}}}\right)^{n}\left(1-\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)^{n}
n=0N1(1+btN(1rt(xt)/M¯h))N(1rt(xt)M¯h)n\displaystyle\leq\sum_{n=0}^{N-1}\left(1+\frac{b_{t}}{N\left(1-r_{t}(x_{t})/\bar{M}_{h}\right)}\right)^{N}\left(1-\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)^{n}
exp(bt1rt(xt)/M¯h)zN(rt(xt)M¯h)\displaystyle\leq\exp\left(\frac{b_{t}}{1-r_{t}(x_{t})/\bar{M}_{h}}\right)z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)
e2btzN(rt(xt)M¯h)\displaystyle\leq e^{2b_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)

if xtx_{t} is such that rt(xt)/M¯h1/2r_{t}(x_{t})/\bar{M}_{h}\leq 1/2. In contrast, if rt(xt)/M¯h1/2r_{t}(x_{t})/\bar{M}_{h}\geq 1/2, then provided that N6btN\geq 6b_{t}, we have

(68) zN(𝔼[rtN(xt)M¯h])zN(rt(xt)M¯hbtN)zN(13)3zN(rt(xt)M¯h).z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\leq z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}-\frac{b_{t}}{N}\right)\leq z^{N}\left(\frac{1}{3}\right)\leq 3z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right).

Putting together (67) and (68), we have, for N6btN\geq 6b_{t},

zN(𝔼[rtN(xt)M¯h])(e2bt+3)zN(rt(xt)M¯h)z^{N}\left(\mathbb{E}\left[\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right]\right)\leq\left(e^{2b_{t}}+3\right)z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)

and so, by (66),

𝔼[𝒳tzN(rtN(xt)M¯h)rtNϕt]\displaystyle\mathbb{E}\left[\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\phi_{t}\right] (e2bt+3)𝒳tzN(rt(xt)M¯h)𝔼[rtN(xt)]ϕt\displaystyle\leq\left(e^{2b_{t}}+3\right)\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)\mathbb{E}\left[r_{t}^{N}(x_{t})\right]\phi_{t}
=(e2bt+3)𝔼[zN(rt(Xt1)M¯h)ϕt(Xt1)].\displaystyle=\left(e^{2b_{t}}+3\right)\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{1})}{\bar{M}_{h}}\right)\phi_{t}(X_{t}^{1})\right].

Again, using the result on the bias of a particle filter,

|𝔼[zN(rt(Xt1)M¯h)ϕt(Xt1)]𝒳tzN(rt(xt)M¯h)rtϕt|btzNϕtN=btϕt\left|\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{1})}{\bar{M}_{h}}\right)\phi_{t}(X_{t}^{1})\right]-\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\phi_{t}\right|\leq\frac{b_{t}\left\lVert z^{N}\right\rVert_{\infty}\left\lVert\phi_{t}\right\rVert_{\infty}}{N}=b_{t}\left\lVert\phi_{t}\right\rVert_{\infty}

which, together with the previous inequality, implies the desired result. ∎

Proposition 8.

In linear Gaussian state space models, we have

𝔼[min(τt,PaRIS,N)]=𝒪((logN)dt/2).\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right]=\mathcal{O}\left((\log N)^{d_{t}/2}\right).
Proof.

Let μt\mu_{t} and Σt\Sigma_{t} be such that Xt,PaRIS𝒩(μt,Σt)X_{t}^{\infty,\mathrm{PaRIS}}\sim\mathcal{N}(\mu_{t},\Sigma_{t}). Then

log(rt(Xt,PaRIS)/M¯h)=btWt\log(r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})/\bar{M}_{h})=b^{\prime}_{t}-W_{t}

where btb^{\prime}_{t} is some constant and

Wt:=(Xt,PaRISμt)Σt1(Xt,PaRISμt)2Gamma(dt2,1).W_{t}:=\frac{(X_{t}^{\infty,\mathrm{PaRIS}}-\mu_{t})^{\top}\Sigma_{t}^{-1}(X_{t}^{\infty,\mathrm{PaRIS}}-\mu_{t})}{2}\sim\operatorname{Gamma}\left(\frac{d_{t}}{2},1\right).

We have

𝔼[min(τt,PaRIS,N)]\displaystyle\mathbb{E}\left[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)\right] =𝔼[zN(rt(Xt,PaRIS)M¯h)]=𝔼[zN(ebtWt)]\displaystyle=\mathbb{E}\left[z^{N}\left(\frac{r_{t}(X_{t}^{\infty,\mathrm{PaRIS}})}{\bar{M}_{h}}\right)\right]=\mathbb{E}\left[z^{N}(e^{b^{\prime}_{t}-W_{t}})\right]
=0zN(ebtw)wdt/21ewΓ(dt/2)dw\displaystyle=\int_{0}^{\infty}z^{N}(e^{b^{\prime}_{t}-w})\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w
0logNewbtwdt/21ewΓ(dt/2)dw+logNNwdt/21ewΓ(dt/2)dw\displaystyle\leq\int_{0}^{\log N}e^{w-b^{\prime}_{t}}\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w+\int_{\log N}^{\infty}N\frac{w^{d_{t}/2-1}e^{-w}}{\Gamma(d_{t}/2)}\mathrm{d}w

using the bound zN(λ)min(N,1/λ)z^{N}(\lambda)\leq\min(N,1/\lambda). The first term is of order 𝒪(logdt/2N)\mathcal{O}(\log^{d_{t}/2}N) by elementary calculus, while the second term is of order 𝒪(logdt/21N)\mathcal{O}(\log^{d_{t}/2-1}N) using asymptotic properties of the incomplete Gamma function, see Olver et al., (2010, Section 8.11). ∎

Proof of Theorem 3.2.

The theorem is a straightforward consequence of Proposition 7 and Proposition 8. ∎

E.7. Proof of Theorems  B.1 and B.2 (pure rejection FFBS complexity)

We start with a useful remark linking the projection kernels Π\Pi and the cost-to-go functions defined in Supplement A with the L-kernels formulated in (27). The proof is simple and therefore omitted.

Lemma E.8.

We have Lt:T(x0:t,𝟙)=Ht:T(xt)L_{t:T}(x_{0:t},\mathbbm{1})=H_{t:T}(x_{t}) for all x0:tx_{0:t}. Moreover, for any function ϕt:𝒳t\phi_{t}:\mathcal{X}_{t}\to\mathbb{R}, we have

Lt:TΠt0:Tϕt=Πt0:t(ϕt×Ht:T).L_{t:T}\Pi^{0:T}_{t}\phi_{t}=\Pi^{0:t}_{t}(\phi_{t}\times H_{t:T}).

Theorems B.1 and B.2 both rely on an induction argument wrapped up in the following proposition.

Proposition 9.

We use the notations of Algorithm 2. Let tN\mathbb{Q}_{t}^{N} be defined as in (24), where the BsNB_{s}^{N} kernels can be BsN,FFBSB_{s}^{N,\mathrm{FFBS}} or any other kernels satisfying the hypotheses of Theorem 2.1. Suppose that Assumption 1 holds. Let ftN:𝒳t0f_{t}^{N}:\mathcal{X}_{t}\to\mathbb{R}_{\geq 0} be a (possibly random) function such that ftN(xt)f_{t}^{N}(x_{t}) is t1\mathcal{F}_{t-1}-measurable. Then the following assertions are true:

  1. (a)

    Suppose that 𝔼[𝒳t{rtN×ftN×Gt×Ht:T}(xt)λt(dxt)]=\mathbb{E}\left[\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]=\infty, where rtNr_{t}^{N} and λt\lambda_{t} are defined in Definition 1. Then

    𝔼[TN(dxt)ftN(xt)]=.\mathbb{E}\left[\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})\right]=\infty.
  2. (b)

    Suppose that 𝒳t{rtN×ftN×Gt×Ht:T}(xt)λt(dxt)0\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0. Then

    TN(dxt)ftN(xt)0.\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0.
Proof.

Part (a). We shall prove by induction the statement

𝔼[sNLs:TΠt0:TftN]=,t1sT.\mathbb{E}\left[\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right]=\infty,\forall\ t-1\leq s\leq T.

For s=t1s=t-1, it follows from part (a)’s hypothesis and Lemma E.8. Indeed,

t1NLt1:TΠt0:TftN\displaystyle\mathbb{Q}_{t-1}^{N}L_{t-1:T}\Pi_{t}^{0:T}f_{t}^{N}
=t1NLt1:tLt:TΠt0:TftN=t1NLt1:tΠt0:t(ftN×Ht:T)\displaystyle=\mathbb{Q}_{t-1}^{N}L_{t-1:t}L_{t:T}\Pi_{t}^{0:T}f_{t}^{N}=\mathbb{Q}_{t-1}^{N}L_{t-1:t}\Pi_{t}^{0:t}(f_{t}^{N}\times H_{t:T})
=𝒳t1×𝒳tt1N(dxt1)mt(xt1,xt)λt(dxt)Gt(xt)(ftN×Ht:T)(xt)\displaystyle=\iint_{\mathcal{X}_{t-1}\times\mathcal{X}_{t}}\mathbb{Q}_{t-1}^{N}(\mathrm{d}x_{t-1})m_{t}(x_{t-1},x_{t})\lambda_{t}(\mathrm{d}x_{t})G_{t}(x_{t})(f_{t}^{N}\times H_{t:T})(x_{t})
=𝒳t{rtN×ftN×Gt×Ht:T}(xt)λt(dxt).\displaystyle=\int_{\mathcal{X}_{t}}\left\{r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T}\right\}(x_{t})\lambda_{t}(\mathrm{d}x_{t}).

For sts\geq t, we have

𝔼[sNLs:TΠt0:TftN]\displaystyle\mathbb{E}\left[\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right] =𝔼[N1K~sN(n,Ls:TΠt0:TftN)sN] by Corollary 3\displaystyle=\mathbb{E}\left[\frac{{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})}{\ell_{s}^{N}}\right]\text{ by Corollary~{}\ref{corol:fundamental}}
1Gs𝔼[N1K~sN(n,Ls:TΠt0:TftN)]\displaystyle\geq\frac{1}{\left\lVert G_{s}\right\rVert_{\infty}}\mathbb{E}\left[{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})\right]
by Assumption 1 and definition of sN (see Algorithm 1)\displaystyle\text{by Assumption~{}\ref{asp:Gbound} and definition of }\ell_{s}^{N}\text{ (see Algorithm~{}\ref{algo:bootstrap})}
1Gs𝔼[s1NLs1:sLs:TΠt0:TftN]\displaystyle\geq\frac{1}{\left\lVert G_{s}\right\rVert_{\infty}}\mathbb{E}\left[\mathbb{Q}_{s-1}^{N}L_{s-1:s}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\right]
by Corollary 3 and law of total expectation
=𝔼[s1NLs1:TΠt0:TftN]= (induction hypothesis).\displaystyle=\mathbb{E}\left[\mathbb{Q}_{s-1}^{N}L_{{s-1}:T}\Pi^{0:T}_{t}f_{t}^{N}\right]=\infty\text{ (induction hypothesis).}

Part (b). Similar to part (a), we shall prove by induction the statement

sNLs:TΠt0:TftN0,t1sT.\mathbb{Q}_{s}^{N}L_{s:T}\Pi^{0:T}_{t}f_{t}^{N}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0,\forall\ t-1\leq s\leq T.

Again, by Corollary 3, this quantity is equal to

N1K~sN(n,Ls:TΠt0:TftN)sN,\frac{{N}^{-1}\sum\tilde{K}_{s}^{N}(n,L_{s:T}\Pi^{0:T}_{t}f_{t}^{N})}{\ell_{s}^{N}},

and the expectation of the numerator given s1\mathcal{F}_{s-1} is s1NLs1:TΠt0:TftN\mathbb{Q}_{s-1}^{N}L_{s-1:T}\Pi_{t}^{0:T}f_{t}^{N}, which tends to 0 in probability by induction hypothesis. Lemma E.11 (see below at the end of the section), the classical result sNs:=s1Ms(Gs)\ell_{s}^{N}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}\ell_{s}:=\mathbb{Q}_{s-1}M_{s}(G_{s}) and Stutsky’s theorem concludes the proof. ∎

Proof of Theorem B.1.

By (21), we have 𝔼[τt1,FFBS]=𝔼[TN,FFBS(dxt)ftN(xt)]\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\mathbb{E}[\int\mathbb{Q}_{T}^{N,\mathrm{FFBS}}(\mathrm{d}x_{t})f_{t}^{N}(x_{t})] where

ftN(xt)=M¯hWt1nmt(Xt1n,xt)=M¯hrtNf_{t}^{N}(x_{t})=\frac{\bar{M}_{h}}{\sum W_{t-1}^{n}m_{t}(X_{t-1}^{n},x_{t})}=\frac{\bar{M}_{h}}{r_{t}^{N}}

with rtNr_{t}^{N} given in Definition 1. Proposition 9(a) gives a sufficient condition for 𝔼[τt1,FFBS]=\mathbb{E}[\tau_{t}^{1,\mathrm{FFBS}}]=\infty to hold, namely

𝒳t(rtN×ftN×Gt×Ht:T)(xt)λt(dxt)=,\int_{\mathcal{X}_{t}}(r_{t}^{N}\times f_{t}^{N}\times G_{t}\times H_{t:T})(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty,

which is equivalent to the hypothesis of the theorem. ∎

Proof of Theorem B.2.

We use notations from Definition 1 and Supplement A.1. We note 𝒩(x|μ,Σ)\mathcal{N}(x|\mu,\Sigma) the density of the specified normal distribution at point xx. Using Lemma E.9, Proposition 9 and (21), we have

𝔼[(τt1,FFBS)]=\displaystyle\mathbb{E}[(\tau_{t}^{1,\mathrm{FFBS}})]=\infty 𝔼[1rtN(Xtt1)k]=\displaystyle\Leftrightarrow\mathbb{E}\left[\frac{1}{r_{t}^{N}(X_{t}^{\mathcal{I}_{t}^{1}})^{k}}\right]=\infty
𝔼[TN(dxt)1rtN(xt)k]=\displaystyle\Leftrightarrow\mathbb{E}\left[\int\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})\frac{1}{r_{t}^{N}(x_{t})^{k}}\right]=\infty
𝔼[𝒳trtNGtHt:T(rtN)k(xt)λt(dxt)]=\displaystyle\Leftarrow\mathbb{E}\left[\int_{\mathcal{X}_{t}}\frac{r_{t}^{N}G_{t}H_{t:T}}{(r_{t}^{N})^{k}}(x_{t})\lambda_{t}(\mathrm{d}x_{t})\right]=\infty
𝒳trtGtHt:T(rtN)k1rt(xt)λt(dxt)= almost surely\displaystyle\Leftarrow\int_{\mathcal{X}_{t}}\frac{r_{t}G_{t}H_{t:T}}{(r_{t}^{N})^{k-1}r_{t}}(x_{t})\lambda_{t}(\mathrm{d}x_{t})=\infty\text{ almost surely}
𝒳t𝒩(xt|μtsmth,Σtsmth)rtN(xt)k1𝒩(xt|μtpred,Σtpred)λt(dxt)= a.s.\displaystyle\Leftrightarrow\int_{\mathcal{X}_{t}}\frac{\mathcal{N}(x_{t}|\mu_{t}^{\mathrm{smth}},\Sigma^{\mathrm{smth}}_{t})}{r_{t}^{N}(x_{t})^{k-1}\mathcal{N}(x_{t}|\mu_{t}^{\mathrm{pred}},\Sigma^{\mathrm{pred}}_{t})}\lambda_{t}(\mathrm{d}x_{t})=\infty\text{ a.s. }

The theorem then follows from elementary arguments, by noting that rtNr_{t}^{N} is a mixture of NN Gaussian distributions with covariance matrix CXC_{X}. ∎

Lemma E.9.

Let LL be a ]0,1]]0,1]-valued random variable. Suppose XX is another random variable such that X|LGeo(L)X|L\sim\operatorname{Geo}(L). Then for any real number k>0k>0,

𝔼[Xk]=𝔼[Lk]=.\mathbb{E}[X^{k}]=\infty\Leftrightarrow\mathbb{E}[L^{-k}]=\infty.
Proof.

By the definition of XX, we have

𝔼[Xk]=𝔼[x=1xk(1L)x1L].\mathbb{E}[X^{k}]=\mathbb{E}\left[\sum_{x=1}^{\infty}x^{k}(1-L)^{x-1}L\right].

A natural idea is then to approximate the sum by the integral 0xk(1L)x1Ldx\int_{0}^{\infty}x^{k}(1-L)^{x-1}L\mathrm{d}x, from which one easily extracts the LkL^{-k} factor. This is however technically laborious, since the function xxk(1L)x1Lx\mapsto x^{k}(1-L)^{x-1}L is not monotone on the whole real line. It is only so starting from a certain x0x_{0} which itself depends on LL. We would therefore rather write

𝔼[Xk]=0(Xkx)dx=0(Xx1/k)dx=0𝔼[(1L)x1/k]dxwhere the two integrands are equal Lebesgue–almost-everywhere=𝔼[0exp(|log(1L)|x1/k)dx]\begin{split}\mathbb{E}[X^{k}]&=\int_{0}^{\infty}\mathbb{P}(X^{k}\geq x)\mathrm{d}x=\int_{0}^{\infty}\mathbb{P}(X\geq x^{1/k})\mathrm{d}x\\ &=\int_{0}^{\infty}\mathbb{E}\left[(1-L)^{\lfloor x^{1/k}\rfloor}\right]\mathrm{d}x\\ &\text{where the two integrands are equal Lebesgue--almost-everywhere}\\ &=\mathbb{E}\left[\int_{0}^{\infty}\exp\left(-\left|\log(1-L)\right|\lfloor x^{1/k}\rfloor\right)\mathrm{d}x\right]\end{split}

with the natural interpretation of expressions when L=1L=1. Using uvu\sim v as a shorthand for “uu and vv are either both finite or both infinite”, we have

𝔼[Xk]𝔼[0exp(|log(1L)|x1/k)dx] by Lemma E.10=kΓ(k)𝔼[1|log(1L)|k]𝔼[1Lk] by Lemma E.10 again.\begin{split}\mathbb{E}[X^{k}]&\sim\mathbb{E}\left[\int_{0}^{\infty}\exp\left(-\left|\log(1-L)\right|{x^{1/k}}\right)\mathrm{d}x\right]\text{ by Lemma~{}\ref{lem:equivalent_in_01}}\\ &=k\ \Gamma(k)\ \mathbb{E}\left[\frac{1}{\left|\log(1-L)\right|^{k}}\right]\sim\mathbb{E}\left[\frac{1}{L^{k}}\right]\text{ by Lemma~{}\ref{lem:equivalent_in_01} again.}\end{split}

The following lemma is elementary. Its proof is therefore omitted.

Lemma E.10.

Let LL be a ]0,1]]0,1]-valued random variable and let f1f_{1} and f2f_{2} be two continuous functions from ]0,1]]0,1] to \mathbb{R}. Suppose that lim sup0+f1()/f2()\limsup_{\ell\to 0^{+}}f_{1}(\ell)/f_{2}(\ell) and lim sup0+f2()/f1()\limsup_{\ell\to 0^{+}}f_{2}(\ell)/f_{1}(\ell) are both finite. Then 𝔼[f1(L)]\mathbb{E}[f_{1}(L)] is finite if and only if 𝔼[f2(L)]\mathbb{E}[f_{2}(L)] is so.

Lemma E.11.

Let Z1,Z2,Z_{1},Z_{2},\ldots be non-negative random variables. Suppose that there exist σ\sigma-algebras 1,2,\mathcal{F}_{1},\mathcal{F}_{2},\ldots such that 𝔼[Zn|n]0\mathbb{E}[Z_{n}|\mathcal{F}_{n}]\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0. Then Zn0Z_{n}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0.

Proof.

Fix ε>0\varepsilon>0. By Markov’s inequality, (Znε|n)ε1𝔼[Zn|n]\mathbb{P}(Z_{n}\geq\varepsilon|\mathcal{F}_{n})\leq{\varepsilon}^{-1}\mathbb{E}[Z_{n}|\mathcal{F}_{n}]. Therefore, the [0,1][0,1]-bounded random variable (Znε|n)\mathbb{P}(Z_{n}\geq\varepsilon|\mathcal{F}_{n}) tends to 0 in probability, hence also in expectation. The law of total expectation then gives (Znε)0\mathbb{P}(Z_{n}\geq\varepsilon)\to 0, which, by varying ε\varepsilon, establishes the convergence of ZnZ_{n} to 0 in probability. ∎

E.8. Proof of Theorem  B.3 and Corollary 2 (hybrid FFBS complexity)

Proof of Theorem B.3.

According to Janson, (2011, Lemma 3), it is sufficient to show that

nmin(τtn,FFBS,N)NαN0\frac{\sum_{n}\min(\tau_{t}^{n,\mathrm{FFBS}},N)}{N\alpha_{N}}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0

for any deterministic sequence αN\alpha_{N} such that αN/𝔼[min(τt,FFBS,N)]{\alpha_{N}}/{\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]}\to\infty. By Lemma E.11, we can take expectation with respect to T\mathcal{F}_{T} to derive a sufficient condition, namely

𝒳tαN1zN(rtN(xt)M¯h)TN(dxt)0 with zN defined in (65)𝒳tαN1zN(rtN(xt)M¯h)rtN×Gt×Ht:Tdλt0 by Proposition 9(b)𝔼[𝒳tαN1zN(rtN(xt)M¯h)rtN×Gt×Ht:Tdλt]0𝒳tαN1zN(rt(xt)M¯h)rt×Gt×Ht:Tdλt0 by Lemma E.7𝔼[min(τt,FFBS,N)]αN0.\begin{split}&\hskip 14.22636pt\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)\mathbb{Q}_{T}^{N}(\mathrm{d}x_{t})\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0\text{ with }z^{N}\text{ defined in \eqref{eq:def_zn}}\\ &\Leftarrow\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0\text{ by Proposition~{}\ref{prop:ffbs_exec_induction}(b)}\\ &\Leftarrow\mathbb{E}\left[\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}^{N}(x_{t})}{\bar{M}_{h}}\right)r_{t}^{N}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\right]\to 0\\ &\Leftarrow\int_{\mathcal{X}_{t}}{\alpha_{N}}^{-1}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)r_{t}\times G_{t}\times H_{t:T}\ \mathrm{d}\lambda_{t}\to 0\text{ by Lemma~{}\ref{lem:magic_jensen}}\\ &\Leftrightarrow\frac{\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]}{\alpha_{N}}\to 0.\end{split}

The proof is now complete. ∎

Proof of Corollary 2.

We have, using the cost-to-go, the zNz^{N} functions and the τt,PaRIS\tau_{t}^{\infty,\mathrm{PaRIS}} distribution defined respectively in (20), (65) and (59):

𝔼[min(τt,FFBS,N)]=𝒳tzN(rt(xt)M¯h)T(dxt)=[(t1Mt)(GtHt:T)]1𝒳tzN(rt(xt)M¯h)(GtHt:T)(xt)(t1Mt)(dxt)GtHt:T[(t1Mt)(GtHt:T)]1𝒳tzN(rt(xt)M¯h)(t1Mt)(dxt)=GtHt:T[(t1Mt)(GtHt:T)]1𝔼[min(τt,PaRIS,N)].\begin{split}&\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{FFBS}},N)]=\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)\mathbb{Q}_{T}(\mathrm{d}x_{t})\\ =&\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)(G_{t}H_{t:T})(x_{t})(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})\\ \leq&\left\lVert G_{t}H_{t:T}\right\rVert_{\infty}\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\int_{\mathcal{X}_{t}}z^{N}\left(\frac{r_{t}(x_{t})}{\bar{M}_{h}}\right)(\mathbb{Q}_{t-1}M_{t})(\mathrm{d}x_{t})\\ =&\left\lVert G_{t}H_{t:T}\right\rVert_{\infty}\left[(\mathbb{Q}_{t-1}M_{t})(G_{t}H_{t:T})\right]^{-1}\mathbb{E}[\min(\tau_{t}^{\infty,\mathrm{PaRIS}},N)].\end{split}

Proposition 8 then finishes the proof. ∎

E.9. Proof of Proposition  4 (MCMC kernel properties)

Proof.

Let JtnJ_{t}^{n} be such that Jtn|Xt11:N,Xtn,B^tN,IMH(n,)BtN,IMH(n,)J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot)\sim B_{t}^{N,\mathrm{IMH}}(n,\cdot). Moreover, let KtnK_{t}^{n} be such that

Ktn|Xt11:N,Xtn,Atn,B^tN,IMH(n,)BtN,IMH(n,).K_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},A_{t}^{n},\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot)\sim B_{t}^{N,\mathrm{IMH}}(n,\cdot).

Given Xt11:NX_{t-1}^{1:N}, XtnX_{t}^{n} and AtnA_{t}^{n}, the kernel BtN,IMH(n,)B_{t}^{N,\mathrm{IMH}}(n,\cdot) applies to AtnA_{t}^{n} one or more several MCMC steps keeping invariant BtN,FFBS(n,)B_{t}^{N,\mathrm{FFBS}}(n,\cdot). Since Atn|Xt11:N,XtnBtN,FFBS(n,)A_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot), it follows that Ktn|Xt11:N,XtnBtN,FFBS(n,)K_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot) too. On the other hand, JtnJ_{t}^{n} and KtnK_{t}^{n} share the same distribution given Xt11:NX_{t-1}^{1:N}, XtnX_{t}^{n} and B^tN,IMH(n,)\hat{B}_{t}^{N,\mathrm{IMH}}(n,\cdot). Hence they also do, given Xt11:NX_{t-1}^{1:N} and XtnX_{t}^{n} only. This implies that Jtn|Xt11:N,XtnBtN,FFBS(n,)J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}\sim B_{t}^{N,\mathrm{FFBS}}(n,\cdot), which is the same as Atn|Xt11:N,XtnA_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n}. Thus (Jtn,Xtn)(J_{t}^{n},X_{t}^{n}) have the same distribution as (Atn,Xtn)(A_{t}^{n},X_{t}^{n}) given Xt11:NX_{t-1}^{1:N}, as required by Theorem 2.1. The arguments for the kernel BtN,IMHPB_{t}^{N,\mathrm{IMHP}} are similar.

To show that a certain kernel BtNB_{t}^{N} satisfies (13), we look at two conditionally i.i.d. simulations Jtn,1J_{t}^{n,1} and Jtn,2J_{t}^{n,2} from BtN(n,)B_{t}^{N}(n,\cdot) and lower bound the probability that they are different. For the kernel BtN,IMHB_{t}^{N,\mathrm{IMH}}, the variables Jtn,1J_{t}^{n,1} and Jtn,2J_{t}^{n,2} both result from one step of MH applied to AtnA_{t}^{n}. Let Jtn,1J_{t}^{n,1*} and Jtn,2J_{t}^{n,2*} be the corresponding MH proposals. A sufficient condition for Jtn,1Jtn,2J_{t}^{n,1}\neq J_{t}^{n,2} is that Jtn,1Jtn,2J_{t}^{n,1*}\neq J_{t}^{n,2*} and the two proposals are both accepted. The acceptance rate is at least M¯/M¯h\bar{M}_{\ell}/\bar{M}_{h} by Assumption 2 and the probability that Jtn,1Jtn,2J_{t}^{n,1*}\neq J_{t}^{n,2*} is

1n=1N(Wt1n)211N(G¯hG¯)21-\sum_{n=1}^{N}(W_{t-1}^{n})^{2}\geq 1-\frac{1}{N}\left(\frac{\bar{G}_{h}}{\bar{G}_{\ell}}\right)^{2}

by Assumption 3. Thus (13) is satisfied for εS=M¯/2M¯h\varepsilon_{\mathrm{S}}=\bar{M}_{\ell}/2\bar{M}_{h} for NN large enough. Similarly, the probability that Jtn,1Jtn,2J_{t}^{n,1}\neq J_{t}^{n,2} for the BtN,IMHPB_{t}^{N,\mathrm{IMHP}} kernel with N~=2\tilde{N}=2 can be lower-bounded via the probability that J~tn,1J~tn,2\tilde{J}_{t}^{n,1}\neq\tilde{J}_{t}^{n,2} (where J~tn,1\tilde{J}_{t}^{n,1} and J~tn,2\tilde{J}_{t}^{n,2} are defined in (18)). Thus using the same arguments, (13) is satisfied here for εS=M¯/4M¯h\varepsilon_{\mathrm{S}}=\bar{M}_{\ell}/4\bar{M}_{h}. ∎

E.10. Conditional probability of maximal couplings

In general, there exist multiple maximal couplings of two random distributions (i.e. couplings that maximise the probability of equality of the two variables). However, they all satisfy a certain conditional probability property stated in the following lemma. It is closely related to results on the coupling density on the diagonal (see e.g. Wang et al.,, 2021, Lemma 2 or Douc et al.,, 2018, Theorem 19.1.6). Its statement, which we were unable to find in the literature in the exact form we need, is obvious in the discrete case but requires lengthier arguments in the continuous one.

Proposition 10.

Let X1X_{1} and X2X_{2} be two random variables with densities f1f_{1} and f2f_{2} with respect to some dominating measure defined on a space 𝒳\mathcal{X}. Then, the following inequality holds almost surely:

(69) (X2=X1|X1)1f2(X1)f1(X1).\mathbb{P}(X_{2}=X_{1}|X_{1})\leq 1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}.

Moreover, the equality occurs almost surely if and only if X1X_{1} and X2X_{2} form a maximal coupling.

Proof.

Let hh be any non-negative test function from 𝒳\mathcal{X} to \mathbb{R}. Putting

A1\displaystyle A_{1} :={x𝒳f1(x)f2(x)}\displaystyle:=\left\{x\in\mathcal{X}\mid f_{1}(x)\geq f_{2}(x)\right\}
A2\displaystyle A_{2} :={x𝒳f2(x)f1(x)},\displaystyle:=\left\{x\in\mathcal{X}\mid f_{2}(x)\geq f_{1}(x)\right\},

we have

𝔼[(X2=X1|X1)h(X1)]=𝔼[𝟙X2=X1h(X1)]=𝔼[𝟙X2=X1𝟙X1A1h(X1)]+𝔼[𝟙X2=X1𝟙X1A2h(X1)]=𝔼[𝟙X2=X1𝟙X2A1h(X2)]+𝔼[𝟙X2=X1𝟙X1A2h(X1)]𝔼[𝟙X2A1h(X2)]+𝔼[𝟙X1A2h(X1)]=h(x)f2f1(x)dx=𝔼[(1f2(X1)f1(X1))h(X1)].\begin{split}\mathbb{E}[\mathbb{P}(X_{2}=X_{1}|X_{1})h(X_{1})]&=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}h(X_{1})]\\ &=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbbm{1}_{X_{1}\in A_{1}}h(X_{1})]+\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &=\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbbm{1}_{X_{2}\in A_{1}}h(X_{2})]+\mathbb{E}[\mathbbm{1}_{X_{2}=X_{1}}\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &\leq\mathbb{E}[\mathbb{1}_{X_{2}\in A_{1}}h(X_{2})]+\mathbb{E}[\mathbb{1}_{X_{1}\in A_{2}}h(X_{1})]\\ &=\int h(x)f_{2}\land f_{1}(x)\mathrm{d}x=\mathbb{E}\left[\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)h(X_{1})\right].\end{split}

The inequality (69) is now proved almost-surely. As a result, almost-sure equality occurs if and only if the expectation of the two sides of (69) are equal, which means, via Lemma A.1, that the two variables are maximally coupled. ∎

The following lemma establishes the symmetry of Assumption 8. Again, its statement is obvious in the discrete case, though some work is needed to rigorously justify the continuous one.

Lemma E.12.

Let X1X_{1} and X2X_{2} be two random variables of densities f1f_{1} and f2f_{2} w.r.t. some dominating measure defined on some space 𝒳\mathcal{X}. Suppose that almost-surely

(X2=X1|X1)ε(1f2(X1)f1(X1))\mathbb{P}(X_{2}=X_{1}|X_{1})\geq\varepsilon\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)

for some ε>0\varepsilon>0. Then almost-surely,

(X1=X2|X2)ε(1f1(X2)f2(X2)).\mathbb{P}(X_{1}=X_{2}|X_{2})\geq\varepsilon\left(1\land\frac{f_{1}(X_{2})}{f_{2}(X_{2})}\right).
Proof.

We introduce a non-negative test function h2:𝒳h_{2}:\mathcal{X}\to\mathbb{R} and write

𝔼[(X1=X2|X2)h(X2)]=𝔼[𝟙X1=X2h(X2)]=𝔼[𝟙X2=X1h(X1)]=𝔼[(X2=X1|X1)h(X1)]𝔼[ε(1f2(X1)f1(X1))h(X1)]=εf1f2(x)h(x)dx=𝔼[ε(1f1(X2)f1(X2))h(X2)]\begin{split}\mathbb{E}[\mathbb{P}(X_{1}=X_{2}|X_{2})h(X_{2})]&=\mathbb{E}[\mathbb{1}_{X_{1}=X_{2}}h(X_{2})]=\mathbb{E}[\mathbb{1}_{X_{2}=X_{1}}h(X_{1})]\\ &=\mathbb{E}[\mathbb{P}(X_{2}=X_{1}|X_{1})h(X_{1})]\\ &\geq\mathbb{E}\left[\varepsilon\left(1\land\frac{f_{2}(X_{1})}{f_{1}(X_{1})}\right)h(X_{1})\right]\\ &=\int\varepsilon f_{1}\land f_{2}(x)h(x)\mathrm{d}x\\ &=\mathbb{E}\left[\varepsilon\left(1\land\frac{f_{1}(X_{2})}{f_{1}(X_{2})}\right)h(X_{2})\right]\end{split}

which implies the desired result. ∎

E.11. Proof of Theorem  4.1 (intractable kernel properties)

Proof.

Let JtnJ_{t}^{n} be a random variable such that

Jtn|Xt11:N,Xtn,B^tN,ITR(n,)BtN,ITR(n,).J_{t}^{n}|X_{t-1}^{1:N},X_{t}^{n},\hat{B}_{t}^{N,\mathrm{ITR}}(n,\cdot)\sim B_{t}^{N,\mathrm{ITR}}(n,\cdot).

By construction of Algorithm 7, given Xt11:NX_{t-1}^{1:N}, the couple (Jtn,Xtn)(J_{t}^{n},X_{t}^{n}) has the same distribution as (Atn,L,Xtn,L)(A_{t}^{n,L},X_{t}^{n,L}). Thus, BtN,ITRB_{t}^{N,\mathrm{ITR}} satisfies the hypotheses of Theorem 2.1. To verify (13), we define the variables Jtn,1:2J_{t}^{n,1:2} accordingly and write:

(Jtn,1Jtn,2|Xtn=xt,Xt11:N=xt11:N)=12(Xtn,1=Xtn,2,Atn,1Atn,2|Xtn=xt,Xt11:N=xt11:N)=(Xtn,1=Xtn,2,Atn,1Atn,2,L=1|Xtn=xt,Xt11:N=xt11:N)=12(Xtn,1=Xtn,2,Atn,1Atn,2|L=1,Xtn=xt,Xt11:N=xt11:N)(by symmetry)=12(Xtn,1=Xtn,2,Atn,1Atn,2|Xtn,1=xt,Xt11:N=xt11:N)=12atn,1atn,2(Xtn,2=xt|Atn,1=atn,1,Atn,2=atn,2,Xtn,1=xt,Xt11:N=xt11:N)××(Atn,1=atn,1,Atn,2=atn,2|Xtn,1=xt,Xt11:N=xt11:N)12εDM¯M¯hatn,1atn,2(Atn,1=atn,1,Atn,2=atn,2|Xtn,1=xt,Xt11:N=xt11:N)(by Assumptions 8 and 2)12εD(M¯M¯h)2εA by Lemma E.13.\mathbb{P}\left(\left.{J_{t}^{n,1}\neq J_{t}^{n,2}}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \begin{aligned} &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2},L=1}\right|{X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{L=1,X_{t}^{n}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\text{(by symmetry)}\\ &=\frac{1}{2}\mathbb{P}\left(\left.{X_{t}^{n,1}=X_{t}^{n,2},A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &=\frac{1}{2}\begin{multlined}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{X_{t}^{n,2}=x_{t}}\right|{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2},X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\times\\ \times\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\end{multlined}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{X_{t}^{n,2}=x_{t}}\right|{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2},X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\times\\ \times\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ &\geq\frac{1}{2}\begin{multlined}\varepsilon_{D}\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \text{(by Assumptions~{}\ref{asp:coupling:dynamics} and~{}\ref{asp:mt_2ways_bound})}\end{multlined}\varepsilon_{D}\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\sum_{a_{t}^{n,1}\neq a_{t}^{n,2}}\mathbb{P}\left(\left.{A_{t}^{n,1}=a_{t}^{n,1},A_{t}^{n,2}=a_{t}^{n,2}}\right|{X_{t}^{n,1}=x_{t},X_{t-1}^{1:N}=x_{t-1}^{1:N}}\right)\\ \text{(by Assumptions~{}\ref{asp:coupling:dynamics} and~{}\ref{asp:mt_2ways_bound})}\\ &\geq\frac{1}{2}\varepsilon_{D}\left(\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\right)^{2}\varepsilon_{A}\text{ by Lemma~{}\ref{lem:modified_epsa}.}\end{aligned}

The proof is complete. ∎

Lemma E.13.

We use the notations of Algorithm 7. Under Assumptions 2 and 7, we have

(Atn,1Atn,2|Xtn,1,Xt11:N)M¯M¯hεA.\mathbb{P}\left(\left.{A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1},X_{t-1}^{1:N}}\right)\geq\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\varepsilon_{A}.
Proof.

We write (and define new notations along the way):

π(atn,1,atn,2):=p(atn,1,atn,2|Xtn,1,Xt11:N)p(atn,1,atn,2|Xt11:N)mt(Xt1atn,1,Xtn,1)=:p(atn,1,atn,2|Xt11:N)ϕ(atn,1)=:π0(atn,1,atn,2)ϕ(atn,1).\begin{split}\pi(a_{t}^{n,1},a_{t}^{n,2})&:=p(a_{t}^{n,1},a_{t}^{n,2}|X_{t}^{n,1},X_{t-1}^{1:N})\\ &\propto p(a_{t}^{n,1},a_{t}^{n,2}|X_{t-1}^{1:N})m_{t}(X_{t-1}^{a_{t}^{n,1}},X_{t}^{n,1})\\ &=:p(a_{t}^{n,1},a_{t}^{n,2}|X_{t-1}^{1:N})\phi(a_{t}^{n,1})\\ &=:\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1}).\end{split}

Thus

(Atn,1Atn,2|Xtn,1,Xt11:N)=𝟙{atn,1atn,2}π(atn,1,atn,2)=𝟙{atn,1atn,2}π0(atn,1,atn,2)ϕ(atn,1)π0(atn,1,atn,2)ϕ(atn,1)𝟙{atn,1atn,2}π0(atn,1,atn,2)M¯M¯h\begin{split}\mathbb{P}\left(\left.{A_{t}^{n,1}\neq A_{t}^{n,2}}\right|{X_{t}^{n,1},X_{t-1}^{1:N}}\right)&=\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi(a_{t}^{n,1},a_{t}^{n,2})\\ &=\frac{\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1})}{\int\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\phi(a_{t}^{n,1})}\\ &\geq\int\mathbbm{1}\left\{a_{t}^{n,1}\neq a_{t}^{n,2}\right\}\pi_{0}(a_{t}^{n,1},a_{t}^{n,2})\frac{\bar{M}_{\ell}}{\bar{M}_{h}}\end{split}

by the boundedness of the function ϕ\phi between M¯\bar{M}_{\ell} and M¯h\bar{M}_{h}. From this, we get the desired result by virtue of Assumption 7. ∎

E.12. Validity of Algorithm 14 (modified Lindvall-Rogers coupler)

Recall that generating a random variable is equivalent to uniformly simulating under the graph of its density (see e.g. Robert and Casella,, 2004, The Fundamental Theorem of Simulation, chapter 2.3.1). Algorithm 14’s correctness is thus a direct corollary of the following intuitive lemma.

Lemma E.14.

Let SAS_{A} and SBS_{B} be two subsets of d\mathbb{R}^{d} with finite Lebesgue measures. Let AA and BB be two not necessarily independent random variables distributed according to Uniform(SA)\operatorname{Uniform}(S_{A}) and Uniform(SB)\operatorname{Uniform}(S_{B}) respectively. Denote by S0S_{0} the intersection of SAS_{A} and SBS_{B}; and by CC a certain Uniform(SA)\operatorname{Uniform}(S_{A})-distributed random variable that is independent from (A,B)(A,B). Define AA^{\star} and BB^{\star} as

A={C if (A,C)S0×S0A otherwiseA^{\star}=\begin{cases}C&\text{ if }(A,C)\in S_{0}\times S_{0}\\ A&\text{ otherwise}\end{cases}

and

B={C if (B,C)S0×S0B otherwiseB^{\star}=\begin{cases}C&\text{ if }(B,C)\in S_{0}\times S_{0}\\ B&\text{ otherwise}\end{cases}

Then AUniform(SA)A^{\star}\sim\operatorname{Uniform}(S_{A}) and BUniform(SB)B^{\star}\sim\operatorname{Uniform}(S_{B}).

Proof.

Given (A,C)S0×S0(A,C)\in S_{0}\times S_{0}, the two variables AA and CC have the same distribution (which is Uniform(S0)\operatorname{Uniform}(S_{0})). Thus, the definition of AA^{\star} implies that AA and AA^{\star} have the same (unconditional) distribution. The same argument applies to BB and BB^{\star} notwithstanding the asymmetry in the definition of CC. ∎

E.13. Hoeffding inequalities

This section proves a Hoeffding inequality for ratios, which helps us to bound (28). It is essentially a reformulation of Douc et al., (2011, Lemma 4) in a slightly more general manner.

Definition 2.

A real-valued random variable XX is called (C,S)(C,S)-sub-Gaussian if

(|X|S>t)2Cet2/2,t0.\mathbb{P}\left(\frac{\left|X\right|}{S}>t\right)\leq 2Ce^{-t^{2}/2},\forall\ t\geq 0.

This definition is close to other sub-Gaussian definitions in the literature, see e.g. Vershynin, (2018, Chapter 2.5). It basically means that the tails of XX decreases at least as fast as the tails of the 𝒩(0,S2)\mathcal{N}(0,S^{2}) distribution, which is itself (1,S)(1,S)-sub-Gaussian. The following result is classic.

Theorem E.15 (Hoeffding’s inequality).

Let X1,,XNX_{1},\ldots,X_{N} be NN i.i.d. random variables with mean μ\mu and almost surely contained between aa and bb. Then N1/2(Xi/Nμ)N^{1/2}(\sum X_{i}/N-\mu) is (1,(ba)/2)(1,(b-a)/2)-sub-Gaussian.

The following lemma is elementary from Definition 2. The proof is omitted.

Lemma E.16.

Let XX and YY be two (not necessarily independent) random variables. If XX is (C1,S1)(C_{1},S_{1})-sub-Gaussian and YY is (C2,S2)(C_{2},S_{2})-sub-Gaussian, then X+YX+Y is (C1+C2,S1+S2)(C_{1}+C_{2},S_{1}+S_{2})-sub-Gaussian.

We are ready to state the main result of this section.

Proposition 11 (Hoeffding’s inequality for ratios).

Let aNa_{N}, bNb_{N}, aa^{*}, bb^{*} be random variables such that N(aNa)\sqrt{N}(a_{N}-a^{*}) is (Ca,Sa)(C_{a},S_{a})-sub-Gaussian and N(bNb)\sqrt{N}(b_{N}-b^{*}) is (Cb,Sb)(C_{b},S_{b})-sub-Gaussian. Then N(aN/bNa/b)\sqrt{N}\left({a_{N}}/{b_{N}}-{a^{*}}/{b^{*}}\right) is sub-Gaussian with parameters (C,S)(C^{*},S^{*}) where

{C=Ca+CbS=1b(Sa+SbaNbN).\begin{cases}C^{*}&=C_{a}+C_{b}\\ S^{*}&=\left\lVert\frac{1}{b^{*}}\right\rVert_{\infty}(S_{a}+S_{b}\left\lVert\frac{a_{N}}{b_{N}}\right\rVert_{\infty}).\end{cases}

The terms with inf-norm can be infinite if the corresponding random variables are unbounded.

Proof.

We have

|N(aNbNab)|\displaystyle\left|\sqrt{N}\left(\frac{a_{N}}{b_{N}}-\frac{a^{*}}{b^{*}}\right)\right| |N(aNbNaNb)|+|N(aNbab)|\displaystyle\leq\left|\sqrt{N}\left(\frac{a_{N}}{b_{N}}-\frac{a_{N}}{b^{*}}\right)\right|+\left|\sqrt{N}\left(\frac{a_{N}}{b^{*}}-\frac{a^{*}}{b^{*}}\right)\right|
=|aNbN||1b||N(bNb)|+|1b||N(aNa)|\displaystyle=\left|\frac{a_{N}}{b_{N}}\right|\left|\frac{1}{b^{*}}\right|\left|\sqrt{N}(b_{N}-b^{*})\right|+\left|\frac{1}{b^{*}}\right|\left|\sqrt{N}(a_{N}-a^{*})\right|

by which the proposition follows from Lemma E.16. ∎