This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Simulating Diffusion Bridges with Score Matching

Jeremy Heng 111Corresponding author: heng@essec.edu ESSEC Business School, Singapore 139408, Singapore    Valentin De Bortoli Center for Science of Data, ENS Ulm, Paris 75005, France    Arnaud Doucet Department of Statistics, University of Oxford, Oxford OX1 3LB, U.K.    James Thornton44footnotemark: 4
Abstract

We consider the problem of simulating diffusion bridges, which are diffusion processes that are conditioned to initialize and terminate at two given states. The simulation of diffusion bridges has applications in diverse scientific fields and plays a crucial role in the statistical inference of discretely-observed diffusions. This is known to be a challenging problem that has received much attention in the last two decades. This article contributes to this rich body of literature by presenting a new avenue to obtain diffusion bridge approximations. Our approach is based on a backward time representation of a diffusion bridge, which may be simulated if one can time-reverse the unconditioned diffusion. We introduce a variational formulation to learn this time-reversal with function approximation and rely on a score matching method to circumvent intractability. Another iteration of our proposed methodology approximates the Doob’s hh-transform defining the forward time representation of a diffusion bridge. We discuss algorithmic considerations and extensions, and present numerical results on an Ornstein–Uhlenbeck process, a model from financial econometrics for interest rates, and a model from genetics for cell differentiation and development to illustrate the effectiveness of our approach.

Keywords: Diffusion; Diffusion bridge; Score matching; Stochastic differential equation; Time-reversal.

1 Introduction

Diffusion processes have been used extensively in mathematical and natural sciences. A diffusion process X=(Xt)t[0,T]X=(X_{t})_{t\in[0,T]} in d\mathbb{R}^{d} is defined by the stochastic differential equation

dXt=f(t,Xt)dt+σ(t,Xt)dWt,\displaystyle\mathrm{d}X_{t}=f(t,X_{t})\mathrm{d}t+\sigma(t,X_{t})\mathrm{d}W_{t}, (1)

where f:[0,T]×ddf:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a drift function, σ:[0,T]×dd×d\sigma:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d} is a diffusion coefficient, and W=(Wt)t[0,T]W=(W_{t})_{t\in[0,T]} is a dd-dimensional Brownian motion. We suppose ff and σ\sigma are sufficiently regular to induce a unique weak solution and Σ(t,xt)=(σσT)(t,xt)\Sigma(t,x_{t})=(\sigma\sigma^{\mathrm{\scriptscriptstyle T}})(t,x_{t}) is uniformly positive definite for all (t,xt)[0,T]×(t,x_{t})\in[0,T]\times\mathbb{R}. For any 0s<tT0\leq s<t\leq T, we denote the transition density of (1) with respect to the Lebesgue measure on d\mathbb{R}^{d} as p(t,xts,xs)p(t,x_{t}\mid s,x_{s}) and assume that it is positive for ease of exposition. While the numerical simulation of XX can be routinely handled by time-discretization schemes (Kloeden and Platen, 1992), the task of simulating XX initialized at X0=x0X_{0}=x_{0} and conditioned to terminate at XT=xTX_{T}=x_{T} is a challenging problem that has received much attention in the last two decades.

Simulating the conditioned process X=(Xt)t[0,T]X^{\star}=(X_{t}^{\star})_{t\in[0,T]}, commonly referred to as a diffusion bridge, has applications in diverse fields such as computational chemistry (Bolhuis et al., 2002; Wang et al., 2020), financial econometrics (Elerian et al., 2001; Durham and Gallant, 2002), genetics (Wang et al., 2011), and shape analysis (Arnaudon et al., 2022). When performing statistical inference for parameters of ff and σ\sigma in the case where XX is observed at discrete time points, diffusion bridge simulation is a crucial tool that allows one to impute missing paths between observations within an expectation-maximization algorithm or a Gibbs sampler (Pedersen, 1995; Roberts and Stramer, 2001; Eraker, 2001; Beskos et al., 2006; Golightly and Wilkinson, 2008; van der Meulen and Schauer, 2017).

By Doob’s hh-transform (Rogers and Williams, 2000, p. 83), it is well-known that XX^{\star} satisfies

dXt={f(t,Xt)+Σ(t,Xt)logh(t,Xt)}dt+σ(t,Xt)dWt,X0=x0,\displaystyle\mathrm{d}X_{t}^{\star}=\{f(t,X_{t}^{\star})+\Sigma(t,X_{t}^{\star})\nabla\log h(t,X_{t}^{\star})\}\mathrm{d}t+\sigma(t,X_{t}^{\star})\mathrm{d}W_{t},\quad X_{0}^{\star}=x_{0}, (2)

where h(t,xt)=p(T,xTt,xt)h(t,x_{t})=p(T,x_{T}\mid t,x_{t}) and \nabla denotes the gradient operator. The term Σ(t,xt)logh(t,xt)\Sigma(t,x_{t})\nabla\log h(t,x_{t}) forces the conditioned process towards the terminal condition XT=xTX_{T}^{\star}=x_{T}. As the transition density and hence its logarithmic gradient is intractable for most diffusions, exploiting this result to simulate diffusion bridges is highly non-trivial. To this end, one can characterize hh as the solution of the backward Kolmogorov equation

th(t,xt)+(h)(t,xt)=0,\displaystyle\partial_{t}h(t,x_{t})+(\mathcal{L}h)(t,x_{t})=0, (3)

with terminal condition at time TT given by the Dirac measure at xTx_{T}, where \mathcal{L} denotes the generator of XX (Stroock and Varadhan, 1997). Equation (3) reveals that hh propagates information about the terminal constraint backwards in time. However, numerical resolution of this partial differential equation is particularly challenging due to the singularity at time TT, and computationally demanding when the dimension dd is large (Wang et al., 2020). Furthermore, one must run a solver for every pair of conditioned states (x0,xT)(x_{0},x_{T}) considered.

A common approach to address these difficulties is to simulate a proposal bridge process X=(Xt)t[0,T]X^{\circ}=(X_{t}^{\circ})_{t\in[0,T]}, satisfying dXt=f(t,Xt)dt+σ(t,Xt)dWt\mathrm{d}X_{t}^{\circ}=f^{\circ}(t,X_{t}^{\circ})\mathrm{d}t+\sigma(t,X_{t}^{\circ})\mathrm{d}W_{t} with X0=x0X_{0}^{\circ}=x_{0}. One constructs f:[0,T]×ddf^{\circ}:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} using a tractable approximation of (2), and corrects for the discrepancy using importance sampling or an independent Metropolis–Hastings algorithm (Papaspiliopoulos and Roberts, 2012; Elerian et al., 2001). The simple choice f(t,xt)=f(t,xt)f^{\circ}(t,x_{t})=f(t,x_{t}) typically performs poorly as it does not take the constraint XT=xTX_{T}=x_{T} into account (Pedersen, 1995). The drift of a Brownian bridge f(t,xt)=(xTxt)/(Tt)f^{\circ}(t,x_{t})=(x_{T}-x_{t})/(T-t) has been considered in several works (Durham and Gallant, 2002; Delyon and Hu, 2006; Stramer and Yan, 2007; Papaspiliopoulos et al., 2013), and improved by Whitaker et al. (2017) using an innovative decomposition of the process into deterministic and stochastic parts. Clark (1990) followed by Delyon and Hu (2006) studied the choice f(t,xt)=f(t,xt)+(xTxt)/(Tt)f^{\circ}(t,x_{t})=f(t,x_{t})+(x_{T}-x_{t})/(T-t) that incorporates the dynamics of the original process XX. To introduce more flexibility and better mimic the structure of (2), Schauer et al. (2017) proposed setting f(t,xt)=f(t,xt)+Σ(t,xt)logh~(t,xt)f^{\circ}(t,x_{t})=f(t,x_{t})+\Sigma(t,x_{t})\nabla\log\tilde{h}(t,x_{t}), where h~(t,xt)=p~(T,xTt,xt)\tilde{h}(t,x_{t})=\tilde{p}(T,x_{T}\mid t,x_{t}) is an analytically tractable transition density of an auxiliary process. For tractability, the latter is typically chosen from the class of linear processes and can be optimized to get the best approximation within this class (van der Meulen and Schauer, 2017). Other Markov chain Monte Carlo approaches include Gibbs sampling (Eraker, 2001), Langevin-type stochastic partial differential equations (Stuart et al., 2004; Beskos et al., 2008), piecewise deterministic Monte Carlo (Bierkens et al., 2021), and manifold Hamiltonian Monte Carlo methods (Graham et al., 2022).

The exact simulation algorithms developed in Beskos and Roberts (2005) and Beskos et al. (2006) can be employed to sample diffusion bridges without any time-discretization error. However, these elegant methods are limited to the class of diffusion processes that can be transformed to have unit diffusion coefficient. Bladt and Sørensen (2014) and Bladt et al. (2016) devised an ingenious methodology to simulate diffusion bridges based on coupling and time-reversal of diffusions. Their proposed method is applicable to the class of ergodic diffusions with an invariant density that is either explicitly known or numerically approximated. Closely related approaches include sequential Monte Carlo algorithms that resample using backward information filter approximations (Guarniero, 2017), information from backward pilot paths (Lin et al., 2010), or guided weight functions (Del Moral and Murray, 2015). The main idea underlying these works is the representation of the diffusion bridge in (2) and (3).

2 Diffusion bridges

2.1 Time-reversed bridge process

It can be shown that the time-reversed bridge process Z=(Zt)t[0,T]=(XTt)t[0,T]Z^{\star}=(Z_{t}^{\star})_{t\in[0,T]}=(X_{T-t}^{\star})_{t\in[0,T]} satisfies

dZt=b(t,Zt)dt+σ(Tt,Zt)dBt,Z0=xT,\displaystyle\mathrm{d}Z_{t}^{\star}=b(t,Z_{t}^{\star})\mathrm{d}t+\sigma(T-t,Z_{t}^{\star})\mathrm{d}B_{t},\quad Z_{0}^{\star}=x_{T}, (4)

with drift function b(t,zt)=f(Tt,zt)+Σ(Tt,zt)s(Tt,zt)+Σ(Tt,zt)b(t,z_{t})=-f(T-t,z_{t})+\Sigma(T-t,z_{t})s(T-t,z_{t})+\nabla\cdot\Sigma(T-t,z_{t}), another standard Brownian motion B=(Bt)t[0,T]B=(B_{t})_{t\in[0,T]}, s(t,xt)=s(t,xt)logh(t,xt)s(t,x_{t})=s^{\star}(t,x_{t})-\nabla\log h(t,x_{t}), and Σ(t,x)=(j=1dxjΣ1,j(t,x),,j=1dxjΣd,j(t,x))\nabla\cdot\Sigma(t,x)=(\sum_{j=1}^{d}\partial_{x_{j}}\Sigma^{1,j}(t,x),\ldots,\sum_{j=1}^{d}\partial_{x_{j}}\Sigma^{d,j}(t,x)) is the divergence of Σ\Sigma. Here s(t,xt)=logp(t,xt)s^{\star}(t,x_{t})=\nabla\log p^{\star}(t,x_{t}) denotes the score of the marginal density p(t,xt)p^{\star}(t,x_{t}) of the diffusion bridge process XtX_{t}^{\star} at time t(0,T)t\in(0,T). We refer readers to Haussmann and Pardoux (1986), Millet et al. (1989), and Cattiaux et al. (2023) for conditions under which the representation in (4) holds. By the Markov property, we have the relation

p(t,xt)=p{t,xt(0,x0),(T,xT)}=p(t,xt0,x0)h(t,xt)p(T,xT0,x0),\displaystyle p^{\star}(t,x_{t})=p\{t,x_{t}\mid(0,x_{0}),(T,x_{T})\}=\frac{p(t,x_{t}\mid 0,x_{0})h(t,x_{t})}{p(T,x_{T}\mid 0,x_{0})}, (5)

as h(t,xt)=p(T,xTt,xt)h(t,x_{t})=p(T,x_{T}\mid t,x_{t}). This implies that s(t,xt)=logp(t,xt0,x0)s(t,x_{t})=\nabla\log p(t,x_{t}\mid 0,x_{0}) is simply the score of the transition density of XX.

Exploiting this backward time representation to derive diffusion bridge approximations is also highly non-trivial due to the intractability of the transition density η(t,xt)=p(t,xt0,x0)\eta(t,x_{t})=p(t,x_{t}\mid 0,x_{0}), which is now characterized by the forward Kolmogorov equation tη(t,xt)+(η)(t,xt)=0\partial_{t}\eta(t,x_{t})+(\mathcal{F}\eta)(t,x_{t})=0, with initial condition at time 0 given by the Dirac measure at x0x_{0} and \mathcal{F} denotes the Fokker–Planck operator of XX (Stroock and Varadhan, 1997). Numerical resolution of η\eta using partial differential equation solvers also suffers from the same difficulties as (3). A key observation is that (4) can be understood as first setting Z0=xTZ_{0}^{\star}=x_{T} to satisfy the terminal constraint, and then evolving ZZ^{\star} using the time-reversal of (1). Due to the influence of the score s(t,xt)s(t,x_{t}), the process will end at the initial constraint ZT=x0Z_{T}^{\star}=x_{0} by construction. This connection between simulation of a diffusion bridge and time-reversal of its original diffusion process will form the basis of our score approximation. We refer readers to Section B of the Supplementary Material for an alternative and more elementary argument to establish this connection.

2.2 Learning time-reversal with score matching

We introduce a variational formulation to learn the time-reversal of (1), involving path measures on the space of continuous functions from [0,T][0,T] to d\mathbb{R}^{d}, equipped with the cylinder σ\sigma-algebra. Consider the time-reversed process Z=(Zt)t[0,T]Z=(Z_{t})_{t\in[0,T]} satisfying

dZt=bϕ(t,Zt)dt+σ(Tt,Zt)dBt,Z0p(T,dxT0,x0),\displaystyle\mathrm{d}Z_{t}=b_{\phi}(t,Z_{t})\mathrm{d}t+\sigma(T-t,Z_{t})\mathrm{d}B_{t},\quad Z_{0}\sim p(T,\mathrm{d}x_{T}\mid 0,x_{0}), (6)

with drift function bϕ(t,zt)=f(Tt,zt)+Σ(Tt,zt)sϕ(Tt,zt)+Σ(Tt,zt)b_{\phi}(t,z_{t})=-f(T-t,z_{t})+\Sigma(T-t,z_{t})s_{\phi}(T-t,z_{t})+\nabla\cdot\Sigma(T-t,z_{t}) that mimics the form of b(t,zt)b(t,z_{t}) in (4). Here sϕ:[0,T]×dds_{\phi}:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} represents a function approximation of the score s(t,xt)s(t,x_{t}) that depends on parameters ϕΦ\phi\in\Phi to be optimized. We shall measure the score approximation error as

e(ϕ)=Ex0{0Tsϕ(t,Xt)s(t,Xt)Σ(t,Xt)2dt},\displaystyle e(\phi)={E}^{x_{0}}\left\{\int_{0}^{T}\left\|s_{\phi}(t,X_{t})-s(t,X_{t})\right\|_{\Sigma(t,X_{t})}^{2}\mathrm{d}t\right\}, (7)

where Ex0E^{x_{0}} denotes expectation with respect to the path measure x0\mathbb{P}^{x_{0}} induced by (Xt)t[0,T](X_{t})_{t\in[0,T]} in (1) with X0=x0X_{0}=x_{0}, and A\|\cdot\|_{A} denotes the Euclidean norm weighted by a positive definite Ad×dA\in\mathbb{R}^{d\times d}.

Let ϕx0\mathbb{Q}_{\phi}^{x_{0}} be the path measure induced by the parameterised forward process (ZTt)t[0,T](Z_{T-t})_{t\in[0,T]}. We consider minimizing the Kullback–Leibler divergence kl(x0|ϕx0)=Ex0{logdx0/dϕx0(X)}\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})=E^{x_{0}}\{\log\mathrm{d}\mathbb{P}^{x_{0}}/\mathrm{d}\mathbb{Q}_{\phi}^{x_{0}}(X)\}, where dx0/dϕx0\mathrm{d}\mathbb{P}^{x_{0}}/\mathrm{d}\mathbb{Q}_{\phi}^{x_{0}} denotes the Radon–Nikodym derivative of x0\mathbb{P}^{x_{0}} with respect ϕx0\mathbb{Q}_{\phi}^{x_{0}}. The following result gives an upper bound of this objective and shows that the process ZZ will end at the initial constraint ZT=x0Z_{T}=x_{0} if the score approximation error is finite.

Proposition 1.

Assuming e(ϕ)<e(\phi)<\infty, we have kl(x0|ϕx0)e(ϕ)/2\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})\leq e(\phi)/2 and ZT=x0Z_{T}=x_{0} holds ϕx0\mathbb{Q}_{\phi}^{x_{0}}-almost surely.

Under Novikov’s condition, an exponential integrability assumption, Girsanov’s theorem shows that kl(x0|ϕx0)=e(ϕ)/2\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})=e(\phi)/2 (Le Gall, 2016a, Theorem 5.22, Theorem 4.13). We do not provide here any assumption on sϕ(t,xt)s_{\phi}(t,x_{t}) to guarantee that e(ϕ)<e(\phi)<\infty. Previous work by Delyon and Hu (2006) and Schauer et al. (2017) have studied necessary conditions for the law of a proposal bridge process and the law of the diffusion bridge to be equivalent. These findings suggest that the behaviour of bϕ(t,zt)b_{\phi}(t,z_{t}) should be (x0zt)/t(x_{0}-z_{t})/t as t0t\rightarrow 0, which can be used to guide our parameterization of the score approximation. For example, we consider sϕ(t,xt)=Σ1(t,xt)(x0xt)/t+vϕ(t,xt)s_{\phi}(t,x_{t})=\Sigma^{-1}(t,x_{t})(x_{0}-x_{t})/t+v_{\phi}(t,x_{t}) where vϕ:[0,T]×ddv_{\phi}:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} denotes a bounded function.

Although Proposition 1 clearly relates the approximation of the time-reversal of (1) to the approximation of the score s(t,xt)s(t,x_{t}), its form is not amenable to optimization. The following result gives an alternative and practical expression by adapting the idea of denoising score matching (Vincent, 2011) to our setting.

Proposition 2.

For any partition (tm)m=0M(t_{m})_{m=0}^{M} of the interval [0,T][0,T], we have kl(x0|ϕx0)L(ϕ)+C\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})\leq L(\phi)+C if e(ϕ)<e(\phi)<\infty, where CC is a constant independent of ϕΦ\phi\in\Phi, the loss function is defined as

L(ϕ)=12m=1Mtm1tmEx0{sϕ(t,Xt)g(tm1,Xtm1,t,Xt)Σ(t,Xt)2}dt,\displaystyle L(\phi)=\frac{1}{2}\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}{E}^{x_{0}}\left\{\left\|s_{\phi}(t,X_{t})-g(t_{m-1},X_{t_{m-1}},t,X_{t})\right\|_{\Sigma(t,X_{t})}^{2}\right\}\mathrm{d}t, (8)

and g(s,xs,t,xt)=logp(t,xts,xs)g(s,x_{s},t,x_{t})=\nabla\log p(t,x_{t}\mid s,x_{s}) for 0s<tT0\leq s<t\leq T and xs,xtdx_{s},x_{t}\in\mathbb{R}^{d}.

Therefore minimizing the Kullback–Leibler divergence kl(x0|ϕx0)\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}}) is equivalent to minimizing the loss function L(ϕ)L(\phi). This allows us to circumvent the intractable score s(t,xt)s(t,x_{t}) by working with g(tm1,xtm1,t,xt)g(t_{m-1},x_{t_{m-1}},t,x_{t}), the score of the transition density p(t,xttm1,xtm1)p(t,x_{t}\mid t_{m-1},x_{t_{m-1}}). Although the latter is also intractable, approximations can be made when the sub-interval [tm1,tm][t_{m-1},t_{m}] is sufficiently small. For example, under the Euler–Maruyama scheme (Kloeden and Platen, 1992, p. 340) with stepsize δt=T/M\delta t=T/M,

g(tm1,xtm1,tm,xtm)(δt)1Σ1(tm1,xtm1){xtmxtm1δtf(tm1,xtm1)}.\displaystyle g(t_{m-1},x_{t_{m-1}},t_{m},x_{t_{m}})\approx-(\delta t)^{-1}\Sigma^{-1}(t_{m-1},x_{t_{m-1}})\{x_{t_{m}}-x_{t_{m-1}}-\delta tf(t_{m-1},x_{t_{m-1}})\}.

Hence the loss L(ϕ)L(\phi) can be approximated and minimized using stochastic gradient algorithms by simulating time-discretized paths under (1). The minimal loss of L(ϕ)=CL(\phi)=-C, achieved when sϕ(t,xt)=s(t,xt)s_{\phi}(t,x_{t})=s(t,x_{t}) x0\mathbb{P}^{x_{0}}-almost surely, is unknown as the constant CC is typically intractable. After obtaining the score approximation sϕs_{\phi}, we can simulate a proposal bridge ZZ from (6) with Z0=xTZ_{0}=x_{T} and correct it using importance sampling or independent Metropolis–Hastings. Time-discretization considerations and proposal correction procedures are detailed in Section C of the Supplementary Material.

In scenarios where one is interested in multiple pairs of conditioned states (x0,xT)(x_{0},x_{T}), we can extend the above methodology to avoid having to learn multiple score approximations as follows. We let the score approximation sϕ(t,xt)s_{\phi}(t,x_{t}) in (6) also depend on the initial condition x0x_{0}, and average the Kullback–Leibler objective kl(x0|ϕx0)\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}}) with a distribution p0(dx0)p_{0}(\mathrm{d}x_{0}) on d\mathbb{R}^{d} that can be sampled from. By applying the arguments of Proposition 2 conditionally on X0=x0X_{0}=x_{0}, we obtain a loss function given by averaging (8) over p0(dx0)p_{0}(\mathrm{d}x_{0}), which can be minimized using time-discretization and stochastic gradient algorithms. In this setting, we parameterize the score approximation as sϕ(t,xt,x0)=Σ1(t,xt)(x0xt)/t+vϕ(t,xt,x0)s_{\phi}(t,x_{t},x_{0})=\Sigma^{-1}(t,x_{t})(x_{0}-x_{t})/t+v_{\phi}(t,x_{t},x_{0}), where vϕ(t,xt,x0)v_{\phi}(t,x_{t},x_{0}) is now a function that also depends on the initial condition x0x_{0}.

2.3 Learning Doob’s hh-transform

It is instructive to consider the time-reversal of (4), which gives

dXt=f(t,Xt)dt+σ(t,Xt)dWt,X0=x0,\displaystyle\mathrm{d}X_{t}^{\star}=f^{\star}(t,X_{t}^{\star})\mathrm{d}t+\sigma(t,X_{t}^{\star})\mathrm{d}W_{t},\quad X_{0}^{\star}=x_{0}, (9)

with drift function f(t,xt)=b(Tt,xt)+Σ(t,xt)s(t,xt)+Σ(t,xt)f^{\star}(t,x_{t})=-b(T-t,x_{t})+\Sigma(t,x_{t})s^{\star}(t,x_{t})+\nabla\cdot\Sigma(t,x_{t}). Using the form of b(t,zt)b(t,z_{t}) and the relation s(t,xt)=s(t,xt)logh(t,xt)s(t,x_{t})=s^{\star}(t,x_{t})-\nabla\log h(t,x_{t}), we can rewrite ff^{\star} as

f(t,xt)=f(t,xt)+Σ(t,xt){s(t,xt)s(t,xt)}=f(t,xt)+Σ(t,xt)logh(t,xt),\displaystyle f^{\star}(t,x_{t})=f(t,x_{t})+\Sigma(t,x_{t})\{s^{\star}(t,x_{t})-s(t,x_{t})\}=f(t,x_{t})+\Sigma(t,x_{t})\nabla\log h(t,x_{t}), (10)

and recover the Doob’s hh-transform in (2). Although this is to be expected as the reversal of the time-reversed process should recover the original process, it forms the basis of our approximation of (2).

After obtaining an approximation of s(t,xt)s(t,x_{t}), another iteration of our methodology can be used to obtain a function approximation of s(t,xt)s^{\star}(t,x_{t}). For brevity, this is detailed in Section 2.3 of the Supplementary Material. Plugging in both approximations in (10) then gives an approximation of XX^{\star}, which could be of interest in algorithms where one requires a forward time representation of the proposal bridge process (Lin et al., 2010; Del Moral and Murray, 2015). However, this comes at the cost of learning two approximations and some accumulation of errors. Recent work by Baker et al. (2024) have extended our approach to directly approximate logh(t,xt)\nabla\log h(t,x_{t}), which is not necessarily the score of a probability density.

3 Related work on generative modeling

Denoising diffusion models are popular state-of-the-art generative models (Sohl-Dickstein et al., 2015; Song et al., 2021). These models are based on a diffusion process that transforms data into random normal noise. Their time-reversal is then approximated to obtain a generative model that maps sampled noise to synthetic data. Recent extensions have generalized these models to allow the distribution at the terminal time TT to be arbitrary rather than a normal distribution (De Bortoli et al., 2021; Vargas et al., 2021; Chen et al., 2022). While there are similarities between these methods and our proposed methodology, the challenges in simulating a diffusion bridge between two states x0x_{0} and xTx_{T} are distinctly different. Firstly, the diffusion process in (1) usually represents a probabilistic model for a problem of interest with an intractable transition density. In contrast, denoising diffusion models employ an Ornstein–Uhlenbeck process, leveraging its analytically tractable transition density to learn the score. Secondly, while the time dimension in (1) arises from modeling time series data, with the time interval TT representing the time between observations, the time variable TT in denoising diffusion models is artificially introduced and represents the time necessary for the diffusion to transform the data distribution into a normal distribution. Thirdly, the time-reversed process (4) is initialized from a conditioned state xTx_{T} in our setting, whereas it is initialized from a normal distribution in generative models.

4 Numerical examples

4.1 Preliminaries

As our methodology allows one to employ any function approximator, we harness the flexibility of neural networks and the ease of implementation using modern software to approximate score functions. We adopt the parameterization of the score approximation in Section 2.2 with vϕv_{\phi} defined by a neural network. The choice of neural network and stochastic gradient algorithm is detailed in Section D.2 of the Supplementary Material. Optimization times ranged between several seconds to a few minutes on a simple desktop machine and can be reduced with hardware accelerators. As such computational overheads are marginal when deploying proposal bridge processes within an independent Metropolis–Hastings algorithm with many iterations or an importance sampler with many samples, we focus on assessing the quality of our proposals in settings where existing proposal methods are unsatisfactory. The performance measures considered are the acceptance rate of independent Metropolis–Hastings, and in the case of an importance sampling estimator p^(T,xT0,x0)\hat{p}(T,x_{T}\mid 0,x_{0}) of p(T,xT0,x0)p(T,x_{T}\mid 0,x_{0}), the effective sample size proportion, and the variance var{logp^(T,xT0,x0)}\{\log\hat{p}(T,x_{T}\mid 0,x_{0})\} or the mean squared error E[{logp^(T,xT0,x0)logp(T,xT0,x0)}2]{E}[\{\log\hat{p}(T,x_{T}\mid 0,x_{0})-\log p(T,x_{T}\mid 0,x_{0})\}^{2}] when the true transition density is known. More implementation details are described in Section D of the Supplementary Material. These measures were computed using 10241024 samples or iterations, and 100100 independent repetitions of each method. We benchmark our approximations of the backward diffusion bridge (BDB) and forward diffusion bridge (FDB) in (4) and (9) against the Clark–Delyon–Hu (CDH) proposal bridge studied by Clark (1990) and Delyon and Hu (2006), the forward diffusion (FD) of Pedersen (1995), the guided proposal bridge (GPB) by Schauer et al. (2017), and the modified diffusion bridge (MDB) of Durham and Gallant (2002). Given the difficulty of comparing the wide range of methods for diffusion bridges in a completely fair manner, as their strengths and weaknesses can depend on the specificities of the problem under consideration, we note that our objective is merely to illustrate a new avenue to improve the construction of proposal bridge processes. A Python package to reproduce all numerical results is available online222https://github.com/jeremyhengjm/DiffusionBridge.

4.2 Ornstein–Uhlenbeck process

Consider (1) with linear drift function f(t,xt)=2xtf(t,x_{t})=-2x_{t} and identity diffusion coefficient σ(t,xt)=Id\sigma(t,x_{t})=I_{d}. The transition density and score function are explicitly known and used as ground truth. The first and second rows of Fig. 1 illustrates the impact of the time horizon TT and dimension dd on algorithmic performance with the constraints x0=xT=(1,,1)x_{0}=x_{T}=(1,\ldots,1). To study how performance degrades when we condition on rare states under the diffusion process (1)\eqref{eqn:SDE}, in the third row of Fig. 1, we set the initial state as x0=1x_{0}=1 and vary where the terminal state xTx_{T} is in the tails of the transition density p(T,xT0,x0)p(T,x_{T}\mid 0,x_{0}) in multiples of its standard deviation.

We omit the guided proposal bridge in this example as it performs perfectly when its auxiliary process is the Ornstein–Uhlenbeck process. Our proposed methods offer substantial improvements over other methods without exploiting the correct parameterization. When comparing our forward and backward diffusion bridge approximations, we notice some accumulation of error, which is to be expected as the forward process is constructed using an additional score approximation. Given this observation, we will only consider the backward process in the following.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Results for Ornstein–Uhlenbeck example with fixed d=1d=1 (first row), fixed T=1T=1 (second row), and fixed d=1,T=1d=1,T=1 (third row).

4.3 Interest rates model

We consider a special case of an interest rates model in Aït-Sahalia and Lo (1998), defined by (1) with f(t,xt)=4/xtxtf(t,x_{t})=4/x_{t}-x_{t} and σ(t,xt)=1\sigma(t,x_{t})=1. This specification admits a tractable transition density and score function as ground truth, given in terms of modified Bessel functions and its derivative. For each TT, we first learn a single score approximation to handle multiple conditioned states (x0,xT)(x_{0},x_{T}) by minimizing the loss in (8) averaged over initial states X0=x0X_{0}=x_{0} from the gamma distribution with shape 55 and rate 22. We construct guided proposal bridges based on a first-order Taylor approximation of ff at the stable stationary point x=2x^{\star}=2. We plot our score approximation error for T=1T=1 in Fig. 2, and examine how algorithmic performance depends on TT and (x0,xT)(x_{0},x_{T}) in Fig. 3. Our backward diffusion bridge approximation performs well for all TT considered and when conditioning requires the process to move away from the stationary point (first and third columns). Its performance is similar to guided proposal bridge and significantly better than other existing methods. While the performance of modified diffusion bridge typically degrades with TT, the results of forward diffusion and Clark–Delyon–Hu bridge depend on the specific conditioned states.

Refer to caption
Figure 2: True score function s(t,x)s(t,x) (first row) and the function approximation error s(t,x)sϕ(t,x)s(t,x)-s_{\phi}(t,x) (second row) for various time tt and initial condition x0x_{0}.
Refer to caption
Figure 3: Results for interest rates model for three pairs of conditioned states (x0,xT)(x_{0},x_{T}).

4.4 Cell model

We end with a cell differentiation and development model from Wang et al. (2011). Cellular expression levels Xt=(Xt,1,Xt,2)X_{t}=(X_{t,1},X_{t,2}) of two genes are modelled by (1) with f=(f1,f2)f=(f_{1},f_{2}), where fi(t,xt)=xt,i4/(24+xt,i4)+24/(24+xt,j4)xt,if_{i}(t,x_{t})=x_{t,i}^{4}/(2^{-4}+x_{t,i}^{4})+2^{-4}/(2^{-4}+x_{t,j}^{4})-x_{t,i}, for (i,j){(1,2),(2,1)}(i,j)\in\{(1,2),(2,1)\}, describe self-activation, mutual inhibition and inactivation respectively, and σ(t,xt)=σXI2\sigma(t,x_{t})=\sigma_{X}I_{2} captures intrinsic or external fluctuations. We consider the cell development from the undifferentiated state of x0=(1,1)x_{0}=(1,1) to a differentiated state xT=(xT,1,xT,2)x_{T}=(x_{T,1},x_{T,2}) defined by another stable fixed point of ff satisfying xT,1<xT,2x_{T,1}<x_{T,2}. We employ an auxiliary Ornstein–Uhlenbeck process in the guided proposal bridge and optimize its parameters using the Kullback–Leibler divergence considered in Schauer et al. (2017). Fig. 4 displays the performance of all proposal methods to approximate this diffusion bridge. For this application, we observe improvement over guided proposal bridge when the diffusion coefficient σX\sigma_{X} is smaller, and significant improvement over other methods when the time horizon is long. These findings are consistent with numerical results obtained in Baker et al. (2024) for the same model.

Refer to caption
Figure 4: Results for cell model with σX2=0.1\sigma_{X}^{2}=0.1 (first row) and σX2=1\sigma_{X}^{2}=1 (second row).

5 Discussion

While implicit score matching (Hyvärinen, 2005) could have been an alternative to denoising score matching (Vincent, 2011), we found it to be less effective and computationally more expensive. If the terminal state xTx_{T} is highly unlikely under the law of the unconditional diffusion, the score approximation learned using unconditional sample paths could be poor near xTx_{T}. Nevertheless, we observed satisfactory experimental results for an Ornstein–Ulhenbeck process up to 8 standard deviations. In statistical problems where some hyperparameters of the diffusion are estimated using maximum likelihood or Bayesian inference, a direct application of our approach would require a new diffusion bridge approximation for each set of parameters considered. To avoid incurring significant computational cost, one could also incorporate parameter dependence in the score approximation procedure (Boserup et al., 2024). Recent work has also extended our approach to approximate directly logh(t,xt)\nabla\log h(t,x_{t}) without having to approximate a time-reversal (Baker et al., 2024) and to infinite-dimensional diffusion processes (Baker et al., 2024).

Acknowledgement

The authors thank the editor, associate editor, and reviewers for their comments, which have helped to improve the paper significantly. Arnaud Doucet and James Thornton were partially supported by the U.K. Engineering and Physical Sciences Research Council.

References

  • Aït-Sahalia and Lo (1998) Aït-Sahalia, Y. and A. W. Lo (1998). Nonparametric estimation of state-price densities implicit in financial asset prices. The Journal of Finance 53(2), 499–547.
  • Ambrosio et al. (2005) Ambrosio, L., N. Gigli, and G. Savaré (2005). Gradient Flows: in Metric Spaces and in the Space of Probability Measures. Birkhäuser.
  • Andrieu et al. (2010) Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlo methods (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(3), 269–342.
  • Arnaudon et al. (2022) Arnaudon, A., F. van der Meulen, M. Schauer, and S. Sommer (2022). Diffusion bridges for stochastic Hamiltonian systems and shape evolutions. SIAM Journal on Imaging Sciences 15(1), 293–323.
  • Baker et al. (2024) Baker, E. L., M. Schauer, and S. Sommer (2024). Score matching for bridges without time-reversals. arXiv preprint arXiv:2407.15455.
  • Baker et al. (2024) Baker, E. L., G. Yang, M. Severinsen, C. Hipsley, and S. Sommer (2024). Conditioning non-linear and infinite-dimensional diffusion processes. Advances in Neural Information Processing Systems 37, 10801–10826.
  • Benton et al. (2024) Benton, J., V. De Bortoli, A. Doucet, and G. Deligiannidis (2024). Nearly dd-linear convergence bounds for diffusion models via stochastic localization. In The Twelfth International Conference on Learning Representations.
  • Beskos et al. (2006) Beskos, A., O. Papaspiliopoulos, and G. O. Roberts (2006). Retrospective exact simulation of diffusion sample paths with applications. Bernoulli 12(6), 1077–1098.
  • Beskos et al. (2006) Beskos, A., O. Papaspiliopoulos, G. O. Roberts, and P. Fearnhead (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(3), 333–382.
  • Beskos and Roberts (2005) Beskos, A. and G. O. Roberts (2005). Exact simulation of diffusions. The Annals of Applied Probability 15(4), 2422–2444.
  • Beskos et al. (2008) Beskos, A., G. O. Roberts, A. Stuart, and J. Voss (2008). MCMC methods for diffusion bridges. Stochastics and Dynamics 8(03), 319–350.
  • Bierkens et al. (2021) Bierkens, J., S. Grazzi, F. van der Meulen, and M. Schauer (2021). A piecewise deterministic Monte Carlo method for diffusion bridges. Statistics and Computing 31(3), 1–21.
  • Bladt et al. (2016) Bladt, M., S. Finch, and M. Sørensen (2016). Simulation of multivariate diffusion bridges. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 343–369.
  • Bladt and Sørensen (2014) Bladt, M. and M. Sørensen (2014). Simple simulation of diffusion bridges with application to likelihood inference for diffusions. Bernoulli 20(2), 645–675.
  • Bolhuis et al. (2002) Bolhuis, P. G., D. Chandler, C. Dellago, and P. L. Geissler (2002). Transition path sampling: Throwing ropes over rough mountain passes, in the dark. Annual Review of Physical Chemistry 53(1), 291–318.
  • Boserup et al. (2024) Boserup, N., G. Yang, M. L. Severinsen, C. A. Hipsley, and S. Sommer (2024). Parameter inference via differentiable diffusion bridge importance sampling. arXiv preprint arXiv:2411.08993.
  • Cattiaux et al. (2023) Cattiaux, P., G. Conforti, I. Gentil, and C. Léonard (2023). Time reversal of diffusion processes under a finite entropy condition. Annales de l’Institut Henri Poincare (B) Probabilites et statistiques 59(4), 1844–1881.
  • Chen et al. (2022) Chen, S., S. Chewi, J. Li, Y. Li, A. Salim, and A. Zhang (2022). Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations.
  • Chen et al. (2022) Chen, T., G.-H. Liu, and E. Theodorou (2022). Likelihood training of Schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations.
  • Clark (1990) Clark, J. M. C. (1990). The simulation of pinned diffusions. In 29th IEEE Conference on Decision and Control, pp. 1418–1420. IEEE.
  • De Bortoli et al. (2021) De Bortoli, V., J. Thornton, J. Heng, and A. Doucet (2021). Diffusion Schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems. PMLR.
  • Del Moral and Murray (2015) Del Moral, P. and L. M. Murray (2015). Sequential Monte Carlo with highly informative observations. SIAM/ASA Journal on Uncertainty Quantification 3(1), 969–997.
  • Delyon and Hu (2006) Delyon, B. and Y. Hu (2006). Simulation of conditioned diffusion and application to parameter estimation. Stochastic Processes and their Applications 116(11), 1660–1675.
  • Durham and Gallant (2002) Durham, G. B. and A. R. Gallant (2002). Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics 20(3), 297–338.
  • Elerian et al. (2001) Elerian, O., S. Chib, and N. Shephard (2001). Likelihood inference for discretely observed nonlinear diffusions. Econometrica 69(4), 959–993.
  • Eraker (2001) Eraker, B. (2001). MCMC analysis of diffusion models with application to finance. Journal of Business & Economic Statistics 19(2), 177–191.
  • Golightly and Wilkinson (2008) Golightly, A. and D. J. Wilkinson (2008). Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics & Data Analysis 52(3), 1674–1693.
  • Graham et al. (2022) Graham, M. M., A. H. Thiery, and A. Beskos (2022). Manifold Markov chain Monte Carlo methods for Bayesian inference in diffusion models. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(4), 1229–1256.
  • Guarniero (2017) Guarniero, P. (2017). The Iterated Auxiliary Particle Filter and Applications to State Space Models and Diffusion Processes. Ph. D. thesis, University of Warwick.
  • Haussmann and Pardoux (1986) Haussmann, U. G. and E. Pardoux (1986). Time reversal of diffusions. The Annals of Probability, 1188–1205.
  • Hyvärinen (2005) Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6(4).
  • Kingma and Ba (2014) Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kloeden and Platen (1992) Kloeden, P. E. and E. Platen (1992). Stochastic differential equations. In Numerical Solution of Stochastic Differential Equations, pp.  103–160. Springer.
  • Le Gall (2016a) Le Gall, J.-F. (2016a). Brownian Motion, Martingales, and Stochastic Calculus. Springer.
  • Le Gall (2016b) Le Gall, J.-F. (2016b). Brownian Motion, Martingales, and Stochastic Calculus. Springer.
  • Lin et al. (2010) Lin, M., R. Chen, and P. Mykland (2010). On generating Monte Carlo samples of continuous diffusion bridges. Journal of the American Statistical Association 105(490), 820–838.
  • Middleton et al. (2019) Middleton, L., G. Deligiannidis, A. Doucet, and P. E. Jacob (2019). Unbiased smoothing using particle independent Metropolis–Hastings. In International Conference on Artificial Intelligence and Statistics. PMLR.
  • Millet et al. (1989) Millet, A., D. Nualart, and M. Sanz (1989). Integration by parts and time reversal for diffusion processes. The Annals of Probability, 208–238.
  • Papaspiliopoulos and Roberts (2012) Papaspiliopoulos, O. and G. O. Roberts (2012). Importance sampling techniques for estimation of diffusion models. In M. Kessler, A. Lindner, and M. Sorensen (Eds.), Statistical Methods for Stochastic Differential Equations, pp.  311–340. Chapman and Hall/CRC.
  • Papaspiliopoulos et al. (2013) Papaspiliopoulos, O., G. O. Roberts, and O. Stramer (2013). Data augmentation for diffusions. Journal of Computational and Graphical Statistics 22(3), 665–688.
  • Pedersen (1995) Pedersen, A. R. (1995). Consistency and asymptotic normality of an approximate maximum likelihood estimator for discretely observed diffusion processes. Bernoulli, 257–279.
  • Roberts and Stramer (2001) Roberts, G. O. and O. Stramer (2001). On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika 88(3), 603–621.
  • Rogers and Williams (2000) Rogers, L. C. G. and D. Williams (2000). Diffusions, Markov processes and Martingales: Volume 2: Itô Calculus, Volume 2. Cambridge University Press.
  • Schauer et al. (2017) Schauer, M., F. Van Der Meulen, and H. Van Zanten (2017). Guided proposals for simulating multi-dimensional diffusion bridges. Bernoulli 23(4A), 2917–2950.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., E. Weiss, N. Maheswaranathan, and S. Ganguli (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR.
  • Song and Kingma (2021) Song, Y. and D. P. Kingma (2021). How to train your energy-based models. arXiv preprint arXiv:2101.03288.
  • Song et al. (2021) Song, Y., J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2021). Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations.
  • Stramer and Yan (2007) Stramer, O. and J. Yan (2007). On simulated likelihood of discretely observed diffusion processes and comparison to closed-form approximation. Journal of Computational and Graphical Statistics 16(3), 672–691.
  • Stroock and Varadhan (1997) Stroock, D. W. and S. S. Varadhan (1997). Multidimensional Diffusion Processes, Volume 233. Springer Science & Business Media.
  • Stuart et al. (2004) Stuart, A. M., J. Voss, and P. Wilberg (2004). Conditional path sampling of SDEs and the Langevin MCMC method. Communications in Mathematical Sciences 2(4), 685–697.
  • van der Meulen and Schauer (2017) van der Meulen, F. and M. Schauer (2017). Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electronic Journal of Statistics 11(1), 2358–2396.
  • Vargas et al. (2021) Vargas, F., P. Thodoroff, A. Lamacraft, and N. Lawrence (2021). Solving Schrödinger bridges via maximum likelihood. Entropy 23(9), 1134.
  • Vaswani et al. (2017) Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008.
  • Vincent (2011) Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation 23(7), 1661–1674.
  • Wang et al. (2011) Wang, J., K. Zhang, L. Xu, and E. Wang (2011). Quantifying the Waddington landscape and biological paths for development and differentiation. Proceedings of the National Academy of Sciences 108(20), 8257–8262.
  • Wang et al. (2020) Wang, S., D. Ramkrishna, and V. Narsimhan (2020). Exact sampling of polymer conformations using Brownian bridges. The Journal of Chemical Physics 153(3), 034901.
  • Whitaker et al. (2017) Whitaker, G. A., A. Golightly, R. J. Boys, and C. Sherlock (2017). Improved bridge constructs for stochastic differential equations. Statistics and Computing 27(4), 885–900.

Appendix A Proofs of Propositions 1 and 2

In the following, we write x,yA=xTAy\langle x,y\rangle_{A}=x^{\mathrm{\scriptscriptstyle T}}Ay for the inner product of x,ydx,y\in\mathbb{R}^{d} weighted by a positive definite matrix Ad×dA\in\mathbb{R}^{d\times d} and xA=x,xA\|x\|_{A}=\sqrt{\langle x,x\rangle_{A}} for the induced weighted Euclidean norm. Before diving into the proof of Proposition 1, we recall the data processing inequality. A proof of such result can be found in Ambrosio et al. (2005, Lemma 9.4.5).

Lemma 1.

Assume that π0,π1\pi_{0},\pi_{1} are two probability distributions over a metric space 𝒳\mathcal{X}. Let φ:𝒳𝒳\varphi:\mathcal{X}\to\mathcal{X} be a Borel map, i.e. a measurable map for the σ\sigma-algebra generated by the open sets of 𝒳\mathcal{X}. Then, we have that

KL(φ#π0|φ#π1)KL(π0|π1).\mathrm{KL}(\varphi_{\#}\pi_{0}|\varphi_{\#}\pi_{1})\leq\mathrm{KL}(\pi_{0}|\pi_{1}).

We are now ready to prove the main result of this section.

Proof of Proposition 1.

The proof is similar to Chen et al. (2022, Section 5.2), see also Benton et al. (2024). We recall that a process (Bt)t[0,T](B_{t})_{t\in[0,T]} defined in (Ω,,(t)t0,)(\Omega,\mathcal{F},(\mathcal{F}_{t})_{t\geq 0},\mathbb{P}) is a \mathbb{P}-Brownian motion if (Bt)t[0,T](B_{t})_{t\in[0,T]} is almost surely continuous and for any s,t[0,T]s,t\in[0,T], BtBsB_{t}-B_{s} is a normal random variable with zero mean and covariance matrix |ts|Id|t-s|I_{d}. If there is no possible ambiguity, we simply say that (Bt)t[0,T](B_{t})_{t\in[0,T]} is a Brownian motion. First, we give a version of Girsanov’s theorem which is a consequence of Le Gall (2016b, Theorem 4.13, Theorem 5.22).

Let us give a few more details on the applications of Le Gall (2016b, Theorem 4.13) and Le Gall (2016b, Theorem 5.22) for the derivation of Theorem 1. We follow the approach described in Le Gall (2016b, p.136), where a process t=0tΔsdBs\mathcal{L}_{t}=\int_{0}^{t}\Delta_{s}\mathrm{d}B_{s} is first defined. Using Le Gall (2016b, Theorem 4.13), we immediately get that the process (t)t[0,T(\mathcal{L}_{t})_{t\in[0,T} is a martingale. We follow the rest of the paragraph of Le Gall (2016b, p.136), notably leveraging that E[()T]=1{E}_{\mathbb{P}}[\mathcal{E}(\mathcal{L})_{T}]=1, to obtain that (()t[0,T](\mathcal{E}(\mathcal{L})_{t\in[0,T]} is a \mathbb{P}-martingale. The rest of the theorem is a direct application of Le Gall (2016b, Theorem 5.22).

Theorem 1.

Let (Ω,,(t)t0,)(\Omega,\mathcal{F},(\mathcal{F}_{t})_{t\geq 0},\mathbb{P}) be a filtered probability space and (Δt)t[0,T](\Delta_{t})_{t\in[0,T]} be a predictable process with E[0TΔs2ds]<{E}_{\mathbb{P}}\big{[}\int_{0}^{T}\|\Delta_{s}\|^{2}\mathrm{d}s\big{]}<\infty. Let (Bt)t0(B_{t})_{t\geq 0} be a \mathbb{P}-Brownian motion and define t=0tΔsdBs\mathcal{L}_{t}=\int_{0}^{t}\Delta_{s}\mathrm{d}B_{s}. Then, (t)t[0,T](\mathcal{L}_{t})_{t\in[0,T]} is a \mathbb{P}-martingale. For any t[0,T]t\in[0,T], we define

()t=exp(0tΔsdBs120tΔs2ds).\mathcal{E}(\mathcal{L})_{t}=\exp\Big{(}\int_{0}^{t}\Delta_{s}\mathrm{d}B_{s}-\frac{1}{2}\int_{0}^{t}\|\Delta_{s}\|^{2}\mathrm{d}s\Big{)}.

Assume that for any t[0,T]t\in[0,T], E[()T]=1{E}_{\mathbb{P}}[\mathcal{E}(\mathcal{L})_{T}]=1 then (()t)t[0,T](\mathcal{E}(\mathcal{L})_{t})_{t\in[0,T]} is a \mathbb{P}-martingale. Let \mathbb{Q} be a probability measure such that d/d=()\mathrm{d}\mathbb{Q}/\mathrm{d}\mathbb{P}=\mathcal{E}(\mathcal{L}) and let (βt)t[0,T](\beta_{t})_{t\in[0,T]} such that for any t[0,T]t\in[0,T]

βt=Bt0tΔsds.\beta_{t}=B_{t}-\int_{0}^{t}\Delta_{s}\mathrm{d}s.

Then (βt)t[0,T](\beta_{t})_{t\in[0,T]} is a \mathbb{Q}-Brownian motion.

Let x0dx_{0}\in\mathbb{R}^{d} and we recall that ϕx0\mathbb{Q}_{\phi}^{x_{0}} is the path measure induced by a time-reversed process Z=(Zt)t[0,T]Z=(Z_{t})_{t\in[0,T]}. More precisely we have that ϕx0\mathbb{Q}_{\phi}^{x_{0}} is the distribution of (ZTt)t[0,T](Z_{T-t})_{t\in[0,T]}, where Z=(Zt)t[0,T]Z=(Z_{t})_{t\in[0,T]} satisfies

dZt=bϕ(t,Zt)dt+σ(Tt,Zt)dBt,Z0p(T,dxT0,x0),\displaystyle\mathrm{d}Z_{t}=b_{\phi}(t,Z_{t})\mathrm{d}t+\sigma(T-t,Z_{t})\mathrm{d}B_{t},\quad Z_{0}\sim p(T,\mathrm{d}x_{T}\mid 0,x_{0}), (11)

with drift function bϕ(t,zt)=f(Tt,zt)+Σ(Tt,zt)sϕ(Tt,zt)+Σ(Tt,zt)b_{\phi}(t,z_{t})=-f(T-t,z_{t})+\Sigma(T-t,z_{t})s_{\phi}(T-t,z_{t})+\nabla\cdot\Sigma(T-t,z_{t}). In addition, we have that Z=(Zt)t[0,T]=(XTt)t[0,T]Z^{\star}=(Z_{t}^{\star})_{t\in[0,T]}=(X_{T-t})_{t\in[0,T]} is associated with x0\mathbb{P}^{x_{0}}. More precisely x0\mathbb{P}^{x_{0}} is the distribution of (Xt)t[0,T]=(ZTt)t[0,T](X_{t})_{t\in[0,T]}=(Z_{T-t}^{\star})_{t\in[0,T]} which satisfies

dZt=b(t,Zt)dt+σ(Tt,Zt)dBt,Z0p(T,dxT0,x0),\displaystyle\mathrm{d}Z_{t}^{\star}=b(t,Z_{t}^{\star})\mathrm{d}t+\sigma(T-t,Z_{t}^{\star})\mathrm{d}B_{t}^{\prime},\quad Z_{0}^{\star}\sim p(T,\mathrm{d}x_{T}\mid 0,x_{0}), (12)

with drift function b(t,zt)=f(Tt,zt)+Σ(Tt,zt)s(Tt,zt)+Σ(Tt,zt)b(t,z_{t})=-f(T-t,z_{t})+\Sigma(T-t,z_{t})s(T-t,z_{t})+\nabla\cdot\Sigma(T-t,z_{t}). Here (Bt)t[0,T](B^{\prime}_{t})_{t\in[0,T]} defines a dd-dimensional Brownian motion. In what follows, we define (Δt)t[0,T](\Delta_{t})_{t\in[0,T]} such that for any t[0,T]t\in[0,T]

Δt=σ(Tt,Zt)T{s(Tt,Zt)sϕ(Tt,Zt)}.\Delta_{t}=\sigma(T-t,Z_{t}^{\star})^{\mathrm{\scriptscriptstyle T}}\{s(T-t,Z_{t}^{\star})-s_{\phi}(T-t,Z_{t}^{\star})\}.

We recall that

e(ϕ)\displaystyle e(\phi) =Ex0{0Tsϕ(t,Xt)s(t,Xt)Σ(t,Xt)2dt}\displaystyle={E}^{x_{0}}\left\{\int_{0}^{T}\left\|s_{\phi}(t,X_{t})-s(t,X_{t})\right\|_{\Sigma(t,X_{t})}^{2}\mathrm{d}t\right\} (13)
=Ex0{0Tsϕ(Tt,Zt)s(Tt,Zt)Σ(Tt,Zt)2dt},\displaystyle={E}^{x_{0}}\left\{\int_{0}^{T}\left\|s_{\phi}(T-t,Z_{t}^{\star})-s(T-t,Z_{t}^{\star})\right\|_{\Sigma(T-t,Z_{t}^{\star})}^{2}\mathrm{d}t\right\}, (14)

where Ex0E^{x_{0}} denotes expectation with respect to x0\mathbb{P}^{x_{0}}.

We now provide an outline for the rest of the proof. First, we show that we can define a sequence of measures which approximate ϕx0\mathbb{Q}_{\phi}^{x_{0}}. This sequence, denoted by (ϕn)n(\mathbb{Q}_{\phi}^{n})_{n\in\mathbb{N}}, corresponds to changing the behaviour of the approximate process, see (16). In particular, near time TT, we use the true score ss instead of the approximate score sϕs_{\phi}. The important property of this sequence of measures is that we can derive a uniform bound on the Kullback–Leibler divergence KL(x0|ϕn)\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}^{n}_{\phi}), see (15). This is done by leveraging results from Girsanov theory, see Theorem 1. Finally, we need to ensure that this result implies that KL(x0|ϕx0)\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}^{x_{0}}_{\phi}) is bounded. It can easily be seen that when truncating the time, i.e. considering projε(ω)(t)=ω(t(Tε))\mathrm{proj}_{\varepsilon}(\omega)(t)=\omega(t\wedge(T-\varepsilon)), we have that for any ε>0\varepsilon>0, there exists n0n_{0}\in\mathbb{N} such that for any nn\in\mathbb{N} with nn0n\geq n_{0}, (projε)#ϕn=(projε)#ϕx0(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}_{\phi}^{n}=(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}_{\phi}^{x_{0}}. Therefore, using lower semi-continuity of the Kullback–Leibler divergence, see Ambrosio et al. (2005, Lemma 9.4.3), we are able to provide a finite bound for KL((projε)#x0|(projε)#ϕx0)\textsc{KL}((\mathrm{proj}_{\varepsilon})_{\#}\mathbb{P}^{x_{0}}|(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{x_{0}}_{\phi}) which is uniform in ε\varepsilon. Finally, we can use the data processing inequality in Lemma 1 to conclude the proof.

Note that (Δt)t[0,T](\Delta_{t})_{t\in[0,T]} is a predictable process and using the assumption that e(ϕ)<+e(\phi)<+\infty, we have

Ex0[0TΔt2dt]<+.{E}^{x_{0}}\Big{[}\int_{0}^{T}\|\Delta_{t}\|^{2}\mathrm{d}t\Big{]}<+\infty.

For any t[0,T]t\in[0,T], let t=0tΔsdBs\mathcal{L}_{t}=\int_{0}^{t}\Delta_{s}\mathrm{d}B^{\prime}_{s}. Then, using Le Gall (2016b, Proposition 5.11) we have that (()t)t[0,T](\mathcal{E}(\mathcal{L})_{t})_{t\in[0,T]} is a continuous local martingale. As a result there exists a sequence of stopping times (Tn)n(T_{n})_{n\in\mathbb{N}} such that TnTT_{n}\rightarrow T almost surely and (()tTn)t[0,T](\mathcal{E}(\mathcal{L})_{t\wedge T_{n}})_{t\in[0,T]} is a continuous martingale for each nn.

For any nn\in\mathbb{N}, let tn=0tΔs1[0,Tn](s)dBs\mathcal{L}_{t}^{n}=\int_{0}^{t}\Delta_{s}1_{[0,T_{n}]}(s)\mathrm{d}B_{s}^{\prime} for all t[0,T]t\in[0,T] and nn\in\mathbb{N}, where 1A(s)=01_{A}(s)=0 if sAs\in A and 0 if sAs\notin A. Then for any t[0,T]t\in[0,T], we have ()tTn=(n)t\mathcal{E}(\mathcal{L})_{t\wedge T_{n}}=\mathcal{E}(\mathcal{L}^{n})_{t}, so (n)\mathcal{E}(\mathcal{L}^{n}) is a continuous martingale, and it follows that Ex0[(n)T]=1{E}^{x_{0}}[\mathcal{E}(\mathcal{L}^{n})_{T}]=1. Hence, using Theorem 1 we have that, for any nn\in\mathbb{N}, ϕn\mathbb{Q}^{n}_{\phi} is such that dϕn/dx0=(n)T\mathrm{d}\mathbb{Q}^{n}_{\phi}/\mathrm{d}\mathbb{P}^{x_{0}}=\mathcal{E}(\mathcal{L}^{n})_{T} is a probability measure. Moreover, for any nn\in\mathbb{N}, (βtn)t[0,T](\beta^{n}_{t})_{t\in[0,T]} given by

βtn=Bt0tΔs1[0,Tn](s)ds\beta^{n}_{t}=B_{t}^{\prime}-\int_{0}^{t}\Delta_{s}1_{[0,T_{n}]}(s)\mathrm{d}s

is a ϕn\mathbb{Q}^{n}_{\phi}-Brownian motion. We also have

dZt\displaystyle\mathrm{d}Z_{t}^{\star} ={f(Tt,Zt)+Σ(Tt,Zt)sϕ(Tt,Zt)+Σ(Tt,Zt)1[0,Tn](t)dt\displaystyle=\{-f(T-t,Z_{t}^{\star})+\Sigma(T-t,Z_{t}^{\star})s_{\phi}(T-t,Z_{t}^{\star})+\nabla\cdot\Sigma(T-t,Z_{t}^{\star})1_{[0,T_{n}]}(t)\mathrm{d}t
+{f(Tt,Zt)+Σ(Tt,Zt)s(Tt,Zt)}1(Tn,T](t)dt+σ(Tt,Zt)dβtn.\displaystyle\qquad+\{-f(T-t,Z_{t}^{\star})+\Sigma(T-t,Z_{t}^{\star})s(T-t,Z_{t}^{\star})\}1_{(T_{n},T]}(t)\mathrm{d}t+\sigma(T-t,Z_{t}^{\star})\mathrm{d}\beta^{n}_{t}.

To clarify, this can also be rewritten as

dZt\displaystyle\mathrm{d}Z_{t}^{\star} ={bϕ(t,Zt)1[0,Tn](t)+b(t,Zt)1(Tn,T](t)}dt+σ(Tt,Zt)dβtn.\displaystyle=\{b_{\phi}(t,Z_{t}^{\star})1_{[0,T_{n}]}(t)+b(t,Z_{t}^{\star})1_{(T_{n},T]}(t)\}\mathrm{d}t+\sigma(T-t,Z_{t}^{\star})\mathrm{d}\beta^{n}_{t}.

In addition, we have that

KL(x0|ϕn)\displaystyle\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}^{n}_{\phi}) =Ex0[logdx0/dϕn]=Ex0[log(n)T]\displaystyle={E}^{x_{0}}[\log\mathrm{d}\mathbb{P}^{x_{0}}/\mathrm{d}\mathbb{Q}^{n}_{\phi}]=-{E}^{x_{0}}[\log\mathcal{E}(\mathcal{L}^{n})_{T}]
=Ex0[Tn+120TnΔs2ds]=12Ex0[0TnΔs2ds]\displaystyle={E}^{x_{0}}\left[-\mathcal{L}_{T_{n}}+\frac{1}{2}\int_{0}^{T_{n}}\|\Delta_{s}\|^{2}\mathrm{d}s\right]=\frac{1}{2}{E}^{x_{0}}\left[\int_{0}^{T_{n}}\|\Delta_{s}\|^{2}\mathrm{d}s\right] (15)

as \mathcal{L} is a x0\mathbb{P}^{x_{0}}-martingale. Finally, we define for any nn\in\mathbb{N}

dZtn\displaystyle\mathrm{d}Z_{t}^{n} ={f(Tt,Ztn)+Σ(Tt,Ztn)sϕ(Tt,Ztn)+Σ(Tt,Zt)}1[0,Tn](t)dt\displaystyle=\{-f(T-t,Z_{t}^{n})+\Sigma(T-t,Z_{t}^{n})s_{\phi}(T-t,Z_{t}^{n})+\nabla\cdot\Sigma(T-t,Z_{t}^{\star})\}1_{[0,T_{n}]}(t)\mathrm{d}t
+{f(Tt,Ztn)+Σ(Tt,Ztn)s(Tt,Ztn)}1(Tn,T](t)dt+σ(Tt,Zt)dWt,\displaystyle\qquad+\{-f(T-t,Z_{t}^{n})+\Sigma(T-t,Z_{t}^{n})s(T-t,Z_{t}^{n})\}1_{(T_{n},T]}(t)\mathrm{d}t+\sigma(T-t,Z_{t}^{\star})\mathrm{d}W_{t},

with (Wt)t[0,T](W_{t})_{t\in[0,T]} a dd-dimensional Brownian motion defined over a filtered probability space (Ω,,(t)t0,𝕄)(\Omega,\mathcal{F},(\mathcal{F}_{t})_{t\geq 0},\mathbb{M}) and Z0np(T,dxT|0,x0)Z_{0}^{n}\sim p(T,\mathrm{d}x_{T}|0,x_{0}). For any nn\in\mathbb{N}, we have that ϕn\mathbb{Q}^{n}_{\phi} is the probability measure associated with (Ztn)t[0,T](Z_{t}^{n})_{t\in[0,T]}. We also define

dZtϕ\displaystyle\mathrm{d}Z_{t}^{\phi} ={f(Tt,Ztϕ)+Σ(Tt,Ztϕ)sϕ(Tt,Zt)+Σ(Tt,Ztϕ)}dt+σ(Tt,Ztϕ)dWt,\displaystyle=\{-f(T-t,Z_{t}^{\phi})+\Sigma(T-t,Z_{t}^{\phi})s_{\phi}(T-t,Z_{t})+\nabla\cdot\Sigma(T-t,Z_{t}^{\phi})\}\mathrm{d}t+\sigma(T-t,Z_{t}^{\phi})\mathrm{d}W_{t},

and Z0p(T,dxT|0,x0)Z_{0}\sim p(T,\mathrm{d}x_{T}|0,x_{0}). Note that (Ztϕ)t[0,T](Z_{t}^{\phi})_{t\in[0,T]} is associated with ϕx0\mathbb{Q}_{\phi}^{x_{0}}. Let ε>0\varepsilon>0 and define projε:𝒞([0,T],d)𝒞([0,T],d)\mathrm{proj}_{\varepsilon}:\mathcal{C}([0,T],\mathbb{R}^{d})\rightarrow\mathcal{C}([0,T],\mathbb{R}^{d}) by projε(ω)(t)=ω(t(Tε))\mathrm{proj}_{\varepsilon}(\omega)(t)=\omega(t\wedge(T-\varepsilon)) for t[0,T]t\in[0,T]. Then, for any ε>0\varepsilon>0, projε(Zn)projε(Zϕ)\mathrm{proj}_{\varepsilon}(Z^{n})\rightarrow\mathrm{proj}_{\varepsilon}(Z^{\phi}) uniformly over [0,T][0,T] almost surely. Therefore, using Chen et al. (2022, Lemma 12), we have (projε)#ϕn(projε)#ϕx0(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{n}_{\phi}\rightarrow(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{x_{0}}_{\phi}. Using Ambrosio et al. (2005, Lemma 9.4.3), the data processing inequality, see Lemma 1, and (15), we get

KL((projε)#x0|(projε)#ϕx0)\displaystyle\textsc{KL}((\mathrm{proj}_{\varepsilon})_{\#}\mathbb{P}^{x_{0}}|(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{x_{0}}_{\phi}) lim infnKL((projε)#x0|(projε)#ϕn)\displaystyle\leq\liminf_{n\rightarrow\infty}\textsc{KL}((\mathrm{proj}_{\varepsilon})_{\#}\mathbb{P}^{x_{0}}|(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{n}_{\phi})
lim infnKL(x0|ϕn)e(ϕ)<+.\displaystyle\leq\liminf_{n\rightarrow\infty}\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}^{n}_{\phi})\leq e(\phi)<+\infty. (16)

Finally, letting ε0\varepsilon\rightarrow 0 we have that projε(ω)ω\mathrm{proj}_{\varepsilon}(\omega)\rightarrow\omega uniformly on [0,T][0,T] (Chen et al., 2022, Lemma 13), and hence using Ambrosio et al. (2005, Corollary 9.4.6), we get that KL((projε)#x0|(projε)#ϕx0)KL(x0|ϕx0)\textsc{KL}((\mathrm{proj}_{\varepsilon})_{\#}\mathbb{P}^{x_{0}}|(\mathrm{proj}_{\varepsilon})_{\#}\mathbb{Q}^{x_{0}}_{\phi})\rightarrow\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}^{x_{0}}_{\phi}). Hence KL(x0|ϕx0)<\textsc{KL}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})<\infty under the assumption that e(ϕ)<e(\phi)<\infty. ∎

Proof of Proposition 2.

By expanding the square in (13), we can decompose the upper bound on the Kullback–Leibler divergence as

kl(x0|ϕx0)C1+C2C3,\displaystyle\textsc{kl}(\mathbb{P}^{x_{0}}|\mathbb{Q}_{\phi}^{x_{0}})\leq C_{1}+C_{2}-C_{3}, (17)

where

C1\displaystyle C_{1} =120Ts(t,xt)Σ(t,xt)2p(t,xt0,x0)dxtdt,\displaystyle=\frac{1}{2}\int_{0}^{T}\left\|s(t,x_{t})\right\|_{\Sigma(t,x_{t})}^{2}p(t,x_{t}\mid 0,x_{0})\mathrm{d}x_{t}\mathrm{d}t,
C2\displaystyle C_{2} =12m=1Mtm1tmdsϕ(t,xt)Σ(t,xt)2p(t,xt0,x0)dxtdt,\displaystyle=\frac{1}{2}\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}\int_{\mathbb{R}^{d}}\|s_{\phi}(t,x_{t})\|_{\Sigma(t,x_{t})}^{2}p(t,x_{t}\mid 0,x_{0})\mathrm{d}x_{t}\mathrm{d}t, (18)
C3\displaystyle C_{3} =m=1Mtm1tmdsϕ(t,xt),s(t,xt)Σ(t,xt)p(t,xt0,x0)dxtdt.\displaystyle=\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}\int_{\mathbb{R}^{d}}\left\langle s_{\phi}(t,x_{t}),s(t,x_{t})\right\rangle_{\Sigma(t,x_{t})}p(t,x_{t}\mid 0,x_{0})\mathrm{d}x_{t}\mathrm{d}t.

We examine the term C3C_{3} that depends on the unknown score function s(t,xt)s(t,x_{t}). Firstly, we can write

C3=m=1Mtm1tmdsϕ(t,xt),p(t,xt0,x0)Σ(t,xt)dxtdt.\displaystyle C_{3}=\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}\int_{\mathbb{R}^{d}}\left\langle s_{\phi}(t,x_{t}),\nabla p(t,x_{t}\mid 0,x_{0})\right\rangle_{\Sigma(t,x_{t})}\mathrm{d}x_{t}\mathrm{d}t.

By differentiating the Chapman–Kolmogorov equation with respect to the variable xtx_{t}

p(t,xt0,x0)\displaystyle\nabla p(t,x_{t}\mid 0,x_{0}) =dp(t,xttm1,xtm1)p(tm1,xtm10,x0)dxtm1\displaystyle=\nabla\int_{\mathbb{R}^{d}}p(t,x_{t}\mid t_{m-1},x_{t_{m-1}})p(t_{m-1},x_{t_{m-1}}\mid 0,x_{0})\mathrm{d}x_{t_{m-1}}
=dlogp(t,xttm1,xtm1)p(t,xttm1,xtm1)p(tm1,xtm10,x0)dxtm1,\displaystyle=\int_{\mathbb{R}^{d}}\nabla\log p(t,x_{t}\mid t_{m-1},x_{t_{m-1}})p(t,x_{t}\mid t_{m-1},x_{t_{m-1}})p(t_{m-1},x_{t_{m-1}}\mid 0,x_{0})\mathrm{d}x_{t_{m-1}},

we obtain

C3=m=1Mtm1tmEx0{sϕ(t,Xt),logp(t,Xttm1,Xtm1)Σ(t,Xt)}dt.\displaystyle C_{3}=\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}{E}^{x_{0}}\left\{\left\langle s_{\phi}(t,X_{t}),\nabla\log p(t,X_{t}\mid t_{m-1},X_{t_{m-1}})\right\rangle_{\Sigma(t,X_{t})}\right\}\mathrm{d}t. (19)

By expanding the square in (8) and using (18), (19), and

C4\displaystyle C_{4} =12m=1Mtm1tmEx0{g(tm1,Xtm1,t,Xt)Σ(t,Xt)2}dt,\displaystyle=\frac{1}{2}\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}{E}^{x_{0}}\left\{\|g(t_{m-1},X_{t_{m-1}},t,X_{t})\|_{\Sigma(t,X_{t})}^{2}\right\}\mathrm{d}t,

we have

L(ϕ)=C2C3+C4.\displaystyle L(\phi)=C_{2}-C_{3}+C_{4}.

The claim follows by noting the decomposition in (17) and taking C=C1C4C=C_{1}-C_{4}. ∎

Appendix B Diffusion bridges

B.1 Time-reversed bridge process

Here we provide an alternative way to establish that the time-reversed bridge process (Zt)t[0,T]=(XTt)t[0,T](Z_{t}^{\star})_{t\in[0,T]}=(X_{T-t}^{\star})_{t\in[0,T]} evolves according to the time-reversal of the original diffusion process (Xt)t[0,T](X_{t})_{t\in[0,T]} in (1) with initialization at Z0=xTZ_{0}^{\star}=x_{T}. For any M3M\geq 3, let 0=t0<t1<<tM=T0=t_{0}<t_{1}<\cdots<t_{M}=T denote a partition of the interval [0,T][0,T]. The finite-dimensional distribution of (Xtm)m=1M1(X_{t_{m}}^{\star})_{m=1}^{M-1} is equal to that of (Xtm)m=1M1(X_{t_{m}})_{m=1}^{M-1} conditioned on X0=x0X_{0}=x_{0} and XT=xTX_{T}=x_{T}, which can be written as

Px0,xT(dxt1,,dxtM1)=Px0,xT(dxtM1)m=1M2Px0,xT(dxtmxtm+1,,xtM1).\displaystyle P^{x_{0},x_{T}}(\mathrm{d}x_{t_{1}},\ldots,\mathrm{d}x_{t_{M-1}})=P^{x_{0},x_{T}}(\mathrm{d}x_{t_{M-1}})\prod_{m=1}^{M-2}P^{x_{0},x_{T}}(\mathrm{d}x_{t_{m}}\mid x_{t_{m+1}},\ldots,x_{t_{M-1}}). (20)

We have

Px0,xT(dxtM1)=p(T,xTtM1,xtM1)p(tM1,xtM10,x0)dxtM1p(T,xT0,x0),\displaystyle P^{x_{0},x_{T}}(\mathrm{d}x_{t_{M-1}})=\frac{p(T,x_{T}\mid t_{M-1},x_{t_{M-1}})p(t_{M-1},x_{t_{M-1}}\mid 0,x_{0})\mathrm{d}x_{t_{M-1}}}{p(T,x_{T}\mid 0,x_{0})}, (21)

and for each m{1,,M2}m\in\{1,\ldots,M-2\}

Px0,xT(dxtmxtm+1,,xtM1)=p(tm+1,xtm+1tm,xtm)p(tm,xtm0,x0)dxtmp(tm+1,xtm+10,x0),\displaystyle P^{x_{0},x_{T}}(\mathrm{d}x_{t_{m}}\mid x_{t_{m+1}},\ldots,x_{t_{M-1}})=\frac{p(t_{m+1},x_{t_{m+1}}\mid t_{m},x_{t_{m}})p(t_{m},x_{t_{m}}\mid 0,x_{0})\mathrm{d}x_{t_{m}}}{p(t_{m+1},x_{t_{m+1}}\mid 0,x_{0})}, (22)

which are indeed the transition kernels of the time-reversed process (XtMm)m=1M1(X_{t_{M-m}})_{m=1}^{M-1}.

B.2 Learning Doob’s hh-transform

Suppose we have found a minimizer ϕ^argminϕΦL(ϕ)\hat{\phi}\in\arg\min_{\phi\in\Phi}L(\phi) satisfying e(ϕ^)<e(\hat{\phi})<\infty and denote the corresponding score approximation as s^(t,xt)=sϕ^(t,xt)\hat{s}(t,x_{t})=s_{\hat{\phi}}(t,x_{t}) and drift function as b^(t,zt)=bϕ^(t,zt)\hat{b}(t,z_{t})=b_{\hat{\phi}}(t,z_{t}). Consider a time-reversed bridge process Z^=(Z^t)t[0,T]\hat{Z}=(\hat{Z}_{t})_{t\in[0,T]} satisfying

dZ^t=b^(t,Z^t)dt+σ(Tt,Z^t)dBt,Z^0=xT,\displaystyle\mathrm{d}\hat{Z}_{t}=\hat{b}(t,\hat{Z}_{t})\mathrm{d}t+\sigma(T-t,\hat{Z}_{t})\mathrm{d}B_{t},\quad\hat{Z}_{0}=x_{T}, (23)

which should be seen as an approximation of (4). Let q^(t,zts,zs)\hat{q}(t,z_{t}\mid s,z_{s}) denote its transition density for any 0s<tT0\leq s<t\leq T, ^x0,xT\hat{\mathbb{Q}}^{x_{0},x_{T}} be the induced path measure, and E^x0,xT\hat{E}^{x_{0},x_{T}} to denote expectation with respect to ^x0,xT\hat{\mathbb{Q}}^{x_{0},x_{T}}. Note that p^(t,xt)=q^(Tt,xt0,xT)\hat{p}^{\star}(t,x_{t})=\hat{q}(T-t,x_{t}\mid 0,x_{T}) is an approximation of the marginal density p(t,xt)p^{\star}(t,x_{t}) in (5) for each t(0,T)t\in(0,T).

Our discussion in (10) prompts having the time-reversal of (23) as an approximation of the Doob’s hh-transform process XX^{\star} in (2). The bridge process X^=(X^t)t[0,T]=(Z^Tt)t[0,T]\hat{X}=(\hat{X}_{t})_{t\in[0,T]}=(\hat{Z}_{T-t})_{t\in[0,T]} satisfies

dX^t=f^(t,X^t)dt+σ(t,X^t)dWt,X^0=x0,\displaystyle\mathrm{d}\hat{X}_{t}=\hat{f}(t,\hat{X}_{t})\mathrm{d}t+\sigma(t,\hat{X}_{t})\mathrm{d}W_{t},\quad\hat{X}_{0}=x_{0},

with drift function f^(t,xt)=b^(Tt,xt)+Σ(t,xt)s^(t,xt)+Σ(t,xt)\hat{f}(t,x_{t})=-\hat{b}(T-t,x_{t})+\Sigma(t,x_{t})\hat{s}^{\star}(t,x_{t})+\nabla\cdot\Sigma(t,x_{t}) which is to be seen as an approximation of f(t,xt)f^{\star}(t,x_{t}). We can approximate the score s^(t,xt)=logp^(t,xt)\hat{s}^{\star}(t,x_{t})=\nabla\log\hat{p}^{\star}(t,x_{t}) of the marginal density p^(t,xt)\hat{p}^{\star}(t,x_{t}) and hence the time-reversal of (23) using the methodology described in Section 2.2. The following summarizes the key elements involved.

Consider a path measure ^ϕx0,xT\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}} that is induced by the bridge process X^=(X^t)t[0,T]\hat{X}=(\hat{X}_{t})_{t\in[0,T]} satisfying

dX^t=f^ϕ(t,X^t)dt+σ(t,X^t)dWt,X^0=x0,\displaystyle\mathrm{d}\hat{X}_{t}=\hat{f}_{\phi}(t,\hat{X}_{t})\mathrm{d}t+\sigma(t,\hat{X}_{t})\mathrm{d}W_{t},\quad\hat{X}_{0}=x_{0}, (24)

with drift function f^ϕ(t,xt)=b^(Tt,xt)+Σ(t,xt)s^ϕ(t,xt)+Σ(t,xt)\hat{f}_{\phi}(t,x_{t})=-\hat{b}(T-t,x_{t})+\Sigma(t,x_{t})\hat{s}^{\star}_{\phi}(t,x_{t})+\nabla\cdot\Sigma(t,x_{t}), where s^ϕ:[0,T]×dd\hat{s}_{\phi}^{\star}:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} denotes a function approximation of the score s^(t,xt)\hat{s}^{\star}(t,x_{t}) with parameters ϕΦ\phi\in\Phi to be optimized. We now measure the score approximation error as

e^(ϕ)=E^x0,xT{0Ts^ϕ(t,X^t)s^(t,X^t)Σ(t,X^t)2dt}.\displaystyle\hat{e}(\phi)=\hat{E}^{x_{0},x_{T}}\left\{\int_{0}^{T}\left\|\hat{s}_{\phi}^{\star}(t,\hat{X}_{t})-\hat{s}^{\star}(t,\hat{X}_{t})\right\|_{\Sigma(t,\hat{X}_{t})}^{2}\mathrm{d}t\right\}.
Proposition 3.

Assuming e^(ϕ)<\hat{e}(\phi)<\infty, we have kl(^x0,xT|^ϕx0,xT)e^(ϕ)/2\textsc{kl}(\hat{\mathbb{Q}}^{x_{0},x_{T}}|\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}})\leq\hat{e}(\phi)/2 and X^T=xT\hat{X}_{T}=x_{T} holds ^ϕx0,xT\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}}-almost surely.

Proposition 4.

For any partition (tm)m=0M(t_{m})_{m=0}^{M} of the interval [0,T][0,T], we have kl(^x0,xT|^ϕx0,xT)L^(ϕ)+C^\textsc{kl}(\hat{\mathbb{Q}}^{x_{0},x_{T}}|\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}})\leq\hat{L}(\phi)+\hat{C} if e^(ϕ)<\hat{e}(\phi)<\infty, where C^\hat{C} is a constant independent of ϕΦ\phi\in\Phi, the loss function is defined as

L^(ϕ)=12m=1Mtm1tmE^x0,xT{s^ϕ(Tt,Z^t)g^(tm1,Z^tm1,t,Z^t)Σ(t,Z^t)2}dt,\displaystyle\hat{L}(\phi)=\frac{1}{2}\sum_{m=1}^{M}\int_{t_{m-1}}^{t_{m}}\hat{E}^{x_{0},x_{T}}\left\{\left\|\hat{s}_{\phi}^{\star}(T-t,\hat{Z}_{t})-\hat{g}(t_{m-1},\hat{Z}_{t_{m-1}},t,\hat{Z}_{t})\right\|_{\Sigma(t,\hat{Z}_{t})}^{2}\right\}\mathrm{d}t, (25)

and g^(s,zs,t,zt)=logq^(t,zts,zs)\hat{g}(s,z_{s},t,z_{t})=\nabla\log\hat{q}(t,z_{t}\mid s,z_{s}) for 0s<tT0\leq s<t\leq T and zs,ztdz_{s},z_{t}\in\mathbb{R}^{d}.

The proof of these results is similar to Section A of the Supplementary Material and is thus omitted. As before, this allows us to circumvent intractability in the Kullback–Leibler divergence kl(^x0,xT|^ϕx0,xT)\textsc{kl}(\hat{\mathbb{Q}}^{x_{0},x_{T}}|\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}}) by minimizing the loss function L^(ϕ)\hat{L}(\phi). In the ideal case of s^ϕ(t,xt)=s^(t,xt)\hat{s}_{\phi}^{\star}(t,x_{t})=\hat{s}^{\star}(t,x_{t}) ^x0,xT\hat{\mathbb{Q}}^{x_{0},x_{T}}-almost surely, the minimal loss of L^(ϕ)=C^\hat{L}(\phi)=-\hat{C} is also unknown in practice, and ^ϕx0,xT\hat{\mathbb{Q}}_{\phi}^{x_{0},x_{T}} recovers the law of the diffusion bridge process XX^{\star} only if the initial score approximation error satisfies e(ϕ^)=0e(\hat{\phi})=0.

Given a minimizer ϕargminϕΦL^(ϕ)\phi^{\circ}\in\arg\min_{\phi\in\Phi}\hat{L}(\phi) and the corresponding score approximation s(t,xt)=s^ϕ(t,xt)s^{\circ}(t,x_{t})=\hat{s}_{\phi^{\circ}}^{\star}(t,x_{t}), by rewriting the drift f(t,xt)=f^ϕ(t,xt)f^{\circ}(t,x_{t})=\hat{f}_{\phi^{\circ}}(t,x_{t}) as

f(t,xt)=f(t,xt)+Σ(t,xt){s(t,xt)s^(t,xt)}\displaystyle f^{\circ}(t,x_{t})=f(t,x_{t})+\Sigma(t,x_{t})\{s^{\circ}(t,x_{t})-\hat{s}(t,x_{t})\}

and comparing it with (10), we see that the last two terms on the right provide an approximation of the term Σ(t,xt)logh(t,xt)\Sigma(t,x_{t})\nabla\log h(t,x_{t}) in Doob’s hh-transform.

Appendix C Numerical implementation

In this section, we detail various numerical considerations to implement our proposed methodology. For simplicity, we employ the Euler–Maruyama scheme (Kloeden and Platen, 1992, p. 340) on a uniform discretization of the interval [0,T][0,T], denoted by 0=t0<t1<<tM=T0=t_{0}<t_{1}<\cdots<t_{M}=T, with stepsize δt=T/M\delta t=T/M for M>1M>1. Non-uniform discretizations involve only minor modifications; some higher-order schemes could also be considered. In the following, we denote a multivariate normal distribution with mean vector μd\mu\in\mathbb{R}^{d} and covariance matrix Σd×d\Sigma\in\mathbb{R}^{d\times d} as 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma), and its density as x𝒩(x;μ,Σ)x\mapsto\mathcal{N}(x;\mu,\Sigma). We write the zero vector as 0dd0_{d}\in\mathbb{R}^{d} and the identity matrix as Idd×dI_{d}\in\mathbb{R}^{d\times d}.

The time-discretization of the stochastic differential equation defining XX satisfies the following recursion

Xtm=Xtm1+δtf(tm1,Xtm1)+σ(tm1,Xtm1)(WtmWtm1),X0=x0,\displaystyle X_{t_{m}}=X_{t_{m-1}}+\delta tf(t_{m-1},X_{t_{m-1}})+\sigma(t_{m-1},X_{t_{m-1}})(W_{t_{m}}-W_{t_{m-1}}),\quad X_{0}=x_{0}, (26)

for m{1,,M}m\in\{1,\ldots,M\}, with independent Brownian increments WtmWtm1𝒩(0d,δtId)W_{t_{m}}-W_{t_{m-1}}\sim\mathcal{N}(0_{d},\delta tI_{d}). Equation (26) induces a normal approximation of the transition density p(tm,xtmtm1,xtm1)p(t_{m},x_{t_{m}}\mid t_{m-1},x_{t_{m-1}}) of the form

pM(tm,xtmtm1,xtm1)=𝒩{xtm;xtm1+δtf(tm1,xtm1),δtΣ(tm1,xtm1)}.\displaystyle p_{M}(t_{m},x_{t_{m}}\mid t_{m-1},x_{t_{m-1}})=\mathcal{N}\{x_{t_{m}};x_{t_{m-1}}+\delta tf(t_{m-1},x_{t_{m-1}}),\delta t\Sigma(t_{m-1},x_{t_{m-1}})\}.

By replacing the score of p(tm,xtmtm1,xtm1)p(t_{m},x_{t_{m}}\mid t_{m-1},x_{t_{m-1}}) with that of pM(tm,xtmtm1,xtm1)p_{M}(t_{m},x_{t_{m}}\mid t_{m-1},x_{t_{m-1}}), g(tm1,xtm1,tm,xtm)g(t_{m-1},x_{t_{m-1}},t_{m},x_{t_{m}}) in Proposition 2 can be approximated by

gM(tm1,xtm1,tm,xtm)=(δt)1Σ1(tm1,xtm1){xtmxtm1δtf(tm1,xtm1)}.\displaystyle g_{M}(t_{m-1},x_{t_{m-1}},t_{m},x_{t_{m}})=-(\delta t)^{-1}\Sigma^{-1}(t_{m-1},x_{t_{m-1}})\{x_{t_{m}}-x_{t_{m-1}}-\delta tf(t_{m-1},x_{t_{m-1}})\}.

This approximation is of order (δt)1/2(\delta t)^{-1/2} as δt0\delta t\rightarrow 0. One could consider a control variate approach to stabilize the approximation as discussed in Song and Kingma (2021). While one could also consider higher-order discretization schemes to approximate the function gg, it may not be feasible or worthwhile to compute higher-order derivatives of ff and σ\sigma, particularly if the time-discretization error is dominated by the neural network approximation error.

We then define the following approximation of the loss function L(ϕ)L(\phi) in (8)

LM(ϕ)=12δtm=1MEMx0{sϕ(tm,Xtm)gM(tm1,Xtm1,tm,Xtm)Σ(tm,Xtm)2},L_{M}(\phi)=\frac{1}{2}\delta t\sum_{m=1}^{M}{E}_{M}^{x_{0}}\left\{\left\|s_{\phi}(t_{m},X_{t_{m}})-g_{M}(t_{m-1},X_{t_{m-1}},t_{m},X_{t_{m}})\right\|_{\Sigma(t_{m},X_{t_{m}})}^{2}\right\},

where EMx0{E}_{M}^{x_{0}} denotes expectation with respect to the law of the time-discretized process under (26). To obtain a minimizer ϕ^argminϕΦLM(ϕ)\hat{\phi}\in\arg\min_{\phi\in\Phi}L_{M}(\phi) using stochastic gradient algorithms, the gradient with respect to parameters ϕΦ\phi\in\Phi

ϕLM(ϕ)=δtm=1MEMx0[(ϕsϕTΣ)(tm,Xtm){sϕ(tm,Xtm)gM(tm1,Xtm1,tm,Xtm)}]\displaystyle\nabla_{\phi}L_{M}(\phi)=\delta t\sum_{m=1}^{M}E_{M}^{x_{0}}\left[(\nabla_{\phi}s_{\phi}^{\mathrm{\scriptscriptstyle T}}\Sigma)(t_{m},X_{t_{m}})\{s_{\phi}(t_{m},X_{t_{m}})-g_{M}(t_{m-1},X_{t_{m-1}},t_{m},X_{t_{m}})\}\right] (27)

can be unbiasedly estimated using independent sample paths from (26). The above notation ϕsϕ\nabla_{\phi}s_{\phi} refers to the Jacobian of sϕs_{\phi}. Equation (27) can be seen as an approximation of the gradient ϕL(ϕ)\nabla_{\phi}{L}(\phi).

After obtaining ϕ^\hat{\phi} with optimization, we can simulate a proposal bridge Z^=(Z^t)t[0,T]\hat{Z}=(\hat{Z}_{t})_{t\in[0,T]} satisfying (23) with drift function b^(t,zt)=bϕ^(t,zt)\hat{b}(t,z_{t})=b_{\hat{\phi}}(t,z_{t}). We employ the following modified Euler–Maruyama scheme

Z^tm=Z^tm1+δtb^(tm1,Z^tm1)+(TtmTtm1)1/2σ(Ttm1,Z^tm1)(BtmBtm1),\displaystyle\hat{Z}_{t_{m}}=\hat{Z}_{t_{m-1}}+\delta t\hat{b}(t_{m-1},\hat{Z}_{t_{m-1}})+\left(\frac{T-t_{m}}{T-t_{m-1}}\right)^{1/2}\sigma(T-t_{m-1},\hat{Z}_{t_{m-1}})(B_{t_{m}}-B_{t_{m-1}}), (28)

for m{1,,M1}m\in\{1,\ldots,M-1\}, with independent Brownian increments BtmBtm1𝒩(0d,δtId)B_{t_{m}}-B_{t_{m-1}}\sim\mathcal{N}(0_{d},\delta tI_{d}), initial condition Z^0=xT\hat{Z}_{0}=x_{T}, and terminal constraint Z^T=x0\hat{Z}_{T}=x_{0}. This changes the variance of the usual Euler–Maruyama transitions with a multiplier of (Ttm)/(Ttm1)(T-t_{m})/(T-t_{m-1}) at time step mm. We found that this modification can improve practical performance for times near the endpoint by lowering the transition variances. Such behaviour is consistent with findings in earlier works by Durham and Gallant (2002) and Papaspiliopoulos et al. (2013) when constructing proposal bridge processes with the drift of a Brownian bridge. This gives a normal approximation of the transition density q^(tm,ztmtm1,ztm1)\hat{q}(t_{m},z_{t_{m}}\mid t_{m-1},z_{t_{m-1}})

q^M(tm,ztmtm1,ztm1)\displaystyle\hat{q}_{M}(t_{m},z_{t_{m}}\mid t_{m-1},z_{t_{m-1}})
=𝒩{ztm;ztm1+δtb^(tm1,ztm1),δt(TtmTtm1)Σ(Ttm1,ztm1)}.\displaystyle=\mathcal{N}\left\{z_{t_{m}};z_{t_{m-1}}+\delta t\hat{b}(t_{m-1},z_{t_{m-1}}),\delta t\left(\frac{T-t_{m}}{T-t_{m-1}}\right)\Sigma(T-t_{m-1},z_{t_{m-1}})\right\}. (29)

We can perform importance sampling on 𝔼M=(d)M1\mathbb{E}_{M}=(\mathbb{R}^{d})^{M-1} to correct for the discrepancy between the law of our proposal bridge process

Q^Mx0,xT(zt1:tM1)=m=1M1q^M(tm,ztmtm1,ztm1),zt1:tM1=(ztm)m=1M1𝔼M,\displaystyle\hat{Q}_{M}^{x_{0},x_{T}}(z_{t_{1}:t_{M-1}})=\prod_{m=1}^{M-1}\hat{q}_{M}(t_{m},z_{t_{m}}\mid t_{m-1},z_{t_{m-1}}),\quad z_{t_{1}:t_{M-1}}=(z_{t_{m}})_{m=1}^{M-1}\in\mathbb{E}_{M}, (30)

and the law of the time-discretized diffusion bridge process

PMx0,xT(xt1:tM1)=γMx0,xT(xt1:tM1)pM(T,xT0,x0),xt1:tM1=(xtm)m=1M1𝔼M,\displaystyle P_{M}^{x_{0},x_{T}}(x_{t_{1}:t_{M-1}})=\frac{\gamma_{M}^{x_{0},x_{T}}(x_{t_{1}:t_{M-1}})}{p_{M}(T,x_{T}\mid 0,x_{0})},\quad x_{t_{1}:t_{M-1}}=(x_{t_{m}})_{m=1}^{M-1}\in\mathbb{E}_{M}, (31)

with γMx0,xT(xt1:tM1)=m=1MpM(tm,xtmtm1,xtm1)\gamma_{M}^{x_{0},x_{T}}(x_{t_{1}:t_{M-1}})=\prod_{m=1}^{M}p_{M}(t_{m},x_{t_{m}}\mid t_{m-1},x_{t_{m-1}}), and also estimate

pM(T,xT0,x0)=𝔼MγMx0,xT(xt1:tM1)dxt1:tM1,\displaystyle p_{M}(T,x_{T}\mid 0,x_{0})=\int_{\mathbb{E}_{M}}\gamma_{M}^{x_{0},x_{T}}(x_{t_{1}:t_{M-1}})\mathrm{d}x_{t_{1}:t_{M-1}}, (32)

which is an approximation of the transition density p(T,xT0,x0)p(T,x_{T}\mid 0,x_{0}) under the Euler–Maruyama scheme. The corresponding unnormalized importance weight is ω(zt1:tM1)=γMx0,xT(xt1:tM1)/Q^Mx0,xT(zt1:tM1)\omega(z_{t_{1}:t_{M-1}})=\gamma_{M}^{x_{0},x_{T}}(x_{t_{1}:t_{M-1}})/\hat{Q}_{M}^{x_{0},x_{T}}(z_{t_{1}:t_{M-1}}) with xt1:tM1=(zTtm)m=1M1x_{t_{1}:t_{M-1}}=(z_{T-t_{m}})_{m=1}^{M-1}, and an unbiased importance sampling estimator of the transition density pM(T,xT0,x0)p_{M}(T,x_{T}\mid 0,x_{0}) is N1n=1Nω(zt1:tM1n)N^{-1}\sum_{n=1}^{N}\omega(z_{t_{1}:t_{M-1}}^{n}) where (zt1:tM1n)n=1N(z_{t_{1}:t_{M-1}}^{n})_{n=1}^{N} denote NN\in\mathbb{N} independent sample paths from Q^Mx0,xT\hat{Q}_{M}^{x_{0},x_{T}}. As noted by Lin et al. (2010), the root mean squared error of this transition density estimator is approximately equals to the χ2\chi^{2}-divergence of Q^Mx0,xT\hat{Q}_{M}^{x_{0},x_{T}} from PMx0,xTP_{M}^{x_{0},x_{T}} divided by the sample size NN. One can also employ proposals from (30) within an independent Metropolis–Hastings algorithm that has (31) as its invariant law (Elerian et al., 2001). At each iteration of the algorithm, a sample path zt1:tM1Q^Mx0,xTz_{t_{1}:t_{M-1}}^{\circ}\sim\hat{Q}_{M}^{x_{0},x_{T}} is accepted with probability min{1,ω(zt1:tM1)/ω(zt1:tM1)}\min\{1,\omega(z_{t_{1}:t_{M-1}}^{\circ})/\omega(z_{t_{1}:t_{M-1}})\}, where zt1:tM1z_{t_{1}:t_{M-1}} denotes the current state of the Markov chain. The efficiency of this Markov chain Monte Carlo algorithm can be assessed by monitoring its acceptance probability. To improve the acceptance probability, we can also combine independent Metropolis–Hastings with importance sampling within a particle independent Metropolis–Hastings algorithm (Andrieu et al., 2010) that has invariant law (31). Each iteration of this algorithm involves selecting a proposed sample path zt1:tM1z_{t_{1}:t_{M-1}}^{\circ} among NN candidates (zt1:tM1n)n=1NQ^Mx0,xT(z_{t_{1}:t_{M-1}}^{n})_{n=1}^{N}\sim\hat{Q}_{M}^{x_{0},x_{T}} according to probabilities proportional to their weights (ω(zt1:tM1n))n=1N(\omega(z_{t_{1}:t_{M-1}}^{n}))_{n=1}^{N}, and accepting it with probability min{1,p^M(T,xT0,x0)/p^M(T,xT0,x0)}\min\{1,\hat{p}_{M}^{\circ}(T,x_{T}\mid 0,x_{0})/\hat{p}_{M}(T,x_{T}\mid 0,x_{0})\} that depends on the ratio of the new and current transition density estimators p^M(T,xT0,x0)=N1n=1Nω(zt1:tM1n)\hat{p}_{M}^{\circ}(T,x_{T}\mid 0,x_{0})=N^{-1}\sum_{n=1}^{N}\omega(z_{t_{1}:t_{M-1}}^{n}) and p^M(T,xT0,x0)\hat{p}_{M}(T,x_{T}\mid 0,x_{0}), respectively. Under mild assumptions, consistency of importance sampling estimators as NN\rightarrow\infty implies that the acceptance probability of particle independent Metropolis–Hastings algorithm converges to one. This algorithm can also be combined with unbiased Markov chain Monte Carlo methods to provide unbiased estimates of expectations with respect to the law of the time-discretized diffusion bridge (Middleton et al., 2019).

Lastly, we sketch the key steps to learn the Doob’s hh-transform process for the sake of brevity. Using the score of the normal transition density in (C), we may approximate g^(tm1,ztm1,tm,ztm)\hat{g}(t_{m-1},z_{t_{m-1}},t_{m},z_{t_{m}}) and hence the loss function L^(ϕ)\hat{L}(\phi) in (25). The approximate loss can be minimized using stochastic gradient algorithms and sample paths from (28). By time-discretizing the resulting proposal bridge process in (24), we may then employ it as an importance proposal to approximate the law in (31) and the transition density in (32), or to generate proposal distributions within independent Metropolis–Hastings algorithms.

Appendix D Implementation details

D.1 Benchmarking proposal bridge processes

We benchmark our diffusion bridge approximations against several existing approaches to construct proposal bridge processes. These methods simulate a proposal bridge process X=(Xt)t[0,T]X^{\circ}=(X_{t}^{\circ})_{t\in[0,T]} satisfying

dXt=f(t,Xt)dt+σ(t,Xt)dWt,X0=x0.\displaystyle\mathrm{d}X_{t}^{\circ}=f^{\circ}(t,X_{t}^{\circ})\mathrm{d}t+\sigma(t,X_{t}^{\circ})\mathrm{d}W_{t},\quad X_{0}^{\circ}=x_{0}. (33)

As summarized in Table 1, each method can be understood as a specific choice of the drift function ff^{\circ} that approximates the diffusion bridge process XX^{\star} given by Doob’s hh-transform in (2). We time-discretize (33) using the Euler–Maruyama (EM) scheme (26) for the forward diffusion method, the modified Euler–Maruyama (Modified EM) scheme (28) for the modified diffusion bridge, the Euler–Maruyama scheme with the time-change τ(t)=t(2t/T)\tau(t)=t(2-t/T) proposed by van der Meulen and Schauer (2017) (Time-change EM) for the Clark–Delyon–Hu bridge and the guided proposal bridge.

We perform an importance sampling or independent Metropolis–Hastings correction as described above, with the exception that for the Clark–Delyon–Hu bridge and the guided proposal bridge, the importance weight

ω(Xτ(t1):τ(tM1))=p~(T,xT0,x0)p(T,xT0,x0)exp[m=1M(ff~)T(logh~){τ(tm),Xτ(tm)}]\displaystyle\omega(X_{\tau(t_{1}):\tau(t_{M-1})}^{\circ})=\frac{\tilde{p}(T,x_{T}\mid 0,x_{0})}{p(T,x_{T}\mid 0,x_{0})}\exp\left[\sum_{m=1}^{M}(f-\tilde{f})^{\mathrm{\scriptscriptstyle T}}(\nabla\log\tilde{h})\{\tau(t_{m}),X_{\tau(t_{m})}^{\circ}\}\right]

of the sample path (Xτ(tm))m=1M1(X_{\tau(t_{m})}^{\circ})_{m=1}^{M-1} is obtained by approximating the Radon–Nikodym derivative

Ψ(X)=p~(T,xT0,x0)p(T,xT0,x0)exp[0T(ff~)T(logh~){t,Xt}dt],\displaystyle\Psi(X^{\circ})=\frac{\tilde{p}(T,x_{T}\mid 0,x_{0})}{p(T,x_{T}\mid 0,x_{0})}\exp\left[\int_{0}^{T}(f-\tilde{f})^{\mathrm{\scriptscriptstyle T}}(\nabla\log\tilde{h})\{t,X_{t}^{\circ}\}\mathrm{d}t\right], (34)

where f~(t,xt)\tilde{f}(t,x_{t}) denotes the drift function of the associated auxiliary process. The numerical results in van der Meulen and Schauer (2017, Section 5.2) show improved time-discretization of Ψ(X)\Psi(X^{\circ}) using Euler–Maruyama with the time-change τ(t)=t(2t/T)\tau(t)=t(2-t/T). The Clark–Delyon–Hu bridge can be understood as having Brownian motion as the auxiliary process, in which case f~(t,xt)=0\tilde{f}(t,x_{t})=0, while the guided proposal bridge typically involves selecting an auxiliary Ornstein–Uhlenbeck process with a linear drift f~(t,xt)\tilde{f}(t,x_{t}) whose parameters are determined by minimizing a Kullback–Leibler objective (Schauer et al., 2017, Section 1.3) or by understanding the behaviour of the diffusion process XX (van der Meulen and Schauer, 2017, Section 4.4). We will detail the choice of these parameters for each example in the following.

Method Drift f(t,xt)f^{\circ}(t,x_{t}) References Time-discretization
Forward diffusion f(t,xt)f(t,x_{t}) Pedersen (1995) EM
Modified diffusion bridge (xTx0)/(Tt)(x_{T}-x_{0})/(T-t) Durham and Gallant (2002) Modified EM
Clark–Delyon–Hu f(t,xt)+(xTx0)/(Tt)f(t,x_{t})+(x_{T}-x_{0})/(T-t) Clark (1990); Delyon and Hu (2006) Time-change EM
Guided proposal f(t,xt)+Σ(t,xt)logh~(t,xt)f(t,x_{t})+\Sigma(t,x_{t})\nabla\log\tilde{h}(t,x_{t}) Schauer et al. (2017) Time-change EM
Table 1: Benchmark proposal bridge methods

D.2 Neural network and stochastic optimization

The architecture of the neural networks we employed is illustrated in Fig. 5. For all numerical experiments, optimization was performed using the stochastic gradient algorithm of Kingma and Ba (2014) with a momentum of 0.990.99 and learning rate of 0.010.01.

Refer to caption
Refer to caption
Figure 5: Neural network architectures involve multi-layer perceptron (MLP) blocks and an “Encoding” block which applies the sine transform described in Vaswani et al. (2017). MLP (1a) and (1b) have one hidden layer and MLP (2) has two hidden layers. All neurons use the Leaky ReLU activation function.

D.3 Ornstein–Uhlenbeck process

Let XX be an Ornstein–Uhlenbeck process, defined by (1) with linear drift function f(t,xt)=αβxtf(t,x_{t})=\alpha-\beta x_{t} and identity diffusion coefficient σ(t,xt)=Id\sigma(t,x_{t})=I_{d}. In this analytically tractable example, for any 0s<tT0\leq s<t\leq T, the transition density of XX is a normal density p(t,xts,xs)=𝒩{xt;m(ts,xs),v(ts)Id}p(t,x_{t}\mid s,x_{s})=\mathcal{N}\{x_{t};m(t-s,x_{s}),v(t-s)I_{d}\}, with the following mean and variance

m(ts,xs)=αβ+(xsαβ)exp{β(ts)},v(ts)=1exp{2β(ts)}2β.\displaystyle m(t-s,x_{s})=\frac{\alpha}{\beta}+\left(x_{s}-\frac{\alpha}{\beta}\right)\exp\{-\beta(t-s)\},\quad v(t-s)=\frac{1-\exp\{-2\beta(t-s)\}}{2\beta}.

Hence the logarithmic gradient term in the Doob’s hh-transform of (2) is

logh(t,xt)=exp{β(Tt)}v(Tt){xTm(Tt,xt)},\displaystyle\nabla\log h(t,x_{t})=\frac{\exp\{-\beta(T-t)\}}{v(T-t)}\{x_{T}-m(T-t,x_{t})\},

and the score of the transition density in (4) is

s(t,xt)=v(t)1{m(t,x0)xt}.\displaystyle s(t,x_{t})=v(t)^{-1}\{m(t,x_{0})-x_{t}\}.

For this example, we can select the auxiliary process of Schauer et al. (2017) as the Ornstein–Uhlenbeck process by setting f~(t,xt)=αβxt\tilde{f}(t,x_{t})=\alpha-\beta x_{t}, in which case the linear guiding term logh~(t,xt)=logh(t,xt)\nabla\log\tilde{h}(t,x_{t})=\nabla\log h(t,x_{t}) is exact, and the Radon–Nikodym derivative in (34) satisfies Ψ(X)=1\Psi(X^{\circ})=1.

For dimension d=1d=1 and varying either the time horizon T{1,2,4,8}T\in\{1,2,4,8\} or the terminal state xTx_{T}, we employed a time-discretization stepsize of δt=0.02\delta t=0.02, 500500 optimization iterations, and 100100 sample paths per iteration. For the case of T=1T=1 and varying d{1,2,4,8}d\in\{1,2,4,8\}, we decreased the stepsize and increased the number of optimization iterations and the capacity of the neural network with dimension.

D.4 Interest rates model

We consider a special case of an interest rates model in Aït-Sahalia and Lo (1998), defined by (1) with drift function f(t,xt)=θ/xtxtf(t,x_{t})=\theta/x_{t}-x_{t} with θ=4\theta=4 and diffusion coefficient σ(t,xt)=1\sigma(t,x_{t})=1. This diffusion has a stable stationary point at x=θx^{\star}=\sqrt{\theta}, and its transition density is known and given by

logp(t,xts,xs)\displaystyle\log p(t,x_{t}\mid s,x_{s}) =θlog(xt/xs)+12log(xtxs)xt2+(θ+12)(ts)\displaystyle=\theta\log(x_{t}/x_{s})+\frac{1}{2}\log(x_{t}x_{s})-x_{t}^{2}+\left(\theta+\frac{1}{2}\right)(t-s)
logsinh(ts)xt2+xs2exp{2(ts)}1+logIθ1/2{xtxssinh(ts)},\displaystyle-\log\mathrm{sinh}(t-s)-\frac{x_{t}^{2}+x_{s}^{2}}{\exp\{2(t-s)\}-1}+\log I_{\theta-1/2}\left\{\frac{x_{t}x_{s}}{\mathrm{sinh}(t-s)}\right\},

for 0s<tT0\leq s<t\leq T, where IνI_{\nu} denotes the modified Bessel function of order ν\nu. The logarithmic gradient term in the Doob’s hh-transform of (2) is

logh(t,xt)\displaystyle\nabla\log h(t,x_{t}) =θxt+12xt2xtexp{2(Tt)}1\displaystyle=-\frac{\theta}{x_{t}}+\frac{1}{2x_{t}}-\frac{2x_{t}}{\exp\{2(T-t)\}-1}
+Iθ1/2{xTxtsinh(Tt)}1Jθ1/2{xTxtsinh(Tt)}xTsinh(Tt),\displaystyle+I_{\theta-1/2}\left\{\frac{x_{T}x_{t}}{\mathrm{sinh}(T-t)}\right\}^{-1}J_{\theta-1/2}\left\{\frac{x_{T}x_{t}}{\mathrm{sinh}(T-t)}\right\}\frac{x_{T}}{\mathrm{sinh}(T-t)},

and the score of the transition density in (4) is

s(t,xt)=θxt+12xt2xt2xtexp(2t)1+Iθ1/2{xtx0sinh(t)}1Jθ1/2{xtx0sinh(t)}x0sinh(t),\displaystyle s(t,x_{t})=\frac{\theta}{x_{t}}+\frac{1}{2x_{t}}-2x_{t}-\frac{2x_{t}}{\exp(2t)-1}+I_{\theta-1/2}\left\{\frac{x_{t}x_{0}}{\mathrm{sinh}(t)}\right\}^{-1}J_{\theta-1/2}\left\{\frac{x_{t}x_{0}}{\mathrm{sinh}(t)}\right\}\frac{x_{0}}{\mathrm{sinh}(t)},

where JνJ_{\nu} denotes the derivative of IνI_{\nu}.

Numerical experiments for all T{1,2,4,8}T\in\{1,2,4,8\} employed a time-discretization stepsize of δt=0.02\delta t=0.02, 10001000 optimization iterations, 10001000 sample paths per iteration with 1010 unique initial conditions X0=x0X_{0}=x_{0} sampled from the gamma distribution with shape 55 and rate 22. We select the auxiliary process of Schauer et al. (2017) as an Ornstein–Uhlenbeck process with drift f~(t,xt)=2θ2xt\tilde{f}(t,x_{t})=2\sqrt{\theta}-2x_{t} and unit diffusion coefficient. The choice of f~\tilde{f} is based on the first-order Taylor approximation

f(t,xt)f(t,x)+xf(t,x)(xtx)=f~(t,xt).\displaystyle f(t,x_{t})\approx f(t,x^{\star})+\partial_{x}f(t,x^{\star})(x_{t}-x^{\star})=\tilde{f}(t,x_{t}).

D.5 Cell model

Our numerical experiments for all T{2,4,8,16}T\in\{2,4,8,16\} and σX2{0.1,1}\sigma_{X}^{2}\in\{0.1,1\} employed a time-discretization stepsize of δt=0.02\delta t=0.02, 20002000 optimization iterations, and 100100 sample paths per iteration. In our implementation of the guided proposal bridge, we choose an auxiliary Ornstein–Uhlenbeck process with drift function f~(t,xt)=A(θxt)\tilde{f}(t,x_{t})=A(\theta-x_{t}) with θd\theta\in\mathbb{R}^{d} and A=UDUTd×dA=UDU^{\mathrm{\scriptscriptstyle T}}\in\mathbb{R}^{d\times d} parameterized by a matrix Ud×dU\in\mathbb{R}^{d\times d} whose columns are eigenvectors and a diagonal matrix Dd×dD\in\mathbb{R}^{d\times d} whose diagonal entries are eigenvalues. This eigendecomposition facilities computation of matrix exponentials appearing in the expressions of the transition density p~(t,xt0,x0)\tilde{p}(t,x_{t}\mid 0,x_{0}) and the gradient logh~(t,xt)\nabla\log\tilde{h}(t,x_{t}).

Following Schauer et al. (2017), we optimize the parameters ϕ=(θ,U,D)\phi=(\theta,U,D) by minimizing the Kullback–Leibler divergence kl(x0,xT|ϕx0,xT)\textsc{kl}(\mathbb{P}^{x_{0},x_{T}}|\mathbb{Q}_{\phi}^{x_{0},x_{T}}), where x0,xT\mathbb{P}^{x_{0},x_{T}} denotes the law of the diffusion bridge and ϕx0,xT\mathbb{Q}_{\phi}^{x_{0},x_{T}} is the law induced by the guided proposal bridge process in (33). As considered in Schauer et al. (2017), we employ importance sampling to approximate the intractable objective by rewriting it as

kl(x0,xT|ϕx0,xT)=Eφx0,xT{log(dx0,xT/dϕx0,xT)(dx0,xT/dφx0,xT)(X)},\displaystyle\textsc{kl}(\mathbb{P}^{x_{0},x_{T}}|\mathbb{Q}_{\phi}^{x_{0},x_{T}})=E_{\varphi}^{x_{0},x_{T}}\{\log(\mathrm{d}\mathbb{P}^{x_{0},x_{T}}/\mathrm{d}\mathbb{Q}_{\phi}^{x_{0},x_{T}})(\mathrm{d}\mathbb{P}^{x_{0},x_{T}}/\mathrm{d}\mathbb{Q}_{\varphi}^{x_{0},x_{T}})(X^{\circ})\}, (35)

where Eφx0,xTE_{\varphi}^{x_{0},x_{T}} denotes expectation with respect to the proposal law φx0,xT\mathbb{Q}_{\varphi}^{x_{0},x_{T}} with reference parameters φ\varphi that are obtained from earlier iterations of a stochastic gradient descent algorithm.

Initialization of φ\varphi is crucial as estimators of the gradient of (35) with respect to ϕ\phi will have large variance when the importance sampling approximation under φx0,xT\mathbb{Q}_{\varphi}^{x_{0},x_{T}} is poor. For this application, we initialize by setting (θ,U,D)=(XT,Id,Id)(\theta,U,D)=(X_{T},I_{d},I_{d}), corresponding to starting AA with the identity matrix and choosing θ\theta to induce mean-reversion towards the stable stationary point XTX_{T}. After each gradient update, we also perform projection to ensure that columns of UU are orthogonal.