This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Iterated Schrödinger bridge approximation
to Wasserstein Gradient Flows

Medha Agarwal Medha Agarwal
Department of Statistics
University of Washington
Seattle WA 98195, USA
Email: medhaaga@uw.edu
Zaid Harchaoui Zaid Harchaoui
Department of Statistics
University of Washington
Seattle WA 98195, USA
Email: zaid@uw.edu
Garrett Mulcahy Garrett Mulcahy
Department of Mathematics
University of Washington
Seattle WA 98195, USA
Email: gmulcahy@uw.edu
 and  Soumik Pal Soumik Pal
Department of Mathematics
University of Washington
Seattle WA 98195, USA
Email: soumik@uw.edu
Abstract.

We introduce a novel discretization scheme for Wasserstein gradient flows that involves successively computing Schrödinger bridges with the same marginals. This is different from both the forward/geodesic approximation and the backward/Jordan-Kinderlehrer-Otto (JKO) approximations. The proposed scheme has two advantages: one, it avoids the use of the score function, and, two, it is amenable to particle-based approximations using the Sinkhorn algorithm. Our proof hinges upon showing that relative entropy between the Schrödinger bridge with the same marginals at temperature ε\varepsilon and the joint distribution of a stationary Langevin diffusion at times zero and ε\varepsilon is of the order o(ε2)o(\varepsilon^{2}) with an explicit dependence given by Fisher information. Owing to this inequality, we can show, using a triangular approximation argument, that the interpolated iterated application of the Schrödinger bridge approximation converge to the Wasserstein gradient flow, for a class of gradient flows, including the heat flow. The results also provide a probabilistic and rigorous framework for the convergence of the self-attention mechanisms in transformer networks to the solutions of heat flows, first observed in the inspiring work [SABP22] in machine learning research.

Key words and phrases:
Schrödinger bridges, Wasserstein gradient flows, JKO approximation, Fisher information, Sinkformers, Transformers, self-attention
2000 Mathematics Subject Classification:
49N99, 49Q22, 60J60
This research is partially supported by the following grants. All the authors are supported by NSF grant DMS-2134012. Additionally, Pal is supported by NSF grant DMS-2133244; Harchaoui is supported by NSF grant CCF-2019844, DMS-2023166, DMS-2133244. Mulcahy is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2140004. Thanks to PIMS Kantorovich Initiative for facilitating this collaboration supported through a PIMS PRN and the NSF Infrastructure grant DMS 2133244. The authors are listed in alphabetical order.

1. Introduction

Let 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) denote the space of square-integrable Borel probability measures on d\mathbb{R}^{d}. Throughout this paper we will let ρ𝒫2(d)\rho\in\mathcal{P}_{2}(\mathbb{R}^{d}) refer to both the measure and the its Lebesgue density, whenever it exists. 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) can be turned into a metric space equipped with the Wasserstein-22 metric [AGS08, Chapter 7]. We will refer to this metric space as simply the Wasserstein space. For any pair of probability measures (ρ1,ρ2)(\rho_{1},\rho_{2}) in the Wasserstein space, let Π(ρ1,ρ2)\Pi(\rho_{1},\rho_{2}) denote the set of couplings, i.e., joint distributions, with marginals (ρ1,ρ2)(\rho_{1},\rho_{2}). Further, define the relative entropy, also known as Kullback-Leibler (KL) divergence, between ρ1\rho_{1} and ρ2\rho_{2} as

(1) H(ρ1|ρ2):={log(dρ1dρ2)𝑑ρ1if ρ1<<ρ2+otherwise\displaystyle H(\rho_{1}|\rho_{2})\mathrel{\mathop{\mathchar 58\relax}}=\begin{cases}\int\log\left(\frac{d\rho_{1}}{d\rho_{2}}\right)d\rho_{1}&\text{if $\rho_{1}<<\rho_{2}$}\\ +\infty&\text{otherwise}\end{cases}

To understand our results let us start with an example that was first described in [SABP22]. The static Schrödinger bridge ([Léo14]) at temperature ε\varepsilon with both marginals ρ\rho is the solution to the following optimization problem

(2) πρ,ε\displaystyle\pi_{\rho,\varepsilon} :=argminπΠ(ρ,ρ)H(π|μρ,ε),\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\operatorname*{argmin}_{\pi\in\Pi(\rho,\rho)}H(\pi|\mu_{\rho,\varepsilon}),

where μρ,ε𝒫(d×d)\mu_{\rho,\varepsilon}\in\mathcal{P}(\mathbb{R}^{d}\times\mathbb{R}^{d}) has density

(3) μρ,ε(x,y):=ρ(x)1(2πε)d/2exp(12εxy2).\displaystyle\mu_{\rho,\varepsilon}(x,y)\mathrel{\mathop{\mathchar 58\relax}}=\rho(x)\frac{1}{(2\pi\varepsilon)^{d/2}}\exp\left(-\frac{1}{2\varepsilon}\|x-y\|^{2}\right).

If (X,Y)πρ,ε(X,Y)\sim\pi_{\rho,\varepsilon}{}, define the barycentric projection [PNW22] as a function from d\mathbb{R}^{d} to itself given by

(4) ρ,ε(x)\displaystyle\mathcal{B}_{\rho,\varepsilon}(x) =Eπρ,ε[Y|X=x].\displaystyle=\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[Y|X=x].

Now define a map from 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) to itself given by the following pushforward

(5) SBε(ρ):=(2Idρ,ε)#ρ.\text{SB}_{\varepsilon}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=\left(2\operatorname{\mathrm{Id}}-\mathcal{B}_{\rho,\varepsilon}\right)_{\#}\rho.

We call this map SBε()\text{SB}_{\varepsilon}(\cdot) to emphasize the central role of the Schrödinger bridge in its definition. That is, if (X,Y)πρ,ε(X,Y)\sim\pi_{\rho,\varepsilon}, then SBε(ρ)\text{SB}_{\varepsilon}(\rho) is the law of the random variable 2XEπρ,ε[Y|X]2X-\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[Y|X]. If ε0\varepsilon\approx 0, then πρ,ε\pi_{\rho,\varepsilon}{} should closely approximate the quadratic cost optimal transport plan from ρ\rho to itself [Léo12]. That is, one would expect Eπρ,ε[Y|X]\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[Y|X] to be approximately close to XX itself (see, for instance, [PNW22, Corollary 1]). Therefore, one expects 2XEπρ,ε[Y|X]X2X-\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[Y|X]\approx X, as well. Hence, the pushforward should not alter ρ\rho by much.

What happens if we iterate this map and rescale time with ε\varepsilon? That is, fix some ρ0𝒫2(d)\rho_{0}\in\mathcal{P}_{2}(\mathbb{R}^{d}) and a T>0T>0. Let Nε=Tε1N_{\varepsilon}=\lfloor T\varepsilon^{-1}\rfloor. With ρ0ε:=ρ0\rho_{0}^{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=\rho_{0}, define iteratively

ρkε:=SBε(ρk1ε),k[Nε]:={1,2,,Nε}.\rho_{k}^{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=\text{SB}_{\varepsilon}(\rho_{k-1}^{\varepsilon}),\quad k\in[N_{\varepsilon}]\mathrel{\mathop{\mathchar 58\relax}}=\{1,2,\ldots,N_{\varepsilon}\}.

Define a continuous time piecewise constant interpolant (ρε(t)=ρt/εε,t[0,T])\left(\rho^{\varepsilon}(t)=\rho^{\varepsilon}_{\lfloor t/\varepsilon\rfloor},\;t\in[0,T]\right). The question is whether, as ε0+\varepsilon\rightarrow 0+, this curve on the Wasserstein space has an absolutely continuous limit? If so, can we write down its characterizing continuity equation?

Before we give the answer, consider the simulation in Figure 1. The figure considers a particle system approximation to the above iterative scheme. We push n=500n=500 particles, initially distributed as a mixtures of two normal distributions, ρ0:=0.5𝒩(2,1)+0.5𝒩(2,1)\rho_{0}\mathrel{\mathop{\mathchar 58\relax}}=0.5\,\mathcal{N}(-2,1)+0.5\,\mathcal{N}(2,1), via the dynamical system for T=5T=5 and step size ε=0.1\varepsilon=0.1. At every step the Schrödinger bridge is approximated by the (matrix) Sinkhorn algorithm [Cut13] applied to the current empirical distribution of the particles and the corresponding barycenter is approximated from this estimate. The pushforward is then applied to the nn particles themselves. This gives a time evolving process of empirical distributions that can be thought of as a discrete approximation of our iterative scheme. One can see that the histogram gets smoothed out gradually and ends in a shape that appears to be a Gaussian distribution. In fact, this process of smoothing is reminiscent of the heat flow starting from ρ0\rho_{0}. In the case of this simple starting distribution, the heat flow (ρt,t0)(\rho_{t},t\geq 0) is known analytically. The curves of the heat flow are superimposed as continuous curves. The broken curves are the exact ρε()\rho^{\varepsilon}(\cdot) which, in this case, can also be computed. The histograms and the two curves all closely match with one another.

Refer to caption
Figure 1. Histograms of n=500n=500 particles at increments of one time unit. Here (ρt,t[0,T])(\rho_{t},t\in[0,T]) is the true gradient flow and (ρ^t,t[0,T])(\hat{\rho}_{t},t\in[0,T]) is the piecewise constant interpolation of the analytically known scheme (ρkε,k[Nε])(\rho_{k}^{\varepsilon},k\in[N_{\varepsilon}]).

One of the results that we prove in this paper shows that, under appropriate assumptions, (ρε(t),t[0,T])\left(\rho^{\varepsilon}(t),\;t\in[0,T]\right) indeed converges to the solution of the heat equation tρt=12Δρt\frac{\partial}{\partial t}\rho_{t}=\frac{1}{2}\Delta\rho_{t}, with initial condition ρ0\rho_{0}. In a different context and language, this result was first observed in [SABP22, Theorem 1]. However, their proof is incomplete. See the discussion below. The statement of Theorem 4 in this paper lays down sufficient conditions under which this convergence happens.

It is well known from [JKO98, Theorem 5.1] that the heat flow is the Wasserstein gradient flow of (one half) entropy. We prove a generalization of the above result where a similar discrete dynamical system converges, under assumptions, to other Wasserstein gradient flows that satisfy an evolutionary PDE of the type [AGS08, eqn. (11.1.6)]. The general existence and uniqueness of gradient flows (curves of maximal slopes) in Wasserstein space are quite subtle [AGS08, Chapter 11]. However, evolutionary PDEs of the type [AGS08, eqn. (11.1.6)] include the classical case of the gradient flow of entropy as the heat equation and that of the Kullback-Leibler divergence (also called relative entropy) as the Fokker-Planck equation [JKO98, Theorem 5.1]. Our iterative approximating schemes always iterate a function of the barycentric projection of the Schrödinger bridge with the same marginals. The marginal itself changes with the iteration. Let us give the reader an intuitive idea how such schemes work.

The foundations of all these results can be found in Theorem 1 which shows that the two point joint distribution of a stationary Langevin diffusion is a close approximation of the Schrödinger bridge with equal marginals. More concretely, let ρ=eg\rho=e^{-g} be some element in 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) for which the Langevin diffusion process with stationary distribution ρ\rho exists and is unique in law. This diffusion process is governed by the stochastic differential equation

(6) dXt=12g(Xt)dt+dBt,X0ρ,dX_{t}=-\frac{1}{2}\nabla g(X_{t})dt+dB_{t}\,,\quad X_{0}\sim\rho,

where (Bt,t0)(B_{t},t\geq 0) is the standard dd-dimensional Brownian motion. Therefore, for any ε>0\varepsilon>0, ρ,ε:=Law(X0,Xε)\ell_{\rho,\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=\text{Law}(X_{0},X_{\varepsilon}) is an element in Π(ρ,ρ)\Pi(\rho,\rho). Then, Theorem 1 shows that, under appropriate conditions on ρ\rho, the symmetrized relative entropy, i.e., the Jensen-Shannon divergence, between ρ,ε\ell_{\rho,\varepsilon} and the Schrödinger bridge πρ,ε\pi_{\rho,\varepsilon} is o(ε2)o(\varepsilon^{2}). Although this level of precision is sufficient for our purpose, it is shown via explicit computations in (28) that when ρ\rho is Gaussian this symmetrized relative entropy is actually O(ε4)O(\varepsilon^{4}), an order of magnitude tighter.

In any case, one would expect that, under suitable assumptions, the barycentric projection

(7) ρ,ε(x)Eρ,ε[Y|X=x]xε2g(x)=x+ε2logρ(x).\mathcal{B}_{\rho,\varepsilon}(x)\approx\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[Y|X=x]\approx x-\frac{\varepsilon}{2}\nabla g(x)=x+\frac{\varepsilon}{2}\nabla\log\rho(x).

The quantity logρ\nabla\log\rho is called the score function. This is a reformulation of [SABP22, Theorem 1] which states that the difference between the two sides converges to zero in L2(ρ)L^{2}(\rho). In other words, one would expect that, under suitable assumptions,

(8) SBε(ρ):=(2Idρ,ε)#ρ(Idε2logρ(x))#ρ.\text{SB}_{\varepsilon}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=\left(2\operatorname{\mathrm{Id}}-\mathcal{B}_{\rho,\varepsilon}\right)_{\#}\rho\approx\left(\operatorname{\mathrm{Id}}-\frac{\varepsilon}{2}\nabla\log\rho(x)\right)_{\#}\rho.

We claim this is approximately the solution of the heat equation at time ϵ\epsilon, starting from ρ\rho.

To see this, recall that absolutely continuous curves (see [AGS08, Definition 1.1.1]) (ρt,t0)\left(\rho_{t},\;t\geq 0\right) on the Wasserstein space are characterized as weak solutions [AGS08, eqn. (8.1.4)] of continuity equations

(9) tρt+(vtρt)=0.\displaystyle\frac{\partial}{\partial t}\rho_{t}+\nabla\cdot\left(v_{t}\rho_{t}\right)=0\,.

Here vt:ddv_{t}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}^{d} is the time-dependent Borel velocity field of the continuity equation whose properties are given by [AGS08, Theorem 8.3.1]. For any absolutely continuous curve, if vtv_{t} belongs to the tangent space at ρt\rho_{t} ([AGS08, Definition 8.4.1]), [AGS08, Proposition 8.4.6] states the following approximation result for Lebesgue-a.e. t0t\geq 0

(10) limε01ε𝕎2(ρt+ε,(Id+εvt)#ρt)=0.\displaystyle\lim\limits_{\varepsilon\to 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\rho_{t+\varepsilon},(\operatorname{\mathrm{Id}}+\varepsilon v_{t})_{\#}\rho_{t})=0.

For example, if ρ\rho is the Wasserstein gradient flow of a function :𝒫2(d)(,]\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\to(-\infty,\infty] satisfying [AGS08, eqn. (11.1.6)], vt=δδρtv_{t}=-\nabla\frac{\delta\mathcal{F}}{\delta\rho_{t}} where δδρt\frac{\delta\mathcal{F}}{\delta\rho_{t}} is the first variation of \mathcal{F} (see [AGS08, Section 10.4.1]) at ρt\rho_{t}. For the heat flow, which is the gradient flow of (half) entropy, δδρt=12logρt\frac{\delta\mathcal{F}}{\delta\rho_{t}}=\frac{1}{2}\log\rho_{t} and vt=12logρtv_{t}=-\frac{1}{2}\nabla\log\rho_{t}. Comparing (8) with (10) explains the previous claim.

The claimed approximation in (8) is what we call one-step approximation. In a somewhat more general setting, this is proved in Theorem 2. Note that a crucial advantage we gain in the Schrödinger bridge scheme that is absent in Langevin diffusion or the continuity equation is that we no longer need to compute the score function logρ\nabla\log\rho. Instead it suffices to be able to draw samples from ρ\rho and compute the approximate Schrödinger bridge, say by the Sinkhorn algorithm [Cut13], as done in Figure 1.

To summarize, the flow of ideas in the paper goes as follows. First, we show that the Schrödinger bridge map SBε(ρ)\text{SB}_{\varepsilon}(\rho), defined in its full generality in (SB), is an approximation the pushforward of ρ\rho by the tangent vector of the gradient flow (10). This is known to be close to the gradient flow curve itself [AGS08, Proposition 8.4.6]. So one may expect that an iteration of this idea, where an absolutely continuous curve approximated by iterated pushforwards such as (10) can be, under suitable assumptions, approximated by iterated Schrödinger bridge operators SBε()\text{SB}_{\varepsilon}(\cdot). One has to be mindful that the cumulative error remains controlled as we iterate, but in the best of situations this might work. And, in fact, it does for the heat flow, under suitable conditions, as proved in Theorem 4.

Now, a natural question is how general is this method? Can we approximate other gradient flows, say that of Kullback-Leibler with respect to a log-concave density? What about absolutely continuous curves that are not gradient flows? The case of KL is partially settled in this paper in Theorem 5. Let us state our proposed Schrödinger Bridge scheme in somewhat general notation. Consider a suitably nice functional :𝒫2(d){+}\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\to\mathbb{R}\cup\{+\infty\} and fix an initial measure ρ0𝒫2(d)\rho_{0}\in\mathcal{P}_{2}(\mathbb{R}^{d}). Suppose the gradient flow of \mathcal{F} can be approximated by a sequence of explicit Euler iterations (ρ~ε(k),k)\left(\tilde{\rho}_{\varepsilon}(k),\;k\in\mathbb{N}\right), where, starting from ρ~ε(0)=ρ0\tilde{\rho}_{\varepsilon}(0)=\rho_{0}, iteratively define

(11) ρ~ε(k)=(Id+εuk1)#ρ~ε(k1)\tilde{\rho}_{\varepsilon}(k)=\left(\operatorname{\mathrm{Id}}+\varepsilon\nabla u_{k-1}\right)_{\#}\tilde{\rho}_{\varepsilon}(k-1)

with uk1-\nabla u_{k-1} being the Wasserstein gradient of \mathcal{F} at ρ~ε(k1)\tilde{\rho}_{\varepsilon}(k-1). Such an approximation scheme will be referred to as an explicit Euler iteration scheme and conditions must be imposed for it to converge to the underlying gradient flow. See Theorem 3 below for sufficient conditions.

Ignoring technical details, our goal is to approximate each such explicit Euler step via a corresponding Schrödinger bridge step. This is accomplished by the introduction of what we call a surrogate measure. For simplicity, consider the first step ρ~ε(1)=(Id+εu0)#ρ0\tilde{\rho}_{\varepsilon}(1)=\left(\operatorname{\mathrm{Id}}+\varepsilon\nabla u_{0}\right)_{\#}\rho_{0}. Suppose that there is some θ{0}\theta\in\mathbb{R}\setminus\{0\} such that e2θu0<\int e^{2\theta u_{0}}<\infty, define the surrogate measure σ0\sigma_{0} to be one with the density

(12) σ0(x)\displaystyle\sigma_{0}(x) :=exp(2θu0(x)Λ0(θ)),whereΛ0(θ):=loge2θu0(y)dy.\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\exp\left(2\theta u_{0}(x)-\Lambda_{0}(\theta)\right),\quad\text{where}\quad\Lambda_{0}(\theta)\mathrel{\mathop{\mathchar 58\relax}}=\log\int e^{2\theta u_{0}(y)}dy.

As explained in (7), if we consider the Schrödinger bridge at temperature ε\varepsilon with both marginals to be σ0\sigma_{0}, then σ0,ε(x)x+ε2logσ0(x)=x+θεu0(x)\mathcal{B}_{\sigma_{0},\varepsilon}(x)\approx x+\frac{\varepsilon}{2}\nabla\log\sigma_{0}(x)=x+\theta\varepsilon\nabla u_{0}(x). If we now define one step in the SB scheme as

(SB) SBε(ρ0)\displaystyle\text{SB}_{\varepsilon}(\rho_{0}) =((11θ)Id+1θσ0,ε)#ρ0.\displaystyle=\left(\left(1-\frac{1}{\theta}\right)\operatorname{\mathrm{Id}}+\frac{1}{\theta}\mathcal{B}_{\sigma_{0},\varepsilon}\right)_{\#}\rho_{0}.

Observing that, as ε0\varepsilon\downarrow 0, we recover that SBε(ρ)(Id+εu0)#ρ\text{SB}_{\varepsilon}(\rho)\approx(\operatorname{\mathrm{Id}}+\varepsilon\nabla u_{0})_{\#}\rho. Notice that the choice of θ\theta does not play a role except to guarantee that the surrogate σ0\sigma_{0} in (12) is integrable. For the case of the heat flow, u0=12logρ0u_{0}=-\frac{1}{2}\log\rho_{0}. Hence, if we pick θ=1\theta=-1, σ0=ρ0\sigma_{0}=\rho_{0} and then the scheme (5) becomes a special case of (SB).

Let us consider the sequence of iterates of (SB), interpolated in continuous time for some 0<T<0<T<\infty,

ρε((k+1)/ε):=SBε(ρε(k/ε)),k[Nε],whereNε:=T/ε.\rho^{\varepsilon}((k+1)/\varepsilon)\mathrel{\mathop{\mathchar 58\relax}}=\text{SB}_{\varepsilon}(\rho^{\varepsilon}(k/\varepsilon)),\quad k\in[N_{\varepsilon}],\quad\text{where}\;N_{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=\lfloor T/\varepsilon\rfloor.

Theorems 4 and 5 list sufficient conditions under which such a scheme converges uniformly to the gradient flow curve as ε0+\varepsilon\rightarrow 0+ for specific choices of functional \mathcal{F}.

We should point out that the primary reason why we need the assumptions in Theorems 4 and 5 is to guarantee that the errors go to zero uniformly over a growing number of iterations. Much of the technical work in this paper goes to produce a simple set of sufficient conditions to ensure this. Consequently, our theorems are specialized to the case of gradient flows of geodesically convex regular functions. But the scope of Schrödinger bridge scheme is potentially larger. As shown in Section 5.3 via explicit calculations, the Schrödinger bridge scheme may be used to approximate even time-reversed gradient flows. The corresponding PDEs, such as the backward heat equation/ Fokker-Planck, are ill-posed and care must be taken to implement any approximating scheme. Interestingly, if one replaces (5) by SBε(ρ):=(ρ,ε)#ρ\text{SB}_{\varepsilon}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=\left(\mathcal{B}_{\rho,\varepsilon}\right)_{\#}\rho, then starting from a Gaussian initial distribution, and for all small enough ε\varepsilon, the iterates are shown to converge to the time-reversed gradient flow of one-half entropy.

In Section 3 we also generalize our Schrödinger bridge scheme (SB) by picking the surrogate measure depending on ε\varepsilon. This generalization is necessary to generalize the Schrödinger bridge scheme to cover tangent vectors that are not necessarily of the gradient type, but are approximated by a sequence of gradient fields. This generalization also covers the case of the renowned JKO scheme, [JKO98], which can be likened to the implicit Euler discretization scheme of the gradient flow curve. This, however, introduces an additional layer of complexity.

Some of the technical challenges we overcome in this paper involve proving results about Schrödinger bridges at low temperatures when the marginals are not compactly supported. The latter is a common assumption in the literature, see for example [Pal19, PNW22, CCGT22, SABP22]. However, the existence of Langevin diffusion requires an unbounded support. Much of this work is done in Section 2 which may be of independent interest. For example, we extend the expansion of entropic OT cost function from [CT21, Theorem 1.6] for the case of same marginals in (42). In all this analysis an unexpected mathematical object appears repeatedly, the harmonic characteristic ([CVR18]) defined below in (14). It appears in the computation of Fisher information of the Langevin and the Schrödinger bridges (see (37)) and in the geodesic approximation of the heat flow (Lemma 7). The role of this object merits a deeper study.

Discussion on [SABP22]. The inspiring work [SABP22] is one of the motivations of our investigations. The authors of [SABP22] propose and analyze a modification to the self-attention mechanism from the Transformer artificial neural network architecture in which the self-attention matrix is enforced to be doubly-stochastic. Since the Sinkhorn algorithm is applied to enforce double stochasticity, the architecture is called the Sinkformer. A major contribution of [SABP22] is the keen observation that self-attention mechanisms can be understood as determining a dynamical system whose particles are the input tokens (possibly after a suitable transformation). See also [GLPR24] for other work in this area. Specifically, [SABP22, Proposition 2,3] observes that, as a regularization parameter vanishes, the associated particle system under doubly-stochastic self-attention evolves according to the heat flow, and this property naturally follows from the doubly-stochastic constraints.

Theorems 1-5 can be understood as probabilistic approaches to [SABP22, Theorem 1], as well as a finer treatment to key steps of the mathematical analysis of the convergence to the heat flow equation. The first statement of [SABP22, Theorem 1] concerns the convergence of the rescaled gradients of entropic potentials for the Schrödinger bridge with the same marginals. Consider the so-called entropic (or, Schrödinger) potentials (fε,ε>0)(f_{\varepsilon},\varepsilon>0) as defined later on in (19). In this language, [SABP22, Theorem 1] states that ε1fεlogρ\varepsilon^{-1}\nabla f_{\varepsilon}\to-\nabla\log\rho in L2(ρ)L^{2}(\rho), as ε0+\varepsilon\rightarrow 0+. We are not able to prove this particular convergence; not least because we work in a slightly different setting, i.e., with measures fully supported on d\mathbb{R}^{d} whereas [SABP22] assumes compact support (although, see the discussion after the proof of Theorem 1 for a similar result). Instead our argument relies on approximating the Schrödinger bridge by the Langevin diffusion which are undefined for measures with compact support. Moreover, in our paper we develop a general machinery that demonstrate the ability of the Schrödinger bridge scheme to approximate gradient flows of other functionals of interest beyond entropy.

2. Approximation of Schrödinger Bridge by Langevin Diffusion

2.1. Preliminaries

In this section we fix a measure ρ𝒫2(d)\rho\in\mathcal{P}_{2}(\mathbb{R}^{d}) with density given by ρ=eg\rho=e^{-g} for some function g:d{+}g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}. Here, we follow the standard convention in optimal transport of using the same notation to refer to an absolutely continuous measure and its density. We will state assumptions on gg later. Let 𝒫2ac(d)\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) denote the collection of probability measures on d\mathbb{R}^{d} with density and finite second moments. The aim of this section is to prove that when the two marginals are the same, then the Schrödinger bridge and the stationary Langevin diffusion with that marginal are very close in the sense made precise in Theorem 2.

We start by collating standard results about the Langevin diffusion and Schrödinger bridge and introducing the notions from stochastic calculus of which we will make use. The relevant notation is all summarized in Table Notations.

Let Cd[0,)C^{d}[0,\infty) denote the set of continuous paths from [0,)[0,\infty) to d\mathbb{R}^{d}, equipped with the locally uniform metric. Denote the coordinate process by (ωt,t0)(\omega_{t},t\geq 0). Endow this space with a filtration satisfying the usual conditions [KS91, Chapter 1, Definition 2.25]. For xdx\in\mathbb{R}^{d}, let WxW_{x} denote the law on Cd[0,)C^{d}[0,\infty) of the standard dd-dimensional Brownian motion started from xx. All stochastic integrals are Itô integrals.

Let gC2(d)g\in C^{2}(\mathbb{R}^{d}) and assume that (Yt,t0)(Y_{t},t\geq 0), the stationary Langevin diffusion with stationary measure equal to ρ=eg\rho=e^{-g}, exists. That is, its law on Cd[0,)C^{d}[0,\infty) is a weak solution to (6) with initial condition Y0ρY_{0}\sim\rho. Let QQ denote the law of this process on Cd[0,)C^{d}[0,\infty), and let QxQ_{x} denote the law of the Langevin diffusion started from xx. As in the Introduction, for ε>0\varepsilon>0, set ρ,ε=Law(Y0,Yε)\ell_{\rho,\varepsilon}=\text{Law}(Y_{0},Y_{\varepsilon}). Under suitable assumptions on gg (see Assumption 1 below), (Yt,t0)(Y_{t},t\geq 0) exists with a.s. infinite explosion time [Roy07, Theorem 2.2.19], and by [Roy07, Lemma 2.2.21] for each time t>0t>0 the laws of QxQ_{x} and PxP_{x} restricted to Cd[0,t]C^{d}[0,t] are mutually absolutely continuous with Radon-Nikodym derivative

(13) dQxdWx|Cd[0,t](ω)\displaystyle\left.\frac{dQ_{x}}{dW_{x}}\right|_{C^{d}[0,t]}(\omega) =ρ(ωt)ρ(x)exp(0t(18g(ωs)214Δg(ωs))𝑑s).\displaystyle=\sqrt{\frac{\rho(\omega_{t})}{\rho(x)}}\exp\left(-\int_{0}^{t}\left(\frac{1}{8}\|\nabla g(\omega_{s})\|^{2}-\frac{1}{4}\Delta g(\omega_{s})\right)ds\right).

The expression inside the integral of the exp()\exp(\cdot) is a distinguished quantity in the Langevin diffusion literature (see [LK93]). Denote it by

(14) 𝒰(x)\displaystyle\mathcal{U}(x) :=18g(x)214Δg(x).\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{8}\|\nabla g(x)\|^{2}-\frac{1}{4}\Delta g(x).

This function is called the harmonic characteristic in [CVR18, eqn. (2)], which provides a heuristic physical interpretation for 𝒰\mathcal{U} and 𝒰\nabla\mathcal{U} (the latter of which appears in Theorem 1) as the “mean acceleration” of the Langevin bridge. This quantity also appears in (69) as the second derivative in time of a quantity of interest, giving another way to see it as mean acceleration.

Let (pt(,),t>0)(p_{t}(\cdot,\cdot),t>0) and (qt(,),t>0)(q_{t}(\cdot,\cdot),t>0) denote the transition kernels for the standard dd-dimensional Brownian motion and the Langevin diffusion, respectively. By (13)

qt(x,y)pt(x,y)\displaystyle\frac{q_{t}(x,y)}{p_{t}(x,y)} =ρ(y)ρ(x)EWx[exp(0t𝒰(ωs)ds)|ωt=y].\displaystyle=\sqrt{\frac{\rho(y)}{\rho(x)}}\mathrm{E}_{W_{x}}\left[\exp\left(-\int_{0}^{t}\mathcal{U}(\omega_{s})ds\right)\middle|\;\omega_{t}=y\right].

Notice that EWx[|ωt=y]\mathrm{E}_{W_{x}}[\cdot|\omega_{t}=y] is expectation with respect to the law of the Brownian bridge from xx to yy over [0,t][0,t]. Define the following function

(15) c(x,y,ε):=log(EWx[exp(0ε𝒰(ωs)ds)|ωε=y]).\displaystyle c(x,y,\varepsilon)\mathrel{\mathop{\mathchar 58\relax}}=-\log\left(\mathrm{E}_{W_{x}}\left[\exp\left(-\int_{0}^{\varepsilon}\mathcal{U}(\omega_{s})ds\right)\middle|\;\omega_{\varepsilon}=y\right]\right).

Recall that ρ,ε\ell_{\rho,\varepsilon} is joint density of (Y0,Yε)(Y_{0},Y_{\varepsilon}) under Langevin; it is now equal to

(16) ρ,ε(x,y)\displaystyle\ell_{\rho,\varepsilon}(x,y) =ρ(x)qε(x,y)=ρ(x)ρ(y)pε(x,y)exp(c(x,y,ε))\displaystyle=\rho(x)q_{\varepsilon}(x,y)=\sqrt{\rho(x)\rho(y)}p_{\varepsilon}(x,y)\exp(-c(x,y,\varepsilon))
(17) =ρ(x)ρ(y)1(2πε)d/2exp(12εxy2c(x,y,ε)).\displaystyle=\sqrt{\rho(x)\rho(y)}\frac{1}{(2\pi\varepsilon)^{d/2}}\exp\left(-\frac{1}{2\varepsilon}\|x-y\|^{2}-c(x,y,\varepsilon)\right).

Let (Gt,t0)(G_{t},t\geq 0) denote the corresponding Langevin semigroup and LL its extended infinitesmal generator (see [RY04, Chapter VII, Section 1] for definitions). That is, we have for t0t\geq 0 and f:df\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R} bounded and measurable that

Gtf(x):=EQ[f(ωt)|ω0=x].\displaystyle G_{t}f(x)\mathrel{\mathop{\mathchar 58\relax}}=\mathrm{E}_{Q}[f(\omega_{t})|\omega_{0}=x].

Letting C2(d)C^{2}(\mathbb{R}^{d}) denote the collection of all twice continuously differentiable functions, by [RY04, Chapter VIII, Proposition 3.4] we have for fC2(d)f\in C^{2}(\mathbb{R}^{d}) that

(18) Lf\displaystyle Lf =12Δf+(12g)f.\displaystyle=\frac{1}{2}\Delta f+\left(-\frac{1}{2}\nabla g\right)\cdot\nabla f.

We now define the static Schrödinger bridge with marginals equal to ρ\rho, denoted by πρ,ε\pi_{\rho,\varepsilon}. For an excellent and detailed account we refer readers to [Léo14] and references therein; we develop only the notions that we use within our proof. Under mild regularity assumptions on ρ\rho (for instance, finite entropy and a moment condition), it is known that for each ε>0\varepsilon>0, πρ,ε\pi_{\rho,\varepsilon} admits an (f,g)(f,g)-decomposition [Léo14, Theorem 2.8]. As we consider the equal marginal Schrödinger bridge, we can in fact insist that f=gf=g. Thus, there exists fε:df_{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R} such that πρ,ε\pi_{\rho,\varepsilon} admits a Lebesgue density of the form

(19) πρ,ε(x,y)\displaystyle\pi_{\rho,\varepsilon}(x,y) =1(2πε)d/2exp(1ε(fε(x)+fε(y)12xy2))ρ(x)ρ(y).\displaystyle=\frac{1}{(2\pi\varepsilon)^{d/2}}\exp\left(\frac{1}{\varepsilon}\left(f_{\varepsilon}(x)+f_{\varepsilon}(y)-\frac{1}{2}\|x-y\|^{2}\right)\right)\rho(x)\rho(y).

We refer to the (fε,ε>0)(f_{\varepsilon},\varepsilon>0) as entropic potentials. Note that, in general, entropic potentials are unique up to a choice of constant; however, our convention of symmetry forces that each fεf_{\varepsilon} is indeed unique. Moreover, combining (3) and (19), the Schrödinger cost can be written in terms of the entropic potentials as

(20) H(πρ,ε|μρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}) =Eπρ,ε[logdπρ,εdμρ,ε]=Eρ[2εfε(X)]+Ent(ρ).\displaystyle=\mathrm{E}_{\pi_{\rho,\varepsilon}}\left[\log\frac{d\pi_{\rho,\varepsilon}}{d\mu_{\rho,\varepsilon}}\right]=\mathrm{E}_{\rho}\left[\frac{2}{\varepsilon}f_{\varepsilon}(X)\right]+\mathrm{Ent}(\rho).

As outlined in [Léo14, Proposition 2.3], there is a corresponding dynamic Schrödinger bridge, obtained from πρ,ε\pi_{\rho,\varepsilon} (the “static” Schrödinger bridge) by interpolating with suitably rescaled Brownian bridges. More precisely, let RεR_{\varepsilon} denote the law of (εBt,t[0,1])(\sqrt{\varepsilon}B_{t},t\in[0,1]) on Cd[0,1]C^{d}[0,1], where (Bt,t0)(B_{t},t\geq 0) is the reversible dd-dimensional Brownian motion. Then, the law of the ε\varepsilon-dynamic Schrödinger bridge on Cd[0,1]C^{d}[0,1] is given by Sρ,εS_{\rho,\varepsilon} where

(21) dSρ,εdRε(ω)=aε(ω0)aε(ω1),\displaystyle\frac{dS_{\rho,\varepsilon}{}}{dR_{\varepsilon}}(\omega)=a^{\varepsilon}(\omega_{0})a^{\varepsilon}(\omega_{1}),

where aε(x):=exp(1εfε(x)+logρ(x))a^{\varepsilon}(x)\mathrel{\mathop{\mathchar 58\relax}}=\exp\left(\frac{1}{\varepsilon}f_{\varepsilon}(x)+\log\rho(x)\right). Letting (Pt,t0)(P_{t},t\geq 0) denote the (standard) heat semigroup, we define

(22) ψε(t,x)\displaystyle\psi^{\varepsilon}(t,x) :=logERε[aε(ω1)|ωt=x]=logERε[aε(ω1t)|ω0=x]=logPε(1t)aε(x),\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\log E_{R_{\varepsilon}}\left[a^{\varepsilon}(\omega_{1})|\omega_{t}=x\right]=\log E_{R_{\varepsilon}}\left[a^{\varepsilon}(\omega_{1-t})|\omega_{0}=x\right]=\log P_{\varepsilon(1-t)}a^{\varepsilon}(x),

where the second identity follows from the Markov property of RεR_{\varepsilon}.

Fix ε>0\varepsilon>0, set ρsε:=(πs)#Sρ,ε\rho_{s}^{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=(\pi_{s})_{\#}S_{\rho,\varepsilon} and call (ρsε,s[0,1])(\rho_{s}^{\varepsilon},s\in[0,1]) the ε\varepsilon-entropic interpolation from ρ\rho to itself, explicitly

(23) ρtε=PεtaεPε(1t)aε=exp(ψε(t,)+ψε(1t,)).\displaystyle\rho_{t}^{\varepsilon}=P_{\varepsilon t}a^{\varepsilon}P_{\varepsilon(1-t)}a^{\varepsilon}=\exp(\psi^{\varepsilon}(t,\cdot)+\psi^{\varepsilon}(1-t,\cdot)).

By [Léo14, Proposition 2.3], we construct a random variable with law ρtε\rho_{t}^{\varepsilon} as follows. Let (Xε,Yε)πρ,ε(X_{\varepsilon},Y_{\varepsilon})\sim\pi_{\rho,\varepsilon} and ZN(0,Id)Z\sim N(0,\operatorname{\mathrm{Id}}) be independent to the pair, then

(24) Xtε=(1t)Xε+tYε+εt(1t)Zρtε.\displaystyle X_{t}^{\varepsilon}=(1-t)X_{\varepsilon}+tY_{\varepsilon}+\sqrt{\varepsilon t(1-t)}Z\sim\rho_{t}^{\varepsilon}.

Thanks to an analogue of the celebrated Benamou-Brenier formula for entropic cost ([GLR17, Theorem 5.1]), the entropic interpolation is in fact an absolutely continuous curve in the Wasserstein space. From [CT21, Equation (2.14)] the entropic interpolation satisfies the continuity equation (9) with (ρtε,vtε)t[0,1](\rho_{t}^{\varepsilon},v_{t}^{\varepsilon})_{t\in[0,1]} given by

(25) vtε=ε2(ψε(t,)ψε(1t,)).\displaystyle v_{t}^{\varepsilon}=\frac{\varepsilon}{2}\left(\nabla\psi^{\varepsilon}(t,\cdot)-\nabla\psi^{\varepsilon}(1-t,\cdot)\right).

Moreover, under mild regularity assumptions on ρ\rho, we obtain the following entropic variant of Benamou-Brenier formulation [GLR17, Corollary 5.8] (with appropriate rescaling)

(26) H(πρ,ε|μρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}{}|\mu_{\rho,\varepsilon}{}) =inf(ρt,vt)(12ε01vtL2(ρt)2𝑑t+ε801I(ρt)𝑑t),\displaystyle=\inf\limits_{(\rho_{t},v_{t})}\left(\frac{1}{2\varepsilon}\int_{0}^{1}\|v_{t}\|_{L^{2}(\rho_{t})}^{2}dt+\frac{\varepsilon}{8}\int_{0}^{1}I(\rho_{t})dt\right),

where I()I(\cdot) is the Fisher information, defined for σ𝒫2ac(d)\sigma\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) by I(σ)=logσ2𝑑σI(\sigma)=\int\|\nabla\log\sigma\|^{2}d\sigma. The infimimum in (26) is taken over all (ρt,vt)t[0,1](\rho_{t},v_{t})_{t\in[0,1]} satisfying ρ0=ρ1=ρ\rho_{0}=\rho_{1}=\rho and (9). The optimal selection is given in (25). That is, the optimal selection is the entropic interpolation. For an AC curve (ρt,t[0,1])(\rho_{t},t\in[0,1]), we will call the quantity 01I(ρt)𝑑t\int_{0}^{1}I(\rho_{t})dt the integrated Fisher information.

Lastly, for the same dynamic marginal Schrödinger bridge, we make the following observation about the continuity at ε=0\varepsilon=0 of the integrated Fisher information along the entropic interpolation.

Lemma 1.

Let ρ𝒫2(d)\rho\in\mathcal{P}_{2}(\mathbb{R}^{d}) satisfy <Ent(ρ)<+-\infty<\mathrm{Ent}(\rho)<+\infty and I(ρ)<+I(\rho)<+\infty, then limε001I(ρtε)𝑑t=I(ρ)\lim\limits_{\varepsilon\downarrow 0}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt=I(\rho).

Proof.

By [Léo12, Theorem 3.7(3)], for each s[0,1]s\in[0,1], ρsε\rho_{s}^{\varepsilon} converges weakly to ρ\rho as ε0\varepsilon\downarrow 0. By the lower semicontinuity of Fisher information with respect to weak convergence [Bob22, Proposition 14.2] and Fatou’s Lemma,

I(ρ)\displaystyle I(\rho) =01I(ρ)𝑑t01lim infε0I(ρtε)dtlim infε001I(ρtε)𝑑t.\displaystyle=\int_{0}^{1}I(\rho)dt\leq\int_{0}^{1}\liminf\limits_{\varepsilon\downarrow 0}I(\rho_{t}^{\varepsilon})dt\leq\liminf\limits_{\varepsilon\downarrow 0}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt.

Lastly, by (26) and the submoptimality of the constant curve ρt=ρ\rho_{t}=\rho, vt=0v_{t}=0 for t[0,1]t\in[0,1]

12ε01vtεL2(ρtε)2𝑑t+ε801I(ρtε)𝑑tε8I(ρ)01I(ρtε)𝑑tI(ρ).\displaystyle\frac{1}{2\varepsilon}\int_{0}^{1}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt+\frac{\varepsilon}{8}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\leq\frac{\varepsilon}{8}I(\rho)\Rightarrow\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\leq I(\rho).

Altogether, it follows

0lim supε0(I(ρ)01I(ρtε)𝑑t)\displaystyle 0\leq\limsup\limits_{\varepsilon\downarrow 0}\left(I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\right) =I(ρ)lim infε001I(ρtε)𝑑tI(ρ)I(ρ)=0.\displaystyle=I(\rho)-\liminf\limits_{\varepsilon\downarrow 0}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\leq I(\rho)-I(\rho)=0.

2.2. Langevin Approximation to Schrödinger Bridge

Our main result of the section is the relative entropy bound between ρ,ε\ell_{\rho,\varepsilon} and πρ,ε\pi_{\rho,\varepsilon} given in Theorem 1.

Assumption 1.

Let ρ=eg\rho=e^{-g} for g:dg\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}, and recall that 𝒰=18g214Δg\mathcal{U}=\frac{1}{8}\|\nabla g\|^{2}-\frac{1}{4}\Delta g. Assume that

  • (1)

    ρ\rho is subexponential ([Ver18, Definition 2.7.5]), gC3(d)g\in C^{3}(\mathbb{R}^{d}), 𝒰\mathcal{U} is bounded below, gg\to\infty as x\|x\|\to\infty.

  • (2)

    There exists C>0C>0 and n1n\geq 1 such that 𝒰(x)2C(1+xn)\|\nabla\mathcal{U}(x)\|^{2}\leq C(1+\|x\|^{n}).

  • (3)

    <Ent(ρ)<+-\infty<\mathrm{Ent}(\rho)<+\infty and I(ρ)<+I(\rho)<+\infty.

Note that Assumption 1 (3) ensures the existence πρ,ε\pi_{\rho,\varepsilon} and that (26) holds [GLR17, (Exi),(Reg1),(Reg2)].

These assumptions are, for instance, satisfied by the multivariate Gaussian N(μ,Σ)N(\mu,\Sigma) with μd\mu\in\mathbb{R}^{d}, Σd×d\Sigma\in\mathbb{R}^{d\times d} positive definite. This provides an essential class of examples as explicit computations of the Schrödinger bridge between Gaussians are known [JMPC20]. In this instance, g(x)=12(xμ)TΣ1(xμ)+12log((2π)ddetΣ)g(x)=\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)+\frac{1}{2}\log((2\pi)^{d}\det\Sigma) and thus

𝒰(x)=18(xμ)TΣ2(xμ)14Tr(Σ1)\displaystyle\mathcal{U}(x)=\frac{1}{8}(x-\mu)^{T}\Sigma^{-2}(x-\mu)-\frac{1}{4}\text{Tr}(\Sigma^{-1}) and 𝒰(x)=14Σ2(xμ).\displaystyle\text{ and }\nabla\mathcal{U}(x)=\frac{1}{4}\Sigma^{-2}(x-\mu).

Importantly, for XN(μ,Σ)X\sim N(\mu,\Sigma) this establishes that 𝒰(X)N(0,116Σ3)\nabla\mathcal{U}(X)\sim N(0,\frac{1}{16}\Sigma^{-3}) and thus E𝒰(X)2=116Tr(Σ3)\mathrm{E}\|\nabla\mathcal{U}(X)\|^{2}=\frac{1}{16}\text{Tr}(\Sigma^{-3}). More generally, when gg is polynomial such that deg<+\int_{\mathbb{R}^{d}}e^{-g}<+\infty these assumptions are also satisfied.

Additionally, we remark that the class of functions gg satisfying Assumption 1 is stable under the addition of functions hC3(d)h\in C^{3}(\mathbb{R}^{d}) such that hh and all of its derivatives up to and including order 33 are bounded (e.g. hCc3(d)h\in C_{c}^{3}(\mathbb{R}^{d})). It is also stable under the addition of linear and quadratic functions, so long as the resulting function has a finite integral when exponentiated.

We begin with a preparatory lemma establishing some important integrability poperties we make use of in the proof of Theorem 1.

Lemma 2.

Let ρ=eg𝒫2(d)\rho=e^{-g}\in\mathcal{P}_{2}(\mathbb{R}^{d}) satisfying Assumption 1.

  • (1)

    For each ε>0\varepsilon>0, the densities along the entropic interpolation (ρsε,s[0,1])(\rho_{s}^{\varepsilon},s\in[0,1]) is uniformly subexponential, meaning that for each T>0T>0 there is a constant K>0K>0 such that for XsερsεX_{s}^{\varepsilon}\sim\rho_{s}^{\varepsilon} we have supε(0,T)sups[0,1]Xsεψ1K\sup\limits_{\varepsilon\in(0,T)}\sup\limits_{s\in[0,1]}\|X_{s}^{\varepsilon}\|_{\psi_{1}}\leq K, where .ψ1\|.\|_{\psi_{1}} is the subexponential norm defined in [Ver18, Definition 2.7.5].

  • (2)

    The following integrability conditions hold:

    • (a)

      The function 𝒰=18g214Δg\mathcal{U}=\frac{1}{8}\|\nabla g\|^{2}-\frac{1}{4}\Delta g is in L2(ρsε)L^{2}(\rho_{s}^{\varepsilon}) for all s[0,1]s\in[0,1].

    • (b)

      The function from [0,1]×Cd[0,1][0,1]\times C^{d}[0,1] to \mathbb{R} given by (t,ω)𝒰(ωt)(t,\omega)\mapsto\mathcal{U}(\omega_{t}) is in L2(dtdSρ,ε)L^{2}(dt\otimes dS_{\rho,\varepsilon}). Similarly, (t,ω)𝒰(ωt)(t,\omega)\mapsto\nabla\mathcal{U}(\omega_{t}) is in L2(dtdSρ,ε)L^{2}(dt\otimes dS_{\rho,\varepsilon}).

    • (c)

      For any T>0T>0, supε(0,T)01ESρ,ε[𝒰(ωs)2]𝑑s<+\sup\limits_{\varepsilon\in(0,T)}\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}[\|\nabla\mathcal{U}(\omega_{s})\|^{2}]ds<+\infty.

Proof.

Fix ε>0\varepsilon>0 and s[0,1]s\in[0,1]. Let (X,Y)(X,Y) be random variables with joint law πρ,ε\pi_{\rho,\varepsilon}. Let ZN(0,Id)Z\sim N(0,\operatorname{\mathrm{Id}}) be independent of (X,Y)(X,Y). Then XsεX_{s}^{\varepsilon} as given in (24) has law ρsε\rho_{s}^{\varepsilon}. As ψ1\|\cdot\|_{\psi_{1}} is a norm, observe that for any T>0T>0

supε(0,T],s[0,1]Xsεψ1\displaystyle\sup\limits_{\varepsilon\in(0,T],s\in[0,1]}\|X_{s}^{\varepsilon}\|_{\psi_{1}} supε(0,T],s[0,1]((1s)Xψ1+sYψ1+εs(1s)Zψ1)<+\displaystyle\leq\sup\limits_{\varepsilon\in(0,T],s\in[0,1]}\left((1-s)\|X\|_{\psi_{1}}+s\|Y\|_{\psi_{1}}+\sqrt{\varepsilon s(1-s)}\|Z\|_{\psi_{1}}\right)<+\infty

establishing the uniform upper bound on the subexponential norm.

It then follows that

supε(0,T],s[0,1]E[𝒰(Xsε)2]\displaystyle\sup\limits_{\varepsilon\in(0,T],s\in[0,1]}\mathrm{E}[\|\nabla\mathcal{U}(X^{\varepsilon}_{s})\|^{2}] supε(0,T],s[0,1]C(1+E[Xsεn])<+.\displaystyle\leq\sup\limits_{\varepsilon\in(0,T],s\in[0,1]}C(1+\mathrm{E}[\|X_{s}^{\varepsilon}\|^{n}])<+\infty.

This observation also establishes that (t,ω)𝒰(ωt)(t,\omega)\mapsto\nabla\mathcal{U}(\omega_{t}) is in L2(dtdSρ,ε)L^{2}(dt\otimes dS_{\rho,\varepsilon}). By assumption on 𝒰\nabla\mathcal{U}, 𝒰\mathcal{U} is also of uniform polynomial growth, and thus the same argument establishes the remaining claims. ∎

Theorem 1.

For ρ𝒫2(d)\rho\in\mathcal{P}_{2}(\mathbb{R}^{d}) satisfying Assumption 1(1,3), for all ε>0\varepsilon>0

(27) H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε)12ε2(I(ρ)01I(ρtε)𝑑t)1/2(01Eρtε𝒰2𝑑t)1/2.\displaystyle H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon})\leq\frac{1}{2}\varepsilon^{2}\left(I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\right)^{1/2}\left(\int_{0}^{1}\mathrm{E}_{\rho_{t}^{\varepsilon}}\|\nabla\mathcal{U}\|^{2}dt\right)^{1/2}.

In particular, additionally under Assumption 1(2),

limε01ε2(H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε))=0.\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon^{2}}\left(H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon})\right)=0.

Before proving the theorem, we pause to comment on the sharpness of the inequality presented in (27). As the Schrödinger bridge is explicitly known for Gaussian marginals [JMPC20], consider the case ρ=N(0,1)\rho=N(0,1). Define two matrices

Σ1=(1eε/2eε/21),Σ2=(112(ε2+4ε)12(ε2+4ε)1).\displaystyle\Sigma_{1}=\begin{pmatrix}1&e^{-\varepsilon/2}\\ e^{-\varepsilon/2}&1\end{pmatrix},\Sigma_{2}=\begin{pmatrix}1&\frac{1}{2}(\sqrt{\varepsilon^{2}+4}-\varepsilon)\\ \frac{1}{2}(\sqrt{\varepsilon^{2}+4}-\varepsilon)&1\end{pmatrix}.

It is well known that ρ,εN(0,Σ1)\ell_{\rho,\varepsilon}\sim N(0,\Sigma_{1}) is the Ornstein-Uhlenbeck transition density. By [JMPC20], it is known that πρ,εN(0,Σ2)\pi_{\rho,\varepsilon}\sim N(0,\Sigma_{2}). One then computes

(28) H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε)=12Tr(Σ11Σ2)+12Tr(Σ21Σ1)2=11152ε4+𝒪(ε5).\displaystyle H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon})=\frac{1}{2}\text{Tr}(\Sigma_{1}^{-1}\Sigma_{2})+\frac{1}{2}\text{Tr}(\Sigma_{2}^{-1}\Sigma_{1})-2=\frac{1}{1152}\varepsilon^{4}+\mathcal{O}(\varepsilon^{5}).

On the other hand, the entropic interpolation between Gaussians is computed in [GLR17], and I(ρtε)I(\rho_{t}^{\varepsilon}) is minimized at t=1/2t=1/2. From a computation in Section 5, see (71),

I(ρ)01I(ρtε)𝑑tI(ρ)I(ρ1/2ε)=𝒪(ε2).\displaystyle I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\leq I(\rho)-I(\rho_{1/2}^{\varepsilon})=\mathcal{O}(\varepsilon^{2}).

Altogether, the RHS of (27) is then 𝒪(ε3)\mathcal{O}(\varepsilon^{3}). That is, the transition densities for the OU process provide a tighter approximation (an order of magnitude) for same marginal Schrödinger bridge than given by Theorem 1.

Proof of Theorem 1.

We argue in the following manner. First, using the densities for πρ,ε\pi_{\rho,\varepsilon} and ρ,ε\ell_{\rho,\varepsilon} given in (16) and (19), respectively, we compute their likelihood ratio and obtain exact expressions for the two relative entropy terms appearing in the statement of the Lemma. The resulting expression is the difference in expectation of the expression c(x,y,ε)c(x,y,\varepsilon) defined in (15) with respect to the Schrödinger bridge and Langevin diffusion. We then extract out the leading order terms from both expectations and show that they cancel. To obtain the stated rate of decay, we then analyze the remaining term with an analytic argument from the continuity equation.

Step 1. The following holds

(29) H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε)\displaystyle H(\ell_{\rho,\varepsilon}{}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon}) =Eπρ,ε[c(X,Y,ε)]Eρ,ε[c(X,Y,ε)].\displaystyle=\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)]-\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)].

To prove (29), from (16) and (19)

dρ,εdπρ,ε=1ρ(x)ρ(y)exp(1εfε(x)1εfε(y)c(x,y,ε)).\displaystyle\frac{d\ell_{\rho,\varepsilon}{}}{d\pi_{\rho,\varepsilon}{}}=\frac{1}{\sqrt{\rho(x)\rho(y)}}\exp\left(-\frac{1}{\varepsilon}f_{\varepsilon}(x)-\frac{1}{\varepsilon}f_{\varepsilon}(y)-c(x,y,\varepsilon)\right).

As both πρ,ε,ρ,εΠ(ρ,ρ)\pi_{\rho,\varepsilon}{},\ell_{\rho,\varepsilon}{}\in\Pi(\rho,\rho), we have that

H(ρ,ε|πρ,ε)\displaystyle H(\ell_{\rho,\varepsilon}{}|\pi_{\rho,\varepsilon}) =Eρ,ε[logdρ,εdπρ,ε]=2εEρ[fε(X)]Ent(ρ)Eρ,ε[c(X,Y,ε)],\displaystyle=\mathrm{E}_{\ell_{\rho,\varepsilon}{}}\left[\log\frac{d\ell_{\rho,\varepsilon}{}}{d\pi_{\rho,\varepsilon}{}}\right]=-\frac{2}{\varepsilon}\mathrm{E}_{\rho}[f_{\varepsilon}(X)]-\mathrm{Ent}(\rho)-\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)],
H(πρ,ε|ρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon}{}) =Eπρ,ε[logdρ,εdπρ,ε]=2εEρ[fε(X)]+Ent(ρ)+Eπρ,ε[c(X,Y,ε)].\displaystyle=\mathrm{E}_{\pi_{\rho,\varepsilon}}\left[-\log\frac{d\ell_{\rho,\varepsilon}{}}{d\pi_{\rho,\varepsilon}{}}\right]=\frac{2}{\varepsilon}\mathrm{E}_{\rho}[f_{\varepsilon}(X)]+\mathrm{Ent}(\rho)+\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)].

Equation (29) follows by adding these expressions together.

Step 2. Recall RεR_{\varepsilon} as used in (21) and let

(30) R(x,y,ε)\displaystyle R(x,y,\varepsilon) :=log(ERε[exp(01ε(𝒰(ωs)𝒰(ω0))ds|ω0=x,ω1=y)]).\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=-\log\left(\mathrm{E}_{R_{\varepsilon}}\left[\exp\left(-\int_{0}^{1}\varepsilon\left(\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})\right)ds\middle|\omega_{0}=x,\omega_{1}=y\right)\right]\right).

Then we claim that

(31) H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε)\displaystyle H(\ell_{\rho,\varepsilon}{}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon}) Eπρ,ε[R(X,Y,ε)].\displaystyle\leq\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[R(X,Y,\varepsilon)].

Recall that WxW_{x} is Wiener measure starting from xx, and

c(x,y,ε)\displaystyle c(x,y,\varepsilon) =log(EWx[exp(0ε𝒰(ωs)ds)|ωε=y])\displaystyle=-\log\left(\mathrm{E}_{W_{x}}\left[\exp\left(-\int_{0}^{\varepsilon}\mathcal{U}(\omega_{s})ds\right)\middle|\;\omega_{\varepsilon}=y\right]\right)
=ε𝒰(x)log(EWx[exp(0ε(𝒰(ωs)𝒰(ω0))ds)|ωε=y]).\displaystyle=\varepsilon\mathcal{U}(x)-\log\left(\mathrm{E}_{W_{x}}\left[\exp\left(-\int_{0}^{\varepsilon}\left(\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})\right)ds\right)\middle|\;\omega_{\varepsilon}=y\right]\right).

For the integral inside the exp()\exp\left(\cdot\right), we rescale ss from [0,ε][0,\varepsilon] to [0,1][0,1]. By Brownian scaling the law of (ωεs,s0)(\omega_{\varepsilon s},s\geq 0) under WxW_{x} is equal to the law of (ωs,s0)(\omega_{s},s\geq 0) under RεR_{\varepsilon} with initial condition xx, which gives

c(x,y,ε)=ε𝒰(x)+R(x,y,ε).\displaystyle c(x,y,\varepsilon)=\varepsilon\mathcal{U}(x)+R(x,y,\varepsilon).

As πρ,εΠ(ρ,ρ)\pi_{\rho,\varepsilon}{}\in\Pi(\rho,\rho), thanks to an integration by parts,

Eπρ,ε[𝒰(X)]\displaystyle\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[\mathcal{U}(X)] =d𝒰(x)ρ(x)𝑑x=18I(ρ)d14Δg(x)eg(x)𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\mathcal{U}(x)\rho(x)dx=\frac{1}{8}I(\rho)-\int_{\mathbb{R}^{d}}\frac{1}{4}\Delta g(x)e^{-g(x)}dx
=18I(ρ)d14g(x)2eg(x)𝑑x=18I(ρ).\displaystyle=\frac{1}{8}I(\rho)-\int_{\mathbb{R}^{d}}\frac{1}{4}\|\nabla g(x)\|^{2}e^{-g(x)}dx=-\frac{1}{8}I(\rho).

Thus from (29) observe

Eπρ,ε[c(X,Y,ε)]Eρ,ε[c(X,Y,ε)]\displaystyle\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)]-\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)] ε8I(ρ)+Eπρ,ε[R(X,Y,ε)]Eρ,ε[c(X,Y,ε)].\displaystyle\leq-\frac{\varepsilon}{8}I(\rho)+\mathrm{E}_{\pi_{\rho,\varepsilon}}[R(X,Y,\varepsilon)]-\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[c(X,Y,\varepsilon)].

Thus, to prove (31) it will suffice to show

(32) Eρ,ε[c(X,Y,ε)]ε8I(ρ).\displaystyle\mathrm{E}_{\ell_{\rho,\varepsilon}{}}[-c(X,Y,\varepsilon)]\leq\frac{\varepsilon}{8}I(\rho).

Observe that

dρ,εdμρ,ε\displaystyle\frac{d\ell_{\rho,\varepsilon}{}}{d\mu_{\rho,\varepsilon}{}} =ρ(y)ρ(x)exp(c(x,y,ε)).\displaystyle=\frac{\sqrt{\rho(y)}}{\sqrt{\rho(x)}}\exp(-c(x,y,\varepsilon)).

Thus, as ρ,εΠ(ρ,ρ)\ell_{\rho,\varepsilon}\in\Pi(\rho,\rho) and <Ent(ρ)<+-\infty<\mathrm{Ent}(\rho)<+\infty,

(33) H(ρ,ε|μρ,ε)\displaystyle H(\ell_{\rho,\varepsilon}{}|\mu_{\rho,\varepsilon}{}) =Eρ,ε[logdρ,εdμρ,ε]=Eρ,ε[c(X,Y,ε)].\displaystyle=\mathrm{E}_{\ell_{\rho,\varepsilon}{}}\left[\log\frac{d\ell_{\rho,\varepsilon}{}}{d\mu_{\rho,\varepsilon}{}}\right]=\mathrm{E}_{\ell_{\rho,\varepsilon}{}}\left[-c(X,Y,\varepsilon)\right].

On Cd[0,ε]C^{d}[0,\varepsilon] let WρW_{\rho} denote the law of Brownian motion with initial distribution ρ\rho. By a well-known formula (see [DGW04, Equation (5.1)] for instance)

H(Qx|Wx)\displaystyle H(Q_{x}|W_{x}) =12EQx[0ε12g(ωs)2𝑑s]=180εEQx[g(ωs)2]𝑑s.\displaystyle=\frac{1}{2}\mathrm{E}_{Q_{x}}\left[\int_{0}^{\varepsilon}\|-\frac{1}{2}\nabla g(\omega_{s})\|^{2}ds\right]=\frac{1}{8}\int_{0}^{\varepsilon}\mathrm{E}_{Q_{x}}[\|\nabla g(\omega_{s})\|^{2}]ds.

It then follows from the factorization of relative entropy that

H(Q|Wρ)\displaystyle H(Q|W_{\rho}) =H(Qx|Wx)ρ(x)𝑑x=18(0εEQx[g(ωs)2]𝑑s)ρ(x)𝑑x\displaystyle=\int H(Q_{x}|W_{x})\rho(x)dx=\frac{1}{8}\int\left(\int_{0}^{\varepsilon}\mathrm{E}_{Q_{x}}\left[\|\nabla g(\omega_{s})\|^{2}\right]ds\right)\rho(x)dx
=180εEQ[g(ωs)2]𝑑s=ε8I(ρ).\displaystyle=\frac{1}{8}\int_{0}^{\varepsilon}E_{Q}\left[\|\nabla g(\omega_{s})\|^{2}\right]ds=\frac{\varepsilon}{8}I(\rho).

As ρ,ε=(π0,πε)#Q\ell_{\rho,\varepsilon}=(\pi_{0},\pi_{\varepsilon})_{\#}Q, (32) follows by (33) and the information processing inequality (see [AGS08, Lemma 9.4.5] for instance).

Step 3. Show Eπρ,ε[R(X,Y,ε)]\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[R(X,Y,\varepsilon)] is bounded above by the RHS in (27).

To start, apply Jensen’s inequality to interchange the log-\log and ERε[|ω0=x,ω1=y]\mathrm{E}_{R_{\varepsilon}}[\;\cdot\;|\;\omega_{0}=x,\omega_{1}=y] in the expression for R(x,y,ε)R(x,y,\varepsilon) in (30) to obtain

R(x,y,ε)\displaystyle R(x,y,\varepsilon) εERε[01(𝒰(ωs)𝒰(ω0))ds|ω0=x,ω1=y].\displaystyle\leq\varepsilon\mathrm{E}_{R_{\varepsilon}}\left[\int_{0}^{1}\left(\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})\right)ds\middle|\;\omega_{0}=x,\omega_{1}=y\right].

Since the dynamic Schrödinger bridge, Sρ,εS_{\rho,\varepsilon} defined in (21), disintegrates as the Brownian bridge once its endpoints are fixed [Léo14, Proposition 2.3], we have that

(34) Eπρ,ε[R(X,Y,ε)]\displaystyle\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[R(X,Y,\varepsilon)] εEπρ,ε[ERε[01(𝒰(ωs)𝒰(ω0))ds|ω0=X,ω1=Y]]\displaystyle\leq\varepsilon\mathrm{E}_{\pi_{\rho,\varepsilon}{}}\left[\mathrm{E}_{R_{\varepsilon}}\left[\int_{0}^{1}\left(\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})\right)ds\middle|\;\omega_{0}=X,\omega_{1}=Y\right]\right]
(35) =ε01ESρ,ε[𝒰(ωs)𝒰(ω0)]𝑑s,\displaystyle=\varepsilon\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})\right]ds,

where the last equality follows from Fubini’s Theorem.

Recall that (ρtε,vtε)t[0,1](\rho_{t}^{\varepsilon},v_{t}^{\varepsilon})_{t\in[0,1]} is a weak solution for the continuity equation (9) if for all ψCc1(d)\psi\in C_{c}^{1}(\mathbb{R}^{d}) and t[0,1]t\in[0,1]

ddtEρtε[ψ]\displaystyle\frac{d}{dt}\mathrm{E}_{\rho_{t}^{\varepsilon}}[\psi] =Eρtε[vtε,ψ].\displaystyle=\mathrm{E}_{\rho_{t}^{\varepsilon}}[\langle v_{t}^{\varepsilon},\nabla\psi\rangle].

Integrating in time over [0,s][0,s] for s(0,1]s\in(0,1] gives

Eρsε[ψ]Eρ0ε[ψ]\displaystyle\mathrm{E}_{\rho_{s}^{\varepsilon}}[\psi]-\mathrm{E}_{\rho_{0}^{\varepsilon}}[\psi] =0sEρuε[vuε,ψ]𝑑u.\displaystyle=\int_{0}^{s}\mathrm{E}_{\rho_{u}^{\varepsilon}}[\langle v_{u}^{\varepsilon},\nabla\psi\rangle]du.

We now use a density argument to extend the above identity to

(36) ESρ,ε[𝒰(ωs)𝒰(ω0)]=Eρsε[𝒰]Eρ0ε[𝒰]\displaystyle\mathrm{E}_{S_{\rho,\varepsilon}}[\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})]=\mathrm{E}_{\rho_{s}^{\varepsilon}}[\mathcal{U}]-\mathrm{E}_{\rho_{0}^{\varepsilon}}[\mathcal{U}] =0sEρuε[vuε,𝒰]𝑑u.\displaystyle=\int_{0}^{s}\mathrm{E}_{\rho_{u}^{\varepsilon}}[\langle v_{u}^{\varepsilon},\nabla\mathcal{U}\rangle]du.

For R>0R>0 let χR:d\chi_{R}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R} be such that χRCc(d)\chi_{R}\in C_{c}^{\infty}(\mathbb{R}^{d}), 0χR10\leq\chi_{R}\leq 1, χR|B(0,R)1\left.\chi_{R}\right|_{B(0,R)}\equiv 1, and χR2\|\nabla\chi_{R}\|_{\infty}\leq 2. As χR𝒰Cc1(d)\chi_{R}\mathcal{U}\in C_{c}^{1}(\mathbb{R}^{d}),

ESρ,ε[(χR𝒰)(ωs)(χR𝒰)(ω0)]\displaystyle\mathrm{E}_{S_{\rho,\varepsilon}}[(\chi_{R}\mathcal{U})(\omega_{s})-(\chi_{R}\mathcal{U})(\omega_{0})] =0sESρ,ε[vuε(ωu),(χR𝒰)(ωu)]𝑑u\displaystyle=\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}[\langle v_{u}^{\varepsilon}(\omega_{u}),\nabla(\chi_{R}\mathcal{U})(\omega_{u})\rangle]du
=0sESρ,ε[vuε,𝒰χR]𝑑s+0sESρ,ε[vuε,χR𝒰]𝑑s.\displaystyle=\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\langle v_{u}^{\varepsilon},\nabla\mathcal{U}\rangle\chi_{R}\right]ds+\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\langle v_{u}^{\varepsilon},\nabla\chi_{R}\rangle\mathcal{U}\right]ds.

By Lemma 2, 01ESρ,ε[𝒰(ωs)2]𝑑s<+\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}[\|\nabla\mathcal{U}(\omega_{s})\|^{2}]ds<+\infty. Moreover, it is shown in (38) that 01ESρ,ε[vsε(ωs)2]𝑑s<+\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}[\|v_{s}^{\varepsilon}(\omega_{s})\|^{2}]ds<+\infty. Thus, by the Dominated Convegence Theorem

limR+0sESρ,ε[vuε,𝒰χR]𝑑s\displaystyle\lim\limits_{R\uparrow+\infty}\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\langle v_{u}^{\varepsilon},\nabla\mathcal{U}\rangle\chi_{R}\right]ds =0sESρ,ε[vuε,𝒰]𝑑s.\displaystyle=\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\langle v_{u}^{\varepsilon},\nabla\mathcal{U}\rangle\right]ds.

As |χR𝒰(ωs)+χR𝒰(ω0)||𝒰(ωs)|+|𝒰(ω0)|\mathinner{\!\left\lvert\chi_{R}\mathcal{U}(\omega_{s})+\chi_{R}\mathcal{U}(\omega_{0})\right\rvert}\leq\mathinner{\!\left\lvert\mathcal{U}(\omega_{s})\right\rvert}+\mathinner{\!\left\lvert\mathcal{U}(\omega_{0})\right\rvert} and

|vuε,χR𝒰|2vuε|𝒰|vuε(ωu)2+|𝒰(ωu)|2\displaystyle\mathinner{\!\left\lvert\langle v_{u}^{\varepsilon},\nabla\chi_{R}\rangle\mathcal{U}\right\rvert}\leq 2\|v_{u}^{\varepsilon}\|\mathinner{\!\left\lvert\mathcal{U}\right\rvert}\leq\|v_{u}^{\varepsilon}(\omega_{u})\|^{2}+\mathinner{\!\left\lvert\mathcal{U}(\omega_{u})\right\rvert}^{2}

and both upper bounds are integrable in their suitable spaces (again by Lemma 2 and (38)), (36) follows from dominated convergence by sending R+R\uparrow+\infty.

Integrating (36) once more in ss over [0,1][0,1], multiplying by ε\varepsilon, and plugging in (25) gives the RHS of (34) as

ε01ESρ,ε[𝒰(ωs)𝒰(ω0)]𝑑s\displaystyle\varepsilon\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}[\mathcal{U}(\omega_{s})-\mathcal{U}(\omega_{0})]ds =ε2010sEρuε[1εvuε,𝒰]𝑑u𝑑s.\displaystyle=\varepsilon^{2}\int_{0}^{1}\int_{0}^{s}\mathrm{E}_{\rho_{u}^{\varepsilon}}\left[\left\langle\frac{1}{\varepsilon}v_{u}^{\varepsilon},\nabla\mathcal{U}\right\rangle\right]duds.

Apply Cauchy-Schwarz on L2(Cd[0,1]×[0,1],dSρ,εdu)L^{2}(C^{d}[0,1]\times[0,1],dS_{\rho,\varepsilon}\otimes du) to obtain for each s[0,1]s\in[0,1]

|0sESρ,ε[1εvuε(ωu),𝒰(ωu)]𝑑u|\displaystyle\mathinner{\!\left\lvert-\int_{0}^{s}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\left\langle\frac{1}{\varepsilon}v_{u}^{\varepsilon}(\omega_{u}),\nabla\mathcal{U}(\omega_{u})\right\rangle\right]du\right\rvert} 01ESρ,ε[1εvuε(ωu)𝒰(ωu)]𝑑u\displaystyle\leq\int_{0}^{1}\mathrm{E}_{S_{\rho,\varepsilon}}\left[\frac{1}{\varepsilon}\|v_{u}^{\varepsilon}(\omega_{u})\|\|\nabla\mathcal{U}(\omega_{u})\|\right]du
(011ε2vuεL2(ρuε)2𝑑u)1/2(01Eρuε[𝒰2]𝑑u)1/2\displaystyle\leq\left(\int_{0}^{1}\frac{1}{\varepsilon^{2}}\|v_{u}^{\varepsilon}\|_{L^{2}(\rho_{u}^{\varepsilon})}^{2}du\right)^{1/2}\left(\int_{0}^{1}\mathrm{E}_{\rho_{u}^{\varepsilon}}[\|\nabla\mathcal{U}\|^{2}]du\right)^{1/2}

Thus, altogether we have that

(37) Eπρ,ε[R(X,Y,ε)]\displaystyle\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[R(X,Y,\varepsilon)] ε2(011ε2vtεL2(ρtε)2𝑑t)1/2(01Eρtε[𝒰2]𝑑t)1/2.\displaystyle\leq\varepsilon^{2}\left(\int_{0}^{1}\frac{1}{\varepsilon^{2}}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt\right)^{1/2}\left(\int_{0}^{1}\mathrm{E}_{\rho_{t}^{\varepsilon}}[\|\nabla\mathcal{U}\|^{2}]dt\right)^{1/2}.

Recall the entropic Benamou Brenier expression for entropic cost given in (26). By the optimality of (25) and suboptimality of the constant curve ρt=ρ\rho_{t}=\rho for all t[0,1]t\in[0,1],

H(πρ,ε|μρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}) =12ε01vtεL2(ρtε)2𝑑t+ε801I(ρtε)𝑑tε8I(ρ).\displaystyle=\frac{1}{2\varepsilon}\int_{0}^{1}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt+\frac{\varepsilon}{8}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\leq\frac{\varepsilon}{8}I(\rho).

Rearranging terms gives

(38) 1ε201vtεL2(ρtε)2𝑑t14(I(ρ)01I(ρtε)𝑑t).\displaystyle\frac{1}{\varepsilon^{2}}\int_{0}^{1}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt\leq\frac{1}{4}\left(I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\right).

With this inequality, (37) becomes the desired

Eπρ,ε[R(X,Y,ε)]\displaystyle\mathrm{E}_{\pi_{\rho,\varepsilon}{}}[R(X,Y,\varepsilon)] 12ε2(I(ρ)01I(ρtε)𝑑t)1/2(01Eρtε[𝒰2]𝑑t)1/2.\displaystyle\leq\frac{1}{2}\varepsilon^{2}\left(I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt\right)^{1/2}\left(\int_{0}^{1}\mathrm{E}_{\rho_{t}^{\varepsilon}}[\|\nabla\mathcal{U}\|^{2}]dt\right)^{1/2}.

Under Assumption 1 (2), Lemma 2 applies and gives that the rightmost constant has an upper bound once we consider ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) for some ε0>0\varepsilon_{0}>0. Thus, the nonnegativity of the LHS of (27) and Lemma 1 give the stated convergence as ε0\varepsilon\downarrow 0.

2.3. Discussion

Integrated Fisher Information. Under additional assumptions the integrated Fisher information along the entropic interpolation in (27) can be replaced with a potentially more manageable quantity. In certain settings (for instance, when ρCc(d)\rho\in C_{c}^{\infty}(\mathbb{R}^{d}) with finite entropy and Fisher information [GLRT20, Lemma 3.2]), there is a conserved quantity along the entropic interpolation called the energy, which we denote ε(ρ)\mathcal{E}_{\varepsilon}(\rho). With (vtε,ρtε)t[0,1](v_{t}^{\varepsilon},\rho_{t}^{\varepsilon})_{t\in[0,1]} as defined in (25), for any t[0,1]t\in[0,1]

(39) ε(ρ):=12ε2vtεL2(ρtε)218I(ρtε).\displaystyle\mathcal{E}_{\varepsilon}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{2\varepsilon^{2}}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}-\frac{1}{8}I(\rho_{t}^{\varepsilon}).

Now, evaluate (39) at t=1/2t=1/2. Then v1/2ε=0v_{1/2}^{\varepsilon}=0 and ε(ρ)=18I(ρ1/2ε)\mathcal{E}_{\varepsilon}(\rho)=-\frac{1}{8}I(\rho_{1/2}^{\varepsilon}). On the other hand, integrate (39) over t[0,1]t\in[0,1], this also gives ε(ρ)\mathcal{E}_{\varepsilon}(\rho) and thus

18I(ρ1/2ε)\displaystyle-\frac{1}{8}I(\rho_{1/2}^{\varepsilon}) =12ε201vtεL2(ρtε)2𝑑t1801I(ρtε)𝑑t.\displaystyle=\frac{1}{2\varepsilon^{2}}\int_{0}^{1}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt-\frac{1}{8}\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt.

This in turn implies

1ε201vtεL2(ρtε)2𝑑t\displaystyle\frac{1}{\varepsilon^{2}}\int_{0}^{1}\|v_{t}^{\varepsilon}\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt =14(01I(ρtε)𝑑tI(ρ1/2ε))14(I(ρ)I(ρ1/2ε)).\displaystyle=\frac{1}{4}\left(\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt-I(\rho_{1/2}^{\varepsilon})\right)\leq\frac{1}{4}\left(I(\rho)-I(\rho_{1/2}^{\varepsilon})\right).

Thus, assuming the conservation of energy holds under Assumption 1, replacing the bound in (38) with the RHS of the above inequality rewrites Theorem 1 as

(40) H(ρ,ε|πρ,ε)+H(πρ,ε|ρ,ε)12ε2(I(ρ)I(ρ1/2ε))1/2(01Eρtε𝒰2𝑑t)1/2.\displaystyle H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\ell_{\rho,\varepsilon})\leq\frac{1}{2}\varepsilon^{2}\left(I(\rho)-I(\rho_{1/2}^{\varepsilon})\right)^{1/2}\left(\int_{0}^{1}\mathrm{E}_{\rho_{t}^{\varepsilon}}\|\nabla\mathcal{U}\|^{2}dt\right)^{1/2}.

Alternatively, (40) also holds when ρ\rho is a univariate Gaussian, as explicit computations of the entropic interpolation in this case from [GLR17] demonstrate that I(ρtε)I(\rho_{t}^{\varepsilon}) is minimized at t=1/2t=1/2.

Schrödinger Cost Expansion. In the course of proving Theorem 1, we recover a result in alignment with the expansion of Schrödinger cost about ε=0\varepsilon=0 developed in [CT21, Theorem 1.6]. Under assumptions on bounded ρ\rho with compact support and finite entropy, [CT21, Theorem 1.6] states

(41) H(πρ,ε|μρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}) =ε8I(ρ)+o(ε).\displaystyle=\frac{\varepsilon}{8}I(\rho)+o(\varepsilon).

Now, from the Pythagorean Theorem of relative entropy [Csi75, Theorem 2.2]

H(ρ,ε|μρ,ε)\displaystyle H(\ell_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}) H(ρ,ε|πρ,ε)+H(πρ,ε|μρ,ε).\displaystyle\geq H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})+H(\pi_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}).

Recall from Step 2 of Lemma 1 that H(ρ,ε|μρ,ε)ε8I(ρ)H(\ell_{\rho,\varepsilon}|\mu_{\rho,\varepsilon})\leq\frac{\varepsilon}{8}I(\rho), so by Theorem 1 we obtain the following expansion of an upper bound of Schrödinger cost

(42) H(πρ,ε|μρ,ε)\displaystyle H(\pi_{\rho,\varepsilon}|\mu_{\rho,\varepsilon}) H(ρ,ε|μρ,ε)H(ρ,ε|πρ,ε)=ε8I(ρ)o(ε2).\displaystyle\leq H(\ell_{\rho,\varepsilon}|\mu_{\rho,\varepsilon})-H(\ell_{\rho,\varepsilon}|\pi_{\rho,\varepsilon})=\frac{\varepsilon}{8}I(\rho)-o(\varepsilon^{2}).

Thus, in the same-marginal case (and under different regularity assumptions) the second order term in the Schrödinger expansion is zero.

Convergence of Potentials. Revisiting the inequality in (38) and recalling (25), observe that

01ψε(t,)ψε(1t,)L2(ρtε)2𝑑tI(ρ)01I(ρtε)𝑑t.\displaystyle\int_{0}^{1}\|\nabla\psi^{\varepsilon}(t,\cdot)-\nabla\psi^{\varepsilon}(1-t,\cdot)\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}dt\leq I(\rho)-\int_{0}^{1}I(\rho_{t}^{\varepsilon})dt.

By Lemma 1, for Lebesgue a.e. t[0,1]t\in[0,1] it holds that

(43) limε0ψε(t,)ψε(1t,)L2(ρtε)2=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0}\|\nabla\psi^{\varepsilon}(t,\cdot)-\nabla\psi^{\varepsilon}(1-t,\cdot)\|_{L^{2}(\rho_{t}^{\varepsilon})}^{2}=0.

This fact hints in the direction of a result on the convergence of rescaled gradients of entropic potentials. Suppose that the convergence in (43) holds at t=1t=1 or t=0t=0 (of course, a priori there is no way to guarantee such convergence at a specific tt). In this case,

ψε(1,)ψ(0,)\displaystyle\nabla\psi^{\varepsilon}(1,\cdot)-\nabla\psi(0,\cdot) =logaεlogPεaε\displaystyle=\nabla\log a^{\varepsilon}-\nabla\log P_{\varepsilon}a^{\varepsilon}
=logaεlogρaε=2εfε+logρ.\displaystyle=\nabla\log a^{\varepsilon}-\nabla\log\frac{\rho}{a^{\varepsilon}}=\frac{2}{\varepsilon}\nabla f_{\varepsilon}+\nabla\log\rho.

As ρ0ε=ρ1ε=ρ\rho_{0}^{\varepsilon}=\rho_{1}^{\varepsilon}=\rho, (43) would imply that limε01εfε=12logρ\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\nabla f_{\varepsilon}=-\frac{1}{2}\nabla\log\rho in L2(ρ)L^{2}(\rho). This would be in accordance with [SABP22, Theorem 1], but now with fully supported densities.

3. One Step Analysis of Schrödinger Bridge Scheme

Fix ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) and recall the tangent space at ρ\rho, denoted by Tan(ρ)\text{Tan}(\rho), and defined in [AGS08, Definition 8.4.1] by

Tan(ρ):={φ:φCc(d)}¯L2(ρ).\displaystyle\text{Tan}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=\overline{\{\nabla\varphi\mathrel{\mathop{\mathchar 58\relax}}\varphi\in C_{c}^{\infty}(\mathbb{R}^{d})\}}^{L^{2}(\rho)}.

Now, consider a vector field v=uTan(ρ)v=\nabla u\in\text{Tan}(\rho), such that the surrogate measure (12) is defined. Our objective in this section is to lay down conditions under which the approximation stated below (SB) holds. More precisely, we show that

(44) limε01ε𝕎2(SBε(ρ),(Id+εv)#ρ)=0.\lim_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),(\operatorname{\mathrm{Id}}+\varepsilon v)_{\#}\rho)=0.

In fact, we will prove the above theorem in a somewhat more general form when the tangent vector field may not even be of a gradient form. The general setting is as follows. Consider a collection of vectors fields of gradient type (vε=ψε,ε>0)(v_{\varepsilon}=\nabla\psi_{\varepsilon},\varepsilon>0), we wish to tightly approximate the pushforwards (Id+εvε)#ρ(\operatorname{\mathrm{Id}}+\varepsilon v_{\varepsilon})_{\#}\rho with SBε(ρ)\text{SB}_{\varepsilon}(\rho). As a shorthand, we write

(45) Sε1(ρ):=(Id+εvε)#ρ.\displaystyle S_{\varepsilon}^{1}\left(\rho\right)\mathrel{\mathop{\mathchar 58\relax}}=(\operatorname{\mathrm{Id}}+\varepsilon v_{\varepsilon})_{\#}\rho.

Suppose that for each ε>0\varepsilon>0 there exists θε{0}\theta_{\varepsilon}\in\mathbb{R}\setminus\{0\} such that the surrogate measure defined in (12) exists. Set

(46) σε(x)\displaystyle\sigma_{\varepsilon}(x) =exp(2θεψε(x)Λ0(θε)),whereΛ0(θε):=loge2θψε(y)dy.\displaystyle=\exp\left(2\theta_{\varepsilon}\psi_{\varepsilon}(x)-\Lambda_{0}(\theta_{\varepsilon})\right),\quad\text{where}\quad\Lambda_{0}(\theta_{\varepsilon})\mathrel{\mathop{\mathchar 58\relax}}=\log\int e^{2\theta\psi_{\varepsilon}(y)}dy.

Redefine one step in the SB scheme as

(SB) SBε(ρ)\displaystyle\text{SB}_{\varepsilon}(\rho) =((1θε1)Id+θε1σε,ε)#ρ,\displaystyle=\left(\left(1-\theta_{\varepsilon}^{-1}\right)\operatorname{\mathrm{Id}}+\theta_{\varepsilon}^{-1}\mathcal{B}_{\sigma_{\varepsilon},\varepsilon}\right)_{\#}\rho,

where σε,ε\mathcal{B}_{\sigma_{\varepsilon},\varepsilon} is the barycentric projection defined in (4).

In Theorem 2 we show that, under appropriate assumptions,

(47) limε01ε𝕎2(SBε(ρ),Sε1(ρ))=0.\lim_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right))=0.

There are multiple reasons why one might be interested in this general setup. Firstly, we know that, for any vTan(ρ)v\in\text{Tan}(\rho), there exists a sequence of tangent elements of the type vε=ψεv_{\varepsilon}=\nabla\psi_{\varepsilon} such that limε0+vε=v\lim_{\varepsilon\rightarrow 0+}v_{\varepsilon}=v in 𝐋2(ρ)\mathbf{L}^{2}(\rho). In particular, by an obvious coupling,

(48) lim supε0+1ε𝕎2((Id+εvε)#ρ,(Id+εv)#ρ)limε0+vεv𝐋2(ρ)=0.\limsup_{\varepsilon\rightarrow 0+}\frac{1}{\varepsilon}\mathbb{W}_{2}((\operatorname{\mathrm{Id}}+\varepsilon v_{\varepsilon})_{\#}\rho,(\operatorname{\mathrm{Id}}+\varepsilon v)_{\#}\rho)\leq\lim_{\varepsilon\rightarrow 0+}\mathinner{\!\left\lVert v_{\varepsilon}-v\right\rVert}_{\mathbf{L}^{2}(\rho)}=0.

But, of course, if vv is not of the gradient form, one cannot construct a surrogate measure as in (12) to run the Schrödinger bridge scheme. A natural remedy is to show that (47) holds under appropriate assumptions. Then, combined with (48) we recover (44) even when vv is not a gradient.

There is one more reason why one might use the above generalized scheme. As explained in (11), we would like to iterate the Schrödinger bridge scheme to track the explicit Euler iterations of a Wasserstein gradient flow. However, the more natural discretization scheme of Wasserstein gradient flows is the implicit Euler or the JKO scheme. See (IE) in Section 4 for the details. In this case, however, the vector field vεv_{\varepsilon}, which is the Kantorovich potential between two successive steps of the JKO scheme, changes with ε\varepsilon. This necessitates our generalization. Hence, we continue with (46) and the generalized (SB) scheme.

The proof of (47) relies on the close approximation of the Langevin diffusion to the Schrödinger bridge provided in Theorem 1. To harness this approximation, we introduce an intermediatery scheme based on the Langevin diffusion, which we denote LDε()\text{LD}_{\varepsilon}(\cdot) and define in the following manner. Let ε>0\varepsilon>0 and consider the symmetric Langevin diffusion (Xt, 0tε)\left(X_{t},\;0\leq t\leq\varepsilon\right) with stationary density σε\sigma_{\varepsilon} as defined in (46). The drift function for this diffusion is x12logσε(x)=θεψε(x)=θεvε(x)x\mapsto\frac{1}{2}\nabla\log\sigma_{\varepsilon}(x)=\theta_{\varepsilon}\nabla\psi_{\varepsilon}(x)=\theta_{\varepsilon}v_{\varepsilon}(x). In analogy with the definition of σε,ε()\mathcal{B}_{\sigma_{\varepsilon},\varepsilon}(\cdot) in (4), define σε,ε()\mathcal{L}_{\sigma_{\varepsilon},\varepsilon}(\cdot) by

(49) σε,ε(x)\displaystyle\mathcal{L}_{\sigma_{\varepsilon},\varepsilon}(x) =Eσε,ε[Y|X=x].\displaystyle=\mathrm{E}_{\ell_{\sigma_{\varepsilon},\varepsilon}}[Y|X=x].

Next, in analogy with the definition of SBε()\text{SB}_{\varepsilon}(\cdot) in (SB), define the Langevin diffusion update

(LD) LDε(ρ)\displaystyle\text{LD}_{\varepsilon}(\rho) =((1θε1)Id+θε1σε,ε)#ρ.\displaystyle=\left(\left(1-\theta_{\varepsilon}^{-1}\right)\operatorname{\mathrm{Id}}+\theta_{\varepsilon}^{-1}\mathcal{L}_{\sigma_{\varepsilon},\varepsilon}\right)_{\#}\rho.

In analogy with Section 2, for ε>0\varepsilon>0 set

𝒰ε\displaystyle\mathcal{U}_{\varepsilon} :=18logσε2+14Δlogσε.\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{8}\|\nabla\log\sigma_{\varepsilon}\|^{2}+\frac{1}{4}\Delta\log\sigma_{\varepsilon}.

We now make the following assumptions on the collection of densities (σε,ε>0)(\sigma_{\varepsilon},\varepsilon>0) defined in (46).

Assumption 2.

For each ε>0\varepsilon>0, σε\sigma_{\varepsilon} as defined in (46) satisfies Assumption 1. Additionally

  • (1)

    There exists σ0𝒫2ac(d)\sigma_{0}\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) such that (σε,ε>0)(\sigma_{\varepsilon},\varepsilon>0) converges weakly to σ0\sigma_{0} and I(σε)I(σ0)I(\sigma_{\varepsilon})\to I(\sigma_{0}) as ε0\varepsilon\downarrow 0. Moreover, Assumption 1 holds uniformly in ε\varepsilon in the following manner: there exists ε0>0\varepsilon_{0}>0 and C,D>0C,D>0, N1N\geq 1 such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}), 𝒰ε(x)2C(1+xN)\|\nabla\mathcal{U}_{\varepsilon}(x)\|^{2}\leq C(1+\|x\|^{N}) and for XεσεX_{\varepsilon}\sim\sigma_{\varepsilon}, Xεψ1D\|X_{\varepsilon}\|_{\psi_{1}}\leq D.

  • (2)

    The (σε,ε(0,ε0))(\sigma_{\varepsilon},\varepsilon\in(0,\varepsilon_{0})) are uniformly semi log-concave, that is, there exists λ\lambda\in\mathbb{R} such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}), 2logσελId-\nabla^{2}\log\sigma_{\varepsilon}\geq\lambda\operatorname{\mathrm{Id}}. Additionally, L:=lim supε0Eσε2logσεHS2<+L\mathrel{\mathop{\mathchar 58\relax}}=\limsup\limits_{\varepsilon\downarrow 0}\mathrm{E}_{\sigma_{\varepsilon}}\|\nabla^{2}\log\sigma_{\varepsilon}\|_{HS}^{2}<+\infty, where .HS\|.\|_{HS} is the Hilbert-Schmidt norm.

  • (3)

    χ:=lim supε0ρ/σε<+\chi\mathrel{\mathop{\mathchar 58\relax}}=\limsup\limits_{\varepsilon\downarrow 0}\|\rho/\sigma_{\varepsilon}\|_{\infty}<+\infty and θ:=lim infε0|θε|>0\theta\mathrel{\mathop{\mathchar 58\relax}}=\liminf\limits_{\varepsilon\downarrow 0}\mathinner{\!\left\lvert\theta_{\varepsilon}\right\rvert}>0.

Remark 1.

In all our examples, θε\theta_{\varepsilon} is either always +1+1 or always 1-1. Hence, this parameter simply needs to have a positive lower bound (in absolute value) and it does not scale with ε\varepsilon.

While these assumptions may appear stringent, they are indeed satisfied by a wide class of examples, see Section 5. In many of these examples, σε\sigma_{\varepsilon} takes the form σε=ρeVε\sigma_{\varepsilon}=\rho e^{V_{\varepsilon}}. Thus, condition (3) in Assumption 2 is satisfied if lim infε0infxdVε>\liminf\limits_{\varepsilon\downarrow 0}\inf\limits_{x\in\mathbb{R}^{d}}V_{\varepsilon}>-\infty. For instance, when using (SB) to approximate (EE) for the gradient flow of entropy, σε=ρ\sigma_{\varepsilon}=\rho for all ε>0\varepsilon>0 and thus Vε0V_{\varepsilon}\equiv 0. Similarly, for approximating the gradient flow of H(|N(0,1))H(\cdot|N(0,1)) from certain starting measures, Vε(x)=12x2+CV_{\varepsilon}(x)=\frac{1}{2}x^{2}+C, where CC is a normalizing constant, for all ε>0\varepsilon>0. See Section 5 for more detailed discussion.

We now provide a quantitative statement about the convergence stated in (47), where we take care to note the dependence on constants appearing in Assumption 2 as this upper bound must hold uniformly along iterations of this scheme in Section 4. Recall Sε1()S_{\varepsilon}^{1}\left(\cdot\right) as defined in (45) and SBε()\text{SB}_{\varepsilon}(\cdot) as defined in (SB).

Theorem 2.

Under Assumption 2, there exists a constant K>0K>0 depending on C,D,N,λ,L,χ,θC,D,N,\lambda,L,\chi,\theta, such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0})

𝕎2(SBε(ρ),Sε1(ρ))\displaystyle\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right)) Kε[(I(σε)01I((σε)sε)𝑑s)1/4+ε].\displaystyle\leq K\varepsilon\left[\left(I(\sigma_{\varepsilon})-\int_{0}^{1}I((\sigma_{\varepsilon})^{\varepsilon}_{s})ds\right)^{1/4}+\sqrt{\varepsilon}\right].

In particular,

limε01ε𝕎2(SBε(ρ),Sε1(ρ))=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right))=0.

Hence, from the discussion following (48) we immediately obtain:

Corollary 1.

Let vTan(ρ)v\in\text{Tan}(\rho), and let (ψε,ε>0)C(d)Tan(ρ)(\nabla\psi_{\varepsilon},\varepsilon>0)\subset C^{\infty}(\mathbb{R}^{d})\cap\text{Tan}(\rho) be such that limε0+ψεv𝐋2(ρ)=0\lim\limits_{\varepsilon\to 0+}\|\nabla\psi_{\varepsilon}-v\|_{\mathbf{L}^{2}(\rho)}=0 and the corresponding surrogate measures (σε>0)(\sigma_{\varepsilon}>0) satisfy Assumption 2. Then

limε0+1ε𝕎2((Id+εv)#ρ,SBε(ρ))=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0+}\frac{1}{\varepsilon}\mathbb{W}_{2}((\operatorname{\mathrm{Id}}+\varepsilon v)_{\#}\rho,\text{SB}_{\varepsilon}(\rho))=0.
Proof of Theorem 2.

This result follows from Lemmas 3, 4, and 5 below. Lemmas 3 and 4 together prove an upper bound on 𝕎2(SBε(ρ),LDε(ρ))\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),\text{LD}_{\varepsilon}(\rho)) that show, in particular,

limε01ε𝕎2(SBε(ρ),LDε(ρ))=0,\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\text{SB}_{\varepsilon}(\rho),\text{LD}_{\varepsilon}(\rho))=0,

and Lemma 5 proves an upper bound on 𝕎2(LDε(ρ),Sε1(ρ))\mathbb{W}_{2}(\text{LD}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right)) which implies

limε01ε𝕎2(LDε(ρ),Sε1(ρ))=0.\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon}\mathbb{W}_{2}(\text{LD}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right))=0.

That is, the three updates SBε(ρ)\text{SB}_{\varepsilon}(\rho), LDε(ρ)\text{LD}_{\varepsilon}(\rho), and Sε1(ρ)S_{\varepsilon}^{1}\left(\rho\right) are within o(ε)o(\varepsilon) of each other in the Wasserstein-2 metric. Theorem 2 now follows from the triangle inequality and simplifying the choice of the constant KK. ∎

We begin by converting the relative entropy decay established in Theorem 1 between the Schrödinger bridge and Langevin diffusion into a decay rate between their respective barycentric projections in the Wasserstein-2 metric. The essential tool for this conversion is a Talagrand inequality from [CG14, Theorem 1.8(3)].

Lemma 3.

Let ε>0\varepsilon>0. Under Assumption 2, there exists a constant α>0\alpha>0 depending on ε\varepsilon and λ\lambda such that

𝕎22((σε,ε)#ρ,(σε,ε)#ρ)\displaystyle\mathbb{W}_{2}^{2}((\mathcal{B}_{\sigma_{\varepsilon},\varepsilon})_{\#}\rho,(\mathcal{L}_{\sigma_{\varepsilon},\varepsilon})_{\#}\rho) αρ/σεH(πσε,ε|σε,ε)\displaystyle\leq\alpha\|\rho/\sigma_{\varepsilon}\|_{\infty}H(\pi_{\sigma_{\varepsilon},\varepsilon}|\ell_{\sigma_{\varepsilon},\varepsilon})
𝕎22(SBε(ρ),LDε(ρ))\displaystyle\mathbb{W}_{2}^{2}(\text{SB}_{\varepsilon}(\rho),\text{LD}_{\varepsilon}(\rho)) αθε2ρ/σεH(πσε,ε|σε,ε).\displaystyle\leq\alpha\theta_{\varepsilon}^{-2}\|\rho/\sigma_{\varepsilon}\|_{\infty}H(\pi_{\sigma_{\varepsilon},\varepsilon}|\ell_{\sigma_{\varepsilon},\varepsilon}).

Moreover, with α\alpha can be chosen to be increasing in ε\varepsilon.

Recall that θε{0}\theta_{\varepsilon}\in\mathbb{R}\setminus\{0\} is a constant chosen for integrability reasons outlined in (46)– it is simply an O(1)O(1) term as ε0\varepsilon\downarrow 0. See Remark 1.

Proof of Lemma 3.

Fix T>0T>0 and ε(0,T)\varepsilon\in(0,T). To avoid unnecessary subscripts, set σ:=σε\sigma\mathrel{\mathop{\mathchar 58\relax}}=\sigma_{\varepsilon}. To start, we produce a coupling of σ,ε\ell_{\sigma,\varepsilon} and πσ,ε\pi_{\sigma,\varepsilon} by applying a gluing argument (as in [Vil03, Lemma 7.6], for example) to the optimal couplings between corresponding transition densities for for the Langevin diffusion and ε\varepsilon-static Schrödinger bridge. From this coupling, we obtain a coupling of the desired barycentric projections of σ,ε\ell_{\sigma,\varepsilon} and πσ,ε\pi_{\sigma,\varepsilon}. We then use the chain rule of relative entropy alongside the Talagrand inequality stated in [CG14, Theorem 1.8(3)] to obtain the inequality stated in the Lemma.

Let (qt(,),t0)(q_{t}(\cdot,\cdot),t\geq 0) denote the transition densities of the stationary Langevin diffusion associated to σ\sigma. By [CG14, Theorem 1.8(3)], there a constant α>0\alpha>0 depending on λ\lambda and TT such that following Talagrand inequality holds for each xdx\in\mathbb{R}^{d}

𝕎22(η,qε(x,))\displaystyle\mathbb{W}_{2}^{2}(\eta,q_{\varepsilon}(x,\cdot)) αH(η|qε(x,)) for all η𝒫(d).\displaystyle\leq\alpha H(\eta|q_{\varepsilon}(x,\cdot))\text{ for all $\eta\in\mathcal{P}(\mathbb{R}^{d})$.}

Let (p(x,),xd)(p(x,\cdot),x\in\mathbb{R}^{d}) denote the conditional densities of the ε\varepsilon-static Schrödinger bridge with marginals σ\sigma. Construct a triplet of d\mathbb{R}^{d}-valued random variables (X,Y,Z)(X,Y,Z) in the following way. Let XρX\sim\rho, and let the conditional measures (Y|X=x,Z|X=x)(Y|X=x,Z|X=x) have the law of the quadratic cost optimal coupling between p(x,)p(x,\cdot) and qε(x,)q_{\varepsilon}(x,\cdot). Denote the law of (X,Y,Z)(X,Y,Z) on d×d×d\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R}^{d} by ν\nu. Observe that (Ev[Y|X],Ev[Z|X])(\mathrm{E}_{v}[Y|X],\mathrm{E}_{v}[Z|X]) is a coupling of (σε,ε)#ρ(\mathcal{L}_{\sigma_{\varepsilon},\varepsilon})_{\#}\rho and (σε,ε)#ρ(\mathcal{B}_{\sigma_{\varepsilon},\varepsilon})_{\#}\rho. Moreover, observe that ((1θε1)X+θε1Ev[Y|X],(1θε1)X+θε1Ev[Z|X])((1-\theta_{\varepsilon}^{-1})X+\theta_{\varepsilon}^{-1}\mathrm{E}_{v}[Y|X],(1-\theta_{\varepsilon}^{-1})X+\theta_{\varepsilon}^{-1}\mathrm{E}_{v}[Z|X]) is a coupling of SBε(ρ)\text{SB}_{\varepsilon}(\rho) and LDε(ρ)\text{LD}_{\varepsilon}(\rho). Thus, to compute an upper bound of 𝕎22(SBε(ρ),LDε(ρ))\mathbb{W}_{2}^{2}(\text{SB}_{\varepsilon}(\rho),\text{LD}_{\varepsilon}(\rho)), it is sufficient to compute EEv[Y|X]Ev[Z|X]2\mathrm{E}\|\mathrm{E}_{v}[Y|X]-\mathrm{E}_{v}[Z|X]\|^{2}.

We first observe for each xdx\in\mathbb{R}^{d} that by two applications of the (conditional) Jensen’s inequality, the optimality of (Y|X=x,Z|X=x)(Y|X=x,Z|X=x), and the aforementioned Talagrand inequality

Ev[Y|X=x]Ev[Z|X=x]\displaystyle\|\mathrm{E}_{v}[Y|X=x]-\mathrm{E}_{v}[Z|X=x]\| 𝕎2(qε(x,),p(x,))αH(p(x,)|qε(x,)).\displaystyle\leq\mathbb{W}_{2}(q_{\varepsilon}(x,\cdot),p(x,\cdot))\leq\sqrt{\alpha H(p(x,\cdot)|q_{\varepsilon}(x,\cdot))}.

Hence, using the chain rule of relative entropy and the fact that the Langevin diffusion and Schrödinger bridge both start from σ\sigma

𝕎22\displaystyle\mathbb{W}_{2}^{2} ((σ,ε)#ρ,(σ,ε)#ρ)EvEv[Y|X=x]Ev[Z|X=x]2\displaystyle((\mathcal{B}_{\sigma,\varepsilon})_{\#}\rho,(\mathcal{L}_{\sigma,\varepsilon})_{\#}\rho)\leq\mathrm{E}_{v}\|\mathrm{E}_{v}[Y|X=x]-\mathrm{E}_{v}[Z|X=x]\|^{2}
αH(p(x,)|qε(x,))ρ(x)𝑑x=αH(p(x,)|qε(x,))σ(x)ρ(x)σ(x)𝑑x\displaystyle\leq\int\alpha H(p(x,\cdot)|q_{\varepsilon}(x,\cdot))\rho(x)dx=\int\alpha H(p(x,\cdot)|q_{\varepsilon}(x,\cdot))\sigma(x)\frac{\rho(x)}{\sigma(x)}dx
αρ/σεH(πσε,ε|σε,ε).\displaystyle\leq\alpha\|\rho/\sigma_{\varepsilon}\|_{\infty}H(\pi_{\sigma_{\varepsilon},\varepsilon}{}|\ell_{\sigma_{\varepsilon},\varepsilon}{}).

Now, we revisit the bound from Theorem 1 and apply it to Lemma 5.

Lemma 4.

Under Assumption 2, there exists a constant K>0K>0 depending on NN, CC, and DD such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0})

(50) H(πσε,ε|σε,ε)+H(σε,ε|πσε,ε)ε22K(I(σε)01I((σε)sε)𝑑s)1/2\displaystyle H(\pi_{\sigma_{\varepsilon},\varepsilon}|\ell_{\sigma_{\varepsilon},\varepsilon})+H(\ell_{\sigma_{\varepsilon},\varepsilon}|\pi_{\sigma_{\varepsilon},\varepsilon})\leq\frac{\varepsilon^{2}}{2}K\left(I(\sigma_{\varepsilon})-\int_{0}^{1}I((\sigma_{\varepsilon})_{s}^{\varepsilon})ds\right)^{1/2}

In particular,

limε01ε2(H(πσε,ε|σε,ε)+H(σε,ε|πσε,ε))=0.\lim\limits_{\varepsilon\downarrow 0}\frac{1}{\varepsilon^{2}}\left(H(\pi_{\sigma_{\varepsilon},\varepsilon}|\ell_{\sigma_{\varepsilon},\varepsilon})+H(\ell_{\sigma_{\varepsilon},\varepsilon}|\pi_{\sigma_{\varepsilon},\varepsilon})\right)=0.
Proof.

Let ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) and in analogy with the Section 2 let ((σε)sε,s[0,1])((\sigma_{\varepsilon})_{s}^{\varepsilon},s\in[0,1]) denote the ε\varepsilon-entropic interpolation from σε\sigma_{\varepsilon} to itself and set

𝒰ε\displaystyle\mathcal{U}_{\varepsilon} =18logσε2+14Δlogσε.\displaystyle=\frac{1}{8}\|\nabla\log\sigma_{\varepsilon}\|^{2}+\frac{1}{4}\Delta\log\sigma_{\varepsilon}.

By the assumed uniform subexponential bound on the σε\sigma_{\varepsilon} and uniform polynomial growth of 𝒰ε\mathcal{U}_{\varepsilon}, there exists a constant K>0K>0 as specified in the Lemma statement such that

(51) supε(0,ε0),s[0,1]E(σε)sε[𝒰ε2]K.\displaystyle\sup\limits_{\varepsilon\in(0,\varepsilon_{0}),s\in[0,1]}\mathrm{E}_{(\sigma_{\varepsilon})_{s}^{\varepsilon}}[\|\nabla\mathcal{U}_{\varepsilon}\|^{2}]\leq K.

Theorem 1 and (51) then establish (50) when ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}). It now remains to show that

(52) limε0[I(σε)01I((σε)sε)𝑑s]=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0}\left[I(\sigma_{\varepsilon})-\int_{0}^{1}I((\sigma_{\varepsilon})_{s}^{\varepsilon})ds\right]=0.

First, we claim that ((σε)sε,ε>0)((\sigma_{\varepsilon})_{s}^{\varepsilon},\varepsilon>0) converges weakly to σ0\sigma_{0} as ε0\varepsilon\downarrow 0 for all s[0,1]s\in[0,1]. As (σε,ε>0)(\sigma_{\varepsilon},\varepsilon>0) converges weakly to σ0\sigma_{0}, the collection {σ0}(σε,ε>0)\{\sigma_{0}\}\cup(\sigma_{\varepsilon},\varepsilon>0) is tight. This implies that (πσε,ε,ε>0)(\pi_{\sigma_{\varepsilon},\varepsilon},\varepsilon>0) is tight. Let π\pi^{*} denote a weak subsequential limit– we will not denote the subsequence. It is clear that πΠ(σ0,σ0)\pi^{*}\in\Pi(\sigma_{0},\sigma_{0}). In the terminology of [BGN22], the support of each πσε,ε\pi_{\sigma_{\varepsilon},\varepsilon} is (c,ε)(c,\varepsilon)-cyclically invariant, where c(x,y)=12xy2c(x,y)=\frac{1}{2}\|x-y\|^{2}. As π\pi^{*} is a weak subsequential limit of the (πσε,ε,ε>0)(\pi_{\sigma_{\varepsilon},\varepsilon},\varepsilon>0), the exact same argument presented in [BGN22, Lemma 3.1, Lemma 3.2], modified only superficially to allow for the different marginal in each ε\varepsilon, establishes that π\pi^{*} is cc-cyclically monotone. That is, π=πσ0,0\pi^{*}=\pi_{\sigma_{0},0} is the quadratic cost optimal transport plan from σ0\sigma_{0} to itself. Moreover, this argument shows that along any subsequence of ε\varepsilon decreasing to 0, there is a further subsequence converging to πσ0,0\pi_{\sigma_{0},0}. Hence, the limit holds without passing to a subsequence. Construct a sequence of random variables ((Xε,Yε),ε0)((X_{\varepsilon},Y_{\varepsilon}),\varepsilon\geq 0) with Law(Xε,Yε)=πσε,ε\text{Law}(X_{\varepsilon},Y_{\varepsilon})=\pi_{\sigma_{\varepsilon},\varepsilon}, and let ZZ be a standard normal random variable independent to the collection. Fix s[0,1]s\in[0,1], and recall that Law(sXε+(1s)Yε+εs(1s)Z)=(σε)sε\text{Law}(sX_{\varepsilon}+(1-s)Y_{\varepsilon}+\sqrt{\varepsilon s(1-s)}Z)=(\sigma_{\varepsilon})_{s}^{\varepsilon}. The Continuous Mapping Theorem gives that sXε+(1s)Yε+εs(1s)ZsX_{\varepsilon}+(1-s)Y_{\varepsilon}+\sqrt{\varepsilon s(1-s)}Z converges in distribution to sX0+(1s)Y0sX_{0}+(1-s)Y_{0} as ε0\varepsilon\downarrow 0. As Law(sX0+(1s)Y0)=σ0\text{Law}(sX_{0}+(1-s)Y_{0})=\sigma_{0}, this establishes the weak convergence of (σε)sε(\sigma_{\varepsilon})_{s}^{\varepsilon} to σ0\sigma_{0} as ε0\varepsilon\downarrow 0. By the lower semicontinuity of Fisher information with respect to weak convergence [Bob22, Proposition 14.2] and Fatou’s Lemma

I(σ0)\displaystyle I(\sigma_{0}) 01lim infε0I((σε)sε)dslim infε001I((σε)sε)𝑑s.\displaystyle\leq\int_{0}^{1}\liminf\limits_{\varepsilon\downarrow 0}I((\sigma_{\varepsilon})_{s}^{\varepsilon})ds\leq\liminf\limits_{\varepsilon\downarrow 0}\int_{0}^{1}I((\sigma_{\varepsilon})_{s}^{\varepsilon})ds.

Thus, (52) holds as Assumption 2(1) gives limε0I(σε)=I(σ0)\lim\limits_{\varepsilon\downarrow 0}I(\sigma_{\varepsilon})=I(\sigma_{0}). ∎

As the last piece of the proof, we show in the following lemma below that LDε(ρ)\text{LD}_{\varepsilon}(\rho) is an o(ε)o(\varepsilon) approximation of the first step of the Euler iteration Sε1(ρ)S_{\varepsilon}^{1}\left(\rho\right).

Lemma 5.

For ρ=eg\rho=e^{-g} satisfying Assumption 2,

𝕎22(Sε1(ρ),LDε(ρ))2θε2ρ/σεε3Eσε2logσεHS2.\mathbb{W}_{2}^{2}(S_{\varepsilon}^{1}\left(\rho\right),\text{LD}_{\varepsilon}(\rho))\leq 2\theta_{\varepsilon}^{-2}\|\rho/\sigma_{\varepsilon}\|_{\infty}\varepsilon^{3}\mathrm{E}_{\sigma_{\varepsilon}}\|\nabla^{2}\log\sigma_{\varepsilon}\|^{2}_{HS}\,.
Proof.

The proof follows by producing an intuitive coupling of GAε(ρ)\text{GA}_{\varepsilon}(\rho) and LDε(ρ)\text{LD}_{\varepsilon}(\rho) and then applying a semigroup property stated in [BGL14, Equation (4.2.3)]. Let (Gsσε,s0)(G_{s}^{\sigma_{\varepsilon}},s\geq 0) be the semigroup associated to the Langevin diffusion corresponding to σε\sigma_{\varepsilon}, then for all ε>0\varepsilon>0, s0s\geq 0, and f:df\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R} in the domain of the Dirichlet form

(53) GsσεffL2(σε)22sfL2(σε)2.\displaystyle\|G_{s}^{\sigma_{\varepsilon}}f-f\|^{2}_{L^{2}(\sigma_{\varepsilon})}\leq 2s\|\nabla f\|_{L^{2}(\sigma_{\varepsilon})}^{2}.

Let XρX\sim\rho, then by Dynkin’s formula

ρ,ε(X)\displaystyle\mathcal{L}_{\rho,\varepsilon}(X) =X+0εGsσε(θεψε)(X)𝑑s\displaystyle=X+\int_{0}^{\varepsilon}G^{\sigma_{\varepsilon}}_{s}(\theta_{\varepsilon}\nabla\psi_{\varepsilon})(X)ds

and thus

LDε(ρ)\displaystyle\text{LD}_{\varepsilon}(\rho) =Law((1θε1)X+θε1(X+0εGsσε(θεψε)(X)𝑑s))\displaystyle=\text{Law}\left((1-\theta_{\varepsilon}^{-1})X+\theta_{\varepsilon}^{-1}\left(X+\int_{0}^{\varepsilon}G^{\sigma_{\varepsilon}}_{s}(\theta_{\varepsilon}\nabla\psi_{\varepsilon})(X)ds\right)\right)
=Law(X+0εGsσε(ψε)(X)𝑑s).\displaystyle=\text{Law}\left(X+\int_{0}^{\varepsilon}G^{\sigma_{\varepsilon}}_{s}(\nabla\psi_{\varepsilon})(X)ds\right).

Also by definition

Sε1(ρ)\displaystyle S_{\varepsilon}^{1}\left(\rho\right) =Law(X+εψε(X)).\displaystyle=\text{Law}\left(X+\varepsilon\nabla\psi_{\varepsilon}(X)\right).

Thus, we have the upper bound

𝕎22(LDε(ρ),Sε1(ρ))\displaystyle\mathbb{W}_{2}^{2}(\text{LD}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right)) E(X+0εGsσε(ψε)(X)ds)(X+εψε(X)))2\displaystyle\leq\mathrm{E}\left\|\left(X+\int_{0}^{\varepsilon}G^{\sigma_{\varepsilon}}_{s}(\nabla\psi_{\varepsilon})(X)ds\right)-\left(X+\varepsilon\nabla\psi_{\varepsilon}(X))\right)\right\|^{2}
=E0ε(Gsσε(ψε)(X)ψε(X))𝑑s2\displaystyle=\mathrm{E}\left\|\int_{0}^{\varepsilon}\left(G_{s}^{\sigma_{\varepsilon}}(\nabla\psi_{\varepsilon})(X)-\nabla\psi_{\varepsilon}(X)\right)ds\right\|^{2}
ε0εEGsσε(ψε)(X)ψε(X)2𝑑s\displaystyle\leq\varepsilon\int_{0}^{\varepsilon}\mathrm{E}\|G_{s}^{\sigma_{\varepsilon}}(\nabla\psi_{\varepsilon})(X)-\nabla\psi_{\varepsilon}(X)\|^{2}ds
=ε201Gεsσε(ψε)ψεL2(ρ)2𝑑s,\displaystyle=\varepsilon^{2}\int_{0}^{1}\|G_{\varepsilon s}^{\sigma_{\varepsilon}}(\nabla\psi_{\varepsilon})-\nabla\psi_{\varepsilon}\|_{L^{2}(\rho)}^{2}ds,

where the penultimate inequality follows from Jensen’s inequality. Then, perform a change of measure and apply (53) to obtain

𝕎22(LDε(ρ),Sε1(ρ))\displaystyle\mathbb{W}_{2}^{2}(\text{LD}_{\varepsilon}(\rho),S_{\varepsilon}^{1}\left(\rho\right)) 2ρ/σεε3Eσε2ψεHS2.\displaystyle\leq 2\|\rho/\sigma_{\varepsilon}\|_{\infty}\varepsilon^{3}\mathrm{E}_{\sigma_{\varepsilon}}\|\nabla^{2}\psi_{\varepsilon}\|_{HS}^{2}.

4. Iterated Schrödinger Bridge Scheme

In this section we present conditions under which suitably interpolated iterations of (SB) converge to the gradient flow of entropy and relative entropy (i.e. solutions to the parabolic heat equation and Fokker-Planck equation). Our results rely on the close approximation of (SB)\eqref{sb-step} to explicit Euler iterations that we define later on. We also present a collection of sufficient conditions for which iterations of (SB) converge to the gradient flow of geodesically convex functionals. Roughly put, our assumptions are that (1) the one step approximation given in Theorem 2 holds along iterations of (SB) (which we call consistency) and (2) the error of this approximation does not accumulate too rapidly over a finite time horizon (which is implied by the condition we call contractivity in the limit). Although these conditions seem reasonable (in Section 5 we show examples that satisfy them), it is not immediate how generally they hold.

4.1. Preliminaries

We set up some notation for subdifferential calculus in Wasserstein space. Let :𝒫2(d)(,]\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\to(-\infty,\infty] be a function that is proper, lower-semicontinuous, and λ\lambda-geodesically convex functional for some λ\lambda\in\mathbb{R}. Let D()={ρ𝒫2(d):(ρ)<+}D(\mathcal{F})=\{\rho\in\mathcal{P}_{2}(\mathbb{R}^{d})\mathrel{\mathop{\mathchar 58\relax}}\mathcal{F}(\rho)<+\infty\}. The subdifferential set of \mathcal{F} at ρD()\rho\in D(\mathcal{F}) is denoted by (ρ){\partial}\mathcal{F}(\rho) (see [AGS08, Definition 10.1.1] for definition of Fréchet subdifferentials). With D()={ρD():(ρ)}D(\partial\mathcal{F})=\{\rho\in D(\mathcal{F})\mathrel{\mathop{\mathchar 58\relax}}\partial\mathcal{F}(\rho)\neq\emptyset\}, we make the assumptions that D()𝒫2ac(d)D(\partial\mathcal{F})\subset\mathcal{P}_{2}^{\text{ac}}(\mathbb{R}^{d}), the subset of absolutely continuous probability measures in 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}). Using Brenier’s theorem [San15, Theorem 1.22], this implies that there exists a unique optimal transport map from any ρD()\rho\in D(\partial\mathcal{F}) to any ν𝒫2(d)\nu\in\mathcal{P}_{2}(\mathbb{R}^{d}). We denote this map by TρνT_{\rho}^{\nu}. Since \mathcal{F} is λ\lambda-geodesically convex, \mathcal{F} is a regular functional [AGS08, Definition 10.1.4] which implies that for all ρD()\rho\in D(\partial\mathcal{F}), the subdifferential set (ρ)\partial\mathcal{F}(\rho) contains a unique element with minimum L2(ρ)L^{2}(\rho) norm. This unique minimum selection subdifferential is called the Wasserstein gradient hereon, denoted by 𝕎(ρ)\nabla_{\mathbb{W}}\mathcal{F}(\rho), and belongs to the tangent space Tan(ρ)\text{Tan}(\rho) [AGS08, Definition 8.4.1]. We further assume that there exists an ε0>0\varepsilon_{0}>0 such that the functional

ν𝒫2(d)(ν)+12ε𝕎22(μ,ρ)\nu\in\mathcal{P}_{2}(\mathbb{R}^{d})\mapsto\mathcal{F}(\nu)+\frac{1}{2\varepsilon}\mathbb{W}_{2}^{2}(\mu,\rho)

admits at least one minimum point for any ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) and ρ𝒫2(d)\rho\in\mathcal{P}_{2}(\mathbb{R}^{d}).

The two primary classes of examples that we consider later are (i) the (one half) entropy function (ρ)=12Ent(ρ)\mathcal{F}(\rho)=\frac{1}{2}\mathrm{Ent}(\rho) for which the gradient flow is the probabilistic heat flow and (ii) (one half) relative entropy (AKA Kullback-Leibler divergence) with respect to a log-concave probability measure, i.e., (ρ)=12H(|ν)\mathcal{F}(\rho)=\frac{1}{2}H\left(\cdot|\nu\right), for some log-concave density ν\nu. These are well-known examples for which the above assumptions are satisfied. In Section 5 we will cover examples that are not covered by our assumptions and yet, as we will show by direct computations, our iterated Schrödinger bridge scheme continues to perform well.

4.2. Gradient Flows and their Euler Approximations

Let (ρ(t),t[0,T])\left(\rho(t),\;t\in[0,T]\right) denote the Wasserstein gradient flow of \mathcal{F} during time interval [0,T][0,T]. We assume that it exists as an absolutely continuous curve that is the unique solution of the continuity equation

tρ(t)+(v(t)ρ(t))=0,wherev(t)=𝕎(ρ(t))\partial_{t}\rho(t)+\nabla\cdot(v(t)\rho(t))=0,\quad\text{where}\;v(t)=-\nabla_{\mathbb{W}}\mathcal{F}(\rho(t))

is assumed to be in Tan(ρ(t))\text{Tan}(\rho(t)).

Euler iterations of the Wasserstein gradient flow of \mathcal{F}, whether it is explicit or implicit, is a sequence of elements in the Wasserstein space (ρε(k),k=0,1,2,)\left(\rho_{\varepsilon}(k),\;k=0,1,2,\ldots\right), starting at some ρε(0)=ρ(0)\rho_{\varepsilon}(0)=\rho(0), given by an iterative map Sε1():𝒫2(d)𝒫2(d)S_{\varepsilon}^{1}\left(\cdot\right)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\rightarrow\mathcal{P}_{2}(\mathbb{R}^{d}) of the form (45), in the sense that ρε(k)=Sε1(ρε(k1))=Sεk(ρε(0))\rho_{\varepsilon}(k)=S_{\varepsilon}^{1}\left(\rho_{\varepsilon}(k-1)\right)=S_{\varepsilon}^{k}\left(\rho_{\varepsilon}(0)\right), for kk\in\mathbb{N}. The continuous-time approximation curve is given by the piecewise constant interpolation

(54) ρε(t)=ρε(k)t[kε,(k+1)ε).\rho_{\varepsilon}(t)=\rho_{\varepsilon}(k)\quad t\in[k\varepsilon,(k+1)\varepsilon).
Definition 1 (First order approximation).

The sequence (ρε(k),k)\left(\rho_{\varepsilon}(k),\;k\in\mathbb{N}\right), or, equivalently, its continuous time interpolation (ρε(t),t0)(\rho_{\varepsilon}(t),t\geq 0) is called a first-order approximation of a Wasserstein gradient flow (ρ(t),t0)(\rho(t),t\geq 0) if, for any T>0T>0,

limε0supt[0,T]𝕎2(ρε(t),ρ(t))=0.\lim_{\varepsilon\to 0}\sup_{t\in[0,T]}\mathbb{W}_{2}\left(\rho_{\varepsilon}(t),\rho(t)\right)=0.

We now wish to develop sufficient conditions under which Sε1()S_{\varepsilon}^{1}\left(\cdot\right) gives a first order approximation. We say that Sε1()S_{\varepsilon}^{1}\left(\cdot\right) is consistent if for any T>0T>0,

(55) lim supε0supt[0,Tε]1ε𝕎2(Sε1(ρ(t)),ρ(t+ε))=0,\displaystyle\limsup_{\varepsilon\to 0}\sup_{t\in[0,T-\varepsilon]}\frac{1}{\varepsilon}\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(\rho(t)\right),\rho({t+\varepsilon})\right)=0\,,

and Sε1()S_{\varepsilon}^{1}\left(\cdot\right) is a contraction in the limit if

(56) lim supε0supk[Nε]1ε(𝕎2(Sε1(Sεk(ρ(0))),Sε1(ρ(kε)))𝕎2(Sεk(ρ(0)),ρ(kε)))0.\limsup\limits_{\varepsilon\to 0}\sup\limits_{k\in[N_{\varepsilon}]}\frac{1}{\varepsilon}\left(\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(S_{\varepsilon}^{k}\left(\rho(0)\right)\right),S_{\varepsilon}^{1}\left(\rho(k\varepsilon)\right)\right)-\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho(k\varepsilon)\right)\right)\leq 0\,.
Remark 2.

Observe that if Sε1()S_{\varepsilon}^{1}\left(\cdot\right) is a contraction, i.e. 𝕎2(Sε1(ρ),Sε1(σ))𝕎2(ρ,σ)\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(\rho\right),S_{\varepsilon}^{1}\left(\sigma\right))\leq\mathbb{W}_{2}(\rho,\sigma), then it is a contraction in the limit as well. Thus, this condition is a more general notion.

Theorem 3.

Suppose the map Sε1()S_{\varepsilon}^{1}\left(\cdot\right) satisfies the two conditions - consistency and contraction in the limit. Then ρε()\rho_{\varepsilon}(\cdot) is a first-order approximation of the gradient flow (ρ(t),t[0,T])(\rho(t),t\in[0,T]).

Proof.

For any t[kε,(k+1)ε)t\in[k\varepsilon,(k+1)\varepsilon), by the triangle inequality and absolute continuity of ρ()\rho(\cdot)

𝕎2(ρε(t),ρ(t))𝕎2(Sεk(ρ(0)),ρ(kε))+kεtv(s)L2(ρ(s))𝑑s.\mathbb{W}_{2}\left(\rho_{\varepsilon}(t),\rho(t)\right)\leq\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho({k\varepsilon})\right)+\int_{k\varepsilon}^{t}\mathinner{\!\left\lVert v(s)\right\rVert}_{L^{2}(\rho(s))}ds.

The second term above converges to 0 with ε\varepsilon because v(t)L2(ρ(t))Lloc1([0,+))\|v(t)\|_{L^{2}(\rho(t))}\in L^{1}_{\text{loc}}([0,+\infty)). For the first term, consider the decomposition

(57) 𝕎2(Sεk(ρ(0)),ρ(kε))𝕎2(Sεk(ρ(0)),Sε1(ρ((k1)ε)))+𝕎2(Sε1(ρ((k1)ε)),ρ(kε)).\displaystyle\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho\left(k\varepsilon\right)\right)\leq\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),S_{\varepsilon}^{1}\left(\rho\left((k-1)\varepsilon\right)\right)\right)+\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(\rho\left((k-1)\varepsilon\right)\right),\rho\left(k\varepsilon\right)\right)\,.

Let δ>0\delta>0, then by the assumptions of consistency and contraction in the limit, for all ε>0\varepsilon>0 small enough

𝕎2(Sε1(ρ((k1)ε)),ρ(kε))<δε,\displaystyle\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(\rho\left((k-1)\varepsilon\right)\right),\rho\left(k\varepsilon\right)\right)<\delta\varepsilon,

and

𝕎2(Sεk(ρ(0)),Sε1(ρ((k1)ε)))𝕎2(Sεk1(ρ(0)),ρ((k1)ε))<δε,\displaystyle\mathbb{W}_{2}\left({S_{\varepsilon}^{k}\left(\rho(0)\right)},S_{\varepsilon}^{1}\left(\rho((k-1)\varepsilon)\right)\right)-\mathbb{W}_{2}\left(S_{\varepsilon}^{k-1}\left(\rho(0)\right),\rho((k-1)\varepsilon)\right)<\delta\varepsilon,

respectively. Thus, (57) gives the following recursion for ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) and k[Nε]k\in[N_{\varepsilon}] with k1k\geq 1

𝕎2(Sεk(ρ(0)),ρ(kε))𝕎2(Sεk1(ρ(0)),ρ((k1)ε))+2δε.\displaystyle\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho\left(k\varepsilon\right)\right)\leq\mathbb{W}_{2}\left(S_{\varepsilon}^{k-1}\left(\rho(0)\right),\rho\left((k-1)\varepsilon\right)\right)+2\delta\varepsilon.

Hence, recursively,

lim supε0supk[Nε]𝕎2(Sεk(ρ),ρ(kε))𝕎2(ρ(0),ρ(0))+Tε(2δε)=2Tδ.\limsup\limits_{\varepsilon\to 0}\sup\limits_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho\right),\rho\left(k\varepsilon\right)\right)\leq\mathbb{W}_{2}\left(\rho(0),\rho(0)\right)+\frac{T}{\varepsilon}\cdot(2\delta\varepsilon)=2T\delta.

As δ>0\delta>0 was arbitrary, this completes the proof. ∎

Explicit Euler Scheme

The explicit Euler scheme in the Wasserstein space can be obtained as an iterative geodesic approximation of the Wasserstein gradient flow minimizing \mathcal{F} for the choice of

(EE) Sε1(ρ)=(Idε𝕎(ρ))#ρ.S_{\varepsilon}^{1}\left(\rho\right)=\left(\operatorname{\mathrm{Id}}-\varepsilon\nabla_{\mathbb{W}}\mathcal{F}\left(\rho\right)\right)_{\#}\rho.

The explicit Euler scheme need not always give a first-order approximation sequence to a Wasserstein gradient flow. However, the assumptions in the statements of Theorems 4 and 5 provides sufficient conditions for which (EE) satisfies consistency and contraction in the limit. Consequently, Theorem 3 gives the uniform convergence of the scheme.

Implicit Euler Scheme

The implicit Euler scheme or, equivalently the JKO scheme [JKO98], in the Wasserstein space can be defined using the following map

(IE) Sε1(ρ)=argminν𝒫2(d)[(ν)+12ε𝕎22(ν,ρ)].S_{\varepsilon}^{1}\left(\rho\right)=\operatorname*{argmin}_{\nu\in\mathcal{P}_{2}(\mathbb{R}^{d})}\left[\mathcal{F}(\nu)+\frac{1}{2\varepsilon}\mathbb{W}_{2}^{2}\left(\nu,\rho\right)\right].

We quickly verify that when ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) and ε>0\varepsilon>0, there exists uε:du_{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R} such that Sε1(ρ)=(Id+εuε)#ρS_{\varepsilon}^{1}\left(\rho\right)=(\operatorname{\mathrm{Id}}+\varepsilon\nabla u_{\varepsilon})_{\#}\rho. As ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}), by Brenier’s Theorem there exists a unique gradient of a convex function ψε\nabla\psi_{\varepsilon} that pushforwards ρ\rho to Sε1(ρ)S_{\varepsilon}^{1}\left(\rho\right). Setting uε(x):=1ε(ψε(x)12x2)u_{\varepsilon}(x)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{\varepsilon}\left(\psi_{\varepsilon}(x)-\frac{1}{2}\|x\|^{2}\right), it holds that Sε1(ρ)=(Id+εuε)#ρS_{\varepsilon}^{1}\left(\rho\right)=(\operatorname{\mathrm{Id}}+\varepsilon u_{\varepsilon})_{\#}\rho as desired. The implicit Euler scheme starting from ρ(0)D()\rho(0)\in D(\partial\mathcal{F}) is given by ρε(k)=Sεk(ρ(0))\rho_{\varepsilon}(k)=S_{\varepsilon}^{k}\left(\rho(0)\right). When \mathcal{F} is geodesically convex, it is known that Theorem 3 holds through [AGS08, Theorem 4.0.7, Theorem 4.0.9, Theorem 4.0.10]. We do not need to verify Definition 1.

In the next section we introduce our iterated Schrödinger bridge scheme that approximates the explicit scheme, under suitable assumptions.

4.3. Uniform Convergence

We now prove two theorems giving conditions under which iterations of (SB) provide a first order approximation to the gradient flow of 12Ent\frac{1}{2}\mathrm{Ent} and 12H(|ν)\frac{1}{2}H(\cdot|\nu). The proof of both theorems rely on an appeal to a general result in Lemma 6 that gives sufficient conditions for (SB) iterates to approximate Euler iterates. In principle, Lemma 6 can be applied with Sε1()S_{\varepsilon}^{1}\left(\cdot\right) given by implicit or explicit Euler iterations, but our Theorems 4 and 5 both only use explicit Euler iterations.

To simplify the bookkeeping of the constants introduced in Assumptions 2, we define two classes of probability densities. For C,D,M>0C,D,M>0, N1N\geq 1, and λ\lambda\in\mathbb{R} set

(58) Π1(C,D,N)\displaystyle\Pi_{1}(C,D,N) :={ρ𝒫2ac(d):ρψ1D,𝒰ρ exists, 𝒰ρ(x)2C(1+xN)},\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\{\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})\mathrel{\mathop{\mathchar 58\relax}}\|\rho\|_{\psi_{1}}\leq D,\nabla\mathcal{U}_{\rho}\text{ exists, }\|\mathcal{U}_{\rho}(x)\|^{2}\leq C(1+\|x\|^{N})\},

and

(59) Π2(λ,M)\displaystyle\Pi_{2}(\lambda,M) :={ρ𝒫2ac(d)satisfying Assumption 12logρλId,Eρ2logρHS2M}.\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\{\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})\;\text{satisfying Assumption \ref{assumption:LD_SB_relative_entropy}, }-\nabla^{2}\log\rho\geq\lambda\operatorname{\mathrm{Id}},\mathrm{E}_{\rho}\|\nabla^{2}\log\rho\|_{HS}^{2}\leq M\}.

Lastly, recall the RHS of Theorems 1 and 2 and that (ρsε,s[0,1])(\rho_{s}^{\varepsilon},s\in[0,1]) is the entropic interpolation from ρ\rho to itself. We define

(60) Δε(ρ):=I(ρ)01I(ρsε)ds.\displaystyle\Delta_{\varepsilon}(\rho)\mathrel{\mathop{\mathchar 58\relax}}=I(\rho)-\int_{0}^{1}I(\rho_{s}^{\varepsilon})ds.

We now state the following theorem for the (probabilistic) heat flow:

(61) tρ(t)=12Δxρ(t)=12((logρ(t))ρ(t)),\displaystyle\partial_{t}\rho(t)=\frac{1}{2}\Delta_{x}\rho(t)=\frac{1}{2}\nabla\cdot\left((\nabla\log\rho(t))\rho(t)\right),

starting at ρ(0)\rho(0). Let Sε1()S_{\varepsilon}^{1}\left(\cdot\right) denote the explicit Euler iteration map for the heat flow. That is, for ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})

(62) Sε1(ρ):=(Idε2logρ)#ρ.\displaystyle S_{\varepsilon}^{1}\left(\rho\right)\mathrel{\mathop{\mathchar 58\relax}}=\left(\operatorname{\mathrm{Id}}-\frac{\varepsilon}{2}\nabla\log\rho\right)_{\#}\rho.
Theorem 4.

Assume that ρ(0)\rho(0) is such that

  1. (1)

    lim supε0supk[Nε]Δε(SBεk(ρ(0)))=0\limsup\limits_{\varepsilon\downarrow 0}\sup\limits_{k\in[N_{\varepsilon}]}\Delta_{\varepsilon}\left(\text{SB}_{\varepsilon}^{k}(\rho(0))\right)=0

  2. (2)

    There exists ε0>0\varepsilon_{0}>0 and constants C,D>0C,D>0, and N1N\geq 1 such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}), (Sεk(ρ(0)),k{0}[Nε])(SBεk(ρ(0)),k[Nε])Π1(C,D,N)(S_{\varepsilon}^{k}\left(\rho(0)\right),k\in\{0\}\cup[N_{\varepsilon}])\cup(\text{SB}_{\varepsilon}^{k}(\rho(0)),k\in[N_{\varepsilon}])\subset\Pi_{1}(C,D,N).

  3. (3)

    There exists λ\lambda\in\mathbb{R} and M>0M>0 such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) and k[Nε]k\in[N_{\varepsilon}], (SBεk(ρ(0)),k[Nε])Π2(λ,M)(\text{SB}_{\varepsilon}^{k}(\rho(0)),k\in[N_{\varepsilon}])\subset\Pi_{2}(\lambda,M).

Then (SBεk(ρ0),k[Nε])(\text{SB}_{\varepsilon}^{k}(\rho_{0}),k\in[N_{\varepsilon}]) is a first order approximation of the (probabilistic) heat flow.

Remark 3.

Theorem 4 lays down sufficient conditions under which [SABP22, Theorem 1] holds. Part of the assumptions are on the Schrödinger bridge iterates which are assumed to be nice enough such that our one step approximations hold uniformly. In Section 5, we show that these assumptions are indeed satisfied when ρ(0)\rho(0) is Gaussian. We are currently unable to verify that these natural conditions must be true once sufficiently stringent conditions (say, strong log concavity and enough smoothness) are assumed on the initial measure ρ(0)\rho(0).

We now state an analogous theorem for the Wasserstein gradient flow of relative entropy. Fix a reference measure ν=eh𝒫2(d)\nu=e^{-h}\in\mathcal{P}_{2}(\mathbb{R}^{d}) satisfying assumptions delineated in Theorem 5 below.

Let Sε1()S_{\varepsilon}^{1}\left(\cdot\right) denote the explicit Euler iteration for the Fokker-Planck equation

(63) tρ(t)=12((logρ(t)+h)ρ(t)).\displaystyle\partial_{t}\rho(t)=\frac{1}{2}\nabla\cdot\left((\nabla\log\rho(t)+\nabla h)\rho(t)\right).

That is, for ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})

(64) Sε1(ρ):=(Idε2logρε2h)#ρ.\displaystyle S_{\varepsilon}^{1}\left(\rho\right)\mathrel{\mathop{\mathchar 58\relax}}=\left(\operatorname{\mathrm{Id}}-\frac{\varepsilon}{2}\nabla\log\rho-\frac{\varepsilon}{2}\nabla h\right)_{\#}\rho.

For ε>0\varepsilon>0 and k[Nε]k\in[N_{\varepsilon}], let σεk𝒫2ac(d)\sigma_{\varepsilon}^{k}\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) denote the surrogate measure obtained in the application of SBε1()\text{SB}_{\varepsilon}^{1}(\cdot) to SBεk1(ρ(0))\text{SB}_{\varepsilon}^{k-1}(\rho(0)), and let θεk{0}\theta_{\varepsilon}^{k}\in\mathbb{R}\setminus\{0\} denote the corresponding integrability constant from (46). That is, σεk\sigma_{\varepsilon}^{k} is the probability density proportional to (ρ1ν)θεk(\rho^{-1}\nu)^{\theta_{\varepsilon}^{k}}, for ρ=SBεk1(ρ(0))\rho=\text{SB}_{\varepsilon}^{k-1}(\rho(0)).

Theorem 5.

Let hh be convex, and assume that ρ(0)\rho(0) is such that

  1. (1)

    lim supε0supk[Nε]Δε(σεk)=0.\limsup\limits_{\varepsilon\downarrow 0}\sup\limits_{k\in[N_{\varepsilon}]}\Delta_{\varepsilon}(\sigma_{\varepsilon}^{k})=0.

  2. (2)

    There exists C,D,M>0C,D,M>0, N1N\geq 1, and λ\lambda\in\mathbb{R} such that for all ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}), (σεk,k[Nε])Π1(C,D,N)Π2(λ,M)(\sigma_{\varepsilon}^{k},k\in[N_{\varepsilon}])\subset\Pi_{1}(C,D,N)\cap\Pi_{2}(\lambda,M), and SBεk(ρ(0))/σkε,(θεk)2M\|\text{SB}_{\varepsilon}^{k}(\rho(0))/\sigma_{k}^{\varepsilon}\|_{\infty},(\theta_{\varepsilon}^{k})^{-2}\leq M for all k[Nε]k\in[N_{\varepsilon}].

  3. (3)

    (Regularity of Fokker-Planck) The collection =(Sεk(ρ(0)),k{0}[Nε])(SBεk(ρ(0)),k[Nε])\mathscr{F}=(S_{\varepsilon}^{k}\left(\rho(0)\right),k\in\{0\}\cup[N_{\varepsilon}])\cup(\text{SB}_{\varepsilon}^{k}(\rho(0)),k\in[N_{\varepsilon}]) is such that

    limε0supρ1ε𝕎2(Sε1(ρ),ρ(ε))=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0}\sup\limits_{\rho\in\mathscr{F}}\frac{1}{\varepsilon}\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(\rho\right),\rho(\varepsilon))=0.

Then (SBεk(ρ0),k[Nε])(\text{SB}_{\varepsilon}^{k}(\rho_{0}),k\in[N_{\varepsilon}]) is a first order approximation of the gradient flow of (one-half) KL divergence, i.e. the solution to (63).

Remark 4.

These assumed regularity of the Fokker-Planck is just requiring that the convergence given in [AGS08, Proposition 8.4.6] holds uniformly along the measures in \mathscr{F}. That is, if we start Fokker-Planck from any of the measures ρ\rho in \mathscr{F}, the explicit Euler iteration is a good enough approximation for small times, uniformly in ρ\rho. In Lemma 7 below we show that this statement is true for the heat flow since its corresponding transition kernel is a Gaussian convolution. It is also satisfied by the flow corresponding to the standard Ornstein-Uhlenbeck (OU) process for which the transition kernel is again a Gaussian convolution. In fact, Lemma 7 continues to hold in this case with minor modifications to the proof.

Observe the difference in the assumptions required for Theorems 4 and 5. The new assumptions in Theorem 5 are regarding the surrogate measures σεk\sigma_{\varepsilon}^{k} corresponding to the Schrödinger iterate (SBεk(ρ(0))(\text{SB}_{\varepsilon}^{k}(\rho(0)). The reason for this difference is that for the heat flow σεk=(SBεk(ρ(0))\sigma_{\varepsilon}^{k}=(\text{SB}_{\varepsilon}^{k}(\rho(0)) and θεk=1\theta_{\varepsilon}^{k}=-1, for every k[Nε]k\in[N_{\varepsilon}] and any ε>0\varepsilon>0. Hence the assumptions simplify.

Before proving Theorems 4 and 5, we present general conditions under which the iterates of (SB), called SB scheme, converge to the gradient flow of a functional \mathcal{F}. For a general case, it is difficult to show that a continuous time interpolation of the SB scheme is a first order approximation of the gradient flow (Definition 1). The following general conditions ensure uniform convergence to the gradient flow via Euler iterations. We then restrict ourselves to the specific cases when \mathcal{F} is entropy and relative entropy.

Assumption 3.

Fix :𝒫2(d)(,+]\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\to(-\infty,+\infty] be a geodesically convex functional and fix T>0T>0. With initial measure ρ(0)𝒫2ac(d)\rho(0)\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}), let (ρ(t),t0)(\rho(t),t\geq 0) denote the gradient flow of \mathcal{F}. Let (ρε(kε)=Sεk(ρ0),k[Nε])\left(\rho_{\varepsilon}(k\varepsilon)=S_{\varepsilon}^{k}\left(\rho_{0}\right),k\in\left[N_{\varepsilon}\right]\right) be a sequence of Euler iterations that gives rise to a first order approximation of the gradient flow of \mathcal{F} satisfying Definition 1. Moreover, assume that:

  • (1)

    there is consistency of Sε1()S_{\varepsilon}^{1}\left(\cdot\right) along the (SBεk(ρ0),k[Nε])\left(\text{SB}_{\varepsilon}^{k}(\rho_{0}),k\in[N_{\varepsilon}]\right), meaning

    (65) lim supε0supk[Nε]1ε𝕎2(SBεk(ρ0),Sε1(SBεk1(ρ0)))=0.\displaystyle\limsup\limits_{\varepsilon\downarrow 0}\sup\limits_{k\in[N_{\varepsilon}]}\frac{1}{\varepsilon}\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{k}(\rho_{0}),S_{\varepsilon}^{1}\left(\text{SB}_{\varepsilon}^{k-1}(\rho_{0})\right)\right)=0.
  • (2)

    There is contraction in the limit, meaning

    (66) lim supε0supk[Nε]1ε(𝕎2(Sεk(ρ0),Sε1(SBεk1(ρ0)))𝕎2(Sεk1(ρ0),SBεk1(ρ0)))\displaystyle\limsup\limits_{\varepsilon\to 0}\sup\limits_{k\in[N_{\varepsilon}]}\frac{1}{\varepsilon}\left(\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho_{0}\right),S_{\varepsilon}^{1}\left(\text{SB}_{\varepsilon}^{k-1}(\rho_{0})\right)\right)-\mathbb{W}_{2}\left(S_{\varepsilon}^{k-1}\left(\rho_{0}\right),\text{SB}_{\varepsilon}^{k-1}(\rho_{0})\right)\right) 0.\displaystyle\leq 0.

These assumptions are reminiscent of (55) and (56). Hence, the proof of the following lemma is very similar to that of Theorem 3.

Lemma 6.

Under Assumption 3,

limε0supk[Nε]𝕎2(ρ(kε),SBεk(ρ0))=0.\displaystyle\lim\limits_{\varepsilon\downarrow 0}\sup\limits_{k\in\left[N_{\varepsilon}\right]}\mathbb{W}_{2}(\rho(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho_{0}))=0.
Proof.

To start, observe that

(67) supk[Nε]𝕎2(SBεk(ρ0),ρ(kε))\displaystyle\sup_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}(\text{SB}_{\varepsilon}^{k}(\rho_{0}),\rho(k\varepsilon)) supk[Nε]𝕎2(SBεk(ρ0),Sεk(ρ0))+supk[Nε]𝕎2(Sεk(ρ0),ρ(kε)).\displaystyle\leq\sup_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}(\text{SB}_{\varepsilon}^{k}(\rho_{0}),S_{\varepsilon}^{k}\left(\rho_{0}\right){})+\sup_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}(S_{\varepsilon}^{k}\left(\rho_{0}\right){},\rho(k\varepsilon)).

By Theorem 3, the rightmost term vanishes as ε0\varepsilon\downarrow 0. For the remaining term, for k[Nε]k\in[N_{\varepsilon}] split it as

𝕎2(SBεk(ρ0),Sεk(ρ0))\displaystyle\mathbb{W}_{2}(\text{SB}_{\varepsilon}^{k}(\rho_{0}),S_{\varepsilon}^{k}\left(\rho_{0}\right)) 𝕎2(SBεk(ρ0),Sε1(SBεk1(ρ0)))+𝕎2(Sε1(SBεk1(ρ0)),Sεk(ρ0)).\displaystyle\leq\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{k}(\rho_{0}),S_{\varepsilon}^{1}\left(\text{SB}_{\varepsilon}^{k-1}(\rho_{0})\right)\right)+\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(\text{SB}_{\varepsilon}^{k-1}(\rho_{0})\right),S_{\varepsilon}^{k}\left(\rho_{0}\right){}\right).

By Assumption 3(2), one obtains the exact same recursion as in Theorem 3. Hence, the same argument gives that limε0supk[Nε]𝕎2(SBεk(ρ0),Sεk(ρ0))=0\lim\limits_{\varepsilon\downarrow 0}\sup\limits_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}(\text{SB}_{\varepsilon}^{k}(\rho_{0}),S_{\varepsilon}^{k}\left(\rho_{0}\right){})=0, completing the Lemma. ∎

We now begin the proof of Theorem 4, which follows as an application of Lemma 6 once we verify Assumption 3 holds in this setting. The verification is a lengthy argument that relies on a tight approximation at small time of the explicit Euler step given in (62) and the heat flow, which we present in Lemma below.

Lemma 7.

Let ρ=egΠ1(C,D,N)\rho=e^{-g}\in\Pi_{1}(C,D,N). Let Sε1()S_{\varepsilon}^{1}\left(\cdot\right) be as defined in (62), and (ρ(t),t0)(\rho(t),t\geq 0) the solution to (61) with ρ(0)=ρ\rho(0)=\rho. Fix ε0>0\varepsilon_{0}>0, then there is a constant K>0K>0 depending on CC, DD, and nn such that for any ε(0,ε0)\varepsilon\in(0,\varepsilon_{0})

𝕎2(Sε1(ρ),ρ(ε))Kε2.\displaystyle\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(\rho\right),\rho(\varepsilon))\leq K\varepsilon^{2}.
Proof.

We consider two evolving particle systems. Let (ρ(t,),t0)(\rho(t,\cdot),t\geq 0) denote the (probabilistic) heat flow starting from ρ\rho, and set g(t,x)=logρ(t,x)g(t,x)=-\log\rho(t,x). Let X,Y:[0,)×ddX,Y\mathrel{\mathop{\mathchar 58\relax}}[0,\infty)\times\mathbb{R}^{d}\to\mathbb{R}^{d} be two particle systems with the same initial configuration X0ρX_{0}\sim\rho solving the ODE (letting X˙t\dot{X}_{t} denote time derivative)

X˙t=12g(t,Xt) and Y˙t=12g(0,X0).\displaystyle\dot{X}_{t}=\frac{1}{2}\nabla g(t,X_{t})\text{ and }\dot{Y}_{t}=\frac{1}{2}\nabla g(0,X_{0}).

Observe that Xερ(ε)X_{\varepsilon}\sim\rho(\varepsilon) and YεSε1(ρ)Y_{\varepsilon}\sim S_{\varepsilon}^{1}\left(\rho\right). Now, define G:[0,+)×ddG\mathrel{\mathop{\mathchar 58\relax}}[0,+\infty)\times\mathbb{R}^{d}\to\mathbb{R}^{d} by G(t,x)=X(t,x)Y(t,x)G(t,x)=X(t,x)-Y(t,x). Observe that G(0,)=G˙(0,)=0G(0,\cdot)=\dot{G}(0,\cdot)=0, giving that

(68) G(t)\displaystyle G(t) =0t0sG¨(u)𝑑u𝑑s.\displaystyle=\int_{0}^{t}\int_{0}^{s}\ddot{G}(u)duds.

By the chain rule and the fact that (ρ(t,),t0)(\rho(t,\cdot),t\geq 0) solves the heat equation, observe that

G¨(t,Xt)\displaystyle\ddot{G}(t,X_{t}) =12txg(t,Xt)\displaystyle=\frac{1}{2}\partial_{t}\nabla_{x}g(t,X_{t})
=12(x2g(t,Xt)X˙t12x[Δxρ(t,Xt)ρ(t,Xt)]).\displaystyle=\frac{1}{2}\left(\nabla_{x}^{2}g(t,X_{t})\dot{X}_{t}-\frac{1}{2}\nabla_{x}\left[\frac{\Delta_{x}\rho(t,X_{t})}{\rho(t,X_{t})}\right]\right).

In analogy with Section 2, define

𝒰(t,u)\displaystyle\mathcal{U}(t,u) =18xg(t,u)214Δxg(t,u),\displaystyle=\frac{1}{8}\|\nabla_{x}g(t,u)\|^{2}-\frac{1}{4}\Delta_{x}g(t,u),

and compute that

14x[Δρ(t,u)ρ(t,u)]=x(18xg(t,u)2+𝒰(t,u))=14x2g(t,u)g(t,u)+x𝒰(t,u).\displaystyle\frac{1}{4}\nabla_{x}\left[\frac{\Delta\rho(t,u)}{\rho(t,u)}\right]=\nabla_{x}\left(\frac{1}{8}\|\nabla_{x}g(t,u)\|^{2}+\mathcal{U}(t,u)\right)=\frac{1}{4}\nabla_{x}^{2}g(t,u)\nabla g(t,u)+\nabla_{x}\mathcal{U}(t,u).

Altogether, it now holds that

(69) G¨(t,Xt)=x𝒰(t,Xt).\displaystyle\ddot{G}(t,X_{t})=-\nabla_{x}\mathcal{U}(t,X_{t}).

For a fixed ε0>0\varepsilon_{0}>0, there uniform constant C>0C^{\prime}>0 and n1n^{\prime}\geq 1 (potentially slightly modified from CC and nn) such that for all t[0,ε0]t\in[0,\varepsilon_{0}], x𝒰(t,x)2C(1+xn)\|\nabla_{x}\mathcal{U}(t,x)\|^{2}\leq C^{\prime}(1+\|x\|^{n^{\prime}}). Let X0ρX_{0}\sim\rho and ZN(0,Id)Z\sim N(0,\operatorname{\mathrm{Id}}) be independent, then for t0t\geq 0 the random variable X0+εZX_{0}+\varepsilon Z has law ρ(t)\rho(t). Moreover, as X0+tZψ1D+tZψ1\|X_{0}+tZ\|_{\psi_{1}}\leq D+t\|Z\|_{\psi_{1}}, the ρ(t)\rho(t) have bounded subexponential norm over t[0,ε0]t\in[0,\varepsilon_{0}]. Hence, there exists a constant KK^{\prime} as described in the Lemma such that for t[0,ε0]t\in[0,\varepsilon_{0}], Eρ(t)x𝒰(t,X)2K\mathrm{E}_{\rho(t)}\|\nabla_{x}\mathcal{U}(t,X)\|^{2}\leq K^{\prime}. Hence, applying Jensen’s inequality twice to (68) gives

𝕎22(Sε1(ρ),ρ(ε))\displaystyle\mathbb{W}_{2}^{2}(S_{\varepsilon}^{1}\left(\rho\right),\rho(\varepsilon)) EG(ε,X0)20ε0sεsE[G¨(u,Xu)2]𝑑u𝑑sε43K,\displaystyle\leq\mathrm{E}\|G(\varepsilon,X_{0})\|^{2}\leq\int_{0}^{\varepsilon}\int_{0}^{s}\varepsilon s\mathrm{E}[\|\ddot{G}(u,X_{u})\|^{2}]du\,ds\leq\frac{\varepsilon^{4}}{3}K^{\prime},

establishing the Lemma. ∎

We now prove Theorem 4.

Proof of Theorem 4.

It suffices to verify that Assumption 3 holds in this setting.

Step 1: Explicit Euler Iterations are Consistent and Contractive in the Limit. Since ρ(0)\rho(0) is assumed to be in Π1(C,D,N)\Pi_{1}(C,D,N), by Lemma 7, along the heat flow starting at ρ(0)\rho(0), there is a uniform constant K>0K>0 such that, for any ε>0\varepsilon>0 small enough and for all t[0,Tε]t\in[0,T-\varepsilon], 𝕎2(Sε1(ρ(t)),ρ(t+ε))Kε2\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(\rho(t)\right),\rho(t+\varepsilon))\leq K\varepsilon^{2}.

Next, we establish contractivity in the limit. For a measure μ𝒫(d)\mu\in\mathcal{P}(\mathbb{R}^{d}) let [μ](t)[\mu](t) denote the time marginal of the heat flow (61) at time tt started from μ\mu. By contractivity of the heat flow with respect to initial data in the Wasserstein-2 metric [AGS08, Theorem 11.1.4],

𝕎2([Sεk(ρ(0))](ε),ρ((k+1)ε))𝕎2(Sεk(ρ(0)),ρ(kε)).\displaystyle\mathbb{W}_{2}([S_{\varepsilon}^{k}\left(\rho(0)\right)](\varepsilon),\rho((k+1)\varepsilon))\leq\mathbb{W}_{2}(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho(k\varepsilon)).

It then follows from triangle inequality

𝕎2(Sε1(Sεk(ρ(0))),Sε1(ρ(kε)))\displaystyle\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(S_{\varepsilon}^{k}\left(\rho(0)\right)\right),S_{\varepsilon}^{1}\left(\rho(k\varepsilon)\right)\right) 𝕎2(Sεk(ρ(0)),ρ(kε))+𝕎2(Sε1(Sεk(ρ(0))),[Sεk(ρ(0))](ε))\displaystyle\leq\mathbb{W}_{2}(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho(k\varepsilon))+\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(S_{\varepsilon}^{k}\left(\rho(0)\right)\right),[S_{\varepsilon}^{k}\left(\rho(0)\right)](\varepsilon))
+𝕎2(Sε1(ρ(kε)),ρ((k+1)ε)).\displaystyle+\mathbb{W}_{2}(S_{\varepsilon}^{1}\left(\rho(k\varepsilon)\right),\rho((k+1)\varepsilon)).

Hence, by assumptions in the theorem statement, for ε\varepsilon small enough there is a universal constant K>0K>0 such that for all k[Nε]k\in[N_{\varepsilon}]

(𝕎2(Sε1(Sεk(ρ(0))),Sε1(ρ(kε)))𝕎2(Sεk(ρ(0)),ρ(kε)))2Kε2,\displaystyle\left(\mathbb{W}_{2}\left(S_{\varepsilon}^{1}\left(S_{\varepsilon}^{k}\left(\rho(0)\right)\right),S_{\varepsilon}^{1}\left(\rho(k\varepsilon)\right)\right)-\mathbb{W}_{2}\left(S_{\varepsilon}^{k}\left(\rho(0)\right),\rho(k\varepsilon)\right)\right)\leq 2K\varepsilon^{2},

establishing contraction in the limit. This demonstrates that the explicit Euler iterations converge to the heat flow.

Step 2: SB Iterations are Consistent and Contractive in the Limit. To start, fix ε>0\varepsilon>0 and k1k\geq 1. To compute SBεk(ρ0)\text{SB}_{\varepsilon}^{k}(\rho_{0}) from SBεk1(ρ0)\text{SB}_{\varepsilon}^{k-1}(\rho_{0}), we must obtain the surrogate measure for (SB). Pick θε=1\theta_{\varepsilon}=-1, then the surrogate measure is the measure itself, that is, σε=SBεk1(ρ0)\sigma_{\varepsilon}=\text{SB}_{\varepsilon}^{k-1}(\rho_{0}). This in turn gives SBεk1(ρ0)/σε=1\|\text{SB}_{\varepsilon}^{k-1}(\rho_{0})/\sigma_{\varepsilon}\|_{\infty}=1. For consistency, by Theorem 2 and the assumptions in the statement of Theorem 4, for small enough ε>0\varepsilon>0 there is a universal constant K>0K>0 such that for all k[Nε]k\in[N_{\varepsilon}] with k1k\geq 1

𝕎2(SBεk(ρ(0)),Sε1(SBεk1(ρ(0))))\displaystyle\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{k}(\rho(0)),S_{\varepsilon}^{1}\left(\text{SB}_{\varepsilon}^{k-1}(\rho(0))\right)\right) εK(I(SBεk1(ρ(0)))01I((SBεk1(ρ(0)))sε)𝑑s)1/4.\displaystyle\leq\varepsilon K\left(I(\text{SB}_{\varepsilon}^{k-1}(\rho(0)))-\int_{0}^{1}I((\text{SB}_{\varepsilon}^{k-1}(\rho(0)))_{s}^{\varepsilon})ds\right)^{1/4}.

By the first assumption in the Theorem statement, consistency holds. Contractivity in the limit holds by the triagle inequality (as in the previous argument) and applying Lemma 7. ∎

The proof of Theorem 5 is completely analogous to the proof of Theorem 4 and thus omitted. The only modification is that we replace the role of Lemma 7 with the assumed regularity of Fokker-Planck equation in Theorem 5(3). See also Remark 4.

5. Examples

In this section we compute explicit examples of iterated Schrödinger bridge approximations to various gradient flows and other absolutely continuous curves on the Wasserstein space. Since Schrödinger bridges are known in closed form for Gaussian marginals ([JMPC20]), our computations are limited to Gaussian models. First, we consider gradient flows of two convex functionals - 1) entropy and 2) KL divergence. Second, while not being gradient flows and aligning with our theoretical analysis in Section 4, we also consider time reversal of these gradient flows to showcase the curiously nice approximation properties of the (SB) scheme for these absolutely continuous curves in the Wasserstein space. These latter examples have recently become important in applications. For example, see [SSDK+20] for an application to diffusion based generative models in machine learning.

5.1. Explicit Euler discretization of the gradient flow of Entropy

We show via explicit calculations that (SB) iterations are close to the explicit Euler approximation of the heat flow (ρ(t),t[0,T])(\rho(t),t\in[0,T]) starting from a Gaussian density. Here =12Ent\mathcal{F}=\frac{1}{2}\mathrm{Ent} and the continuity equation for the gradient flow of \mathcal{F} is tρ(t)+(v(t)ρ(t))=0\partial_{t}\rho(t)+\nabla\cdot\left(v(t)\rho(t)\right)=0 where v(t)=12logρ(t)v(t)=-\frac{1}{2}\nabla\log\rho(t). The explicit Euler iterations, (ρε(k),k[Nε])(\rho_{\varepsilon}(k),k\in[N_{\varepsilon}]), where Nε:=Tε1N_{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}=\lfloor T\varepsilon^{-1}\rfloor, are given by ρε(k+1)=(Id+εvε(k))#ρε(k)\rho_{\varepsilon}(k+1)=\left(\operatorname{\mathrm{Id}}+\varepsilon v_{\varepsilon}(k)\right)_{\#}\rho_{\varepsilon}(k) where vε(k)=12logρε(k)v_{\varepsilon}(k)=-\frac{1}{2}\nabla\log\rho_{\varepsilon}(k).

For ρε(0)=ρ=𝒩(0,η2Id)\rho_{\varepsilon}(0)=\rho=\mathcal{N}(0,\eta^{2}I_{d}), η2>0\eta^{2}>0, we will calculate the (EE) and (SB) iterates and show that the SB step is an 𝒪(ε2)\mathcal{O}(\varepsilon^{2}) approximation of the explicit Euler step. Recall that the form of Schrödinger bridge with equal marginals ρ\rho is explicitly given by [JMPC20, Theorem 1] as

πρ,ε=𝒩((00),η2(IdCη2εIdCη2εIdId)) where Cη2ε=1η2(η4+ε24ε2).\pi_{\rho,\varepsilon}=\mathcal{N}\left(\begin{pmatrix}0\\ 0\end{pmatrix},\eta^{2}\begin{pmatrix}I_{d}&C^{\varepsilon}_{\eta^{2}}I_{d}\\ C^{\varepsilon}_{\eta^{2}}I_{d}&I_{d}\end{pmatrix}\right)\quad\text{ where }\quad C^{\varepsilon}_{\eta^{2}}=\frac{1}{\eta^{2}}\left(\sqrt{\eta^{4}+\frac{\varepsilon^{2}}{4}}-\frac{\varepsilon}{2}\right)\,.

From this the barycentric projection can be computed as

(70) ρ,ε(x)=𝔼πρ,ε[Y|X=x]=Cη2εx.\displaystyle\mathcal{B}_{\rho,\varepsilon}(x)=\mathbb{E}_{\pi_{\rho,\varepsilon}}\left[Y|X=x\right]=C^{\varepsilon}_{\eta^{2}}x\,.

For the choice of ρ\rho, SBε1(ρ)=(2Idρ,ε)#ρ=𝒩(0,(2Cη2ε)2η2Id)\text{SB}_{\varepsilon}^{1}(\rho)=(2\operatorname{\mathrm{Id}}-\mathcal{B}_{\rho,\varepsilon})_{\#}\rho=\mathcal{N}\left(0,\left(2-C^{\varepsilon}_{\eta^{2}}\right)^{2}\eta^{2}I_{d}\right), and the corresponding ρε(1)=(Idε2logρ)#ρ=𝒩(0,(1+ε2η2)2η2Id)\rho_{\varepsilon}(1)=\left(\operatorname{\mathrm{Id}}-\frac{\varepsilon}{2}\nabla\log\rho\right)_{\#}\rho=\mathcal{N}\left(0,\left(1+\frac{\varepsilon}{2\eta^{2}}\right)^{2}\eta^{2}I_{d}\right).

Now let’s check the sharpness of Theorem 2. By the explicit form of the 𝕎2\mathbb{W}_{2} distance between two Gaussians

𝕎2(ρε(1),SBε1(ρ))=d|(2Cη2ε)η(1+εη2)η|=dη|1+ε24η41|.\mathbb{W}_{2}\left(\rho_{\varepsilon}(1),\text{SB}_{\varepsilon}^{1}(\rho)\right)=d\mathinner{\!\left\lvert(2-C^{\varepsilon}_{\eta^{2}})\eta-(1+\varepsilon\eta^{-2})\eta\right\rvert}=d\eta\mathinner{\!\left\lvert\sqrt{1+\frac{\varepsilon^{2}}{4\eta^{4}}}-1\right\rvert}\,.

Therefore, 𝕎2(ρε(1),SBε1(ρ))=𝒪(ε2)\mathbb{W}_{2}\left(\rho_{\varepsilon}(1),\text{SB}_{\varepsilon}^{1}(\rho)\right)=\mathcal{O}(\varepsilon^{2}), which an order of magnitude sharper than Theorem 2.

Since both (SB) and (EE) schemes operate via linear pushforwards, then all steps of these schemes are mean-zero Gaussian distributed. For the SB scheme, the surrogate measure at each step is the same as the current measure, i.e. σε(k)=SBεk(ρ)\sigma_{\varepsilon}(k)=\text{SB}_{\varepsilon}^{k}(\rho) for all k[Nε]k\in[N_{\varepsilon}]. Let SBεk(ρ0)=𝒩(0,αk2Id)\text{SB}_{\varepsilon}^{k}(\rho_{0})=\mathcal{N}\left(0,\alpha_{k}^{2}I_{d}\right) and ρε(k)=𝒩(0,βk2Id)\rho_{\varepsilon}(k)=\mathcal{N}\left(0,\beta_{k}^{2}I_{d}\right), then the iterates evolve through the following recursive relationship

αk+12=(2Cαk2ε)2αk2 and βk+12=(1+ε2βk2)2βk2.\alpha_{k+1}^{2}=\left(2-C^{\varepsilon}_{\alpha_{k}^{2}}\right)^{2}\alpha_{k}^{2}\quad\text{ and }\quad\beta_{k+1}^{2}=\left(1+\frac{\varepsilon}{2\beta_{k}^{2}}\right)^{2}\beta_{k}^{2}\,.

To show that (ρε(k),k[Nε])(\rho_{\varepsilon}(k),k\in[N_{\varepsilon}]) and (SBεk(ρ),k[Nε])(\text{SB}_{\varepsilon}^{k}(\rho),k\in[N_{\varepsilon}]) are first-order approximations (Definition 1) of the Wasserstein gradient flow (ρ(t),t[0,T])(\rho(t),t\in[0,T]), we will show that the assumptions of Theorem 4 are satisfied.

First, let us consider assumption (1) of Theorem 4. The formula for the time marginals of the dynamic Schrödinger bridge between σ=𝒩(0,η2)\sigma=\mathcal{N}(0,\eta^{2}) and itself at time t=1/2t=1/2 is available from [GLR17] and given by

(71) (σ)1/2ε=𝒩(0,(δε24(1+δε)+1)η2) where δε=12(ε2+4+ε2).(\sigma)^{\varepsilon}_{1/2}=\mathcal{N}\left(0,\left(\frac{\delta_{\varepsilon}^{2}}{4(1+\delta_{\varepsilon})}+1\right)\eta^{2}\right)\quad\text{ where }\delta_{\varepsilon}=\frac{1}{2}\left(\varepsilon-2+\sqrt{4+\varepsilon^{2}}\right)\,.

Also, it known that for σ=𝒩(0,η2Id)\sigma=\mathcal{N}(0,\eta^{2}I_{d}), the Fisher information I(σ)=dη2I(\sigma)=d\eta^{-2}. Therefore, for σε(k)=SBεk(ρ)=𝒩(0,αk2)\sigma_{\varepsilon}(k)=\text{SB}_{\varepsilon}^{k}(\rho)=\mathcal{N}(0,\alpha_{k}^{2}), we have

(72) I(SBεk(ρ))I((SBεk(ρ))1/2ε)=dαk2(1(δε24(1+δε)+1)1)=dδε2αk2(δε2+4(1+δε)).I(\text{SB}_{\varepsilon}^{k}(\rho))-I((\text{SB}_{\varepsilon}^{k}(\rho))_{1/2}^{\varepsilon})=\frac{d}{\alpha_{k}^{2}}\left(1-\left(\frac{\delta_{\varepsilon}^{2}}{4(1+\delta_{\varepsilon})}+1\right)^{-1}\right)=\frac{d\delta_{\varepsilon}^{2}}{\alpha_{k}^{2}\left(\delta_{\varepsilon}^{2}+4(1+\delta_{\varepsilon})\right)}\,.

Since Cν2ε1C^{\varepsilon}_{\nu^{2}}\leq 1 for any ν2>0\nu^{2}>0, (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]) is monotonically increasing and bounded from below by η2\eta^{2}. Hence,

supk[Nε](I(SBεk(ρ))I((SBεk(ρ))1/2ε))I(ρ)I(ρ1/2ε)=dδε2η2(δε2+4(1+δε)).\sup_{k\in[N_{\varepsilon}]}\left(I(\text{SB}_{\varepsilon}^{k}(\rho))-I((\text{SB}_{\varepsilon}^{k}(\rho))_{1/2}^{\varepsilon})\right)\leq I(\rho)-I(\rho_{1/2}^{\varepsilon})=\frac{d\delta_{\varepsilon}^{2}}{\eta^{2}\left(\delta_{\varepsilon}^{2}+4(1+\delta_{\varepsilon})\right)}\,.

Since the RHS above converges to zero as ε0+\varepsilon\rightarrow 0+, we satisfy Assumption (1) of Theorem 4 with rate 𝒪(ε2)\mathcal{O}(\varepsilon^{2}).

Now, we consider assumption (2) of Theorem 4. One can be convinced that for the Gaussian case, if both sequences (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]) and (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) are bounded from below and above, then the assumption (2) is satisfied. Trivially, (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]) and (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) are bounded from below by η2\eta^{2} because they are monotonically increasing. Take εη2\varepsilon\leq\eta^{2}. Using the identity 1x1+x21-x\leq\sqrt{1+x^{2}} for all x0x\geq 0, we have that 2Cαk2ε1+εαk22-C^{\varepsilon}_{\alpha_{k}^{2}}\leq 1+\varepsilon\alpha_{k}^{-2} and therefore αk+12αk2+2ε+ε2/αk2αk2+3ε\alpha_{k+1}^{2}\leq\alpha_{k}^{2}+2\varepsilon+\varepsilon^{2}/\alpha_{k}^{2}\leq\alpha_{k}^{2}+3\varepsilon. Consequently, (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]) is bounded from above by η2+3T\eta^{2}+3T. Now, again using that εη2\varepsilon\leq\eta^{2}, βk+12βk2+5ε/4\beta_{k+1}^{2}\leq\beta_{k}^{2}+5\varepsilon/4, and therefore (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) is bounded from above by η2+5T/4\eta^{2}+5T/4. Step 1 of the proof of Theorem 4 gives uniform convergence of (ρε(k),k[Nε])(\rho_{\varepsilon}(k),k\in[N_{\varepsilon}]) and step 2 gives the uniform convergence of (SBεk(ρ),k[Nε])(\text{SB}_{\varepsilon}^{k}(\rho),k\in[N_{\varepsilon}]).

5.2. Explicit Euler discretization of the gradient flow of Kullback-Leibler

Now let us consider the Kullback-Leibler (KL) divergence functional H(|ν)H\left(\cdot|\nu\right) where ν=eg\nu=e^{-g} for g(x)=x\nabla g(x)=x. This is the Fokker-Planck equation corresponding to the Ornstein-Uhlenbeck process.

Now let us write the (SB) step at ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}). For this functional, the selection of θ{0}\theta\in\mathbb{R}\setminus\{0\} in (SB) becomes more delicate. The choice of θ\theta for the surrogate measure σ=σε(0)\sigma=\sigma_{\varepsilon}(0) is determined by the integrability of

exp(2θ(12logρ+12g))=(ρeg)θ.\displaystyle\exp\left(-2\theta\left(\frac{1}{2}\log\rho+\frac{1}{2}g\right)\right)=(\rho e^{g})^{-\theta}.

If ρ\rho is log-concave, then since gg is convex the sign of θ\theta depends on a comparison of the log-concavity of ρ\rho and ν\nu. Suppose ρ\rho is more log-concave than ν\nu, then θ\theta may be taken to be 1-1. Consequently, the (SB) step is SBε1(ρ)=(2Idσ,ε)#ρ\text{SB}_{\varepsilon}^{1}(\rho)=(2\operatorname{\mathrm{Id}}-\mathcal{B}_{\sigma,\varepsilon})_{\#}\rho. On the other hand, if ν\nu is more log-concave than the density ρ\rho, then θ\theta may be taken to be 11. In this case, the (SB) step is SBε1(ρ)=(σ,ε)#ρ\text{SB}_{\varepsilon}^{1}(\rho)=(\mathcal{B}_{\sigma,\varepsilon})_{\#}\rho.

For an initial ρ=𝒩(0,η2Id)\rho=\mathcal{N}(0,\eta^{2}I_{d}), η2>0\eta^{2}>0, we calculate the (SB) and the (EE) step and prove that the (SB) step is an 𝒪(ε2)\mathcal{O}(\varepsilon^{2}) approximation of the (EE) step. If η2>1\eta^{2}>1, then ρ\rho is less log-concave than ege^{-g}, and therefore σ=ρ1eg=𝒩(0,η2/(η21)Id)\sigma=\rho^{-1}e^{-g}=\mathcal{N}(0,\eta^{2}/(\eta^{2}-1)I_{d}). This surrogate measure satisfies Assumption 2. By (70), we have that SBε1(ρ)=𝒩(0,(Cη2/(η21)ε)2η2Id)\text{SB}_{\varepsilon}^{1}(\rho)=\mathcal{N}\left(0,(C_{\eta^{2}/(\eta^{2}-1)}^{\varepsilon})^{2}\eta^{2}I_{d}\right). Now if η2<1\eta^{2}<1, then ρ\rho is more log-concave than ege^{-g}, and therefore σ=ρeg=𝒩(0,η2/(1η2)Id)\sigma=\rho e^{g}=\mathcal{N}(0,\eta^{2}/(1-\eta^{2})I_{d}). Note that this choice satisfies Assumption 2 (1-2, 4) but the ratio ρσ=ρ2eg\frac{\rho}{\sigma}=\rho^{2}e^{g} is unbounded. Nevertheless, we calculate the (SB) update using this surrogate measure and demonstrate one-step convergence, indicating that our assumptions in this paper are stronger than required. By (70) we have SBε1(ρ)=𝒩(0,(2Cη2/(1η2)ε)2η2Id)\text{SB}_{\varepsilon}^{1}(\rho)=\mathcal{N}(0,(2-C_{\eta^{2}/(1-\eta^{2})}^{\varepsilon})^{2}\eta^{2}I_{d}). The corresponding (EE) update is ρε(1)=(Idε2(logρ+Id))#ρ=𝒩(0,(1+ε(1η2)2η2)2η2Id)\rho_{\varepsilon}(1)=\left(\operatorname{\mathrm{Id}}-\frac{\varepsilon}{2}\left(\nabla\log\rho+\operatorname{\mathrm{Id}}\right)\right)_{\#}\rho=\mathcal{N}\left(0,\left(1+\frac{\varepsilon(1-\eta^{2})}{2\eta^{2}}\right)^{2}\eta^{2}I_{d}\right).

Again, we check the sharpness of Theorem 2 by explicitly calculating the 𝕎2\mathbb{W}_{2} distance between ρε(1)\rho_{\varepsilon}(1) and SBε1(ρ)\text{SB}_{\varepsilon}^{1}(\rho). For any η2>0\eta^{2}>0,

𝕎2(ρε(1),SBε1(ρ0))\displaystyle\mathbb{W}_{2}\left(\rho_{\varepsilon}(1),\text{SB}_{\varepsilon}^{1}(\rho_{0})\right) =ηd|2Cη2/(1η2)ε1ε(1η2)2η2|=ηd|11+ε2(1η2)24η4|.\displaystyle=\eta d\mathinner{\!\left\lvert 2-C^{\varepsilon}_{\eta^{2}/(1-\eta^{2})}-1-\frac{\varepsilon(1-\eta^{2})}{2\eta^{2}}\right\rvert}=\eta d\mathinner{\!\left\lvert 1-\sqrt{1+\frac{\varepsilon^{2}(1-\eta^{2})^{2}}{4\eta^{4}}}\right\rvert}\,.

Therefore, again 𝕎2(ρε(1),SBε1(ρ0))=𝒪(ε2)\mathbb{W}_{2}\left(\rho_{\varepsilon}(1),\text{SB}_{\varepsilon}^{1}(\rho_{0})\right)=\mathcal{O}(\varepsilon^{2}) which is an order of magnitude better than Theorem 2. As mentioned above, here is an example of a fast convergence rate even when the surrogate measure σ\sigma does not satisfy Assumption 2(3).

Iterates of the (SB) scheme can be defined beyond the one step in the similar manner if the sign convention remains same throughout the iterative process. The following calculation is done for η2<1\eta^{2}<1 which is not covered by our theorem. A similar calculation is valid for η2>1\eta^{2}>1 which is skipped. We claim that if ρ\rho is more log-concave than ν=𝒩(0,Id)\nu=\mathcal{N}(0,I_{d}), then SBε1(ρ)\text{SB}_{\varepsilon}^{1}(\rho) is also more log-concave than ν\nu for small enough ε>0\varepsilon>0. This is because, using Taylor series expansion around ε=0\varepsilon=0, we have

(2Cη2/(1η2)ε)2η2\displaystyle\left(2-C^{\varepsilon}_{\eta^{2}/(1-\eta^{2})}\right)^{2}\eta^{2} =(2[1η2η2(η4(1η2)2+ε24ε2)])2η2\displaystyle=\left(2-\left[\frac{1-\eta^{2}}{\eta^{2}}\left(\sqrt{\frac{\eta^{4}}{(1-\eta^{2})^{2}}+\frac{\varepsilon^{2}}{4}}-\frac{\varepsilon}{2}\right)\right]\right)^{2}\eta^{2}
=η2+(1η2)εε2(1η2)28η2+𝒪(ε4).\displaystyle=\eta^{2}+(1-\eta^{2})\varepsilon-\frac{\varepsilon^{2}(1-\eta^{2})^{2}}{8\eta^{2}}+\mathcal{O}(\varepsilon^{4}).

Therefore, as long as ε<η2(1η2)\varepsilon<\frac{\eta^{2}}{(1-\eta^{2})}, the SB steps approach monotonically toward the minimizer 𝒩(0,Id)\mathcal{N}(0,I_{d}) and the sign convention remains the same. Again, since both (SB) and (EE) schemes evolve via linear pushforwards, all steps are mean-zero Gaussian distributed. Denote SBεk(ρ0)=𝒩(0,αk2Id)\text{SB}_{\varepsilon}^{k}(\rho_{0})=\mathcal{N}(0,\alpha_{k}^{2}I_{d}) and ρε(k)=𝒩(0,βk2Id)\rho_{\varepsilon}(k)=\mathcal{N}(0,\beta_{k}^{2}I_{d}) where αk2\alpha_{k}^{2} and βk2\beta_{k}^{2} evolve via the following recursive relationship

αk+12=(2Cαk2/(1αk2)ε)2αk2andβk+12=(1+ε(1βk2)βk2)2βk2.\alpha_{k+1}^{2}=\left(2-C^{\varepsilon}_{\alpha_{k}^{2}/(1-\alpha_{k}^{2})}\right)^{2}\alpha_{k}^{2}\quad\text{and}\quad\beta_{k+1}^{2}=\left(1+\frac{\varepsilon(1-\beta_{k}^{2})}{\beta_{k}^{2}}\right)^{2}\beta_{k}^{2}.

The corresponding surrogate measures (σε(k),k[Nε])\left(\sigma_{\varepsilon}(k),k\in[N_{\varepsilon}]\right) are given by σε(k)=𝒩(0,αk2/(1αk2)Id)\sigma_{\varepsilon}(k)=\mathcal{N}\left(0,{\alpha_{k}^{2}}/{(1-\alpha_{k}^{2})}I_{d}\right).

We again show that the assumptions of Theorem 5 are satisfied to prove that (ρε(k),k[Nε])(\rho_{\varepsilon}(k),k\in[N_{\varepsilon}]) and (SBεk(ρ),k[Nε])(\text{SB}_{\varepsilon}^{k}(\rho){,k\in[N_{\varepsilon}]}) are first-order approximation schemes of the Wasserstein gradient flow (ρ(t),t[0,T])(\rho(t),t\in[0,T]). Using (71), we have that

I(σε(k))I((σε(k))1/2ε)=(1αk2αk2)dδε2δε2+4(1+δε),I(\sigma_{\varepsilon}(k))-I((\sigma_{\varepsilon}(k))^{\varepsilon}_{1/2})=\left(\frac{1-\alpha_{k}^{2}}{\alpha_{k}^{2}}\right)\frac{d\delta_{\varepsilon}^{2}}{\delta_{\varepsilon}^{2}+4(1+\delta_{\varepsilon})}\,,

where δε=12(ε2+4+ε2)\delta_{\varepsilon}=\frac{1}{2}\left(\varepsilon-2+\sqrt{4+\varepsilon^{2}}\right). The sequence (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]) is monotonically increasing because

αk+12=(21+ε2(αk21)24αk4+ε(1αk2)2αk2)2αk2,\alpha_{k+1}^{2}=\left(2-\sqrt{1+\frac{\varepsilon^{2}(\alpha_{k}^{2}-1)^{2}}{4\alpha_{k}^{4}}}+\frac{\varepsilon(1-\alpha_{k}^{2})}{2\alpha_{k}^{2}}\right)^{2}\alpha_{k}^{2}\,,

and the prefactor of αk2\alpha_{k}^{2} can be easily checked to be greater than 11. Moreover, each αk2\alpha_{k}^{2} is bounded from below by its initial value η2\eta^{2}. Therefore,

supk[Nε](I(σε(k))I((σε(k))1/2ε))I(σ(0))I((σ(0))1/2ε)=(1η2η2)dδε2δε2+4(1+δε).\sup_{k\in[N_{\varepsilon}]}\left(I(\sigma_{\varepsilon}(k))-I((\sigma_{\varepsilon}(k))^{\varepsilon}_{1/2})\right)\leq I(\sigma(0))-I((\sigma(0))^{\varepsilon}_{1/2})=\left(\frac{1-\eta^{2}}{\eta^{2}}\right)\frac{d\delta_{\varepsilon}^{2}}{\delta_{\varepsilon}^{2}+4(1+\delta_{\varepsilon})}\,.

Since δε=𝒪(ε)\delta_{\varepsilon}=\mathcal{O}(\varepsilon), the RHS converges to zero as ε0+\varepsilon\to 0+ at the order 𝒪(ε2)\mathcal{O}(\varepsilon^{2}). Therefore, Assumption (1) of Theorem 5 is satisfied with rate 𝒪(ε2)\mathcal{O}(\varepsilon^{2}).

Now, we show that assumption (2) of Theorem 5 is satisfied. Again, in the Gaussian case, it suffices to show that (αk2/(1αk2),k[Nε])\left({\alpha_{k}^{2}}/(1-\alpha_{k}^{2}),k\in[N_{\varepsilon}]\right) and (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) is bounded from below and above. Since Cν2ε<1C^{\varepsilon}_{\nu^{2}}<1 for any ν2>0\nu^{2}>0, the sequence (αk2,k[Nε])(\alpha_{k}^{2},k\in[N_{\varepsilon}]), and equivalently (αk2/(1αk2),k[Nε])(\alpha_{k}^{2}/(1-\alpha_{k}^{2}),k\in[N_{\varepsilon}]), is monotonically increasing. Then trivially, (αk2/(1αk2),k[Nε])(\alpha_{k}^{2}/(1-\alpha_{k}^{2}),k\in[N_{\varepsilon}]) is lower bounded by η2/(1η2)\eta^{2}/(1-\eta^{2}) and (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) are lower bounded by η2\eta^{2}. Take ε<min(η21η2,13)\varepsilon<\min\left(\frac{\eta^{2}}{1-\eta^{2}},\frac{1}{3}\right). Using the identity 1x1+x1-x\leq\sqrt{1+x} for all x0x\geq 0, we have that 2Cαk2/(1αk2)ε1+ε(1αk2)/αk22-C^{\varepsilon}_{\alpha_{k}^{2}/(1-\alpha_{k}^{2})}\leq 1+\varepsilon(1-\alpha_{k}^{2})/\alpha_{k}^{2}. Therefore, αk+12(1+ε(1αk2)/αk2)2αk2\alpha_{k+1}^{2}\leq\left(1+\varepsilon(1-\alpha_{k}^{2})/\alpha_{k}^{2}\right)^{2}\alpha_{k}^{2} and βk+12=(1+ε(1βk2)/βk2)2βk2\beta_{k+1}^{2}=\left(1+\varepsilon(1-\beta_{k}^{2})/\beta_{k}^{2}\right)^{2}\beta_{k}^{2} give recursive inequalities of similar form. Since εαk2/(1αk2)\varepsilon\leq\alpha_{k}^{2}/(1-\alpha_{k}^{2}), we have αk+12αk2+3ε(1αk2)\alpha_{k+1}^{2}\leq\alpha_{k}^{2}+3\varepsilon(1-\alpha_{k}^{2}). Recursively, αNε2(13ε)Nεη2+1(13ε)Nε\alpha_{N_{\varepsilon}}^{2}\leq(1-3\varepsilon)^{N_{\varepsilon}}\eta^{2}+{1-(1-3\varepsilon)^{N_{\varepsilon}}}, which is bounded from above. Therefore, (αk2(1αk2),k[Nε])\left(\frac{\alpha_{k}^{2}}{(1-\alpha_{k}^{2})},k\in[N_{\varepsilon}]\right) is bounded from above by η21η2+1(13ε)Nε(13ε)Nε(1η2)\frac{\eta^{2}}{1-\eta^{2}}+\frac{1-(1-3\varepsilon)^{N_{\varepsilon}}}{(1-3\varepsilon)^{N_{\varepsilon}}(1-\eta^{2})} and similarly, (βk2,k[Nε])(\beta_{k}^{2},k\in[N_{\varepsilon}]) is bounded from above by (13ε)Nεη2+(1(13ε)Nε)(1-3\varepsilon)^{N_{\varepsilon}}\eta^{2}+\left(1-(1-3\varepsilon)^{N_{\varepsilon}}\right), which is bounded as ε0\varepsilon\downarrow 0. Consequently, the assumptions of Theorem 5 are satisfied. Step 1 of the proof gives uniform convergence of (EE) scheme and step 2 gives the uniform convergence of the (SB) scheme.

5.3. Time reversal of gradient flows

Let us consider the time reversal of gradient flows of the functionals in the previous two examples. Specifically, if (ρ(t),t[0,T])\left(\rho(t),t\in[0,T]\right) is the gradient flow of a functional :𝒫2(d)(,]\mathcal{F}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}_{2}(\mathbb{R}^{d})\to(-\infty,\infty] with velocity field (v(t),t[0,T])(v(t),t\in[0,T]), then let (ρ¯(t),t[0,T])(\bar{\rho}(t),t\in[0,T]) denote the time reversed flow defined as ρ¯(t):=ρ(Tt)\bar{\rho}(t)\mathrel{\mathop{\mathchar 58\relax}}=\rho(T-t). Naturally, the velocity field in the reverse process is negative of the velocity in the forward process, i.e. if (v¯(t),t[0,T])\left(\bar{v}(t),t\in[0,T]\right) denotes the velocity field of (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right), then v¯(t)=v(Tt)\bar{v}(t)=-v(T-t).

These absolutely continuous curves are not gradient flows and therefore do not benefit from the theoretical guarantees established in our theorems. Note that the corresponding PDEs, such as the backward heat equation, are well known to be ill-posed ([Olv13, page 129, eqn. 4.29]). However, interestingly, we observe that, for the time reversed gradient flows of entropy and KL divergence functionals with Gaussian marginals, the (SB) scheme approximates these curves with the same convergence rates. For both the examples below, these convergence rates are derived by directly showing that SBε1()\text{SB}_{\varepsilon}^{1}(\cdot) is consistent (55) and a contraction (a stronger version of contraction in limit (56)). Following that, we use Theorem 3 to prove that the (SB) iterates give a first-order approximation of the time reversed flow (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right). Through the examples, we also highlight a key issue with approximating reversed time gradient flows: to have a valid first-order approximation (Definition 1) of (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right), the step size ε\varepsilon should be small enough where the bound on ε\varepsilon depends on the properties of the iterates encountered in the future. We include these calculations below for readers’ curiosity and their practical importance in score-based generative modeling using diffusions [SSDK+20, LYB+24].

First, we consider the time reversed gradient flow of =12Ent\mathcal{F}=\frac{1}{2}\mathrm{Ent}. Then, v¯(t)=12logρ(Tt)=12logρ¯(t)\bar{v}(t)=\frac{1}{2}\nabla\log\rho(T-t)=\frac{1}{2}\nabla\log\bar{\rho}(t). Now we calculate the first (SB) step. For ρ=ρ(0)\rho=\rho(0), the surrogate measure is σ=σ(0)=eθlogρ¯\sigma=\sigma(0)=e^{\theta\log\bar{\rho}} where θ\theta is chosen to ensure integrability of eθlogρ¯e^{\theta\log\bar{\rho}}. Therefore, we may choose θ=1\theta=1, which implies σε(0)=σ=ρ\sigma_{\varepsilon}(0)=\sigma=\rho and the (SB) step is SBε1(ρ)=(ρ,ε)#ρ\text{SB}_{\varepsilon}^{1}(\rho)=\left(\mathcal{B}_{{\rho},\varepsilon}\right)_{\#}\rho. Notice that the sign of θ\theta is flipped compared to Section 5.1. This change ensures that the (SB) for the reverse flow is the opposite direction to the (SB) step of the forward flow. This is interesting as we can approximate both forward and reverse time gradient flow of entropy functional using appropriate direction of (SB) steps, calculated only using the current measure.

Now for an initial ρ(0)=𝒩(0,η2Id)\rho(0)=\mathcal{N}(0,\eta^{2}I_{d}), we know that the gradient flow (ρ(t),t[0,T])(\rho(t),t\in[0,T]) is given by ρ(t)=𝒩(0,(η2+t)Id)\rho(t)=\mathcal{N}(0,(\eta^{2}+t)I_{d}). Then, we have ρ=ρ¯(0)=𝒩(0,(η2+T)Id)\rho=\bar{\rho}(0)=\mathcal{N}(0,(\eta^{2}+T)I_{d}) and ρ¯(ε)=𝒩(0,(η2+Tε)Id)\bar{\rho}(\varepsilon)=\mathcal{N}(0,(\eta^{2}+T-\varepsilon)I_{d}). Now we show that SBε1(ρ)\text{SB}_{\varepsilon}^{1}(\rho) is an 𝒪(ε2)\mathcal{O}(\varepsilon^{2}) close approximation of ρ¯(ε)\bar{\rho}(\varepsilon). For any η2>0\eta^{2}>0 and ε<η2+T\varepsilon<\eta^{2}+T,

𝕎2(ρ¯(ε),SBε1(ρ))\displaystyle\mathbb{W}_{2}\left(\bar{\rho}(\varepsilon),\text{SB}_{\varepsilon}^{1}(\rho)\right) =d|Cη2+Tεη2+Tη2+Tε|\displaystyle=d\mathinner{\!\left\lvert C^{\varepsilon}_{\eta^{2}+T}\sqrt{\eta^{2}+T}-\sqrt{\eta^{2}+T-\varepsilon}\right\rvert}
=dη2+T|1+ε24(η2+T)2ε2(η2+T)1εη2+T|.\displaystyle=d\sqrt{\eta^{2}+T}\mathinner{\!\left\lvert\sqrt{1+\frac{\varepsilon^{2}}{4(\eta^{2}+T)^{2}}}-\frac{\varepsilon}{2(\eta^{2}+T)}-\sqrt{1-\frac{\varepsilon}{\eta^{2}+T}}\right\rvert}\,.

Denote x=ε/(η2+T)x=\varepsilon/(\eta^{2}+T), then the term containing ε\varepsilon above is cε=1+x2/4(x/2)1xc_{\varepsilon}=\sqrt{1+x^{2}/4}-(x/2)-\sqrt{1-x}. Using the identity ab=(ab)/(a+b)\sqrt{a}-\sqrt{b}=(a-b)/(\sqrt{a}+\sqrt{b}) with a=1+x2/4a=1+x^{2}/4 and b=((x/2)+1x)2b=\left((x/2)+\sqrt{1-x}\right)^{2}, we can be convinced that constant cε>0c_{\varepsilon}>0. Further, using the identity 1+a<1+a/2\sqrt{1+a}<1+a/2 for a>0a>0 and 1b>1b/2\sqrt{1-b}>1-b/2 for 0<b<10<b<1, we have that cε<x2/4=ε24(η2+T)2c_{\varepsilon}<x^{2}/4=\frac{\varepsilon^{2}}{4(\eta^{2}+T)^{2}}. Therefore, we have the consistency (55) result

(73) 𝕎2(ρ¯(ε),SBε1(ρ))<dε24(η2+T)3/2.\mathbb{W}_{2}\left(\bar{\rho}(\varepsilon),\text{SB}_{\varepsilon}^{1}(\rho)\right)<\frac{d\varepsilon^{2}}{4(\eta^{2}+T)^{3/2}}\,.

Again, since (SB) scheme evolves via linear pushforwards, the sequence of (SB) steps are SBεk(ρ)=𝒩(0,αk2Id)\text{SB}_{\varepsilon}^{k}(\rho)=\mathcal{N}(0,\alpha_{k}^{2}I_{d}) where αk2\alpha_{k}^{2} follows the recursive relationship αk+12=(Cαk2ε)2αk2\alpha_{k+1}^{2}=\left(C^{\varepsilon}_{\alpha_{k}^{2}}\right)^{2}\alpha_{k}^{2}. Define the piecewise constant interpolation

SBε(t)=SBεk(ρ)t[kε,(k+1)ε).\text{SB}_{\varepsilon}(t)=\text{SB}_{\varepsilon}^{k}(\rho)\quad t\in[k\varepsilon,(k+1)\varepsilon)\,.

Unlike the previous two examples, since (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right) is not a gradient flow, we do not invoke Theorem 6 to prove the uniform convergence of SB scheme. We instead directly show that (SBε(t),t[0,T])\left(\text{SB}_{\varepsilon}(t),t\in[0,T]\right) is a first-order approximation (Definition 1) of (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right). We have already shown consistency in (73). Now we show that SBε1()\text{SB}_{\varepsilon}^{1}(\cdot), restricted to Gaussian measures, is Lipschitz with the Lipschitz constant of the order (1+𝒪(ε))(1+\mathcal{O}(\varepsilon)). Take ρ1=𝒩(0,σ12Id)\rho_{1}=\mathcal{N}(0,\sigma_{1}^{2}I_{d}) and ρ2=𝒩(0,σ22Id)\rho_{2}=\mathcal{N}(0,\sigma_{2}^{2}I_{d}). Then, 𝕎2(SBε1(ρ1),SBε1(ρ2))\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\rho_{1}),\text{SB}_{\varepsilon}^{1}(\rho_{2})\right) can be bounded above by the following string of inequalities

=d|σ11+ε24σ14ε2σ1σ21+ε24σ24+ε2σ2|\displaystyle=d\mathinner{\!\left\lvert\sigma_{1}\sqrt{1+\frac{\varepsilon^{2}}{4\sigma_{1}^{4}}}-\frac{\varepsilon}{2\sigma_{1}}-\sigma_{2}\sqrt{1+\frac{\varepsilon^{2}}{4\sigma_{2}^{4}}}+\frac{\varepsilon}{2\sigma_{2}}\right\rvert}
dε2σ1σ2|σ1σ2|+d|σ12+ε24σ12σ22+ε24σ22|\displaystyle\leq\frac{d\varepsilon}{2\sigma_{1}\sigma_{2}}\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert}+d\mathinner{\!\left\lvert\sqrt{\sigma_{1}^{2}+\frac{\varepsilon^{2}}{4\sigma_{1}^{2}}}-\sqrt{\sigma_{2}^{2}+\frac{\varepsilon^{2}}{4\sigma_{2}^{2}}}\right\rvert} ( triangle inequality)\displaystyle\left(\text{ triangle inequality}\right)
=dε2σ1σ2|σ1σ2|+d|σ12σ22||1ε24σ12σ22|σ12+ε24σ12+σ22+ε24σ22\displaystyle=\frac{d\varepsilon}{2\sigma_{1}\sigma_{2}}\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert}+d\frac{\mathinner{\!\left\lvert\sigma_{1}^{2}-\sigma_{2}^{2}\right\rvert}\mathinner{\!\left\lvert 1-\frac{\varepsilon^{2}}{4\sigma_{1}^{2}\sigma_{2}^{2}}\right\rvert}}{{\sqrt{\sigma_{1}^{2}+\frac{\varepsilon^{2}}{4\sigma_{1}^{2}}}+\sqrt{\sigma_{2}^{2}+\frac{\varepsilon^{2}}{4\sigma_{2}^{2}}}}} (ab=aba+b)\displaystyle\left(\text{$\sqrt{a}-\sqrt{b}=\frac{a-b}{\sqrt{a}+\sqrt{b}}$}\right)
dε2σ1σ2|σ1σ2|+d|1ε24σ12σ22||σ1σ2|\displaystyle\leq\frac{d\varepsilon}{2\sigma_{1}\sigma_{2}}\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert}+d\mathinner{\!\left\lvert 1-\frac{\varepsilon^{2}}{4\sigma_{1}^{2}\sigma_{2}^{2}}\right\rvert}\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert} (σi2+ε2/4σi2>σi for i{1,2}).\displaystyle\left(\text{$\sqrt{\sigma_{i}^{2}+\varepsilon^{2}/4\sigma_{i}^{2}}>\sigma_{i}$ for $i\in\{1,2\}$}\right)\,.

For ε<2σ1σ2\varepsilon<2\sigma_{1}\sigma_{2}, (1ε24σ12σ22)>0\left(1-\frac{\varepsilon^{2}}{4\sigma_{1}^{2}\sigma_{2}^{2}}\right)>0 and hence we have the contraction

(74) 𝕎2(SBε1(ρ1),SBε1(ρ2))<(1+ε2σ1σ2)𝕎2(ρ1,ρ2)\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\rho_{1}),\text{SB}_{\varepsilon}^{1}(\rho_{2})\right)<\left(1+\frac{\varepsilon}{2\sigma_{1}\sigma_{2}}\right)\mathbb{W}_{2}(\rho_{1},\rho_{2})

The sequence (αk2,k[Nε])\left(\alpha_{k}^{2},k\in[N_{\varepsilon}]\right) decreases monotonically because

αk2=(Cαk12ε)2αk12=(1+ε24αk14ε2αk12)2αk12.\alpha_{k}^{2}=\left(C^{\varepsilon}_{\alpha_{k-1}^{2}}\right)^{2}\alpha_{k-1}^{2}=\left(\sqrt{1+\frac{\varepsilon^{2}}{4\alpha_{k-1}^{4}}}-\frac{\varepsilon}{2\alpha_{k-1}^{2}}\right)^{2}\alpha_{k-1}^{2}\,.

and 1+x2x<1\sqrt{1+x^{2}}-x<1 for all x>0x>0. Now we show that (αk2,k[Nε])\left(\alpha_{k}^{2},k\in[N_{\varepsilon}]\right) is bounded from below because for small enough ε\varepsilon. Using 1+x1+x/2\sqrt{1+x}\leq 1+x/2,

αk2\displaystyle\alpha_{k}^{2} =αk12ε1+ε24αk14+ε22αk12αk12ε+ε22αk12(1ε4αk12).\displaystyle=\alpha_{k-1}^{2}-\varepsilon\sqrt{1+\frac{\varepsilon^{2}}{4\alpha_{k-1}^{4}}}+\frac{\varepsilon^{2}}{2\alpha_{k-1}^{2}}\geq\alpha_{k-1}^{2}-\varepsilon+\frac{\varepsilon^{2}}{2\alpha_{k-1}^{2}}\left(1-\frac{\varepsilon}{4\alpha_{k-1}^{2}}\right)\,.

If ε\varepsilon is uniformly less then 4αk124\alpha_{k-1}^{2} for all k[Nε]k\in[N_{\varepsilon}], then αk2αk12ε\alpha_{k}^{2}\geq\alpha_{k-1}^{2}-\varepsilon. Consequently, the lowest value is αNε2(η2+T)εNε=η2\alpha_{N_{\varepsilon}}^{2}\geq(\eta^{2}+T)-\varepsilon N_{\varepsilon}=\eta^{2}. Therefore, choosing ε<η2\varepsilon<\eta^{2}, we get that (αk2,k[Nε])\left(\alpha_{k}^{2},k\in[N_{\varepsilon}]\right) is bounded from below by η2\eta^{2}. This highlights an inherent issue with approximating time-reversed gradient flows: a valid approximation depends on the choice of ϵ\epsilon, which in turn depends on the variance of the flow at a future time. As a result, if we run an SB scheme with ϵ\epsilon greater than the variance of ρ¯(T)\bar{\rho}(T), we might encounter the problem of an SB step with a negative variance.

By the triangle inequality

𝕎2(ρ¯((k+1)ε),SBεk+1(ρ))\displaystyle\mathbb{W}_{2}\left(\bar{\rho}((k+1)\varepsilon),\text{SB}_{\varepsilon}^{k+1}(\rho)\right) 𝕎2(ρ¯((k+1)ε),SBε1(ρ¯(kε)))+𝕎2(SBε1(ρ¯(kε)),SBεk+1(ρ)).\displaystyle\leq\mathbb{W}_{2}\left(\bar{\rho}((k+1)\varepsilon),\text{SB}_{\varepsilon}^{1}(\bar{\rho}(k\varepsilon))\right)+\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\bar{\rho}(k\varepsilon)),\text{SB}_{\varepsilon}^{k+1}(\rho)\right)\,.

Using consistency (73),

𝕎2(ρ¯((k+1)ε),SBε1(ρ¯(kε)))<dε24(η2+Tkε)3/2,\mathbb{W}_{2}\left(\bar{\rho}((k+1)\varepsilon),\text{SB}_{\varepsilon}^{1}(\bar{\rho}(k\varepsilon))\right)<\frac{d\varepsilon^{2}}{4(\eta^{2}+T-k\varepsilon)^{3/2}}\,,

and using Lipschitzness (74),

𝕎2(SBε1(ρ¯(kε)),SBεk+1(ρ))<(1+ε2(η2+Tkε)αk)𝕎2(ρ¯(kε),SBεk(ρ)).\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\bar{\rho}(k\varepsilon)),\text{SB}_{\varepsilon}^{k+1}(\rho)\right)<\left(1+\frac{\varepsilon}{2(\eta^{2}+T-k\varepsilon)\alpha_{k}}\right)\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)\,.

Since for all k[Nε]k\in[N_{\varepsilon}], αk2η2\alpha_{k}^{2}\geq\eta^{2}, we have

𝕎2(ρ¯((k+1)ε),SBεk+1(ρ))<dε24η3+(1+ε2η2)𝕎2(ρ¯(kε),SBεk(ρ)).\mathbb{W}_{2}\left(\bar{\rho}((k+1)\varepsilon),\text{SB}_{\varepsilon}^{k+1}(\rho)\right)<\frac{d\varepsilon^{2}}{4\eta^{3}}+\left(1+\frac{\varepsilon}{2\eta^{2}}\right)\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)\,.

Recursively,

𝕎2(ρ¯(kε),SBεk(ρ))dε2η(1+ε2η2)Nε.\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)\leq\frac{d\varepsilon}{2\eta}\left(1+\frac{\varepsilon}{2\eta^{2}}\right)^{N_{\varepsilon}}\,.

Since Nε=Tε1N_{\varepsilon}=\lfloor T\varepsilon^{-1}\rfloor, therefore, limε0supk[Nε]𝕎2(ρ¯(kε),SBεk(ρ))=0\lim_{\varepsilon\to 0}\sup_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)=0, proving the uniform convergence of (SBεk(ρ),k[Nε])\left(\text{SB}_{\varepsilon}^{k}(\rho),k\in[N_{\varepsilon}]\right) to the curve (ρ¯(t),t[0,T])\left(\bar{\rho}(t),t\in[0,T]\right) as ε0+\varepsilon\to 0+.

Finally, we consider the case of time reversed case of gradient flow of =12H(|ν)\mathcal{F}=\frac{1}{2}H\left(\cdot|\nu\right) with ν=eg=𝒩(0,Id)\nu=e^{-g}=\mathcal{N}(0,I_{d}). Since this case is very similar to the previous examples on the gradient flow of KL divergence and time reversed gradient flow of entropy, we limit explicit calculations. The velocity field v¯(t)=12(Id+logρ¯(t))\bar{v}(t)=\frac{1}{2}\left(\operatorname{\mathrm{Id}}+\nabla\log\bar{\rho}(t)\right). For an initial choice ρ(0)=𝒩(0,η2)\rho(0)=\mathcal{N}(0,\eta^{2}), the time reversed gradient flow, starting from time t=Tt=T, is ρ¯(t)=𝒩(0,(η2e(Tt)+(1e(Tt)))Id)\bar{\rho}(t)=\mathcal{N}\left(0,\left(\eta^{2}e^{-(T-t)}+(1-e^{-(T-t)})\right)I_{d}\right). For brevity of notation, let ρ=ρ¯(0)=𝒩(0,τ2)\rho=\bar{\rho}(0)=\mathcal{N}(0,\tau^{2}) where τ2:=η2eT+(1eT)\tau^{2}\mathrel{\mathop{\mathchar 58\relax}}=\eta^{2}e^{-T}+(1-e^{-T}). Now we will calculate the (SB) step. Again, we choose θ\theta such that the surrogate measure σ=σ(0)=(ρeg)θ\sigma=\sigma(0)=\left(\rho e^{g}\right)^{\theta} is integrable. Using the same logic as Section 5.2, if η2>1\eta^{2}>1 (equivalently τ2>1\tau^{2}>1), then θ\theta may be taken to be 1-1 giving σ=ρ1eg=𝒩(0,(τ2/(τ21))Id)\sigma=\rho^{-1}e^{-g}=\mathcal{N}\left(0,(\tau^{2}/(\tau^{2}-1))I_{d}\right) and SBε1(ρ)=(2Idσ,ε)#ρ\text{SB}_{\varepsilon}^{1}(\rho)=\left(2\operatorname{\mathrm{Id}}-\mathcal{B}_{\sigma,\varepsilon}\right)_{\#}\rho. Whereas if η2<1\eta^{2}<1 (equivalently τ2<1\tau^{2}<1), then θ\theta can be taken to be 11 giving σ=𝒩(0,(τ2/(1τ2))Id)\sigma=\mathcal{N}(0,(\tau^{2}/(1-\tau^{2}))I_{d}) and SBε1(ρ)=(σ,ε)#ρ\text{SB}_{\varepsilon}^{1}(\rho)=\left(\mathcal{B}_{\sigma,\varepsilon}\right)_{\#}\rho. We will present calculations for η2<1\eta^{2}<1 case. Similar calculations follow for η2>1\eta^{2}>1 and have been skipped.

Now we prove that the (SB) step is an 𝒪(ε2)\mathcal{O}(\varepsilon^{2}) approximation of ρ¯(ε)\bar{\rho}(\varepsilon). Note that for τ2<1\tau^{2}<1 and ε<log((1τ2)1)\varepsilon<\log\left((1-\tau^{2})^{-1}\right), SBε1(ρ)=𝒩(0,(Cτ2/(1τ2)ε)2τ2)\text{SB}_{\varepsilon}^{1}(\rho)=\mathcal{N}\left(0,\left(C^{\varepsilon}_{\tau^{2}/(1-\tau^{2})}\right)^{2}\tau^{2}\right) and ρ¯(ε)=𝒩(0,τ2+(τ21)(eε1))\bar{\rho}(\varepsilon)=\mathcal{N}\left(0,\tau^{2}+(\tau^{2}-1)(e^{\varepsilon}-1)\right). Then

𝕎2(ρ¯(ε),SBε1(ρ))\displaystyle\mathbb{W}_{2}\left(\bar{\rho}(\varepsilon),\text{SB}_{\varepsilon}^{1}(\rho)\right) =dτ|1+ε2(1τ2)24τ4ε(1τ2)2τ211τ2τ2(eε1)|.\displaystyle=d\tau\mathinner{\!\left\lvert\sqrt{1+\frac{\varepsilon^{2}(1-\tau^{2})^{2}}{4\tau^{4}}}-\frac{\varepsilon(1-\tau^{2})}{2\tau^{2}}-\sqrt{1-\frac{1-\tau^{2}}{\tau^{2}}(e^{\varepsilon}-1)}\right\rvert}.

Denote x=ε(1τ2)τ2x=\frac{\varepsilon(1-\tau^{2})}{\tau^{2}}, then the term containing ε\varepsilon above is cε=1+x24x21x(eε1ε)c_{\varepsilon}=\sqrt{1+\frac{x^{2}}{4}}-\frac{x}{2}-\sqrt{1-x\left(\frac{e^{\varepsilon}-1}{\varepsilon}\right)}. It is easy to see that cε>0c_{\varepsilon}>0. Further, using the identity 1+a<1+a/2\sqrt{1+a}<1+a/2 for a>0a>0 and 1b>1b/2\sqrt{1-b}>1-b/2 for 0<b<10<b<1, we have that cε<x2(eε1ε)+x28c_{\varepsilon}<\frac{x}{2}\left(\frac{e^{\varepsilon}-1}{\varepsilon}\right)+\frac{x^{2}}{8}. Therefore, we have the consistency (55) result

(75) 𝕎2(ρ¯(ε),SBε1(ρ))<dε2((1τ2)28τ3+C(1τ2)2τ)for some constant C.\mathbb{W}_{2}\left(\bar{\rho}(\varepsilon),\text{SB}_{\varepsilon}^{1}(\rho)\right)<d\varepsilon^{2}\left(\frac{(1-\tau^{2})^{2}}{8\tau^{3}}+\frac{C(1-\tau^{2})}{2\tau}\right)\quad\text{for some constant }C.

Again since the (SB) scheme evolves via linear pushforwards, SBεk(ρ)\text{SB}_{\varepsilon}^{k}(\rho) is a Gaussian measure for all k[Nε]k\in[N_{\varepsilon}] and let SBεk(ρ)=𝒩(0,αk2Id)\text{SB}_{\varepsilon}^{k}(\rho)=\mathcal{N}(0,\alpha_{k}^{2}I_{d}). Then αk2\alpha_{k}^{2} evolves via the recursive relationship αk+12=(Cαk2/(1αk2)ε)2αk2\alpha_{k+1}^{2}=\left(C^{\varepsilon}_{\alpha_{k}^{2}/(1-\alpha_{k}^{2})}\right)^{2}\alpha_{k}^{2}. We prove uniform convergence of the SB scheme in the same way as the time reversed flow of entropy functional. The consistency is show above and now we show that the (SB) is a contraction for Gaussian measures. Take ρ1=𝒩(0,σ12Id)\rho_{1}=\mathcal{N}(0,\sigma_{1}^{2}I_{d}) and ρ2=𝒩(0,σ22Id)\rho_{2}=\mathcal{N}(0,\sigma_{2}^{2}I_{d}). Then, exactly as in the case of time-reversed gradient flow of entropy,

𝕎2(SBε1(ρ1),SBε1(ρ2))\displaystyle\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\rho_{1}),\text{SB}_{\varepsilon}^{1}(\rho_{2})\right) <dε2(1+1σ1σ2)|σ1σ2|+d|1ε24(1σ12σ22σ12σ22)||σ1σ2|.\displaystyle<\frac{d\varepsilon}{2}\left(1+\frac{1}{\sigma_{1}\sigma_{2}}\right)\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert}+d\mathinner{\!\left\lvert 1-\frac{\varepsilon^{2}}{4}\left(\frac{1-\sigma_{1}^{2}\sigma_{2}^{2}}{\sigma_{1}^{2}\sigma_{2}^{2}}\right)\right\rvert}\mathinner{\!\left\lvert\sigma_{1}-\sigma_{2}\right\rvert}.

For ε<2σ1σ21σ12σ22\varepsilon<\frac{2\sigma_{1}\sigma_{2}}{\sqrt{1-\sigma_{1}^{2}\sigma_{2}^{2}}}, (1ε24(1σ12σ22σ12σ22))>0\left(1-\frac{\varepsilon^{2}}{4}\left(\frac{1-\sigma_{1}^{2}\sigma_{2}^{2}}{\sigma_{1}^{2}\sigma_{2}^{2}}\right)\right)>0 and hence we have the contraction

(76) 𝕎2(SBε1(ρ1),SBε1(ρ2))<(1+ε2(1+1σ1σ2))𝕎2(ρ1,ρ2).\mathbb{W}_{2}\left(\text{SB}_{\varepsilon}^{1}(\rho_{1}),\text{SB}_{\varepsilon}^{1}(\rho_{2})\right)<\left(1+\frac{\varepsilon}{2}\left(1+\frac{1}{\sigma_{1}\sigma_{2}}\right)\right)\mathbb{W}_{2}\left(\rho_{1},\rho_{2}\right)\,.

Because Cη2ε1C^{\varepsilon}_{\eta^{2}}\leq 1 for all η2>0\eta^{2}>0, the sequence (αk2;k[Nε])\left(\alpha_{k}^{2};k\in[N_{\varepsilon}]\right) is monotonically decreasing and bounded from above by its initial value τ2\tau^{2}. Similar to the reversed time gradient flow of entropy, we get that the variance (αk2;k[Nε])(\alpha_{k}^{2};k\in[N_{\varepsilon}]) is bounded from below if ε\varepsilon is less than 4η21η2\frac{4\eta^{2}}{1-\eta^{2}}, which depends on the variance at time t=Tt=T. Using the triangular argument from Theorem 3, for any k[Nε]k\in[N_{\varepsilon}],

𝕎2(ρ¯(kε),SBεk(ρ))dε(1+4η2C)4η(η2+1)(1+(1+η2)ε2)Nε.\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)\leq d\varepsilon\frac{(1+4\eta^{2}C)}{4\eta(\eta^{2}+1)}\left(1+\frac{\left(1+\eta^{-2}\right)\varepsilon}{2}\right)^{N_{\varepsilon}}\,.

Therefore, limε0supk[Nε]𝕎2(ρ¯(kε),SBεk(ρ))=0\lim_{\varepsilon\to 0}\sup_{k\in[N_{\varepsilon}]}\mathbb{W}_{2}\left(\bar{\rho}(k\varepsilon),\text{SB}_{\varepsilon}^{k}(\rho)\right)=0, proving the uniform convergence of (SBεk(ρ);k[Nε])\left(\text{SB}_{\varepsilon}^{k}(\rho);k\in[N_{\varepsilon}]\right) to the curve (ρ¯(t);t[0,T])\left(\bar{\rho}(t);t\in[0,T]\right) as ε0+\varepsilon\to 0+.

References

  • [AGS08] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows in metric spaces and in the space of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, second edition, 2008.
  • [BGL14] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 348 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Cham, 2014.
  • [BGN22] Espen Bernton, Promit Ghosal, and Marcel Nutz. Entropic optimal transport: geometry and large deviations. Duke Math. J., 171(16):3363–3400, 2022.
  • [Bob22] Sergey G. Bobkov. Upper bounds for Fisher information. Electron. J. Probab., 27:Paper No. 115, 44, 2022.
  • [CCGT22] Alberto Chiarini, Giovanni Conforti, Giacomo Greco, and Luca Tamanini. Gradient estimates for the schrödinger potentials: convergence to the brenier map and quantitative stability, 2022.
  • [CG14] P. Cattiaux and A. Guillin. Semi log-concave Markov diffusions. In Séminaire de Probabilités XLVI, volume 2123 of Lecture Notes in Math., pages 231–292. Springer, Cham, 2014.
  • [Csi75] I. Csiszár. II-divergence geometry of probability distributions and minimization problems. Ann. Probability, 3:146–158, 1975.
  • [CT21] Giovanni Conforti and Luca Tamanini. A formula for the time derivative of the entropic cost and applications. J. Funct. Anal., 280(11):Paper No. 108964, 48, 2021.
  • [Cut13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  • [CVR18] Giovanni Conforti and Max Von Renesse. Couplings, gradient estimates and logarithmic Sobolev inequality for Langevin bridges. Probab. Theory Related Fields, 172(1-2):493–524, 2018.
  • [DGW04] H. Djellout, A. Guillin, and H. Wu. Transportation cost-information inequalities and applications to random dynamical systems and diffusions. Ann. Probab., 32(3B):2702–2732, 2004.
  • [GLPR24] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. Arxiv preprint 2305.05465, 2024.
  • [GLR17] Ivan Gentil, Christian Léonard, and Luigia Ripani. About the analogy between optimal transport and minimal entropy. Ann. Fac. Sci. Toulouse Math. (6), 26(3):569–601, 2017.
  • [GLRT20] Ivan Gentil, Christian Léonard, Luigia Ripani, and Luca Tamanini. An entropic interpolation proof of the HWI inequality. Stochastic Process. Appl., 130(2):907–923, 2020.
  • [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker-Planck equation. SIAM J. Math. Anal., 29(1):1–17, 1998.
  • [JMPC20] Hicham Janati, Boris Muzellec, Gabriel Peyré, and Marco Cuturi. Entropic optimal transport between unbalanced Gaussian measures has a closed form. NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  • [KS91] Ioannis Karatzas and Steven E. Shreve. Brownian motion and stochastic calculus, volume 113 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 1991.
  • [Léo12] Christian Léonard. From the Schrödinger problem to the Monge-Kantorovich problem. J. Funct. Anal., 262(4):1879–1920, 2012.
  • [Léo14] Christian Léonard. A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst., 34(4):1533–1574, 2014.
  • [LK93] B. C. Levy and A. J. Krener. Dynamics and kinematics of reciprocal diffusions. Journal of mathematical physics, 34(5):1846–1875, 1993.
  • [LYB+24] Sungbin Lim, Eun Bi Yoon, Taehyun Byun, Taewon Kang, Seungwoo Kim, Kyungjae Lee, and Sungjoon Choi. Score-based generative modeling through stochastic evolution equations in Hilbert spaces. Advances in Neural Information Processing Systems, 36, 2024.
  • [Olv13] P.J. Olver. Introduction to Partial Differential Equations. Undergraduate Texts in Mathematics. Springer International Publishing, 2013.
  • [Pal19] Soumik Pal. On the difference between entropic cost and the optimal transport cost, 2019. Arxiv 1905.12206. To appear in The Annals of Applied Probability.
  • [PNW22] Aram-Alexandre Pooladian and Jonathan Niles-Weed. Entropic estimation of optimal transport maps, 2022.
  • [Roy07] Gilles Royer. An initiation to logarithmic Sobolev inequalities, volume 14 of SMF/AMS Texts and Monographs. American Mathematical Society, Providence, RI; Société Mathématique de France, Paris, 2007. Translated from the 1999 French original by Donald Babbitt.
  • [RY04] D. Revuz and M. Yor. Continuous Martingales and Brownian Motion. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2004.
  • [SABP22] Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transformers with doubly stochastic attention. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 3515–3530. PMLR, 28–30 Mar 2022.
  • [San15] Filippo Santambrogio. Optimal transport for applied mathematicians: Calculus of variations, PDEs, and modeling, volume 87. Springer, 2015.
  • [SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020. ArXiv version 2011.13456.
  • [Ver18] Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer.
  • [Vil03] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2003.

Notations

Symbol Explanation
Cd[0,)C^{d}[0,\infty) Collection of continuous paths ω:[0,)d\omega\mathrel{\mathop{\mathchar 58\relax}}[0,\infty)\to\mathbb{R}^{d}
Ck(d)C^{k}(\mathbb{R}^{d}) kk-times continuously differentiable functions d\mathbb{R}^{d}\to\mathbb{R}
𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) Space of twice integrable probability measures on d\mathbb{R}^{d}
𝒫2ac(d)\mathcal{P}_{2}^{ac}(\mathbb{R}^{d}) Space of twice integrable and absolutely continuous probability measures (with respect to Lebesgue measure) on d\mathbb{R}^{d}
Cc(d)C_{c}^{\infty}(\mathbb{R}^{d}) Space of smooth and compactly supported functions in d\mathbb{R}^{d}
𝕎2(μ,ν)\mathbb{W}_{2}(\mu,\nu) Wasserstein-2 distance between measures μ\mu and ν\nu in 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d})
TμνT_{\mu}^{\nu} Optimal transport map from μ\mu to ν\nu (if it exists)
(πt,t0)(\pi_{t},t\geq 0) Coordinatewise projections on Cd[0,)C^{d}[0,\infty), πs(ω)=ωs\pi_{s}(\omega)=\omega_{s}
μρ,ε\mu_{\rho,\varepsilon} μρ,ε(dx,dy)=(2πε)d/2ρ(x)exp(xy2/2ε)dxdy\mu_{\rho,\varepsilon}(dx,dy)=(2\pi\varepsilon)^{-d/2}\rho(x)\exp\left(-\|x-y\|^{2}/2\varepsilon\right)dxdy
(Wx,xd)(W_{x},x\in\mathbb{R}^{d}) Wiener measure on Cd[0,)C^{d}[0,\infty) with initial condition δx\delta_{x}
WW Reversible Wiener measure on Cd[0,)C^{d}[0,\infty)
RεR_{\varepsilon} Law of reversible Brownian motion with diffusion ε\varepsilon
(Pt,t0)(P_{t},t\geq 0) Brownian semigroup
(pt(,),t>0)(p_{t}(\cdot,\cdot),t>0) Brownian transition densities
QQ Law of stationary Langevin diffusion on Cd[0,)C^{d}[0,\infty)
ρ,ε\ell_{\rho,\varepsilon} Joint law of Langevin diffusion at time 0 and ε\varepsilon, i.e. (π0,πε)#Q(\pi_{0},\pi_{\varepsilon})_{\#}Q
(qt(,),t>0)(q_{t}(\cdot,\cdot),t>0) Transition densities for Langevin diffusion
LL Extended generator for the Langevin diffusion
(Gt,t0)(G_{t},t\geq 0) Langevin semigroup
πρ,ε\pi_{\rho,\varepsilon} ε\varepsilon-static Schrödinger bridge with marginals equal to ρ\rho
(fε,ε>0)(f_{\varepsilon},\varepsilon>0) Entropic potentials associated to (πρ,ε,ε>0)(\pi_{\rho,\varepsilon},\varepsilon>0)
ρ,ε\mathcal{B}_{\rho,\varepsilon} Barycentric projection associated with πρ,ε\pi_{\rho,\varepsilon}, ρ,ε(x)=Eπρ,ε[Y|X=x]\mathcal{B}_{\rho,\varepsilon}(x)=\mathrm{E}_{\pi_{\rho,\varepsilon}}[Y|X=x]
Sρ,εS_{\rho,\varepsilon} Law of ε\varepsilon-dynamic Schrödinger bridge on Cd[0,1]C^{d}[0,1] with initial and terminal distribution ρ\rho
(ρtε,t[0,1])(\rho_{t}^{\varepsilon},t\in[0,1]) Entropic interpolation from ρ\rho to itself, ρtε=(πt)#Sρ,ε\rho_{t}^{\varepsilon}=(\pi_{t})_{\#}S_{\rho,\varepsilon}
ψ1\|\cdot\|_{\psi_{1}} Subexponential norm of a random variable, [Ver18, Definition 2.7.5]
Tan(μ)\text{Tan}(\mu) Tangent bundle of 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) at μ\mu
Ent(μ)\mathrm{Ent}(\mu) Entropy of μ𝒫2ac(d)\mu\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})
H(μ|ν)H(\mu|\nu) Kullback-Leibler divergence between μ\mu and ν\nu
I(ρ)I(\rho) Fisher information of ρ\rho, I(ρ)=Eρlogρ2I(\rho)=\mathrm{E}_{\rho}\|\nabla\log\rho\|^{2} for ρ𝒫2ac(d)\rho\in\mathcal{P}_{2}^{ac}(\mathbb{R}^{d})
SBεk(μ0)\text{SB}_{\varepsilon}^{k}(\mu_{0}) kkth iterate of Schrödinger bridge scheme starting from μ0\mu_{0} with stepsize ε\varepsilon
Sεk(μ0)S_{\varepsilon}^{k}\left(\mu_{0}\right) kkth iterate of pushforward of form (Id+εvε)#μ0(\operatorname{\mathrm{Id}}+\varepsilon v_{\varepsilon})_{\#}\mu_{0}, where each vε:ddv_{\varepsilon}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\to\mathbb{R}^{d} is a vector field specified in context