This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Discretization Analysis for Underdamped
Langevin Monte Carlo

Matthew Zhang       Sinho Chewi       Mufan (Bill) Li
Krishnakumar Balasubramanian   Murat A. Erdogdu
Department of Computer Science at University of Toronto, and Vector Institute, matthew.zhang@mail.utoronto.ca Department of Mathematics at Massachusetts Institute of Technology, schewi@mit.edu Department of Statistical Sciences at University of Toronto, and Vector Institute, mufan.li@mail.utoronto.ca Department of Statistics at University of California, Davis, kbala@ucdavis.edu Department of Computer Science at University of Toronto, and Vector Institute, erdogdu@cs.toronto.edu
Abstract

Underdamped Langevin Monte Carlo (ULMC) is an algorithm used to sample from unnormalized densities by leveraging the momentum of a particle moving in a potential well. We provide a novel analysis of ULMC, motivated by two central questions: (1) Can we obtain improved sampling guarantees beyond strong log-concavity? (2) Can we achieve acceleration for sampling?

For (1), prior results for ULMC only hold under a log-Sobolev inequality together with a restrictive Hessian smoothness condition. Here, we relax these assumptions by removing the Hessian smoothness condition and by considering distributions satisfying a Poincaré inequality. Our analysis achieves the state of art dimension dependence, and is also flexible enough to handle weakly smooth potentials. As a byproduct, we also obtain the first KL divergence guarantees for ULMC without Hessian smoothness under strong log-concavity, which is based on a new result on the log-Sobolev constant along the underdamped Langevin diffusion.

For (2), the recent breakthrough of Cao, Lu, and Wang (2020) established the first accelerated result for sampling in continuous time via PDE methods. Our discretization analysis translates their result into an algorithmic guarantee, which indeed enjoys better condition number dependence than prior works on ULMC, although we leave open the question of full acceleration in discrete time.

Both (1) and (2) necessitate Rényi discretization bounds, which are more challenging than the typically used Wasserstein coupling arguments. We address this using a flexible discretization analysis based on Girsanov’s theorem that easily extends to more general settings.

1 Introduction

The problem of sampling from a high-dimensional distribution πexp(U)\pi\propto\exp(-U) on d\mathbb{R}^{d}, when the normalizing constant is unknown and only the potential UU is given, has increasing relevancy in a number of application domains, including economics, physics, and scientific computing [JP10, Von11, KPB20]. Recent progress on this problem has been driven by a strong connection with the field of optimization, starting from the seminal work of [JKO98]; see [Che23] for an exposition.

Given the success of momentum-based algorithms for optimization [Nes83], it is natural to investigate momentum-based algorithms for sampling. The hope is that such methods can improve the dependence of the convergence estimates on key problem parameters, such as the condition number κ\kappa, the dimension dd, and the error tolerance ϵ\epsilon. One such method is underdamped Langevin Monte Carlo (ULMC), which is a discretization of the underdamped Langevin diffusion (ULD):

dxt=vtdt,dvt=γvtdtU(xt)dt+2γdBt,\displaystyle\begin{aligned} &\mathrm{d}x_{t}=v_{t}\,\mathrm{d}t\,,\\ &\mathrm{d}v_{t}=-\gamma v_{t}\,\mathrm{d}t-\nabla U(x_{t})\,\mathrm{d}t+\sqrt{2\gamma}\,\mathrm{d}B_{t}\,,\end{aligned}

where {Bt}t0\{B_{t}\}_{t\geq 0} is the standard dd-dimensional Brownian motion. The stationary distribution of ULD is μ(x,v)exp(U(x)v2/2)\mu(x,v)\propto\exp(-U(x)-\lVert v\rVert^{2}/2), and in particular, the xx-marginal of μ\mu is the desired target distribution π\pi. Therefore, by taking a small step size for the discretization and a large number of iterations, ULMC will yield an approximate sample from π\pi.

We also note that in the limiting case where γ=0\gamma=0, ULMC closely resembles the Hamiltonian Monte Carlo algorithm, which is known to achieve acceleration and better discretization error in some limited settings [Vis21, AGS22, BM22, WW22].

While there is currently no analysis of ULMC that yields acceleration for sampling (i.e., square root dependence on the condition number κ\kappa), ULMC is known to improve the dependence on other parameters such as the dimension dd and the error tolerance ϵ\epsilon [Che+18, Che+18a, DR20], at least for guarantees in the Wasserstein metric. However, compared to the extensive literature on the simpler (overdamped) Langevin Monte Carlo (LMC) algorithm, existing analyses of ULMC are not easily extended to stronger performance metrics such as the 𝖪𝖫\mathsf{KL} and Rényi divergences. In turn, this limits the scope of the results for ULMC; see the discussion in Section 1.1.

In light of these shortcomings, in this work, we ask the following two questions:

  1. 1.

    Can we obtain sampling guarantees beyond the strongly log-concave case via ULMC?

  2. 2.

    Can we obtain accelerated convergence guarantees for sampling via ULMC?

1.1 Our Contributions

We address the two questions above by providing a new Girsanov discretization bound for ULMC. Our bound holds in the strong Rényi divergence metric and applies under general assumptions (in particular, it does not require strong log-concavity of the target π\pi, and it allows for weakly smooth potentials). Consequently, it leads to the following new state-of-the-art results for ULMC:

  • We obtain an ϵ2\epsilon^{2}-guarantee in KL divergence with iteration complexity 𝒪~(κ3/2d1/2ϵ1)\widetilde{\mathcal{O}}(\kappa^{3/2}d^{1/2}\epsilon^{-1}) for strongly log-concave and log-smooth distributions, which removes the Lipschitz Hessian assumption of [Ma+21]; here, κ\kappa is the condition number of the distribution.

  • We obtain an ϵ\epsilon-guarantee in TV distance with iteration complexity 𝒪~(C𝖫𝖲𝖨3/2L3/2d1/2ϵ1)\widetilde{\mathcal{O}}(C_{\mathsf{LSI}}^{3/2}L^{3/2}d^{1/2}\epsilon^{-1}) under a log-Sobolev inequality (LSI) and LL-smooth potential, again without assuming a Lipschitz Hessian. This is the state-of-the-art guarantee for this class of distributions with regards to dimension dependence.

  • We obtain ϵ2\epsilon^{2}-guarantees in the stronger Rényi divergence metric of any order in [1,2)[1,2) with iteration complexity 𝒪~(C𝖯𝖨3/2L3/2d2ϵ1)\widetilde{\mathcal{O}}(C_{\mathsf{PI}}^{3/2}L^{3/2}\,d^{2}\epsilon^{-1}) under a Poincaré inequality and a LL-smooth potential, which improves to 𝒪~(C𝖯𝖨Ld2ϵ1)\widetilde{\mathcal{O}}(C_{\mathsf{PI}}Ld^{2}\epsilon^{-1}) under log-concavity. These are the first guarantees for ULMC known in these settings, and they substantially improve upon the corresponding results for LMC in these settings [Che+21].

  • In the Poincaré case, we also consider weakly smooth potentials (i.e., Hölder continuous gradients with coefficient s(0,1]s\in(0,1]), which more realistically reflect the delicate smoothness properties of distributions satisfying a Poincaré inequality.

We now discuss our results in more detail in the context of the existing literature.

Guarantees under Weaker Assumptions.

Prior works, [Che+18a, DR20, GT20], require strong log-concavity of the target. Whereas for works which operate under isoperimetric assumptions, we are only aware of [Ma+21], which further assumes a restrictive Lipschitz Hessian condition for the potential. In contrast, we make no such assumption on the Hessian of UU, and we obtain results under a log-Sobolev inequality (LSI), or under the even weaker assumption of a Poincaré inequality (PI), for which sampling analysis is known to be challenging [Che+21].

As noted above, our result for sampling from distributions satisfying LSI and smoothness assumptions are state-of-the-art with regards to the dimension dependence (d1/2d^{1/2}); in contrast, the previous best results had linear dependence on dd [Che+21, Che+22]. Moreover, in the Poincaré case, we can also consider weakly smooth potentials, which have not been previously considered in the context of ULMC.

Guarantees in Stronger Metrics.

Key to achieving these results is our discretization analysis in the Rényi divergence metric. Indeed, the continuous-time convergence results for ULD under LSI or PI hold in the KL or Rényi divergence metrics, and translating these guarantees to the ULMC algorithm necessitates studying the discretization in Rényi. This is the main technical challenge, as we can no longer rely on Wasserstein coupling arguments which are standard in the literature [Che+18a, DR20]. Two notable exceptions are the Rényi discretization argument of [GT20], which incurs suboptimal dependence on ε\varepsilon, and the KL divergence argument of [Ma+21], which requires stringent smoothness assumptions.

In this work, we provide the first KL divergence guarantee for sampling from strongly log-concave and log-smooth distributions via ULMC without Hessian smoothness, based on a new LSI along the trajectory (discussed further below).

Towards Acceleration in Sampling.

Our work is also motivated by the breakthrough result of [CLW20], which achieves for the first time an accelerated convergence guarantee for ULD in continuous time. Our discretization bound allows us to convert this result into an algorithmic guarantee which indeed improves the dependence on the condition number κ\kappa111In the case of Poincaré inequality, the condition number is κC𝖯𝖨L\kappa\coloneqq C_{\mathsf{PI}}L, which is consistent with the definition in the strongly log-concave case. for ULMC, whereas prior results incurred a dependence of at least κ3/2\kappa^{3/2}; our dependence is linear in κ\kappa in the log-concave case. While this still falls short of proving full acceleration for sampling (i.e., an improvement to κ1/2\kappa^{1/2}), our result provides further hope for achieving acceleration via ULMC.

A New Log-Sobolev Inequality along the ULD Trajectory.

Finally, en route to proving the KL divergence guarantee in the strongly log-concave case, we establish a new log-Sobolev inequality along ULD (Proposition 14), which is of independent interest. While such a result was previously known for the overdamped Langevin diffusion, to the best of our knowledge it is new for the underdamped version.

1.2 More Related Work

Langevin Monte Carlo.

For the standard LMC algorithm, non-asymptotic rate estimates in 𝒲2\mathcal{W}_{2} were first demonstrated in [Dal17], for the class of strongly log-concave measures. Guarantees in 𝖪𝖫\mathsf{KL} divergence under a log-Sobolev inequality were obtained by [VW19], which developed an appealing continuous-time framework for analyzing LMC under functional inequalities. With some difficulty, this result was extended to Rényi divergences by [GT20, EHZ22]. At the same time, a body of literature studied convergence in 𝖪𝖫\mathsf{KL} divergence under tail-growth conditions such as dissipativity [RRT17, EMS18, EH21, Mou+22], which usually imply functional inequalities.

Most related to the current work, [Che+21] extended the continuous-time approach from [VW19] to Rényi divergences, and moreover introduced a novel discretization analysis using Girsanov’s theorem, which also holds for weakly smooth potentials. The present work builds upon the Girsanov techniques introduced in [Che+21] to study ULMC.

Underdamped Langevin Diffusion.

ULMC is a discretization of the underdamped Langevin diffusion (1). First studied by [Kol34] and [Hör67] in their pioneering works on hypoellipticity, it was quickly understood that establishing quantitative convergence to stationarity is technically challenging, let alone capturing any acceleration phenomenon. The seminal work of [Vil02, Vil09] developed the hypocoercivity approach, providing the first convergence guarantees under functional inequalities; see also [Hér06, DMS09, DMS15, RS18]. We also refer to [Ber+22] and references therein for a comprehensive discussion of qualitative and quantitative convergence results for ULD.

As mentioned earlier, the most recent breakthrough by [CLW20] achieved acceleration in continuous time in χ2\chi_{2}-divergence when the target distribution π\pi is log-concave. This work was built on an approach using the dual Sobolev space 1\mathcal{H}^{-1} [Alb+19]. However, since this method relies on the duality of the L2L^{2} space and its connections to the Poincaré inequality, it is difficult to extend to LpL^{p} spaces or to other functional inequalities.

Other Discretizations.

Many alternative discretization schemes have since been proposed in this setting [SL19, Li+19, HBE20, FLO21, Mon21, FRS22, JLS23], albeit all of the analyses up to this point were limited to 𝒲2\mathcal{W}_{2} distance and did not achieve acceleration in terms of the condition number κ\kappa.

1.3 Organization

The remainder of this paper will be organized as follows. In Section 2, we will review the required definitions and assumptions. In Section 3, we will state our main results and briefly sketch their proofs. In Section 4, we highlight several implications of our theorems through some examples. In Section 5, we briefly sketch the proofs of our main results, before concluding in Section 6 with a discussion of future directions.

2 Background

2.1 Notation

Hereafter, we will use \lVert\cdot\rVert to denote the 22-norm on vectors. In general, we will only work with measures that admit densities on d\mathbb{R}^{d}, and we will abuse notation slightly to conflate a measure with its density for convenience. The notation a=𝒪(b)a=\mathcal{O}(b) signifies that there exists an absolute constant C>0C>0 such that aCba\leq Cb, and 𝒪~()\widetilde{\mathcal{O}}(\cdot) hides logarithmic factors. Similarly we write a=Θ(b)a=\Theta(b) if there exist constants c,C>0c,C>0 such that cbaCbcb\leq a\leq Cb, and Θ~()\widetilde{\Theta}(\cdot) hides logarithmic factors. The stationary measure (in the position coordinate) is πexp(U)\pi\propto\exp(-U), and UU will be referred to as the potential. We will use L2(π)L^{2}(\pi) to denote test functions ff where 𝔼πf2<\mathbb{E}_{\pi}\,f^{2}<\infty, and 1(π)\mathcal{H}^{1}(\pi) to denote weakly differentiable L2(π)L^{2}(\pi) functions where xifL2(π)\partial_{x_{i}}f\in L^{2}(\pi). Finally, the notations \lesssim, \gtrsim, \asymp represent \leq, \geq, == up to absolute constants. Further notations are introduced in subsequent sections.

2.2 Definitions and Assumptions

In this subsection, we will define the relevant processes, divergences, and isoperimetric inequalities. Firstly, we define the ULMC algorithm by the following stochastic differential equation (SDE):

dxt=vtdt,dvt=γvtdt+U(xkh)dt+2γdBt,\displaystyle\begin{aligned} \mathrm{d}x_{t}&=v_{t}\,\mathrm{d}t\,,\\ \mathrm{d}v_{t}&=-\gamma v_{t}\,\mathrm{d}t+\nabla U(x_{kh})\,\mathrm{d}t+\sqrt{2\gamma}\,\mathrm{d}B_{t}\,,\end{aligned} (ULMC)

where t[kh,(k+1)h)t\in[kh,(k+1)h) for some step size h>0h>0. We note this formulation of ULMC can be integrated in closed form (see Appendix A).

Next, we define a few measures of distance between two probability distributions μ\mu and π\pi on d\mathbb{R}^{d}. We define the total variation distance as

μπ𝖳𝖵sup|μ(A)π(A)|,\displaystyle\lVert\mu-\pi\rVert_{\mathsf{TV}}\coloneqq\sup|\mu(A)-\pi(A)|\,, (2.1)

where the sup\sup is taken over Borel measurable sets AdA\subset\mathbb{R}^{d}. We further define the 𝖪𝖫\mathsf{KL} divergence as

𝖪𝖫(μπ)dμdπlogdμdπdπ,\displaystyle\mathsf{KL}(\mu\mathbin{\|}\pi)\coloneqq\int\frac{\mathrm{d}\mu}{\mathrm{d}\pi}\log\frac{\mathrm{d}\mu}{\mathrm{d}\pi}\,\mathrm{d}\pi\,, (2.2)

and 𝖪𝖫(μπ)+\mathsf{KL}(\mu\mathbin{\|}\pi)\coloneqq+\infty if μ\mu is not absolutely continuous with respect to π\pi. Finally, we define the Rényi divergence with order q>1q>1 as

q(μπ)1q1log|dμdπ|qdπ,\displaystyle\mathcal{R}_{q}(\mu\mathbin{\|}\pi)\coloneqq\frac{1}{q-1}\log\int\Bigl{|}\frac{\mathrm{d}\mu}{\mathrm{d}\pi}\Bigr{|}^{q}\,\mathrm{d}\pi\,,

and similarly q(μπ)+\mathcal{R}_{q}(\mu\mathbin{\|}\pi)\coloneqq+\infty if μ≪̸π\mu\not\ll\pi. The Rényi divergence upper bounds 𝖪𝖫\mathsf{KL} for all orders, i.e., 𝖪𝖫(μπ)q(μπ)\mathsf{KL}(\mu\mathbin{\|}\pi)\leq\mathcal{R}_{q}(\mu\mathbin{\|}\pi) for any order q>1q>1, and q\mathcal{R}_{q} is monotonic in qq. In particular, when q=2q=2, we also get χ2\chi_{2} divergence, i.e., χ2(μπ)=exp(2(μπ))1\chi_{2}(\mu\mathbin{\|}\pi)=\exp(\mathcal{R}_{2}(\mu\mathbin{\|}\pi))-1.

Our primary results are provided under the following smoothness conditions.

Definition 1 (Smoothness).

The potential UU is (L,s)(L,s)-weakly smooth if UU is differentiable and U\nabla U is ss-Hölder continuous satisfying

U(x)U(y)Lxys,\displaystyle\lVert\nabla U(x)-\nabla U(y)\rVert\leq L\,\lVert x-y\rVert^{s}\,, (2.3)

for all x,ydx,y\in\mathbb{R}^{d} and some L0L\geq 0, s(0,1]s\in(0,1]. In the particular case where s=1s=1, we say that the potential is LL-smooth, or that U\nabla U is LL-Lipschitz.

We conduct three lines of analysis. The first assumes strong convexity of the potential, i.e.:

Definition 2 (Strong Convexity).

The potential UU is mm-strongly convex for some m0m\geq 0 if for all x,ydx,y\in\mathbb{R}^{d}:

U(x)U(y),xym2xy2.\displaystyle{\mathopen{}\mathclose{{}\left\langle\nabla U(x)-\nabla U(y),x-y}\right\rangle}\geq\frac{m}{2}\,\lVert x-y\rVert^{2}\,.

In the case m=0m=0 above, we say that UU is convex. If a potential function UU is (strongly) convex, then we say the distribution πexp(U)\pi\propto\exp(-U) is (strongly) log-concave.

A second, strictly more general assumption is the log-Sobolev inequality.

Definition 3 (Log-Sobolev Inequality).

A measure π\pi satisfies a log-Sobolev inequality (LSI) with parameter C𝖫𝖲𝖨>0C_{\mathsf{LSI}}>0 if for all g1(π)g\in\mathcal{H}^{1}(\pi) :

𝖾𝗇𝗍π(g2)2C𝖫𝖲𝖨𝔼π[g2],\displaystyle\mathsf{ent}_{\pi}(g^{2})\leq 2C_{\mathsf{LSI}}\operatorname{\mathbb{E}}_{\pi}[\lVert\nabla g\rVert^{2}]\,, (LSI)

where 𝖾𝗇𝗍π(g2)𝔼π[g2log(g2/𝔼π[g2])]\mathsf{ent}_{\pi}(g^{2})\coloneqq\operatorname{\mathbb{E}}_{\pi}[g^{2}\log(g^{2}/\operatorname{\mathbb{E}}_{\pi}[g^{2}])].

An mm-strongly convex potential is known to satisfy (LSI) with constant m1m^{-1} [BGL14]. More generally, we can consider the following weaker isoperimetric inequality, which corresponds to a linearization of (LSI).

Definition 4 (Poincaré Inequality).

A measure π\pi satisfies a Poincaré inequality with parameter C𝖯𝖨>0C_{\mathsf{PI}}>0 if for all g1(π)g\in\mathcal{H}^{1}(\pi) :

𝗏𝖺𝗋π(g)\displaystyle\mathsf{var}_{\pi}(g) C𝖯𝖨𝔼π[g2],\displaystyle\leq C_{\mathsf{PI}}\operatorname{\mathbb{E}}_{\pi}[\lVert\nabla g\rVert^{2}]\,, (PI)

where 𝗏𝖺𝗋π(g)=𝔼π[|g𝔼π[g]|2]\mathsf{var}_{\pi}(g)=\operatorname{\mathbb{E}}_{\pi}[\lvert g-\operatorname{\mathbb{E}}_{\pi}[g]\rvert^{2}].

Conditions (LSI) and (PI) are standard assumptions made on the stationary distribution in the theory of Markov diffusions as well as sampling [BGL14, VW19, Che+21, Che23]. They are known to be satisfied by a broad class of targets such as log-concave distributions or certain mixture distributions [Che21, CCN21].

We define the condition number for an mm-strongly log-concave target with (L,s)(L,s)-weakly smooth potential as κL/m\kappa\triangleq L/m. In the case where instead of strong convexity, the target only satisfies (LSI) (respectively (PI)), the condition number is instead κC𝖫𝖲𝖨L\kappa\triangleq C_{\mathsf{LSI}}L (respectively κC𝖯𝖨L\kappa\triangleq C_{\mathsf{PI}}L).

Finally, we collect several mild assumptions to simplify computing the bounds below, which have also appeared in prior work; see in particular the discussion in [Che+21, Appendix A].

Assumption 1.

The expectation of the norm (in the position coordinate) is quantitatively bounded by some constant, 𝔼π[]𝔪=𝒪~(d)\operatorname{\mathbb{E}}_{\pi}[\lVert\cdot\rVert]\leq\mathfrak{m}=\widetilde{\mathcal{O}}(d)222This holds for instance when U(x)=xαU(x)=\lVert x\rVert^{\alpha} for 1α21\leq\alpha\leq 2., for some constant 𝔪<\mathfrak{m}<\infty. Furthermore, we assume that U(0)=0\nabla U(0)=0 (without loss of generality), and that U(0)minU=𝒪~(d)U(0)-\min U=\widetilde{\mathcal{O}}(d).

3 Main Theorems

In the sequel, we always take the initial distribution of the momentum ρ0\rho_{0} to be equal to the stationary distribution ρexp(2/2)\rho\propto\exp(-\lVert\cdot\rVert^{2}/2). Then, under Assumption 1 we can find an initial distribution π0\pi_{0} for the position which is a centered Gaussian with variance specified in Appendix D, such that π0\pi_{0} has some appropriately bounded initial divergence (e.g. 𝖪𝖫,q\mathsf{KL},\mathcal{R}_{q}) with respect to π\pi. Lastly, we initialize ULMC by sampling from the distribution μ0(x,v)=π0(x)×ρ0(v)\mu_{0}(x,v)=\pi_{0}(x)\times\rho_{0}(v), i.e. with xx and vv independent.

3.1 Convergence in 𝖪𝖫\mathsf{KL} and 𝖳𝖵\mathsf{TV}

In order to state our results for ULMC in 𝖪𝖫\mathsf{KL} and 𝖳𝖵\mathsf{TV}, we leverage the following result in continuous-time from [Ma+21], which relies on an entropic hypocoercivity argument, after a time-change of the coordinates (see Appendix B.1 for a proof).

Lemma 5 (Adapted from [Ma+21, Proposition 1]).

Define the Lyapunov functional

(μμ)𝖪𝖫(μμ)+𝔼μ[𝔐1/2logμμ2],where𝔐=[14L12L12L4]Id.\displaystyle\mathcal{F}(\mu^{\prime}\mathbin{\|}\mu)\triangleq\mathsf{KL}(\mu^{\prime}\mathbin{\|}\mu)+\operatorname{\mathbb{E}}_{\mu^{\prime}}\bigl{[}\bigl{\lVert}\mathfrak{M}^{1/2}\,\nabla\log\frac{\mu^{\prime}}{\mu}\bigr{\rVert}^{2}\bigr{]}\,,\ \quad\text{where}\ \ \mathfrak{M}=\begin{bmatrix}\frac{1}{4L}&\frac{1}{\sqrt{2L}}\\ \frac{1}{\sqrt{2L}}&4\end{bmatrix}\otimes I_{d}\,. (3.1)

For targets π\pi that are LL-smooth and satisfy (LSI) with parameter C𝖫𝖲𝖨C_{\mathsf{LSI}}, let γ=22L\gamma=2\sqrt{2L}. Then the law μt\mu_{t} of ULD satisfies

t(μtμ)110C𝖫𝖲𝖨2L(μtμ).\displaystyle\partial_{t}\mathcal{F}(\mu_{t}\mathbin{\|}\mu)\leq-\frac{1}{10C_{\mathsf{LSI}}\,\sqrt{2L}}\,\mathcal{F}(\mu_{t}\mathbin{\|}\mu)\,.

We now proceed to state our main results more precisely. First, we obtain the following KL divergence guarantee under strong log-concavity and smoothness.

Theorem 6 (Convergence in 𝖪𝖫\mathsf{KL} under Strong Log-Concavity).

Let the potential UU be mm-strongly convex and LL-smooth, and additionally satisfy Assumption 1. Then, for

h=Θ~(ϵm1/2Ld1/2)andγL,\displaystyle h=\widetilde{\Theta}\Bigl{(}\frac{\epsilon m^{1/2}}{Ld^{1/2}}\Bigl{)}\quad\text{and}\quad\gamma\asymp\sqrt{L},

the following holds for μ^Nh\hat{\mu}_{Nh}, the law of the NN-th iterate of ULMC initialized at a centered Gaussian (with variance specified in Appendix D):

𝖪𝖫(μ^Nhμ)ϵ2afterN=Θ~(κ3/2d1/2ϵ)iterations.\displaystyle\mathsf{KL}(\hat{\mu}_{Nh}\mathbin{\|}\mu)\leq\epsilon^{2}\qquad\text{after}\qquad N=\widetilde{\Theta}\Bigl{(}\frac{\kappa^{3/2}\,d^{1/2}}{\epsilon}\Bigr{)}\quad\text{iterations}\,.

Here, we justify the choice of error tolerance for 𝖪𝖫\mathsf{KL} to be ϵ2\epsilon^{2}. Based on Pinsker’s and Talagrand’s transport inequalities, we know 𝖪𝖫\mathsf{KL} is on the order of 𝖳𝖵2,𝒲22\mathsf{TV}^{2},\mathcal{W}_{2}^{2}. Hence, this allows for a fair comparison of convergence guarantees in terms of 𝖪𝖫\mathsf{KL} with 𝖳𝖵\mathsf{TV} and 𝒲2\mathcal{W}_{2}. Weakening the strong convexity assumption to (LSI)\eqref{eq:LSI}, we obtain a result in 𝖳𝖵\mathsf{TV}.

Theorem 7 (Convergence in 𝖳𝖵\mathsf{TV} under (LSI)).

Let the potential be LL-smooth, satisfy (LSI) with constant C𝖫𝖲𝖨C_{\mathsf{LSI}}, and satisfy Assumption 1. Then, for

h=Θ~(ϵC𝖫𝖲𝖨1/2Ld1/2),andγL,\displaystyle h=\widetilde{\Theta}\Bigl{(}\frac{\epsilon}{C_{\mathsf{LSI}}^{1/2}Ld^{1/2}}\Bigr{)},\quad\text{and}\quad\gamma\asymp\sqrt{L},

the following holds for μ^Nh\hat{\mu}_{Nh}, the law of the NN-th iterate of ULMC initialized at a centered Gaussian (with variance specified in Appendix D):

μ^Nhμ𝖳𝖵ϵafterN=Θ~(C𝖫𝖲𝖨3/2L3/2d1/2ϵ)iterations.\displaystyle\lVert\hat{\mu}_{Nh}-\mu\rVert_{\mathsf{TV}}\leq\epsilon\qquad\text{after}\qquad N=\widetilde{\Theta}\Bigl{(}\frac{C_{\mathsf{LSI}}^{3/2}\,L^{3/2}\,d^{1/2}}{\epsilon}\Bigr{)}\quad\text{iterations}\,.

3.2 Convergence in q\mathcal{R}_{q} and Improving the Dependence on κ\kappa

To state our convergence results in q\mathcal{R}_{q}, we additionally inherit the following technical assumption from [CLW20].

Assumption 2.

1(μ)L2(μ)\mathcal{H}^{1}(\mu)\hookrightarrow L^{2}(\mu) is a compact embedding. Secondly, assume that UU is twice continuously differentiable, and that for all xdx\in\mathbb{R}^{d}, we have

2U(x)𝔏(1+U(x)).\displaystyle\lVert\nabla^{2}U(x)\rVert\leq\mathfrak{L}\,\mathopen{}\mathclose{{}\left(1+\lVert\nabla U(x)\rVert}\right)\,.
Remark.

[Hoo81, Theorem 3.1] shows the first part of this assumption is always satisfied if the potential has super-linear tail growth, i.e. U(x)xαU(x)\propto\|x\|^{\alpha} for α>1\alpha>1 and large x\|x\|. In the case where the tail is strictly linear, we can instead construct an arbitrarily close approximation with super-linear tails; thus, it generically holds for all targets we consider in this work. As also remarked in [CLW20], the above assumption is required solely due to technical reasons and is likely not a necessary condition.

The second part of the assumption is satisfied under LL-smoothness of the gradient with the same constant. In the convex case or the case where 2U\nabla^{2}U is lower bounded, the constant 𝔏\mathfrak{L} does not show up in the bounds. As a result, for weakly smooth potentials in this setting, we can approximate using twice differentiable potentials to obtain a rate estimate.

In the light of the above discussion, we emphasize that this additional assumption largely does not hinder the applicability of our results. Under this assumption, [CLW20] established the following guarantee on (1) in continuous time.

Lemma 8 (Rapid Convergence in L2L^{2}; Adapted from [CLW20, Theorem 1]).

Under Assumption 2, and if π\pi additionally satisfies (PI) with constant C𝖯𝖨C_{\mathsf{PI}}, then the following holds for the law μt\mu_{t} of ULD initialized at μ0\mu_{0}, where C0>0C_{0}>0 is an absolute constant:

χ2(μtμ)C0exp(𝔮(γ)t)χ2(μ0μ),\displaystyle\chi_{2}(\mu_{t}\mathbin{\|}\mu)\leq C_{0}\exp\bigl{(}-\mathfrak{q}(\gamma)\,t\bigr{)}\,\chi_{2}(\mu_{0}\mathbin{\|}\mu)\,,

where the coefficient inside the exponent is

𝔮(γ)C𝖯𝖨1γC0(C𝖯𝖨1+R2+γ2),\displaystyle\mathfrak{q}(\gamma)\coloneqq\frac{C_{\mathsf{PI}}^{-1}\gamma}{C_{0}\,(C_{\mathsf{PI}}^{-1}+R^{2}+\gamma^{2})}, (3.2)

and the constant RR is

R={0if U convex,Kifinfxd2U(x)KId,𝔏dif2U(x)op𝔏(1+U(x))for allxd.\displaystyle R=\begin{cases}0&\text{if $U$ convex}\,,\\ \sqrt{K}&\text{if}~{}\inf_{x\in\mathbb{R}^{d}}\nabla^{2}U(x)\succeq-KI_{d}\,,\\ \mathfrak{L}\sqrt{d}&\text{if}~{}\lVert\nabla^{2}U(x)\rVert_{\rm op}\leq\mathfrak{L}\,(1+\lVert\nabla U(x)\rVert)~{}\text{for all}~{}x\in\mathbb{R}^{d}\,.\end{cases}
Remark.

In the strongly log-concave case, Lemma 8 actually yields a better decay of order m\sqrt{m} than Lemma 5, which has dependence m/Lm/\sqrt{L}.

Our final result leverages the above accelerated convergence guarantees of ULD, and establishes the first bound for ULMC in Rényi divergence with an improved condition number dependence.

Theorem 9 (Convergence in q\mathcal{R}_{q} under (PI)).

Let the potential be (L,s)(L,s)-weakly smooth, satisfy (PI) with constant C𝖯𝖨C_{\mathsf{PI}}, and satisfy Assumption 1. Let it also satisfy the additional technical condition Assumption 2. Then, for ξ(0,1)\xi\in(0,1)

h=Θ~(γ1/(2s)ϵ1/sξ1/s𝔮(γ)1/(2s)L1/sd1/2(Ld)1/(2s)),h=\tilde{\Theta}\Bigl{(}\frac{\gamma^{1/(2s)}\epsilon^{1/s}\xi^{1/s}\mathfrak{q}(\gamma)^{1/(2s)}}{L^{1/s}d^{1/2}\,{(L\vee d)}^{1/(2s)}}\Bigr{)},

the following holds for μ^Nh\hat{\mu}_{Nh}, the law of the NN-th iterate of ULMC initialized at a centered Gaussian (variance specified in Appendix D) for q=2ξ[1,2)q=2-\xi\in[1,2) and with 𝔮\mathfrak{q} defined in (3.2):

q(μ^Nhμ)ϵ2afterN=Θ~(L1/sd1/2(Ld)1+1/(2s)γ1/(2s)ϵ1/sξ1/s𝔮(γ)1+1/(2s))iterations.\displaystyle\mathcal{R}_{q}(\hat{\mu}_{Nh}\mathbin{\|}\mu)\leq\epsilon^{2}\qquad\text{after}\qquad N=\tilde{\Theta}\Bigl{(}\frac{L^{1/s}\,d^{1/2}\,{(L\vee d)}^{1+1/(2s)}}{\gamma^{1/(2s)}\,\epsilon^{1/s}\,\xi^{1/s}\,{\mathfrak{q}(\gamma)}^{1+1/(2s)}}\Bigr{)}\quad\text{iterations}\,.
Remark.

The optimal choice is to take γC𝖯𝖨1+R2\gamma\asymp\sqrt{C_{\mathsf{PI}}^{-1}+R^{2}}. If the potential UU is convex, then we set γ𝔮(1/C𝖯𝖨)1/C𝖯𝖨\gamma\asymp\mathfrak{q}(1/\sqrt{C_{\mathsf{PI}}})\asymp 1/\sqrt{C_{\mathsf{PI}}}, which is known to be an optimal choice [CLW20]. As a result, in the convex and smooth case, the iteration complexity has the condition number dependence κ\kappa, which improves upon the κ2\kappa^{2} dependence seen in [Che+21]. The dependence on dimension dd and error tolerance ϵ\epsilon are also improved.

4 Examples

Example 10.

We consider the potential U(x)=1+x2U(x)=\sqrt{1+\lVert x\rVert^{2}}, which satisfies (PI) with constant 𝒪(d)\mathcal{O}(d) [Bob03] and is (1,1)(1,1)-smooth. Assuming the compact embedding condition of Assumption 2, Theorem 9 gives a complexity of 𝒪~(d3ξ1ϵ1)\widetilde{\mathcal{O}}(d^{3}\xi^{-1}\epsilon^{-1}) for ϵ2\epsilon^{2}-guarantees in 2ξ\mathcal{R}_{2-\xi} after optimizing for γ\gamma, since in this case the potential is log-concave. In this case, the dimension dependence equates to that of the proximal sampler with rejection sampling [Che+22, Corollary 8], which is 𝒪~(d3)\widetilde{\mathcal{O}}(d^{3}); it surpasses [Che+21, Theorem 8], which can only obtain 𝒪~(d4ϵ2)\widetilde{\mathcal{O}}(d^{4}\epsilon^{-2}) for the same guarantees. However, it is important to note that the latter two works obtain these for any order of Rényi divergence and are not limited to order q=2ξ<2q=2-\xi<2, which cannot presently be obtained using our results for ULMC.

Example 11.

Consider an mm-strongly log-concave and LL-log-smooth distribution. Non-trivial examples of this can be found in Bayesian regression (see e.g., [Dal17a, Section 6]); we will examine the first one, where π(x)exp(x𝐚2/2)+exp(x+𝐚2/2)\pi(x)\propto\exp(-\lVert x-\boldsymbol{a}\rVert^{2}/2)+\exp(-\lVert x+\boldsymbol{a}\rVert^{2}/2) for some 𝐚d:𝐚=1/3\boldsymbol{a}\in\mathbb{R}^{d}:\lVert\boldsymbol{a}\rVert=1/3. Here, our Theorem 6 gives a complexity of N=𝒪~(d1/2ϵ1)N=\widetilde{\mathcal{O}}(d^{1/2}\epsilon^{-1}) to obtain a ϵ2\epsilon^{2}-guarantee for the 𝖪𝖫\mathsf{KL} divergence. In contrast, the Hessian is 2U(x)=Id4𝐚𝐚exp(2x𝖳𝐚)/(1+exp(2x𝖳𝐚))2\nabla^{2}U(x)=I_{d}-4\boldsymbol{a}\boldsymbol{a}^{\top}\exp(2x^{\mathsf{T}}\boldsymbol{a})/(1+\exp(2x^{\mathsf{T}}\boldsymbol{a}))^{2}, which has L𝖧dL_{\mathsf{H}}\asymp d, where L𝖧L_{\mathsf{H}} is the Lipschitz constant of the Hessian in the Frobenius norm. Consequently, [Ma+21, Theorem 1] is stated as N=𝒪~(d1/2LHm2ϵ1)N=\widetilde{\mathcal{O}}(d^{1/2}L_{H}m^{-2}\epsilon^{-1}), which in this case gives N=𝒪~(d3/2ϵ1)N=\widetilde{\mathcal{O}}(d^{3/2}\epsilon^{-1}) to obtain the same ϵ2\epsilon^{2}-accuracy guarantee. This is worse in the dimension-dependence. Finally, it is possible to compare with the discretization bounds achieved in [GT20, Theorem 28], where in combination with our continuous time results (using the same proof technique as Theorem 6) to yield N=O~(d1/2ϵ2)N=\widetilde{O}(d^{1/2}\epsilon^{-2}) iterations, which is suboptimal in the order of ϵ\epsilon, but has the same dimension dependence.

Example 12.

We can analyze LL-smooth distributions satisfying a log-Sobolev inequality with parameter C𝖫𝖲𝖨C_{\mathsf{LSI}}. One such instance arises when considering any bounded perturbation of a strongly convex potential. In this case, let U𝐚U_{\boldsymbol{a}} be the potential of the target in Example 11. Then consider a target with modified potential U𝐚+fU_{\boldsymbol{a}}+f, with supx|f(x)|f(x)2f(x)𝗈𝗉𝔅\sup_{x}\lvert f(x)\rvert\vee\lVert\nabla f(x)\rVert\vee\lVert\nabla^{2}f(x)\rVert_{\mathsf{op}}\leq\mathfrak{B} for some 𝔅<\mathfrak{B}<\infty, and let 2f\nabla^{2}f be 𝒪(d)\mathcal{O}(d)-Frobenius Lipschitz. We can bound the log-Sobolev constant of this potential using the Holley–Stroock Lemma [HS87]. Let this new potential have condition number κ\kappa. We achieve ϵ\epsilon-accuracy in 𝖳𝖵\mathsf{TV} distance with N=𝒪~(κ3/2d1/2ϵ1)N=\widetilde{\mathcal{O}}(\kappa^{3/2}d^{1/2}\epsilon^{-1}). For comparison, the previous bound [Ma+21, Theorem 1] gives N=𝒪~(κ2d3/2ϵ1)N=\tilde{\mathcal{O}}(\kappa^{2}d^{3/2}\epsilon^{-1}) to arrive at the same guarantee in 𝖳𝖵\mathsf{TV}, which is worse in the dimension. However, note that the guarantees in [Ma+21, Theorem 1] are in 𝖪𝖫\mathsf{KL}, which is stronger than 𝖳𝖵\mathsf{TV}. Finally, we note that [GT20] requires strong log-concavity, and hence cannot provide a guarantee in this setting.

Example 13.

Consider a (1,s)(1,s)-weakly log-smooth target that is log-concave and satisfies a Poincaré inequality with C𝖯𝖨=𝒪(d).C_{\mathsf{PI}}=\mathcal{O}(d). Consequently, Theorem 9 yields N=𝒪~(d2+1/sξ1/sϵ1/s)N=\tilde{\mathcal{O}}(d^{2+1/s}\xi^{-1/s}\epsilon^{-1/s}) to obtain ϵ2\epsilon^{2}-guarantees for 2ξ\mathcal{R}_{2-\xi}. [Che+21, Theorem 7] yields N=𝒪~(d3+2/sϵ2/s)N=\tilde{\mathcal{O}}(d^{3+2/s}\epsilon^{-2/s}) for the same guarantees, which is worse in both parameters. On the other hand, take the specific case of a distribution with potential U(x)=xαU(x)=\lVert x\rVert^{\alpha}, which has C𝖯𝖨=𝒪(d2/α1)C_{\mathsf{PI}}=\mathcal{O}(d^{2/\alpha-1}) [Bob03], is log-convex and (1,α1)(1,\alpha-1)-weakly log-smooth. Consequently, Theorem 9 yields N=𝒪~(dα/(α1)ξ1/(α1)ϵ1/(α1))N=\tilde{\mathcal{O}}(d^{\alpha/(\alpha-1)}\xi^{-1/(\alpha-1)}\epsilon^{-1/(\alpha-1)}) for ϵ2\epsilon^{2}-accuracy guarantees in 2ξ\mathcal{R}_{2-\xi} divergence. This is worse by a factor of dd than the rate estimate obtained in [Che+21, Example 9], as they leverage a stronger class of functional inequalities that interpolate between (PI) and (LSI), whereas our analysis cannot capture this improvement. Our convergence guarantee is still better in terms of ϵ\epsilon-dependence.

5 Proof Sketches

5.1 Continuous Time Results

For results under both the Poincaré and log-Sobolev inequalities, we leverage the existing results as stated in [CLW20, Ma+21], which we present in Lemmas 5 and 8. These allow us to bound χ2(μtμ)\chi_{2}(\mu_{t}\mathbin{\|}\mu), 𝖪𝖫(μtμ)\mathsf{KL}(\mu_{t}\mathbin{\|}\mu) with exponentially decaying quantities.

With the additional assumption of strong convexity, we can obtain a contraction in an alternate system of coordinates (ϕ,ψ)(x,v)(x,x+2γv)(\phi,\psi)\triangleq\mathcal{M}(x,v)\triangleq(x,x+\frac{2}{\gamma}\,v) (see Appendix B). This allows us to consider the distributions of the continuous time iterates and the target in these alternate coordinates μt,μ\mu_{t}^{\mathcal{M}},\mu^{\mathcal{M}} respectively. From this, we obtain the following proposition.

Proposition 14 (Log-Sobolev Inequality Along the Trajectory).

Suppose UU is mm-strongly convex and LL-smooth. Let μt\mu_{t}^{\mathcal{M}} now denote the law of the continuous-time underdamped Langevin diffusion with γ=cL\gamma=c\sqrt{L} for c2c\geq\sqrt{2} in the (ϕ,ψ)(\phi,\psi) coordinates. Suppose the initial distribution μ0\mu_{0} has (LSI) constant (in the altered coordinates) C𝖫𝖲𝖨(μ0)C_{\mathsf{LSI}}(\mu_{0}^{\mathcal{M}}), then {μt}t0\{\mu_{t}^{\mathcal{M}}\}_{t\geq 0} satisfies (LSI) with constant that can be uniformly upper bounded by

C𝖫𝖲𝖨(μt)\displaystyle C_{\mathsf{LSI}}(\mu_{t}^{\mathcal{M}}) exp(m2Lt)C𝖫𝖲𝖨(μ0)+2m.\displaystyle\leq\exp\Bigl{(}-m\sqrt{\frac{2}{L}}\,t\Bigr{)}\,C_{\mathsf{LSI}}(\mu_{0}^{\mathcal{M}})+\frac{2}{m}\,.

The main idea behind the proof of this proposition is to analyze the discretization (ULMC) of the underdamped Langevin diffusion in the coordinates (ϕ,ψ)(\phi,\psi). Note that this can be written in the following form, for some matrix Σ¯2d×2d\overline{\Sigma}\in\mathbb{R}^{2d\times 2d} and function F¯:d×dd×d\bar{F}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}^{d}\times\mathbb{R}^{d},

(ϕ(k+1)h,ψ(k+1)h)\displaystyle(\phi_{(k+1)h},\psi_{(k+1)h}) =dF¯(ϕkh,ψkh)+𝒩(0,Σ¯).\displaystyle\stackrel{{\scriptstyle d}}{{=}}\bar{F}(\phi_{kh},\psi_{kh})+\mathcal{N}(0,\overline{\Sigma})\,.

This is the composition of a deterministic function F¯\bar{F} giving the mean of the next iterate of ULMC started at (ϕ,ψ)(\phi,\psi), followed by addition with a Gaussian distribution giving the variance of the resulting iterate. In particular, we show that for coordinates (ϕ(x,v),ψ(x,v))(x,x+2γv)(\phi(x,v),\psi(x,v))\triangleq(x,x+\frac{2}{\gamma}v), we can find an almost sure strict contraction under F¯\bar{F} in the sense that

F¯Lip1m2Lh+𝒪(Lh2),\displaystyle\lVert\bar{F}\rVert_{\text{Lip}}\leq 1-\frac{m}{\sqrt{2L}}\,h+\mathcal{O}(Lh^{2})\,,

where by abuse of notation F¯:2d2d\bar{F}:\mathbb{R}^{2d}\to\mathbb{R}^{2d}, and the seminorm gLip\lVert g\rVert_{\text{Lip}} of a function g:2d2dg:\mathbb{R}^{2d}\to\mathbb{R}^{2d} refers to the Lipschitz constant of the function.

Since F¯\bar{F} is a contraction for small enough hh, each push forward improves the log-Sobolev constant by a multiplicative factor [VW19, Lemma 19]. At the same time, a Gaussian convolution can only worsen the log-Sobolev constant by an additive constant [Cha04, Corollary 3.1]. Subsequently, the log-Sobolev constant at each iterate forms a (truncated) geometric sum, and therefore can be bounded by the infinite series. This incidentally can be used to bound the log-Sobolev constant of the ULMC iterates. Taking an appropriate limit of h0h\to 0 while keeping Nh=tNh=t, we arrive at the stated bound in the proposition. Consequently, considering the decomposition of the 𝖪𝖫\mathsf{KL}, a simple application of Cauchy–Schwarz tells us that

𝖪𝖫(μ^tμ)\displaystyle\mathsf{KL}(\hat{\mu}_{t}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}}) =logμ^tμdμ^t=𝖪𝖫(μ^tμt)+logμtμdμ^t\displaystyle=\int\log\frac{\hat{\mu}_{t}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\,\mathrm{d}\hat{\mu}_{t}^{\mathcal{M}}=\mathsf{KL}(\hat{\mu}_{t}^{\mathcal{M}}\mathbin{\|}\mu_{t}^{\mathcal{M}})+\int\log\frac{\mu_{t}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\,\mathrm{d}\hat{\mu}_{t}^{\mathcal{M}}
𝖪𝖫(μ^tμt)+𝖪𝖫(μtμ)+χ2(μ^tμt)×𝗏𝖺𝗋μt(logμtμ).\displaystyle\leq\mathsf{KL}(\hat{\mu}_{t}^{\mathcal{M}}\mathbin{\|}\mu_{t}^{\mathcal{M}})+\mathsf{KL}(\mu_{t}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}})+\sqrt{\chi^{2}(\hat{\mu}_{t}^{\mathcal{M}}\mathbin{\|}\mu_{t}^{\mathcal{M}})\times\mathsf{var}_{\mu_{t}^{\mathcal{M}}}\Bigl{(}\log\frac{\mu_{t}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{)}}\,.

The log-Sobolev inequality for μt\mu_{t}^{\mathcal{M}} implies a Poincaré inequality, which allows us to bound the variance term by the Fisher information 𝖥𝖨(μtμ)=𝔼μtlog(μt/μ)2\mathsf{FI}(\mu_{t}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}})=\operatorname{\mathbb{E}}_{\mu_{t}^{\mathcal{M}}}\lVert\nabla\log(\mu_{t}^{\mathcal{M}}/\mu^{\mathcal{M}})\rVert^{2}. This can be bounded by the same entropic hypocoercivity argument from [Ma+21] that is used to generate our 𝖳𝖵\mathsf{TV} bounds, while the remaining two terms are handled respectively via the discretization analysis and again the entropic hypocoercivity argument.

5.2 Discretization Analysis

The main result we use to control the discretization error can be found below.

Proposition 15.

Let (μ^t)t0(\hat{\mu}_{t})_{t\geq 0} denote the law of (ULMC) and let (μt)t0{(\mu_{t})}_{t\geq 0} denote the law of the continuous-time underdamped Langevin diffusion (1), both initialized at some μ0\mu_{0}. Assume that the potential UU is (L,s)(L,s)-weakly smooth. If the step size hh satisfies

h=𝒪~s(γ1/(2s)ϵ1/sL1/sT1/(2s)(d+2(μ0μ(a)))1/2),\displaystyle h=\widetilde{\mathcal{O}}_{s}\Bigl{(}\frac{\gamma^{1/(2s)}\,\epsilon^{1/s}}{L^{1/s}\,T^{1/(2s)}\,(d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)}))^{1/2}}\Bigr{)}\,, (5.1)

where the notation 𝒪~s\widetilde{\mathcal{O}}_{s} hides constants depending on ss as well as polylogarithmic factors including logN\log N, and μ(a)\mu^{(a)} is a modified target distribution (see Appendix C.3 for details), then

q(μ^TμT)\displaystyle\mathcal{R}_{q}(\hat{\mu}_{T}\mathbin{\|}\mu_{T}) ϵ2.\displaystyle\leq\epsilon^{2}\,.
Remark.

The condition on hh is dependent on NN only through logarithmic factors. Secondly, this is shown under generic assumptions, and can be combined with continuous-time results in q\mathcal{R}_{q} in any setting, such as the log-Sobolev or Latała–Oleszkiewicz inequalities seen in [Che+21].

We outline the proof of this result below. Similar to the work of [Che+21], we first invoke the data processing inequality, allowing us to bound the Rényi between the time marginal distributions of the iterates with Rényi between the path measures

q(μ^TμT)q(PTQT),\displaystyle\mathcal{R}_{q}(\hat{\mu}_{T}\mathbin{\|}\mu_{T})\leq\mathcal{R}_{q}(P_{T}\mathbin{\|}Q_{T})\,,

where PT,QTP_{T},Q_{T} are probability measures of (ULMC), (1) respectively on the space of paths C([0,T],2d)C([0,T],\mathbb{R}^{2d}). Subsequently, we invoke Girsanov’s theorem, which allows us to exactly bound the pathwise divergence by the difference between the drifts of the two processes:

2q(PTQT)log𝔼exp(4q2γ0TU(xt)U(xt/hh)2dt).\displaystyle\mathcal{R}_{2q}(P_{T}\mathbin{\|}Q_{T})\lesssim\log\operatorname{\mathbb{E}}\exp\Bigl{(}\frac{4q^{2}}{\gamma}\int_{0}^{T}\lVert\nabla U(x_{t})-\nabla U(x_{\lfloor t/h\rfloor h})\rVert^{2}\,\mathrm{d}t\Bigr{)}\,.

It remains to bound the term inside the expectation. We achieve this by conditioning on the event that supt[0,T]xtxt/hh2\sup_{t\in[0,T]}\lVert x_{t}-x_{\lfloor t/h\rfloor h}\rVert^{2} is bounded by a vanishing quantity as h0h\to 0, which we must demonstrate occurs with sufficiently high probability. To show this, we begin with a single-step analysis, i.e., we bound the above for ThT\leq h. Compared to LMC, the main gain in this analysis is that the SDEs (1) and (ULMC) match exactly in the position coordinate, while the difference between the drifts manifests solely in the momentum. After integration of the momentum, the order of error is better in the position coordinate (the dominant term is 𝒪(dh2)\mathcal{O}(dh^{2}) compared to 𝒪(dh)\mathcal{O}(dh) seen in [Che+21, Lemma 24]).

The technique for extending this analysis from a single step to the full time interval follows closely that seen in [Che+21]. In particular, we obtain a dependence for xt\lVert x_{t}\rVert on xkh\lVert x_{kh}\rVert in the interval t[kh,(k+1)h)t\in[kh,(k+1)h). Controlling the latter is quite complicated when the potential satisfies only a Poincaré inequality, since it is equivalent to showing sub-Gaussian tail bounds on the iterates, while the target itself is not sub-Gaussian in the position coordinate. By comparing against an auxiliary potential, we can show that for our choice of initialization, the iterates remain sub-Gaussian for all iterations up to NN (albeit with a growing constant). Finally, this allows us to recover our discretization result in the proposition above.

6 Conclusion

This work provides state-of-the-art convergence guarantees for underdamped Langevin Monte Carlo algorithm in several regimes. Our discretization analysis (Proposition 15) in particular is generic and can be extended to any order of Rényi, under various conditions on the potential (Latała–Oleszkiewicz, weak smoothness, etc.). Consequently, our results serve as a key step towards a complete understanding of the ULMC algorithm. However, limitations of the current continuous-time techniques do not permit us to obtain stronger iteration complexity results. More specifically, it is not understood how to analyze Rényi divergence of order greater than 22, or if hypercontractive decay is possible when the potential satisfies a log-Sobolev inequality. Secondly, our discretization approach via Girsanov is currently suboptimal in the condition number (a fact noted in [Che+21]), and thus does not obtain the expected dependence of κ\sqrt{\kappa} after discretization. An improvement in the proof techniques would be necessary to sharpen this result. We believe the results and techniques developed in this work will be of interest to stimulate future research.

Acknowledgements

We thank Jason M. Altschuler, Alain Durmus, and Aram-Alexandre Pooladian for helpful conversations. KB was supported by NSF grant DMS-2053918. SC was supported by the NSF TRIPODS program (award DMS-2022448). MAE was supported by NSERC Grant [2019-06167], the Connaught New Researcher Award, the CIFAR AI Chairs program, and the CIFAR AI Catalyst grant. ML was supported by the Ontario Graduate Scholarship and Vector Institute.

range

pages35 rangepages-1 rangepages-1 rangepages7 rangepages1 rangepages4 rangepages39 rangepages24 rangepages31 rangepages28 rangepages12 rangepages26 rangepages6 rangepages22 rangepages33 rangepages47 rangepages25 rangepages10 rangepages12 rangepages11 rangepages11 rangepages12 rangepages25 rangepages36 rangepages17 rangepages72 rangepages2 rangepages16 rangepages51 rangepages13 rangepages50 rangepages25 rangepages5 rangepages30 rangepages33 rangepages41 rangepages-1 rangepages1

References

  • [AGS22] Simon Apers, Sander Gribling and Dániel Szilágyi “Hamiltonian Monte Carlo for efficient Gaussian sampling: long and random steps” In arXiv preprint arXiv:2209.12771, 2022
  • [Alb+19] Dallas Albritton, Scott Armstrong, Jean-Christophe Mourrat and Matthew Novack “Variational methods for the kinetic Fokker–Planck equation” In arXiv preprint arXiv:1902.04037, 2019
  • [Ber+22] Etienne Bernard, Max Fathi, Antoine Levitt and Gabriel Stoltz “Hypocoercivity with Schur complements” In Annales Henri Lebesgue 5, 2022, pp. 523–557
  • [BGL14] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” 348, Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] Springer, Cham, 2014, pp. xx+552
  • [BLM13] Stéphane Boucheron, Gábor Lugosi and Pascal Massart “Concentration inequalities” A nonasymptotic theory of independence, With a foreword by Michel Ledoux Oxford University Press, Oxford, 2013, pp. x+481
  • [BM22] Nawaf Bou-Rabee and Milo Marsden “Unadjusted Hamiltonian MCMC with stratified Monte Carlo time integration” In arXiv preprint arXiv:2211.11003, 2022
  • [Bob03] Sergey G. Bobkov “Spectral gap and concentration for some spherically symmetric probability measures” In Geometric aspects of functional analysis 1807, Lecture Notes in Math. Springer, Berlin, 2003, pp. 37–43
  • [CCN21] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11, 2021, pp. 109236
  • [CD76] Jagdish Chandra and Paul W. Davis “Linear generalizations of Gronwall’s inequality” In Proceedings of the American Mathematical Society 60.1, 1976, pp. 157–160
  • [Cha04] Djalil Chafai “Entropies, convexity, and functional inequalities: on Φ\Phi-entropies and Φ\Phi-Sobolev inequalities” In J. Math. Kyoto Univ. 44.2, 2004, pp. 325–363
  • [Che+18] Xiang Cheng et al. “Sharp convergence rates for Langevin dynamics in the nonconvex setting” In arXiv preprint arXiv:1805.01648, 2018
  • [Che+18a] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett and Michael I. Jordan “Underdamped Langevin MCMC: a non-asymptotic analysis” In Conference on Learning Theory, 2018, pp. 300–323 PMLR
  • [Che+21] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev” In arXiv preprint arXiv:2112.12662, 2021
  • [Che+22] Yongxin Chen, Sinho Chewi, Adil Salim and Andre Wibisono “Improved analysis for a proximal algorithm for sampling” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2984–3014
  • [Che21] Yuansi Chen “An almost constant lower bound of the isoperimetric coefficient in the KLS conjecture” In Geom. Funct. Anal. 31.1, 2021, pp. 34–61
  • [Che23] Sinho Chewi “Log-concave sampling” Book draft available at https://chewisinho.github.io/, 2023
  • [CLW20] Yu Cao, Jianfeng Lu and Lihan Wang “On explicit L2L^{2}-convergence rate estimate for underdamped Langevin dynamics” In arXiv e-prints, 2020
  • [Dal17] Arnak S. Dalalyan “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 678–689
  • [Dal17a] Arnak S. Dalalyan “Theoretical guarantees for approximate sampling from smooth and log-concave densities” In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79.3 Wiley Online Library, 2017, pp. 651–676
  • [DMS09] Jean Dolbeault, Clément Mouhot and Christian Schmeiser “Hypocoercivity for kinetic equations with linear relaxation terms” In Comptes Rendus Mathematique 347.9-10 Elsevier, 2009, pp. 511–516
  • [DMS15] Jean Dolbeault, Clément Mouhot and Christian Schmeiser “Hypocoercivity for linear kinetic equations conserving mass” In Transactions of the American Mathematical Society 367.6, 2015, pp. 3807–3828
  • [DR20] Arnak S. Dalalyan and Lionel Riou-Durand “On sampling from a log-concave density using kinetic Langevin diffusions” In Bernoulli 26.3 Bernoulli Society for Mathematical StatisticsProbability, 2020, pp. 1956–1988
  • [EH21] Murat A. Erdogdu and Rasa Hosseinzadeh “On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1776–1822
  • [EHZ22] Murat A. Erdogdu, Rasa Hosseinzadeh and Shunshi Zhang “Convergence of Langevin Monte Carlo in chi-squared and Rényi divergence” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics 151, Proceedings of Machine Learning Research PMLR, 2022, pp. 8151–8175
  • [EMS18] Murat A Erdogdu, Lester Mackey and Ohad Shamir “Global non-convex optimization with discretized diffusions” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 9694–9703
  • [FLO21] James Foster, Terry Lyons and Harald Oberhauser “The shifted ODE method for underdamped Langevin MCMC” In arXiv preprint arXiv:2101.03446, 2021
  • [FRS22] James Foster, Goncalo dos Reis and Calum Strange “High order splitting methods for SDEs satisfying a commutativity condition” In arXiv preprint arXiv:2210.17543, 2022
  • [GT20] Arun Ganesh and Kunal Talwar “Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 7222–7233
  • [HBE20] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “On the ergodicity, bias and asymptotic normality of randomized midpoint sampling method” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 7366–7376
  • [Hér06] Frédéric Hérau “Hypocoercivity and exponential time decay for the linear inhomogeneous relaxation Boltzmann equation” In Asymptotic Analysis 46.3-4 IOS Press, 2006, pp. 349–359
  • [Hoo81] James G. Hooton “Compact Sobolev imbeddings on finite measure spaces” In Journal of Mathematical Analysis and Applications 83.2 Elsevier, 1981, pp. 570–581
  • [Hör67] Lars Hörmander “Hypoelliptic second order differential equations” In Acta Mathematica 119 Institut Mittag-Leffler, 1967, pp. 147–171
  • [HS87] Richard Holley and Daniel Stroock “Logarithmic Sobolev inequalities and stochastic Ising models” In J. Statist. Phys. 46.5-6, 1987, pp. 1159–1194
  • [JKO98] Richard Jordan, David Kinderlehrer and Felix Otto “The variational formulation of the Fokker–Planck equation” In SIAM Journal on Mathematical Analysis 29.1 SIAM, 1998, pp. 1–17
  • [JLS23] Tim Johnston, Iosif Lytras and Sotirios Sabanis “Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity–the Strongly Convex Case” In arXiv preprint arXiv:2301.08039, 2023
  • [JP10] Michael Johannes and Nicholas Polson “MCMC methods for continuous-time financial econometrics” In Handbook of financial econometrics: applications Elsevier, 2010, pp. 1–72
  • [Kol34] Andrey Kolmogorov “Zufallige bewegungen (zur theorie der Brownschen bewegung)” In Annals of Mathematics JSTOR, 1934, pp. 116–117
  • [KPB20] Ivan Kobyzev, Simon JD Prince and Marcus A. Brubaker “Normalizing flows: an introduction and review of current methods” In IEEE Transactions on Pattern Analysis and Machine Intelligence 43.11 IEEE, 2020, pp. 3964–3979
  • [Li+19] Xuechen Li, Yi Wu, Lester Mackey and Murat A Erdogdu “Stochastic runge-kutta accelerates langevin monte carlo and beyond” In Advances in neural information processing systems 32, 2019
  • [Ma+21] Yi-An Ma et al. “Is there an analog of Nesterov acceleration for gradient-based MCMC?” In Bernoulli 27.3 Bernoulli Society for Mathematical StatisticsProbability, 2021, pp. 1942–1992
  • [Mir17] Ilya Mironov “Rényi differential privacy” In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), 2017, pp. 263–275 IEEE
  • [Mon21] Pierre Monmarché “High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion.” In Electronic Journal of Statistics 15.2 Institute of Mathematical StatisticsBernoulli Society, 2021, pp. 4117–4166
  • [Mou+22] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright and Peter L. Bartlett “Improved bounds for discretization of Langevin diffusions: near-optimal rates without convexity” In Bernoulli 28.3, 2022, pp. 1577–1601
  • [Nes83] Yurii E. Nesterov “A method of solving a convex programming problem with convergence rate O(1k2){O}(\frac{1}{k^{2}}) In Doklady Akademii Nauk 269.3, 1983, pp. 543–547 Russian Academy of Sciences
  • [Oks13] Bernt Oksendal “Stochastic differential equations: an introduction with applications” Springer Science & Business Media, 2013
  • [RRT17] Maxim Raginsky, Alexander Rakhlin and Matus Telgarsky “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 1674–1703
  • [RS18] Julien Roussel and Gabriel Stoltz “Spectral methods for Langevin dynamics and associated error estimates” In ESAIM: Mathematical Modelling and Numerical Analysis 52.3 EDP Sciences, 2018, pp. 1051–1083
  • [SL19] Ruoqi Shen and Yin Tat Lee “The randomized midpoint method for log-concave sampling” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
  • [Vil02] Cédric Villani “Limites hydrodynamiques de l’équation de Boltzmann” In Astérisque, SMF 282, 2002, pp. 365–405
  • [Vil09] Cédric Villani “Hypocoercivity” In Mem. Amer. Math. Soc. 202.950, 2009, pp. iv+141
  • [Vis21] Nisheeth K Vishnoi “An introduction to Hamiltonian Monte Carlo method for sampling” In arXiv preprint arXiv:2108.12107, 2021
  • [Von11] Udo Von Toussaint “Bayesian inference in physics” In Reviews of Modern Physics 83.3 APS, 2011, pp. 943
  • [VW19] Santosh Vempala and Andre Wibisono “Rapid convergence of the unadjusted Langevin algorithm: isoperimetry suffices” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
  • [WW22] Jun-Kun Wang and Andre Wibisono “Accelerating Hamiltonian Monte Carlo via Chebyshev integration time” In arXiv preprint arXiv:2207.02189, 2022

Appendix A Explicit Form for the Underdamped Langevin Diffusion

Recall that we evolve (xt,vt)(x_{t},v_{t}) for time t[kh,(k+1)h)t\in[kh,(k+1)h) explicitly according to the SDE (ULMC), which we repeat here for convenience:

dxt\displaystyle\mathrm{d}x_{t} vtdt,\displaystyle\triangleq v_{t}\,\mathrm{d}t\,, (A.1)
dvt\displaystyle\mathrm{d}v_{t} γvt+U(xkh)dt+2γdBt.\displaystyle\triangleq-\gamma v_{t}+\nabla U(x_{kh})\,\mathrm{d}t+\sqrt{2\gamma}\,\mathrm{d}B_{t}\,. (A.2)

Consequently, since we fix the position xkhx_{kh} in the non-linear term, this permits an explicit solution

x(k+1)h\displaystyle x_{(k+1)h} =xkh+γ1(1exp(γh))vkhγ1(hγ1(1exp(γh)))U(xkh)+Wkx,\displaystyle=x_{kh}+\gamma^{-1}\,(1-\exp(-\gamma h))\,v_{kh}-\gamma^{-1}\,(h-\gamma^{-1}\,(1-\exp(-\gamma h)))\,\nabla U(x_{kh})+W_{k}^{x}\,, (A.3)
v(k+1)h\displaystyle v_{(k+1)h} =exp(γh)vkhγ1(1exp(γh))U(xkh)+Wkv,\displaystyle=\exp(-\gamma h)\,v_{kh}-\gamma^{-1}\,(1-\exp(-\gamma h))\,\nabla U(x_{kh})+W_{k}^{v}\,, (A.4)

where (Wkx,Wkv)k(W_{k}^{x},W_{k}^{v})_{k\in\mathbb{N}} is an independent sequence of pairs of variables, where each pair has the joint distribution

[WkxWkv]\displaystyle\begin{bmatrix}W_{k}^{x}\\ W_{k}^{v}\end{bmatrix} 𝒩(0,[2γ(h2γ(1exp(γh))+12γ(1exp(2γh)))1γ(12exp(γh)+exp(2γh))1exp(2γh)]),\displaystyle\sim\mathcal{N}\biggl{(}0,\begin{bmatrix}\frac{2}{\gamma}\,(h-\frac{2}{\gamma}\,(1-\exp(-\gamma h))+\frac{1}{2\gamma}\,(1-\exp(-2\gamma h)))&*\\ \frac{1}{\gamma}\,(1-2\exp(-\gamma h)+\exp(-2\gamma h))&1-\exp(-2\gamma h)\end{bmatrix}\biggr{)}\,,

where * is identical to the bottom left entry.

Appendix B Continuous-Time Results

B.1 Entropic Hypocoercivity

Our proof of Lemma 5 is based on adapting the argument on the decay of a Lyapunov function from [Ma+21] (based on entropic hypocoercivity, see [Vil09]) and combining it with a time change argument [DR20, Lemma 1]. We provide the details below for completeness.

Proof.of Lemma 5 First note that variables xt,vtx_{t},v_{t} with γ=22L\gamma=2\sqrt{2L} following (ULMC) can be changed into (x~t,v~t)=(xtξ,1ξvtξ)(\tilde{x}_{t},\tilde{v}_{t})=(x_{t\sqrt{\xi}},\frac{1}{\sqrt{\xi}}\,v_{t\sqrt{\xi}}), which satisfies the process given by

dx~t\displaystyle\mathrm{d}\tilde{x}_{t} =ξv~tdt,\displaystyle=\xi\tilde{v}_{t}\,\mathrm{d}t\,,
dv~t\displaystyle\mathrm{d}\tilde{v}_{t} =ξγ~v~tdtU(x~t)dt+2γ~dBt,\displaystyle=-\xi\tilde{\gamma}\tilde{v}_{t}\,\mathrm{d}t-\nabla U(\tilde{x}_{t})\,\mathrm{d}t+\sqrt{2\tilde{\gamma}}\,\mathrm{d}B_{t}\,,

with γ~=2\tilde{\gamma}=2, ξ=2L\xi=2L, which are the parameters satisfying [Ma+21, Proposition 1]. From that Proposition, we know that the Lyapunov functional given by

~(μ~μ~)=𝖪𝖫(μ~μ~)+𝔼μ~[𝔑1/2logμ~μ~2],where𝔑=1L[1/41/21/22]Id,\displaystyle\tilde{\mathcal{F}}(\tilde{\mu}^{\prime}\mathbin{\|}\tilde{\mu})=\mathsf{KL}(\tilde{\mu}^{\prime}\mathbin{\|}\tilde{\mu})+\operatorname{\mathbb{E}}_{\tilde{\mu}^{\prime}}\bigl{[}\bigl{\lVert}\mathfrak{N}^{1/2}\,\nabla\log\frac{\tilde{\mu}^{\prime}}{\tilde{\mu}}\bigr{\rVert}^{2}\bigr{]}\,,\qquad\text{where}~{}\mathfrak{N}=\frac{1}{L}\,\begin{bmatrix}1/4&1/2\\ 1/2&2\end{bmatrix}\otimes I_{d}\,,

decays with t~(μ~tμ~)110C𝖫𝖲𝖨~(μ~tμ~).\partial_{t}\tilde{\mathcal{F}}(\tilde{\mu}_{t}\mathbin{\|}\tilde{\mu})\leq-\frac{1}{10C_{\mathsf{LSI}}}\,\tilde{\mathcal{F}}(\tilde{\mu}_{t}\mathbin{\|}\tilde{\mu}). Here the 𝖫𝖲𝖨\mathsf{LSI} constant does not change under our coordinate transform, but now μ~t\tilde{\mu}_{t} represents the joint law of (x~t,v~t)(\tilde{x}_{t},\tilde{v}_{t}), while the stationary measure has the form μ~(x~,v~)π(x~)×exp(ξv~2/2)\tilde{\mu}(\tilde{x},\tilde{v})\propto\pi(\tilde{x})\times\exp(-\xi\,\lVert\tilde{v}\rVert^{2}/2). The statement of our theorem immediately follows by reversing our change of variables, which involves scaling up the gradients of the momenta by ξ1/2\xi^{1/2}, while the time is scaled down by ξ1/2\xi^{1/2}. ∎

B.2 Contraction of ULMC

In this section, we prove a contraction result for ULMC and use this to deduce a log-Sobolev inequality along the trajectory of the underdamped Langevin diffusion. The mean of the next iterate of ULMC started at (x,v)(x,v) is given by

F(x,v)\displaystyle F(x,v) (x+1exp(γh)γvhγ1(1exp(γh))γU(x),\displaystyle\coloneqq\Bigl{(}x+\frac{1-\exp(-\gamma h)}{\gamma}\,v-\frac{h-\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}\,\nabla U(x),
exp(γh)v1exp(γh)γU(x)).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\exp(-\gamma h)\,v-\frac{1-\exp(-\gamma h)}{\gamma}\,\nabla U(x)\Bigr{)}\,.

We will use the change of coordinates

(ϕ,ψ)\displaystyle(\phi,\psi) (x,v)(x,x+2γv).\displaystyle\coloneqq\mathcal{M}(x,v)\coloneqq\bigl{(}x,x+\frac{2}{\gamma}\,v\bigr{)}\,.

In these new coordinates, the mean of the next iterate of ULMC started at (ϕ,ψ)(\phi,\psi) is F¯(ϕ,ψ)\bar{F}(\phi,\psi), where F¯=F1\bar{F}=\mathcal{M}\circ F\circ\mathcal{M}^{-1}. Since 1(ϕ,ψ)=(ϕ,γ2(ψϕ))\mathcal{M}^{-1}(\phi,\psi)=(\phi,\frac{\gamma}{2}\,(\psi-\phi)), we can explicitly write

F¯(ϕ,ψ)\displaystyle\bar{F}(\phi,\psi) =(ϕ+1exp(γh)2(ψϕ)hγ1(1exp(γh))γU(ϕ),\displaystyle=\Bigl{(}\phi+\frac{1-\exp(-\gamma h)}{2}\,(\psi-\phi)-\frac{h-\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}\,\nabla U(\phi),
ϕ+1+exp(γh)2(ψϕ)h+γ1(1exp(γh))γU(ϕ)).\displaystyle\qquad\phi+\frac{1+\exp(-\gamma h)}{2}\,(\psi-\phi)-\frac{h+\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}\,\nabla U(\phi)\Bigr{)}\,.
Lemma 16.

Consider the mapping F¯:d×dd×d\bar{F}:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}^{d}\times\mathbb{R}^{d} defined above. Assume that mId2ULIdmI_{d}\preceq\nabla^{2}U\preceq LI_{d}. Then, for h1h\lesssim 1 and γ=cL\gamma=c\sqrt{L} for some c2c\geq\sqrt{2}, F¯\bar{F} is a contraction with parameter

F¯Lip\displaystyle\lVert\bar{F}\rVert_{\operatorname{Lip}} 1m2Lh+O(Lh2).\displaystyle\leq 1-\frac{m}{\sqrt{2L}}\,h+O(Lh^{2})\,.

Proof.  We compute the partial derivatives

ϕF¯(ϕ,ψ)ϕ\displaystyle\partial_{\phi}{\bar{F}(\phi,\psi)}_{\phi} =1+exp(γh)2Idhγ1(1exp(γh))γ2U(ϕ),\displaystyle=\frac{1+\exp(-\gamma h)}{2}\,I_{d}-\frac{h-\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}\,\nabla^{2}U(\phi)\,,
ϕF¯(ϕ,ψ)ψ\displaystyle\partial_{\phi}{\bar{F}(\phi,\psi)}_{\psi} =1exp(γh)2Idh+γ1(1exp(γh))γ2U(ϕ),\displaystyle=\frac{1-\exp(-\gamma h)}{2}\,I_{d}-\frac{h+\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}\,\nabla^{2}U(\phi)\,,
ψF¯(ϕ,ψ)ϕ\displaystyle\partial_{\psi}{\bar{F}(\phi,\psi)}_{\phi} =1exp(γh)2Id,\displaystyle=\frac{1-\exp(-\gamma h)}{2}\,I_{d}\,,
ψF¯(ϕ,ψ)ψ\displaystyle\partial_{\psi}{\bar{F}(\phi,\psi)}_{\psi} =1+exp(γh)2Id.\displaystyle=\frac{1+\exp(-\gamma h)}{2}\,I_{d}\,.

Let aexp(γh)a\coloneqq\exp(-\gamma h) and b2γ(h+γ1(1exp(γh)))b\coloneqq\frac{2}{\gamma}\,(h+\gamma^{-1}\,(1-\exp(-\gamma h))). Since

hγ1(1exp(γh))γ=O(h2),\displaystyle\frac{h-\gamma^{-1}\,(1-\exp(-\gamma h))}{\gamma}=O(h^{2})\,,

we have

F¯(ϕ,ψ)op\displaystyle\lVert\nabla\bar{F}(\phi,\psi)\rVert_{\rm op} 12[(1+a)Id(1a)Idb2U(ϕ)(1a)Id(1+a)Id]Aop+O(Lh2).\displaystyle\leq\frac{1}{2}\,\Bigl{\lVert}\underbrace{\begin{bmatrix}(1+a)\,I_{d}&(1-a)\,I_{d}-b\,\nabla^{2}U(\phi)\\ (1-a)\,I_{d}&(1+a)\,I_{d}\end{bmatrix}}_{\eqqcolon A}\Bigr{\rVert}_{\rm op}+O(Lh^{2})\,.

Then,

AA𝖳\displaystyle AA^{\mathsf{T}} =[(1+a)2Id+((1a)Idb2U(ϕ))22(1a2)Id(1+a)b2U(ϕ){(1a)2+(1+a)2}Id],\displaystyle=\begin{bmatrix}{(1+a)}^{2}\,I_{d}+{((1-a)\,I_{d}-b\,\nabla^{2}U(\phi))}^{2}&*\\ 2\,(1-a^{2})\,I_{d}-(1+a)\,b\,\nabla^{2}U(\phi)&\{{(1-a)}^{2}+{(1+a)}^{2}\}\,I_{d}\end{bmatrix}\,,

where the upper right entry is determined by symmetry. Since 1a=Θ(γh)1-a=\Theta(\gamma h) and b=O(h/γ)b=O(h/\gamma), one can simplify this as follows:

AA𝖳2[(1+a2)Id(1a2)Idb2U(ϕ)(1a2)Idb2U(ϕ)(1+a2)Id]Bop\displaystyle\Bigl{\lVert}AA^{\mathsf{T}}-2\,\underbrace{\begin{bmatrix}(1+a^{2})\,I_{d}&(1-a^{2})\,I_{d}-b\,\nabla^{2}U(\phi)\\ (1-a^{2})\,I_{d}-b\,\nabla^{2}U(\phi)&(1+a^{2})\,I_{d}\end{bmatrix}}_{\eqqcolon B}\Bigr{\rVert}_{\rm op}
O(L2h2γ2+Lh2).\displaystyle\qquad\leq O\bigl{(}\frac{L^{2}h^{2}}{\gamma^{2}}+Lh^{2}\bigr{)}\,.

One can check that the eigenvalues of the matrix BB are 1+a2±(1a2bλ)1+a^{2}\pm(1-a^{2}-b\lambda), where λ\lambda ranges over the eigenvalues of 2U(ϕ)\nabla^{2}U(\phi). Hence, we can bound

Bop\displaystyle\lVert B\rVert_{\rm op} max{2a2+Lb,2bm}.\displaystyle\leq\max\{2a^{2}+Lb,2-bm\}\,.

We note that

2a2+Lb\displaystyle 2a^{2}+Lb =2exp(2γh)+2L(h+γ1(1exp(γh)))γ\displaystyle=2\exp(-2\gamma h)+\frac{2L\,(h+\gamma^{-1}\,(1-\exp(-\gamma h)))}{\gamma}
=2{12γh+2Lhγ+O(γ2h2+Lh2)}.\displaystyle=2\,\Bigl{\{}1-2\gamma h+\frac{2Lh}{\gamma}+O(\gamma^{2}h^{2}+Lh^{2})\Bigr{\}}\,.

In order for this to be strictly smaller than 22, we must take γ>L\gamma>\sqrt{L}. We choose γ=cL\gamma=c\sqrt{L} for c2c\geq\sqrt{2}, in which case

Bop\displaystyle\lVert B\rVert_{\rm op} 2max{1cLh, 1m2Lh}+O(Lh2)\displaystyle\leq 2\max\Bigl{\{}1-c\sqrt{L}\,h,\;1-m\sqrt{\frac{2}{L}}\,h\Bigr{\}}+O(Lh^{2})
=2(1m2Lh)+O(Lh2).\displaystyle=2\,\Bigl{(}1-m\sqrt{\frac{2}{L}}\,h\Bigr{)}+O(Lh^{2})\,.

We deduce that

AA𝖳op\displaystyle\lVert AA^{\mathsf{T}}\rVert_{\rm op} 4(1m2Lh)+O(Lh2)\displaystyle\leq 4\,\Bigl{(}1-m\sqrt{\frac{2}{L}}\,h\Bigr{)}+O(Lh^{2})

and therefore

F¯(ϕ,ψ)op\displaystyle\lVert\nabla\bar{F}(\phi,\psi)\rVert_{\rm op} 1m2Lh+O(Lh2)1m2Lh+O(Lh2).\displaystyle\leq\sqrt{1-m\sqrt{\frac{2}{L}}\,h}+O(Lh^{2})\leq 1-\frac{m}{\sqrt{2L}}\,h+O(Lh^{2})\,.

The ULMC iterate is

(x(k+1)h,v(k+1)h)\displaystyle(x_{(k+1)h},v_{(k+1)h}) =dF(xkh,vkh)+𝒩(0,Σ),\displaystyle\stackrel{{\scriptstyle d}}{{=}}F(x_{kh},v_{kh})+\mathcal{N}(0,\Sigma)\,,

where Σ\Sigma is the covariance of the Gaussian random vector in the LMC update. In the new coordinates, this iteration can be written

(ϕ(k+1)h,ψ(k+1)h)\displaystyle(\phi_{(k+1)h},\psi_{(k+1)h}) =dF¯(ϕkh,ψkh)+𝒩(0,Σ𝖳).\displaystyle\stackrel{{\scriptstyle d}}{{=}}\bar{F}(\phi_{kh},\psi_{kh})+\mathcal{N}(0,\mathcal{M}\Sigma\mathcal{M}^{\mathsf{T}})\,.

Writing Σ𝖳=Σ¯Id\mathcal{M}\Sigma\mathcal{M}^{\mathsf{T}}=\bar{\Sigma}\otimes I_{d}, we can compute

Σ¯1,1\displaystyle\bar{\Sigma}_{1,1} =2hγ3γ2+4exp(γh)γ2exp(2γh)γ2=O(γh3),\displaystyle=\frac{2h}{\gamma}-\frac{3}{\gamma^{2}}+\frac{4\exp(-\gamma h)}{\gamma^{2}}-\frac{\exp(-2\gamma h)}{\gamma^{2}}=O(\gamma h^{3})\,,
Σ¯1,2\displaystyle\bar{\Sigma}_{1,2} =2hγ1γ2+exp(2γh)γ2=O(h2),\displaystyle=\frac{2h}{\gamma}-\frac{1}{\gamma^{2}}+\frac{\exp(-2\gamma h)}{\gamma^{2}}=O(h^{2})\,,
Σ¯2,2\displaystyle\bar{\Sigma}_{2,2} =2hγ+5γ28exp(γh)γ2+3exp(2γh)γ2=4hγ2+O(h2).\displaystyle=\frac{2h}{\gamma}+\frac{5}{\gamma^{2}}-\frac{8\exp(-\gamma h)}{\gamma^{2}}+\frac{3\exp(-2\gamma h)}{\gamma^{2}}=\frac{4h}{\gamma^{2}}+O(h^{2})\,.

We conclude that

Σ¯op\displaystyle\lVert\bar{\Sigma}\rVert_{\rm op} 4hγ+O(h2).\displaystyle\leq\frac{4h}{\gamma}+O(h^{2})\,.

Hence, C𝖫𝖲𝖨(𝒩(0,Σ𝖳))4hγ2+O(h2)C_{\mathsf{LSI}}(\mathcal{N}(0,\mathcal{M}\Sigma\mathcal{M}^{\mathsf{T}}))\leq\frac{4h}{\gamma^{2}}+O(h^{2}).

Proposition 17.

Let μ^tlaw(ϕt,ψt)\hat{\mu}_{t}^{\mathcal{M}}\coloneqq\operatorname{law}(\phi_{t},\psi_{t}). Then, for all ε>0\varepsilon>0, for all sufficiently small h>0h>0 (depending on ε\varepsilon), one has

C𝖫𝖲𝖨(μ^Nh)\displaystyle C_{\mathsf{LSI}}(\hat{\mu}_{Nh}^{\mathcal{M}}) (1(m2Lε)h)NC𝖫𝖲𝖨(μ^0)+42mε2L+O(hLm).\displaystyle\leq\Bigl{(}1-\bigl{(}m\sqrt{\frac{2}{L}}-\varepsilon\bigr{)}\,h\Bigr{)}{\vphantom{\Big{|}}}^{N}\,C_{\mathsf{LSI}}(\hat{\mu}_{0}^{\mathcal{M}})+\frac{4}{2m-\varepsilon\sqrt{2L}}+O\bigl{(}\frac{h\sqrt{L}}{m}\bigr{)}\,.

Proof.  The LSI constant evolves according to

C𝖫𝖲𝖨(μ^(k+1)h)\displaystyle C_{\mathsf{LSI}}(\hat{\mu}_{(k+1)h}^{\mathcal{M}}) F¯op2C𝖫𝖲𝖨(μ^kh)+C𝖫𝖲𝖨(𝒩(0,Σ𝖳))\displaystyle\leq\lVert\bar{F}\rVert_{\rm op}^{2}\,C_{\mathsf{LSI}}(\hat{\mu}_{kh}^{\mathcal{M}})+C_{\mathsf{LSI}}\bigl{(}\mathcal{N}(0,\mathcal{M}\Sigma\mathcal{M}^{\mathsf{T}})\bigr{)}
(1m2Lh+O(Lh2))C𝖫𝖲𝖨(μ^kh)+4hγ+O(h2).\displaystyle\leq\Bigl{(}1-m\sqrt{\frac{2}{L}}\,h+O(Lh^{2})\Bigr{)}\,C_{\mathsf{LSI}}(\hat{\mu}_{kh}^{\mathcal{M}})+\frac{4h}{\gamma}+O(h^{2})\,.

For hh sufficiently small, we have

C𝖫𝖲𝖨(μ^(k+1)h)\displaystyle C_{\mathsf{LSI}}(\hat{\mu}_{(k+1)h}^{\mathcal{M}}) (1(m2Lε)h)C𝖫𝖲𝖨(μ^kh)+4hγ+O(h2).\displaystyle\leq\Bigl{(}1-\bigl{(}m\sqrt{\frac{2}{L}}-\varepsilon\bigr{)}\,h\Bigr{)}\,C_{\mathsf{LSI}}(\hat{\mu}_{kh}^{\mathcal{M}})+\frac{4h}{\gamma}+O(h^{2})\,.

Iterating,

C𝖫𝖲𝖨(μ^Nh)\displaystyle C_{\mathsf{LSI}}(\hat{\mu}_{Nh}^{\mathcal{M}}) (1(m2Lε)h)NC𝖫𝖲𝖨(μ^0)+42mε2L+O(hLm).\displaystyle\leq\Bigl{(}1-\bigl{(}m\sqrt{\frac{2}{L}}-\varepsilon\bigr{)}\,h\Bigr{)}{\vphantom{\Big{|}}}^{N}\,C_{\mathsf{LSI}}(\hat{\mu}_{0}^{\mathcal{M}})+\frac{4}{2m-\varepsilon\sqrt{2L}}+O\bigl{(}\frac{h\sqrt{L}}{m}\bigr{)}\,.

This completes the proof. ∎

Corollary 18.

Let μt\mu_{t}^{\mathcal{M}} now denote the law of the continuous-time underdamped Langevin diffusion with γ=cL\gamma=c\sqrt{L} for c2c\geq\sqrt{2} in the (ϕ,ψ)(\phi,\psi) coordinates. Then,

C𝖫𝖲𝖨(μt)\displaystyle C_{\mathsf{LSI}}(\mu_{t}^{\mathcal{M}}) exp(m2Lt)C𝖫𝖲𝖨(μ0)+2m.\displaystyle\leq\exp\Bigl{(}-m\sqrt{\frac{2}{L}}\,t\Bigr{)}\,C_{\mathsf{LSI}}(\mu_{0}^{\mathcal{M}})+\frac{2}{m}\,.

Proof.  In the preceding proposition, let h0h\searrow 0 while NhtNh\to t, and then let ε0\varepsilon\searrow 0. ∎

Appendix C Discretization Analysis

We consider the discretization used in [Ma+21], with the following differential form:

dx^t=v^tdt,\displaystyle d\hat{x}_{t}=\hat{v}_{t}\,\mathrm{d}t\,,
dv^t=γv^tdtU(x^kh)dt+2γdBt,\displaystyle d\hat{v}_{t}=-\gamma\hat{v}_{t}\,\mathrm{d}t-\nabla U(\hat{x}_{kh})\,\mathrm{d}t+\sqrt{2\gamma}\,\mathrm{d}B_{t}\,,

and we define the variable w^t\hat{w}_{t} as the tuple (x^t,v^t)(\hat{x}_{t},\hat{v}_{t}), for t[kh,(k+1)h]t\in[kh,(k+1)h].

C.1 Technical Lemmas

Theorem 19 (Girsanov’s Theorem, Adapted from [Oks13, Theorem 8.6.8]).

Consider stochastic processes (xt)t0{(x_{t})}_{t\geq 0}, (btP)t0{(b_{t}^{P})}_{t\geq 0}, (btQ)t0{(b_{t}^{Q})}_{t\geq 0} adapted to the same filtration, and σd×d\sigma\in\mathbb{R}^{d\times d} any constant, possibly degenerate, matrix. Let PTP_{T} and QTQ_{T} be probability measures on the path space C([0,T];d)C([0,T];\mathbb{R}^{d}) such that (wt)t0{(w_{t})}_{t\geq 0} evolves according to

dwt\displaystyle\mathrm{d}w_{t} =btPdt+σdBtPunderPT,\displaystyle=b_{t}^{P}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}^{P}\qquad\text{under}~{}P_{T}\,,
dwt\displaystyle\mathrm{d}w_{t} =btQdt+σdBtQunderQT,\displaystyle=b_{t}^{Q}\,\mathrm{d}t+\sigma\,\mathrm{d}B_{t}^{Q}\qquad\text{under}~{}Q_{T}\,,

where BPB^{P} is a PTP_{T}-Brownian motion and BQB^{Q} is a QTQ_{T}-Brownian motion. Furthermore, suppose there exists a process (ut)t0(u_{t})_{t\geq 0} such that

σut=btPbtQ,\sigma\,u_{t}=b^{P}_{t}-b^{Q}_{t}\,,

and

𝔼QTexp(2q20Tus2ds)<,\mathbb{E}^{Q_{T}}\exp\Bigl{(}2q^{2}\int_{0}^{T}\|u_{s}\|^{2}\,\mathrm{d}s\Bigr{)}<\infty\,,

Consequently, if we define σ\sigma^{\dagger} as the Moore–Penrose pseudo-inverse of σ\sigma, then by the previous supposition we have ut=σ(btPbtQ)u_{t}=\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q}). Then,

dPTdQT\displaystyle\frac{\mathrm{d}P_{T}}{\mathrm{d}Q_{T}} =exp(0Tσ(btPbtQ),dBtQ120Tσ(btPbtQ)2dt).\displaystyle=\exp\Bigl{(}\int_{0}^{T}\langle\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q}),\mathrm{d}B_{t}^{Q}\rangle-\frac{1}{2}\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\,.

In fact, we will only need the following corollary.

Corollary 20.

For any event \mathcal{E} and q1q\geq 1,

𝔼QT[(dPTdQT)q𝟙]\displaystyle\operatorname{\mathbb{E}}^{Q_{T}}\bigl{[}\bigl{(}\frac{\mathrm{d}P_{T}}{\mathrm{d}Q_{T}}\bigr{)}^{q}\operatorname{\mathbbm{1}}_{\mathcal{E}}\bigr{]} 𝔼[exp(2q20Tσ(btPbtQ)2dt)𝟙].\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}2q^{2}\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\operatorname{\mathbbm{1}}_{\mathcal{E}}\Bigr{]}}\,.

Proof.  Using Cauchy–Schwarz, and then Itô’s Lemma, we find

𝔼QT[(dPTdQT)q𝟙]=𝔼QT[exp(q0Tσ(btPbtQ),dBtQq20Tσ(btPbtQ)2dt)𝟙]\displaystyle\operatorname{\mathbb{E}}^{Q_{T}}\bigl{[}\bigl{(}\frac{\mathrm{d}P_{T}}{\mathrm{d}Q_{T}}\bigr{)}^{q}\operatorname{\mathbbm{1}}_{\mathcal{E}}\bigr{]}=\operatorname{\mathbb{E}}^{Q_{T}}\Bigl{[}\exp\Bigl{(}q\int_{0}^{T}\langle\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q}),\mathrm{d}B_{t}^{Q}\rangle-\frac{q}{2}\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\operatorname{\mathbbm{1}}_{\mathcal{E}}\Bigr{]}
𝔼QT[exp((2q2q)0Tσ(btPbtQ)2dt)𝟙]\displaystyle\qquad\leq\sqrt{\operatorname{\mathbb{E}}^{Q_{T}}\Bigl{[}\exp\Bigl{(}(2q^{2}-q)\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\operatorname{\mathbbm{1}}_{\mathcal{E}}\Bigr{]}}
×𝔼QT[exp(2q0Tσ(btPbtQ),dBtQ2q20Tσ(btPbtQ)2dt)𝟙]=1\displaystyle\qquad\qquad{}\times\underset{=1}{\underbrace{\sqrt{\operatorname{\mathbb{E}}^{Q_{T}}\Bigl{[}\exp\Bigl{(}2q\int_{0}^{T}\langle\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q}),\mathrm{d}B_{t}^{Q}\rangle-2q^{2}\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\operatorname{\mathbbm{1}}_{\mathcal{E}}\Bigr{]}}}}
𝔼QT[exp(2q20Tσ(btPbtQ)2dt)𝟙].\displaystyle\qquad\leq\sqrt{\operatorname{\mathbb{E}}^{Q_{T}}\Bigl{[}\exp\Bigl{(}2q^{2}\int_{0}^{T}\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert^{2}\,\mathrm{d}t\Bigr{)}\operatorname{\mathbbm{1}}_{\mathcal{E}}\Bigr{]}}\,.

Here, we used the fact that texp(0tuτ,dBτ120tuτ2dτ)t\mapsto\exp(\int_{0}^{t}\langle u_{\tau},\mathrm{d}B_{\tau}\rangle-\frac{1}{2}\int_{0}^{t}\lVert u_{\tau}\rVert^{2}\,\mathrm{d}\tau) is a local martingale. ∎

We can identify the following for the process (xt,vt)(x_{t},v_{t}):

σ=[0002γId],btP=[vtγvtU(xt)],btQ=[vtγvtU(xt/hh)].\displaystyle\sigma=\begin{bmatrix}0&0\\ 0&\sqrt{2\gamma}\,I_{d}\end{bmatrix}\,,\qquad b_{t}^{P}=\begin{bmatrix}v_{t}\\ -\gamma v_{t}-\nabla U(x_{t})\end{bmatrix}\,,\qquad b_{t}^{Q}=\begin{bmatrix}v_{t}\\ -\gamma v_{t}-\nabla U(x_{\lfloor t/h\rfloor h})\end{bmatrix}\,.

In this case, σ(btPbtQ)12γU(xt/hh)U(xt)\lVert\sigma^{\dagger}\,(b_{t}^{P}-b_{t}^{Q})\rVert\equiv\frac{1}{\sqrt{2\gamma}}\,\lVert\nabla U(x_{\lfloor t/h\rfloor h})-\nabla U(x_{t})\rVert.

We also adapt the following Lemmas without proof from [Che+21].

Lemma 21 (Change of Measure, from [Che+21, Lemma 21]).

Let μ\mu, ν\nu be probability measures and let EE be any event. Then,

μ(E)\displaystyle\mu(E) ν(E)+χ2(μν)ν(E).\displaystyle\leq\nu(E)+\sqrt{\chi_{2}(\mu\mathbin{\|}\nu)\,\nu(E)}\,.

In particular, if μ\mu and ν\nu are probability measures on d\mathbb{R}^{d} and

ν{R0+η}Cexp(cη2)for allη0,\displaystyle\nu\{\lVert\cdot\rVert\geq R_{0}+\eta\}\leq C\exp(-c\eta^{2})\qquad\text{for all}~{}\eta\geq 0\,,

where C1C\geq 1, then

μ{R0+1c2(μν)+η}2Cexp(cη22)for allη0.\displaystyle\mu\Bigl{\{}\lVert\cdot\rVert\geq R_{0}+\sqrt{\frac{1}{c}\,\mathcal{R}_{2}(\mu\mathbin{\|}\nu)}+\eta\Bigr{\}}\leq 2C\exp\bigl{(}-\frac{c\eta^{2}}{2}\bigr{)}\qquad\text{for all}~{}\eta\geq 0\,.
Lemma 22.

Let (Bt)t0{(B_{t})}_{t\geq 0} be a standard Brownian motion in d\mathbb{R}^{d}. Then, if λ0\lambda\geq 0 and h1/(4λ)h\leq 1/(4\lambda),

𝔼exp(λsupt[0,h]Bt2)\displaystyle\operatorname{\mathbb{E}}\exp\bigl{(}\lambda\sup_{t\in[0,h]}{\lVert B_{t}\rVert^{2}}\bigr{)} exp(6dhλ).\displaystyle\leq\exp(6dh\lambda)\,.

In particular, for all η0\eta\geq 0,

{supt[0,h]Btη}\displaystyle\mathbb{P}\bigl{\{}\sup_{t\in[0,h]}{\lVert B_{t}\rVert}\geq\eta\bigr{\}} 3exp(η26dh).\displaystyle\leq 3\exp\bigl{(}-\frac{\eta^{2}}{6dh}\bigr{)}\,.
Lemma 23 ([GT20, Lemma 14]).

Let Y>0Y>0 be a random variable. Assume that for all 0<δ<1/20<\delta<1/2 there exists an event δ\mathcal{E}_{\delta} with probability at least 1δ1-\delta such that

𝔼[Y2δ]vδξ\displaystyle\operatorname{\mathbb{E}}[Y^{2}\mid\mathcal{E}_{\delta}]\leq\frac{v}{\delta^{\xi}}

for some ξ<1\xi<1. Then, 𝔼Y4v\operatorname{\mathbb{E}}Y\leq 4\sqrt{v}.

Lemma 24 (Matrix Grönwall Inequality).

Let x:+dx:\mathbb{R}_{+}\to\mathbb{R}^{d}, and cdc\in\mathbb{R}^{d}, Ad×dA\in\mathbb{R}^{d\times d}, where AA has non-negative entries. Suppose that the following inequality is satisfied componentwise:

x(t)c+0tAx(s)ds,for allt0.\displaystyle x(t)\leq c+\int_{0}^{t}Ax(s)\,\mathrm{d}s\,,\qquad\text{for all}~{}t\geq 0\,. (C.1)

Then, the following inequality holds, where Idd×dI_{d}\in\mathbb{R}^{d\times d} is the dd-dimensional identity matrix:

x(t)(AAeAtAA+Id)c,\displaystyle x(t)\leq(AA^{\dagger}\,e^{At}-AA^{\dagger}+I_{d})\,c\,, (C.2)

where AA^{\dagger} is the Moore–Penrose pseudo-inverse of AA (when AA is invertible, this is equivalent to the standard inverse).

Proof.  This is a special case of [CD76, Main Theorem]. ∎

C.2 Movement Bound for ULMC

We next prove a movement bound for the continuous-time Langevin diffusion. The following lemma is a standard fact about the concentration of the norm of a Gaussian vector [[, see, e.g.,]Theorem 5.5]boucheronlugosimassart2013concentration.

Lemma 25 (Concentration of the Norm).

The following concentration holds: for all η0\eta\geq 0,

ρ(d+η)exp(η22).\displaystyle\rho(\lVert\cdot\rVert\geq\sqrt{d}+\eta)\leq\exp\Bigl{(}-\frac{\eta^{2}}{2}\Bigr{)}\,.

Note that vtv0\lVert v_{t}-v_{0}\rVert is of size 𝒪(dt)\mathcal{O}(\sqrt{dt}), due to the Brownian motion component of the momentum variable vv; this is the same order as the size of the increment of the overdamped Langevin diffusion. However, if we consider the increment in the xx-coordinate only, we obtain the following bound.

Lemma 26.

Let (xt,vt)t0{(x_{t},v_{t})}_{t\geq 0} denote the continuous-time underdamped Langevin diffusion started at (x0,v0)(x_{0},v_{0}), and assume that the gradient U\nabla U of the potential satisfies U(0)=0\nabla U(0)=0 and is Hölder continuous (satisfies (2.3)). Also, assume that hL1/2γ1h\lesssim L^{-1/2}\wedge\gamma^{-1} and 0λ1γsdsh3s0\leq\lambda\lesssim\frac{1}{\gamma^{s}d^{s}h^{3s}}. Then,

log𝔼exp(λsupt[0,h]xtx02s)(L2sh4s(1+x02s2)+h2sv02s+γsdsh3s)λ.\displaystyle\log\operatorname{\mathbb{E}}\exp\bigl{(}\lambda\sup_{t\in[0,h]}{\lVert x_{t}-x_{0}\rVert}^{2s}\bigr{)}\lesssim\bigl{(}L^{2s}h^{4s}\,(1+\lVert x_{0}\rVert^{2s^{2}})+h^{2s}\,\lVert v_{0}\rVert^{2s}+\gamma^{s}d^{s}h^{3s}\bigr{)}\,\lambda\,.

Proof.  For the interpolant times, we will use Grönwall’s matrix inequality (Lemma 24), with the following equation for xx:

xtx0\displaystyle\lVert x_{t}-x_{0}\rVert 0tvτdτhv0+0t(vτv0)dτ\displaystyle\leq\Bigl{\lVert}\int_{0}^{t}v_{\tau}\,\mathrm{d}\tau\Bigr{\rVert}\leq h\,\lVert v_{0}\rVert+\Bigl{\lVert}\int_{0}^{t}(v_{\tau}-v_{0})\,\mathrm{d}\tau\Bigr{\rVert}
hv0+0t0τγvτdτdτ+0t0τU(xτ)dτdτ\displaystyle\leq h\,\lVert v_{0}\rVert+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\gamma v_{\tau^{\prime}}\,\mathrm{d}\tau^{\prime}\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\nabla U(x_{\tau^{\prime}})\,\mathrm{d}\tau^{\prime}\,\mathrm{d}\tau\Bigr{\rVert}
+0t0τ2γdBτdτ\displaystyle\qquad{}+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\sqrt{2\gamma}\,\mathrm{d}B_{\tau^{\prime}}\,\mathrm{d}\tau\Bigr{\rVert}
hv0+γh(hv0+0tvτv0dτ)+Lh2\displaystyle\leq h\,\lVert v_{0}\rVert+\gamma h\,\Bigl{(}h\,\lVert v_{0}\rVert+\int_{0}^{t}\lVert v_{\tau}-v_{0}\rVert\,\mathrm{d}\tau\Bigr{)}+Lh^{2}
+Lh(hx0s+0txτx0dτ)+2γhsupt[0,h]Bt.\displaystyle\qquad{}+Lh\,\Bigl{(}h\,\lVert x_{0}\rVert^{s}+\int_{0}^{t}\lVert x_{\tau}-x_{0}\rVert\,\mathrm{d}\tau\Bigr{)}+\sqrt{2\gamma}\,h\sup_{t\in[0,h]}{\lVert B_{t}\rVert}\,.

Here we use the Hölder property of U\nabla U along with xs1+x\lVert x\rVert^{s}\leq 1+\lVert x\rVert. Likewise for vv:

vtv0\displaystyle\lVert v_{t}-v_{0}\rVert 0tγvτdτ+0tU(xτ)dτ+0t2γdBτ\displaystyle\leq\Bigl{\lVert}\int_{0}^{t}\gamma v_{\tau}\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\nabla U(x_{\tau})\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\sqrt{2\gamma}\,\mathrm{d}B_{\tau}\Bigr{\rVert}
γ(hv0+0tvτv0dτ)+Lh+L(hx0s+0txτx0dτ)\displaystyle\leq\gamma\,\Bigl{(}h\,\lVert v_{0}\rVert+\int_{0}^{t}\lVert v_{\tau}-v_{0}\rVert\,\mathrm{d}\tau\Bigr{)}+Lh+L\,\Bigl{(}h\,\lVert x_{0}\rVert^{s}+\int_{0}^{t}\lVert x_{\tau}-x_{0}\rVert\,\mathrm{d}\tau\Bigr{)}
+2γsupt[0,h]Bt.\displaystyle\qquad{}+\sqrt{2\gamma}\sup_{t\in[0,h]}{\lVert B_{t}\rVert}\,.

Consequently, we can use the matrix form of Grönwall’s inequality (Lemma 24). While applying that Lemma, let c=c1+c2c=c_{1}+c_{2} with c1,c2c_{1},c_{2} to be given. First, for c1c_{1}:

A=[LhγhLγ],c1=[Lh2x0s+γh2v0+Lh2+2γhsupt[0,h]BtLhx0s+γhv0+Lh+2γsupt[0,h]Bt].\displaystyle A=\begin{bmatrix}Lh&\gamma h\\ L&\gamma\end{bmatrix}\,,\qquad c_{1}=\begin{bmatrix}Lh^{2}\,\lVert x_{0}\rVert^{s}+\gamma h^{2}\,\lVert v_{0}\rVert+Lh^{2}+\sqrt{2\gamma}\,h\,\sup_{t\in[0,h]}\lVert B_{t}\rVert\\ Lh\,\lVert x_{0}\rVert^{s}+\gamma h\,\lVert v_{0}\rVert+Lh+\sqrt{2\gamma}\sup_{t\in[0,h]}\lVert B_{t}\rVert\end{bmatrix}\,.

Noting that c1c_{1} lies in the image space of AA so that AAc1=c1AA^{\dagger}c_{1}=c_{1}, and similarly observing that exp(At)c1\exp(At)\,c_{1} belongs to the image space of AA (using the power series representation of the matrix exponential), we obtain for this first component:

supt[0,h]x0xt\displaystyle\sup_{t\in[0,h]}{\lVert x_{0}-x_{t}\rVert}
hexp((Lh+γ)h)(γhv0+Lhx0s+Lh+2γsupt[0,h]Bt)+c2term\displaystyle\qquad\leq h\exp\bigl{(}(Lh+\gamma)\,h\bigr{)}\,\bigl{(}\gamma h\,\lVert v_{0}\rVert+Lh\,\lVert x_{0}\rVert^{s}+Lh+\sqrt{2\gamma}\sup_{t\in[0,h]}{\lVert B_{t}\rVert}\bigr{)}+c_{2}~{}\text{term}
2h(γhv0+Lhx0s+Lh+2γsupt[0,h]Bt)+c2term,\displaystyle\qquad\leq 2h\,\bigl{(}\gamma h\,\lVert v_{0}\rVert+Lh\,\lVert x_{0}\rVert^{s}+Lh+\sqrt{2\gamma}\sup_{t\in[0,h]}{\lVert B_{t}\rVert}\bigr{)}+c_{2}~{}\text{term}\,,

where in the second line we take h1L+γh\lesssim\frac{1}{\sqrt{L}+\gamma}. Now, taking

c2=[hv00],\displaystyle c_{2}=\begin{bmatrix}h\,\lVert v_{0}\rVert\\ 0\end{bmatrix}\,,

we find the following (where 𝐯(1)\mathbf{v}_{(1)} denotes the first component of a vector 𝐯\mathbf{v}):

((AA(eAhI2d)+I2d)c2)(1)=Lhe(Lh+γ)h+γLh+γhv0.\displaystyle((AA^{\dagger}\,(e^{Ah}-I_{2d})+I_{2d})\,c_{2})_{(1)}=\frac{Lhe^{(Lh+\gamma)\,h}+\gamma}{Lh+\gamma}\,h\,\lVert v_{0}\rVert.

Finally, for h1L+γh\lesssim\frac{1}{\sqrt{L}+\gamma}, this can be bounded by 2hv02h\,\lVert v_{0}\rVert. Using Lemma 22 and plugging this into the expression completes the proof. ∎

C.3 Sub-Gaussianity of the Iterates

Similarly to [Che+21], we introduce a modified potential in order to prove sub-Gaussianity of the iterates of ULMC. Firstly, we consider a modified distribution in the xx-coordinate, with parameter a(β,S)a\triangleq(\beta,S) for some S,β0S,\beta\geq 0:

π(a)exp(U(a)),U(a)(x)U(x)+β2(xS)+2.\displaystyle\pi^{(a)}\propto\exp(-U^{(a)})\,,\qquad U^{(a)}(x)\triangleq U(x)+\frac{\beta}{2}\,(\lVert x\rVert-S)^{2}_{+}\,. (C.3)

The modified potential satisfies the following properties.

Lemma 27 (Properties of the Modified Potential, [Che+21, Lemma 23]).

Consider π(a)\pi^{(a)} and U(a)U^{(a)} defined as in (C.3). Assume that U(0)=0\nabla U(0)=0 and that U\nabla U satisfies (2.3). Then, the following assertions hold.

  1. 1.

    (sub-Gaussian tail bound) Assume that SS is chosen so that π(B(0,S))1/2\pi(B(0,S))\geq 1/2. Then, for all η0\eta\geq 0,

    π(a){S+η}2exp(βη22).\displaystyle\pi^{(a)}\{\lVert\cdot\rVert\geq S+\eta\}\leq 2\exp\bigl{(}-\frac{\beta\eta^{2}}{2}\bigr{)}\,.
  2. 2.

    (gradient growth) The gradient U(a)\nabla U^{(a)} satisfies

    U(a)(x)\displaystyle\lVert\nabla U^{(a)}(x)\rVert L+(β+L)x.\displaystyle\leq L+(\beta+L)\,\lVert x\rVert\,.

Then, letting {(xt(a),vt(a))}t0\{(x^{(a)}_{t},v^{(a)}_{t})\}_{t\geq 0} be the solution to the underdamped Langevin diffusion with potential U(a)U^{(a)} and μ(a)π(a)ρ\mu^{(a)}\coloneqq\pi^{(a)}\otimes\rho, the following lemma holds:

Lemma 28.

Assume that h(β+L)1/2γ1d1/2h\lesssim(\beta+L)^{-1/2}\wedge\gamma^{-1}\wedge d^{-1/2}, and β1\beta\leq 1. Then, for all δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta,

suptNhxt(a)S\displaystyle\sup_{t\leq Nh}{\lVert x_{t}^{(a)}\rVert}-S (β+L)Sh2+1β2(μ0(a)μ(a))+1βlog16Nδ.\displaystyle\lesssim(\beta+L)\,Sh^{2}+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{1}{\beta}\log\frac{16N}{\delta}}\,.

Proof.  We can use the change of measure lemma (Lemma 21) together with the sub-Gaussian tail bounds in Lemmas 25, 27 to see that with probability at least 1δ1-\delta, the following events hold simultaneously:

maxkNxkh(a)\displaystyle\max_{k\leq N}{\lVert x^{(a)}_{kh}\rVert} S+2β2(μ0(a)μ(a))+4βlog8Nδ\displaystyle\leq S+\sqrt{\frac{2}{\beta}\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{4}{\beta}\log\frac{8N}{\delta}}\,
maxkNvkh(a)\displaystyle\max_{k\leq N}{\lVert v_{kh}^{(a)}\rVert} d+22(μ0(a)μ(a))+4log4Nδ.\displaystyle\leq\sqrt{d}+\sqrt{2\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{4\log\frac{4N}{\delta}}\,.

Here we use a union bound together with the monotonicity of t2(μt(a)μ(a))t\mapsto\mathcal{R}_{2}(\mu_{t}^{(a)}\mathbin{\|}\mu^{(a)}) in tt.

For the interpolant times, we will use Grönwall’s matrix inequality, with the following inequality for xx:

xkh(a)xkh+t(a)\displaystyle\lVert x^{(a)}_{kh}-x^{(a)}_{kh+t}\rVert hvkh(a)+0t0τγvkh+τ(a)dτdτ+0t0τU(a)(xkh+τ(a))dτdτ\displaystyle\leq h\,\lVert v^{(a)}_{kh}\rVert+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\gamma v^{(a)}_{kh+\tau^{\prime}}\,\mathrm{d}\tau^{\prime}\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\nabla U^{(a)}(x^{(a)}_{kh+\tau^{\prime}})\,\mathrm{d}\tau^{\prime}\,\mathrm{d}\tau\Bigr{\rVert}
+0t0τ2γdBkh+τdτ\displaystyle\qquad+\Bigl{\lVert}\int_{0}^{t}\int_{0}^{\tau}\sqrt{2\gamma}\,\mathrm{d}B_{kh+\tau^{\prime}}\,\mathrm{d}\tau\Bigr{\rVert}
hvkh(a)+γh(hvkh(a)+0tvkh+τ(a)vkh(a)dτ)+Lh2\displaystyle\leq h\,\lVert v^{(a)}_{kh}\rVert+\gamma h\,\Bigl{(}h\,\lVert v^{(a)}_{kh}\rVert+\int_{0}^{t}\lVert v^{(a)}_{kh+\tau}-v^{(a)}_{kh}\rVert\,\mathrm{d}\tau\Bigr{)}+Lh^{2}
+(β+L)h(hxkh(a)+0txkh+τ(a)xkh(a)dτ)\displaystyle\qquad{}+(\beta+L)\,h\,\Bigl{(}h\,\lVert x^{(a)}_{kh}\rVert+\int_{0}^{t}\lVert x^{(a)}_{kh+\tau}-x^{(a)}_{kh}\rVert\,\mathrm{d}\tau\Bigr{)}
+2γhsupτ[0,h]Bkh+τBkh.\displaystyle\qquad{}+\sqrt{2\gamma}\,h\sup_{\tau\in[0,h]}{\lVert B_{kh+\tau}-B_{kh}\rVert}\,.

Likewise,

vkh(a)vkh+t(a)\displaystyle\lVert v^{(a)}_{kh}-v^{(a)}_{kh+t}\rVert 0tγvkh+τ(a)dτ+0tU(a)(xkh+τ(a))dτ+0t2γdBkh+τ\displaystyle\leq\Bigl{\lVert}\int_{0}^{t}\gamma v^{(a)}_{kh+\tau}\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\nabla U^{(a)}(x^{(a)}_{kh+\tau})\,\mathrm{d}\tau\Bigr{\rVert}+\Bigl{\lVert}\int_{0}^{t}\sqrt{2\gamma}\,\mathrm{d}B_{kh+\tau}\Bigr{\rVert}
γ(hvkh(a)+0tvkh+τ(a)vkh(a)dτ)+Lh\displaystyle\leq\gamma\,\Bigl{(}h\,\lVert v^{(a)}_{kh}\rVert+\int_{0}^{t}\lVert v^{(a)}_{kh+\tau}-v^{(a)}_{kh}\rVert\,\mathrm{d}\tau\Bigr{)}+Lh
+(β+L)(hxkh(a)+0txkh+τ(a)xkh(a)dτ)\displaystyle\qquad{}+(\beta+L)\,\Bigl{(}h\,\lVert x^{(a)}_{kh}\rVert+\int_{0}^{t}\lVert x^{(a)}_{kh+\tau}-x^{(a)}_{kh}\rVert\,\mathrm{d}\tau\Bigr{)}
+2γsupτ[0,h]Bkh+τBkh.\displaystyle\qquad{}+\sqrt{2\gamma}\sup_{\tau\in[0,h]}{\lVert B_{kh+\tau}-B_{kh}\rVert}\,.

Consequently, we can apply the matrix Grönwall inequality analogously to how we did in Lemma 24 with c=c1+c2c=c_{1}+c_{2} denoting the following matrices:

A\displaystyle A =[(β+L)hγh(β+L)γ],\displaystyle=\begin{bmatrix}(\beta+L)\,h&\gamma h\\ (\beta+L)&\gamma\end{bmatrix}\,,
c1\displaystyle c_{1} =[(β+L)h2xkh(a)+γh2vkh(a)+Lh2+2γhsupt[0,h]Bkh+tBkh(β+L)hxkh(a)+γhvkh(a)+Lh+2γsupt[0,h]Bkh+tBkh],\displaystyle=\begin{bmatrix}(\beta+L)\,h^{2}\,\lVert x^{(a)}_{kh}\rVert+\gamma h^{2}\,\lVert v^{(a)}_{kh}\rVert+Lh^{2}+\sqrt{2\gamma}\,h\sup_{t\in[0,h]}\lVert B_{kh+t}-B_{kh}\rVert\\ (\beta+L)\,h\,\lVert x^{(a)}_{kh}\rVert+\gamma h\,\lVert v^{(a)}_{kh}\rVert+Lh+\sqrt{2\gamma}\sup_{t\in[0,h]}\lVert B_{kh+t}-B_{kh}\rVert\end{bmatrix}\,,
c2\displaystyle c_{2} =[hvkh(a)0].\displaystyle=\begin{bmatrix}h\,\lVert v^{(a)}_{kh}\rVert\\ 0\end{bmatrix}\,.

Note that c1c_{1} here is again in the image space of AA, so that (AAI2)c=0(AA^{\dagger}-I_{2})\,c=0. Finally, after calculating the matrix exponential we find

supthxkh(a)xkh+t(a)\displaystyle\sup_{t\leq h}{\lVert x^{(a)}_{kh}-x^{(a)}_{kh+t}\rVert} hexp((β+L)h2+γh)((β+L)hxkh(a)+γhvkh(a)+Lh\displaystyle\leq h\exp\bigl{(}(\beta+L)\,h^{2}+\gamma h\bigr{)}\,\Bigl{(}(\beta+L)\,h\,\lVert x^{(a)}_{kh}\rVert+\gamma h\,\lVert v^{(a)}_{kh}\rVert+Lh
+2γsupthBkh+tBkh)\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad{}+\sqrt{2\gamma}\sup_{t\leq h}{\lVert B_{kh+t}-B_{kh}\rVert}\Bigr{)}
+h(β+L)exp((β+L)h2+γh)h+γ(β+L)h+γvkh(a)\displaystyle\qquad{}+h\,\frac{(\beta+L)\exp\bigl{(}(\beta+L)\,h^{2}+\gamma h\bigr{)}h+\gamma}{(\beta+L)\,h+\gamma}\,\lVert v^{(a)}_{kh}\rVert
2h((β+L)hxkh(a)+vkh(a)+Lh+2γsupthBkh+tBkh),\displaystyle\leq 2h\,\Bigl{(}(\beta+L)\,h\,\lVert x^{(a)}_{kh}\rVert+\lVert v^{(a)}_{kh}\rVert+Lh+\sqrt{2\gamma}\sup_{t\leq h}{\lVert B_{kh+t}-B_{kh}\rVert}\Bigr{)}\,,

where in the second line we take h1(β+L)1/21γh\lesssim\frac{1}{(\beta+L)^{1/2}}\wedge\frac{1}{\gamma}. Note that this is also entirely analogous to the calculation in Lemma 26.

Subsequently, we can take a union bound to obtain for any S1,S2S_{1},S_{2},

{supt[0,Nh]xt(a)η}\displaystyle\mathbb{P}\Bigl{\{}\sup_{t\in[0,Nh]}{\lVert x^{(a)}_{t}\rVert}\geq\eta\Bigr{\}}
{maxk=0,1,N1xkh(a)S1}+{maxk=0,1,N1vkh(a)S2}\displaystyle\qquad\leq\mathbb{P}\Bigl{\{}\max_{k=0,1,\ldots N-1}{\lVert x^{(a)}_{kh}\rVert}\geq S_{1}\Bigr{\}}+\mathbb{P}\Bigl{\{}\max_{k=0,1,\ldots N-1}{\lVert v^{(a)}_{kh}\rVert}\geq S_{2}\Bigr{\}}
+k=0N1{supt[0,h]xkh+t(a)xkh(a)ηS1}\displaystyle\qquad\qquad{}+\sum_{k=0}^{N-1}\mathbb{P}\Bigl{\{}\sup_{t\in[0,h]}{\lVert x_{kh+t}^{(a)}-x_{kh}^{(a)}\rVert}\geq\eta-S_{1}\Bigr{\}}
{maxk=0,1,N1xkh(a)S1}+{maxk=0,1,N1vkh(a)S2}\displaystyle\qquad\leq\mathbb{P}\Bigl{\{}\max_{k=0,1,\ldots N-1}{\lVert x^{(a)}_{kh}\rVert}\geq S_{1}\Bigr{\}}+\mathbb{P}\Bigl{\{}\max_{k=0,1,\ldots N-1}{\lVert v^{(a)}_{kh}\rVert}\geq S_{2}\Bigr{\}}
+k=0N1{supt[0,h]2γBkh+tBkhηS12h(β+L)S1hS2Lh}.\displaystyle\qquad\qquad{}+\sum_{k=0}^{N-1}\mathbb{P}\Bigl{\{}\sup_{t\in[0,h]}\sqrt{2\gamma}\,\lVert B_{kh+t}-B_{kh}\rVert\geq\frac{\eta-S_{1}}{2h}-(\beta+L)\,S_{1}h-S_{2}-Lh\Bigr{\}}\,.

Subsequently, taking respectively S1=S+2β2(μ0(a)μ(a))+4βlog8NδS_{1}=S+\sqrt{\frac{2}{\beta}\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{4}{\beta}\log\frac{8N}{\delta}}, S2=d+22(μ0(a)μ(a))+4log4NδS_{2}=\sqrt{d}+\sqrt{2\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{4\log\frac{4N}{\delta}}, we use the Brownian motion tail bound (Lemma 22) to get with probability 12δ1-2\delta:

suptNhxt(a)S1\displaystyle\sup_{t\leq Nh}{\lVert x_{t}^{(a)}\rVert}-S_{1} (β+L)S1h2+S2h+Lh2+γdh3log3Nδ.\displaystyle\lesssim(\beta+L)\,S_{1}h^{2}+S_{2}h+Lh^{2}+\sqrt{\gamma dh^{3}\log\frac{3N}{\delta}}\,.

If we assume that β1\beta\leq 1 and h1dh\lesssim\frac{1}{\sqrt{d}}, then we can further simplify this bound to yield

suptNhxt(a)S\displaystyle\sup_{t\leq Nh}{\lVert x_{t}^{(a)}\rVert}-S (β+L)Sh2+1β2(μ0(a)μ(a))+1βlog8Nδ.\displaystyle\lesssim(\beta+L)\,Sh^{2}+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}^{(a)}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{1}{\beta}\log\frac{8N}{\delta}}\,.

This concludes the proof. ∎

To transfer this sub-Gaussianity to the original underdamped Langevin process, we consider the following bound on the chi-squared divergence between these two processes.

Proposition 29.

Let QT,QT(a)Q_{T},Q_{T}^{(a)} represent respectively the laws on the path space of the original and modified diffusions, under the same initialization μ0\mu_{0}. Then, if βγTL\beta\lesssim\frac{\gamma}{T}\wedge L and h(β+L)1/2γ1d1/2h\lesssim(\beta+L)^{-1/2}\wedge\gamma^{-1}\wedge d^{-1/2}, then

2(QTQT(a))β2L2S2Th4γ+βTγ(2(μ0μ(a))+logN).\displaystyle\mathcal{R}_{2}(Q_{T}\mathbin{\|}Q_{T}^{(a)})\lesssim\frac{\beta^{2}L^{2}S^{2}Th^{4}}{\gamma}+\frac{\beta T}{\gamma}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log N\bigr{)}\,.

Proof.  Conditioning on the event in Lemma 28, which we denote by δ\mathcal{E}_{\delta} for some δ1/2\delta\leq 1/2, then using Girsanov’s theorem (Corollary 20) we get (for some sufficiently small hh so that Novikov’s condition is satisfied)

log𝔼[(dQTdQT(a))4 1δ]\displaystyle\log\operatorname{\mathbb{E}}\Bigl{[}\Bigl{(}\frac{\mathrm{d}Q_{T}}{\mathrm{d}Q_{T}^{(a)}}\Bigr{)}^{4}\,\mathbbm{1}_{\mathcal{E}_{\delta}}\Bigr{]} 12log𝔼[exp(16γ0TU(xt(a))U(a)(xt(a))2dt) 1δ]\displaystyle\leq\frac{1}{2}\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{16}{\gamma}\int_{0}^{T}\lVert\nabla U(x_{t}^{(a)})-\nabla U^{(a)}(x_{t}^{(a)})\rVert^{2}\,\mathrm{d}t\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta}}\Bigr{]}
12log𝔼[exp(16β2γ0T(xt(a)S)+2dt) 1δ]\displaystyle\leq\frac{1}{2}\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{16\beta^{2}}{\gamma}\int_{0}^{T}\bigl{(}\lVert x_{t}^{(a)}\rVert-S\bigr{)}^{2}_{+}\,\mathrm{d}t\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta}}\Bigr{]}
β2Tγ{(β+L)2S2h4+1β2(μ0μ(a))+1βlog16Nδ}.\displaystyle\lesssim\frac{\beta^{2}T}{\gamma}\,\Bigl{\{}(\beta+L)^{2}\,S^{2}h^{4}+\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\frac{1}{\beta}\log\frac{16N}{\delta}\Bigr{\}}\,.

If we take βγ/T\beta\lesssim\gamma/T and that LβL\geq\beta, we can use Lemma 23 to get

2(QTQT(a))\displaystyle\mathcal{R}_{2}(Q_{T}\mathbin{\|}Q_{T}^{(a)}) =log𝔼[(dQTdQT(a))2]β2L2S2Th4γ+βTγ(2(μ0μ(a))+logN).\displaystyle=\log\operatorname{\mathbb{E}}\Bigl{[}\Bigl{(}\frac{\mathrm{d}Q_{T}}{\mathrm{d}Q_{T}^{(a)}}\Bigr{)}^{2}\Bigr{]}\lesssim\frac{\beta^{2}L^{2}S^{2}Th^{4}}{\gamma}+\frac{\beta T}{\gamma}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log N\bigr{)}\,.

This concludes the proof. ∎

Proposition 30.

Consider the continuous time diffusion (xt,vt)t0(x_{t},v_{t})_{t\geq 0} initialized at μ0\mu_{0}. For h(β+L)1/2γ1d1/2h\lesssim(\beta+L)^{-1/2}\wedge\gamma^{-1}\wedge d^{-1/2}, S𝔪S\asymp\mathfrak{m}, and βγT\beta\asymp\frac{\gamma}{T}, for δ(0,1/2)\delta\in(0,1/2), the following holds with probability 1δ1-\delta:

maxk=0,,N1xkh\displaystyle\max_{k=0,\dotsc,N-1}{\lVert x_{kh}\rVert} 𝔪+Tγ(2(μ0μ(a))+logNδ),\displaystyle\lesssim\mathfrak{m}+\sqrt{\frac{T}{\gamma}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}}\,,
maxk=0,,N1vkh\displaystyle\max_{k=0,\dotsc,N-1}{\lVert v_{kh}\rVert} d+2(μ0μ(a))+logNδ.\displaystyle\lesssim\sqrt{d}+\sqrt{\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}}\,.

Proof.  Recall from the proof of Lemma 28 that with probability 1δ1-\delta,

maxk=0,N1xkh(a)S+1β2(μ0μ(a))+1βlog8Nδ.\displaystyle\max_{k=0,\ldots N-1}{\lVert x_{kh}^{(a)}\rVert}\lesssim S+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{1}{\beta}\log\frac{8N}{\delta}}\,.

In particular, this immediately implies that the following holds: for η0\eta\geq 0,

(maxk=0,N1xkh(a)S+1β2(μ0μ(a))+1βlog8Nδ+η)\displaystyle\mathbb{P}\Bigl{(}\max_{k=0,\ldots N-1}{\lVert x_{kh}^{(a)}\rVert}\gtrsim S+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{1}{\beta}\log\frac{8N}{\delta}}+\eta\Bigr{)} Nexp(cβη2),\displaystyle\lesssim N\exp(-c\beta\eta^{2})\,,

for a universal constant c>0c>0.

Then, using the change of measure (Lemma 21) together with the bound in Proposition 29, choosing S𝔪S\asymp\mathfrak{m}, we get with probability 1δ1-\delta

maxk=0,N1xkh\displaystyle\max_{k=0,\ldots N-1}{\lVert x_{kh}\rVert} S+1β2(μ0μ(a))+1β2(QTQT(a))+1βlogNδ\displaystyle\lesssim S+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})}+\sqrt{\frac{1}{\beta}\,\mathcal{R}_{2}(Q_{T}\mathbin{\|}Q_{T}^{(a)})}+\sqrt{\frac{1}{\beta}\log\frac{N}{\delta}}
𝔪+1β(2(μ0μ(a))+logNδ)+βL2Th4𝔪2γ.\displaystyle\lesssim\mathfrak{m}+\sqrt{\frac{1}{\beta}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}+\frac{\beta L^{2}Th^{4}\mathfrak{m}^{2}}{\gamma}}\,.

We choose βγ/T\beta\asymp\gamma/T so that

maxk=0,N1xkh\displaystyle\max_{k=0,\ldots N-1}{\lVert x_{kh}\rVert} 𝔪+Tγ(2(μ0μ(a))+logNδ).\displaystyle\lesssim\mathfrak{m}+\sqrt{\frac{T}{\gamma}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}}\,.

Finally, combining this with a union bound to control vkh\lVert v_{kh}\rVert from Lemma 25, we get the Proposition. ∎

C.4 Completing the Discretization Proof

We proceed by following the proof of [Che+21].

Proof. [Proof of Proposition 15] Let {xt}t0\{x_{t}\}_{t\geq 0} follow the continuous-time process. Let PT,QTP_{T},Q_{T} denote the measures on the path space corresponding to the interpolated process and the continuous-time diffusion respectively, with both being initialized at μ0=π0𝒩(0,Id)\mu_{0}=\pi_{0}\otimes\mathcal{N}(0,I_{d}). Then, define

Gt12γ0tU(xτ)U(xτ/hh),dBτ14γ0tU(xτ)U(xτ/hh)2dτ.\displaystyle G_{t}\triangleq\frac{1}{\sqrt{2\gamma}}\int_{0}^{t}\langle\nabla U(x_{\tau})-\nabla U(x_{\lfloor\tau/h\rfloor h}),\mathrm{d}B_{\tau}\rangle-\frac{1}{4\gamma}\int_{0}^{t}\lVert\nabla U(x_{\tau})-\nabla U(x_{\lfloor\tau/h\rfloor h})\rVert^{2}\,\mathrm{d}\tau.

From Girsanov’s theorem (Theorem 19), we obtain immediately using Itô’s formula

𝔼QT[(dPTdQT)q]1\displaystyle\operatorname{\mathbb{E}}_{Q_{T}}\Bigl{[}\Bigl{(}\frac{\mathrm{d}P_{T}}{\mathrm{d}Q_{T}}\Bigr{)}^{q}\Bigr{]}-1 =𝔼exp(qGT)1\displaystyle=\operatorname{\mathbb{E}}\exp(qG_{T})-1
=q(q1)4γ𝔼0Texp(qGt)U(xt)U(xt/hh)2dt\displaystyle=\frac{q\,(q-1)}{4\gamma}\operatorname{\mathbb{E}}\int_{0}^{T}\exp(qG_{t})\,\lVert\nabla U(x_{t})-\nabla U(x_{\lfloor t/h\rfloor h})\rVert^{2}\,\mathrm{d}t
q24γ0T𝔼[exp(2qGt)]𝔼[U(xt)U(xt/hh)4]dt.\displaystyle\leq\frac{q^{2}}{4\gamma}\int_{0}^{T}\sqrt{\operatorname{\mathbb{E}}[\exp(2qG_{t})]\operatorname{\mathbb{E}}[\lVert\nabla U(x_{t})-\nabla U(x_{\lfloor t/h\rfloor h})\rVert^{4}]}\,\mathrm{d}t\,.

Bounding these terms individually, we first use Corollary 20 and (2.3) to get

𝔼exp(2qGt)\displaystyle\operatorname{\mathbb{E}}\exp(2qG_{t}) 𝔼exp(4q2γ0tU(xr)U(xr/hh)2dr)\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\exp\Bigl{(}\frac{4q^{2}}{\gamma}\int_{0}^{t}\lVert\nabla U(x_{r})-\nabla U(x_{\lfloor r/h\rfloor h})\rVert^{2}\,\mathrm{d}r\Bigr{)}}
𝔼exp(4L2q2γ0txrxr/hh2sdr).\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\exp\Bigl{(}\frac{4L^{2}q^{2}}{\gamma}\int_{0}^{t}\lVert x_{r}-x_{\lfloor r/h\rfloor h}\rVert^{2s}\,\mathrm{d}r\Bigr{)}}\,.

Let us now condition on the event

δ,kh{maxj=0,1,,k1xkhRδx,maxj=0,1,,k1vkhRδv}.\displaystyle\mathcal{E}_{\delta,kh}\triangleq\Bigl{\{}\max_{j=0,1,\dotsc,k-1}{\lVert x_{kh}\rVert}\leq R_{\delta}^{x},\;\max_{j=0,1,\dotsc,k-1}{\lVert v_{kh}\rVert}\leq R_{\delta}^{v}\Bigr{\}}\,.

By Proposition 30, we can have (δ,kh)1δ\mathbb{P}(\mathcal{E}_{\delta,kh})\geq 1-\delta while choosing

Rδx\displaystyle R_{\delta}^{x} 𝔪+γT(2(μ0μ(a))+logNδ),\displaystyle\lesssim\mathfrak{m}+\sqrt{\frac{\gamma}{T}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}}\,,
Rδv\displaystyle R_{\delta}^{v} d+2(μ0μ(a))+logNδ.\displaystyle\lesssim\sqrt{d}+\sqrt{\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}}\,.

We proceed to bound our desired quantity through some careful steps.

One step error. Consider first the error on a single interval [0,h][0,h]. If we presume that the step size satisfies h(γ1s/(L2dsq2))1/(1+3s)h\lesssim(\gamma^{1-s}/(L^{2}d^{s}q^{2}))^{1/(1+3s)}, Lemma 26 implies

log𝔼exp(8L2q2γ0hxtx02dt)\displaystyle\log\operatorname{\mathbb{E}}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\int_{0}^{h}\lVert x_{t}-x_{0}\rVert^{2}\,\mathrm{d}t\Bigr{)} log𝔼exp(8L2hq2γsupt[0,h]xtx02)\displaystyle\leq\log\operatorname{\mathbb{E}}\exp\Bigl{(}\frac{8L^{2}hq^{2}}{\gamma}\sup_{t\in[0,h]}{\lVert x_{t}-x_{0}\rVert}^{2}\Bigr{)}
L2+2sh1+4sq2γ(1+x02s2)+L2h1+2sq2γv02s\displaystyle\lesssim\frac{L^{2+2s}h^{1+4s}q^{2}}{\gamma}\,(1+\lVert x_{0}\rVert^{2s^{2}})+\frac{L^{2}h^{1+2s}q^{2}}{\gamma}\,\lVert v_{0}\rVert^{2s}
+L2dsh1+3sq2γ1s.\displaystyle\qquad{}+\frac{L^{2}d^{s}h^{1+3s}q^{2}}{\gamma^{1-s}}\,.

Iteration. If we let {t}t0\{\mathcal{F}_{t}\}_{t\geq 0} denote the filtration, then writing Ht=0txrxr/hh2drH_{t}=\int_{0}^{t}\lVert x_{r}-x_{\lfloor r/h\rfloor h}\rVert^{2}\,\mathrm{d}r, we can condition on (N1)h\mathcal{F}_{(N-1)h} and iterate our one step bound.

log𝔼[exp(8L2q2γHNh) 1δ,Nh]\displaystyle\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{Nh}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,Nh}}\Bigr{]}
log𝔼[exp(8L2q2γH(N1)h\displaystyle\qquad\leq\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{(N-1)h}
+𝒪(L2+2sh1+4sq2γ(1+x(N1)h2s2)\displaystyle\qquad\qquad\qquad\qquad\qquad{}+\mathcal{O}\bigl{(}\frac{L^{2+2s}h^{1+4s}q^{2}}{\gamma}\,(1+\lVert x_{(N-1)h}\rVert^{2s^{2}})
+L2h1+2sq2γv(N1)h2s+L2dsh1+3sq2γ1s)) 1δ,Nh]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad{}+\frac{L^{2}h^{1+2s}q^{2}}{\gamma}\,\lVert v_{(N-1)h}\rVert^{2s}+\frac{L^{2}d^{s}h^{1+3s}q^{2}}{\gamma^{1-s}}\bigr{)}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,Nh}}\Bigr{]}
log𝔼[exp(8L2q2γH(N1)h) 1δ,(N1)h]\displaystyle\qquad\leq\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{(N-1)h}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,(N-1)h}}\Bigr{]}
+𝒪(L2+2sh1+4sq2γ(Rδx)2s2+L2h1+2sq2γ(Rδv)2s+L2dsh1+3sq2γ1s).\displaystyle\qquad\qquad{}+\mathcal{O}\Bigl{(}\frac{L^{2+2s}h^{1+4s}q^{2}}{\gamma}\,(R_{\delta}^{x})^{2s^{2}}+\frac{L^{2}h^{1+2s}q^{2}}{\gamma}\,(R_{\delta}^{v})^{2s}+\frac{L^{2}d^{s}h^{1+3s}q^{2}}{\gamma^{1-s}}\Bigr{)}\,.

We now make additional simplifying assumptions to obtain more interpretable bounds: we assume γ/T1\gamma/T\leq 1 and h1L(1d1/2𝔪s)h\lesssim\frac{1}{L}\,(1\wedge\frac{d^{1/2}}{\mathfrak{m}^{s}}). With these assumptions,

log𝔼[exp(8L2q2γHNh) 1δ,Nh]\displaystyle\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{Nh}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,Nh}}\Bigr{]}
log𝔼[exp(8L2q2γH(N1)h) 1δ,(N1)h]\displaystyle\qquad\leq\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{(N-1)h}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,(N-1)h}}\Bigr{]}
+𝒪(L2h1+2sq2γ(d+2(μ0μ(a))+logNδ)s).\displaystyle\qquad\qquad{}+\mathcal{O}\Bigl{(}\frac{L^{2}h^{1+2s}q^{2}}{\gamma}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}^{s}\Bigr{)}\,.

Completing this iteration yields

log𝔼[exp(8L2q2γHNh) 1δ,Nh]\displaystyle\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{8L^{2}q^{2}}{\gamma}\,H_{Nh}\Bigr{)}\,\mathbbm{1}_{\mathcal{E}_{\delta,Nh}}\Bigr{]} L2Th2sq2γ(d+2(μ0μ(a))+logNδ)s.\displaystyle\lesssim\frac{L^{2}Th^{2s}q^{2}}{\gamma}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log\frac{N}{\delta}\bigr{)}^{s}\,.

Finally, applying Lemma 23 when

hsγ1/(2s)L1/sT1/(2s)q1/s\displaystyle h\lesssim_{s}\frac{\gamma^{1/(2s)}}{L^{1/s}T^{1/(2s)}q^{1/s}} (C.4)

(where s\lesssim_{s} hides an ss-dependent constant), we find

log𝔼[exp(4L2q2γHNh)]\displaystyle\log\operatorname{\mathbb{E}}\Bigl{[}\exp\Bigl{(}\frac{4L^{2}q^{2}}{\gamma}\,H_{Nh}\Bigr{)}\Bigr{]} 1+L2Th2sq2γ(d+2(μ0μ(a))+logN)s.\displaystyle\lesssim 1+\frac{L^{2}Th^{2s}q^{2}}{\gamma}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log N\bigr{)}^{s}\,.

It remains to choose the appropriate step size hh which makes this whole quantity 1\lesssim 1. In particular, it suffices to choose

h𝒪~s(γ1/(2s)L1/sT1/(2s)q1/s(d+2(μ0μ(a)))1/2).\displaystyle h\lesssim\widetilde{\mathcal{O}}_{s}\Bigl{(}\frac{\gamma^{1/(2s)}}{L^{1/s}T^{1/(2s)}q^{1/s}\,(d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)}))^{1/2}}\Bigr{)}\,. (C.5)

Second term. It remains to bound the other term in our original expression. From Lemma 26, we obtain

𝔼[exp(λxkh+txkh2s)wkh]1,\displaystyle\operatorname{\mathbb{E}}[\exp(\lambda\,\lVert x_{kh+t}-x_{kh}\rVert^{2s})\mid w_{kh}]\lesssim 1\,,

so long as λ\lambda is chosen to be appropriately small, i.e.,

λ1γsdsh3s1L2sh4s(1+xkh)2s21h2svkh2s.\displaystyle\lambda\asymp\frac{1}{\gamma^{s}d^{s}h^{3s}}\wedge\frac{1}{L^{2s}h^{4s}\,(1+\lVert x_{kh}\rVert)^{2s^{2}}}\wedge\frac{1}{h^{2s}\,\lVert v_{kh}\rVert^{2s}}\,.

This immediately implies a tail bound: for η0\eta\geq 0,

{xkh+txkh4sηwkh}exp(λη).\displaystyle\mathbb{P}\{\lVert x_{kh+t}-x_{kh}\rVert^{4s}\geq\eta\mid w_{kh}\}\lesssim\exp(-\lambda\sqrt{\eta})\,.

Integrating, we get

𝔼[U(xt)U(xkh)4]L2𝔼[xtxkh4s]L2𝔼1λ2\displaystyle\sqrt{\operatorname{\mathbb{E}}[\lVert\nabla U(x_{t})-\nabla U(x_{kh})\rVert^{4}]}\leq L^{2}\sqrt{\operatorname{\mathbb{E}}[\lVert x_{t}-x_{kh}\rVert^{4s}]}\lesssim L^{2}\sqrt{\operatorname{\mathbb{E}}\frac{1}{\lambda^{2}}}
L2γsdsh3s+L2+2sh4s1+𝔼[xkh4s2]+L2h2s𝔼[vkh4s].\displaystyle\qquad\lesssim L^{2}\gamma^{s}d^{s}h^{3s}+L^{2+2s}h^{4s}\sqrt{1+\operatorname{\mathbb{E}}[\lVert x_{kh}\rVert^{4s^{2}}]}+L^{2}h^{2s}\sqrt{\operatorname{\mathbb{E}}[\lVert v_{kh}\rVert^{4s}]}\,.

We can estimate the expectations by integration of our previous tail bound (Proposition 30):

𝔼[U(xt)U(xkh)4]\displaystyle\sqrt{\operatorname{\mathbb{E}}[\lVert\nabla U(x_{t})-\nabla U(x_{kh})\rVert^{4}]} L2γsdsh3s+L2+2sh4s(𝔪+Tγ(2(μ0μ(a))+logN))2s2\displaystyle\lesssim L^{2}\gamma^{s}d^{s}h^{3s}+L^{2+2s}h^{4s}\,\Bigl{(}\mathfrak{m}+\frac{T}{\gamma}\,\bigl{(}\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log N\bigr{)}\Bigr{)}^{2s^{2}}
+L2h2s(d+2(μ0μ(a))+logN)s\displaystyle\qquad{}+L^{2}h^{2s}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})+\log N\bigr{)}^{s}
𝒪~(L2h2s(d+2(μ0μ(a)))s),\displaystyle\leq\widetilde{\mathcal{O}}\Bigl{(}L^{2}h^{2s}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})\bigr{)}^{s}\Bigr{)}\,,

provided that h𝒪~(1L(d1/2𝔪s2(γ2T)s/2))h\leq\widetilde{\mathcal{O}}(\frac{1}{L}\,(\frac{d^{1/2}}{\mathfrak{m}^{s}}\wedge\mathcal{R}_{2}\,(\frac{\gamma\mathcal{R}_{2}}{T})^{s/2})), where 2=2(μ0μ(a))\mathcal{R}_{2}=\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)}). In our applications, this condition is not dominant and can be disregarded.

Combining the bounds. Finally, we can combine each of these steps to find that, provided (C.5) for the step size holds,

𝔼QT[(dPTdQT)q]1\displaystyle\operatorname{\mathbb{E}}_{Q_{T}}\Bigl{[}\Bigl{(}\frac{\mathrm{d}P_{T}}{\mathrm{d}Q_{T}}\Bigr{)}^{q}\Bigr{]}-1 𝒪~(Tq2γL2h2s(d+2(μ0μ(a)))s).\displaystyle\leq\widetilde{\mathcal{O}}\Bigl{(}\frac{Tq^{2}}{\gamma}\,L^{2}h^{2s}\,\bigl{(}d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)})\bigr{)}^{s}\Bigr{)}\,.

Finally, the following step size condition suffices to bound the Rényi divergence by ϵ2\epsilon^{2}:

h𝒪~s(γ1/(2s)ϵ1/sL1/sT1/(2s)q1/s(d+2(μ0μ(a)))1/2).\displaystyle h\lesssim\widetilde{\mathcal{O}}_{s}\Bigl{(}\frac{\gamma^{1/(2s)}\epsilon^{1/s}}{L^{1/s}T^{1/(2s)}q^{1/s}\,(d+\mathcal{R}_{2}(\mu_{0}\mathbin{\|}\mu^{(a)}))^{1/2}}\Bigr{)}\,.

This completes the proof. ∎

Appendix D Proof of the Main Results

Firstly, we collect some results on feasible initializations from [Che+21]. Recall that π(a)\pi^{(a)} is the modified distribution introduced in Appendix C.3. Let

π0=𝒩(0,ςId),\pi_{0}=\mathcal{N}(0,\varsigma I_{d}),

where ς=(2L+β)1\varsigma=(2L+\beta)^{-1} is the variance of the Gaussian, and β1/T\beta\asymp 1/T is the parameter appearing in the modified potential. The choice of TT will be assumption dependent, and we collect the conditions below under our main assumptions:

T={Θ~(L+d𝔮(γ))π satisfies (PI)Θ~(LC𝖫𝖲𝖨)π satisfies (LSI), or is strongly log-concave,\displaystyle T=\begin{cases}\widetilde{\Theta}\bigl{(}\frac{L+d}{\mathfrak{q}(\gamma)}\bigr{)}&\text{$\pi$ satisfies \eqref{eq:pi}}\\ \widetilde{\Theta}(\sqrt{L}C_{\mathsf{LSI}})&\text{$\pi$ satisfies \eqref{eq:LSI}, or is strongly log-concave,}\end{cases}

where 𝔮(γ)\mathfrak{q}(\gamma) is defined in (3.2).

Lemma 31 (Adapted from [Che+21, Appendix A]).

Suppose that π\pi satisfies (PI) and the Hölder continuity condition (2.3), as well as U(0)=0\nabla U(0)=0, U(0)minUdU(0)-\min U\lesssim d. Then the following two properties hold for π0=𝒩(0,(2L+β)1Id)\pi_{0}=\mathcal{N}(0,(2L+\beta)^{-1}I_{d}), where β\beta is the parameter appearing in the modified potential:

q(π0π)\displaystyle\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi) 𝒪~(β+L+d),\displaystyle\leq\widetilde{\mathcal{O}}(\beta+L+d)\,,
q(π0π(a))\displaystyle\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi^{(a)}) 𝒪~(β+L+d).\displaystyle\leq\widetilde{\mathcal{O}}(\beta+L+d)\,.

Proof.  Apply either [Che+21, Lemma 30] or [Che+21, Lemma 31]. ∎

From our analysis we take βL\beta\lesssim L, and if moreover LdL\lesssim d then it is reasonable to expect that q(π0π),q(π0π(a))𝒪~(d)\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi),\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi^{(a)})\leq\widetilde{\mathcal{O}}(d). Let μ0=π0ρ\mu_{0}=\pi_{0}\otimes\rho, so that q(μ0μ)=q(π0π)\mathcal{R}_{q}(\mu_{0}\mathbin{\|}\mu)=\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi), and similarly q(μ0μ(a))=q(π0π(a))\mathcal{R}_{q}(\mu_{0}\mathbin{\|}\mu^{(a)})=\mathcal{R}_{q}(\pi_{0}\mathbin{\|}\pi^{(a)}).

The following lemma gives a bound on the value of the Fisher information at initialization.

Lemma 32.

Under the conditions of the previous lemma, the initialization μ0=π0ρ\mu_{0}=\pi_{0}\otimes\rho also satisfies 𝖥𝖨(μ0μ)Ld+L1sds\mathsf{FI}(\mu_{0}\mathbin{\|}\mu)\lesssim Ld+L^{1-s}d^{s}.

Proof.  Note that as U(0)=0\nabla U(0)=0, logπ(x)2=U(x)2L2x2s\lVert\nabla\log\pi(x)\rVert^{2}=\lVert\nabla U(x)\rVert^{2}\leq L^{2}\,\lVert x\rVert^{2s}. Secondly, π0\pi_{0} satisfies 𝔼xπ0[x2]d/L\operatorname{\mathbb{E}}_{x\sim\pi_{0}}[\lVert x\rVert^{2}]\lesssim d/L. Hence,

𝖥𝖨(μ0μ)\displaystyle\mathsf{FI}(\mu_{0}\mathbin{\|}\mu) =𝔼π0[logπ0π2]𝔼xπ0[U(x)(2L+β)x2]\displaystyle=\operatorname{\mathbb{E}}_{\pi_{0}}\Bigl{[}\Bigl{\lVert}\nabla\log\frac{\pi_{0}}{\pi}\Bigr{\rVert}^{2}\Bigr{]}\leq\operatorname{\mathbb{E}}_{x\sim\pi_{0}}[\lVert\nabla U(x)-(2L+\beta)\,x\rVert^{2}]
L2𝔼xμ0[x2+x2s]Ld+L1sds,\displaystyle\lesssim L^{2}\operatorname{\mathbb{E}}_{x\sim\mu_{0}}[\lVert x\rVert^{2}+\lVert x\rVert^{2s}]\lesssim Ld+L^{1-s}d^{s}\,,

where we used Jensen’s inequality in the last step. ∎

D.1 Poincaré Inequality

Proof. [Proof of Theorem 9] The continuous-time result from Lemma 8 states that

T1𝔮(γ)logχ2(μ0μ)ε2χ2(μTμ)ϵ2.\displaystyle T\gtrsim\frac{1}{\mathfrak{q}(\gamma)}\log\frac{\chi_{2}(\mu_{0}\mathbin{\|}\mu)}{\varepsilon^{2}}\implies\chi_{2}(\mu_{T}\mathbin{\|}\mu)\leq\epsilon^{2}\,.

Noting that there exists a feasible initialization such that logχ2(μ0μ)𝒪~(L+d)\log\chi_{2}(\mu_{0}\mathbin{\|}\mu)\leq\widetilde{\mathcal{O}}(L+d), then this is satisfied if we choose T=𝒪~(1𝔮(γ)(L+d+log1ε))T=\widetilde{\mathcal{O}}(\frac{1}{\mathfrak{q}(\gamma)}\,(L+d+\log\frac{1}{\varepsilon})). This also shows that 2(μTμ)=log(1+χ2(μTμ))ϵ2\mathcal{R}_{2}(\mu_{T}\mathbin{\|}\mu)=\log(1+\chi_{2}(\mu_{T}\mathbin{\|}\mu))\lesssim\epsilon^{2} for ϵ1\epsilon\lesssim 1.

Note the following decomposition (weak triangle inequality) for the Rényi divergence [[, see, e.g.,]Proposition 11]mironov2017renyi:

q(P1P2)q1/𝔠q1𝔠q(P1P3)+𝔡(q1/𝔠)(P3P2),\displaystyle\mathcal{R}_{q}(P_{1}\mathbin{\|}P_{2})\leq\frac{q-1/\mathfrak{c}}{q-1}\,\mathcal{R}_{\mathfrak{c}q}(P_{1}\mathbin{\|}P_{3})+\mathcal{R}_{\mathfrak{d}(q-1/\mathfrak{c})}(P_{3}\mathbin{\|}P_{2}),

for any valid Hölder conjugate pair 𝔠,𝔡\mathfrak{c},\mathfrak{d}, i.e., 1𝔠+1𝔡=1\frac{1}{\mathfrak{c}}+\frac{1}{\mathfrak{d}}=1, 𝔠,𝔡>1\mathfrak{c},\mathfrak{d}>1, and any three probability distributions P1,P2,P3P_{1},P_{2},P_{3}.

In our case, we let q=2ξq=2-\xi and 𝔡(q1/𝔠)=2\mathfrak{d}(q-1/\mathfrak{c})=2, so that after solving for 𝔠,𝔡\mathfrak{c},\mathfrak{d}, we get the following for ξ1/2\xi\leq 1/2:

2ξ(P1P2)\displaystyle\mathcal{R}_{2-\xi}(P_{1}\mathbin{\|}P_{2}) 22/ξ(P1P3)+2(P3P2).\displaystyle\leq 2\mathcal{R}_{2/\xi}(P_{1}\mathbin{\|}P_{3})+\mathcal{R}_{2}(P_{3}\mathbin{\|}P_{2})\,.

Consequently, let P1=μ^NhP_{1}=\hat{\mu}_{Nh}, P2=μP_{2}=\mu, P3=μNhP_{3}=\mu_{Nh}, and combining this result with the discretization bound of Proposition 15, we then obtain

2ξ(μ^Nhμ)2/ξ(μ^NhμNh)+2(μNhμ)ϵ2,\displaystyle\mathcal{R}_{2-\xi}(\hat{\mu}_{Nh}\mathbin{\|}\mu)\lesssim\mathcal{R}_{2/\xi}(\hat{\mu}_{Nh}\mathbin{\|}\mu_{Nh})+\mathcal{R}_{2}(\mu_{Nh}\mathbin{\|}\mu)\lesssim\epsilon^{2}\,,

so long as

h\displaystyle h =Θ~(γ1/(2s)ϵ1/sξ1/s𝔮(γ)1/(2s)L1/sd1/2(Ld)1/(2s)),\displaystyle=\widetilde{\Theta}\Bigl{(}\frac{\gamma^{1/(2s)}\epsilon^{1/s}\xi^{1/s}\mathfrak{q}(\gamma)^{1/(2s)}}{L^{1/s}d^{1/2}\,{(L\vee d)}^{1/(2s)}}\Bigr{)}\,,
N\displaystyle N =Θ~(L1/sd1/2(Ld)1+1/(2s)γ1/(2s)ϵ1/sξ1/s𝔮(γ)1+1/(2s)).\displaystyle=\widetilde{\Theta}\Bigl{(}\frac{L^{1/s}d^{1/2}\,{(L\vee d)}^{1+1/(2s)}}{\gamma^{1/(2s)}\epsilon^{1/s}\xi^{1/s}\mathfrak{q}(\gamma)^{1+1/(2s)}}\Bigr{)}\,.

This completes the proof. ∎

D.2 Log-Sobolev Inequality

D.2.1 KL Divergence

Proof. [Proof of Theorem 6] We provide the following Theorem in the twisted coordinates (ϕ,ψ)(\phi,\psi), which were used in Lemma 14. Consider the decomposition of the 𝖪𝖫\mathsf{KL} using Cauchy–Schwarz:

𝖪𝖫(μ^Tμ)\displaystyle\mathsf{KL}(\hat{\mu}_{T}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}}) =logμ^Tμdμ^T\displaystyle=\int\log\frac{\hat{\mu}_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\,\mathrm{d}\hat{\mu}_{T}^{\mathcal{M}}
=𝖪𝖫(μ^TμT)+logμTμdμ^T\displaystyle=\mathsf{KL}(\hat{\mu}_{T}^{\mathcal{M}}\mathbin{\|}\mu_{T}^{\mathcal{M}})+\int\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\,\mathrm{d}\hat{\mu}_{T}^{\mathcal{M}}
=𝖪𝖫(μ^TμT)+𝖪𝖫(μTμ)+logμTμd(μ^TμT)\displaystyle=\mathsf{KL}(\hat{\mu}_{T}^{\mathcal{M}}\mathbin{\|}\mu_{T}^{\mathcal{M}})+\mathsf{KL}(\mu_{T}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}})+\int\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\,\mathrm{d}(\hat{\mu}_{T}^{\mathcal{M}}-\mu_{T}^{\mathcal{M}})
𝖪𝖫(μ^TμT)+𝖪𝖫(μTμ)+χ2(μ^TμT)×𝗏𝖺𝗋μT(logμTμ).\displaystyle\leq\mathsf{KL}(\hat{\mu}_{T}^{\mathcal{M}}\mathbin{\|}\mu_{T}^{\mathcal{M}})+\mathsf{KL}(\mu_{T}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}})+\sqrt{\chi^{2}(\hat{\mu}_{T}^{\mathcal{M}}\mathbin{\|}\mu_{T}^{\mathcal{M}})\times\mathsf{var}_{\mu_{T}^{\mathcal{M}}}\Bigl{(}\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{)}}\,.

Using the log-Sobolev inequality of the iterates via Lemma 14, we find (through the implication that a log-Sobolev inequality implies a Poincaré inequality with the same constant)

𝗏𝖺𝗋μT(logμTμ)C𝖫𝖲𝖨(μT)𝔼μT[logμTμ2],\displaystyle\mathsf{var}_{\mu_{T}^{\mathcal{M}}}\Bigl{(}\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{)}\leq C_{\mathsf{LSI}}(\mu_{T}^{\mathcal{M}})\operatorname{\mathbb{E}}_{\mu_{T}^{\mathcal{M}}}\Bigl{[}\Bigl{\lVert}\nabla\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{\rVert}^{2}\Bigr{]}\,,

where we substitute logμTμ\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}} for the function in (PI). Here, C𝖫𝖲𝖨(μT)1/mC_{\mathsf{LSI}}(\mu_{T}^{\mathcal{M}})\lesssim 1/m for all t0t\geq 0.

Since μ=#μ\mu^{\mathcal{M}}=\mathcal{M}_{\#}\mu, then μ(ϕ,ψ)μ(1(ϕ,ψ))\mu^{\mathcal{M}}(\phi,\psi)\propto\mu(\mathcal{M}^{-1}(\phi,\psi)). Therefore,

logμ=(1)𝖳logμ1,\displaystyle\nabla\log\mu^{\mathcal{M}}=(\mathcal{M}^{-1})^{\mathsf{T}}\,\nabla\log\mu\circ\mathcal{M}^{-1}\,,

and similarly for logμT\nabla\log\mu_{T}^{\mathcal{M}}. This yields the expression

𝔼μT[logμTμ2]\displaystyle\operatorname{\mathbb{E}}_{\mu_{T}^{\mathcal{M}}}\Bigl{[}\Bigl{\lVert}\nabla\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{\rVert}^{2}\Bigr{]} =𝔼μT[(1)𝖳logμTμ2].\displaystyle=\operatorname{\mathbb{E}}_{\mu_{T}}\Bigl{[}\Bigl{\lVert}(\mathcal{M}^{-1})^{\mathsf{T}}\,\nabla\log\frac{\mu_{T}}{\mu}\Bigr{\rVert}^{2}\Bigr{]}\,.

Also, one has

1(1)𝖳\displaystyle\mathcal{M}^{-1}\,(\mathcal{M}^{-1})^{\mathsf{T}} =[1γ/2γ/2γ2/2].\displaystyle=\begin{bmatrix}1&-\gamma/2\\ -\gamma/2&\gamma^{2}/2\end{bmatrix}\,.

For c0>0c_{0}>0 and 𝔐\mathfrak{M} defined in Appendix B.1, we have

L𝔐c01(1)𝖳\displaystyle L\,\mathfrak{M}-c_{0}\,\mathcal{M}^{-1}\,(\mathcal{M}^{-1})^{\mathsf{T}} =[1/4c0L(1/2+c02)L(1/2+c02)L(4c0)].\displaystyle=\begin{bmatrix}1/4-c_{0}&\sqrt{L}\,(1/\sqrt{2}+c_{0}\sqrt{2})\\ \sqrt{L}\,(1/\sqrt{2}+c_{0}\sqrt{2})&L\,(4-c_{0})\end{bmatrix}\,.

The determinant is L((14c0)(4c0)(12+c02)2)>0L\,((\frac{1}{4}-c_{0})\,(4-c_{0})-(\frac{1}{\sqrt{2}}+c_{0}\sqrt{2})^{2})>0 for c0>0c_{0}>0 sufficiently small. This shows that 1(1)𝖳c01L𝔐\mathcal{M}^{-1}\,(\mathcal{M}^{-1})^{\mathsf{T}}\preceq c_{0}^{-1}L\,\mathfrak{M}, and therefore

𝔼μT[logμTμ2]\displaystyle\operatorname{\mathbb{E}}_{\mu_{T}^{\mathcal{M}}}\Bigl{[}\Bigl{\lVert}\nabla\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{\rVert}^{2}\Bigr{]} L𝖥𝖨𝔐(μTμ).\displaystyle\lesssim L\,\mathsf{FI}_{\mathfrak{M}}(\mu_{T}\mathbin{\|}\mu)\,.

Here we define

𝖥𝖨𝔐(μμ)𝔼μ[𝔐1/2logμμ2]\displaystyle\mathsf{FI}_{\mathfrak{M}}(\mu^{\prime}\mathbin{\|}\mu)\triangleq\operatorname{\mathbb{E}}_{\mu^{\prime}}\bigl{[}\bigl{\lVert}\mathfrak{M}^{1/2}\,\nabla\log\frac{\mu^{\prime}}{\mu}\bigr{\rVert}^{2}\bigr{]}

The decay of the Fisher information via Lemma 5 allows us to set

TC𝖫𝖲𝖨Llog(κϵ2(𝖪𝖫(μ0μ)+𝖥𝖨𝔐(μ0μ)))𝗏𝖺𝗋μT(logμTμ)ϵ2.\displaystyle T\gtrsim C_{\mathsf{LSI}}\sqrt{L}\log\Bigl{(}\frac{\kappa}{\epsilon^{2}}\,\bigl{(}\mathsf{KL}(\mu_{0}\mathbin{\|}\mu)+\mathsf{FI}_{\mathfrak{M}}(\mu_{0}\mathbin{\|}\mu)\bigr{)}\Bigr{)}\implies\mathsf{var}_{\mu_{T}^{\mathcal{M}}}\Bigl{(}\log\frac{\mu_{T}^{\mathcal{M}}}{\mu^{\mathcal{M}}}\Bigr{)}\lesssim\epsilon^{2}\,.

The same choice of TT also ensures that 𝖪𝖫(μTμ)ε2\mathsf{KL}(\mu_{T}^{\mathcal{M}}\mathbin{\|}\mu^{\mathcal{M}})\leq\varepsilon^{2}. From our initialization (Lemma 32), we can naively estimate using that

𝖥𝖨𝔐(μ0μ)\displaystyle\mathsf{FI}_{\mathfrak{M}}(\mu_{0}\mathbin{\|}\mu) 1L𝖥𝖨(π0π)d,\displaystyle\lesssim\frac{1}{L}\,\mathsf{FI}(\pi_{0}\mathbin{\|}\pi)\lesssim d\,,

and 𝖪𝖫(μ0μ)dlogκ\mathsf{KL}(\mu_{0}\mathbin{\|}\mu)\lesssim d\log\kappa, so that our condition on TT is (with C𝖫𝖲𝖨m1C_{\mathsf{LSI}}\leq m^{-1})

T𝒪~(Lmlogκdϵ2).\displaystyle T\geq\widetilde{\mathcal{O}}\Bigl{(}\frac{\sqrt{L}}{m}\log\frac{\kappa d}{\epsilon^{2}}\Bigr{)}\,.

Recall as well that this requires γL\gamma\asymp\sqrt{L}. For the remaining χ2(μ^TμT)\chi^{2}(\hat{\mu}_{T}\mathbin{\|}\mu_{T}) and 𝖪𝖫(μ^TμT)\mathsf{KL}(\hat{\mu}_{T}\mathbin{\|}\mu_{T}) terms, we invoke Proposition 15 with the value of T=NhT=Nh specified and desired accuracy ϵ\epsilon, and with q=2q=2 and s=1s=1, which consequently yields

h=Θ~(ϵm1/2Ld1/2),\displaystyle h=\widetilde{\Theta}\Bigl{(}\frac{\epsilon m^{1/2}}{Ld^{1/2}}\Bigr{)}\,,

with

N=Θ~(κ3/2d1/2ϵ)\displaystyle N=\widetilde{\Theta}\Bigl{(}\frac{\kappa^{3/2}d^{1/2}}{\epsilon}\Bigr{)}

(using N=T/hN=T/h). ∎

D.2.2 TV Distance

Proof. [Proof of Theorem 7] Notice first that the 𝖳𝖵\mathsf{TV} distance is a proper metric, and therefore satisfies the triangle inequality. Subsequently, by two applications of Pinsker’s inequality,

μ^Nhμ𝖳𝖵\displaystyle\lVert\hat{\mu}_{Nh}-\mu\rVert_{\mathsf{TV}} μ^NhμNh𝖳𝖵+μNhμ𝖳𝖵\displaystyle\leq\lVert\hat{\mu}_{Nh}-\mu_{Nh}\rVert_{\mathsf{TV}}+\lVert\mu_{Nh}-\mu\rVert_{\mathsf{TV}}
𝖪𝖫(μ^NhμNh)+𝖪𝖫(μNhμ).\displaystyle\lesssim\sqrt{\mathsf{KL}(\hat{\mu}_{Nh}\mathbin{\|}\mu_{Nh})}+\sqrt{\mathsf{KL}(\mu_{Nh}\mathbin{\|}\mu)}\,.

These terms can be bounded separately. Analogous to the proof of the prior theorem, using Lemma 5, it suffices to take

T𝒪~(C𝖫𝖲𝖨Llogdϵ2),\displaystyle T\geq\widetilde{\mathcal{O}}\Bigl{(}C_{\mathsf{LSI}}\sqrt{L}\log\frac{d}{\epsilon^{2}}\Bigr{)}\,,

and for the other term, it suffices to use Proposition 15 with any value of qq, γL\gamma\asymp\sqrt{L} which combined with the requirement on TT yields:

h=Θ~(ϵC𝖫𝖲𝖨1/2Ld1/2),\displaystyle h=\widetilde{\Theta}\Bigl{(}\frac{\epsilon}{C_{\mathsf{LSI}}^{1/2}Ld^{1/2}}\Bigr{)}\,,

with

N=Θ~(C𝖫𝖲𝖨3/2L3/2d1/2ϵ),\displaystyle N=\widetilde{\Theta}\Bigl{(}\frac{C_{\mathsf{LSI}}^{3/2}L^{3/2}d^{1/2}}{\epsilon}\Bigr{)}\,,

(using N=T/hN=T/h). ∎