Improved Discretization Analysis for Underdamped
Langevin Monte Carlo
Abstract
Underdamped Langevin Monte Carlo (ULMC) is an algorithm used to sample from unnormalized densities by leveraging the momentum of a particle moving in a potential well. We provide a novel analysis of ULMC, motivated by two central questions: (1) Can we obtain improved sampling guarantees beyond strong log-concavity? (2) Can we achieve acceleration for sampling?
For (1), prior results for ULMC only hold under a log-Sobolev inequality together with a restrictive Hessian smoothness condition. Here, we relax these assumptions by removing the Hessian smoothness condition and by considering distributions satisfying a Poincaré inequality. Our analysis achieves the state of art dimension dependence, and is also flexible enough to handle weakly smooth potentials. As a byproduct, we also obtain the first KL divergence guarantees for ULMC without Hessian smoothness under strong log-concavity, which is based on a new result on the log-Sobolev constant along the underdamped Langevin diffusion.
For (2), the recent breakthrough of Cao, Lu, and Wang (2020) established the first accelerated result for sampling in continuous time via PDE methods. Our discretization analysis translates their result into an algorithmic guarantee, which indeed enjoys better condition number dependence than prior works on ULMC, although we leave open the question of full acceleration in discrete time.
Both (1) and (2) necessitate Rényi discretization bounds, which are more challenging than the typically used Wasserstein coupling arguments. We address this using a flexible discretization analysis based on Girsanov’s theorem that easily extends to more general settings.
1 Introduction
The problem of sampling from a high-dimensional distribution on , when the normalizing constant is unknown and only the potential is given, has increasing relevancy in a number of application domains, including economics, physics, and scientific computing [JP10, Von11, KPB20]. Recent progress on this problem has been driven by a strong connection with the field of optimization, starting from the seminal work of [JKO98]; see [Che23] for an exposition.
Given the success of momentum-based algorithms for optimization [Nes83], it is natural to investigate momentum-based algorithms for sampling. The hope is that such methods can improve the dependence of the convergence estimates on key problem parameters, such as the condition number , the dimension , and the error tolerance . One such method is underdamped Langevin Monte Carlo (ULMC), which is a discretization of the underdamped Langevin diffusion (ULD):
where is the standard -dimensional Brownian motion. The stationary distribution of ULD is , and in particular, the -marginal of is the desired target distribution . Therefore, by taking a small step size for the discretization and a large number of iterations, ULMC will yield an approximate sample from .
We also note that in the limiting case where , ULMC closely resembles the Hamiltonian Monte Carlo algorithm, which is known to achieve acceleration and better discretization error in some limited settings [Vis21, AGS22, BM22, WW22].
While there is currently no analysis of ULMC that yields acceleration for sampling (i.e., square root dependence on the condition number ), ULMC is known to improve the dependence on other parameters such as the dimension and the error tolerance [Che+18, Che+18a, DR20], at least for guarantees in the Wasserstein metric. However, compared to the extensive literature on the simpler (overdamped) Langevin Monte Carlo (LMC) algorithm, existing analyses of ULMC are not easily extended to stronger performance metrics such as the and Rényi divergences. In turn, this limits the scope of the results for ULMC; see the discussion in Section 1.1.
In light of these shortcomings, in this work, we ask the following two questions:
-
1.
Can we obtain sampling guarantees beyond the strongly log-concave case via ULMC?
-
2.
Can we obtain accelerated convergence guarantees for sampling via ULMC?
1.1 Our Contributions
We address the two questions above by providing a new Girsanov discretization bound for ULMC. Our bound holds in the strong Rényi divergence metric and applies under general assumptions (in particular, it does not require strong log-concavity of the target , and it allows for weakly smooth potentials). Consequently, it leads to the following new state-of-the-art results for ULMC:
-
•
We obtain an -guarantee in KL divergence with iteration complexity for strongly log-concave and log-smooth distributions, which removes the Lipschitz Hessian assumption of [Ma+21]; here, is the condition number of the distribution.
-
•
We obtain an -guarantee in TV distance with iteration complexity under a log-Sobolev inequality (LSI) and -smooth potential, again without assuming a Lipschitz Hessian. This is the state-of-the-art guarantee for this class of distributions with regards to dimension dependence.
-
•
We obtain -guarantees in the stronger Rényi divergence metric of any order in with iteration complexity under a Poincaré inequality and a -smooth potential, which improves to under log-concavity. These are the first guarantees for ULMC known in these settings, and they substantially improve upon the corresponding results for LMC in these settings [Che+21].
-
•
In the Poincaré case, we also consider weakly smooth potentials (i.e., Hölder continuous gradients with coefficient ), which more realistically reflect the delicate smoothness properties of distributions satisfying a Poincaré inequality.
We now discuss our results in more detail in the context of the existing literature.
Guarantees under Weaker Assumptions.
Prior works, [Che+18a, DR20, GT20], require strong log-concavity of the target. Whereas for works which operate under isoperimetric assumptions, we are only aware of [Ma+21], which further assumes a restrictive Lipschitz Hessian condition for the potential. In contrast, we make no such assumption on the Hessian of , and we obtain results under a log-Sobolev inequality (LSI), or under the even weaker assumption of a Poincaré inequality (PI), for which sampling analysis is known to be challenging [Che+21].
As noted above, our result for sampling from distributions satisfying LSI and smoothness assumptions are state-of-the-art with regards to the dimension dependence (); in contrast, the previous best results had linear dependence on [Che+21, Che+22]. Moreover, in the Poincaré case, we can also consider weakly smooth potentials, which have not been previously considered in the context of ULMC.
Guarantees in Stronger Metrics.
Key to achieving these results is our discretization analysis in the Rényi divergence metric. Indeed, the continuous-time convergence results for ULD under LSI or PI hold in the KL or Rényi divergence metrics, and translating these guarantees to the ULMC algorithm necessitates studying the discretization in Rényi. This is the main technical challenge, as we can no longer rely on Wasserstein coupling arguments which are standard in the literature [Che+18a, DR20]. Two notable exceptions are the Rényi discretization argument of [GT20], which incurs suboptimal dependence on , and the KL divergence argument of [Ma+21], which requires stringent smoothness assumptions.
In this work, we provide the first KL divergence guarantee for sampling from strongly log-concave and log-smooth distributions via ULMC without Hessian smoothness, based on a new LSI along the trajectory (discussed further below).
Towards Acceleration in Sampling.
Our work is also motivated by the breakthrough result of [CLW20], which achieves for the first time an accelerated convergence guarantee for ULD in continuous time. Our discretization bound allows us to convert this result into an algorithmic guarantee which indeed improves the dependence on the condition number 111In the case of Poincaré inequality, the condition number is , which is consistent with the definition in the strongly log-concave case. for ULMC, whereas prior results incurred a dependence of at least ; our dependence is linear in in the log-concave case. While this still falls short of proving full acceleration for sampling (i.e., an improvement to ), our result provides further hope for achieving acceleration via ULMC.
A New Log-Sobolev Inequality along the ULD Trajectory.
Finally, en route to proving the KL divergence guarantee in the strongly log-concave case, we establish a new log-Sobolev inequality along ULD (Proposition 14), which is of independent interest. While such a result was previously known for the overdamped Langevin diffusion, to the best of our knowledge it is new for the underdamped version.
1.2 More Related Work
Langevin Monte Carlo.
For the standard LMC algorithm, non-asymptotic rate estimates in were first demonstrated in [Dal17], for the class of strongly log-concave measures. Guarantees in divergence under a log-Sobolev inequality were obtained by [VW19], which developed an appealing continuous-time framework for analyzing LMC under functional inequalities. With some difficulty, this result was extended to Rényi divergences by [GT20, EHZ22]. At the same time, a body of literature studied convergence in divergence under tail-growth conditions such as dissipativity [RRT17, EMS18, EH21, Mou+22], which usually imply functional inequalities.
Most related to the current work, [Che+21] extended the continuous-time approach from [VW19] to Rényi divergences, and moreover introduced a novel discretization analysis using Girsanov’s theorem, which also holds for weakly smooth potentials. The present work builds upon the Girsanov techniques introduced in [Che+21] to study ULMC.
Underdamped Langevin Diffusion.
ULMC is a discretization of the underdamped Langevin diffusion (1). First studied by [Kol34] and [Hör67] in their pioneering works on hypoellipticity, it was quickly understood that establishing quantitative convergence to stationarity is technically challenging, let alone capturing any acceleration phenomenon. The seminal work of [Vil02, Vil09] developed the hypocoercivity approach, providing the first convergence guarantees under functional inequalities; see also [Hér06, DMS09, DMS15, RS18]. We also refer to [Ber+22] and references therein for a comprehensive discussion of qualitative and quantitative convergence results for ULD.
As mentioned earlier, the most recent breakthrough by [CLW20] achieved acceleration in continuous time in -divergence when the target distribution is log-concave. This work was built on an approach using the dual Sobolev space [Alb+19]. However, since this method relies on the duality of the space and its connections to the Poincaré inequality, it is difficult to extend to spaces or to other functional inequalities.
Other Discretizations.
1.3 Organization
The remainder of this paper will be organized as follows. In Section 2, we will review the required definitions and assumptions. In Section 3, we will state our main results and briefly sketch their proofs. In Section 4, we highlight several implications of our theorems through some examples. In Section 5, we briefly sketch the proofs of our main results, before concluding in Section 6 with a discussion of future directions.
2 Background
2.1 Notation
Hereafter, we will use to denote the -norm on vectors. In general, we will only work with measures that admit densities on , and we will abuse notation slightly to conflate a measure with its density for convenience. The notation signifies that there exists an absolute constant such that , and hides logarithmic factors. Similarly we write if there exist constants such that , and hides logarithmic factors. The stationary measure (in the position coordinate) is , and will be referred to as the potential. We will use to denote test functions where , and to denote weakly differentiable functions where . Finally, the notations , , represent , , up to absolute constants. Further notations are introduced in subsequent sections.
2.2 Definitions and Assumptions
In this subsection, we will define the relevant processes, divergences, and isoperimetric inequalities. Firstly, we define the ULMC algorithm by the following stochastic differential equation (SDE):
(ULMC) |
where for some step size . We note this formulation of ULMC can be integrated in closed form (see Appendix A).
Next, we define a few measures of distance between two probability distributions and on . We define the total variation distance as
(2.1) |
where the is taken over Borel measurable sets . We further define the divergence as
(2.2) |
and if is not absolutely continuous with respect to . Finally, we define the Rényi divergence with order as
and similarly if . The Rényi divergence upper bounds for all orders, i.e., for any order , and is monotonic in . In particular, when , we also get divergence, i.e., .
Our primary results are provided under the following smoothness conditions.
Definition 1 (Smoothness).
The potential is -weakly smooth if is differentiable and is -Hölder continuous satisfying
(2.3) |
for all and some , . In the particular case where , we say that the potential is -smooth, or that is -Lipschitz.
We conduct three lines of analysis. The first assumes strong convexity of the potential, i.e.:
Definition 2 (Strong Convexity).
The potential is -strongly convex for some if for all :
In the case above, we say that is convex. If a potential function is (strongly) convex, then we say the distribution is (strongly) log-concave.
A second, strictly more general assumption is the log-Sobolev inequality.
Definition 3 (Log-Sobolev Inequality).
A measure satisfies a log-Sobolev inequality (LSI) with parameter if for all :
(LSI) |
where .
An -strongly convex potential is known to satisfy (LSI) with constant [BGL14]. More generally, we can consider the following weaker isoperimetric inequality, which corresponds to a linearization of (LSI).
Definition 4 (Poincaré Inequality).
A measure satisfies a Poincaré inequality with parameter if for all :
(PI) |
where .
Conditions (LSI) and (PI) are standard assumptions made on the stationary distribution in the theory of Markov diffusions as well as sampling [BGL14, VW19, Che+21, Che23]. They are known to be satisfied by a broad class of targets such as log-concave distributions or certain mixture distributions [Che21, CCN21].
We define the condition number for an -strongly log-concave target with -weakly smooth potential as . In the case where instead of strong convexity, the target only satisfies (LSI) (respectively (PI)), the condition number is instead (respectively ).
Finally, we collect several mild assumptions to simplify computing the bounds below, which have also appeared in prior work; see in particular the discussion in [Che+21, Appendix A].
Assumption 1.
The expectation of the norm (in the position coordinate) is quantitatively bounded by some constant, 222This holds for instance when for ., for some constant . Furthermore, we assume that (without loss of generality), and that .
3 Main Theorems
In the sequel, we always take the initial distribution of the momentum to be equal to the stationary distribution . Then, under Assumption 1 we can find an initial distribution for the position which is a centered Gaussian with variance specified in Appendix D, such that has some appropriately bounded initial divergence (e.g. ) with respect to . Lastly, we initialize ULMC by sampling from the distribution , i.e. with and independent.
3.1 Convergence in and
In order to state our results for ULMC in and , we leverage the following result in continuous-time from [Ma+21], which relies on an entropic hypocoercivity argument, after a time-change of the coordinates (see Appendix B.1 for a proof).
Lemma 5 (Adapted from [Ma+21, Proposition 1]).
Define the Lyapunov functional
(3.1) |
For targets that are -smooth and satisfy (LSI) with parameter , let . Then the law of ULD satisfies
We now proceed to state our main results more precisely. First, we obtain the following KL divergence guarantee under strong log-concavity and smoothness.
Theorem 6 (Convergence in under Strong Log-Concavity).
Here, we justify the choice of error tolerance for to be . Based on Pinsker’s and Talagrand’s transport inequalities, we know is on the order of . Hence, this allows for a fair comparison of convergence guarantees in terms of with and . Weakening the strong convexity assumption to , we obtain a result in .
3.2 Convergence in and Improving the Dependence on
To state our convergence results in , we additionally inherit the following technical assumption from [CLW20].
Assumption 2.
is a compact embedding. Secondly, assume that is twice continuously differentiable, and that for all , we have
Remark.
[Hoo81, Theorem 3.1] shows the first part of this assumption is always satisfied if the potential has super-linear tail growth, i.e. for and large . In the case where the tail is strictly linear, we can instead construct an arbitrarily close approximation with super-linear tails; thus, it generically holds for all targets we consider in this work. As also remarked in [CLW20], the above assumption is required solely due to technical reasons and is likely not a necessary condition.
The second part of the assumption is satisfied under -smoothness of the gradient with the same constant. In the convex case or the case where is lower bounded, the constant does not show up in the bounds. As a result, for weakly smooth potentials in this setting, we can approximate using twice differentiable potentials to obtain a rate estimate.
In the light of the above discussion, we emphasize that this additional assumption largely does not hinder the applicability of our results. Under this assumption, [CLW20] established the following guarantee on (1) in continuous time.
Lemma 8 (Rapid Convergence in ; Adapted from [CLW20, Theorem 1]).
Remark.
Our final result leverages the above accelerated convergence guarantees of ULD, and establishes the first bound for ULMC in Rényi divergence with an improved condition number dependence.
Theorem 9 (Convergence in under (PI)).
Let the potential be -weakly smooth, satisfy (PI) with constant , and satisfy Assumption 1. Let it also satisfy the additional technical condition Assumption 2. Then, for
the following holds for , the law of the -th iterate of ULMC initialized at a centered Gaussian (variance specified in Appendix D) for and with defined in (3.2):
Remark.
The optimal choice is to take . If the potential is convex, then we set , which is known to be an optimal choice [CLW20]. As a result, in the convex and smooth case, the iteration complexity has the condition number dependence , which improves upon the dependence seen in [Che+21]. The dependence on dimension and error tolerance are also improved.
4 Examples
Example 10.
We consider the potential , which satisfies (PI) with constant [Bob03] and is -smooth. Assuming the compact embedding condition of Assumption 2, Theorem 9 gives a complexity of for -guarantees in after optimizing for , since in this case the potential is log-concave. In this case, the dimension dependence equates to that of the proximal sampler with rejection sampling [Che+22, Corollary 8], which is ; it surpasses [Che+21, Theorem 8], which can only obtain for the same guarantees. However, it is important to note that the latter two works obtain these for any order of Rényi divergence and are not limited to order , which cannot presently be obtained using our results for ULMC.
Example 11.
Consider an -strongly log-concave and -log-smooth distribution. Non-trivial examples of this can be found in Bayesian regression (see e.g., [Dal17a, Section 6]); we will examine the first one, where for some . Here, our Theorem 6 gives a complexity of to obtain a -guarantee for the divergence. In contrast, the Hessian is , which has , where is the Lipschitz constant of the Hessian in the Frobenius norm. Consequently, [Ma+21, Theorem 1] is stated as , which in this case gives to obtain the same -accuracy guarantee. This is worse in the dimension-dependence. Finally, it is possible to compare with the discretization bounds achieved in [GT20, Theorem 28], where in combination with our continuous time results (using the same proof technique as Theorem 6) to yield iterations, which is suboptimal in the order of , but has the same dimension dependence.
Example 12.
We can analyze -smooth distributions satisfying a log-Sobolev inequality with parameter . One such instance arises when considering any bounded perturbation of a strongly convex potential. In this case, let be the potential of the target in Example 11. Then consider a target with modified potential , with for some , and let be -Frobenius Lipschitz. We can bound the log-Sobolev constant of this potential using the Holley–Stroock Lemma [HS87]. Let this new potential have condition number . We achieve -accuracy in distance with . For comparison, the previous bound [Ma+21, Theorem 1] gives to arrive at the same guarantee in , which is worse in the dimension. However, note that the guarantees in [Ma+21, Theorem 1] are in , which is stronger than . Finally, we note that [GT20] requires strong log-concavity, and hence cannot provide a guarantee in this setting.
Example 13.
Consider a -weakly log-smooth target that is log-concave and satisfies a Poincaré inequality with Consequently, Theorem 9 yields to obtain -guarantees for . [Che+21, Theorem 7] yields for the same guarantees, which is worse in both parameters. On the other hand, take the specific case of a distribution with potential , which has [Bob03], is log-convex and -weakly log-smooth. Consequently, Theorem 9 yields for -accuracy guarantees in divergence. This is worse by a factor of than the rate estimate obtained in [Che+21, Example 9], as they leverage a stronger class of functional inequalities that interpolate between (PI) and (LSI), whereas our analysis cannot capture this improvement. Our convergence guarantee is still better in terms of -dependence.
5 Proof Sketches
5.1 Continuous Time Results
For results under both the Poincaré and log-Sobolev inequalities, we leverage the existing results as stated in [CLW20, Ma+21], which we present in Lemmas 5 and 8. These allow us to bound , with exponentially decaying quantities.
With the additional assumption of strong convexity, we can obtain a contraction in an alternate system of coordinates (see Appendix B). This allows us to consider the distributions of the continuous time iterates and the target in these alternate coordinates respectively. From this, we obtain the following proposition.
Proposition 14 (Log-Sobolev Inequality Along the Trajectory).
Suppose is -strongly convex and -smooth. Let now denote the law of the continuous-time underdamped Langevin diffusion with for in the coordinates. Suppose the initial distribution has (LSI) constant (in the altered coordinates) , then satisfies (LSI) with constant that can be uniformly upper bounded by
The main idea behind the proof of this proposition is to analyze the discretization (ULMC) of the underdamped Langevin diffusion in the coordinates . Note that this can be written in the following form, for some matrix and function ,
This is the composition of a deterministic function giving the mean of the next iterate of ULMC started at , followed by addition with a Gaussian distribution giving the variance of the resulting iterate. In particular, we show that for coordinates , we can find an almost sure strict contraction under in the sense that
where by abuse of notation , and the seminorm of a function refers to the Lipschitz constant of the function.
Since is a contraction for small enough , each push forward improves the log-Sobolev constant by a multiplicative factor [VW19, Lemma 19]. At the same time, a Gaussian convolution can only worsen the log-Sobolev constant by an additive constant [Cha04, Corollary 3.1]. Subsequently, the log-Sobolev constant at each iterate forms a (truncated) geometric sum, and therefore can be bounded by the infinite series. This incidentally can be used to bound the log-Sobolev constant of the ULMC iterates. Taking an appropriate limit of while keeping , we arrive at the stated bound in the proposition. Consequently, considering the decomposition of the , a simple application of Cauchy–Schwarz tells us that
The log-Sobolev inequality for implies a Poincaré inequality, which allows us to bound the variance term by the Fisher information . This can be bounded by the same entropic hypocoercivity argument from [Ma+21] that is used to generate our bounds, while the remaining two terms are handled respectively via the discretization analysis and again the entropic hypocoercivity argument.
5.2 Discretization Analysis
The main result we use to control the discretization error can be found below.
Proposition 15.
Let denote the law of (ULMC) and let denote the law of the continuous-time underdamped Langevin diffusion (1), both initialized at some . Assume that the potential is -weakly smooth. If the step size satisfies
(5.1) |
where the notation hides constants depending on as well as polylogarithmic factors including , and is a modified target distribution (see Appendix C.3 for details), then
Remark.
The condition on is dependent on only through logarithmic factors. Secondly, this is shown under generic assumptions, and can be combined with continuous-time results in in any setting, such as the log-Sobolev or Latała–Oleszkiewicz inequalities seen in [Che+21].
We outline the proof of this result below. Similar to the work of [Che+21], we first invoke the data processing inequality, allowing us to bound the Rényi between the time marginal distributions of the iterates with Rényi between the path measures
where are probability measures of (ULMC), (1) respectively on the space of paths . Subsequently, we invoke Girsanov’s theorem, which allows us to exactly bound the pathwise divergence by the difference between the drifts of the two processes:
It remains to bound the term inside the expectation. We achieve this by conditioning on the event that is bounded by a vanishing quantity as , which we must demonstrate occurs with sufficiently high probability. To show this, we begin with a single-step analysis, i.e., we bound the above for . Compared to LMC, the main gain in this analysis is that the SDEs (1) and (ULMC) match exactly in the position coordinate, while the difference between the drifts manifests solely in the momentum. After integration of the momentum, the order of error is better in the position coordinate (the dominant term is compared to seen in [Che+21, Lemma 24]).
The technique for extending this analysis from a single step to the full time interval follows closely that seen in [Che+21]. In particular, we obtain a dependence for on in the interval . Controlling the latter is quite complicated when the potential satisfies only a Poincaré inequality, since it is equivalent to showing sub-Gaussian tail bounds on the iterates, while the target itself is not sub-Gaussian in the position coordinate. By comparing against an auxiliary potential, we can show that for our choice of initialization, the iterates remain sub-Gaussian for all iterations up to (albeit with a growing constant). Finally, this allows us to recover our discretization result in the proposition above.
6 Conclusion
This work provides state-of-the-art convergence guarantees for underdamped Langevin Monte Carlo algorithm in several regimes. Our discretization analysis (Proposition 15) in particular is generic and can be extended to any order of Rényi, under various conditions on the potential (Latała–Oleszkiewicz, weak smoothness, etc.). Consequently, our results serve as a key step towards a complete understanding of the ULMC algorithm. However, limitations of the current continuous-time techniques do not permit us to obtain stronger iteration complexity results. More specifically, it is not understood how to analyze Rényi divergence of order greater than , or if hypercontractive decay is possible when the potential satisfies a log-Sobolev inequality. Secondly, our discretization approach via Girsanov is currently suboptimal in the condition number (a fact noted in [Che+21]), and thus does not obtain the expected dependence of after discretization. An improvement in the proof techniques would be necessary to sharpen this result. We believe the results and techniques developed in this work will be of interest to stimulate future research.
Acknowledgements
We thank Jason M. Altschuler, Alain Durmus, and Aram-Alexandre Pooladian for helpful conversations. KB was supported by NSF grant DMS-2053918. SC was supported by the NSF TRIPODS program (award DMS-2022448). MAE was supported by NSERC Grant [2019-06167], the Connaught New Researcher Award, the CIFAR AI Chairs program, and the CIFAR AI Catalyst grant. ML was supported by the Ontario Graduate Scholarship and Vector Institute.
pages35 rangepages-1 rangepages-1 rangepages7 rangepages1 rangepages4 rangepages39 rangepages24 rangepages31 rangepages28 rangepages12 rangepages26 rangepages6 rangepages22 rangepages33 rangepages47 rangepages25 rangepages10 rangepages12 rangepages11 rangepages11 rangepages12 rangepages25 rangepages36 rangepages17 rangepages72 rangepages2 rangepages16 rangepages51 rangepages13 rangepages50 rangepages25 rangepages5 rangepages30 rangepages33 rangepages41 rangepages-1 rangepages1
References
- [AGS22] Simon Apers, Sander Gribling and Dániel Szilágyi “Hamiltonian Monte Carlo for efficient Gaussian sampling: long and random steps” In arXiv preprint arXiv:2209.12771, 2022
- [Alb+19] Dallas Albritton, Scott Armstrong, Jean-Christophe Mourrat and Matthew Novack “Variational methods for the kinetic Fokker–Planck equation” In arXiv preprint arXiv:1902.04037, 2019
- [Ber+22] Etienne Bernard, Max Fathi, Antoine Levitt and Gabriel Stoltz “Hypocoercivity with Schur complements” In Annales Henri Lebesgue 5, 2022, pp. 523–557
- [BGL14] Dominique Bakry, Ivan Gentil and Michel Ledoux “Analysis and geometry of Markov diffusion operators” 348, Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] Springer, Cham, 2014, pp. xx+552
- [BLM13] Stéphane Boucheron, Gábor Lugosi and Pascal Massart “Concentration inequalities” A nonasymptotic theory of independence, With a foreword by Michel Ledoux Oxford University Press, Oxford, 2013, pp. x+481
- [BM22] Nawaf Bou-Rabee and Milo Marsden “Unadjusted Hamiltonian MCMC with stratified Monte Carlo time integration” In arXiv preprint arXiv:2211.11003, 2022
- [Bob03] Sergey G. Bobkov “Spectral gap and concentration for some spherically symmetric probability measures” In Geometric aspects of functional analysis 1807, Lecture Notes in Math. Springer, Berlin, 2003, pp. 37–43
- [CCN21] Hong-Bin Chen, Sinho Chewi and Jonathan Niles-Weed “Dimension-free log-Sobolev inequalities for mixture distributions” In Journal of Functional Analysis 281.11, 2021, pp. 109236
- [CD76] Jagdish Chandra and Paul W. Davis “Linear generalizations of Gronwall’s inequality” In Proceedings of the American Mathematical Society 60.1, 1976, pp. 157–160
- [Cha04] Djalil Chafai “Entropies, convexity, and functional inequalities: on -entropies and -Sobolev inequalities” In J. Math. Kyoto Univ. 44.2, 2004, pp. 325–363
- [Che+18] Xiang Cheng et al. “Sharp convergence rates for Langevin dynamics in the nonconvex setting” In arXiv preprint arXiv:1805.01648, 2018
- [Che+18a] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett and Michael I. Jordan “Underdamped Langevin MCMC: a non-asymptotic analysis” In Conference on Learning Theory, 2018, pp. 300–323 PMLR
- [Che+21] Sinho Chewi et al. “Analysis of Langevin Monte Carlo from Poincaré to log-Sobolev” In arXiv preprint arXiv:2112.12662, 2021
- [Che+22] Yongxin Chen, Sinho Chewi, Adil Salim and Andre Wibisono “Improved analysis for a proximal algorithm for sampling” In Proceedings of Thirty Fifth Conference on Learning Theory 178, Proceedings of Machine Learning Research PMLR, 2022, pp. 2984–3014
- [Che21] Yuansi Chen “An almost constant lower bound of the isoperimetric coefficient in the KLS conjecture” In Geom. Funct. Anal. 31.1, 2021, pp. 34–61
- [Che23] Sinho Chewi “Log-concave sampling” Book draft available at https://chewisinho.github.io/, 2023
- [CLW20] Yu Cao, Jianfeng Lu and Lihan Wang “On explicit -convergence rate estimate for underdamped Langevin dynamics” In arXiv e-prints, 2020
- [Dal17] Arnak S. Dalalyan “Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 678–689
- [Dal17a] Arnak S. Dalalyan “Theoretical guarantees for approximate sampling from smooth and log-concave densities” In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79.3 Wiley Online Library, 2017, pp. 651–676
- [DMS09] Jean Dolbeault, Clément Mouhot and Christian Schmeiser “Hypocoercivity for kinetic equations with linear relaxation terms” In Comptes Rendus Mathematique 347.9-10 Elsevier, 2009, pp. 511–516
- [DMS15] Jean Dolbeault, Clément Mouhot and Christian Schmeiser “Hypocoercivity for linear kinetic equations conserving mass” In Transactions of the American Mathematical Society 367.6, 2015, pp. 3807–3828
- [DR20] Arnak S. Dalalyan and Lionel Riou-Durand “On sampling from a log-concave density using kinetic Langevin diffusions” In Bernoulli 26.3 Bernoulli Society for Mathematical StatisticsProbability, 2020, pp. 1956–1988
- [EH21] Murat A. Erdogdu and Rasa Hosseinzadeh “On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness” In Proceedings of Thirty Fourth Conference on Learning Theory 134, Proceedings of Machine Learning Research PMLR, 2021, pp. 1776–1822
- [EHZ22] Murat A. Erdogdu, Rasa Hosseinzadeh and Shunshi Zhang “Convergence of Langevin Monte Carlo in chi-squared and Rényi divergence” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics 151, Proceedings of Machine Learning Research PMLR, 2022, pp. 8151–8175
- [EMS18] Murat A Erdogdu, Lester Mackey and Ohad Shamir “Global non-convex optimization with discretized diffusions” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 9694–9703
- [FLO21] James Foster, Terry Lyons and Harald Oberhauser “The shifted ODE method for underdamped Langevin MCMC” In arXiv preprint arXiv:2101.03446, 2021
- [FRS22] James Foster, Goncalo dos Reis and Calum Strange “High order splitting methods for SDEs satisfying a commutativity condition” In arXiv preprint arXiv:2210.17543, 2022
- [GT20] Arun Ganesh and Kunal Talwar “Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 7222–7233
- [HBE20] Ye He, Krishnakumar Balasubramanian and Murat A. Erdogdu “On the ergodicity, bias and asymptotic normality of randomized midpoint sampling method” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 7366–7376
- [Hér06] Frédéric Hérau “Hypocoercivity and exponential time decay for the linear inhomogeneous relaxation Boltzmann equation” In Asymptotic Analysis 46.3-4 IOS Press, 2006, pp. 349–359
- [Hoo81] James G. Hooton “Compact Sobolev imbeddings on finite measure spaces” In Journal of Mathematical Analysis and Applications 83.2 Elsevier, 1981, pp. 570–581
- [Hör67] Lars Hörmander “Hypoelliptic second order differential equations” In Acta Mathematica 119 Institut Mittag-Leffler, 1967, pp. 147–171
- [HS87] Richard Holley and Daniel Stroock “Logarithmic Sobolev inequalities and stochastic Ising models” In J. Statist. Phys. 46.5-6, 1987, pp. 1159–1194
- [JKO98] Richard Jordan, David Kinderlehrer and Felix Otto “The variational formulation of the Fokker–Planck equation” In SIAM Journal on Mathematical Analysis 29.1 SIAM, 1998, pp. 1–17
- [JLS23] Tim Johnston, Iosif Lytras and Sotirios Sabanis “Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity–the Strongly Convex Case” In arXiv preprint arXiv:2301.08039, 2023
- [JP10] Michael Johannes and Nicholas Polson “MCMC methods for continuous-time financial econometrics” In Handbook of financial econometrics: applications Elsevier, 2010, pp. 1–72
- [Kol34] Andrey Kolmogorov “Zufallige bewegungen (zur theorie der Brownschen bewegung)” In Annals of Mathematics JSTOR, 1934, pp. 116–117
- [KPB20] Ivan Kobyzev, Simon JD Prince and Marcus A. Brubaker “Normalizing flows: an introduction and review of current methods” In IEEE Transactions on Pattern Analysis and Machine Intelligence 43.11 IEEE, 2020, pp. 3964–3979
- [Li+19] Xuechen Li, Yi Wu, Lester Mackey and Murat A Erdogdu “Stochastic runge-kutta accelerates langevin monte carlo and beyond” In Advances in neural information processing systems 32, 2019
- [Ma+21] Yi-An Ma et al. “Is there an analog of Nesterov acceleration for gradient-based MCMC?” In Bernoulli 27.3 Bernoulli Society for Mathematical StatisticsProbability, 2021, pp. 1942–1992
- [Mir17] Ilya Mironov “Rényi differential privacy” In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), 2017, pp. 263–275 IEEE
- [Mon21] Pierre Monmarché “High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion.” In Electronic Journal of Statistics 15.2 Institute of Mathematical StatisticsBernoulli Society, 2021, pp. 4117–4166
- [Mou+22] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright and Peter L. Bartlett “Improved bounds for discretization of Langevin diffusions: near-optimal rates without convexity” In Bernoulli 28.3, 2022, pp. 1577–1601
- [Nes83] Yurii E. Nesterov “A method of solving a convex programming problem with convergence rate ” In Doklady Akademii Nauk 269.3, 1983, pp. 543–547 Russian Academy of Sciences
- [Oks13] Bernt Oksendal “Stochastic differential equations: an introduction with applications” Springer Science & Business Media, 2013
- [RRT17] Maxim Raginsky, Alexander Rakhlin and Matus Telgarsky “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 1674–1703
- [RS18] Julien Roussel and Gabriel Stoltz “Spectral methods for Langevin dynamics and associated error estimates” In ESAIM: Mathematical Modelling and Numerical Analysis 52.3 EDP Sciences, 2018, pp. 1051–1083
- [SL19] Ruoqi Shen and Yin Tat Lee “The randomized midpoint method for log-concave sampling” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
- [Vil02] Cédric Villani “Limites hydrodynamiques de l’équation de Boltzmann” In Astérisque, SMF 282, 2002, pp. 365–405
- [Vil09] Cédric Villani “Hypocoercivity” In Mem. Amer. Math. Soc. 202.950, 2009, pp. iv+141
- [Vis21] Nisheeth K Vishnoi “An introduction to Hamiltonian Monte Carlo method for sampling” In arXiv preprint arXiv:2108.12107, 2021
- [Von11] Udo Von Toussaint “Bayesian inference in physics” In Reviews of Modern Physics 83.3 APS, 2011, pp. 943
- [VW19] Santosh Vempala and Andre Wibisono “Rapid convergence of the unadjusted Langevin algorithm: isoperimetry suffices” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
- [WW22] Jun-Kun Wang and Andre Wibisono “Accelerating Hamiltonian Monte Carlo via Chebyshev integration time” In arXiv preprint arXiv:2207.02189, 2022
Appendix A Explicit Form for the Underdamped Langevin Diffusion
Recall that we evolve for time explicitly according to the SDE (ULMC), which we repeat here for convenience:
(A.1) | ||||
(A.2) |
Consequently, since we fix the position in the non-linear term, this permits an explicit solution
(A.3) | ||||
(A.4) |
where is an independent sequence of pairs of variables, where each pair has the joint distribution
where is identical to the bottom left entry.
Appendix B Continuous-Time Results
B.1 Entropic Hypocoercivity
Our proof of Lemma 5 is based on adapting the argument on the decay of a Lyapunov function from [Ma+21] (based on entropic hypocoercivity, see [Vil09]) and combining it with a time change argument [DR20, Lemma 1]. We provide the details below for completeness.
Proof. of Lemma 5 First note that variables with following (ULMC) can be changed into , which satisfies the process given by
with , , which are the parameters satisfying [Ma+21, Proposition 1]. From that Proposition, we know that the Lyapunov functional given by
decays with Here the constant does not change under our coordinate transform, but now represents the joint law of , while the stationary measure has the form . The statement of our theorem immediately follows by reversing our change of variables, which involves scaling up the gradients of the momenta by , while the time is scaled down by . ∎
B.2 Contraction of ULMC
In this section, we prove a contraction result for ULMC and use this to deduce a log-Sobolev inequality along the trajectory of the underdamped Langevin diffusion. The mean of the next iterate of ULMC started at is given by
We will use the change of coordinates
In these new coordinates, the mean of the next iterate of ULMC started at is , where . Since , we can explicitly write
Lemma 16.
Consider the mapping defined above. Assume that . Then, for and for some , is a contraction with parameter
Proof. We compute the partial derivatives
Let and . Since
we have
Then,
where the upper right entry is determined by symmetry. Since and , one can simplify this as follows:
One can check that the eigenvalues of the matrix are , where ranges over the eigenvalues of . Hence, we can bound
We note that
In order for this to be strictly smaller than , we must take . We choose for , in which case
We deduce that
and therefore
∎
The ULMC iterate is
where is the covariance of the Gaussian random vector in the LMC update. In the new coordinates, this iteration can be written
Writing , we can compute
We conclude that
Hence, .
Proposition 17.
Let . Then, for all , for all sufficiently small (depending on ), one has
Proof. The LSI constant evolves according to
For sufficiently small, we have
Iterating,
This completes the proof. ∎
Corollary 18.
Let now denote the law of the continuous-time underdamped Langevin diffusion with for in the coordinates. Then,
Proof. In the preceding proposition, let while , and then let . ∎
Appendix C Discretization Analysis
We consider the discretization used in [Ma+21], with the following differential form:
and we define the variable as the tuple , for .
C.1 Technical Lemmas
Theorem 19 (Girsanov’s Theorem, Adapted from [Oks13, Theorem 8.6.8]).
Consider stochastic processes , , adapted to the same filtration, and any constant, possibly degenerate, matrix. Let and be probability measures on the path space such that evolves according to
where is a -Brownian motion and is a -Brownian motion. Furthermore, suppose there exists a process such that
and
Consequently, if we define as the Moore–Penrose pseudo-inverse of , then by the previous supposition we have . Then,
In fact, we will only need the following corollary.
Corollary 20.
For any event and ,
Proof. Using Cauchy–Schwarz, and then Itô’s Lemma, we find
Here, we used the fact that is a local martingale. ∎
We can identify the following for the process :
In this case, .
We also adapt the following Lemmas without proof from [Che+21].
Lemma 21 (Change of Measure, from [Che+21, Lemma 21]).
Let , be probability measures and let be any event. Then,
In particular, if and are probability measures on and
where , then
Lemma 22.
Let be a standard Brownian motion in . Then, if and ,
In particular, for all ,
Lemma 23 ([GT20, Lemma 14]).
Let be a random variable. Assume that for all there exists an event with probability at least such that
for some . Then, .
Lemma 24 (Matrix Grönwall Inequality).
Let , and , , where has non-negative entries. Suppose that the following inequality is satisfied componentwise:
(C.1) |
Then, the following inequality holds, where is the -dimensional identity matrix:
(C.2) |
where is the Moore–Penrose pseudo-inverse of (when is invertible, this is equivalent to the standard inverse).
Proof. This is a special case of [CD76, Main Theorem]. ∎
C.2 Movement Bound for ULMC
We next prove a movement bound for the continuous-time Langevin diffusion. The following lemma is a standard fact about the concentration of the norm of a Gaussian vector [[, see, e.g.,]Theorem 5.5]boucheronlugosimassart2013concentration.
Lemma 25 (Concentration of the Norm).
The following concentration holds: for all ,
Note that is of size , due to the Brownian motion component of the momentum variable ; this is the same order as the size of the increment of the overdamped Langevin diffusion. However, if we consider the increment in the -coordinate only, we obtain the following bound.
Lemma 26.
Let denote the continuous-time underdamped Langevin diffusion started at , and assume that the gradient of the potential satisfies and is Hölder continuous (satisfies (2.3)). Also, assume that and . Then,
Proof. For the interpolant times, we will use Grönwall’s matrix inequality (Lemma 24), with the following equation for :
Here we use the Hölder property of along with . Likewise for :
Consequently, we can use the matrix form of Grönwall’s inequality (Lemma 24). While applying that Lemma, let with to be given. First, for :
Noting that lies in the image space of so that , and similarly observing that belongs to the image space of (using the power series representation of the matrix exponential), we obtain for this first component:
where in the second line we take . Now, taking
we find the following (where denotes the first component of a vector ):
Finally, for , this can be bounded by . Using Lemma 22 and plugging this into the expression completes the proof. ∎
C.3 Sub-Gaussianity of the Iterates
Similarly to [Che+21], we introduce a modified potential in order to prove sub-Gaussianity of the iterates of ULMC. Firstly, we consider a modified distribution in the -coordinate, with parameter for some :
(C.3) |
The modified potential satisfies the following properties.
Lemma 27 (Properties of the Modified Potential, [Che+21, Lemma 23]).
Then, letting be the solution to the underdamped Langevin diffusion with potential and , the following lemma holds:
Lemma 28.
Assume that , and . Then, for all , with probability at least ,
Proof. We can use the change of measure lemma (Lemma 21) together with the sub-Gaussian tail bounds in Lemmas 25, 27 to see that with probability at least , the following events hold simultaneously:
Here we use a union bound together with the monotonicity of in .
For the interpolant times, we will use Grönwall’s matrix inequality, with the following inequality for :
Likewise,
Consequently, we can apply the matrix Grönwall inequality analogously to how we did in Lemma 24 with denoting the following matrices:
Note that here is again in the image space of , so that . Finally, after calculating the matrix exponential we find
where in the second line we take . Note that this is also entirely analogous to the calculation in Lemma 26.
Subsequently, we can take a union bound to obtain for any ,
Subsequently, taking respectively , , we use the Brownian motion tail bound (Lemma 22) to get with probability :
If we assume that and , then we can further simplify this bound to yield
This concludes the proof. ∎
To transfer this sub-Gaussianity to the original underdamped Langevin process, we consider the following bound on the chi-squared divergence between these two processes.
Proposition 29.
Let represent respectively the laws on the path space of the original and modified diffusions, under the same initialization . Then, if and , then
Proof. Conditioning on the event in Lemma 28, which we denote by for some , then using Girsanov’s theorem (Corollary 20) we get (for some sufficiently small so that Novikov’s condition is satisfied)
If we take and that , we can use Lemma 23 to get
This concludes the proof. ∎
Proposition 30.
Consider the continuous time diffusion initialized at . For , , and , for , the following holds with probability :
Proof. Recall from the proof of Lemma 28 that with probability ,
In particular, this immediately implies that the following holds: for ,
for a universal constant .
C.4 Completing the Discretization Proof
We proceed by following the proof of [Che+21].
Proof. [Proof of Proposition 15] Let follow the continuous-time process. Let denote the measures on the path space corresponding to the interpolated process and the continuous-time diffusion respectively, with both being initialized at . Then, define
From Girsanov’s theorem (Theorem 19), we obtain immediately using Itô’s formula
Bounding these terms individually, we first use Corollary 20 and (2.3) to get
Let us now condition on the event
By Proposition 30, we can have while choosing
We proceed to bound our desired quantity through some careful steps.
One step error. Consider first the error on a single interval . If we presume that the step size satisfies , Lemma 26 implies
Iteration. If we let denote the filtration, then writing , we can condition on and iterate our one step bound.
We now make additional simplifying assumptions to obtain more interpretable bounds: we assume and . With these assumptions,
Completing this iteration yields
Finally, applying Lemma 23 when
(C.4) |
(where hides an -dependent constant), we find
It remains to choose the appropriate step size which makes this whole quantity . In particular, it suffices to choose
(C.5) |
Second term. It remains to bound the other term in our original expression. From Lemma 26, we obtain
so long as is chosen to be appropriately small, i.e.,
This immediately implies a tail bound: for ,
Integrating, we get
We can estimate the expectations by integration of our previous tail bound (Proposition 30):
provided that , where . In our applications, this condition is not dominant and can be disregarded.
Combining the bounds. Finally, we can combine each of these steps to find that, provided (C.5) for the step size holds,
Finally, the following step size condition suffices to bound the Rényi divergence by :
This completes the proof. ∎
Appendix D Proof of the Main Results
Firstly, we collect some results on feasible initializations from [Che+21]. Recall that is the modified distribution introduced in Appendix C.3. Let
where is the variance of the Gaussian, and is the parameter appearing in the modified potential. The choice of will be assumption dependent, and we collect the conditions below under our main assumptions:
where is defined in (3.2).
Lemma 31 (Adapted from [Che+21, Appendix A]).
From our analysis we take , and if moreover then it is reasonable to expect that . Let , so that , and similarly .
The following lemma gives a bound on the value of the Fisher information at initialization.
Lemma 32.
Under the conditions of the previous lemma, the initialization also satisfies .
Proof. Note that as , . Secondly, satisfies . Hence,
where we used Jensen’s inequality in the last step. ∎
D.1 Poincaré Inequality
Proof. [Proof of Theorem 9] The continuous-time result from Lemma 8 states that
Noting that there exists a feasible initialization such that , then this is satisfied if we choose . This also shows that for .
Note the following decomposition (weak triangle inequality) for the Rényi divergence [[, see, e.g.,]Proposition 11]mironov2017renyi:
for any valid Hölder conjugate pair , i.e., , , and any three probability distributions .
In our case, we let and , so that after solving for , we get the following for :
Consequently, let , , , and combining this result with the discretization bound of Proposition 15, we then obtain
so long as
This completes the proof. ∎
D.2 Log-Sobolev Inequality
D.2.1 KL Divergence
Proof. [Proof of Theorem 6] We provide the following Theorem in the twisted coordinates , which were used in Lemma 14. Consider the decomposition of the using Cauchy–Schwarz:
Using the log-Sobolev inequality of the iterates via Lemma 14, we find (through the implication that a log-Sobolev inequality implies a Poincaré inequality with the same constant)
where we substitute for the function in (PI). Here, for all .
Since , then . Therefore,
and similarly for . This yields the expression
Also, one has
For and defined in Appendix B.1, we have
The determinant is for sufficiently small. This shows that , and therefore
Here we define
The decay of the Fisher information via Lemma 5 allows us to set
The same choice of also ensures that . From our initialization (Lemma 32), we can naively estimate using that
and , so that our condition on is (with )
Recall as well that this requires . For the remaining and terms, we invoke Proposition 15 with the value of specified and desired accuracy , and with and , which consequently yields
with
(using ). ∎
D.2.2 TV Distance
Proof. [Proof of Theorem 7] Notice first that the distance is a proper metric, and therefore satisfies the triangle inequality. Subsequently, by two applications of Pinsker’s inequality,