When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint
Abstract
There has been a surge of works bridging MCMC sampling and optimization, with a specific focus on translating non-asymptotic convergence guarantees for optimization problems into the analysis of Langevin algorithms in MCMC sampling. A conspicuous distinction between the convergence analysis of Langevin sampling and that of optimization is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for optimization are dimension-free for convex problems. Whether a dimension independent convergence rate can be achieved by Langevin algorithm is thus a long-standing open problem. This paper provides an affirmative answer to this problem for large classes of either Lipschitz or smooth convex problems with normal priors. By viewing Langevin algorithm as composite optimization, we develop a new analysis technique that leads to dimension independent convergence rates for such problems.
1 Introduction
Two of the major themes in machine learning are point prediction and uncertainty quantification. Computationally, they manifest in two types of algorithms: optimization and Markov chain Monte Carlo (MCMC). While both strategies have developed relatively separately for decades, there is a recent trend in relating both strands of research and translating nonasymptotic convergence guarantees in gradient based optimization methods to those in MCMC Dalalyan (2017); Durmus and Moulines (2019); Dalalyan and Karagulyan (2017); Cheng et al. (2018b); Cheng and Bartlett (2018); Dwivedi et al. (2018); Mangoubi and Smith (2017); Mangoubi and Vishnoi (2018); Bou-Rabee et al. (2018); Chatterji et al. (2018); Vempala and Wibisono (2019); Wibisono (2018). In particular, the Langevin sampling algorithm Rossky et al. (1978); Roberts and Stramer (2002); Durmus and Moulines (2017) has been shown to be a form of gradient descent on the space of probabilities Jordan et al. (1998); Wibisono (2018); Bernton (2018); Durmus et al. (2019). Many convergence rates on Langevin algorithm have emerged thenceforward, based on different assumptions on the posterior distribution (e.g., Dalalyan and Riou-Durand, 2018; Ma et al., 2019; Lee et al., 2018; Cheng et al., 2018a; Shen and Lee, 2019; Ma et al., 2021; Mou et al., 2021; Zou and Gu, 2019, to list a few).
A conspicuous distinction between the convergence analysis of Langevin sampling and that of gradient descent is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for gradient descent are dimension-free for convex problems. This prompts us to ask:
Can Langevin algorithm achieve dimension independent convergence rate under the usual convex assumptions?
In order to answer this question formally, we make two assumptions on the negative log-likelihood function. One is that the negative log-likelihood is convex. Another is that the negative log-likelihood is either Lipschitz continuous or Lipschitz smooth. We also employ a known and tractable prior distribution that is strongly log-concave—often times taken to be a normal distribution—to serve as a parallel to the regularizer in gradient descent.
Under such assumptions, we answer the above highlighted question in the affirmative. In particular, we prove that a Langevin algorithm converges similarly as convex optimization for this class of problems. In the analysis, we observe that the number of gradient queries required for the algorithm to converge does not depend on the dimensionality of the problem for either the Lipschitz continuous log-likelihood or the Lipschitz smooth log-likelihood equipped with a ridge separable structure.
To obtain this result, we first follow recent works (reference Durmus et al. (2019) in particular) and formulate the posterior sampling problem as optimizing over the Kullback-Leibler (KL) divergence, which is composed of two terms: (regularized) entropy and cross entropy. We then decompose the Langevin algorithm into two steps, each optimizing one part of the objective function. With a strongly convex and tractable prior, we explicitly integrate the diffusion along the prior distribution, optimizing the regularized entropy; whereas gradient descent over the convex negative log-likelihood optimizes the cross entropy. Via analyzing an intermediate quantity in this composite optimization procedure, we achieve a tight convergence bound that corresponds to the gradient descent’s convergence for convex optimization on the Euclidean space. This dimension independent convergence time for Lipschitz continuous log-likelihood and Lipschitz smooth log-likelihood endowed with a ridge separable structure carries over to the stochastic versions of the Langevin algorithm.
2 Preliminaries
2.1 Two Problem Classes
We consider sampling from a posterior distribution over parameter , given the data set :
where the potential function decomposes into two parts: .
While the formulation is general, in the machine learning setting, usually corresponds to the negative log-likelihood, and corresponds to the negative log-prior. The parameter is the temperature, which often takes the value of in machine learning, where is the number of training data. The key motivation to consider this decomposition is that we assume that is “simple” so that an SDE involving can be solved to high precision. We will take advantage of this assumption in our algorithm design.
Assumption on function
-
A0
We assume that function is -strongly convex ( is convex) 111We also say that the density proportional to is -strongly log-concave in this case. and can be explicitly integrated.
Assumption on function
We assume that function is convex (Assumption A1) and consider two cases regarding its regularity.
- •
- •
The first case stems from Bayesian classification problems, where one has a simple strongly log-concave prior and a log-concave and log-Lipschitz likelihood that encodes the complexity of the data. Examples include Bayesian neural networks for classification tasks Neal (1996), Bayesian logistic regression Gelman et al. (2004), as well as other Bayesian classification problems Sollich (2002) with Gaussian or Bayesian elastic net priors. The second case corresponds to the regression type problems, where the entire posterior is strongly log-concave and log-Lipschitz smooth. In this case, one can separate the negative log-posterior into two parts: and , which is convex and -Lipschitz smooth. We therefore directly let in Section 6.
2.2 Objective Functional and Convergence Criteria
We take the KL divergence to be our objective functional and solve the following optimization problem:
(1) | ||||
The minimizer that solves the optimization problem (1) is the posterior distribution:
(2) |
We further define the entropy functional as
so that the objective functional decomposes into the regularized entropy plus cross entropy:
With this definition of the objective function, we state that the difference in leads to the KL divergence.
Proposition 1.
Let be the solution of (1), and be another distribution on . We have
This result establishes that the convergence in the objective is equivalent to the convergence in -divergence. Therefore our analysis will focus on the convergence of .
We also define the -Wasserstein distance between two distributions that will become useful in our analysis.
Definition 1.
Given two probability distributions and on , and let be the class of distributions on so that the marginals and . The Wasserstein distance of and is defined as
A celebrated relationship between the -divergence and the -Wasserstein distance is known as the Talagrand inequality Otto and Villani (2000).
Proposition 2.
Assume that probability density is -strongly log-concave, and defines another distribution on . Then satisfies the log-Sobolev inequality with constant 222This fact follows from the Bakry-Emery criterion Bakry and Emery (1985)., and yields the following Talagrand inequality:
3 Related Works
Some previous works have aimed to sample from posteriors of the similar kind and obtain convergence in the KL divergence or the squared -Wasserstein distance.
In the Lipschitz continuous case,
where the negative log-likelihood is convex and -Lipschitz continuous, composed with an -strongly convex and -Lipschitz smooth negative log-prior, the convergence time to achieve is (Corollary 22 of Durmus et al., 2019). Similarly, Chatterji et al. (2020) uses Gaussian smoothing to obtain a convergence time of (in Theorem 3.4), which improves the dependence on accuracy . In Mou et al. (2019), the Metropolis-adjusted Langevin algorithm is levaraged with a proximal sampling oracle to remove the polynomial dependence on the accuracy (in total variation distance) and achieve a convergence time for a related composite posterior distribution. Unfortunately, an additional dimension dependent factor is always introduced into the overall convergence rate. This work demonstrates that if the -strongly convex regularizer is explicitly integrable, then the convergence time for the Langevin algorithm to achieve is dimension independent: . This is proven in Theorem 1 for the full gradient Langevin algorithm, and in Theorem 2 for the stochastic gradient Langevin algorithm. Using Proposition 2, the result implies a bound of to achieve .
In the Lipschitz smooth case,
where the negative log-posterior is -strongly convex and -Lipschitz smooth, the overdamped Langevin algorithm has been shown to converge in number of gradient queries Dalalyan (2017); Dalalyan and Karagulyan (2017); Cheng and Bartlett (2018); Durmus and Moulines (2017, 2019); Durmus et al. (2019), while the underdamped Langevin algorithm converges in gradient queries Cheng et al. (2018b); Ma et al. (2021); Dalalyan and Riou-Durand (2018), to ensure that and . Using a randomized midpoint integration method for the underdamped Langevin dynamics, this convergence time can be reduced to for convergence in squared -Wasserstein distance Shen and Lee (2019). This paper establishes that for overdamped Langevin algorithm, the convergence time can be sharpened to to achieve , where matrix is an upper bound for the Hessian of function .
Previous works have also focused on the ridge separable potential functions studied in this work. There is a literature that requires incoherence conditions on the data vectors and/or high-order smoothness conditions on the component functions to achieve a convergence time for using Hamiltonian Monte Carlo methods Mangoubi and Smith (2017); Mangoubi and Vishnoi (2018). Making further assumptions that the differential equation of the Hamiltonian dynamics is close to the span of a small number of basis functions, this bound can be improved to polynomial in Lee et al. (2018). Another thread of work alleviates these assumptions and achieves the convergence time for the general ridge separable potential functions via higher order Langevin dynamics and integration schemes Mou et al. (2021). We follow this general ridge separable setting and assume that each individual log-likelihood is Lipschitz smooth. Under this assumption, we demonstrate in this paper, by instantiating the bound for the general Lipschitz smooth case, that the Langevin algorithm converges in number of gradient queries to achieve (see Corollary 1 and Corollary 2).
4 Langevin Algorithms
We consider the following variant of the Langevin Algorithm 1.
(3) |
(4) |
In this method, we assume that the prior diffusion equation (3) can be solved efficiently. When the prior distribution is a standard normal distribution where on , we can instantiate equation (3) to be:
(5) |
In general, the diffusion equation (3) can also be solved numerically for separable of the form
where . In this case, we only need to solve one-dimensional problems, which are relatively simple. For example, this includes the regularization arising from the Bayesian elastic net Li and Lin (2010),
among other priors that decompose coordinate-wise.
We will also consider the stochastic version of Algorithm 1, the stochastic gradient Langevin dynamics (SGLD) method, with a strongly convex function . Assume that function decomposes into . Let be the distribution over the dataset such that expectation over it provides the unbiased estimate of the full gradient: . Then the new algorithm takes the following form and can be instantiated in the same way as Algorithm 1.
(6) |
(7) |
This algorithm becomes the streaming SGLD method where in each iteration we take one data point .
In the analysis of Algorithm 1, we will use to denote the distribution of , and to denote the distribution of , where the randomness include all random sampling in the algorithm. We also denote and as the measures associated with random variables and , respectively. When using samples along the Markov chain to estimate expectations over function , we take a weighted average, so that
which is equivalent to the expectation with respect to the weighted averaged distribution:
5 Langevin Algorithms in Lipschitz Convex Case
For the posterior , we assume that function satisfies the following two conditions common to convex analysis.
Assumptions for the Lipschitz Convex Case:
-
A1
Function is convex.
-
Function is -Lipschitz continuous on : .
We also assume that function is -strongly convex. Note that we have assumed that the gradient of function exists but have not assumed that function is smooth.
5.1 Full Gradient Langevin Algorithm Convergence in Lipschitz Convex Case
Our main result for Full Gradient Langevin Algorithm in the case that is Lipshitz can be stated as follows.
Theorem 1.
Assume that function satisfies the convex and Lipschitz continuous Assumptions A1 and . Further assume that function satisfies Assumption 1. Then for following the Langevin Algorithm 1, it satisfies ( for ):
By the convexity of the KL divergence, this leads to the convergence time of
for the averaged distribution to convergence to accuracy in the KL-divergence .
We devote the rest of this section to prove Theorem 1.
Proof of Theorem 1.
We take a composite optimization approach and analyze the convergence of the Langevin algorithm in two steps. First we characterize the decrease of the regularized entropy along the diffusion step (3).
Lemma 1 (For Regularized Entropy).
We then capture the decrease of the cross entropy along the gradient descent step (4). This result follows and parallels the standard convergence analysis of gradient descent (see Zinkevich, 2003; Zhang, 2004, for example).
Lemma 2.
We then combine the two steps to prove the overall convergence rate for the Langevin algorithm. It is worth noting that by aligning the diffusion step (3) and the gradient descent step (4) at , we combine with and cancel out perfectly and achieve the same convergence rate as that of stochastic gradient descent in optimization.
Proposition 3.
Set for some and . Then
Choosing and , we have
(8) |
The learning rate schedule of (with ) was introduced to SGD analysis for strongly convex objectives in Shalev-Shwartz et al. (2011), which leads to a similar rate as that of Proposition 3, but with an extra term than (9). The use of has been adopted in more recent literature of SGD analysis, as an effort to avoid the term (for example, see Lacoste-Julien et al. (2012)). The resulting bound in the SGD analysis becomes identical to that of Proposition 3, and this rate is optimal for nonsmooth strongly convex optimization Rakhlin et al. (2012). In addition, it is possible to implement for Langevin algorithm a similar scheme using moving averaging, as discussed in Shamir and Zhang (2013).
It can be observed that taking a large step size will grant rapid convergence. The largest one can take is to choose and consequently , leading to a learning rate schedule of . In this case, we are effectively initializing from . Choosing the same , we can bound the initial error via the Talagrand inequality in Proposition 1 and the log-Sobolev inequality Bakry and Emery (1985); Ledoux (2000) for the -strongly log-concave distribution :
since . Plugging this bound and into equation (9), and noting that , we arrive at our result that
(9) |
∎
5.2 Streaming SGLD Convergence in Lipschitz Convex Case
To analyze the streaming stochastic gradient Langevin algorithm, we assume that function decomposes:
where is the distribution over the data samples. In this case, we modify Assumption and assume that the individual log-likelihood satisfies the Lipschitz condition.
Assumptions on individual loss
-
Function is -Lipschitz continuous on : , .
In the case that is Lipschitz, our main result for SGLD is the following counterpart of Theorem 1.
Theorem 2.
Assume that function satisfies the convex assumption A1 and the Lipschitz continuous assumption for the individual log-likelihood . Further assume that function satisfies Assumption 1. Then for following the streaming SGLD Algorithm 2, it satisfies ( for ):
leading to the convergence time of
for the averaged distribution to convergence to accuracy in the KL-divergence .
We devote the rest of this section to prove Theorem 2.
Proof of Theorem 2.
Same as in the previous section, convergence of the regularized entropy along equation (6) follows Lemma 1.
For the convergence of the cross entropy along equation (7), the following Lemma follows the standard analysis of SGD.
Lemma 3.
Adopt Assumption that is -Lipschitz for all . Also adopt Assumption A1 that is convex. We have for all :
(12) |
It implies the following bound, which modifies Lemma 2.
Lemma 4.
Given any probability density on . Define
then we have
Initializing from the prior distribution, we can follow the same proof as in Proposition 3 and obtain a similar convergence rate as in the non-stochastic case.
Proposition 4.
Set for some and . Then
We can choose , and then for , we have
(13) |
Following the same steps as in the full gradient case, we arrive at the result. ∎
6 Langevin Algorithms in Smooth Convex Case
For the posterior , we make the following assumptions on function .
Assumptions for the smooth convex case:
-
A1
Function is convex and positive.
-
Function is -Smooth on : .
We also assume that function is -strongly convex. Note that this is equivalent to the cases where we simply assume the entire negative log-posterior to be -strongly convex and -Lipschitz smooth: one can separate the negative log-posterior into two parts: and , which is convex and -Lipschitz smooth. We therefore directly let in what follows.
6.1 Full Gradient Langevin Algorithm Convergence in Smooth Convex Case
Our main result for Full Gradient Langevin Algorithm in the case that is smooth can be stated as follows. Compared to Theorem 1, the result of Theorem 3 is useful for loss functions such as least squares loss that are smooth but not Lipschitz continuous.
Theorem 3.
Assume that function satisfies the convex and Lipschitz smooth Assumptions A1 and . Also assume that . Further let function . Then for following Algorithm 1 and initializing from , it satisfies ( for ):
leading to the convergence time of
for the averaged distribution to convergence to accuracy in the KL-divergence .
Ridge Separable Case
Assume that function decomposes into the following ridge-separable form:
(14) |
We make some assumptions on the activation function and the data points .
Assumptions in ridge separable case
-
R1
, the one dimensional activation function has a bounded second derivative: , for any .
-
R2
, data point has a bounded norm: .
Assumptions R1 and R2 combines to give a Lipschitz smoothness constant of for the individual log-likelihood.
Corollary 1.
We make the convexity Assumption A1 on function and let it take the ridge-separable form (14) (also let function ). Further adopt Assumptions R1 and R2 on the activation functions and the data points, respectively. Then the convergence time of Algorithm 1 initializing from (with step size ) is
for the averaged distribution to convergence to accuracy in the KL-divergence .
Proof.
We devote the rest of this section to the proof of Theorem 3.
Proof of Theorem 3.
For the decrease of the cross entropy along the gradient descent step (4), we use the following derivation for -Lipschitz smooth . For being the density of following equation (4) and for being another probability density,
where is the optimal coupling between distributions with densities and .
With , we have
(15) |
We also have the following lemma.
Lemma 5.
Let be the optimal coupling of and , and let the solution of (1). Then we have
Next we bound the last term of equation (16) at : .
Lemma 6.
With these lemmas, we are ready to prove the convergence time of the Langevin algorithm 1.
We note that the shrinking step size scheduling of satisfies:
Using this inequality and combining Lemma 6 and equation (16) at , we obtain that
Summing over ,
Denote and take . Then for and for ,
or
(17) |
Inspired by the Lipschitz continuous case, we take . Then by the Talagrand and log-Sobolev inequalities,
Applying Lemma 6 to the above inequality, we obtain that . Then taking , we obtain that the weighted-averaged KL divergence is upper bounded:
(18) |
Since , , the weighted-averaged KL divergence if
Plugging in the bound that gives the final result.
∎
6.2 SGLD Convergence in Smooth Convex Case
Similar to the Lipschitz continuous case, we assume that function decomposes:
where is the distribution over the data samples. Making the following assumption, which modifies Assumption , that the individual log-likelihood satisfies the Lipschitz smooth condition yields the convergence rate for the SGLD method.
Assumptions on individual loss
-
Function is -Lipschitz smooth on : , .
-
The stochastic gradient variance at the mode is bounded: .
Assumption ensures that the stochastic estimates of are Lipschitz smooth.
Under the above assumptions, we obtain in what follows the convergence rate for the SGLD method with minibatch size . This result is the counterpart of its full gradient version in Theorem 3.
Ridge Separable Case
Assume that the individual component take the following form so that function becomes ridge-separable:
(19) |
To ensure bounded stochastic gradient variance at the mode of the posterior, we additionally assume that at the mode , the derivatives of the activation functions are bounded.
Assumption in ridge separable case on bounded variance
-
, so that , , where .
Assumption ensures that the stochastic gradient variance is bounded at the mode. Then we have the following corollary instantiating Theorem 4.
Corollary 2.
Proof of Corollary 2.
We devote the rest of this section to the proof of Theorem 4.
Proof of Theorem 4.
We first note that because each is -Lipschitz smooth, the stochastic estimate of function , is -Lipschitz smooth:
We thereby invoke the next lemma.
Lemma 7.
Assume that function is convex, and that its stochastic estimate is -Lipschitz smooth. Then
where , and is the optimal coupling between and .
Taking and combining Lemma 1 and Lemma 7, we obtain that
(20) |
We then adapt Lemma 6 to the stochastic gradient method.
Lemma 8 (Stochastic Gradient Counterpart of Lemma 6).
Assume that
Let
and be the solution of (2). Then for -Lipschitz smooth function , at and consequently ,
(21) |
For the last piece of information, we establish the variance of the stochastic gradient at the mode, . For samples that are i.i.d. draws from the data set and are unbiased estimators of , we have
Leading to the bound that
7 Proofs of the Supporting Lemmas
7.1 Proofs of Lemmas in the Lipschitz Continuous Case
7.1.1 Proof of Lemma 1
Before proving Lemma 1, we first state a result in (Theorem 23.9 of Villani, 2009) that establishes the strong subdifferential of the Wasserstein-2 distance.
Lemma 9.
Assume that solve the following continuity equations
Then
where and are the optimal transport vector fields:
Writing and as the density functions of and , we take and so that follows the Fokker-Planck equation equation associated with process (3) and is a constant measure. This leads to the following equation
For being the probability measure associated with its density , define relative entropy , where . We can then use the fact that the relative entropy is -geodesically strongly convex (see Proposition 9.3.2 of Ambrosio et al. (2008)) to prove the following Lemma.
Lemma 10.
For being the density of ,
Proof.
Let be the geodesic between and . -geodesic strong convexity of states that (see Proposition 9.3.2 of Ambrosio et al. (2008)):
and consequently
By the definition of subdifferential (c.f. Villani, 2009, Theorem 23.14) we also have along the diffusion process defined by equation (3):
Taking the limit of , we obtain the result. ∎
Proof of Lemma 1.
7.1.2 Proof of Lemma 2
Proof of Lemma 2.
We first state a point-wise result along the gradient descent step (4):
(23) |
This is because
where the last step follows from the convexity and Lipschitz smoothness of .
We then denote the measures corresponding to random variables and to be: and . From the definitions, we know that they have densities and .
Denote an optimal coupling between and (where measure has density , which is the stationary distribution) to be . We then take expectations over on both sides of equation (23):
From the relationship , we know that the joint distribution of is . Note that also defines a coupling, and therefore
Therefore,
∎
7.1.3 Proofs of Lemma 12 and 4 for the streaming SGLD algorithm 2
Proof of Lemma 12.
By the definitions of and ,
We now take expectation with respect to , conditioned on , to obtain
The last step follows from the convexity of . Therefore, the desired bound follows. ∎
Proof of Lemma 4.
We first denote the measures corresponding to random variables and to be: and . From the definitions, we know that they have densities and .
Denote an optimal coupling between and (where measure has density , which is the stationary distribution) to be . We then take expectations over on both sides of Eq. (12), :
From the relationship , we know that conditional on , the joint distribution of is . Note that also defines a coupling, and therefore
Plugging this result and the Lipschitz assumption on in, we obtain that
∎
7.2 Proofs of Lemmas in the Lipschitz Smooth Case
7.2.1 Proofs of Lemmas 5 and 6 for the full gradient Langevin algorithm 1
Proof of Lemma 5.
By the geodesic convexity of the entropy function ,
where is the optimal transport from to . Using optimal coupling ,
In addition, convexity of and implies that
and
Adding the above three inequalities, and note that the following holds point-wise
we obtain that
and that
(24) |
Proof of Lemma 6.
We have
In what follows we bound . We first note that the posterior density:
is strongly log-concave. By Hargé (2004), for any convex function ,
Taking , we obtain that
On the other hand, by the celebrated relation between mean and mode for 1-unimodal distributions (see, e.g., Theorem of Basu and DasGupta, 1996),
where is the covariance of . This results in the following bound
Combining the two bounds, we obtain that
∎
7.2.2 Proofs of Lemma 7 and 8 for the SGLD algorithm 2
Proof of Lemma 7.
By the definitions of and ,
We now take expectation with respect to , conditioned on , to obtain
(26) |
We then upper bound by introducing variable that is distributed according to and couples optimally with the law of :
(27) |
For function being -Lipschitz smooth,
Taking expectation over the randomness of minibatch assignment on both sides leads to the fact that
Combining this equation with equation (24), we adapt Lemma 5 to the stochastic gradient method:
Applying this result to equation (27) and taking expectation of on both sides, we obtain:
Therefore,
leading to the final result that
∎
Proof of Lemma 8.
We have
Taking expectation on both sides, we obtain that
(28) |
We now upper bound . Let be the normal distribution , and define
(29) | ||||
(30) |
then can be expressed as
which is the solution of
Therefore
(31) |
We then decompose :
Since is the minimum of , . Hence
(32) |
8 Conclusion
This paper investigated the convergence of Langevin algorithms with strongly log-concave posteriors. We assume that the strongly log-concave posterior can be decomposed into two parts, with one part being simple and explicitly integrable with respect to the underlying SDE. This is analogous to the situation of proximal gradient methods in convex optimization. Using a new analysis technique which mimics the corresponding analysis of convex optimization, we obtain convergence results for Langenvin algorithms that are independent of dimension, both for Lipschitz and for a large class of smooth convex problems in machine learning. Our result addresses a long-standing puzzle with respect to the convergence of the Langevin algorithms. We note that the current work focused on the standard Langevin algorithm, and the resulting convergence rate in terms of dependency is inferior to the best known results leveraging underdamped or even higher order Langevin dynamics such as Cheng et al. (2018b); Dalalyan and Riou-Durand (2018); Shen and Lee (2019); Ma et al. (2021); Mou et al. (2021), which corresponds to accelerated methods in optimization. It thus remains open to investigate whether dimension independent bounds can be combined with these accelerated methods to improve dependence as well as condition number dependence.
References
- Ambrosio et al. (2008) L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2nd edition, 2008.
- Bakry and Emery (1985) D. Bakry and M. Emery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages 177–206. 1985.
- Basu and DasGupta (1996) S. Basu and A. DasGupta. The mean, median, and mode of unimodal distributions: a characterization. Teor. Veroyatnost. i Primenen., 41:336–352, 1996.
- Bernton (2018) E. Bernton. Langevin Monte Carlo and JKO splitting. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 1777–1798, 2018.
- Bou-Rabee et al. (2018) N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. arXiv:1805.00452, 2018.
- Chatterji et al. (2018) N. Chatterji, N. Flammarion, Y.-A. Ma, P. Bartlett, and M. Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 764–773, 2018.
- Chatterji et al. (2020) Niladri S. Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter L. Bartlett. Langevin monte carlo without smoothness. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), volume 108, 2020.
- Cheng and Bartlett (2018) X. Cheng and P. L. Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT), pages 186–211, 2018.
- Cheng et al. (2018a) X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648, 2018a.
- Cheng et al. (2018b) X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 300–323, 2018b.
- Dalalyan (2017) A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Royal Stat. Soc. B, 79(3):651–676, 2017.
- Dalalyan and Karagulyan (2017) A. S. Dalalyan and A. G. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095, 2017.
- Dalalyan and Riou-Durand (2018) A. S. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. arXiv:1807.09382, 2018.
- Durmus and Moulines (2017) A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017.
- Durmus and Moulines (2019) A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
- Durmus et al. (2019) Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
- Dwivedi et al. (2018) R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-Hastings algorithms are fast! arXiv:1801.02309, 2018.
- Gelman et al. (2004) J. B. Gelman, A.and Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, New York, 2004.
- Hargé (2004) G. Hargé. A convex/log-concave correlation inequality for Gaussian measure and an application to abstract Wiener spaces. Probab. Theory Related Fields, 130:415–440, 2004.
- Jordan et al. (1998) R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker-Planck equation. SIAM J. Math. Anal., 29(1):1–17, January 1998.
- Lacoste-Julien et al. (2012) Simon Lacoste-Julien, Mark Schmidt, and Francis R. Bach. A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002, 2012.
- Ledoux (2000) M. Ledoux. The geometry of Markov diffusion generators. Ann Fac Sci Toulouse Math, 9(6):305–366, 2000.
- Lee et al. (2018) Y.-T. Lee, Z. Song, and S. S. Vempala. Algorithmic theory of ODEs and sampling from well-conditioned logconcave densities. arXiv:1812.06243, 2018.
- Li and Lin (2010) Q. Li and N. Lin. The Bayesian elastic net. Bayesian Analysis, 5(1):151–170, 2010.
- Ma et al. (2019) Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization. Proc. Natl. Acad. Sci. U.S.A., 116:20881–20885, 2019.
- Ma et al. (2021) Y.-A. Ma, N. S. Chatterji, X. Cheng, N. Flammarion, P. L. Bartlett, and M. I. Jordan. Is there an analog of Nesterov acceleration for MCMC? Bernoulli, 27(3):1942–1992, 2021.
- Mangoubi and Smith (2017) O. Mangoubi and A. Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv:1708.07114, 2017.
- Mangoubi and Vishnoi (2018) O. Mangoubi and N. K. Vishnoi. Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. arXiv:1802.08898, 2018.
- Mou et al. (2021) W. Mou, Y.-A. Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22:1–41, 2021.
- Mou et al. (2019) Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. arXiv:1910.00551, 2019.
- Neal (1996) R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, New York, 1996.
- Otto and Villani (2000) F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal., 173(2):361–400, 2000.
- Rakhlin et al. (2012) Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International conference on machine learning, pages 1571–1578, 2012.
- Roberts and Stramer (2002) G. O. Roberts and O. Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab., 4:337–357, 2002.
- Rossky et al. (1978) P. J. Rossky, J. D. Doll, and H. L. Friedman. Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys., 69(10):4628, 1978.
- Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
- Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
- Shen and Lee (2019) R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pages 2098–2109, 2019.
- Sollich (2002) P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46:21–52, 2002.
- Vempala and Wibisono (2019) S. S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Log-Sobolev suffices. arXiv:1903.08568, 2019.
- Villani (2009) C. Villani. Optimal Transport, Old and New, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009.
- Wibisono (2018) A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 2093–3027, 2018.
- Zhang (2004) Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
- Zou and Gu (2019) Difan Zou and Quanquan Gu. Stochastic gradient hamiltonian monte carlo methods with recursive variance reduction. In Advances in Neural Information Processing Systems (NeurIPS) 32. 2019.