This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint

Yoav Freund University of California, San Diego Yi-An Ma University of California, San Diego Tong Zhang Google Research The Hong Kong University of Science and Technology
Abstract

There has been a surge of works bridging MCMC sampling and optimization, with a specific focus on translating non-asymptotic convergence guarantees for optimization problems into the analysis of Langevin algorithms in MCMC sampling. A conspicuous distinction between the convergence analysis of Langevin sampling and that of optimization is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for optimization are dimension-free for convex problems. Whether a dimension independent convergence rate can be achieved by Langevin algorithm is thus a long-standing open problem. This paper provides an affirmative answer to this problem for large classes of either Lipschitz or smooth convex problems with normal priors. By viewing Langevin algorithm as composite optimization, we develop a new analysis technique that leads to dimension independent convergence rates for such problems.

1 Introduction

Two of the major themes in machine learning are point prediction and uncertainty quantification. Computationally, they manifest in two types of algorithms: optimization and Markov chain Monte Carlo (MCMC). While both strategies have developed relatively separately for decades, there is a recent trend in relating both strands of research and translating nonasymptotic convergence guarantees in gradient based optimization methods to those in MCMC Dalalyan (2017); Durmus and Moulines (2019); Dalalyan and Karagulyan (2017); Cheng et al. (2018b); Cheng and Bartlett (2018); Dwivedi et al. (2018); Mangoubi and Smith (2017); Mangoubi and Vishnoi (2018); Bou-Rabee et al. (2018); Chatterji et al. (2018); Vempala and Wibisono (2019); Wibisono (2018). In particular, the Langevin sampling algorithm Rossky et al. (1978); Roberts and Stramer (2002); Durmus and Moulines (2017) has been shown to be a form of gradient descent on the space of probabilities Jordan et al. (1998); Wibisono (2018); Bernton (2018); Durmus et al. (2019). Many convergence rates on Langevin algorithm have emerged thenceforward, based on different assumptions on the posterior distribution (e.g., Dalalyan and Riou-Durand, 2018; Ma et al., 2019; Lee et al., 2018; Cheng et al., 2018a; Shen and Lee, 2019; Ma et al., 2021; Mou et al., 2021; Zou and Gu, 2019, to list a few).

A conspicuous distinction between the convergence analysis of Langevin sampling and that of gradient descent is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for gradient descent are dimension-free for convex problems. This prompts us to ask:

Can Langevin algorithm achieve dimension independent convergence rate under the usual convex assumptions?

In order to answer this question formally, we make two assumptions on the negative log-likelihood function. One is that the negative log-likelihood is convex. Another is that the negative log-likelihood is either Lipschitz continuous or Lipschitz smooth. We also employ a known and tractable prior distribution that is strongly log-concave—often times taken to be a normal distribution—to serve as a parallel to the L2L_{2} regularizer in gradient descent.

Under such assumptions, we answer the above highlighted question in the affirmative. In particular, we prove that a Langevin algorithm converges similarly as convex optimization for this class of problems. In the analysis, we observe that the number of gradient queries required for the algorithm to converge does not depend on the dimensionality of the problem for either the Lipschitz continuous log-likelihood or the Lipschitz smooth log-likelihood equipped with a ridge separable structure.

To obtain this result, we first follow recent works (reference Durmus et al. (2019) in particular) and formulate the posterior sampling problem as optimizing over the Kullback-Leibler (KL) divergence, which is composed of two terms: (regularized) entropy and cross entropy. We then decompose the Langevin algorithm into two steps, each optimizing one part of the objective function. With a strongly convex and tractable prior, we explicitly integrate the diffusion along the prior distribution, optimizing the regularized entropy; whereas gradient descent over the convex negative log-likelihood optimizes the cross entropy. Via analyzing an intermediate quantity in this composite optimization procedure, we achieve a tight convergence bound that corresponds to the gradient descent’s convergence for convex optimization on the Euclidean space. This dimension independent convergence time for Lipschitz continuous log-likelihood and Lipschitz smooth log-likelihood endowed with a ridge separable structure carries over to the stochastic versions of the Langevin algorithm.

2 Preliminaries

2.1 Two Problem Classes

We consider sampling from a posterior distribution over parameter wdw\in{\mathbb{R}}^{d}, given the data set 𝐳{\mathbf{z}}:

p(w|𝐳)p(𝐳|w)π(w)exp(U(w)),p(w|{\mathbf{z}})\propto p({\mathbf{z}}|w)\pi(w)\propto\exp\left(-U(w)\right),

where the potential function UU decomposes into two parts: U(w)=β1(f(w)+g(w))U(w)=\beta^{-1}\left(f(w)+g(w)\right).

While the formulation is general, in the machine learning setting, f(w)f(w) usually corresponds to the negative log-likelihood, and g(w)g(w) corresponds to the negative log-prior. The parameter β\beta is the temperature, which often takes the value of 1/n1/n in machine learning, where nn is the number of training data. The key motivation to consider this decomposition is that we assume that gg is “simple” so that an SDE involving gg can be solved to high precision. We will take advantage of this assumption in our algorithm design.

Assumption on function gg
  1. A0

    We assume that function gg is mm-strongly convex (g(w)m2w2g(w)-\frac{m}{2}\left\|w\right\|^{2} is convex) 111We also say that the density proportional to exp(β1g(w))\exp\left(-\beta^{-1}g(w)\right) is β1m\beta^{-1}m-strongly log-concave in this case. and can be explicitly integrated.

Assumption on function ff

We assume that function ff is convex (Assumption A1) and consider two cases regarding its regularity.

  • In the first case, we assume that function ff is GG-Lipschitz continuous (Assumption A2L\rm{A2_{L}}).

  • In the second case, we assume that function ff is LL-Lipschitz smooth (Assumption A2S\rm{A2_{S}}). We then instantiate the result by endowing it with a ridge separable structure (Assumptions R1 and R2).

The first case stems from Bayesian classification problems, where one has a simple strongly log-concave prior and a log-concave and log-Lipschitz likelihood that encodes the complexity of the data. Examples include Bayesian neural networks for classification tasks Neal (1996), Bayesian logistic regression Gelman et al. (2004), as well as other Bayesian classification problems Sollich (2002) with Gaussian or Bayesian elastic net priors. The second case corresponds to the regression type problems, where the entire posterior is strongly log-concave and log-Lipschitz smooth. In this case, one can separate the negative log-posterior into two parts: β1g(w)=β1m2w2\beta^{-1}g(w)=\frac{\beta^{-1}m}{2}\left\|w\right\|^{2} and β1f(w)=(logp(w|𝐳)β1m2w2)\beta^{-1}f(w)=\left(-\log p(w|{\mathbf{z}})-\frac{\beta^{-1}m}{2}\left\|w\right\|^{2}\right), which is convex and β1L\beta^{-1}L-Lipschitz smooth. We therefore directly let g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2} in Section 6.

2.2 Objective Functional and Convergence Criteria

We take the KL divergence β1Q(p)\beta^{-1}Q(p) to be our objective functional and solve the following optimization problem:

p=\displaystyle p_{*}= argminpQ(p),\displaystyle\arg\min_{p}Q(p), (1)
Q(p)=p(w)lnpp(w|𝐳)dw=𝔼wp[f(w)+g(w)+βlnp(w)].\displaystyle Q\left(p\right)=\int p(w)\ln\frac{p}{p(w|{\mathbf{z}})}dw={\mathbb{E}}_{w\sim p}\left[f(w)+g(w)+\beta\ln p(w)\right].

The minimizer that solves the optimization problem (1) is the posterior distribution:

p(w)exp(β1(f(w)+g(w))).\displaystyle p_{*}(w)\propto\exp\left(-\beta^{-1}(f(w)+g(w))\right). (2)

We further define the entropy functional as

H(p)=β𝔼wplnp(w),H(p)=\beta\;{\mathbb{E}}_{w\sim p}\ln p(w),

so that the objective functional decomposes into the regularized entropy plus cross entropy:

Q(p)=(H(p)+𝔼wp[g(w)])+𝔼wp[f(w)].Q(p)=\big{(}H(p)+{\mathbb{E}}_{w\sim p}\left[g(w)\right]\big{)}+{\mathbb{E}}_{w\sim p}\left[f(w)\right].

With this definition of the objective function, we state that the difference in QQ leads to the KL divergence.

Proposition 1.

Let pp be the solution of (1), and pp^{\prime} be another distribution on ww. We have

KL(pp)=β1[Q(p)Q(p)].{\mathrm{KL}}(p^{\prime}\|p)=\beta^{-1}[Q(p^{\prime})-Q(p)].

This result establishes that the convergence in the objective β1Q(p)\beta^{-1}Q(p^{\prime}) is equivalent to the convergence in KL{\mathrm{KL}}-divergence. Therefore our analysis will focus on the convergence of β1Q(p)\beta^{-1}Q(p^{\prime}).

We also define the 22-Wasserstein distance between two distributions that will become useful in our analysis.

Definition 1.

Given two probability distributions p(x)p(x) and p(y)p^{\prime}(y) on d{\mathbb{R}}^{d}, and let Π(p,p)\Pi(p,p^{\prime}) be the class of distributions q(x,y)q(x,y) on d×d{\mathbb{R}}^{d}\times{\mathbb{R}}^{d} so that the marginals q(x)=p(x)q(x)=p(x) and q(y)=p(y)q(y)=p^{\prime}(y). The W2W_{2} Wasserstein distance of pp and pp^{\prime} is defined as

W2(p,p)2=minqΠ(p,p)𝔼(x,y)qxy22.W_{2}(p,p^{\prime})^{2}=\min_{q\in\Pi(p,p^{\prime})}{\mathbb{E}}_{(x,y)\sim q}\|x-y\|_{2}^{2}.

A celebrated relationship between the KL{\mathrm{KL}}-divergence and the 22-Wasserstein distance is known as the Talagrand inequality Otto and Villani (2000).

Proposition 2.

Assume that probability density pp_{*} is m^\widehat{m}-strongly log-concave, and pp^{\prime} defines another distribution on d{\mathbb{R}}^{d}. Then pp_{*} satisfies the log-Sobolev inequality with constant m^/2\widehat{m}/2 222This fact follows from the Bakry-Emery criterion Bakry and Emery (1985)., and yields the following Talagrand inequality:

W22(p,p)m^1KL(pp).W_{2}^{2}(p_{*},p^{\prime})\leq\widehat{m}^{-1}{\mathrm{KL}}(p_{*}\|p^{\prime}).

3 Related Works

Some previous works have aimed to sample from posteriors of the similar kind and obtain convergence in the KL divergence or the squared 22-Wasserstein distance.

In the Lipschitz continuous case,

where the negative log-likelihood is convex and G^\widehat{G}-Lipschitz continuous, composed with an m^\widehat{m}-strongly convex and MM-Lipschitz smooth negative log-prior, the convergence time to achieve W22(p~T,p)ϵW_{2}^{2}(\widetilde{p}_{T},p_{*})\leq\epsilon is Ω~(dM+G^2m^2ϵ2)\widetilde{\Omega}\left(\frac{dM+\widehat{G}^{2}}{\widehat{m}^{2}\epsilon^{2}}\right) (Corollary 22 of Durmus et al., 2019). Similarly, Chatterji et al. (2020) uses Gaussian smoothing to obtain a convergence time of Ω~(d(M+G^2)m^2ϵ)\widetilde{\Omega}\left(\frac{d(M+\widehat{G}^{2})}{\widehat{m}^{2}\epsilon}\right) (in Theorem 3.4), which improves the dependence on accuracy ϵ\epsilon. In Mou et al. (2019), the Metropolis-adjusted Langevin algorithm is levaraged with a proximal sampling oracle to remove the polynomial dependence on the accuracy ϵ\epsilon (in total variation distance) and achieve a Ω~(dlog(1ϵ))\widetilde{\Omega}\left(d\log(\frac{1}{\epsilon})\right) convergence time for a related composite posterior distribution. Unfortunately, an additional dimension dependent factor is always introduced into the overall convergence rate. This work demonstrates that if the mm-strongly convex regularizer is explicitly integrable, then the convergence time for the Langevin algorithm to achieve KL(p~Tp)ϵ\mathrm{KL}(\widetilde{p}_{T}\|p_{*})\leq\epsilon is dimension independent: T=Ω(G^2m^ϵ)T=\Omega\left(\frac{\widehat{G}^{2}}{\widehat{m}\epsilon}\right). This is proven in Theorem 1 for the full gradient Langevin algorithm, and in Theorem 2 for the stochastic gradient Langevin algorithm. Using Proposition 2, the result implies a bound of T=Ω(G^2m^2ϵ)T=\Omega\left(\frac{\widehat{G}^{2}}{\widehat{m}^{2}\epsilon}\right) to achieve W22(p~T,p)ϵW_{2}^{2}(\widetilde{p}_{T},p_{*})\leq\epsilon.

In the Lipschitz smooth case,

where the negative log-posterior UU is m^\widehat{m}-strongly convex and L^\widehat{L}-Lipschitz smooth, the overdamped Langevin algorithm has been shown to converge in Ω~(L^2m^2dϵ)\widetilde{\Omega}\left(\frac{\widehat{L}^{2}}{\widehat{m}^{2}}\frac{d}{\epsilon}\right) number of gradient queries Dalalyan (2017); Dalalyan and Karagulyan (2017); Cheng and Bartlett (2018); Durmus and Moulines (2017, 2019); Durmus et al. (2019), while the underdamped Langevin algorithm converges in Ω~(L^3/2m^2dϵ)\widetilde{\Omega}\left(\frac{\widehat{L}^{3/2}}{\widehat{m}^{2}}\sqrt{\frac{d}{\epsilon}}\right) gradient queries Cheng et al. (2018b); Ma et al. (2021); Dalalyan and Riou-Durand (2018), to ensure that KL(p~Tp)ϵ{\mathrm{KL}}(\widetilde{p}_{T}\|p_{*})\leq\epsilon and W22(p~T,p)ϵW_{2}^{2}(\widetilde{p}_{T},p_{*})\leq\epsilon. Using a randomized midpoint integration method for the underdamped Langevin dynamics, this convergence time can be reduced to Ω~(L^m^4/3(dϵ)1/3)\widetilde{\Omega}\left(\frac{\widehat{L}}{\widehat{m}^{4/3}}\left(\frac{d}{\epsilon}\right)^{1/3}\right) for convergence in squared 22-Wasserstein distance Shen and Lee (2019). This paper establishes that for overdamped Langevin algorithm, the convergence time can be sharpened to Ω(trace(H^2)m^2ϵ)\Omega\left(\frac{{\mathrm{trace}}(\widehat{H}^{2})}{\widehat{m}^{2}\epsilon}\right) to achieve KL(p~Tp)ϵ{\mathrm{KL}}(\widetilde{p}_{T}\|p_{*})\leq\epsilon, where matrix H^\widehat{H} is an upper bound for the Hessian of function UU.

Previous works have also focused on the ridge separable potential functions studied in this work. There is a literature that requires incoherence conditions on the data vectors and/or high-order smoothness conditions on the component functions to achieve a Ω~((dϵ)1/4)\widetilde{\Omega}\left(\left(\frac{d}{\epsilon}\right)^{1/4}\right) convergence time for W22(p~T,p)ϵW_{2}^{2}(\widetilde{p}_{T},p_{*})\leq\epsilon using Hamiltonian Monte Carlo methods Mangoubi and Smith (2017); Mangoubi and Vishnoi (2018). Making further assumptions that the differential equation of the Hamiltonian dynamics is close to the span of a small number of basis functions, this bound can be improved to polynomial in log(d)\log(d) Lee et al. (2018). Another thread of work alleviates these assumptions and achieves the Ω~((dϵ)1/4)\widetilde{\Omega}\left(\left(\frac{d}{\epsilon}\right)^{1/4}\right) convergence time for the general ridge separable potential functions via higher order Langevin dynamics and integration schemes Mou et al. (2021). We follow this general ridge separable setting and assume that each individual log-likelihood is Lipschitz smooth. Under this assumption, we demonstrate in this paper, by instantiating the bound for the general Lipschitz smooth case, that the Langevin algorithm converges in Ω(1ϵ)\Omega\left(\frac{1}{\epsilon}\right) number of gradient queries to achieve KL(p~Tp)ϵ{\mathrm{KL}}(\widetilde{p}_{T}\|p_{*})\leq\epsilon (see Corollary 1 and Corollary 2).

4 Langevin Algorithms

We consider the following variant of the Langevin Algorithm 1.

Input: Initial distribution p0p_{0} on d{\mathbb{R}}^{d}, stepsize ηt\eta_{t}, β=1\beta=1
1 Draw w0w_{0} from p0p_{0}
2 for t=1,2,,Tt=1,2,\ldots,T do
3       Sample w~t\widetilde{w}_{t} from w~t(ηt)\widetilde{w}_{t}(\eta_{t}) with the following SDE on d{\mathbb{R}}^{d} and initial value w~t(0)=wt1\widetilde{w}_{t}(0)=w_{t-1}
w~t(ηt)=wt10ηtg(w~t(s))𝑑s+2β0ηt𝑑Bs,\displaystyle\widetilde{w}_{t}(\eta_{t})=w_{t-1}-\int_{0}^{\eta_{t}}\nabla g(\widetilde{w}_{t}(s))ds+\sqrt{2\beta}\int_{0}^{\eta_{t}}dB_{s}, (3)
where dBsdB_{s} is the standard Brownian motion on d{\mathbb{R}}^{d}.
4      Let
wt=w~tη~tf(w~t){w}_{t}=\widetilde{w}_{t}-\widetilde{\eta}_{t}\nabla f(\widetilde{w}_{t}) (4)
5 end for
return w~T\widetilde{w}_{T}
Algorithm 1 Langevin Algorithm with Prior Diffusion

In this method, we assume that the prior diffusion equation (3) can be solved efficiently. When the prior distribution is a standard normal distribution where g(w)=m2w22g(w)=\frac{m}{2}\left\|w\right\|_{2}^{2} on d{\mathbb{R}}^{d}, we can instantiate equation (3) to be:

Samplew~t𝒩(emηtwt1,1e2mηtmβI).\displaystyle\text{Sample}\quad\widetilde{w}_{t}\sim\mathcal{N}\left(e^{-m\eta_{t}}w_{t-1},\frac{1-e^{-2m\eta_{t}}}{m}\beta{\mathrm{I}}\right). (5)

In general, the diffusion equation (3) can also be solved numerically for separable g(w)g(w) of the form

g(w)=j=1dgj(wj),g(w)=\sum_{j=1}^{d}g_{j}(w_{j}),

where w=[w1,,wd]w=[w_{1},\ldots,w_{d}]. In this case, we only need to solve dd one-dimensional problems, which are relatively simple. For example, this includes the L1L2L_{1}-L_{2} regularization arising from the Bayesian elastic net Li and Lin (2010),

g(w)=m2w22+αw1,g(w)=\frac{m}{2}\|w\|_{2}^{2}+\alpha\|w\|_{1},

among other priors that decompose coordinate-wise.

We will also consider the stochastic version of Algorithm 1, the stochastic gradient Langevin dynamics (SGLD) method, with a strongly convex function g(w)g(w). Assume that function ff decomposes into f(w)=1ni=1n(w,zi)f(w)=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i}). Let DD be the distribution over the dataset Ω\Omega such that expectation over it provides the unbiased estimate of the full gradient: 𝔼zDw(w,z)=f(w){\mathbb{E}}_{z\sim D}\nabla_{w}\ell(w,z)=\nabla f(w). Then the new algorithm takes the following form and can be instantiated in the same way as Algorithm 1.

Input: Initial distribution p0p_{0} on d{\mathbb{R}}^{d}, stepsize ηt\eta_{t}, β=1/n\beta=1/n
1 Draw w0w_{0} from p0p_{0}
2 for t=1,2,,Tt=1,2,\ldots,T do
3       Sample w~t\widetilde{w}_{t} from w~t(ηt)\widetilde{w}_{t}(\eta_{t}) with the following SDE on d{\mathbb{R}}^{d} and initial value w~t(0)=wt1\widetilde{w}_{t}(0)=w_{t-1}
w~t(ηt)=wt10ηtg(w~t(s))𝑑s+2β0ηt𝑑Bs,\displaystyle\widetilde{w}_{t}(\eta_{t})=w_{t-1}-\int_{0}^{\eta_{t}}\nabla g(\widetilde{w}_{t}(s))ds+\sqrt{2\beta}\int_{0}^{\eta_{t}}dB_{s}, (6)
where dBsdB_{s} is the standard Bronwian motion on d{\mathbb{R}}^{d}.
4      Draw minibatch 𝒮\mathcal{S} where each zi𝒮z_{i}\in\mathcal{S} are i.i.d. draws: ziDz_{i}\sim D. Let
wt=w~tη~t1|𝒮|zi𝒮w(w~t,zi).{w}_{t}=\widetilde{w}_{t}-\widetilde{\eta}_{t}\frac{1}{|\mathcal{S}|}\sum_{z_{i}\in\mathcal{S}}\nabla_{w}\ell(\widetilde{w}_{t},z_{i}). (7)
5 end for
return w~T\widetilde{w}_{T}
Algorithm 2 Stochastic Gradient Langevin Algorithm with Prior Diffusion

This algorithm becomes the streaming SGLD method where in each iteration we take one data point zDz\sim D.

In the analysis of Algorithm 1, we will use pt1p_{t-1} to denote the distribution of wt1w_{t-1}, and p~t\widetilde{p}_{t} to denote the distribution of w~t\widetilde{w}_{t}, where the randomness include all random sampling in the algorithm. We also denote μt\mu_{t} and μ~t\widetilde{\mu}_{t} as the measures associated with random variables wtw_{t} and w~t\widetilde{w}_{t}, respectively. When using samples along the Markov chain to estimate expectations over function ϕ()\phi(\cdot), we take a weighted average, so that

ϕ^(p)=1s=1Tηst=1Tηtϕ(w~t),\hat{\phi}(p)=\frac{1}{\sum_{s=1}^{T}\eta_{s}}\sum_{t=1}^{T}\eta_{t}\phi(\widetilde{w}_{t}),

which is equivalent to the expectation with respect to the weighted averaged distribution:

μ¯T=1s=1Tηst=1Tηtμ~t.\bar{\mu}_{T}=\frac{1}{\sum_{s=1}^{T}\eta_{s}}\sum_{t=1}^{T}\eta_{t}\widetilde{\mu}_{t}.

We prove in what follows the convergence of the distribution p~t\widetilde{p}_{t} along the updates of (3) and (4) towards the posterior distribution (2).

5 Langevin Algorithms in Lipschitz Convex Case

For the posterior p(w|𝐳)(β1(f(w)+g(w)))p(w|{\mathbf{z}})\propto\left(-\beta^{-1}(f(w)+g(w))\right), we assume that function ff satisfies the following two conditions common to convex analysis.

Assumptions for the Lipschitz Convex Case:
  1. A1

    Function f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} is convex.

  1. A2L\rm{A2_{L}}

    Function ff is GG-Lipschitz continuous on d{\mathbb{R}}^{d}: f(w)2G\|\nabla f(w)\|_{2}\leq G.

We also assume that function g:dg:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} is mm-strongly convex. Note that we have assumed that the gradient of function ff exists but have not assumed that function ff is smooth.

5.1 Full Gradient Langevin Algorithm Convergence in Lipschitz Convex Case

Our main result for Full Gradient Langevin Algorithm in the case that ff is Lipshitz can be stated as follows.

Theorem 1.

Assume that function ff satisfies the convex and Lipschitz continuous Assumptions A1 and A2L\rm{A2_{L}}. Further assume that function g(w)g(w) satisfies Assumption 1. Then for p~T\widetilde{p}_{T} following the Langevin Algorithm 1, it satisfies ( for η~t=(1emηt)/m=2/(m(t+2))\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2/\left(m(t+2)\right)):

t=1T1+0.5tT+0.25T(T+1)β1[Q(p~t)Q(p)]5G2βmT.\sum_{t=1}^{T}\frac{1+0.5t}{T+0.25T(T+1)}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{5G^{2}}{\beta mT}.

By the convexity of the KL divergence, this leads to the convergence time of

T=5G2βmϵ,T=\frac{5G^{2}}{\beta m\epsilon},

for the averaged distribution to convergence to ϵ\epsilon accuracy in the KL-divergence β1Q\beta^{-1}Q.

We devote the rest of this section to prove Theorem 1.

Proof of Theorem 1.

We take a composite optimization approach and analyze the convergence of the Langevin algorithm in two steps. First we characterize the decrease of the regularized entropy 𝔼wp[g(w)+H(p)]{\mathbb{E}}_{w\sim p}\left[g(w)+H(p)\right] along the diffusion step (3).

Lemma 1 (For Regularized Entropy).

We generalize Lemma 5 of Durmus et al. (2019) and have for p~t\widetilde{p}_{t} being the density of w~t\widetilde{w}_{t} following equation (3) and pp being another probability density,

2m(1emηt)(𝔼wp~t[g(w)+H(p~t)]𝔼wp[g(w)+H(p)])emηtW22(pt1,p)W22(p~t,p),\frac{2}{m}(1-e^{-m\eta_{t}})\left({\mathbb{E}}_{w\sim\widetilde{p}_{t}}[g(w)+H(\widetilde{p}_{t})]-{\mathbb{E}}_{w\sim p}[g(w)+H(p)]\right)\leq e^{-m\eta_{t}}W_{2}^{2}({p}_{t-1},p)-W_{2}^{2}(\widetilde{p}_{t},p),

where mm is the strong convexity of g(w)g(w).

We then capture the decrease of the cross entropy 𝔼wp[f(w)]{\mathbb{E}}_{w\sim p}\left[f(w)\right] along the gradient descent step (4). This result follows and parallels the standard convergence analysis of gradient descent (see Zinkevich, 2003; Zhang, 2004, for example).

Lemma 2.

Given probability density pp on d{\mathbb{R}}^{d}. Define

f(p)=𝔼wpf(w),f(p)={\mathbb{E}}_{w\sim p}f(w),

then we have for ptp_{t} being the density of wtw_{t} following equation (4):

2η~t[f(p~t)f(p)]W22(p~t,p)W22(pt,p)+η~t2G2.2\widetilde{\eta}_{t}[f(\widetilde{p}_{t})-f(p)]\leq W_{2}^{2}(\widetilde{p}_{t},p)-W_{2}^{2}(p_{t},p)+\widetilde{\eta}_{t}^{2}G^{2}.

We then combine the two steps to prove the overall convergence rate for the Langevin algorithm. It is worth noting that by aligning the diffusion step (3) and the gradient descent step (4) at p~t\widetilde{p}_{t}, we combine 𝔼wp~t[g(w)+H(p~t)]{\mathbb{E}}_{w\sim\widetilde{p}_{t}}[g(w)+H(\widetilde{p}_{t})] with f(p~t)f(\widetilde{p}_{t}) and cancel out W22(p~t,p)W_{2}^{2}(\widetilde{p}_{t},p) perfectly and achieve the same convergence rate as that of stochastic gradient descent in optimization.

Proposition 3.

Set η~t=(1emηt)/m=τ(τ/η~0+mt)1\widetilde{\eta}_{t}=(1-e^{-m\eta_{t}})/m=\tau\cdot(\tau/\widetilde{\eta}_{0}+mt)^{-1} for some τ1\tau\geq 1 and η~0>0\widetilde{\eta}_{0}>0. Then

t=1Tη~t1τ[Q(p~t)Q(p)]η~0τW22(p0,p)+G2t=1Tη~t2τ.\sum_{t=1}^{T}\widetilde{\eta}_{t}^{1-\tau}[Q(\widetilde{p}_{t})-Q(p)]\leq\widetilde{\eta}_{0}^{-\tau}W_{2}^{2}(p_{0},p)+G^{2}\sum_{t=1}^{T}\widetilde{\eta}_{t}^{2-\tau}.

Choosing τ=2\tau=2 and p=pp=p_{*}, we have

t=1T1+0.5tT+0.25T(T+1)β1[Q(p~t)Q(p)]4βmη~02T(T+1)W22(p0,p)+4G2βm(T+1).\sum_{t=1}^{T}\frac{1+0.5t}{T+0.25T(T+1)}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{4}{\beta m\widetilde{\eta}_{0}^{2}T(T+1)}W_{2}^{2}(p_{0},p_{*})+\frac{4G^{2}}{\beta m(T+1)}. (8)

The learning rate schedule of ηt=1/mt\eta_{t}=1/mt (with τ=1\tau=1) was introduced to SGD analysis for strongly convex objectives in Shalev-Shwartz et al. (2011), which leads to a similar rate as that of Proposition 3, but with an extra log(T)\log(T) term than (9). The use of τ>1\tau>1 has been adopted in more recent literature of SGD analysis, as an effort to avoid the log(T)\log(T) term (for example, see Lacoste-Julien et al. (2012)). The resulting bound in the SGD analysis becomes identical to that of Proposition 3, and this rate is optimal for nonsmooth strongly convex optimization Rakhlin et al. (2012). In addition, it is possible to implement for Langevin algorithm a similar scheme using moving averaging, as discussed in Shamir and Zhang (2013).

It can be observed that taking a large step size η~0\widetilde{\eta}_{0} will grant rapid convergence. The largest one can take is to choose η0=+\eta_{0}=+\infty and consequently η~0=1/m\widetilde{\eta}_{0}=1/m, leading to a learning rate schedule of η~t=2/(m(t+2))\widetilde{\eta}_{t}=2/(m\cdot(t+2)). In this case, we are effectively initializing from p~1exp(β1g(w))\widetilde{p}_{1}\propto\exp\left(-\beta^{-1}g(w)\right). Choosing the same p0exp(β1g(w))p_{0}\propto\exp\left(-\beta^{-1}g(w)\right), we can bound the initial error W22(p0,p)W_{2}^{2}(p_{0},p_{*}) via the Talagrand inequality in Proposition 1 and the log-Sobolev inequality Bakry and Emery (1985); Ledoux (2000) for the β1m\beta^{-1}m-strongly log-concave distribution pp_{*}:

W22(p0,p)βmKL(pp0)β22m2𝔼p[logpp02]G22m2,W_{2}^{2}(p_{0},p_{*})\leq\frac{\beta}{m}{\mathrm{KL}}(p_{*}\|p_{0})\leq\frac{\beta^{2}}{2m^{2}}{\mathbb{E}}_{p_{*}}\left[\left\|\nabla\log\frac{p_{*}}{p_{0}}\right\|^{2}\right]\leq\frac{G^{2}}{2m^{2}},

since logpp0(w)=β1f(w)β1G\left\|\nabla\log\frac{p_{*}}{p_{0}}(w)\right\|=\left\|\beta^{-1}\nabla f(w)\right\|\leq\beta^{-1}G. Plugging this bound and η~0=1/m\widetilde{\eta}_{0}=1/m into equation (9), and noting that T1T\geq 1, we arrive at our result that

t=1T1+0.5tT+0.25T(T+1)β1[Q(p~t)Q(p)]5G2βmT.\sum_{t=1}^{T}\frac{1+0.5t}{T+0.25T(T+1)}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{5G^{2}}{\beta mT}. (9)

Proof of Proposition 3.

We can add the inequalities in Lemma 1 and Lemma 2 to obtain:

η~t[Q(p~t)Q(p)]emηtW2(pt1,p)2W2(pt,p)2+η~t2G2.\displaystyle\widetilde{\eta}_{t}[Q(\widetilde{p}_{t})-Q(p)]\leq e^{-m\eta_{t}}W_{2}({p}_{t-1},p)^{2}-W_{2}(p_{t},p)^{2}+\widetilde{\eta}_{t}^{2}G^{2}.

This is equivalent to

η~t1τ[Q(p~t)Q(p)](1mη~t)η~tτW2(pt1,p)2η~tτW2(pt,p)2+η~t2τG2.\widetilde{\eta}_{t}^{1-\tau}[Q(\widetilde{p}_{t})-Q(p)]\leq(1-m\widetilde{\eta}_{t})\widetilde{\eta}_{t}^{-\tau}W_{2}({p}_{t-1},p)^{2}-\widetilde{\eta}_{t}^{-\tau}W_{2}(p_{t},p)^{2}+\widetilde{\eta}_{t}^{2-\tau}G^{2}. (10)

We first show that

(1mη~t)η~tτη~t1τ.(1-m\widetilde{\eta}_{t})\widetilde{\eta}_{t}^{-\tau}\leq\widetilde{\eta}_{t-1}^{-\tau}. (11)

Let s=t+τ/(mη0)1s=t+\tau/(m\eta_{0})\geq 1 for t1t\geq 1, η~t=τ/(ms)\tilde{\eta}_{t}=\tau/(ms) and η~t=τ/(m(s1))\tilde{\eta}_{t}=\tau/(m(s-1)). Therefore (11) is equivalent to

(1τ/s)sτ(s1)τ.(1-\tau/s)s^{\tau}\leq(s-1)^{\tau}.

This inequality follows from the fact that for z=1/s[0,1]z=1/s\in[0,1] and τ1\tau\geq 1: ψ(z)=(1z)τ\psi(z)=(1-z)^{\tau} is convex in zz, and thus (1τz)=ψ(0)+ψ(0)zψ(z)=(1z)τ(1-\tau z)=\psi(0)+\psi^{\prime}(0)z\leq\psi(z)=(1-z)^{\tau}.

By combining (10) and (11), we obtain

η~t1τ[Q(p~t)Q(p)]η~t1τW2(pt1,p)2η~tτW2(pt,p)2+η~t2τG2.\displaystyle\widetilde{\eta}_{t}^{1-\tau}[Q(\widetilde{p}_{t})-Q(p)]\leq\widetilde{\eta}_{t-1}^{-\tau}W_{2}({p}_{t-1},p)^{2}-\widetilde{\eta}_{t}^{-\tau}W_{2}(p_{t},p)^{2}+\widetilde{\eta}_{t}^{2-\tau}G^{2}.

By summing over t=1t=1 to t=Tt=T, we obtain the bound.

5.2 Streaming SGLD Convergence in Lipschitz Convex Case

To analyze the streaming stochastic gradient Langevin algorithm, we assume that function ff decomposes:

f(w)=1ni=1n(w,zi)=𝔼zD[(w,z)],f(w)=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i})={\mathbb{E}}_{z\sim D}[\ell(w,z)],

where DD is the distribution over the data samples. In this case, we modify Assumption A2L\rm{A2_{L}} and assume that the individual log-likelihood satisfies the Lipschitz condition.

Assumptions on individual loss \ell
  1. A2LSG\rm{A2_{L}^{SG}}

    Function \ell is GG_{\ell}-Lipschitz continuous on d{\mathbb{R}}^{d}: (w,z)2G\|\nabla\ell(w,z)\|_{2}\leq G_{\ell}, zΩ\forall z\in\Omega.

In the case that (w,z)\ell(w,z) is Lipschitz, our main result for SGLD is the following counterpart of Theorem 1.

Theorem 2.

Assume that function ff satisfies the convex assumption A1 and the Lipschitz continuous assumption for the individual log-likelihood A2L𝑆𝐺\rm{A2_{L}^{SG}}. Further assume that function g(w)g(w) satisfies Assumption 1. Then for p~T\widetilde{p}_{T} following the streaming SGLD Algorithm 2, it satisfies ( for η~t=(1emηt)/m=2/(m(t+2))\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2/\left(m(t+2)\right)):

t=1T1+0.5tT+0.25T(T+1)β1[Q(p~t)Q(p)]5G2βmT.\sum_{t=1}^{T}\frac{1+0.5t}{T+0.25T(T+1)}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{5G_{\ell}^{2}}{\beta mT}.

leading to the convergence time of

T=5G2βmϵ,T=\frac{5G_{\ell}^{2}}{\beta m\epsilon},

for the averaged distribution to convergence to ϵ\epsilon accuracy in the KL-divergence β1Q\beta^{-1}Q.

We devote the rest of this section to prove Theorem 2.

Proof of Theorem 2.

Same as in the previous section, convergence of the regularized entropy 𝔼wp[g(w)]+H(p){\mathbb{E}}_{w\sim p}\left[g(w)\right]+H(p) along equation (6) follows Lemma 1.

For the convergence of the cross entropy 𝔼wp[f(w)]{\mathbb{E}}_{w\sim p}\left[f(w)\right] along equation (7), the following Lemma follows the standard analysis of SGD.

Lemma 3.

Adopt Assumption A2L𝑆𝐺\rm{A2_{L}^{SG}} that (w,z)\ell(w,z) is GG_{\ell}-Lipschitz for all zΩz\in\Omega. Also adopt Assumption A1 that f(w)=𝔼zD(w,z)f(w)={\mathbb{E}}_{z\sim D}\ell(w,z) is convex. We have for all wdw\in{\mathbb{R}}^{d}:

2η~t𝔼zD[(w~t,z)(w,z)]w~tw22𝔼wt|w~twtw22+η~t2G2.2\widetilde{\eta}_{t}{\mathbb{E}}_{z\sim D}[\ell(\widetilde{w}_{t},z)-\ell(w,z)]\leq\|\widetilde{w}_{t}-w\|_{2}^{2}-{\mathbb{E}}_{w_{t}|\widetilde{w}_{t}}\|w_{t}-w\|_{2}^{2}+\widetilde{\eta}_{t}^{2}G_{\ell}^{2}. (12)

It implies the following bound, which modifies Lemma 2.

Lemma 4.

Given any probability density qq on d{\mathbb{R}}^{d}. Define

(q)=𝔼wq𝔼zD(w,z),\ell(q)={\mathbb{E}}_{w\sim q}{\mathbb{E}}_{z\sim D}\ell(w,z),

then we have

2η~t[(p~t)(p)]W2(p~t,p)2W2(pt,p)2+η~t2G2.2\widetilde{\eta}_{t}[\ell(\widetilde{p}_{t})-\ell(p)]\leq W_{2}(\widetilde{p}_{t},p)^{2}-W_{2}(p_{t},p)^{2}+\widetilde{\eta}_{t}^{2}G_{\ell}^{2}.

Initializing from the prior distribution, we can follow the same proof as in Proposition 3 and obtain a similar convergence rate as in the non-stochastic case.

Proposition 4.

Set η~t=(1emηt)/m=τ(τ/η~0+mt)1\widetilde{\eta}_{t}=(1-e^{-m\eta_{t}})/m=\tau\cdot(\tau/\widetilde{\eta}_{0}+mt)^{-1} for some τ1\tau\geq 1 and η~0>0\widetilde{\eta}_{0}>0. Then

t=1Tη~t1τ[Q(p~t)Q(p)]η0τW2(p0,p)2+G2t=1Tη~t2τ.\sum_{t=1}^{T}\widetilde{\eta}_{t}^{1-\tau}[Q(\widetilde{p}_{t})-Q(p)]\leq\eta_{0}^{-\tau}W_{2}(p_{0},p)^{2}+G_{\ell}^{2}\sum_{t=1}^{T}\widetilde{\eta}_{t}^{2-\tau}.

We can choose τ=2\tau=2, and then for p=pp=p_{*}, we have

t=1T1+0.5tT+0.25T(T+1)β1[Q(p~t)Q(p)]4βmη~02T(T+1)W2(p0,p)2+4G2βm(T+1).\sum_{t=1}^{T}\frac{1+0.5t}{T+0.25T(T+1)}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{4}{\beta m\widetilde{\eta}_{0}^{2}T(T+1)}W_{2}(p_{0},p_{*})^{2}+\frac{4G_{\ell}^{2}}{\beta m(T+1)}. (13)

Following the same steps as in the full gradient case, we arrive at the result. ∎

6 Langevin Algorithms in Smooth Convex Case

For the posterior p(w|𝐳)(β1(f(w)+g(w)))p(w|{\mathbf{z}})\propto\left(-\beta^{-1}(f(w)+g(w))\right), we make the following assumptions on function ff.

Assumptions for the smooth convex case:
  1. A1

    Function f:df:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} is convex and positive.

  1. A2S\rm{A2_{S}}

    Function ff is LL-Smooth on d{\mathbb{R}}^{d}: f(w)f(w)2Lww2\|\nabla f(w)-\nabla f(w^{\prime})\|_{2}\leq L\|w-w^{\prime}\|_{2}.

We also assume that function g:dg:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}} is mm-strongly convex. Note that this is equivalent to the cases where we simply assume the entire negative log-posterior to be β1m\beta^{-1}m-strongly convex and (β1(L+m))\left(\beta^{-1}(L+m)\right)-Lipschitz smooth: one can separate the negative log-posterior into two parts: β1m2w2\frac{\beta^{-1}m}{2}\left\|w\right\|^{2} and (logp(w|𝐳)β1m2w2)\left(-\log p(w|{\mathbf{z}})-\frac{\beta^{-1}m}{2}\left\|w\right\|^{2}\right), which is convex and β1L\beta^{-1}L-Lipschitz smooth. We therefore directly let g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2} in what follows.

6.1 Full Gradient Langevin Algorithm Convergence in Smooth Convex Case

Our main result for Full Gradient Langevin Algorithm in the case that ff is smooth can be stated as follows. Compared to Theorem 1, the result of Theorem 3 is useful for loss functions such as least squares loss that are smooth but not Lipschitz continuous.

Theorem 3.

Assume that function ff satisfies the convex and Lipschitz smooth Assumptions A1 and A2S\rm{A2_{S}}. Also assume that 2f(w)H\nabla^{2}f(w)\preceq H. Further let function g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2}. Then for p~T\widetilde{p}_{T} following Algorithm 1 and initializing from p0exp(β1g)p_{0}\propto\exp(-\beta^{-1}g), it satisfies ( for η~t=(1emηt)/m=2(8L+mt)1\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2\cdot\left(8L+mt\right)^{-1}):

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]64L2m2T(T+1)(8m2trace(H2)+2U(0))+16T+1(8m2trace(H2)+2U(0)).\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\\ \leq\frac{64L^{2}}{m^{2}T(T+1)}\cdot\left(\frac{8}{m^{2}}{\mathrm{trace}}\left(H^{2}\right)+2U(0)\right)+\frac{16}{T+1}\cdot\left(\frac{8}{m^{2}}{\mathrm{trace}}\left(H^{2}\right)+2U(0)\right).

leading to the convergence time of

T=64max{8trace(H2)m2ϵ,2U(0)ϵ},T=64\cdot\max\left\{\frac{8{\mathrm{trace}}\left(H^{2}\right)}{m^{2}\epsilon},\frac{2U(0)}{\epsilon}\right\},

for the averaged distribution to convergence to ϵ1\epsilon\leq 1 accuracy in the KL-divergence β1Q\beta^{-1}Q.

Ridge Separable Case

Assume that function ff decomposes into the following ridge-separable form:

f(w)=1ni=1nsi(wzi),\displaystyle f(w)=\frac{1}{n}\sum_{i=1}^{n}s_{i}(w^{\top}z_{i}), (14)

We make some assumptions on the activation function sis_{i} and the data points ziz_{i}.

Assumptions in ridge separable case
  1. R1

    i{1,,n}\forall i\in\{1,\dots,n\}, the one dimensional activation function si()s_{i}(\cdot) has a bounded second derivative: |si′′(x)|Ls\left|s_{i}^{\prime\prime}(x)\right|\leq L_{s}, for any xx\in{\mathbb{R}}.

  2. R2

    i{1,,n}\forall i\in\{1,\dots,n\}, data point ziz_{i} has a bounded norm: zi2Rz\left\|z_{i}\right\|^{2}\leq R_{z}.

Assumptions R1 and R2 combines to give a Lipschitz smoothness constant of LsRzL_{s}R_{z} for the individual log-likelihood.

Corollary 1.

We make the convexity Assumption A1 on function ff and let it take the ridge-separable form (14) (also let function g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2}). Further adopt Assumptions R1 and R2 on the activation functions and the data points, respectively. Then the convergence time of Algorithm 1 initializing from p0exp(β1g)p_{0}\propto\exp(-\beta^{-1}g) (with step size η~t=(1emηt)/m=2(8LsRz+mt)1\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2\left(8L_{s}R_{z}+mt\right)^{-1}) is

T=64max{8Ls2Rz2m2ϵ,2U(0)ϵ},T=64\cdot\max\left\{\frac{8L_{s}^{2}R_{z}^{2}}{m^{2}\epsilon},\frac{2U(0)}{\epsilon}\right\},

for the averaged distribution to convergence to ϵ\epsilon accuracy in the KL-divergence β1Q\beta^{-1}Q.

Proof.

From Assumptions R1 and R2, we know that 2f(w)1nLsZZ=H\nabla^{2}f(w)\preceq\frac{1}{n}L_{s}ZZ^{\top}=H, where

trace(H2)=Ls21n2trace(ZZZZ)=Ls21n2ZZF2=Ls21n2i,j=1n(zizj)2Ls21ni=1nzi4Ls2Rz2.{\mathrm{trace}}\left(H^{2}\right)=L_{s}^{2}\frac{1}{n^{2}}{\mathrm{trace}}\left(Z^{\top}ZZ^{\top}Z\right)=L_{s}^{2}\frac{1}{n^{2}}\left\|Z^{\top}Z\right\|_{F}^{2}=L_{s}^{2}\frac{1}{n^{2}}\sum_{i,j=1}^{n}\left(z_{i}^{\top}z_{j}\right)^{2}\leq L_{s}^{2}\frac{1}{n}\sum_{i=1}^{n}\left\|z_{i}\right\|^{4}\leq L_{s}^{2}R_{z}^{2}.

Plugging this into Theorem 3 leads to the convergence time of

T=64max{8Ls2Rz2m2ϵ,2U(0)ϵ}.T=64\cdot\max\left\{\frac{8L_{s}^{2}R_{z}^{2}}{m^{2}\epsilon},\frac{2U(0)}{\epsilon}\right\}.

We devote the rest of this section to the proof of Theorem 3.

Proof of Theorem 3.

Same as in Section 5.1, convergence of the regularized entropy 𝔼wp[g(w)]+H(p){\mathbb{E}}_{w\sim p}\left[g(w)\right]+H(p) along equation (6) follows Lemma 1.

For the decrease of the cross entropy 𝔼wp[f(w)]{\mathbb{E}}_{w\sim p}\left[f(w)\right] along the gradient descent step (4), we use the following derivation for LL-Lipschitz smooth ff. For ptp_{t} being the density of wtw_{t} following equation (4) and for pp being another probability density,

2η~t[𝔼wp~tf(w)𝔼wpf(w)]\displaystyle 2\widetilde{\eta}_{t}[{\mathbb{E}}_{w\sim\widetilde{p}_{t}}f(w)-{\mathbb{E}}_{w^{\prime}\sim p}f(w^{\prime})]
\displaystyle\leq [W2(p~t,p)2W2(pt,p)2]+ηt2𝔼wp~tf(w)22\displaystyle[W_{2}(\widetilde{p}_{t},p)^{2}-W_{2}({p}_{t},p)^{2}]+\eta_{t}^{2}{\mathbb{E}}_{w\sim\widetilde{p}_{t}}\|\nabla f(w)\|_{2}^{2}
\displaystyle\leq [W2(p~t,p)2W2(pt,p)2]+2η~t2𝔼(w,w)γtf(w)f(w)22+2η~t2𝔼wpf(w)22,\displaystyle[W_{2}(\widetilde{p}_{t},p)^{2}-W_{2}({p}_{t},p)^{2}]+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{(w,w^{\prime})\sim\gamma_{t}}\|\nabla f(w)-\nabla f(w^{\prime})\|_{2}^{2}+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{w^{\prime}\sim p}\|\nabla f(w^{\prime})\|_{2}^{2},

where γtΓopt(p~t,p)\gamma_{t}\in\Gamma_{\text{opt}}(\widetilde{p}_{t},p) is the optimal coupling between distributions with densities p~t\widetilde{p}_{t} and pp.

With η~t=(1emηt)/m\widetilde{\eta}_{t}=(1-e^{-m\eta_{t}})/m, we have

2η~t[Q(p~t)Q(p)]\displaystyle 2\widetilde{\eta}_{t}[Q(\widetilde{p}_{t})-Q(p)] (1mη~t)W2(p~t1,p)2W2(pt,p)2\displaystyle\leq(1-m\widetilde{\eta}_{t})W_{2}(\widetilde{p}_{t-1},p)^{2}-W_{2}({p}_{t},p)^{2}
+2η~t2𝔼(w,w)γtf(w)f(w)22+2η~t2𝔼wpf(w)22.\displaystyle+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{(w,w^{\prime})\sim\gamma_{t}}\|\nabla f(w)-\nabla f(w^{\prime})\|_{2}^{2}+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{w^{\prime}\sim p}\|\nabla f(w^{\prime})\|_{2}^{2}. (15)

We also have the following lemma.

Lemma 5.

Let γtΓopt(p~t,p)\gamma_{t}\in\Gamma_{\text{opt}}(\widetilde{p}_{t},p) be the optimal coupling of p~t\widetilde{p}_{t} and pp, and let pp the solution of (1). Then we have

𝔼(w,w)γtf(w)f(w)222L[Q(p~t)Q(p)].{\mathbb{E}}_{(w,w^{\prime})\sim\gamma_{t}}\|\nabla f(w)-\nabla f(w^{\prime})\|_{2}^{2}\leq 2L[Q(\widetilde{p}_{t})-Q(p)].

We note that Lemma 5 and equation (15) imply that

2η~t(12Lη~t)[Q(p~t)Q(p)](1mη~t)W2(pt1,p)2W2(pt,p)2+2η~t2𝔼wpf(w)22,\displaystyle 2\widetilde{\eta}_{t}(1-2L\widetilde{\eta}_{t})[Q(\widetilde{p}_{t})-Q(p)]\leq(1-m\widetilde{\eta}_{t})W_{2}(p_{t-1},p)^{2}-W_{2}(p_{t},p)^{2}+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{w\sim p}\|\nabla f(w)\|_{2}^{2}, (16)

where pp satisfies (2).

Next we bound the last term of equation (16) at pp_{*}: 𝔼wpf(w)22{\mathbb{E}}_{w\sim p_{*}}\|\nabla f(w)\|_{2}^{2}.

Lemma 6.

Assume that

2f(w)H,g(w)=m2w22.\nabla^{2}f(w)\preceq H,\quad g(w)=\frac{m}{2}\|w\|_{2}^{2}.

Let

w=argminw[f(w)+g(w)],w_{*}=\arg\min_{w}\left[f(w)+g(w)\right],

and pexp(β1(f(w)+g(w)))p_{*}\propto\exp(-\beta^{-1}(f(w)+g(w))) be the solution of (2). Then

𝔼wpf(w)2216βmtrace(H2)+2m2w22.{\mathbb{E}}_{w\sim p_{*}}\|\nabla f(w)\|_{2}^{2}\leq 16\frac{\beta}{m}{\mathrm{trace}}\left(H^{2}\right)+2m^{2}\|w_{*}\|_{2}^{2}.

With these lemmas, we are ready to prove the convergence time of the Langevin algorithm 1.

We note that the shrinking step size scheduling of η~t=τ(ττ~0+mt)1\widetilde{\eta}_{t}=\tau\cdot\left(\frac{\tau}{\widetilde{\tau}_{0}}+mt\right)^{-1} satisfies:

η~tτη~t1τ(1mη~t).\frac{\widetilde{\eta}_{t}^{\tau}}{\widetilde{\eta}_{t-1}^{\tau}}\leq\left(1-m\widetilde{\eta}_{t}\right).

Using this inequality and combining Lemma 6 and equation (16) at p=pp=p_{*}, we obtain that

2η~t1τ(12Lη~t)[Q(p~t)Q(p)]\displaystyle 2\widetilde{\eta}_{t}^{1-\tau}\left(1-2L\widetilde{\eta}_{t}\right)[Q(\widetilde{p}_{t})-Q(p_{*})]
\displaystyle\leq η~t1τW2(pt1,p)2η~tτW2(pt,p)2+4η~t2τ[8(β/m)trace(H2)+m2w2].\displaystyle\widetilde{\eta}_{t-1}^{-\tau}W_{2}(p_{t-1},p_{*})^{2}-\widetilde{\eta}_{t}^{-\tau}W_{2}(p_{t},p_{*})^{2}+4\widetilde{\eta}_{t}^{2-\tau}\left[8\left(\beta/m\right){\mathrm{trace}}\left(H^{2}\right)+m^{2}\left\|w_{*}\right\|^{2}\right].

Summing over t=1,,Tt=1,\dots,T,

2t=1Tη~t1τ(12Lη~t)[Q(p~t)Q(p)]\displaystyle 2\sum_{t=1}^{T}\widetilde{\eta}_{t}^{1-\tau}\left(1-2L\widetilde{\eta}_{t}\right)[Q(\widetilde{p}_{t})-Q(p_{*})]
\displaystyle\leq η~0τW2(p0,p)2+4[8(β/m)trace(H2)+m2w2]t=1Tη~t2τ.\displaystyle\widetilde{\eta}_{0}^{-\tau}W_{2}(p_{0},p_{*})^{2}+4\left[8\left(\beta/m\right){\mathrm{trace}}\left(H^{2}\right)+m^{2}\left\|w_{*}\right\|^{2}\right]\sum_{t=1}^{T}\widetilde{\eta}_{t}^{2-\tau}.

Denote Δ=4[8(β/m)trace(H2)+m2w2]\Delta=4\left[8\left(\beta/m\right){\mathrm{trace}}\left(H^{2}\right)+m^{2}\left\|w_{*}\right\|^{2}\right] and take η~t=τ(τη~0+mt)1\widetilde{\eta}_{t}=\tau\cdot\left(\frac{\tau}{\widetilde{\eta}_{0}}+mt\right)^{-1}. Then for η~t141L\widetilde{\eta}_{t}\leq\frac{1}{4}\frac{1}{L} and for τ=2\tau=2,

mt=1T(1/(mη~0)+t/2)[Q(p~t)Q(p)]1η~02W2(p0,p)2+ΔT,m\sum_{t=1}^{T}\left(1/\left(m\widetilde{\eta}_{0}\right)+t/2\right)[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{1}{\widetilde{\eta}_{0}^{2}}W_{2}(p_{0},p_{*})^{2}+\Delta\cdot T,

or

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4[Q(p~t)Q(p)]4mη~02T(T+1)W2(p0,p)2+4Δm(T+1).\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\frac{4}{m\widetilde{\eta}_{0}^{2}T(T+1)}W_{2}(p_{0},p_{*})^{2}+\frac{4\Delta}{m(T+1)}. (17)

Inspired by the Lipschitz continuous case, we take p0(w)exp(β1g(w))p_{0}(w)\propto\exp\left(-\beta^{-1}g(w)\right). Then by the Talagrand and log-Sobolev inequalities,

W2(p0,p)2βmKL(pp0)β22m2𝔼p[logpp02]=β22m2𝔼p[β1f(w)2].W_{2}(p_{0},p_{*})^{2}\leq\frac{\beta}{m}\mathrm{KL}\left(p_{*}\|p_{0}\right)\leq\frac{\beta^{2}}{2m^{2}}{\mathbb{E}}_{p_{*}}\left[\left\|\nabla\log\frac{p_{*}}{p_{0}}\right\|^{2}\right]=\frac{\beta^{2}}{2m^{2}}{\mathbb{E}}_{p_{*}}\left[\left\|\beta^{-1}\nabla f(w)\right\|^{2}\right].

Applying Lemma 6 to the above inequality, we obtain that W2(p0,p)21m2(8(β/m)trace(H2)+m2w2)W_{2}(p_{0},p_{*})^{2}\leq\frac{1}{m^{2}}\left(8\left(\beta/m\right){\mathrm{trace}}\left(H^{2}\right)+m^{2}\left\|w_{*}\right\|^{2}\right). Then taking η~0=141L\widetilde{\eta}_{0}=\frac{1}{4}\frac{1}{L}, we obtain that the weighted-averaged KL divergence is upper bounded:

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]64L2m2T(T+1)(8m2trace(H2)+β1mw2)+16T+1(8m2trace(H2)+β1mw2).\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\\ \leq\frac{64L^{2}}{m^{2}T(T+1)}\cdot\left(\frac{8}{m^{2}}{\mathrm{trace}}\left(H^{2}\right)+\beta^{-1}m\left\|w_{*}\right\|^{2}\right)+\frac{16}{T+1}\cdot\left(\frac{8}{m^{2}}{\mathrm{trace}}\left(H^{2}\right)+\beta^{-1}m\left\|w_{*}\right\|^{2}\right). (18)

Since L2trace(H2)L^{2}\leq{\mathrm{trace}}\left(H^{2}\right), ϵ1\forall\epsilon\leq 1, the weighted-averaged KL divergence t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]ϵ\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\epsilon if

T64max{8trace(H2)m2ϵ,β1mw2ϵ}.T\geq 64\cdot\max\left\{\frac{8{\mathrm{trace}}\left(H^{2}\right)}{m^{2}\epsilon},\frac{\beta^{-1}m\left\|w_{*}\right\|^{2}}{\epsilon}\right\}.

Plugging in the bound that mw22f(0)=2βU(0)m\left\|w_{*}\right\|^{2}\leq 2f(0)=2\beta U(0) gives the final result.

6.2 SGLD Convergence in Smooth Convex Case

Similar to the Lipschitz continuous case, we assume that function ff decomposes:

f(w)=1ni=1n(w,zi)=𝔼zD[(w,z)],f(w)=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i})={\mathbb{E}}_{z\sim D}[\ell(w,z)],

where DD is the distribution over the data samples. Making the following assumption, which modifies Assumption A2S\rm{A2_{S}}, that the individual log-likelihood satisfies the Lipschitz smooth condition yields the convergence rate for the SGLD method.

Assumptions on individual loss \ell
  1. A2SSG\rm{A{{2}}_{S}^{SG}}

    Function \ell is LL_{\ell}-Lipschitz smooth on d{\mathbb{R}}^{d}: (y,z)(x,z)+(x,z)(yx)+12L(y,z)(x,z)2\ell(y,z)\geq\ell(x,z)+\nabla\ell(x,z)^{\top}(y-x)+\frac{1}{2L_{\ell}}\left\|\nabla\ell(y,z)-\nabla\ell(x,z)\right\|^{2}, zΩ\forall z\in\Omega.

  2. A3SSG\rm{A{{3}}_{S}^{SG}}

    The stochastic gradient variance at the mode ww_{*} is bounded: 𝔼zD[(w,z)f(w)2]b2{\mathbb{E}}_{z\sim D}\left[\left\|\nabla\ell(w_{*},z)-\nabla f(w_{*})\right\|^{2}\right]\leq b^{2}.

Assumption A2SSG\rm{A{{2}}_{S}^{SG}} ensures that the stochastic estimates of ff are LL_{\ell} Lipschitz smooth.

Under the above assumptions, we obtain in what follows the convergence rate for the SGLD method with minibatch size |𝒮||\mathcal{S}|. This result is the counterpart of its full gradient version in Theorem 3.

Theorem 4.

We make the convexity Assumptions A1 on function ff and the regularity Assumptions A2S𝑆𝐺\rm{A{{2}}_{S}^{SG}} and A3S𝑆𝐺\rm{A{{3}}_{S}^{SG}} on its components \ell. Also assume that 2f(w)H\nabla^{2}f(w)\preceq H. Let function g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2}. Then taking η~t=(1emηt)/m=2(8LsRz+mt)1\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2\cdot\left(8L_{s}R_{z}+mt\right)^{-1}, the convergence time of the SGLD Algorithm 2 initializing from p0exp(β1g)p_{0}\propto\exp(-\beta^{-1}g) is

T=Ω(max{Ltrace(H)m2ϵ,U(0)ϵ,n|𝒮|b2mϵ}),\displaystyle T=\Omega\left(\max\left\{\frac{L_{\ell}{\mathrm{trace}}(H)}{m^{2}\epsilon},\frac{U(0)}{\epsilon},\frac{n}{|\mathcal{S}|}\frac{b^{2}}{m\epsilon}\right\}\right),

to achieve an accuracy of

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]ϵ.\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\epsilon.
Ridge Separable Case

Assume that the individual component \ell take the following form so that function ff becomes ridge-separable:

(w,zi)=si(wzi).\displaystyle\ell(w,z_{i})=s_{i}(w^{\top}z_{i}). (19)

To ensure bounded stochastic gradient variance at the mode of the posterior, we additionally assume that at the mode ww_{*}, the derivatives of the activation functions are bounded.

Assumption in ridge separable case on bounded variance
  1. R3SG\rm{R3^{SG}}

    bs>0\exists b_{s}>0, so that |si(wzi)|bs\left|s_{i}^{\prime}(w_{*}^{\top}z_{i})\right|\leq b_{s}, i{1,,n}\forall i\in\{1,\dots,n\}, where w=argminw[f(w)+g(w)]w_{*}=\arg\min_{w}\left[f(w)+g(w)\right].

Assumption R3SG\rm{R3^{SG}} ensures that the stochastic gradient variance is bounded at the mode. Then we have the following corollary instantiating Theorem 4.

Corollary 2.

We make the convexity Assumption A1 on function ff and let it take the ridge-separable form (14) (also let function g(w)=m2w2g(w)=\frac{m}{2}\left\|w\right\|^{2}). Further adopt Assumptions R1, R2, and R3𝑆𝐺\rm{R3^{SG}}. Then taking η~t=(1emηt)/m=2(8LsRz+mt)1\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m=2\cdot\left(8L_{s}R_{z}+mt\right)^{-1}, the convergence time of Algorithm 2 initializing from p0exp(β1g)p_{0}\propto\exp(-\beta^{-1}g) is

T=Ω(max{Ls2Rz2m2ϵ,U(0)ϵ,n|𝒮|Rzbs2mϵ}),T=\Omega\left(\max\left\{\frac{L_{s}^{2}R_{z}^{2}}{m^{2}\epsilon},\frac{U(0)}{\epsilon},\frac{n}{|\mathcal{S}|}\frac{R_{z}b_{s}^{2}}{m\epsilon}\right\}\right),

to achieve an accuracy of

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]ϵ.\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\epsilon.
Proof of Corollary 2.

Since function \ell takes form (19), we can compute that 2(w,zi)=si(wzi)zizi\nabla^{2}\ell(w,z_{i})=s_{i}(w^{\top}z_{i})z_{i}z_{i}^{\top}. Using Assumptions R1 and R2, the Lipschitz smoothness L=LsRzL_{\ell}=L_{s}R_{z}. Same as in Corollary 1, we know that 2f(w)1nLsZZ=H\nabla^{2}f(w)\preceq\frac{1}{n}L_{s}ZZ^{\top}=H. Therefore,

trace(H)=Ls1ntrace(ZZ)LsRz,{\mathrm{trace}}\left(H\right)=L_{s}\cdot\frac{1}{n}{\mathrm{trace}}\left(ZZ^{\top}\right)\leq L_{s}R_{z},

leading to the fact that Ltrace(H)Ls2Rz2L_{\ell}\cdot{\mathrm{trace}}\left(H\right)\leq L_{s}^{2}R_{z}^{2}.

For the stochastic gradient bound bb at ww_{*}, we apply Assumptions R2 and R3SG\rm{R3^{SG}} to obtain

(w,zi)(w,zj)=si(wzi)zisj(wzj)zj2Rzbs.\left\|\nabla\ell(w_{*},z_{i})-\nabla\ell(w_{*},z_{j})\right\|=\left\|s^{\prime}_{i}(w_{*}^{\top}z_{i})z_{i}-s^{\prime}_{j}(w_{*}^{\top}z_{j})z_{j}\right\|\leq 2\sqrt{R_{z}}b_{s}.

We thus have

(w,z)f(w)=(w,zi)1nj=1n(w,zj)2Rzbs,\left\|\nabla\ell(w_{*},z)-\nabla f(w_{*})\right\|=\Big{\|}\nabla\ell(w,z_{i})-\frac{1}{n}\sum_{j=1}^{n}\nabla\ell(w,z_{j})\Big{\|}\leq 2\sqrt{R_{z}}b_{s},

leading to the fact that

𝔼zD[(w,z)f(w)2]4Rzbs2.{\mathbb{E}}_{z\sim D}\left[\left\|\nabla\ell(w_{*},z)-\nabla f(w_{*})\right\|^{2}\right]\leq 4R_{z}b_{s}^{2}.

Therefore, the stochastic gradient variance bound in Assumption A3SSG\rm{A{{3}}_{S}^{SG}}, b=2Rzbsb=2\sqrt{R_{z}}b_{s}. Plugging these bounds into Theorem 4 proves the corollary. ∎

We devote the rest of this section to the proof of Theorem 4.

Proof of Theorem 4.

We first note that because each (,zi)\ell(\cdot,z_{i}) is LL_{\ell}-Lipschitz smooth, the stochastic estimate of function ff, f~=1|𝒮|zi𝒮(w~t,zi)\widetilde{f}=\frac{1}{|\mathcal{S}|}\sum_{z_{i}\in\mathcal{S}}\ell(\widetilde{w}_{t},z_{i}) is LL_{\ell}-Lipschitz smooth:

1|𝒮|zi𝒮(y,zi)(x,zi)2\displaystyle\left\|\frac{1}{|\mathcal{S}|}\sum_{z_{i}\in\mathcal{S}}\nabla\ell(y,z_{i})-\nabla\ell(x,z_{i})\right\|^{2} 1|𝒮|2(zi𝒮(y,zi)(x,zi))2\displaystyle\leq\frac{1}{|\mathcal{S}|^{2}}\left(\sum_{z_{i}\in\mathcal{S}}\left\|\nabla\ell(y,z_{i})-\nabla\ell(x,z_{i})\right\|\right)^{2}
1|𝒮|zi𝒮(y,zi)(x,zi)2\displaystyle\leq\frac{1}{|\mathcal{S}|}\sum_{z_{i}\in\mathcal{S}}\left\|\nabla\ell(y,z_{i})-\nabla\ell(x,z_{i})\right\|^{2}
2L|𝒮|zi𝒮((y,zi)(x,zi)(x,zi)(yx))\displaystyle\leq\frac{2L_{\ell}}{|\mathcal{S}|}\sum_{z_{i}\in\mathcal{S}}\left(\ell(y,z_{i})-\ell(x,z_{i})-\nabla\ell(x,z_{i})^{\top}(y-x)\right)
=2L(f~(y)f~(x)+f~(x)(yx)).\displaystyle=2L_{\ell}\left(\widetilde{f}(y)-\widetilde{f}(x)+\nabla\widetilde{f}(x)^{\top}(y-x)\right).

We thereby invoke the next lemma.

Lemma 7.

Assume that function ff is convex, and that its stochastic estimate f~\widetilde{f} is LL_{\ell}-Lipschitz smooth. Then

W22(pt,p)\displaystyle W_{2}^{2}(p_{t},p) W22(p~t,p)2η~t(f(p~t)f(p))\displaystyle\leq W_{2}^{2}(\widetilde{p}_{t},p)-2\widetilde{\eta}_{t}\left(f(\widetilde{p}_{t})-f(p)\right)
+η~t2(4L[Q(p~t)Q(p)]+2𝔼(w~t,w)γt[𝔼𝒮|w~tf~(w,𝒮)2]),\displaystyle+\widetilde{\eta}_{t}^{2}\left(4L_{\ell}[Q(\widetilde{p}_{t})-Q(p)]+2{\mathbb{E}}_{(\widetilde{w}_{t},w^{\prime})\sim\gamma_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\widetilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]\right),

where f(q)=𝔼wqf(w)f(q)={\mathbb{E}}_{w\sim q}f(w), and γtΓopt(p~t,p)\gamma_{t}\in\Gamma_{\mathrm{opt}}(\widetilde{p}_{t},p) is the optimal coupling between p~t\widetilde{p}_{t} and pp.

Taking η~t=(1emηt)/m\widetilde{\eta}_{t}=\left(1-e^{-m\eta_{t}}\right)/m and combining Lemma 1 and Lemma 7, we obtain that

2η~t(12η~tL)(Q(p~t)Q(p))emηtW22(pt1,p)W22(pt,p)+2η~t2𝔼(w~t,w)γt[𝔼𝒮|w~tf~(w,𝒮)2].\displaystyle 2\widetilde{\eta}_{t}\left(1-2\widetilde{\eta}_{t}L_{\ell}\right)\left(Q(\widetilde{p}_{t})-Q(p)\right)\leq e^{-m\eta_{t}}W_{2}^{2}(p_{t-1},p)-W_{2}^{2}(p_{t},p)+2\widetilde{\eta}_{t}^{2}{\mathbb{E}}_{(\widetilde{w}_{t},w^{\prime})\sim\gamma_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\widetilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]. (20)

We then adapt Lemma 6 to the stochastic gradient method.

Lemma 8 (Stochastic Gradient Counterpart of Lemma 6).

Assume that

2f(w)H,g(w)=m2w22.\nabla^{2}f(w)\preceq H,\quad g(w)=\frac{m}{2}\|w\|_{2}^{2}.

Let

w=argminw[f(w)+g(w)],w_{*}=\arg\min_{w}\left[f(w)+g(w)\right],

and pp be the solution of (2). Then for LL_{\ell}-Lipschitz smooth function f~\widetilde{f}, at p=pp=p_{*} and consequently γtΓopt(p~t,p)\gamma_{t}\in\Gamma_{\mathrm{opt}}(\widetilde{p}_{t},p_{*}),

𝔼(w~t,w)γt[𝔼𝒮|w~tf~(w,𝒮)2]2βLmtrace(H)+4m2w2+4𝔼𝒮f~(w,𝒮)f(w)2.\displaystyle{\mathbb{E}}_{(\widetilde{w}_{t},w^{\prime})\sim\gamma_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\widetilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]\leq\frac{2\beta L_{\ell}}{m}{\mathrm{trace}}(H)+4m^{2}\left\|w_{*}\right\|^{2}+4{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})-\nabla f(w_{*})\right\|^{2}. (21)

For the last piece of information, we establish the variance of the stochastic gradient at the mode, 𝔼𝒮f~(w,𝒮)f(w)22{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})-\nabla f(w_{*})\right\|_{2}^{2}. For samples ziz_{i} that are i.i.d. draws from the data set and are unbiased estimators of f(w)=1nj=1n(w,zj)\nabla f(w_{*})=\frac{1}{n}\sum_{j=1}^{n}\nabla\ell(w_{*},z_{j}), we have

𝔼𝒮[zi𝒮((w,zi)1nj=1n(w,zj))2]=|𝒮|𝔼zD[(w,z)1nj=1n(w,zj)2]|𝒮|b2,\displaystyle{\mathbb{E}}_{\mathcal{S}}\left[\left\|\sum_{z_{i}\in\mathcal{S}}\left(\nabla\ell(w_{*},z_{i})-\frac{1}{n}\sum_{j=1}^{n}\nabla\ell(w_{*},z_{j})\right)\right\|^{2}\right]=|\mathcal{S}|\cdot{\mathbb{E}}_{z\sim D}\left[\left\|\nabla\ell(w_{*},z)-\frac{1}{n}\sum_{j=1}^{n}\nabla\ell(w_{*},z_{j})\right\|^{2}\right]\leq|\mathcal{S}|\cdot b^{2},

Leading to the bound that

𝔼𝒮f~(w,𝒮)f(w)22=1|𝒮|2𝔼𝒮[zi𝒮((w,zi)1nj=1n(w,zj))2]b2|𝒮|.\displaystyle{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})-\nabla f(w_{*})\right\|_{2}^{2}=\frac{1}{|\mathcal{S}|^{2}}{\mathbb{E}}_{\mathcal{S}}\left[\left\|\sum_{z_{i}\in\mathcal{S}}\left(\nabla\ell(w_{*},z_{i})-\frac{1}{n}\sum_{j=1}^{n}\nabla\ell(w_{*},z_{j})\right)\right\|^{2}\right]\leq\frac{b^{2}}{|\mathcal{S}|}.

Plugging this result and Lemma 8 into equation (6.2) at p=pp=p_{*}, we obtain the final bound that

2η~t(12η~tLsRz)(Q(p~t)Q(p))\displaystyle 2\widetilde{\eta}_{t}\left(1-2\widetilde{\eta}_{t}L_{s}R_{z}\right)\left(Q(\widetilde{p}_{t})-Q(p_{*})\right) (1mη~t)W22(pt1,p)W22(pt,p)\displaystyle\leq\left(1-m\widetilde{\eta}_{t}\right)W_{2}^{2}(p_{t-1},p_{*})-W_{2}^{2}(p_{t},p_{*})
+4η~t2(βLmtrace(H)+2m2w2+2b2|𝒮|).\displaystyle+4\widetilde{\eta}_{t}^{2}\left(\beta\frac{L_{\ell}}{m}{\mathrm{trace}}(H)+2m^{2}\left\|w_{*}\right\|^{2}+2\frac{b^{2}}{|\mathcal{S}|}\right).

This leads to a convergence rate, similar to the full gradient case, of

T=Ω(max(Ltrace(H)m2ϵ,β1mw2ϵ,β1|𝒮|b2mϵ)),\displaystyle T=\Omega\left(\max\left(\frac{L_{\ell}{\mathrm{trace}}(H)}{m^{2}\epsilon},\frac{\beta^{-1}m\left\|w_{*}\right\|^{2}}{\epsilon},\frac{\beta^{-1}}{|\mathcal{S}|}\frac{b^{2}}{m\epsilon}\right)\right),

so that the weighted-averaged KL divergence:

t=1T1/(mη~0)+t/2T/(mη~0)+T(T+1)/4β1[Q(p~t)Q(p)]ϵ.\sum_{t=1}^{T}\frac{1/\left(m\widetilde{\eta}_{0}\right)+t/2}{T/\left(m\widetilde{\eta}_{0}\right)+T(T+1)/4}\beta^{-1}[Q(\widetilde{p}_{t})-Q(p_{*})]\leq\epsilon.

Since β=1/n\beta=1/n, and that mw22f(0)=2βU(0)m\left\|w_{*}\right\|^{2}\leq 2f(0)=2\beta U(0),

T=Ω(max(Ltrace(H)m2ϵ,U(0)ϵ,n|𝒮|b2mϵ)).\displaystyle T=\Omega\left(\max\left(\frac{L_{\ell}{\mathrm{trace}}(H)}{m^{2}\epsilon},\frac{U(0)}{\epsilon},\frac{n}{|\mathcal{S}|}\frac{b^{2}}{m\epsilon}\right)\right).

7 Proofs of the Supporting Lemmas

7.1 Proofs of Lemmas in the Lipschitz Continuous Case

7.1.1 Proof of Lemma 1

Before proving Lemma 1, we first state a result in (Theorem 23.9 of Villani, 2009) that establishes the strong subdifferential of the Wasserstein-2 distance.

Lemma 9.

Assume that μt,μ^t\mu_{t},\hat{\mu}_{t} solve the following continuity equations

μtt+(ξtμt)=0,μ^tt+(ξ^tμ^t)=0.\frac{\partial\mu_{t}}{\partial t}+\nabla\cdot(\xi_{t}\mu_{t})=0,\qquad\frac{\partial\hat{\mu}_{t}}{\partial t}+\nabla\cdot(\hat{\xi}_{t}\hat{\mu}_{t})=0.

Then

12ddtW22(μt,μ^t)=~ψt,ξt𝑑μt~ψ^t,ξ^t𝑑μ^t,\frac{1}{2}\frac{d}{dt}W_{2}^{2}(\mu_{t},\hat{\mu}_{t})=-\int\left\langle\tilde{\nabla}\psi_{t},\xi_{t}\right\rangle d\mu_{t}-\int\left\langle\tilde{\nabla}\hat{\psi}_{t},\hat{\xi}_{t}\right\rangle d\hat{\mu}_{t},

where ψt\psi_{t} and ψ^t\hat{\psi}_{t} are the optimal transport vector fields:

exp(~ψt)#μt=μt^,exp(~ψ^t)#μ^t=μt.\exp(\tilde{\nabla}\psi_{t})_{\#}\mu_{t}=\hat{\mu_{t}},\qquad\exp(\tilde{\nabla}\hat{\psi}_{t})_{\#}\hat{\mu}_{t}=\mu_{t}.

Writing ptp_{t} and p^t\hat{p}_{t} as the density functions of μt\mu_{t} and μ^t\hat{\mu}_{t}, we take ξt=βlogptg\xi_{t}=-\beta\nabla\log p_{t}-\nabla g and ξ^t=0\hat{\xi}_{t}=0 so that μt\mu_{t} follows the Fokker-Planck equation equation associated with process (3) and μ^t=ν\hat{\mu}_{t}=\nu is a constant measure. This leads to the following equation

12ddsW22(μs,ν)=βlogps+g,(~ψ)μsν𝑑μs.\frac{1}{2}\frac{d}{ds}W_{2}^{2}(\mu_{s},\nu)=\int\left\langle\beta\nabla\log p_{s}+\nabla g,(\tilde{\nabla}\psi)_{\mu_{s}}^{\nu}\right\rangle d\mu_{s}.

For μ\mu being the probability measure associated with its density pp, define relative entropy β1F(μ)\beta^{-1}F(\mu), where F(μ)=𝔼wp[g(w)]+H(p)F(\mu)=\mathbb{E}_{w\sim p}\left[g(w)\right]+H(p). We can then use the fact that the relative entropy β1F\beta^{-1}F is β1m\beta^{-1}m-geodesically strongly convex (see Proposition 9.3.2 of Ambrosio et al. (2008)) to prove the following Lemma.

Lemma 10.

For pp being the density of μ\mu,

F(ν)F(μ)m2W22(μ,ν)βlogp+g,(~ψ)μν𝑑μ.F(\nu)-F(\mu)-\frac{m}{2}W_{2}^{2}(\mu,\nu)\geq\int\left\langle\beta\nabla\log p+\nabla g,(\tilde{\nabla}\psi)_{\mu}^{\nu}\right\rangle d\mu.
Proof.

Let μt\mu_{t} be the geodesic between μ\mu and ν\nu. β1m\beta^{-1}m-geodesic strong convexity of β1F\beta^{-1}F states that (see Proposition 9.3.2 of Ambrosio et al. (2008)):

β1F(μt)tβ1F(ν)+(1t)β1F(μ)β1m2t(1t)W22(μ,ν),\beta^{-1}F(\mu_{t})\leq t\beta^{-1}F(\nu)+(1-t)\beta^{-1}F(\mu)-\frac{\beta^{-1}m}{2}t(1-t)W_{2}^{2}(\mu,\nu),

and consequently

F(μt)F(μ)tF(ν)F(μ)m2(1t)W22(μ,ν).\displaystyle\frac{F(\mu_{t})-F(\mu)}{t}\leq F(\nu)-F(\mu)-\frac{m}{2}(1-t)W_{2}^{2}(\mu,\nu).

By the definition of subdifferential (c.f. Villani, 2009, Theorem 23.14) we also have along the diffusion process defined by equation (3):

liminft0F(μt)F(μ)tβlogp+g,(~ψ)μν𝑑μ.\lim\inf_{t\downarrow 0}\frac{F(\mu_{t})-F(\mu)}{t}\geq\int\left\langle\beta\nabla\log p+\nabla g,(\tilde{\nabla}\psi)_{\mu}^{\nu}\right\rangle d\mu.

Taking the limit of t0t\rightarrow 0, we obtain the result. ∎

Proof of Lemma 1.

Combining Lemma 9 and 10, we obtain that

12ddsW22(μs,ν)=βlogps+g,(~ψ)μsν𝑑μsF(ν)F(μs)m2W22(μs,ν).\displaystyle\frac{1}{2}\frac{d}{ds}W_{2}^{2}(\mu_{s},\nu)=\int\left\langle\beta\nabla\log p_{s}+\nabla g,(\tilde{\nabla}\psi)_{\mu_{s}}^{\nu}\right\rangle d\mu_{s}\leq F(\nu)-F(\mu_{s})-\frac{m}{2}W_{2}^{2}(\mu_{s},\nu). (22)

Along the Fokker-Planck equation associated with process (3), ddsF(μs)=𝔼ps[βlogps+g22]0\frac{d}{ds}F(\mu_{s})=-\mathbb{E}_{p_{s}}\left[\left\|\beta\nabla\log p_{s}+\nabla g\right\|_{2}^{2}\right]\leq 0, meaning that F(μs)F(\mu_{s}) is monotonically decreasing. We obtain from equation (22) for s[0,t]s\in[0,t],

12ddsW22(μs,ν)sups[0,t][F(ν)F(μs)]m2W22(μs,ν)=F(ν)F(μt)m2W22(μs,ν).\displaystyle\frac{1}{2}\frac{d}{ds}W_{2}^{2}(\mu_{s},\nu)\leq\sup_{s\in[0,t]}\left[F(\nu)-F(\mu_{s})\right]-\frac{m}{2}W_{2}^{2}(\mu_{s},\nu)=F(\nu)-F(\mu_{t})-\frac{m}{2}W_{2}^{2}(\mu_{s},\nu).

Applying the Gronwall’s inequality, we arrive at the conclusion that

2m(1emΔt)(F(μt)F(ν))emΔtW22(μ0,ν)W22(μt,ν).\displaystyle\frac{2}{m}\left(1-e^{-m\Delta t}\right)\left(F(\mu_{t})-F(\nu)\right)\leq e^{-m\Delta t}W_{2}^{2}(\mu_{0},\nu)-W_{2}^{2}(\mu_{t},\nu).

Taking dμt=p~tdxd\mu_{t}=\tilde{p}_{t}dx, dμ0=pt1dxd\mu_{0}=p_{t-1}dx, dν=pdxd\nu=pdx, and Δt=ηt{\Delta t}=\eta_{t} finishes the proof. ∎

7.1.2 Proof of Lemma 2

Proof of Lemma 2.

We first state a point-wise result along the gradient descent step (4):

2ηt(f(w~t)f(w))w~tw22wtw22+ηt2G2.\displaystyle 2\eta_{t}\left(f(\widetilde{w}_{t})-f(w)\right)\leq\left\|\widetilde{w}_{t}-w\right\|_{2}^{2}-\left\|w_{t}-w\right\|_{2}^{2}+\eta_{t}^{2}G^{2}. (23)

This is because

wtw22\displaystyle\left\|w_{t}-w\right\|_{2}^{2} =w~tηtf(w~t)w22\displaystyle=\left\|\tilde{w}_{t}-\eta_{t}\nabla f(\tilde{w}_{t})-w\right\|_{2}^{2}
=w~tw222ηtf(w~t),w~tw+ηt2f(w~t)22\displaystyle=\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left\langle\nabla f(\tilde{w}_{t}),\tilde{w}_{t}-w\right\rangle+\eta_{t}^{2}\left\|\nabla f(\tilde{w}_{t})\right\|_{2}^{2}
w~tw222ηt(f(w~t)f(w))+ηt2G2,\displaystyle\leq\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left(f(\tilde{w}_{t})-f(w)\right)+\eta_{t}^{2}G^{2},

where the last step follows from the convexity and Lipschitz smoothness of ff.

We then denote the measures corresponding to random variables wtw_{t} and w~t\tilde{w}_{t} to be: wtμtw_{t}\sim\mu_{t} and w~tμ~t\tilde{w}_{t}\sim\tilde{\mu}_{t}. From the definitions, we know that they have densities ptp_{t} and p~t\tilde{p}_{t}.

Denote an optimal coupling between μ~t\tilde{\mu}_{t} and μ\mu (where measure μ\mu has density pp, which is the stationary distribution) to be γΓopt(μ~t,μ)\gamma\in\Gamma_{opt}(\tilde{\mu}_{t},\mu). We then take expectations over γ(w~t,w)\gamma(\tilde{w}_{t},w) on both sides of equation (23):

2ηt(f(p~t)f(p))\displaystyle 2\eta_{t}\left(f(\tilde{p}_{t})-f(p)\right) =2ηt𝔼(w~t,w)γ[f(w~t)f(w)]\displaystyle=2\eta_{t}\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[f(\tilde{w}_{t})-f(w)\right]
𝔼(w~t,w)γ[w~tw22]𝔼(w~t,w)γ[wtw22]+ηt2G2\displaystyle\leq\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\|\tilde{w}_{t}-w\|_{2}^{2}\right]-\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\|w_{t}-w\|_{2}^{2}\right]+\eta_{t}^{2}G^{2}
=W22(p~t,p)𝔼(w~t,w)γ[wtw22]+ηt2G2.\displaystyle=W_{2}^{2}(\tilde{p}_{t},p)-\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\|w_{t}-w\|_{2}^{2}\right]+\eta_{t}^{2}G^{2}.

From the relationship wt=w~tηtf(w~t)w_{t}=\tilde{w}_{t}-\eta_{t}\nabla f(\tilde{w}_{t}), we know that the joint distribution of (wt,w)(w_{t},w) is (idηtf,id)#γ\left(\mathrm{id}-\eta_{t}\nabla f,\,\mathrm{id}\right)_{\#}\gamma. Note that γ~=(idηtf,id)#γ\tilde{\gamma}=\left(\mathrm{id}-\eta_{t}\nabla f,\,\mathrm{id}\right)_{\#}\gamma also defines a coupling, and therefore

𝔼(w~t,w)γ[wtw22]\displaystyle\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\|w_{t}-w\|_{2}^{2}\right] =𝔼(wt,w)γ~[wtw22]\displaystyle=\mathbb{E}_{(w_{t},w)\sim\tilde{\gamma}}\left[\|w_{t}-w\|_{2}^{2}\right]
infγ^Γ(μt,μ)𝔼(wt,w)γ^[wtw22]=W22(pt,p).\displaystyle\geq\inf_{\hat{\gamma}\in\Gamma(\mu_{t},\mu)}\mathbb{E}_{(w_{t},w)\sim\hat{\gamma}}\left[\|w_{t}-w\|_{2}^{2}\right]=W_{2}^{2}(p_{t},p).

Therefore,

2ηt(f(p~t)f(p))W22(p~t,p)W22(pt,p)+ηt2G2.\displaystyle 2\eta_{t}\left(f(\tilde{p}_{t})-f(p)\right)\leq W_{2}^{2}(\tilde{p}_{t},p)-W_{2}^{2}(p_{t},p)+\eta_{t}^{2}G^{2}.

7.1.3 Proofs of Lemma 12 and 4 for the streaming SGLD algorithm 2

Proof of Lemma 12.

By the definitions of wtw_{t} and w~t\tilde{w}_{t},

wtw22\displaystyle\left\|w_{t}-w\right\|_{2}^{2} =w~tηt(w~t,zt)w22\displaystyle=\left\|\tilde{w}_{t}-\eta_{t}\nabla\ell(\tilde{w}_{t},z_{t})-w\right\|_{2}^{2}
=w~tw222ηt(w~t,zt),w~tw+ηt2(w~t,zt)22.\displaystyle=\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left\langle\nabla\ell(\tilde{w}_{t},z_{t}),\tilde{w}_{t}-w\right\rangle+\eta_{t}^{2}\left\|\nabla\ell(\tilde{w}_{t},z_{t})\right\|_{2}^{2}.

We now take expectation with respect to ztz_{t}, conditioned on w~t\tilde{w}_{t}, to obtain

𝔼zt|w~twtw22\displaystyle{\mathbb{E}}_{z_{t}|\tilde{w}_{t}}\left\|w_{t}-w\right\|_{2}^{2} w~tw222ηtf(w~t),w~tw+ηt2G2\displaystyle\leq\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left\langle\nabla f(\tilde{w}_{t}),\tilde{w}_{t}-w\right\rangle+\eta_{t}^{2}G_{\ell}^{2}
w~tw222ηt(f(w~t)f(w))+ηt2G2.\displaystyle\leq\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left(f(\tilde{w}_{t})-f(w)\right)+\eta_{t}^{2}G_{\ell}^{2}.

The last step follows from the convexity of ff. Therefore, the desired bound follows. ∎

Proof of Lemma 4.

We first denote the measures corresponding to random variables wtw_{t} and w~t\tilde{w}_{t} to be: wtμtw_{t}\sim\mu_{t} and w~tμ~t\tilde{w}_{t}\sim\tilde{\mu}_{t}. From the definitions, we know that they have densities ptp_{t} and p~t\tilde{p}_{t}.

Denote an optimal coupling between μ~t\tilde{\mu}_{t} and μ\mu (where measure μ\mu has density pp, which is the stationary distribution) to be γΓopt(μ~t,μ)\gamma\in\Gamma_{opt}(\tilde{\mu}_{t},\mu). We then take expectations over γ(w~t,w)\gamma(\tilde{w}_{t},w) on both sides of Eq. (12), zΩ\forall z\in\Omega:

2ηt𝔼(w~t,w)γ[𝔼z[(w~t,z)(w,z)]]\displaystyle 2\eta_{t}\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\mathbb{E}_{z}\left[\ell(\tilde{w}_{t},z)-\ell(w,z)\right]\right]
𝔼(w~t,w)γ[w~tw22]𝔼(w~t,w)γ[𝔼wt[wtw22|w~t]]+ηt2G2\displaystyle\leq\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[\|\tilde{w}_{t}-w\|_{2}^{2}\right]-\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[{\mathbb{E}}_{w_{t}}\left[\|w_{t}-w\|_{2}^{2}|\widetilde{w}_{t}\right]\right]+\eta_{t}^{2}G_{\ell}^{2}
=W22(p~t,p)𝔼(w~t,w)γ[𝔼wt[wtw22|w~t]]+ηt2G2.\displaystyle=W_{2}^{2}(\tilde{p}_{t},p)-\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[{\mathbb{E}}_{w_{t}}\left[\|w_{t}-w\|_{2}^{2}|\widetilde{w}_{t}\right]\right]+\eta_{t}^{2}G_{\ell}^{2}.

From the relationship wt=w~tηt(w~t,zt)w_{t}=\tilde{w}_{t}-\eta_{t}\nabla\ell(\tilde{w}_{t},z_{t}), we know that conditional on ztz_{t}, the joint distribution of (wt,w)(w_{t},w) is (idηt,id)#γ\left(\mathrm{id}-\eta_{t}\nabla\ell,\,\mathrm{id}\right)_{\#}\gamma. Note that γ~=(idηt,id)#γ\tilde{\gamma}=\left(\mathrm{id}-\eta_{t}\nabla\ell,\,\mathrm{id}\right)_{\#}\gamma also defines a coupling, and therefore

𝔼(w~t,w)γ[𝔼wt[wtw22|w~t]]\displaystyle\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[{\mathbb{E}}_{w_{t}}\left[\|w_{t}-w\|_{2}^{2}|\widetilde{w}_{t}\right]\right]
=𝔼zt[𝔼(w~t,w)γ[𝔼wt[wtw22|w~t,zt]]]\displaystyle=\mathbb{E}_{z_{t}}\left[\mathbb{E}_{(\tilde{w}_{t},w)\sim\gamma}\left[{\mathbb{E}}_{w_{t}}\left[\|w_{t}-w\|_{2}^{2}|\widetilde{w}_{t},z_{t}\right]\right]\right]
=𝔼zt[𝔼(wt,w)γ~[wtw22|zt]]\displaystyle=\mathbb{E}_{z_{t}}\left[\mathbb{E}_{(w_{t},w)\sim\tilde{\gamma}}\left[\|w_{t}-w\|_{2}^{2}|z_{t}\right]\right]
infγ^Γ(μt,μ)𝔼(wt,w)γ^[wtw22]=W22(pt,p).\displaystyle\geq\inf_{\hat{\gamma}\in\Gamma(\mu_{t},\mu)}\mathbb{E}_{(w_{t},w)\sim\hat{\gamma}}\left[\|w_{t}-w\|_{2}^{2}\right]=W_{2}^{2}(p_{t},p).

Plugging this result and the Lipschitz assumption on \ell in, we obtain that

2η~t[(p~t)(p)]W2(p~t,p)2W2(pt,p)2+η~t2G2.2\widetilde{\eta}_{t}[\ell(\widetilde{p}_{t})-\ell(p)]\leq W_{2}(\widetilde{p}_{t},p)^{2}-W_{2}(p_{t},p)^{2}+\widetilde{\eta}_{t}^{2}G_{\ell}^{2}.

7.2 Proofs of Lemmas in the Lipschitz Smooth Case

7.2.1 Proofs of Lemmas 5 and 6 for the full gradient Langevin algorithm 1

Proof of Lemma 5.

By the geodesic convexity of the entropy function H(p)=β𝔼wp[lnp(w)]H(p)=\beta{\mathbb{E}}_{w\sim p}\left[\ln p(w)\right],

H(pt~)H(p)βlnp(w),(Tppid)(w)p(w)𝑑w.H(\widetilde{p_{t}})-H(p)\geq\beta\int\left\langle\nabla\ln p(w^{\prime}),\left(T_{p^{\prime}}^{p}-\mathrm{id}\right)(w^{\prime})\right\rangle p(w^{\prime})\ dw^{\prime}.

where Tpp~tT_{p}^{\widetilde{p}_{t}} is the optimal transport from pp to p~t\widetilde{p}_{t}. Using optimal coupling μtΠ(p~,p)\mu_{t}\in\Pi(\widetilde{p},p),

H(pt~)H(p)β𝔼w,wμt[lnp(w),ww].H(\widetilde{p_{t}})-H(p)\geq\beta{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}\left[\left\langle\nabla\ln p(w^{\prime}),w-w^{\prime}\right\rangle\right].

In addition, convexity of ff and gg implies that

𝔼w,wμt[g(w)g(w)]𝔼w,wμt[g(w),ww]{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}[g(w)-g(w^{\prime})]\geq{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}\left[\left\langle\nabla g(w^{\prime}),w-w^{\prime}\right\rangle\right]

and

𝔼w,wμt[f(w)f(w)]𝔼w,wμt[f(w),ww].{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}[f(w)-f(w^{\prime})]\geq{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}\left[\left\langle\nabla f(w^{\prime}),w-w^{\prime}\right\rangle\right].

Adding the above three inequalities, and note that the following holds point-wise

βlnp(w)+g(w)+f(w)=0,\beta\nabla\ln p(w^{\prime})+\nabla g(w^{\prime})+\nabla f(w^{\prime})=0,

we obtain that

H(pt~)H(p)+𝔼w,wμt[g(w)g(w)]𝔼w,wμt[f(w),ww],H(\widetilde{p_{t}})-H(p)+{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}[g(w)-g(w^{\prime})]\geq-{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}\left[\left\langle\nabla f(w^{\prime}),w-w^{\prime}\right\rangle\right],

and that

Q(pt~)Q(p)𝔼w,wμt[f(w)f(w)]𝔼w,wμt[f(w),ww].\displaystyle Q(\widetilde{p_{t}})-Q(p)\geq{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}[f(w)-f(w^{\prime})]-{\mathbb{E}}_{w,w^{\prime}\sim\mu_{t}}\left[\left\langle\nabla f(w^{\prime}),w-w^{\prime}\right\rangle\right]. (24)

Since the potential function f(w)f(w) is convex and LL-smooth,

f(w)f(w)+f(w)(ww)+12Lf(w)f(w)22.f(w)\geq f(w^{\prime})+\nabla f(w^{\prime})^{\top}(w-w^{\prime})+\frac{1}{2L}\|\nabla f(w^{\prime})-\nabla f(w)\|_{2}^{2}. (25)

Combining equations (24) and (25), we obtain the desired bound. ∎

Proof of Lemma 6.

We have

𝔼wpf(w)22\displaystyle{\mathbb{E}}_{w\sim p}\|\nabla f(w)\|_{2}^{2} 2𝔼wpf(w)f(w)22+2f(w)22\displaystyle\leq 2{\mathbb{E}}_{w\sim p}\|\nabla f(w)-\nabla f(w_{*})\|_{2}^{2}+2\|\nabla f(w_{*})\|_{2}^{2}
2𝔼wp(012f(τw+(1τ)w)𝑑τ)(ww)2+2m2w22\displaystyle\leq 2{\mathbb{E}}_{w\sim p}\left\|\left(\int_{0}^{1}\nabla^{2}f\left(\tau w+(1-\tau)w_{*}\right)d\tau\right)\left(w-w_{*}\right)\right\|^{2}+2m^{2}\|w_{*}\|_{2}^{2}
2𝔼wpH(ww)2+2m2w22.\displaystyle\leq 2{\mathbb{E}}_{w\sim p}\left\|H(w-w_{*})\right\|^{2}+2m^{2}\|w_{*}\|_{2}^{2}.

In what follows we bound 𝔼wpH(ww)2{\mathbb{E}}_{w\sim p}\left\|H(w-w_{*})\right\|^{2}. We first note that the posterior density:

pexp(β1(f(w)+g(w)))p\propto\exp(-\beta^{-1}(f(w)+g(w)))

is mβ\frac{m}{\beta} strongly log-concave. By Hargé (2004), for any convex function hh,

𝔼wp[h(w𝔼wp[w])]𝔼w~𝒩(0,β/mI)[h(w~)].{\mathbb{E}}_{w\sim p}\left[h\left(w-{\mathbb{E}}_{w\sim p}\left[w\right]\right)\right]\leq{\mathbb{E}}_{\tilde{w}\sim\mathcal{N}\left(0,\beta/m\cdot{\mathrm{I}}\right)}\left[h(\tilde{w})\right].

Taking h(w)=Hw2h(w)=\left\|Hw\right\|^{2}, we obtain that

𝔼wpH(w𝔼wp[w])2𝔼w~𝒩(0,β/mI)Hw~2=βmtrace(H2).{\mathbb{E}}_{w\sim p}\left\|H\left(w-{\mathbb{E}}_{w\sim p}\left[w\right]\right)\right\|^{2}\leq{\mathbb{E}}_{\tilde{w}\sim\mathcal{N}\left(0,\beta/m\cdot{\mathrm{I}}\right)}\left\|H\tilde{w}\right\|^{2}=\frac{\beta}{m}{\mathrm{trace}}\left(H^{2}\right).

On the other hand, by the celebrated relation between mean and mode for 1-unimodal distributions (see, e.g., Theorem 77 of Basu and DasGupta, 1996),

(w𝔼wp[w])Σ1(w𝔼wp[w])3,\left(w_{*}-{\mathbb{E}}_{w\sim p}[w]\right)^{\top}\Sigma^{-1}\left(w_{*}-{\mathbb{E}}_{w\sim p}[w]\right)\leq 3,

where Σ\Sigma is the covariance of pp. This results in the following bound

H(w𝔼wp[w])23βmH22.\left\|H\left(w_{*}-{\mathbb{E}}_{w\sim p}[w]\right)\right\|^{2}\leq 3\frac{\beta}{m}\left\|H\right\|_{2}^{2}.

Combining the two bounds, we obtain that

𝔼wpH(ww)2\displaystyle{\mathbb{E}}_{w\sim p}\left\|H(w-w_{*})\right\|^{2} 2𝔼wpH(w𝔼wp[w])2+2H(w𝔼wp[w])2\displaystyle\leq 2{\mathbb{E}}_{w\sim p}\left\|H\left(w-{\mathbb{E}}_{w\sim p}\left[w\right]\right)\right\|^{2}+2\left\|H\left(w_{*}-{\mathbb{E}}_{w\sim p}[w]\right)\right\|^{2}
2βmtrace(H2)+6βmH228βmtrace(H2).\displaystyle\leq 2\frac{\beta}{m}{\mathrm{trace}}\left(H^{2}\right)+6\frac{\beta}{m}\left\|H\right\|_{2}^{2}\leq 8\frac{\beta}{m}{\mathrm{trace}}\left(H^{2}\right).

7.2.2 Proofs of Lemma 7 and 8 for the SGLD algorithm 2

Proof of Lemma 7.

By the definitions of wtw_{t} and w~t\tilde{w}_{t},

wtw22\displaystyle\left\|w_{t}-w\right\|_{2}^{2} =w~tηtf~(w~t,𝒮)w22\displaystyle=\left\|\tilde{w}_{t}-\eta_{t}\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})-w\right\|_{2}^{2}
=w~tw222ηtf~(w~t,𝒮),w~tw+ηt2f~(w~t,𝒮)22.\displaystyle=\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left\langle\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S}),\tilde{w}_{t}-w\right\rangle+\eta_{t}^{2}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2}.

We now take expectation with respect to 𝒮\mathcal{S}, conditioned on w~t\tilde{w}_{t}, to obtain

𝔼wt|w~twtw22\displaystyle{\mathbb{E}}_{w_{t}|\tilde{w}_{t}}\left\|w_{t}-w\right\|_{2}^{2} =w~tw222ηtf(w~t),w~tw+ηt2𝔼𝒮|w~tf~(w~t,𝒮)22\displaystyle=\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left\langle\nabla f(\tilde{w}_{t}),\tilde{w}_{t}-w\right\rangle+\eta_{t}^{2}{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2}
w~tw222ηt(f(w~t)f(w))+ηt2𝔼𝒮|w~tf~(w~t,𝒮)22.\displaystyle\leq\left\|\tilde{w}_{t}-w\right\|_{2}^{2}-2\eta_{t}\left(f(\tilde{w}_{t})-f(w)\right)+\eta_{t}^{2}{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2}. (26)

We then upper bound 𝔼𝒮|w~tf~(w~t,𝒮)22{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2} by introducing variable ww^{\prime} that is distributed according to pp and couples optimally with the law of w~t\tilde{w}_{t}:

𝔼𝒮|w~tf~(w~t,𝒮)222𝔼𝒮|w~tf~(w~t,𝒮)f~(w,𝒮)2+2𝔼𝒮|w~tf~(w,𝒮)2.\displaystyle{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2}\leq 2{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})-\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}+2{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}. (27)

For function f~\widetilde{f} being LL_{\ell}-Lipschitz smooth,

f~(w~t,𝒮)f~(w,𝒮)+f~(w,𝒮)(ww)+12Lf~(w,𝒮)f~(w~t,𝒮)22.\widetilde{f}(\tilde{w}_{t},\mathcal{S})\geq\widetilde{f}(w^{\prime},\mathcal{S})+\nabla\widetilde{f}(w^{\prime},\mathcal{S})^{\top}(w-w^{\prime})+\frac{1}{2L_{\ell}}\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})-\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\|_{2}^{2}.

Taking expectation over the randomness of minibatch assignment 𝒮\mathcal{S} on both sides leads to the fact that

f(w~t)f(w)+f(w)(ww)+12L𝔼𝒮|w~tf~(w,𝒮)f~(w~t,𝒮)22.f(\tilde{w}_{t})\geq f(w^{\prime})+\nabla f(w^{\prime})^{\top}(w-w^{\prime})+\frac{1}{2L_{\ell}}{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})-\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\|_{2}^{2}.

Combining this equation with equation (24), we adapt Lemma 5 to the stochastic gradient method:

𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)f~(w~t,𝒮)22]2L[Q(p~t)Q(p)].{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})-\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\|_{2}^{2}\right]\leq 2L_{\ell}[Q(\widetilde{p}_{t})-Q(p)].

Applying this result to equation (27) and taking expectation of (w~t,w)μt(\tilde{w}_{t},w^{\prime})\sim\mu_{t} on both sides, we obtain:

𝔼w~tp~t[𝔼𝒮|w~tf~(w~t,𝒮)22]4L[Q(p~t)Q(p)]+2𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)2].{\mathbb{E}}_{\tilde{w}_{t}\sim\widetilde{p}_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(\tilde{w}_{t},\mathcal{S})\right\|_{2}^{2}\right]\leq 4L_{\ell}[Q(\widetilde{p}_{t})-Q(p)]+2{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right].

Therefore,

𝔼(w~t,w)μt[𝔼wt|w~twtw22]\displaystyle{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{w_{t}|\tilde{w}_{t}}\left\|w_{t}-w\right\|_{2}^{2}\right] 𝔼(w~t,w)μt[w~tw22]2ηt(f(p~t)f(p))\displaystyle\leq{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[\left\|\tilde{w}_{t}-w\right\|_{2}^{2}\right]-2\eta_{t}\left(f(\widetilde{p}_{t})-f(p)\right)
+ηt2(4L[Q(p~t)Q(p)]+2𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)2]),\displaystyle+\eta_{t}^{2}\left(4L_{\ell}[Q(\widetilde{p}_{t})-Q(p)]+2{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]\right),

leading to the final result that

W22(pt,p)\displaystyle W_{2}^{2}(p_{t},p) 𝔼(w~t,w)μt[𝔼wt|w~twtw22]\displaystyle\leq{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{w_{t}|\tilde{w}_{t}}\left\|w_{t}-w\right\|_{2}^{2}\right]
W22(p~t,p)2ηt(f(p~t)f(p))\displaystyle\leq W_{2}^{2}(\widetilde{p}_{t},p)-2\eta_{t}\left(f(\widetilde{p}_{t})-f(p)\right)
+ηt2(4L[Q(p~t)Q(p)]+2𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)2]).\displaystyle+\eta_{t}^{2}\left(4L_{\ell}[Q(\widetilde{p}_{t})-Q(p)]+2{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]\right).

Proof of Lemma 8.

We have

𝔼𝒮|w~tf~(w,𝒮)2\displaystyle{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\leq 2𝔼𝒮|w~tf~(w,𝒮)f~(w,𝒮)22+2𝔼𝒮|w~tf~(w,𝒮)22\displaystyle 2{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})-\nabla\widetilde{f}(w_{*},\mathcal{S})\|_{2}^{2}+2{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\|\nabla\widetilde{f}(w_{*},\mathcal{S})\|_{2}^{2}
\displaystyle\leq 4L[f(w)f(w)f(w)(ww)]+2𝔼𝒮|w~tf~(w,𝒮)22.\displaystyle 4L_{\ell}\left[f(w)-f(w_{*})-\nabla f(w_{*})^{\top}(w-w_{*})\right]+2{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\|\nabla\widetilde{f}(w_{*},\mathcal{S})\|_{2}^{2}.

Taking expectation on both sides, we obtain that

𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)2]\displaystyle{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right] 4L𝔼wp[f(w)f(w)f(w)(ww)]\displaystyle\leq 4L_{\ell}{\mathbb{E}}_{w\sim p}\left[f(w)-f(w_{*})-\nabla f(w_{*})^{\top}(w-w_{*})\right]
+2𝔼𝒮f~(w,𝒮)2.\displaystyle+2{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}. (28)

We now upper bound 𝔼wp[f(w)f(w)f(w)(ww)]{\mathbb{E}}_{w\sim p}\left[f(w)-f(w_{*})-\nabla f(w_{*})^{\top}(w-w_{*})\right]. Let p0p_{0} be the normal distribution N(w,(β/m)I)N(w_{*},(\beta/m)I), and define

Δf(w)\displaystyle\Delta f(w) =f(w)f(w)f(w)(ww),\displaystyle=f(w)-f(w_{*})-\nabla f(w_{*})^{\top}(w-w_{*}), (29)
Δg(w)\displaystyle\Delta g(w) =g(w)g(w)g(w)(ww),\displaystyle=g(w)-g(w_{*})-\nabla g(w_{*})^{\top}(w-w_{*}), (30)

then pp can be expressed as

pexp(β1(Δf(w)+Δg(w))),p\propto\exp(-\beta^{-1}(\Delta f(w)+\Delta g(w))),

which is the solution of

p=argminp𝔼wp[Δf(w)+βlnp(w)p0(w)].p=\arg\min_{p}{\mathbb{E}}_{w\sim p}\left[\Delta f(w)+\beta\ln\frac{p(w)}{p_{0}(w)}\right].

Therefore

𝔼wpΔf(w)\displaystyle{\mathbb{E}}_{w\sim p}\Delta f(w)\leq 𝔼wp[Δf(w)+βlnp(w)p0(w)]\displaystyle{\mathbb{E}}_{w\sim p}\left[\Delta f(w)+\beta\ln\frac{p(w)}{p_{0}(w)}\right]
=\displaystyle= βln𝔼wp0exp(β1Δf(w))\displaystyle-\beta\ln{\mathbb{E}}_{w\sim p_{0}}\exp(-\beta^{-1}\Delta f(w))
\displaystyle\leq βln𝔼wp0exp(0.5β1(ww)H(ww))\displaystyle-\beta\ln{\mathbb{E}}_{w\sim p_{0}}\exp(-0.5\beta^{-1}(w-w_{*})^{\top}H(w-w_{*}))
=\displaystyle= 0.5βln|I+H/m|0.5βtrace(H/m).\displaystyle 0.5\beta\ln|I+H/m|\leq 0.5\beta{\mathrm{trace}}(H/m). (31)

We then decompose 𝔼𝒮f~(w,𝒮)2{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}:

𝔼𝒮f~(w,𝒮)22f(w)2+2𝔼𝒮f(w)f~(w,𝒮)2.{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}\leq 2\left\|\nabla f(w_{*})\right\|^{2}+2{\mathbb{E}}_{\mathcal{S}}\left\|\nabla f(w_{*})-\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}.

Since ww_{*} is the minimum of f+gf+g, f(w)=g(w)=mw\nabla f(w_{*})=-\nabla g(w_{*})=-mw_{*}. Hence

𝔼𝒮f~(w,𝒮)22m2w2+2𝔼𝒮f(w)f~(w,𝒮)2.\displaystyle{\mathbb{E}}_{\mathcal{S}}\left\|\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}\leq 2m^{2}\left\|w_{*}\right\|^{2}+2{\mathbb{E}}_{\mathcal{S}}\left\|\nabla f(w_{*})-\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}. (32)

Plugging inequalities (31) and (32) into inequality (28) proves the desired result that

𝔼(w~t,w)μt[𝔼𝒮|w~tf~(w,𝒮)2]2βLmtrace(H)+4m2w2+4𝔼𝒮f(w)f~(w,𝒮)2.\displaystyle{\mathbb{E}}_{(\tilde{w}_{t},w^{\prime})\sim\mu_{t}}\left[{\mathbb{E}}_{\mathcal{S}|\tilde{w}_{t}}\left\|\nabla\widetilde{f}(w^{\prime},\mathcal{S})\right\|^{2}\right]\leq\frac{2\beta L_{\ell}}{m}{\mathrm{trace}}(H)+4m^{2}\left\|w_{*}\right\|^{2}+4{\mathbb{E}}_{\mathcal{S}}\left\|\nabla f(w_{*})-\nabla\widetilde{f}(w_{*},\mathcal{S})\right\|^{2}. (33)

8 Conclusion

This paper investigated the convergence of Langevin algorithms with strongly log-concave posteriors. We assume that the strongly log-concave posterior can be decomposed into two parts, with one part being simple and explicitly integrable with respect to the underlying SDE. This is analogous to the situation of proximal gradient methods in convex optimization. Using a new analysis technique which mimics the corresponding analysis of convex optimization, we obtain convergence results for Langenvin algorithms that are independent of dimension, both for Lipschitz and for a large class of smooth convex problems in machine learning. Our result addresses a long-standing puzzle with respect to the convergence of the Langevin algorithms. We note that the current work focused on the standard Langevin algorithm, and the resulting convergence rate in terms of ϵ\epsilon dependency is inferior to the best known results leveraging underdamped or even higher order Langevin dynamics such as Cheng et al. (2018b); Dalalyan and Riou-Durand (2018); Shen and Lee (2019); Ma et al. (2021); Mou et al. (2021), which corresponds to accelerated methods in optimization. It thus remains open to investigate whether dimension independent bounds can be combined with these accelerated methods to improve ϵ\epsilon dependence as well as condition number dependence.

References

  • Ambrosio et al. (2008) L. Ambrosio, N. Gigli, and G. Savaré. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2nd edition, 2008.
  • Bakry and Emery (1985) D. Bakry and M. Emery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages 177–206. 1985.
  • Basu and DasGupta (1996) S. Basu and A. DasGupta. The mean, median, and mode of unimodal distributions: a characterization. Teor. Veroyatnost. i Primenen., 41:336–352, 1996.
  • Bernton (2018) E. Bernton. Langevin Monte Carlo and JKO splitting. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 1777–1798, 2018.
  • Bou-Rabee et al. (2018) N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. arXiv:1805.00452, 2018.
  • Chatterji et al. (2018) N. Chatterji, N. Flammarion, Y.-A. Ma, P. Bartlett, and M. Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 764–773, 2018.
  • Chatterji et al. (2020) Niladri S. Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter L. Bartlett. Langevin monte carlo without smoothness. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), volume 108, 2020.
  • Cheng and Bartlett (2018) X. Cheng and P. L. Bartlett. Convergence of Langevin MCMC in KL-divergence. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT), pages 186–211, 2018.
  • Cheng et al. (2018a) X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv:1805.01648, 2018a.
  • Cheng et al. (2018b) X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 300–323, 2018b.
  • Dalalyan (2017) A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. Royal Stat. Soc. B, 79(3):651–676, 2017.
  • Dalalyan and Karagulyan (2017) A. S. Dalalyan and A. G. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. arXiv:1710.00095, 2017.
  • Dalalyan and Riou-Durand (2018) A. S. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. arXiv:1807.09382, 2018.
  • Durmus and Moulines (2017) A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 06 2017.
  • Durmus and Moulines (2019) A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • Durmus et al. (2019) Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
  • Dwivedi et al. (2018) R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-Hastings algorithms are fast! arXiv:1801.02309, 2018.
  • Gelman et al. (2004) J. B. Gelman, A.and Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall, New York, 2004.
  • Hargé (2004) G. Hargé. A convex/log-concave correlation inequality for Gaussian measure and an application to abstract Wiener spaces. Probab. Theory Related Fields, 130:415–440, 2004.
  • Jordan et al. (1998) R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker-Planck equation. SIAM J. Math. Anal., 29(1):1–17, January 1998.
  • Lacoste-Julien et al. (2012) Simon Lacoste-Julien, Mark Schmidt, and Francis R. Bach. A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002, 2012.
  • Ledoux (2000) M. Ledoux. The geometry of Markov diffusion generators. Ann Fac Sci Toulouse Math, 9(6):305–366, 2000.
  • Lee et al. (2018) Y.-T. Lee, Z. Song, and S. S. Vempala. Algorithmic theory of ODEs and sampling from well-conditioned logconcave densities. arXiv:1812.06243, 2018.
  • Li and Lin (2010) Q. Li and N. Lin. The Bayesian elastic net. Bayesian Analysis, 5(1):151–170, 2010.
  • Ma et al. (2019) Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization. Proc. Natl. Acad. Sci. U.S.A., 116:20881–20885, 2019.
  • Ma et al. (2021) Y.-A. Ma, N. S. Chatterji, X. Cheng, N. Flammarion, P. L. Bartlett, and M. I. Jordan. Is there an analog of Nesterov acceleration for MCMC? Bernoulli, 27(3):1942–1992, 2021.
  • Mangoubi and Smith (2017) O. Mangoubi and A. Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv:1708.07114, 2017.
  • Mangoubi and Vishnoi (2018) O. Mangoubi and N. K. Vishnoi. Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. arXiv:1802.08898, 2018.
  • Mou et al. (2021) W. Mou, Y.-A. Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22:1–41, 2021.
  • Mou et al. (2019) Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. arXiv:1910.00551, 2019.
  • Neal (1996) R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, New York, 1996.
  • Otto and Villani (2000) F. Otto and C. Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal., 173(2):361–400, 2000.
  • Rakhlin et al. (2012) Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International conference on machine learning, pages 1571–1578, 2012.
  • Roberts and Stramer (2002) G. O. Roberts and O. Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab., 4:337–357, 2002.
  • Rossky et al. (1978) P. J. Rossky, J. D. Doll, and H. L. Friedman. Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys., 69(10):4628, 1978.
  • Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  • Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
  • Shen and Lee (2019) R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pages 2098–2109, 2019.
  • Sollich (2002) P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46:21–52, 2002.
  • Vempala and Wibisono (2019) S. S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Log-Sobolev suffices. arXiv:1903.08568, 2019.
  • Villani (2009) C. Villani. Optimal Transport, Old and New, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009.
  • Wibisono (2018) A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 2093–3027, 2018.
  • Zhang (2004) Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116, 2004.
  • Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928–936, 2003.
  • Zou and Gu (2019) Difan Zou and Quanquan Gu. Stochastic gradient hamiltonian monte carlo methods with recursive variance reduction. In Advances in Neural Information Processing Systems (NeurIPS) 32. 2019.