This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Preconditioned Stochastic Gradient Langevin Dynamics for
Deep Neural Networks

Chunyuan Li1, Changyou Chen1, David Carlson2 and Lawrence Carin1
1Department of Electrical and Computer Engineering, Duke University
2Department of Statistics and Grossman Center, Columbia University
    chunyuan.li@duke.edu, cchangyou@gmail.com, david.edwin.carlson@gmail.com, lcarin@duke.edu
Abstract

Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

Introduction

Deep Neural Networks (DNNs) have recently generated significant interest, largely due to their state-of-the-art performance on a wide variety of tasks, such as image classification (?) and language modeling (?). Despite this significant empirical success, it remains a challenge to effectively train DNNs. This is due to two main problems: (i)(\textup{\it i}) The function under consideration is often difficult to optimize and find a good local minima. It is believed that this is in large part due to the pathological curvature and highly non-convex nature of the function to be optimized (?). (ii)(\textup{\it ii}) Standard optimization techniques lead to overfitting, typically addressed through early stopping (?).

A Bayesian approach for learning neural networks incorporates uncertainty into model learning, and can reduce overfitting (?). In fact, it is possible to view the standard dropout technique (?) as a form of Bayesian approximation that incorporates uncertainty (??). Many other recent works (???) advocate incorporation of uncertainty estimates during model training to help improve robustness and performance.

While a Bayesian approach can ameliorate the overfitting issue in these complicated models, exact Bayesian inference in DNNs is generally intractable. Recently, several approaches have been proposed to approximate a Bayesian posterior for DNNs, including a stochastic variational inference (SVI) method “Bayes by Backprop” (BBB) (?) and an online expectation propogation method (OEP) “probabilistic backpropagation” (PBP) (?). These methods assume the posterior is comprised of separable Gaussian distributions. While this is a good choice for computational reasons, it can lead to unreasonable approximation errors and underestimation of model uncertainty.

A popular alternative to SVI and OEP is to use Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods to generate posterior samples (????). One of the most common SG-MCMC methods is the Stochastic Gradient Langevin Dynamics (SGLD) algorithm (?). One merit of this approach is that it is highly scalable; it requires only the gradient on a small mini-batch of data, as in the optimization method Stochastic Gradient Descent (SGD). It has been shown that these MCMC approaches converge to the true posterior by using a slowly-decreasing sequence of step sizes (??). Costly Metropolis-Hasting steps are not required.

Unfortunately, DNNs often exhibit pathological curvature and saddle points (?), which render existing SG-MCMC methods inefficient. In the optimization literature, numerous approaches have been proposed to overcome this problem, including methods based on adapting a preconditioning matrix in SGD to the local geometry (???). These approaches estimate second-order information with trivial per-iteration overhead, have improved risk bounds in convex problems compared to SGD, and demonstrate improved empirical performance in DNNs. The idea of using geometry in SG-MCMC has been explored in many contexts (???) and includes second-order approximations. Often, these approaches use the expected Fisher information, adding significant computational overhead. These methods lack the scalability necessary for learning DNNs, as discussed further below.

We combine adaptive preconditioners from optimization with SGLD, to improve SGLD efficacy. To note the distinction from SGLD, we refer to this as the Preconditioned SGLD method (pSGLD). This procedure is simple and adds trivial per-iteration overhead. We first show theoretical properties of this method, including bounds on risk and asymptotic convergence properties. We demonstrate improved efficiency of pSGLD by demonstrating an enhanced bias-variance tradeoff of the estimator for small problems. We further empirically demonstrate its application to several models and large datasets, including deep neural networks. In the DNN experiments, pSGLD outperforms the results based on standard SGLD from (?), both in terms of convergence speed and the test-set performance. Futher, pSGLD generates state-of-the-art performance for the examples tested.

Related Work

Various regularization schemes have been developed to prevent overfitting in neural networks, such as early stopping, weight decay, dropout (?), and dropconnect (?). Bayesian methods are appealing due to their ability to avoid overfitting by capturing uncertainty during learning (?). MCMC methods work by producing Monte Carlo approximations to the posterior, with asymptotic consistency (?). Traditional MCMC methods use the full dataset, which does not scale to large data problems. A pioneering work in combining stochastic optimization with MCMC was presented in (?), based on Langevin dynamics (?). This method was referred to as Stochastic Gradient Langevin Dynamics (SGLD), and required only the gradient on mini-batches of data. The per-iteration cost of SGLD is nearly identical to SGD. Unlike SGD, SGLD can generate samples from the posterior by injecting noise into the dynamics. This encourages the algorithm to explore the full posterior, instead of simply converging to a maximum a posterior (MAP) solution. Later, SGLD was extended by (?), (?) and (?). Furthermore, higher-order versions of the SGLD with momentum have also been proposed, including stochastic gradient Hamiltionian Monte Carlo (SGHMC) (?) and stochastic gradient Nose-Hoover Thermostats (SGNHT) (?).

It has been shown that incoporating higher-order gradient information helps train neural networks when employing optimization methods (?). However, calculations of higher-order information is often cumbersome in most models of interest. Methods such as quasi-Newton, and those approximating second-order gradient information, have shown promising results (?). An alternative to full quasi-Newton methods is to rescale parameters so that the loss function has similar curvature along all directions. This strategy has shown improved performance in Adagrad (?), Adadelta (?), Adam (?) and RMSprop (?) algorithms. Recently, RMSprop has been explained as a diagonal preconditioner in (?). While relatively mature in optimization, these techniques have not been developed in sampling methods. In this paper, we show that rescaling the parameter updates according to geometry information can also improve SG-MCMC, in terms of both training speed and predictive accuracy.

Preliminaries

Given data 𝒟={𝒅i}i=1N\mathcal{D}=\{\bm{d}_{i}\}^{N}_{i=1}, the posterior of model parameters 𝜽\bm{\theta} with prior p(𝜽)p(\bm{\theta}) and likelihood i=1Np(𝒅i|𝜽)\prod_{i=1}^{N}p(\bm{d}_{i}|\bm{\theta}) is computed as p(𝜽|𝒟)p(𝜽)i=1Np(𝒅i|𝜽)p(\bm{\theta}|\mathcal{D})\propto p(\bm{\theta})\prod_{i=1}^{N}p(\bm{d}_{i}|\bm{\theta}). In the optimization literature, the prior plays the role of a penalty that regularizes parameters, while the likelihood constitutes the loss function to be optimized. The task in optimization is to find the MAP estimate, 𝜽MAP=argmaxlogp(𝜽|𝒟)\bm{\theta}_{\textsl{MAP}}=\arg\!\max\log p(\bm{\theta}|\mathcal{D}). Let Δ𝜽t\Delta\bm{\theta}_{t} denote the change in the parameters at time tt. Stochastic optimization methods such as Stochastic Gradient Descent (SGD)111For maximization, this is Stochastic Gradient Ascent. Here, we abuse notation because SGD is a more common term. update 𝜽\bm{\theta} using the following rule:

Δ𝜽t=ϵt(𝜽logp(𝜽t)+Nni=1n𝜽logp(𝒅ti|𝜽t))\displaystyle\Delta\bm{\theta}_{t}=\epsilon_{t}\Big{(}\nabla_{\bm{\theta}}\log p(\bm{\theta}_{t})+\frac{N}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t})\Big{)}\vskip 0.0pt (1)

where {ϵt}\{\epsilon_{t}\} is a sequence of step sizes, and 𝒟t={𝒅t1,,𝒅tn}\mathcal{D}^{t}=\{\bm{d}_{t_{1}},\cdots,\bm{d}_{t_{n}}\} a subset of n<Nn<N data items randomly chosen from 𝒟\mathcal{D} at iteration tt. The convergence of SGD has been established (?).

For DNNs, the gradient is calculated by backpropagation (?). One data item 𝒅i(xi,yi)\bm{d}_{i}\triangleq(x_{i},y_{i}) may consist of input xiDx_{i}\in\mathbb{R}^{D} and output yi𝒴y_{i}\in\mathcal{Y}, with 𝒴\mathcal{Y} being the output space (e.g., a discrete label space in classification). In the testing stage, the Bayesian predictive estimate for input xx, is given by p(y|x,𝒟)=𝔼p(𝜽|𝒟)[p(y|x,𝜽)]p(y|x,\mathcal{D})=\mathbb{E}_{p(\bm{\theta}|\mathcal{D})}[p(y|x,\bm{\theta})]. The MAP estimate simply approximates this expectation as p(y|x,𝒟)p(y|x,𝜽MAP)p(y|x,\mathcal{D})\approx p(y|x,\bm{\theta}_{\text{MAP}}), ignoring parameter uncertainty.

Stochastic sampling methods such as SGLD incorporate uncertainty into predictive estimates. SGLD samples 𝜽\bm{\theta} from the posterior distributions via a Markov Chain with steps:

Δ𝜽t𝒩(ϵt2(𝜽logp(𝜽t)+Nni=1n𝜽logp(𝒅ti|𝜽t)),ϵt𝐈)\Delta\bm{\theta}_{t}\sim\mathcal{N}\left(\frac{\epsilon_{t}}{2}\Big{(}\nabla_{\bm{\theta}}\log p(\bm{\theta}_{t})+\frac{N}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t})\Big{)},\epsilon_{t}{\bf I}\right) (2)

with 𝐈{\bf I} denoting the identity matrix. It also uses mini-batches to take gradient descend steps at each iteration. Rates of convergence are proven rigorously in (?). Given a set of samples from the update rule (2), posterior distributions can be approximated via Monte Carlo approximations as p(y|x,𝒟)1Tt=1Tp(y|x,𝜽t)p(y|x,\mathcal{D})\approx\frac{1}{T}\sum_{t=1}^{T}p(y|x,\bm{\theta}_{t}), where TT is the number of samples.

Both stochastic optimization and stochastic sampling approaches have the requirement that the step sizes satisfy the the following assumption.222The requirement for SGLD can be relaxed, see (??) for more details.

Assumption 1

The step sizes {ϵt}\{\epsilon_{t}\} are decreasing, i.e., 0<ϵt+1<ϵt0<\epsilon_{t+1}<\epsilon_{t}, with 1) t=1ϵt=\sum_{t=1}^{\infty}\epsilon_{t}=\infty; and 2) t=1ϵt2<\sum_{t=1}^{\infty}\epsilon_{t}^{2}<\infty.

If these step-sizes are not satisfied in stochastic optimization, there is no guarantee of convergence because the gradient estimation noise is not eliminated. Likewise, in stochastic sampling, decreasing step-sizes are necessary for asymptotic consistency with the true posterior, where the approximation error is dominated by the natural stochasticity of Langevin dynamics (?).

Preconditioned Stochastic Gradient Langevin Dynamics

As noted in the previous section, standard SGLD updates all parameters with the same step size. This could lead to slow mixing when the components of 𝜽\bm{\theta} have different curvature. Unfortunately, this is generally true in DNNs due to the composition of nonlinear functions at multiple layers. A potential solution is to employ a user-chosen preconditioning matrix G(𝜽)G(\bm{\theta}) in SGLD (?). The intuition is to consider the family of probability distributions p(𝒅|𝜽)p(\bm{d}|\bm{\theta}) parameterised by 𝜽\bm{\theta} lying on a Riemannian manifold. One can use the non-Euclidean geometry implied by this manifold to guide the random walk of a sampler. For any probability distribution, the expected Fisher information matrix 𝜽\mathcal{I}_{\bm{\theta}} defines a natural Riemannian metric tensor. To further scale up the method to a general online framework stochastic gradient Riemannian Langevin dynamics (SGRLD) was suggested in (?). At position 𝜽t\bm{\theta}_{t}, it gives the step333The update form in (?) is more complicated and seemingly different from (3); however, they can be shown to be equivalent.,

Δ𝜽t\displaystyle\Delta\bm{\theta}_{t} ϵt2[G(𝜽t)(𝜽logp(𝜽t)\displaystyle\sim\frac{\epsilon_{t}}{2}\Big{[}G(\bm{\theta}_{t})\Big{(}\nabla_{\bm{\theta}}\log p(\bm{\theta}_{t}) (3)
+Nni=1n𝜽logp(𝒅ti|𝜽t))+Γ(𝜽t)]+G12(𝜽t)𝒩(0,ϵt𝐈)\displaystyle+\frac{N}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t})\Big{)}+\Gamma(\bm{\theta}_{t})\Big{]}+G^{\frac{1}{2}}(\bm{\theta}_{t})\mathcal{N}(0,\epsilon_{t}{\bf I})

​​​where Γi(𝜽)=jGi,j(𝜽)θj\Gamma_{i}(\bm{\theta})=\sum_{j}\frac{\partial G_{i,j}(\bm{\theta})}{\partial\theta_{j}} describes how the preconditioner changes with respect to 𝜽t\bm{\theta}_{t}. . This term vanishes in SGLD because the preconditioner of SGLD is a constant 𝐈{\bf I}. Both the direction and variance in Eq.(3) depends on the geometry of G(𝜽t)G(\bm{\theta}_{t}). The natural gradient in the SGRLD step takes the direction of steepest descent on a manifold. Convergence to the posterior is guaranteed (??) as long as step sizes satisfy Assumption 1.

Unfortunately, for many models of interest, the expected Fisher information is intractable. However, we note that any positive definite matrix defines a valid Riemannian manifold metric. Hence, we are not restricted to using the exact expected Fisher information. Preconditioning aims to constitute a local transform such that the rate of curvature is equal in all directions. Following this, we propose to use the same preconditoner as in RMSprop. This preconditioner is updated sequentially using only the current gradient information, and only estimates a diagonal matrix. It is given sequentially as,

G(𝜽t+1)=diag(𝟏(λ𝟏+V(𝜽t+1)))\displaystyle G(\bm{\theta}_{t+1})=\text{diag}\left({\bf 1}\oslash\big{(}\lambda{\bf 1}+\sqrt{V(\bm{\theta}_{t+1})}\big{)}\right) (4)
V(𝜽t+1)=αV(𝜽t)+(1α)g¯(𝜽t;𝒟t)g¯(𝜽t;𝒟t),\displaystyle V(\bm{\theta}_{t+1})=\alpha V(\bm{\theta}_{t})+(1-\alpha)\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})\odot\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})~, (5)

where for notational simplicity, g¯(𝜽t;𝒟t)=1ni=1n𝜽logp(𝒅ti|𝜽t)\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t}), is the sample mean of the gradient using mini-batch 𝒟t\mathcal{D}^{t}, and α[0,1]\alpha\in[0,1]. Operators \odot and \oslash represent element-wise matrix product and division, respectively.

RMSprop utilizes magnitudes of recent gradients to construct a preconditioner. Flatter landscapes have smaller gradients while curved landscapes have larger gradients. Gradient information is usually only locally consistent. Therefore, two equivalent interpretations for Eq. (3) can be reached intuitively: i) the preconditioner equalizes the gradient so that a constant stepsize is adequate for all dimensions. ii) the stepsizes are adaptive, in that flat directions have larger stepsizes while curved directions have smaller stepsizes.

In DNNs, saddle points are the most prevalent critical points, that can considerably slow down training (?), mostly because the parameter space tends to be flat in many directions and ill-conditioned in the neighborhood of these saddle points. Standard SGLD will slowly escape the saddle point due to the typical oscillations along the high positive curvature direction. By transforming the landscape to be more equally curved, it is possible for the sampler to move much faster.

In addition, there are two tuning parameters: λ\lambda controls the extremes of the curvature in the preconditioner (default λ=105\lambda\!\!=\!\!10^{-5}), and α\alpha balances the weights of historical and current gradients. We use a default value of α=0.99\alpha\!\!=\!\!0.99 to construct an exponentially decaying sequence. Our Preconditioned SGLD with RMSprop is outlined in Algorithm 1.

Algorithm 1 Preconditioned SGLD with RMSprop
Inputs: {ϵt}t=1:T,λ,α\{\epsilon_{t}\}_{t=1:T},\lambda,\alpha
Outputs: {𝜽t}t=1:T\{\bm{\theta}_{t}\}_{t=1:T}
Initialize: 𝐕0𝟎{{\bf V}}_{0}\leftarrow{\bf 0}, random 𝜽1\bm{\theta}_{1}
for t1:Tt\leftarrow 1:T do
  Sample a minibatch of size nn, 𝒟nt={𝒅t1,,𝒅tn}\mathcal{D}_{n}^{t}=\{\bm{d}_{t_{1}},\dots,\bm{d}_{t_{n}}\}
  Estimate gradient g¯(𝜽t;𝐗t)=1ni=1nlogp(𝒅ti|𝜽t)\bar{g}(\bm{\theta}_{t};{{\bf X}}^{t})=\frac{1}{n}\sum_{i=1}^{n}\nabla\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t})
  V(𝜽t)αV(𝜽t1)+(1α)g¯(𝜽t;𝒟t)g¯(𝜽t;𝒟t)V(\bm{\theta}_{t})\leftarrow\alpha V(\bm{\theta}_{t-1})+(1-\alpha)\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})\odot\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})
  G(𝜽t)diag(𝟏(λ𝟏+V(𝜽t)))G(\bm{\theta}_{t})\leftarrow\text{diag}\left({\bf 1}\oslash\big{(}\lambda{\bf 1}+\sqrt{V(\bm{\theta}_{t})}\big{)}\right)
  𝜽t+1𝜽t+ϵt2[G(𝜽t)(𝜽logp(𝜽t)+Ng¯(𝜽t;𝒟t))+Γ(𝜽t)]+𝒩(0,ϵtG(𝜽t))\bm{\theta}_{t+1}\leftarrow\bm{\theta}_{t}+\frac{\epsilon_{t}}{2}\Big{[}G(\bm{\theta}_{t})\Big{(}\nabla_{\bm{\theta}}\log p(\bm{\theta}_{t})+N\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})\Big{)}+\Gamma(\bm{\theta}_{t})\Big{]}+\mathcal{N}(0,\epsilon_{t}G(\bm{\theta}_{t}))
end for

Preconditioned SGLD Algorithms in Practice

This section first analyzes the finite-time convergence properties of pSGLD, then proposes a more efficient variant for practical use. We note that prior work gave similar theoretical results (?), and we extend the theory to consider the use of preconditioners.

Finite-time Error Analysis

For a bounded function ϕ(𝜽)\phi(\bm{\theta}), we are often interested in its true posterior expectation ϕ¯=𝒳ϕ(𝜽)p(𝜽|𝒟)𝑑𝜽\bar{\phi}=\int_{\mathcal{X}}\phi(\bm{\theta})p(\bm{\theta}|\mathcal{D})d\bm{\theta}. For example, the class distribution of a data point in DNNs. In our SG-MCMC based algorithm, this intractable integration is approximated by a weighted sample average ϕ^=1STt=1Tϵtϕ(𝜽t)\hat{\phi}=\frac{1}{S_{T}}\sum_{t=1}^{T}\epsilon_{t}\phi(\bm{\theta}_{t}) at time ST=t=1TϵtS_{T}=\sum_{t=1}^{T}\epsilon_{t}, with stepsizes {ϵt}\{\epsilon_{t}\}. These samples are generated from an MCMC algorithm with a numerical integrator (e.g., our pSGLD algorithm) that discretizes the continuous-time Langevin dynamics. The precision of the true posterior average and its MCMC approximation is characterized by the expected difference between ϕ¯\bar{\phi} and ϕ^\hat{\phi}. We analyze the pSGLD algorithm by extending the work of (??) to include adaptive preconditioners. We first show the asymptotic convergence properties of our algorithm in Theorem 1 by the mean of the mean squared error (MSE)444This is different from the optimization literature where the regret is studied, which is not straightforward in the MCMC framework.. To get the convergence result, some mild assumptions on the smoothness and boundness of ψ\psi, the solution functional of ψ=ϕ(𝜽t)ϕ¯\mathcal{L}\psi=\phi(\bm{\theta}_{t})-\bar{\phi}, is needed, where \mathcal{L} is the generator of corresponding stochastic differential equation for pSGLD. We discuss these conditions and prove the Theorem in Appendix A.

Theorem 1

Define the operator ΔVt=(Ng¯(𝛉t;𝒟t)g(𝛉t;𝒟t))G(𝛉t)𝛉\Delta V_{t}=(N\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})-g(\bm{\theta}_{t};\mathcal{D}^{t}))^{\top}G(\bm{\theta}_{t})\nabla_{\bm{\theta}}. Under Assumption 1, for a test function ϕ\phi, the MSE of the pSGLD at finite time STS_{T} is bounded, for some C>0C>0 independent of {ϵt}\{\epsilon_{t}\}, as:

MSE:\displaystyle\text{MSE}: 𝔼[(ϕ^ϕ¯)2]mse\displaystyle\mathbb{E}\left[\left(\hat{\phi}-\bar{\phi}\right)^{2}\right]\leq\mathcal{B}_{\text{mse}} (6)
C(tϵt2ST2𝔼ΔVt2+1ST+(t=1Tϵt2)2ST2).\displaystyle\triangleq C\left(\sum_{t}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\mathbb{E}\left\|\Delta V_{t}\right\|^{2}+\frac{1}{S_{T}}+\frac{(\sum_{t=1}^{T}\epsilon_{t}^{2})^{2}}{S_{T}^{2}}\right)~.

MSE is a common measure of quality of an estimator, reflecting the precision of an approximate algorithm. It can be seen from Theorem 1 that the finite-time approximation error of pSGLD is bounded by mse\mathcal{B}_{\text{mse}}, consisting of two factors: (i)(\textup{\it i}) estimation error from stochastic gradients, tϵt2ST2𝔼ΔVt2\sum_{t}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\mathbb{E}\left\|\Delta V_{t}\right\|^{2}, and (ii)(\textup{\it ii}) discretization error inherited from numerical integrators, 1ST+(t=1Tϵt2)2ST2\frac{1}{S_{T}}+\frac{(\sum_{t=1}^{T}\epsilon_{t}^{2})^{2}}{S_{T}^{2}}. These terms asymptotically approach 0 under Assumption 1, meaning that the decreasing-step-size pSGLD is asymptotically consistent with true posterior expectation.

Practical Techniques

Of interest when considering the practical issue of limited computation time, we now interpret the above finite-time error using the framework of risk of an estimator, which provides practical guidance in implementation. From (?), the predictive risk, RR, of an algorithm is defined as the MSE above, and can be decomposed as R=𝔼[(ϕ¯ϕ^)2]=B2+VR=\mathbb{E}[(\bar{\phi}-\hat{\phi})^{2}]=B^{2}+V , where BB is the bias and VV is the variance. Denote ϕ¯η=𝒳ϕ(𝜽)ρη(𝜽)𝑑𝜽\bar{\phi}_{\eta}=\int_{\mathcal{X}}\phi(\bm{\theta})\rho_{\eta}(\bm{\theta})d\bm{\theta} as the ergodic average under the invariant measure, ρη(𝜽)\rho_{\eta}(\bm{\theta}), of the pSGLD. After burnin, it can be shown that

Bias :\displaystyle: B=ϕ¯ηϕ¯\displaystyle B=\bar{\phi}_{\eta}-\bar{\phi} (7)
Variance :\displaystyle: V=𝔼[(ϕ¯ηϕ^)2]A(0)Mη\displaystyle V=\mathbb{E}[(\bar{\phi}_{\eta}-\hat{\phi})^{2}]\approx\frac{A(0)}{M_{\eta}}\vskip 0.0pt (8)

where A(0)A(0) is the variance of ϕ\phi with respect to ρη(𝜽)\rho_{\eta}(\bm{\theta}) (i.e., 𝔼ρη(𝜽)[(ϕϕ^)2]\mathbb{E}_{\rho_{\eta}(\bm{\theta})}[(\phi-\hat{\phi})^{2}]) , which is a constant (further details are given in Appendix D). MηM_{\eta} is the effective sample size (ESS), defined as

ESS :\displaystyle: Mη=T1+2t=1A(t)A(0)=T2τ\displaystyle M_{\eta}=\frac{T}{1+2\sum_{t=1}^{\infty}\frac{A(t)}{A(0)}}=\frac{T}{2\tau}\vskip 0.0pt (9)

where A(t)=𝔼[(ϕ¯ηϕ(𝜽0))(ϕ¯ηϕ(𝜽t))]A(t)=\mathbb{E}[(\bar{\phi}_{\eta}-\phi(\bm{\theta}_{0}))(\bar{\phi}_{\eta}-\phi(\bm{\theta}_{t}))] is the autocovariance function, manifesting how strong two samples with a time lag tt are correlated. The term τ=12+t=1A(t)A(0)\tau=\frac{1}{2}+\sum_{t=1}^{\infty}\frac{A(t)}{A(0)} is the integrated autocorrelation time (ACT), which measures the interval between independent samples.

In practice, there is always a tradeoff between bias and variance. In the case of infinite computation time, the traditional MCMC setting can reduce the bias and variance to zero. However, in practice, time is limited. Obtaining more effective samples can reduce the total risk (Eq. (6)), even if bias is introduced. In the following, we provide two model-independent practical techniques to further speed up the proposed pSGLD.

Excluding Γ(𝜽t)\Gamma(\bm{\theta}_{t}) term

Though the evaluation of Γ(𝜽t)\Gamma(\bm{\theta}_{t}) in our case is manageable due to its diagonal nature, we propose to remove it during sampling to reduce the computation. It is interesting that in our case ignoring Γ(𝜽t)\Gamma(\bm{\theta}_{t}) produces a bias controlled by α\alpha on the MSE.

Corollary 2

Assume the 1st-order and 2nd-order gradients are bounded. With the same assumptions as Theorem 1, the MSE when ignoring the Γ(𝛉t)\Gamma(\bm{\theta}_{t}) term in the algorithm can be bounded as 𝔼[(ϕ^ϕ¯)2]mse+O((1α)2α3)\mathbb{E}\left[\left(\hat{\phi}-\bar{\phi}\right)^{2}\right]\leq\mathcal{B}_{\text{mse}}+O\left(\frac{(1-\alpha)^{2}}{\alpha^{3}}\right), where mse\mathcal{B}_{\text{mse}} is the bound defined in Theorem 1.

Omitting Γ(𝜽t)\Gamma(\bm{\theta}_{t}) introduces an extra term in the bound that is controlled by the parameter α\alpha. The proof is in Appendix B. Since α\alpha is always set to a value that is very close to 11, the term (1α)2/α30(1-\alpha)^{2}/\alpha^{3}\approx 0, the effect of Γ(𝜽t)\Gamma(\bm{\theta}_{t}) negligible. In addition, more samples per unit time are generated when Γ(𝜽t)\Gamma(\bm{\theta}_{t}) is ignored, resulting in a smaller variance on the prediction. Note that the term Γ(𝜽t)\Gamma(\bm{\theta}_{t}) is heuristically ignored in (?), but is only able to approximate the true posterior in the case of infinite data, which is not required in our algorithm.

Thinning samples

Making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment for a large number of users, especially when the models are large neural nets. One practical technique is to average models using a thinned version of samples. By thinning the samples in pSGLD, the total number of samples is reduced. However, these thinned samples have a lower autocorrelation time and can have a similar ESS. We can also guarantee the MSE remains the same form under the thinning schema. The proof is in Appendix C.

Corollary 3

By thinning samples from our pSGLD algorithm, the MSE remains the same form as in Theorem 1, and asymptotically approaches 0.

Experiments

Our experiments focus on the effectiveness of preconditioners in pSGLD, and present results in four parts: a simple simulation, Bayesian Logistic Regression (BLR), and two widely used DNN models, Feedforward Neural Networks (FNN) and Convolutional Neural Networks (CNN).

The proposed algorithm that uses the discussed practical techniques is denoted as pSGLD. The prior on the parameters is set to p(𝜽)=𝒩(0,σ2𝐈)p(\bm{\theta})=\mathcal{N}(0,\sigma^{2}{\bf I}). If not specifically mentioned, the default setting for DNN experiments is shared as follows. σ2=1\sigma^{2}=1, minibatch size is 100, thinning interval is 100, burn-in is 300. We employ a block decay strategy for stepsize; it decreases by half after every LL epochs.

Simulation

We first demonstrate pSGLD on a simple 2D Gaussian example, 𝒩([00],[0.16001])\mathcal{N}\Big{(}\left[\begin{array}[]{c}0\\ 0\end{array}\right],\left[\begin{array}[]{cc}0.16&0\\ 0&1\end{array}\right]\Big{)}. Given posterior samples, the goal is to estimate the covariance matrix. A diagonal covariance matrix is used to show the algorithm can adjust the stepsize at different dimension.

We first compare SGLD and pSGLD with a large range of different stepsizes ϵ\epsilon. 2×1052\times 10^{5} samples are collected for each algorithm. Reconstruction errors and autocorrelation time are shown in Fig. 1 (a). We see that pSGLD dominates the “vanilla” SGLD in that it consistently shows a lower error and autocorrelation time, particularly with larger stepsize. When the stepsize is small enough, the sampler does not move much, and the performances of the two algorithms become similar. The first 600600 samples of both methods for ϵ=0.3\epsilon=0.3 are shown in Fig. 1 (b). Because step sizes in pSGLD can be adaptive, it implies that even if the covariance matrix of a target distribution is mildly rescaled, a new stepsize is unnecessary for pSGLD. Meanwhile, the stepsize of the vanilla SGLD needs to be fine-tuned in order to obtain decent samples. See Appendix E for further details.

Refer to caption Refer to caption
(a) Error and ACT (b) Samples
Figure 1: Simulation results on a 2D Gaussian.

Bayesian Logstic Regression

To demonstrate that our pSGLD is applicable to general Bayesian posterior sampling, we demonstrate results on BLR. A small Australian dataset (?) is first used with N=690N=690 and dimension D=14D=14. We choose a minibatch size of 55. The prior variance is σ2=100\sigma^{2}=100. 5×1035\times 10^{3} iterations are used. For both pSGLD and SGLD, we test stepsize ϵ\epsilon ranging from 1×1071\times 10^{-7} to 1×1041\times 10^{-4}, with 50 runs for each algorithm.

Refer to caption Refer to caption
(a) Variance (b) Parameter estimation
Figure 2: BLR on Australian dataset.

Following (?), we report the time per minimum Effective Sample (1/EES\propto 1/\text{EES}) in Fig. 2 (a), which is proportional to the variance. pSGLD generates much larger ESS compared to SGLD, especially when the stepsize is large. Meanwhile, Fig. 2 (b) shows that pSGLD provides smaller error in estimating weights, where the “groundtruth” is obtained by 10610^{6} samples from HMC with Metroplis-Hastings. Therefore, the overall risk is reduced.

We then test BLR on a large-scale Adult dataset, 𝚊𝟿𝚊\mathtt{a9a} (?), with Ntrain=32561,Ntest=16281,N_{\text{train}}=32561,N_{\text{test}}=16281, and D=123D=123. Minibatch size is set to 50, and the prior variance is σ2=10\sigma^{2}=10. The thinning interval is 50, burn-in is 500, and T=1.5×104T=1.5\times 10^{4} iterations are used. Stepsize ϵ=5×102\epsilon=5\times 10^{-2} for pSGLD and SGLD. The test errors are compared in Table 1Bayesian Logstic Regression, and learning curves are shown in Fig. 3Bayesian Logstic Regression. Both SG-MCMC methods outperform the recently proposed doubly stochastic variational Bayes (SDVI) (?), and higher-order variational autoencoder methods (L-BFGS-SGVI, HFSGVI) (?). Furthermore, pSGLD converges in less than 4×1034\times 10^{3} iterations, while SGLD at least needs double the time to reach this accuracy.

Method Test error
pSGLD 14.85%
SGLD 14.85%
DSVI 15.20%
L-BFGS-SGVI 14.91%
HFSGVI 15.16%
Table 1: BLR on 𝚊𝟿𝚊\mathtt{a9a}.
[Uncaptioned image]
Figure 3: Learning curves.

Feedforward Neural Networks

The first DNN model we study is the Feedforward Neural Networks (FNN), or multilayer perceptron (MLP). The activation function is rectified linear unit (ReLU). A two-layer model, 784-X-X-10, is employed, where X is the number of hidden units for each layer. 100 epochs are used, with L=20L=20. We compare our propose method, pSGLD, with representative stochastic optimization methods: SGD, RMSprop and RMSspectral (?). After tuning, we set the optimal stepsize for each algorithm as: for pSGLD and RSMprop as follows: ϵ=5×104\epsilon=5\times 10^{-4}, while for SGLD and SGD as ϵ=5×101\epsilon=5\times 10^{-1}.

We test the algorithms on the standard MNIST dataset, consisting of 28×2828\times 28 images (thus the 784-dimensional input vector) from 1010 different classes (0 to 99) with 60,00060,000 training and 10,00010,000 test samples. The test classification errors for network (X-X) size 400-400, 800-800 and 1200-1200 are shown in Table LABEL:tab:fnn. The results of stochastic sampling methods are better than their corresponding stochastic optimization counterparts. This indicates that incorporating weight uncertainty can improve performances. By increasing the variance σ2\sigma^{2} of pSGLD from 11 to 100100, more uncertainty is introduced into the model from the prior, and higher performance is obtained. Figure 4 (a) displays the distribution histograms of weights in the last training iteration of the 1200-1200 model. We observe that smaller variance in the prior imposes lower uncertainty, by making the weights concentrate to 0; while larger variance in the prior leads to a wider range of weight choices, thus higher uncertainty.

Table 2: Classification error of FNN on MNIST. [ {\diamond} ] indicates results taken from (?)
Refer to caption Refer to caption
(a) Weights distribution (b) Learning curves
Figure 4: FNN of size 1200-1200 on MNIST.

We also compare to other techniques developed to prevent overfitting (dropout) and weight uncertainty (BPB, Gaussian and scale mixtures). pSGLD provides state-of-the-art performance for FNN on test accuracy. We further note that pSGLD is able to give increasing performance with increasing network size, whereas BPB and SGD dropout do not. This is probably because overfitting is harder to be dealt with in large neural networks with pure optimization techniques.

Finally, learning curves of network configuration 1200-1200 are plot in Fig. 4 (b)555RMSspectral is not shown because it uses larger batch sizes and so is difficult to compare on this scale.. We empirically find that pSGLD and SGLD take fewer iterations to converge, and the results are more stable than their optimization counterparts. Moreover, it can be seen that pSGLD consistently converges faster and to a better point than SGLD. Learning curves for other network sizes are provided in Appendix F. While the ensemble of samples requires more computation than a single FNN in testing, it shows significantly improved performance. As well, (?) showed that learning a single FNN that approximates the model average result gave nearly the same performance. We employ this idea, and suggest a fast version, distilled pSGLD. Its results for σ2=1\sigma^{2}=1 show it can maintain good performances.

Convolutional Neural Networks

Our next DNN is the popular CNN model. We use a standard network configuration with 2 convolutional layers followed by 2 fully-connected layers (?). Both convolutional layers use 5×55\times 5 filter size with 32 and 64 channels, respectively; 2×22\times 2 max pooling is used after each convolutional layer. The fully-connected layers have 200-200 hidden nodes with ReLU nonlinearities, 20 epochs are used, and LL is set to 10. The stepsizes for pSGLD and RMSprop is set to ϵ={1,2}×103\epsilon=\{1,2\}\times 10^{-3} via grid search. For SGLD and SGD, this is ϵ={1,2}×101\epsilon=\{1,2\}\times 10^{-1}. Additional results with CNNs are in Appendix G.

The same MNIST dataset is used. A comparison of test errors is shown in Table 3G, with the corresponding learning curves in Fig. 5G. We emphasize that the purpose of this experiment is to compare methods on the same model architecture, not to achieve overall state-of-the-art results. The CNN trained with traditional SGD gives an error of 0.82%. pSGLD shows significant improvement, with an error of 0.45%. This result is also comparable with some recent state-of-the-art CNN based systems, which have much more complex architectures. These include the stochastic pooling (?), Network in Network (NIN) (?) and Maxout Network(MN) (?).

Method Test error
pSGLD 0.45%
SGLD 0.71%
RMSprop 0.65%
RMSspectral 0.78%
SGD 0.82%
Stochastic Pooling 0.47%
NIN + Dropout 0.47%
MN + Dropout 0.45%
Table 3: Test error.
[Uncaptioned image]
Figure 5: Learning curves.

Conclusion

A preconditioned SGLD is developed based on the RMSprop algorithm, with controllable finite-time approximation error. We apply the algorithm to DNNs to overcome their notorious problems of overfitting and pathological curvature. Extensive experiments show that our pSGLD can adaptive to the local geometry, allowing improved effective sampling rates and performance. It provides sample-based uncertainty in DNNs, and achieves state-of-the-arts performances on FNN and CNN models. Interesting future directions include exploring applications to latent variable models or recurrent neural networks (?).

Acknowledgements

This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.

References

  • [Ahn, Korattikara, and Welling 2012] Ahn, S.; Korattikara, A.; and Welling, M. 2012. Bayesian posterior sampling via stochastic gradient fisher scoring. In ICML.
  • [Bakhturin 2001] Bakhturin, Y. A. 2001. Campbell–Hausdorff formula. Encyclopedia of Mathematics, Springer.
  • [Blundell et al. 2015] Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty in neural networks. In ICML.
  • [Bottou 2004] Bottou, L. 2004. Stochastic learning. Advanced Lectures on Machine Learning 146–168.
  • [Carlson et al. 2015] Carlson, D.; Collins, E.; Hsieh, Y. P.; Carin, L.; and Cevher, V. 2015. Preconditioned spectral descent for deep learning. In NIPS.
  • [Chen, Ding, and Carin 2015] Chen, C.; Ding, N.; and Carin, L. 2015. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In NIPS.
  • [Chen, Fox, and Guestrin 2014] Chen, T.; Fox, E. B.; and Guestrin, C. 2014. Stochastic gradient Hamiltonian Monte Carlo. In ICML.
  • [Chen-Yu et al. 2015] Chen-Yu, L.; Saining, X.; Patrick, G.; Zhengyou, Z.; and Zhuowen, T. 2015. Deeply-supervised nets. AISTATS.
  • [Dauphin et al. 2014] Dauphin, Y. N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli, S.; and Bengio, Y. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS.
  • [Dauphin, de Vries, and Bengio 2015] Dauphin, Y. N.; de Vries, H.; and Bengio, Y. 2015. Equilibrated adaptive learning rates for non-convex optimization. In NIPS.
  • [Ding et al. 2014] Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H. 2014. Bayesian sampling using stochastic gradient thermostats. In NIPS.
  • [Duchi, Hazan, and Singer 2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. JMLR.
  • [Fan et al. 2015] Fan, K.; Wang, Z.; Beck, J.; Kwok, J.; and Heller, J. 2015. Fast second-order stochastic backpropagation for variational inference. In NIPS.
  • [Gal and Ghahramani 2015] Gal, Y., and Ghahramani, Z. 2015. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv:1506.02142.
  • [Gamerman and Lopes 2006] Gamerman, D., and Lopes, H. F. 2006. Markov chain Monte Carlo: stochastic simulation for Bayesian inference. CRC Press.
  • [Gan et al. 2015] Gan, Z.; Li, C.; Henao, R.; Carlson, D.; and Carin, L. 2015. Deep temporal sigmoid belief networks for sequence modeling. NIPS.
  • [Girolami and Calderhead 2011] Girolami, M., and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. In JRSS: Series B.
  • [Goodfellow et al. 2013] Goodfellow, I.; Warde-farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout networks. In ICML.
  • [Hernández-Lobato and Adams 2015] Hernández-Lobato, J. M., and Adams, R. P. 2015. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML.
  • [Jarrett et al. 2009] Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and LeCun, Y. 2009. What is the best multi-stage architecture for object recognition? In ICCV.
  • [Kingma and Ba 2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. ICLR.
  • [Kingma, Salimans, and Welling 2015] Kingma, D. P.; Salimans, T.; and Welling, M. 2015. Variational dropout and the local reparameterization trick. NIPS.
  • [Korattikara et al. 2015] Korattikara, A.; Rathod, V.; Murphy, K.; and Welling, M. 2015. Bayesian dark knowledge. NIPS.
  • [Korattikara, Chen, and Welling 2014] Korattikara, A.; Chen, Y.; and Welling, M. 2014. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. ICML.
  • [Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
  • [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
  • [Li et al. 2016] Li, C.; Chen, C.; Fan, K.; and Carin, L. 2016. High-order stochastic gradient thermostats for Bayesian learning of deep models. In AAAI.
  • [Lin, Chen, and Yan 2014] Lin, M.; Chen, Q.; and Yan, S. 2014. Network in network. ICLR.
  • [Lin, Weng, and Keerthi 2008] Lin, C.-J.; Weng, R. C.; and Keerthi, S. S. 2008. Trust region newton method for logistic regression. JMLR.
  • [MacKay 1992] MacKay, D. J. C. 1992. A practical bayesian framework for backpropagation networks. Neural computation.
  • [Mattingly, Stuart, and Tretyakov 2010] Mattingly, J. C.; Stuart, A. M.; and Tretyakov, M. V. 2010. Construction of numerical time-average and stationary measures via Poisson equations. SIAM J. NUMER. ANAL. 48(2):552–577.
  • [Neal 1995] Neal, R. M. 1995. Bayesian learning for neural networks. PhD thesis, University of Toronto.
  • [Neal 2011] Neal, R. M. 2011. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.
  • [Ngiam et al. 2011] Ngiam, J.; Coates, A.; Lahiri, A.; Prochnow, B.; Le, Q. V.; and Ng, A. Y. 2011. On optimization methods for deep learning. In ICML.
  • [Patterson and Teh 2013] Patterson, S., and Teh, Y. W. 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In NIPS.
  • [Rumelhart, Hinton, and Williams 1986] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. 1986. Learning representations by back-propagating errors. Nature.
  • [Srivastava et al. 2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR.
  • [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS.
  • [Teh, Thiéry, and Vollmer 2014] Teh, Y. W.; Thiéry, A. H.; and Vollmer, S. J. 2014. Consistency and fluctuations for stochastic gradient Langevin dynamics.
  • [Tieleman and Hinton 2012] Tieleman, T., and Hinton, G. E. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning.
  • [Titsias and Lázaro-Gredilla 2014] Titsias, M., and Lázaro-Gredilla, M. 2014. Doubly stochastic variational bayes for non-conjugate inference. In ICML.
  • [Wan et al. 2013] Wan, L.; Zeiler, M.; Zhang, S.; LeCun, Y.; and Fergus, R. 2013. Regularization of neural networks using dropconnect. In ICML.
  • [Welling and Teh 2011] Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
  • [Zeiler and Fergus 2013] Zeiler, M., and Fergus, R. 2013. Stochastic pooling for regularization of deep convolutional neural networks. ICLR.
  • [Zeiler 2012] Zeiler, M. D. 2012. Adadelta: An adaptive learning rate method. arXiv:1212.5701.

Supplementary Material of
Preconditioned Stochastic Gradient Langevin Dynamics for
Deep Neural Networks

Appendix A A. The proof for main theorem

In (?), the authors provide the convergence property for general SG-MCMC, here we follow their assumptions and proof techniques, with specific treatment on the 1st-order numerical integrator, and the case of preconditioner.

Details on the assumption

Before the proof, we detail the assumptions needed for Theorem 1. For pSGLD, its associated Stochastic Differential Equation (SDE) has an invariant measure ρ(𝜽)\rho(\bm{\theta}), the posterior average is defined as: ϕ¯𝒳ϕ(𝜽)ρ(𝜽)d𝜽\bar{\phi}\triangleq\int_{\mathcal{X}}\phi(\bm{\theta})\rho(\bm{\theta})\mathrm{d}\bm{\theta} for some test function ϕ(𝜽)\phi(\bm{\theta}) of interest. Given samples (𝜽t)t=1T(\bm{\theta}_{t})_{t=1}^{T} from pSGLD, we use the sample average ϕ^\hat{\phi} to approximate ϕ¯\bar{\phi}. In the analysis, we define a functional ψ\psi that solves the following Poisson Equation:

ψ(𝜽t)=ϕ(𝜽t)ϕ¯.\displaystyle\mathcal{L}\psi(\bm{\theta}_{t})=\phi(\bm{\theta}_{t})-\bar{\phi}~. (10)

The solution functional ψ(𝜽t)\psi(\bm{\theta}_{t}) characterizes the difference between ϕ(𝜽t)\phi(\bm{\theta}_{t}) and the posterior average ϕ¯\bar{\phi} for every 𝜽t\bm{\theta}_{t}, thus would typically possess a unique solution, which is at least as smooth as ϕ\phi under the elliptic or hypoelliptic settings (?). In the unbounded domain of 𝜽t\bm{\theta}_{t}, to make the presentation simple, we follow (?) and make certain assumptions on the solution functional, ψ\psi, of the Poisson equation (10), which are used in the detailed proofs.

The mild assumptions of smoothness and boundedness made in the main paper are detailed as follows.

Assumption 2

ψ\psi and its up to 3rd-order derivatives, 𝒟kψ\mathcal{D}^{k}\psi, are bounded by a function 𝒱\mathcal{V}, i.e., 𝒟kψCk𝒱pk\|\mathcal{D}^{k}\psi\|\leq C_{k}\mathcal{V}^{p_{k}} for k=(0,1,2,3)k=(0,1,2,3), Ck,pk>0C_{k},p_{k}>0. Furthermore, the expectation of 𝒱\mathcal{V} on {𝛉t}\{\bm{\theta}_{t}\} is bounded: supt𝔼𝒱p(𝛉t)<\sup_{t}\mathbb{E}\mathcal{V}^{p}(\bm{\theta}_{t})<\infty, and 𝒱\mathcal{V} is smooth such that sups(0,1)𝒱p(s𝛉+(1s)Y)C(𝒱p(𝛉)+𝒱p(Y))\sup_{s\in(0,1)}\mathcal{V}^{p}\left(s\bm{\theta}+\left(1-s\right)Y\right)\leq C\left(\mathcal{V}^{p}\left(\bm{\theta}\right)+\mathcal{V}^{p}\left(Y\right)\right), 𝛉,Y,pmax{2pk}\forall\bm{\theta},Y,p\leq\max\{2p_{k}\} for some C>0C>0.

Proof of Theorem 1

Based on Assumption 2, we prove the main theorem.

  •  Proof

    First let us denote

    ~t\displaystyle\tilde{\mathcal{L}}_{t} =(G(𝜽t)(𝜽logp(𝜽t)+Nni=1n𝜽logp(𝒅ti|𝜽t))\displaystyle=\left(G(\bm{\theta}_{t})\Big{(}\nabla_{\bm{\theta}}\log p(\bm{\theta}_{t})+\frac{N}{n}\sum_{i=1}^{n}\nabla_{\bm{\theta}}\log p(\bm{d}_{t_{i}}|\bm{\theta}_{t})\Big{)}\right.
    +Γ(𝜽t))𝜽+12G(𝜽)(G(𝜽)T):𝜽𝜽T,\displaystyle+\left.\Gamma(\bm{\theta}_{t})\right)\cdot\nabla_{\bm{\theta}}+\frac{1}{2}G(\bm{\theta})\left(G(\bm{\theta})^{T}\right):\nabla_{\bm{\theta}}\nabla_{\bm{\theta}}^{T}~, (11)

    the local generator of our proposed pSGLD with stochastic gradients, where 𝒂𝒃𝒂𝒃{\bm{a}}\cdot{\bm{b}}\triangleq{\bm{a}}^{\top}{\bm{b}} is the vector inner product, 𝐀:𝐁tr{𝐀𝐁}{{\bf A}}:{\bf B}\triangleq\mbox{tr}\{{{\bf A}}^{\top}{\bf B}\} is the matrix double dot product. Furthermore, let \mathcal{L} be the true generator of the Langevin dynamic corresponding to the pSGLD, e.g., replacing the stochastic gradient in ~t\tilde{\mathcal{L}}_{t} with the true gradient. As a result, we have the relation:

    ~t=+ΔVt,\displaystyle\tilde{\mathcal{L}}_{t}=\mathcal{L}+\Delta V_{t}~, (12)

    where ΔVt(Ng¯(𝜽t;𝒟t)g(𝜽t;𝒟t))G(𝜽t)𝜽\Delta V_{t}\triangleq(N\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t})-g(\bm{\theta}_{t};\mathcal{D}^{t}))^{\top}G(\bm{\theta}_{t})\nabla_{\bm{\theta}}, g(𝜽t;𝒟t)g(\bm{\theta}_{t};\mathcal{D}^{t}) is the full gradient, g¯(𝜽t;𝒟t)\bar{g}(\bm{\theta}_{t};\mathcal{D}^{t}) is the stochatic gradient calculated from the tt-th minibatch.

    In pSGLD, we use the Euler integrator, which is a first order integrator. As a result, according to (?), for a test function ϕ\phi, we can decompose it as:

    𝔼[ψ(𝜽t)]\displaystyle\mathbb{E}[\psi(\bm{\theta}_{t})] =eϵt~tψ(𝜽(t1))+O(ϵt2)\displaystyle=e^{\epsilon_{t}\tilde{\mathcal{L}}_{t}}\psi(\bm{\theta}_{(t-1)})+O(\epsilon_{t}^{2})
    =(𝕀+ϵt~t)ψ(𝜽(t1))+O(ϵt2),\displaystyle=\left(\mathbb{I}+\epsilon_{t}\tilde{\mathcal{L}}_{t}\right)\psi(\bm{\theta}_{(t-1)})+O(\epsilon_{t}^{2})~, (13)

    where 𝕀\mathbb{I} is the identity map, i.e., 𝕀f(x)=f(x)\mathbb{I}f(x)=f(x).

    According to the assumptions, there exists a functional ψ\psi that solves the following Poisson Equation:

    ψ(𝜽t)=ϕ(𝜽t)ϕ¯,\displaystyle\mathcal{L}\psi(\bm{\theta}_{t})=\phi(\bm{\theta}_{t})-\bar{\phi}~, (14)

    where ϕ¯\bar{\phi} is defined in the main text.

    Sum over t=1,,Tt=1,\cdots,T in the above equation, take expectation on both sides, and use the Poisson Equation (14) and the relation 𝒯~t=+ΔVt\tilde{\mathcal{T}}_{t}=\mathcal{L}+\Delta V_{t} to expand the first order term. We obtain

    t=1T𝔼(ψ(𝜽t))\displaystyle\sum_{t=1}^{T}\mathbb{E}\left(\psi(\bm{\theta}_{t})\right) =t=1Tψ(𝜽(t1))+t=1Tϵt𝒯tψ(𝜽(t1))\displaystyle=\sum_{t=1}^{T}\psi(\bm{\theta}_{(t-1)})+\sum_{t=1}^{T}\epsilon_{t}\mathcal{T}_{t}\psi(\bm{\theta}_{(t-1)})
    +t=1TϵtΔVlψ(𝜽(t1))+Ct=1Tϵt2.\displaystyle+\sum_{t=1}^{T}\epsilon_{t}\Delta V_{l}\psi(\bm{\theta}_{(t-1)})+C\sum_{t=1}^{T}\epsilon_{t}^{2}~. (15)

    Divide both sides by STS_{T}, we have

    ϕ^ϕ¯\displaystyle\hat{\phi}-\bar{\phi} =𝔼ψ(𝜽t)ψ(𝜽0)ST+1STl=1T1(𝔼ψ(𝜽(t1))+ψ(𝜽(t1)))\displaystyle=\frac{\mathbb{E}\psi(\bm{\theta}_{t})-\psi(\bm{\theta}_{0})}{S_{T}}+\frac{1}{S_{T}}\sum_{l=1}^{T-1}\left(\mathbb{E}\psi(\bm{\theta}_{(t-1)})+\psi(\bm{\theta}_{(t-1)})\right)
    +t=1TϵtSTΔVtψ(𝜽(t1))+Ct=1Tϵt2ST.\displaystyle+\sum_{t=1}^{T}\frac{\epsilon_{t}}{S_{T}}\Delta V_{t}\psi(\bm{\theta}_{(t-1)})+C\frac{\sum_{t=1}^{T}\epsilon_{t}^{2}}{S_{T}}~. (16)

    As a result, there exists some positive constant CC, such that:

    (ϕ^ϕ¯)2C(1ST2(ψ(𝜽0)𝔼ψ(𝜽T))2A1\displaystyle\left(\hat{\phi}-\bar{\phi}\right)^{2}\leq C\left(\frac{1}{S_{T}^{2}}\underbrace{\left(\psi(\bm{\theta}_{0})-\mathbb{E}\psi(\bm{\theta}_{T})\right)^{2}}_{A_{1}}\right.
    +1ST2t=1T(𝔼ψ(𝜽(t1))ψ(𝜽(t1)))2A2+t=1Tϵt2ST2ΔVt2\displaystyle+\underbrace{\frac{1}{S_{T}^{2}}\sum_{t=1}^{T}\left(\mathbb{E}\psi(\bm{\theta}_{(t-1)})-\psi(\bm{\theta}_{(t-1)})\right)^{2}}_{A_{2}}+\sum_{t=1}^{T}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\left\|\Delta V_{t}\right\|^{2}
    +(t=1Tϵt2ST)2)\displaystyle\left.+\left(\frac{\sum_{t=1}^{T}\epsilon_{t}^{2}}{S_{T}}\right)^{2}\right) (17)

    A1A_{1} can be bounded by assumptions, and A2A_{2} can be easily shown to be bounded by O(ϵt)O(\sqrt{\epsilon_{t}}) due to the Gaussian noise. It turns out that the resulting terms have order higher than those from the other terms, thus can be ignored in the expression below. After some simplifications, (17) is bounded by:

    𝔼(ϕ^ϕ¯)2tϵt2ST2𝔼ΔVt2+1ST+1ST2+(t=1Lϵt2ST)2\displaystyle\mathbb{E}\left(\hat{\phi}-\bar{\phi}\right)^{2}\lesssim\sum_{t}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\mathbb{E}\left\|\Delta V_{t}\right\|^{2}+\frac{1}{S_{T}}+\frac{1}{S_{T}^{2}}+\left(\frac{\sum_{t=1}^{L}\epsilon_{t}^{2}}{S_{T}}\right)^{2}
    =C(tϵt2ST2𝔼ΔVt2+1ST+(t=1Tϵt2)2ST2)\displaystyle=C\left(\sum_{t}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\mathbb{E}\left\|\Delta V_{t}\right\|^{2}+\frac{1}{S_{T}}+\frac{(\sum_{t=1}^{T}\epsilon_{t}^{2})^{2}}{S_{T}^{2}}\right) (18)

    for some C>0C>0. It is easy to show under the assumptions, all the terms in the above bound approach zero. This completes the first part of the theorem. \blacksquare

Appendix B B. The proof for Corollary 2

To prove Corollary 2, we first show the following results.

Lemma 4

Assume that the 1st-order and 2nd-order gradient are bounded, then there exists some constant MM, for k-th component of Γ(𝛉t)\Gamma(\bm{\theta}_{t}), we have

|t=1TΓk(𝜽t)|MT(1α)α32.\displaystyle\left|\sum_{t=1}^{T}\Gamma_{k}(\bm{\theta}_{t})\right|\leq MT\frac{(1-\alpha)}{\alpha^{\frac{3}{2}}}~. (19)
  •  Proof

Since Γ(𝜽)\Gamma(\bm{\theta}) is a diagnal matrix, we focus on one of its elements thus omit the index kk in the following.

First, the iterative form of exponential moving average can be written as a function of the gradients at all the previous timesteps:

V(θt)=αV(θt1)+(1α)g¯2(θt)\displaystyle V(\theta_{t})=\alpha V(\theta_{t-1})+(1-\alpha)\bar{g}^{2}(\theta_{t}) (20)
=(1α)i=1tαtig¯2(θi)\displaystyle=(1-\alpha)\sum_{i=1}^{t}\alpha^{t-i}\bar{g}^{2}(\theta_{i})\vskip 0.0pt (21)

Based on this, for each component of Γ(𝜽t)\Gamma(\bm{\theta}_{t}), we have

|t=1TΓ(θt)|=|t=1T(1α)V32(θt)g¯(θt)g¯(θt)θt|\displaystyle\left|\sum_{t=1}^{T}\Gamma(\theta_{t})\right|=\left|\sum_{t=1}^{T}(1-\alpha)V^{-\frac{3}{2}}(\theta_{t})\bar{g}(\theta_{t})\frac{\partial\bar{g}(\theta_{t})}{\partial\theta_{t}}\right| (22)
=|t=1T(1α)g¯(θt)(αV(θt1)+(1α)g¯2(θt))32g¯(θt)θt|\displaystyle=\left|\sum_{t=1}^{T}(1-\alpha)\frac{\bar{g}(\theta_{t})}{\big{(}\alpha V(\theta_{t-1})+(1-\alpha)\bar{g}^{2}(\theta_{t})\big{)}^{\frac{3}{2}}}\frac{\partial\bar{g}(\theta_{t})}{\partial\theta_{t}}\right| (23)
|t=1T(1α)α32V32(θt1)g¯(θt)θt|\displaystyle\ll\left|\sum_{t=1}^{T}\frac{(1-\alpha)}{\alpha^{\frac{3}{2}}V^{\frac{3}{2}}(\theta_{t-1})}\frac{\partial\bar{g}(\theta_{t})}{\partial\theta_{t}}\right| (24)

With the assumption that the 1st-order and 2nd-order gradient are bounded, we have |V32(θt1)g¯(θt)θt|M\left|V^{-\frac{3}{2}}(\theta_{t-1})\frac{\partial\bar{g}(\theta_{t})}{\partial\theta_{t}}\right|\leq M, where MM is a constant independent of {ϵt}\{\epsilon_{t}\}. Therefore, |t=1TϵtΓ(θt)|MT(1α)/α32\left|\sum_{t=1}^{T}\epsilon_{t}\Gamma(\theta_{t})\right|\ll MT(1-\alpha)/\alpha^{\frac{3}{2}}. \blacksquare

Based on Lemma 4, we now proceed to the proof of Corollary 2.

  •  Proof

    By dropping the Γ(𝜽t)\Gamma(\bm{\theta}_{t}) terms, we get a modified version of the local generator corresponding to the SDE of the pSGLD, defined as

    ~t=+ΔV~t,\displaystyle\tilde{\mathcal{L}}_{t}=\mathcal{L}+\Delta\tilde{V}_{t}~,

    where ΔV~t=ΔVt+Γ(𝜽t)𝜽\Delta\tilde{V}_{t}=\Delta V_{t}+\Gamma(\bm{\theta}_{t})\cdot\nabla_{\bm{\theta}} with ΔVt\Delta V_{t} defined in the proof of Theorem 1.

    Following the proof of Theorem 1, we can derive the bound for (ϕ^ϕ¯)2(\hat{\phi}-\bar{\phi})^{2}, which is no more than (17) with an extra term as:

    (ϕ^ϕ¯)2C(1ST2(ψ(𝜽0)𝔼ψ(𝜽T))2A1\displaystyle\left(\hat{\phi}-\bar{\phi}\right)^{2}\leq C\left(\frac{1}{S_{T}^{2}}\underbrace{\left(\psi(\bm{\theta}_{0})-\mathbb{E}\psi(\bm{\theta}_{T})\right)^{2}}_{A_{1}}\right.
    +1ST2t=1T(𝔼ψ(𝜽(t1))ψ(𝜽(t1)))2A2+t=1Tϵt2ST2ΔVt2\displaystyle+\underbrace{\frac{1}{S_{T}^{2}}\sum_{t=1}^{T}\left(\mathbb{E}\psi(\bm{\theta}_{(t-1)})-\psi(\bm{\theta}_{(t-1)})\right)^{2}}_{A_{2}}+\sum_{t=1}^{T}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\left\|\Delta V_{t}\right\|^{2}
    +t=1TϵtSTΓ(𝜽t)2A3+(t=1Tϵt2ST)2)\displaystyle\left.+\underbrace{\left\|\sum_{t=1}^{T}\frac{\epsilon_{t}}{S_{T}}\Gamma(\bm{\theta}_{t})\right\|^{2}}_{A_{3}}+\left(\frac{\sum_{t=1}^{T}\epsilon_{t}^{2}}{S_{T}}\right)^{2}\right) (25)

    We can further relax A3A_{3} above as:

    A3\displaystyle A_{3} (k|t=1TϵtSTΓk(𝜽t)|)2\displaystyle\leq\left(\sum_{k}\left|\sum_{t=1}^{T}\frac{\epsilon_{t}}{S_{T}}\Gamma_{k}(\bm{\theta}_{t})\right|\right)^{2}
    (kϵ1TϵT|t=1TΓk(𝜽t)|)2\displaystyle\leq\left(\sum_{k}\frac{\epsilon_{1}}{T\epsilon_{T}}\left|\sum_{t=1}^{T}\Gamma_{k}(\bm{\theta}_{t})\right|\right)^{2}
    O((1α)2α3),\displaystyle\leq O\left(\frac{(1-\alpha)^{2}}{\alpha^{3}}\right)~, (26)

    where the last inequality follows by using the bound from Lemma 4. Taking expectation on both sides, we arrive at the MSE:

    𝔼(ϕ^ϕ¯)2\displaystyle\mathbb{E}\left(\hat{\phi}-\bar{\phi}\right)^{2}\leq
    C(tϵt2ST2𝔼ΔVt2+1ST+(t=1Tϵt2)2ST2+𝔼t=1TϵtSTΓ(𝜽t)2)\displaystyle\hskip-14.22636ptC\left(\sum_{t}\frac{\epsilon_{t}^{2}}{S_{T}^{2}}\mathbb{E}\left\|\Delta V_{t}\right\|^{2}+\frac{1}{S_{T}}+\frac{(\sum_{t=1}^{T}\epsilon_{t}^{2})^{2}}{S_{T}^{2}}+\mathbb{E}\left\|\sum_{t=1}^{T}\frac{\epsilon_{t}}{S_{T}}\Gamma(\bm{\theta}_{t})\right\|^{2}\right)
    mse+O((1α)2α3).\displaystyle\leq\mathcal{B}_{\text{mse}}+O\left(\frac{(1-\alpha)^{2}}{\alpha^{3}}\right)~. (27)

    for some C>0C>0. \blacksquare

Appendix C C. The proof for Corollary 3

  •  Proof

    By thinning samples from the pSGLD, we obtain a sequence of subsamples {𝜽t1,,𝜽tm}\{\bm{\theta}_{t_{1}},\cdots,\bm{\theta}_{t_{m}}\} from the original samples {𝜽1,,𝜽n}\{\bm{\theta}_{1},\cdots,\bm{\theta}_{n}\} where mnm\leq n and (t1,,tm)(t_{1},\cdots,t_{m}) is a subsequence of (1,2,,n)(1,2,\cdots,n). Since we use the 1st-order Euler integrator, based on the definition in (?), we have for the original samples:

    P~lf(𝜽l)𝔼f(𝜽l)=eϵl~lf(𝜽l)+O(ϵl2),\displaystyle\tilde{P}_{l}f(\bm{\theta}_{l})\triangleq\mathbb{E}f(\bm{\theta}_{l})=e^{\epsilon_{l}\tilde{\mathcal{L}}_{l}}f(\bm{\theta}_{l})+O(\epsilon_{l}^{2})~, (28)

    where P~l\tilde{P}_{l} denotes the Kolmogorov operator. Now for samples between tit_{i} and tjt_{j}, i.e., {𝜽ti,,𝜽tj}\{\bm{\theta}_{t_{i}},\cdots,\bm{\theta}_{t_{j}}\}, we have

    P~tjf(𝜽i)=P~tjP~tif(𝜽i),\displaystyle\tilde{P}_{t_{j}}f(\bm{\theta}_{i})=\tilde{P}_{t_{j}}\circ\cdots\circ\tilde{P}_{t_{i}}f(\bm{\theta}_{i})~, (29)

    where ABA\circ B denotes the composition of the two operators AA and BB, i.e., AA is evaluated on the output of B. Now substitute (28) into (29), and use the Baker-Campbell-Hausdorff formula (?) for commutators, we have

    P~tjf(𝜽i)\displaystyle\tilde{P}_{t_{j}}f(\bm{\theta}_{i}) =el=ijϵl~lf(𝜽i)+O(l=ijϵl2)\displaystyle=e^{\sum_{l=i}^{j}\epsilon_{l}\tilde{\mathcal{L}}_{l}}f(\bm{\theta}_{i})+O(\sum_{l=i}^{j}\epsilon_{l}^{2})
    eSij~ijf(𝜽i)+O(Sij2),\displaystyle\leq e^{S_{ij}\tilde{\mathcal{L}}_{ij}}f(\bm{\theta}_{i})+O(S_{ij}^{2})~, (30)

    where Sijl=ijϵl,~ijl=ijϵlSij~lS_{ij}\triangleq\sum_{l=i}^{j}\epsilon_{l},\tilde{\mathcal{L}}_{ij}\triangleq\sum_{l=i}^{j}\frac{\epsilon_{l}}{S_{ij}}\tilde{\mathcal{L}}_{l}. This means by thinning the samples, going from 𝜽i\bm{\theta}_{i} to 𝜽j\bm{\theta}_{j} corresponds to a 1st-order local integrator with stepsize SijS_{ij} and a modified generator of the corresponding SDE as ~ij\tilde{\mathcal{L}}_{ij}, which is in the same form as the original generator \mathcal{L}.

    By performing the same derivation with the new generator ~ij\tilde{\mathcal{L}}_{ij}, we obtain the same MSE as in Theorem 1 in the main text. \blacksquare

Appendix D D. The proof for bias-variance tradeoff

Bias-variance decomposition

Risk :\displaystyle: R=𝔼[(ϕ¯ϕ^)2]=B2+V\displaystyle R=\mathbb{E}[(\bar{\phi}-\hat{\phi})^{2}]=B^{2}+V\vskip 0.0pt (31)
  •  Proof
R\displaystyle R =𝔼[(ϕ¯ϕ^)2]\displaystyle=\mathbb{E}[(\bar{\phi}-\hat{\phi})^{2}]
=𝔼[(ϕ¯ϕ¯η+ϕ¯ηϕ^)2]\displaystyle=\mathbb{E}[(\bar{\phi}-\bar{\phi}_{\eta}+\bar{\phi}_{\eta}-\hat{\phi})^{2}]
=𝔼[(ϕ¯ϕ¯η)2+(ϕ¯ηϕ^)2+2(ϕ¯ϕ¯η)(ϕ¯ηϕ^)]\displaystyle=\mathbb{E}[(\bar{\phi}-\bar{\phi}_{\eta})^{2}+(\bar{\phi}_{\eta}-\hat{\phi})^{2}+2(\bar{\phi}-\bar{\phi}_{\eta})(\bar{\phi}_{\eta}-\hat{\phi})]
=𝔼[(ϕ¯ϕ¯η)2]+𝔼[(ϕ¯ηϕ^)2]\displaystyle=\mathbb{E}[(\bar{\phi}-\bar{\phi}_{\eta})^{2}]+\mathbb{E}[(\bar{\phi}_{\eta}-\hat{\phi})^{2}]
+2𝔼[(ϕ¯ϕ¯η)(ϕ¯ηϕ^)]\displaystyle\qquad+2\mathbb{E}[(\bar{\phi}-\bar{\phi}_{\eta})(\bar{\phi}_{\eta}-\hat{\phi})]
=(ϕ¯ϕ¯η)2+𝔼[(ϕ¯ηϕ^)2]\displaystyle=(\bar{\phi}-\bar{\phi}_{\eta})^{2}+\mathbb{E}[(\bar{\phi}_{\eta}-\hat{\phi})^{2}]
=B2+V\displaystyle=B^{2}+V

where

Bias :\displaystyle: B=ϕ¯ηϕ¯\displaystyle B=\bar{\phi}_{\eta}-\bar{\phi} (32)
Variance :\displaystyle: V=𝔼[(ϕ¯ηϕ^)2]\displaystyle V=\mathbb{E}[(\bar{\phi}_{\eta}-\hat{\phi})^{2}]\vskip 0.0pt (33)

\blacksquare

Variance term in risk of estimator

Variance :\displaystyle: V=𝔼[(ϕ¯ϕ^)2]A(0)M\displaystyle V=\mathbb{E}[(\bar{\phi}-\hat{\phi})^{2}]\approx\frac{A(0)}{M}\vskip 0.0pt (34)
  •  Proof
    V\displaystyle V =\displaystyle= 𝔼[(ϕ¯ηϕ^)2]\displaystyle\ \mathbb{E}[(\bar{\phi}_{\eta}-\hat{\phi})^{2}]
    =\displaystyle= 𝔼[(ϕ¯η1Ti=1Tϕ(𝜽i))2]\displaystyle\mathbb{E}\Big{[}\Big{(}\bar{\phi}_{\eta}-\frac{1}{T}\sum_{i=1}^{T}\phi(\bm{\theta}_{i})\Big{)}^{2}\Big{]}
    =\displaystyle= 1T2𝔼[i=1Tj=1T(ϕ¯ηϕ(𝜽i))(ϕ¯ηϕ(𝜽j))]\displaystyle\frac{1}{T^{2}}\mathbb{E}\Big{[}\sum_{i=1}^{T}\sum_{j=1}^{T}\Big{(}\bar{\phi}_{\eta}-\phi(\bm{\theta}_{i})\Big{)}\Big{(}\bar{\phi}_{\eta}-\phi(\bm{\theta}_{j})\Big{)}\Big{]}
    =\displaystyle= 1T2i=1Tj=1TA(|ij|)\displaystyle\frac{1}{T^{2}}\sum_{i=1}^{T}\sum_{j=1}^{T}A(|i-j|) (36)
    =\displaystyle= 1T2i=1T(t=A(|t|)|t|>2TA(|t|))\displaystyle\frac{1}{T^{2}}\sum_{i=1}^{T}\left(\sum_{t=-\infty}^{\infty}A(|t|)-\sum_{|t|>2T}A(|t|)\right) (37)
    \displaystyle\approx 1T2i=1Tt=A(|t|)\displaystyle\frac{1}{T^{2}}\sum_{i=1}^{T}\sum_{t=-\infty}^{\infty}A(|t|) (38)
    =\displaystyle= 1T(A(0)+2t=1A(t))\displaystyle\frac{1}{T}\Big{(}A(0)+2\sum_{t=1}^{\infty}A(t)\Big{)} (39)
    =\displaystyle= A(0)T(1+2t=1A(t)A(0))\displaystyle\frac{A(0)}{T}\Big{(}1+2\sum_{t=1}^{\infty}\frac{A(t)}{A(0)}\Big{)}\vskip 0.0pt (40)

    where the term |t|>2TA(|t|)\sum_{|t|>2T}A(|t|) is omitted from (37) to (38), which is usually small according to the property of autocovariance function.

    We repeat some defintions from the main paper (?).

    A(t)=𝔼[(ϕ¯ηϕ(𝜽0))(ϕ¯ηϕ(𝜽t))]\displaystyle A(t)=\mathbb{E}[(\bar{\phi}_{\eta}-\phi(\bm{\theta}_{0}))(\bar{\phi}_{\eta}-\phi(\bm{\theta}_{t}))]\vskip 0.0pt (41)

    is the autocovariance function, manifesting how strong two samples with a time lag tt are correlated. Its normalized version

    ACF :\displaystyle: γ(t)=A(t)A(0)\displaystyle\gamma(t)=\frac{A(t)}{A(0)}\vskip 0.0pt (42)

    is called the autocorrelation function (ACF).

    ACT :\displaystyle: τ=12+t=1γ(t)\displaystyle\tau=\frac{1}{2}+\sum_{t=1}^{\infty}\gamma(t)\vskip 0.0pt (43)

    is the integrated autocorrelation time (ACT), which measures the interval between independent samples.

    Note that effective sample size (ESS) is defined as

    ESS :\displaystyle: M=T1+2t=1A(t)A(0)\displaystyle M=\frac{T}{1+2\sum_{t=1}^{\infty}\frac{A(t)}{A(0)}}\vskip 0.0pt (44)

    Plugin the definition into the derivation for variance, we have

    VA(0)T(1+2t=1A(t)A(0))=A(0)M\displaystyle V\approx\frac{A(0)}{T}\Big{(}1+2\sum_{t=1}^{\infty}\frac{A(t)}{A(0)}\Big{)}=\frac{A(0)}{M}\vskip 0.0pt (45)

    \blacksquare

Appendix E E. More results on simulation

We demonstrate our pSGLD on a simple 2D gaussian example, 𝒩([00],[0.1600a])\mathcal{N}\Big{(}\left[\begin{array}[]{c}0\\ 0\end{array}\right],\left[\begin{array}[]{cc}0.16&0\\ 0&a\end{array}\right]\Big{)}. The first 600 samples of both methods for different aa and ϵ\epsilon are shown in Fig. 1.

Comparing the results for different stepsize ϵ\epsilon at the same aa, it can be seen that pSGLD can adapt stepsizes acorrding to the manifold geometry of different dimensions.

When aa is rescaled from 0.50.5 to 22, stepsize ϵ=0.1\epsilon=0.1 is appropriate for SGLD at a=0.5a=0.5, but not a good choice at a=2a=2, because the space is not fully explored. This also implies that even if the covariance matrix of a target distribution is mildly rescaled, we do not have to choose a new stepsize for pSGLD. Whilst, the stepsize of the standard SGLD needs to be fine-tuned in order to obtain decent samples.

Refer to caption Refer to caption
(a) a=2,ϵ=0.3a=2,\epsilon=0.3 (b) a=2,ϵ=0.1a=2,\epsilon=0.1
Refer to caption Refer to caption
(c) a=0.5,ϵ=0.3a=0.5,\epsilon=0.3 (d) a=0.5,ϵ=0.1a=0.5,\epsilon=0.1
Figure 6: Simulation.

Appendix F F. More results on
Feedforward Neural Networks

Learning curves for network sizes of 400-400 and 800-800 on MNIST are provided in Fig. 7 (a) and (b), respectively. Similar with results of network size 1200-1200 in the main paper, stochastic sampling methods take less iterations to converge, and the results are more stable than their optimization counterparts. Moreover, it can be seen that pSGLD consistently converges faster and better than SGLD and others.

Refer to caption Refer to caption
(a) 400-400 (b) 800-800
Figure 7: Learning curves of FNN at different network sizes.

Appendix G G. More results on
Convolutional Neural Networks

We use another fairly standard network configuration containing 2 convolutional layers on MNIST dataset. It is followed by a single fully-connected layer (?), containing 500 hidden nodes that uses ReLU. Both convolutional layers use 5×55\times 5 filter size with 32 and 64 channels, respectively, 2×22\times 2 max pooling are used after each convolutional layer. 100 epochs are used, and LL is set to 20. The stepsizes for pSGLD and RSMprop are set to ϵ={1,2}×103\epsilon=\{1,2\}\times 10^{-3} via grid search. For SGLD and SGD, this is ϵ={1,2}×101\epsilon=\{1,2\}\times 10^{-1}.

A comparison of test errors is shown in Table 1G, with the corresponding learning curves in Fig. 3G. Again, under the same network architecture, CNN trained with traditional SGD gives an error of 0.81%, while pSGLD has a significant improvement, with an error of 0.56%.

Method Test error
pSGLD 0.56%
SGLD 0.76%
RMSprop 0.64%
SGD 0.81%

Table 4: Results of CNN.
[Uncaptioned image]
Figure 8: Learning curves.

We also tested a similar 3-layer CNN with 32-32-64 channels on Cifar-10 RGB image dataset (?), which consists of 50,00050,000 samples for training and 10,00010,000 samples for testing. No data augmentation is employed for the dataset. We keep the same setting for pSGLD and SGLD from MNIST, and show the comparison on Cifar-10 in Fig. 9. pSGLD converges faster and reach a lower error.

Refer to caption
Figure 9: Test learning curves of CNN on Cifar-10 dataset.