This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Compressed Sensing with Generative Priors via Langevin Dynamics

Thanh V. Nguyen, Gauri Jagatap, Chinmay Hegde Email: thanhng@iastate.edu, gbj221@nyu.edu, chinmay.h@nyu.edu. This work was partially done while TN was with the Electrical and Computer Engineering Department at Iowa State University. GJ and CH are currently with the Tandon School of Engineering at New York University. This work was supported in part by NSF grants CCF-2005804 and CCF-1815101.
Abstract

Deep generative models have emerged as a powerful class of priors for signals in various inverse problems such as compressed sensing, phase retrieval and super-resolution. Here, we assume an unknown signal to lie in the range of some pre-trained generative model. A popular approach for signal recovery is via gradient descent in the low-dimensional latent space. While gradient descent has achieved good empirical performance, its theoretical behavior is not well understood. In this paper, we introduce the use of stochastic gradient Langevin dynamics (SGLD) for compressed sensing with a generative prior. Under mild assumptions on the generative model, we prove the convergence of SGLD to the true signal. We also demonstrate competitive empirical performance to standard gradient descent.

1 Introduction

We consider the familiar setting of inverse problems where the goal is to recover an nn-dimensional signal xx^{*} that is indirectly observed via a linear measurement operation y=Axy=Ax^{*}. The measurement vector can be noisy, and its dimension mm may be less than nn. Several practical applications fit this setting, including super-resolution (Dong et al., 2016), in-painting, denoising Vincent et al. (2010), and compressed sensing Donoho (2006); Chang et al. (2017).

Since such an inverse problem is ill-posed in general, the recovery of xx^{*} from yy often requires assuming a low-dimensional structure or prior on xx^{*}. Choices of good priors have been extensively explored in the past three decades, including sparsity Chen et al. (2001); Needell & Tropp (2009), structured sparsity Baraniuk et al. (2010), end-to-end training via convolutional neural networks Chang et al. (2017); Mousavi & Baraniuk (2017), pre-trained generative priors Bora et al. (2017a), as well as untrained deep image priors Ulyanov et al. (2018); Jagatap & Hegde (2019).

In this paper, we focus on a powerful class of priors based on deep generative models. The setup is the following: the unknown signal xx^{*} is assumed to lie in the range of some pre-trained generator network, obtained from (say) a generative adversarial network (GAN) or a variational autoencoder (VAE). That is, x=G(z)x^{*}=G(z^{*}) for some zz^{*} in the latent space. The task is again to recover xx^{*} from (noisy) linear measurements.

Such generative priors have been shown to achieve high empirical success Chang et al. (2017); Bora et al. (2017a); Y. Wu (2019). However, progress on the theoretical side for inverse problems with generative priors has been much more modest. On the one hand, the seminal work of Bora et al. (2017b) established the first statistical upper bounds (in terms of measurement complexity) for compressed sensing for fairly general generative priors, which was later shown in Liu & Scarlett (2020) to be nearly optimal. On the other hand, provable algorithmic guarantees for recovery using generative priors are only available in very restrictive cases. The paper Hand & Voroninski (2018) proves the convergence of (a variant of) gradient descent for shallow generative priors whose weights obey a distributional assumption. The paper Shah & Hegde (2018) proves the convergence of projected gradient descent (PGD) under the assumption that the range of the (possibly deep) generative model GG admits a polynomial-time oracle projection. To our knowledge, the most general algorithmic result in this line of work is by Latorre et al. (2019). Here, the authors show that under rather mild and intuitive assumptions on GG, a linearized alternating direction method of multipliers (ADMM) applied to a regularized mean-squared error loss converges to a (potentially large) neighborhood of xx^{*}.

The main barrier for obtaining guarantees for recovery algorithms based on gradient descent is the non-convexity of the recovery problem induced by the generator network. Therefore, in this paper we sidestep traditional gradient descent-style optimization methods, and instead show that a very good estimate of xx^{*} can also be obtained by performing stochastic gradient Langevin Dynamics (SGLD) Welling & Teh (2011); Raginsky et al. (2017); Zhang et al. (2017); Zou et al. (2020). We show that this dynamics amounts to sampling from a Gibbs distribution whose energy function is precisely the reconstruction loss 111While preparing this manuscript, we became aware of concurrent work by Jalal et al. (2020) which also pursues a similar Langevin-style approach for solving compressed sensing problems; however, they do not theoretically analyze its dynamics..

As a stochastic version of gradient descent, SGLD is simple to implement. However, care must be taken in constructing the additive stochastic perturbation to each gradient update step. Nevertheless, the sampling viewpoint enables us to achieve finite-time convergence guarantees for compressed sensing recovery. To the best of our knowledge, this is the first such result for solving compressed sensing problems with generative neural network priors. Moreover, our analysis succeeds under (slightly) weaker assumptions on the generator network than those made in Latorre et al. (2019). Our specific contributions are as follows:

  1. 1.

    We propose a provable compressed sensing recovery algorithm for generative priors based on stochastic gradient Langevin dynamics (SGLD).

  2. 2.

    We prove polynomial-time convergence of our proposed recovery algorithm to the true underlying solution, under assumptions of smoothness and near-isometry of GG. These are technically weaker than the mild assumptions made in Latorre et al. (2019). We emphasize that these conditions are valid for a wide range of generator networks. Section 3 describes them in greater details.

  3. 3.

    We provide several empirical results and demonstrate that our approach is competitive with existing (heuristic) methods based on gradient descent.

2 Prior work

We briefly review the literature on compressed sensing with deep generative models. For a thorough survey on deep learning for inverse problems, see Ongie et al. (2020).

In Bora et al. (2017a), the authors provide sufficient conditions under which the solution of the inverse problem is a minimizer of the (possibly non-convex) program:

minx=G(z)Axy22.\displaystyle\min_{x=G(z)}\|Ax-y\|_{2}^{2}\,. (2.1)

Specifically, they show that if AA satisfies the so-called set-Restricted Eigenvalue Condition (REC), then the solution to (2.1) equals the unknown vector xx^{*}. They also show that if the generator GG has a latent dimension kk and is LL-Lipschitz, then a matrix Am×nA\in\mathbb{R}^{m\times n} populated with i.i.d. Gaussian entries satisfies the REC, provided m=O(klogL)m=O(k\log L). However, they propose gradient descent as a heuristic to solve (2.1), but do not analyze its convergence. In Shah & Hegde (2018), the authors show that projected gradient descent (PGD) for (2.1) converges at a linear rate under the REC, but only if there exists a tractable projection oracle that can compute argminzxG(z)\operatornamewithlimits{arg\,min}_{z}\|x-G(z)\| for any xx. The recent work Lei et al. (2019) provides sufficient conditions under which such a projection can be approximately computed. In Latorre et al. (2019), a provable recovery scheme based on ADMM is established, but guarantees recovery only up to a neighborhood around xx^{*}.

Note that all the above works assume mild conditions on the weights of the generator, use variations of gradient descent to update the estimate for xx, and require the forward matrix AA to satisfy the REC over the range of GG. Hand & Voroninski (2018, 2019) showed global convergence for gradient descent, but under the (strong) assumption that the weights of the trained generator are Gaussian distributed.

Generator networks trained with GANs are most commonly studied. However, Whang et al. (2020); Asim et al. (2019) have recently advocated using invertible generative models, which use real-valued non-volume preserving (NVP) transformations Dinh et al. (2017). An alternate strategy for sampling images consistent with linear forward models was proposed in Lindgren et al. (2020) where the authors assume an invertible generative mapping and sample the latent vector zz from a second generative invertible prior.

Our proposed approach also traces its roots to Bayesian compressed sensing Ji & Carin (2007), where instead of modeling the problem as estimating a (deterministic) sparse vector, one models the signal xx to be sampled from a sparsity promoting distribution, such as a Laplace prior. One can then derive the maximum a posteriori (MAP) estimate of xx under the constraint that the measurements y=Axy=Ax are consistent. Our motivation is similar, except that we model the distribution of xx as being supported on the range of a generative prior.

3 Recovery via Langevin dynamics

In the rest of the paper, xyx\wedge y denotes min{x,y}\min\{x,y\} and xyx\vee y for max{x,y}\max\{x,y\}. Given a distribution μ\mu and set 𝒜\mathcal{A}, we denote μ(𝒜)\mu(\mathcal{A}) the probability measure of 𝒜\mathcal{A} with respect to μ\mu. μνTV\|\mu-\nu\|_{TV} is the total variation distance between two distributions μ\mu and ν\nu. Finally, we use standard big-O notation in our analysis.

3.1 Preliminaries

We focus on the problem of recovering a signal xnx^{*}\in\mathbb{R}^{n} from a set of linear measurements ymy\in\mathbb{R}^{m} where

y=Ax+ε.y=Ax^{*}+\varepsilon.

To keep our analysis and results simple, we consider zero measurement noise, i.e., ε=0\varepsilon=0222We not in passing that our analysis techniques succeed for any vector ε\varepsilon with bounded 2\ell_{2} norm.. Here, Am×nA\in\mathbb{R}^{m\times n} is a matrix populated with i.i.d. Gaussian entries with mean 0 and variance 1/m1/m. We assume that xx^{*} belongs to the range of a known generative model G:𝒟dnG\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}; that is,

x=G(z)for somez𝒟.x^{*}=G(z^{*})~{}~{}\text{for some}~{}~{}z^{*}\in\mathcal{D}.

Following Bora et al. (2017b), we restrict zz to belong to a dd-dimensional Euclidean ball, i.e., 𝒟=(0,R)\mathcal{D}=\mathcal{B}(0,R). Then, given the measurements yy, our goal is to recover xx^{*}. Again following Bora et al. (2017b), we do so by solving the usual optimization problem:

minz𝒟F(z)yAG(z)2.\min_{z\in\mathcal{D}}F(z)\triangleq\|y-AG(z)\|^{2}. (3.1)

Hereon and otherwise stated, \|\cdot\| denotes the 2\ell_{2}-norm. The most popular approach to solving (3.1) is to use gradient descent Bora et al. (2017b). For generative models G(z)G(z) defined by deep neural networks, the function F(z)F(z) is highly non-convex, and as such, it is impossible to guarantee global signal recovery using regular (projected) gradient descent.

We adopt a slightly more refined approach. Starting from an initial point z0μ0z_{0}\sim\mu_{0}, our algorithm computes stochastic gradient updates of the form:

zk+1=zkηzF(z)+2ηβ1ξk,k=0,1,2,z_{k+1}=z_{k}-\eta\nabla_{z}F(z)+\sqrt{2\eta\beta^{-1}}\xi_{k},\quad k=0,1,2,\dots (3.2)

where ξk\xi_{k} is a unit Gaussian random vector in d\mathbb{R}^{d}, η\eta is the step size and β\beta is an inverse temperature parameter. This update rule is known as stochastic gradient Langevin dynamics (SGLD) Welling & Teh (2011) and has been widely studied both in theory and practice Raginsky et al. (2017); Zhang et al. (2017). Intuitively, (3.2) is an Euler discretization of the continuous-time diffusion equation:

dZ(t)=zF(Z(t))dt+2β1dB(t),t0,\mathrm{d}Z(t)=-\nabla_{z}F(Z(t))\mathrm{d}t+\sqrt{2\beta^{-1}}\mathrm{d}B(t),\quad t\geq 0, (3.3)

where Z(0)μ0Z(0)\sim\mu_{0}. Under standard regularity conditions on F(z)F(z), one can show that the above diffusion has a unique invariant Gibbs measure.

We refine the standard SGLD to account for the boundedness of zz. Specifically, we require an additional Metropolis-like accept/reject step to ensure that zk+1z_{k+1} always belongs to the support 𝒟\mathcal{D}, and also is not too far from zkz_{k} of the previous iteration. We study this variant for theoretical analysis; in practice we have found that this is not necessary. Algorithm 1 (CS-SGLD) shows the detailed algorithm. Note that we can use stochastic (mini-batch) gradient instead of the full gradient zF(z)\nabla_{z}F(z).

We wish to derive sufficient conditions on the convergence (in distribution) of the random process in Algorithm 1 to the target distribution π\pi, denoted by:

π(dz)exp(βF(z))𝟏(z𝒟),\pi(\mathrm{d}z)\propto\exp(-\beta F(z)){\mathbf{1}}(z\in\mathcal{D}), (3.4)

and study its consequence in recovering the true signal xx^{*}. This leads to the first guarantees of a stochastic gradient-like method for compressed sensing with generative priors. In order to do so, we make the following three assumptions on the generator network G(z)G(z).

Algorithm 1 CS-SGLD
  Input: step size η\eta; inverse temperature parameter β\beta, radius rr and Lipschitz constant LL of F(z)F(z).
  Draw z0z_{0} from μ0=𝒩(0,12Lβ𝕀)\mu_{0}=\mathcal{N}(0,\frac{1}{2L\beta}\mathbb{I}) truncated on 𝒟\mathcal{D}.
  for k=0,1,,k=0,1,\ldots, do
     Randomly sample ξk𝒩(0,𝕀)\xi_{k}\sim\mathcal{N}(0,\mathbb{I}).
     zk+1=zkηzF(zk)+2η/βξkz_{k+1}=z_{k}-\eta\nabla_{z}F(z_{k})+\sqrt{2\eta/\beta}\xi_{k}
     if zk+1(zk,r)Kz_{k+1}\not\in\mathcal{B}(z_{k},r)\cap K then
        zk+1=zkz_{k+1}=z_{k}
     end if
  end for
  Output: z^=zk\widehat{z}=z_{k}.
  1. (A.1)

    Boundedness. For all z𝒟z\in\mathcal{D}, we have that G(z)B\|G(z)\|\leq B for some B>0B>0.

  2. (A.2)

    Near-isometry. G(z)G(z) is a near-isometric mapping if there exist 0<ιGκG0<\iota_{G}\leq\kappa_{G} such that the following holds for any z,z𝒟z,z^{\prime}\in\mathcal{D}:

    ιGzzG(z)G(z)κGzz.\displaystyle\iota_{G}\|z-z^{\prime}\|\leq\|G(z)-G(z^{\prime})\|\leq\kappa_{G}\|z-z^{\prime}\|.
  3. (A.3)

    Lipschitz gradients. The Jacobian of G(z)G(z) is MM-Lipschitz, i.e., for any z,z𝒟z,z^{\prime}\in\mathcal{D}, we have

    zG(z)zG(z)Mzz,\|\nabla_{z}G(z)-\nabla_{z}G(z^{\prime})\|\leq M\|z-z^{\prime}\|,

    where zG(z)=G(z)z\nabla_{z}G(z)=\frac{\partial G(z)}{\partial z} is the Jacobian of the mapping G()G(\cdot) with respect to zz.

All three assumptions are justifiable. Assumption (A.1) is reasonable due to the bounded domain KK and for well-trained generative models G(z)G(z) whose target data distribution is normalized. Assumption (A.2) is reminiscent of the ubiquitous restricted isometry property (RIP) used for compressed sensing analysis (Candes & Tao, 2005) and is recently adopted in (Latorre et al., 2019). Finally, Assumption (A.3) is needed so that the loss function F(z)F(z) is smooth, following typical analyses of Markov processes.

Next, we introduce a new concept of smoothness for generative networks. This concept is a weaker version of a condition on G()G(\cdot) introduced in Latorre et al. (2019).

Definition 3.1 (Strong smoothness).

The generator network G(z)G(z) is (α,γ)(\alpha,\gamma)-strongly smooth if there exist α>0\alpha>0 and γ0\gamma\geq 0 such that for any z,z𝒟z,z^{\prime}\in\mathcal{D}, we have

G(z)G(z),zG(z)(zz)αzz2γ.\langle G(z)-G(z^{\prime}),\nabla_{z}G(z)(z-z^{\prime})\rangle\geq\alpha\|z-z^{\prime}\|^{2}-\gamma. (3.5)

Following Latorre et al. (2019) (Assumption 2), we call this property “strong smoothness”. However, our definition of strong smoothness requires two parameters instead of one, and is weaker since we allow for an additive slack parameter γ0\gamma\geq 0.

Definition 3.1 can be closely linked to the following property of the loss function F(z)F(z) that turns out to be crucial in establishing convergence results for CS-SGLD.

Definition 3.2 (Dissipativity (Hale, 1990)).

A differentiable function F(z)F(z) on 𝒟\mathcal{D} is (α,γ)(\alpha,\gamma)-dissipative around zz^{*} if for constants α>0\alpha>0 and γ0\gamma\geq 0, we have

zz,zF(z)αzz2γ.\langle z-z^{*},\nabla_{z}F(z)\rangle\geq\alpha\|z-z^{*}\|^{2}-\gamma. (3.6)

It is straightforward to see that (3.6) essentially recovers the strong smoothness condition (3.5) if the measurement matrix AA is assumed to be the identity matrix. In compressed sensing, it is often the case that AA is a (sub)Gaussian matrix and that given a sufficient number of measurements as well as Assumptions (A.1), (A.2) and (A.3), the dissipativity of F(z)F(z) for such an AA can still be established.

Once FF is shown to be dissipative, the machinery of Raginsky et al. (2017); Zhang et al. (2017); Zou et al. (2020) can be adapted to show that the convergence of CS-SGLD. The majority of the remainder of the paper is devoted to proving this series of technical claims.

3.2 Main results

We first show that a very broad class of generator networks satisfies the assumptions made above. The following proposition is an extension of a result in Latorre et al. (2019).

Proposition 3.1.

Suppose G(z):𝒟dnG(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} is a feed-forward neural network with layers of non-decreasing sizes and compact input domain 𝒟\mathcal{D}. Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, G(z)G(z) satisfies Assumptions (A.2) & (A.3) with constants ιG,κG,M\iota_{G},\kappa_{G},M, and if 2ιG2>MκG2\iota_{G}^{2}>M\kappa_{G}, the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.

This proposition merits a thorough discussion. First, architectures with increasing layer sizes are common; many generative models (such as GANs) assume architectures of this sort. Observe that the non-decreasing layer size condition is much milder than the expansivity ratios of successive layers assumed in related work Hand & Voroninski (2018); Asim et al. (2019).

Second, the compactness assumption of the domain of GG is mild, and traces its provenance to earlier related works Bora et al. (2017b); Latorre et al. (2019). Moreover, common empirical techniques for training generative models (such as GANs) indeed assume that the latent vectors zz lie on the surface of a sphere White (2016).

Third, common activation functions such as the sigmoid, or the Exponential Linear Unit (ELU) are continuously differentiable and monotonic. Note that the standard Rectified Linear Unit (ReLU) activation does not satisfy these conditions, and establishing similar results for ReLU networks is deferred to future work.

The key for our theoretical analysis, as discussed above, is Definition 3.1, and establishing this requires Proposition 3.1. Interestingly however, in Section 5 below we provide empirical evidence that strong smoothness holds for generative adversarial networks with ReLU activation trained on the MNIST and CIFAR-10 image datasets.

We now obtain a measurement complexity result by deriving a bound on the number of measurements required for FF to be dissipative.

Lemma 3.1.

Let G(z):𝒟dnG(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let κG\kappa_{G} be its Lipschitz constant. Suppose the number of measurements mm satisfies:

m=Ω(dδ2log(κG/γ)),m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right)\,,

for some small constant δ>0\delta>0. If the elements of AA are drawn according to 𝒩(0,1m)\mathcal{N}(0,\frac{1}{m}), then the loss function F(z)F(z) is (1δ,γ)(1-\delta,\gamma)-dissipative with probability at least 1exp(Ω(mδ2))1-\exp(-\Omega(m\delta^{2})).

The above result can be derived using covering number arguments, similar to the treatment in Bora et al. (2017b). Observe that the number of measurements scales linearly with the dimension of the latent vector zz instead of the ambient dimension, keeping in line with the flavor of results in standard compressed sensing. Recent lower bounds reported Liu & Scarlett (2020) also have shown that the scaling of mm with respect to dd and logL\log L might be tight for compressed sensing recovery in several natural parameter regimes.

We need two more quantities to readily state our convergence guarantee. Both definitions are widely used in the convergence analysis of MCMC methods. The first quantity defines the goodness of an initial distribution μ0\mu_{0} with respect to the target distribution π\pi.

Definition 3.3 (λ\lambda-warm start, Zou et al. (2020)).

Let ν\nu be a distribution on 𝒟\mathcal{D}. An initial distribution μ0\mu_{0} is λ\lambda-warm start with respect to ν\nu if

sup𝒜:𝒜𝒟μ0(𝒜)ν(𝒜)λ.\displaystyle\sup_{\mathcal{A}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{A}\subseteq\mathcal{D}}\frac{\mu_{0}(\mathcal{A})}{\nu(\mathcal{A})}\leq\lambda.

The next quantity is the Cheeger constant that connects the geometry of the objective function and the hitting time of SGLD to a particular set in the domain Zhang et al. (2017).

Definition 3.4 (Cheeger constant).

Let μ\mu be a probability measure on 𝒟\mathcal{D}. We say μ\mu satisfies the isoperimetric inequality with Cheeger constant ρ\rho if for any 𝒜𝒟\mathcal{A}\subset\mathcal{D},

lim infh0+μ(𝒜h)μ(𝒜)hρmin{μ(𝒜),1μ(𝒜)},\displaystyle\liminf_{h\rightarrow 0^{+}}\frac{\mu(\mathcal{A}_{h})-\mu(\mathcal{A})}{h}\geq\rho\min\big{\{}\mu(\mathcal{A}),1-\mu(\mathcal{A})\big{\}},

where 𝒜h={uK:v𝒜,uv2h}\mathcal{A}_{h}=\{u\in K\mathrel{\mathop{\mathchar 58\relax}}\exists v\in\mathcal{A},\|u-v\|_{2}\leq h\}.

Putting all the above ingredients together, our main theoretical result describing the convergence of Algorithm 1 (CS-SGLD) for compressed sensing recovery is given as follows.

Theorem 1 (Convergence of CS-SGLD).

Assume that the generative network GG satisfies Assumptions (A.1)(A.3) as well as the strong smoothness condition. Consider a signal x=G(z)x^{*}=G(z^{*}), and assume that it is measured with mm (sub)Gaussian measurements such that m=Ω(dlogκG/γ)m=\Omega(d\log\kappa_{G}/\gamma). Choose an inverse temperature β\beta and precision parameter ϵ>0\epsilon>0. Then, after kk iterations of SGLD in Algorithm 1, we obtain a latent vector zkz_{k} such that

𝔼[F(zk)]ϵ+O(dβlog(βd)),\mathbb{E}\left[F(z_{k})\right]\leq\epsilon+O\left(\frac{d}{\beta}\log\left(\frac{\beta}{d}\right)\right), (3.7)

provided the step size η\eta and the number of iterations kk are chosen such that:

η=O~(ρ2ϵ2d2β),andk=O~(d3β2ρ4ϵ2).\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{3}\beta^{2}}{\rho^{4}\epsilon^{2}}\right).

In words, if we choose a high enough inverse temperature and appropriate step size, CS-SGLD converges (in expectation) to a signal estimate with very low loss within a polynomial number of iterations.

Let us parse the above result further. First, observe that the right hand side of (3.7) consists of two terms. The first term can be made arbitrarily small (at the cost of greater computational cost since η\eta decreases ). The second term represents the irreducible expected error of the exact sampling algorithm on the Gibbs measure π(dz)\pi({\mathrm{d}}z), which is worse than the optimal loss obtained at z=zz=z^{*}.

Second, suppose the right hand side of (3.7) is upper bounded by ϵ\epsilon^{\prime}. Once SGLD finds an ϵ\epsilon^{\prime}-approximate minimizer of the loss, in the regime of sufficient compressed sensing measurements (as specified by Lemma 3.1), we can invoke Theorem 1.1 in Bora et al. (2017b) along with Jensen’s inequality to immediately obtain a recovery guarantee, i.e.,

𝔼[xG(zk)]ϵ.\mathbb{E}\left[\mathinner{\!\left\lVert x^{*}-G(z_{k})\right\rVert}\right]\leq\sqrt{\epsilon^{\prime}}.

Third, the convergence rate of CS-SGLD can be slow. In particular, SGLD may require a polynomial number of iterations to recover the true signal, while linearized ADMM Latorre et al. (2019) converges within a logarithmic number of iterations up to a neighborhood of the true signal. Obtaining an improved characterization of CS-SGLD convergence (or perhaps devising a new linearly convergent algorithm) is an important direction for future work.

Fourth, the above result is for noiseless measurements. A rather similar result can be derived with noisy measurements of bounded noise (says, εσ)\|\varepsilon\|\leq\sigma). This quantity (times a constant depending on AA) will affect (3.7) up to an additive term that scales with σ\sigma. This is precisely in line with most compressed sensing recovery results and for simplicity we omit such a derivation.

4 Proof outline

In this section, we provide a brief proof sketch of Theorem 1, while relegating details to the appendix. At a high level, our analysis is an adaptation of the framework of Zhang et al. (2017); Zou et al. (2020) specialized to the problem of compressed sensing recovery using generative priors. The basic ingredient in the proof is the use of conductance analysis to show the convergence of CS-SGLD to the target distribution in total variation distance.

Let μk\mu_{k} denote the probability measure of ZkZ_{k} generated by Algorithm 1 and π\pi denote the target distribution in 3.4. The proof of Theorem 1 consists of three main steps:

  1. 1.

    First, we construct an auxiliary Metropolis-Hasting Markov process to show that μk\mu_{k} converges to π\pi in total variation for a sufficiently large kk and a “good” initial distribution μ0\mu_{0}.

  2. 2.

    Next, we construct an initial distribution μ0\mu_{0} that serves as a λ\lambda-warm start with respect to π\pi.

  3. 3.

    Finally, we show that a random draw from π\pi is a near-minimizer of F(z)F(z), proving that CS-SGLD recovers the signal to high fidelity.

We proceed with a characterization of the evolution of the distribution of zkz_{k} in Algorithm 1, which basically follows (Zou et al., 2020).

4.1 Construction of Metropolis-Hasting SGLD

Let g(z)=zF(z)g(z)=\nabla_{z}F(z), uu and ww respectively be the points before and after one iteration of Algorithm 1; the Markov chain is written as uvwu\rightarrow v\rightarrow w, where v𝒩(uηg(u),2ηβI)v\sim\mathcal{N}(u-\eta g(u),\frac{2\eta}{\beta}I) with the following density:

P(v|u)=[1(4πη/β)d/2exp(vu+ηg(u)224η/β)|u].\displaystyle\begin{split}&P(v|u)=\bigg{[}\frac{1}{(4\pi\eta/\beta)^{d/2}}\exp\bigg{(}-\frac{\|v-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}\bigg{)}\bigg{|}u\bigg{]}.\end{split} (4.1)

Without the correction step, P(v|u)P(v|u) is exactly the transition probability of the standard Langevin dynamics. Note also that one can construct a similar density with a stochastic (mini-batch) gradient. The process of vwv\rightarrow w is

w={vv(u,r)𝒟;uotherwise.\displaystyle w=\begin{cases}v&v\in\mathcal{B}(u,r)\cap\mathcal{D};\\ u&\mbox{otherwise}.\end{cases} (4.2)

Let p(u)=vP(|u)[v(u,r)𝒟]p(u)=\mathbb{P}_{v\sim P(\cdot|u)}[v\in\mathcal{B}(u,r)\cap\mathcal{D}] be the probability of accepting vv. The conditional density Q(w|u)Q(w|u) is

Q(w|u)=(1p(u))\displaystyle Q(w|u)=(1-p(u)) δu(w)+P(w|u)𝟏[w(u,r)𝒟],\displaystyle\delta_{u}(w)+P(w|u)\cdot{\mathbf{1}}\big{[}w\in\mathcal{B}(u,r)\cap\mathcal{D}\big{]},

where δu()\delta_{u}(\cdot) is the Dirac-delta function at uu. Similar to Zou et al. (2020); Zhang et al. (2017), we consider the 1/21/2-lazy version of the above Markov process, with the transition distribution

𝒯u(w)=12δu(w)+12Q(w|u),\displaystyle\mathcal{T}_{u}(w)=\frac{1}{2}\delta_{u}(w)+\frac{1}{2}Q(w|u), (4.3)

and construct an auxiliary Markov process by adding an extra Metropolis accept/reject step. While proving the ergodicity of the Markov process with transition distribution 𝒯u(w)\mathcal{T}_{u}(w) is difficult, the auxiliary chain does indeed converge to a unique stationary distribution πeβF(z)𝟏(z𝒟)\pi\propto e^{-\beta F(z)}\cdot{\mathbf{1}}(z\in\mathcal{D}) due to the Metropolis-Hastings correction step.

The auxiliary Markov chain is given as follows: starting from uu, let ww be the state generated from 𝒯u()\mathcal{T}_{u}(\cdot). The Metropolis-Hasting SGLD accepts ww with probability,

αu(w)=min{1,𝒯w(u)𝒯u(w)exp[β(F(w)F(u))]}.\displaystyle\alpha_{u}(w)=\min\bigg{\{}1,\frac{\mathcal{T}_{w}(u)}{\mathcal{T}_{u}(w)}\cdot\exp\big{[}-\beta\big{(}F(w)-F(u)\big{)}\big{]}\bigg{\}}.

Let 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) denote the transition distribution of the auxiliary Markov process, such that

𝒯u(w)=(1αu(w))δ(u)+αu(w)𝒯u(w).\displaystyle\mathcal{T}^{\star}_{u}(w)=(1-\alpha_{u}(w))\delta(u)+\alpha_{u}(w)\mathcal{T}_{u}(w).

Below, we establish the connection between 𝒯u()\mathcal{T}_{u}(\cdot) and 𝒯u()\mathcal{T}^{\star}_{u}(\cdot), as well as the convergence of the original chain in Algorithm 1 through a conductance analysis on 𝒯u()\mathcal{T}^{\star}_{u}(\cdot).

Lemma 4.1.

Under Assumptions, F(z)F(z) is LL-smooth and satisfies zF(z)D\|\nabla_{z}F(z)\|\leq D for z𝒟z\in\mathcal{D}. For r=10ηd/βr=\sqrt{10\eta d/\beta}, the transition distribution of the chain in Algorithm 1 is δ\delta-close the auxiliary chain, i.e., for any set 𝒜𝒟\mathcal{A}\subseteq\mathcal{D}

(1δ)𝒯u(𝒜)𝒯u(𝒜)(1+δ)𝒯u(𝒜).(1-\delta)\mathcal{T}_{u}^{\star}(\mathcal{A})\leq\mathcal{T}_{u}(\mathcal{A})\leq(1+\delta)\mathcal{T}_{u}^{\star}(\mathcal{A}).

where δ=10Ldη+10LDd1/2β1/2η3/2\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}.

In Appendix B, we show that F(z)F(z) is LL-smooth with L=(MB+κG2)L=(MB+\kappa_{G}^{2}) and its gradient is bounded by D=κG2AAD=\kappa_{G}^{2}\|A^{\top}A\|.

One can verify that 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) is time-reversible (Zhang et al., 2017). Moreover, following Lovász et al. (1993); Lovász & Vempala (2007), the convergence of a time-reversible Markov chain to its stationary distribution depends on its conductance, which is defined as follows:

Definition 4.1 (Restricted conductance).

The conductance of a time-reversible Markov chain with transition distribution 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) and stationary distribution π\pi is defined by,

ϕinf𝒜:𝒜𝒟,π(𝒜)(0,1)𝒜𝒯u(𝒟\𝒜)π(du)min{π(𝒜),π(𝒟\𝒜)}.\displaystyle\phi\triangleq\inf_{\mathcal{A}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{A}\subseteq\mathcal{D},\pi(\mathcal{A})\in(0,1)}\frac{\int_{\mathcal{A}}\mathcal{T}_{u}(\mathcal{D}\backslash\mathcal{A})\pi({\mathrm{d}}u)}{\min\{\pi(\mathcal{A}),\pi(\mathcal{D}\backslash\mathcal{A})\}}.

Using the conductance parameter ϕ\phi and the closeness δ\delta between 𝒯u()\mathcal{T}_{u}(\cdot) and 𝒯u()\mathcal{T}^{\star}_{u}(\cdot), we can derive the convergence of 𝒯u()\mathcal{T}_{u}(\cdot) in total variation distance.

Lemma 4.2 (Zou et al. (2020)).

Assume the conditions of Lemma 4.1 hold. If 𝒯u()\mathcal{T}_{u}(\cdot) is δ\delta-close to 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) with δmin{12/2,ϕ/16}\delta\leq\min\{1-\sqrt{2}/2,\phi/16\}, and the initial distribution μ0\mu_{0} serves as a λ\lambda-warm start with respect to π\pi, then

μkπTVλ(1ϕ2/8)k+16δ/ϕ.\displaystyle\|\mu_{k}-\pi\|_{TV}\leq\lambda\big{(}1-\phi^{2}/8\big{)}^{k}+16\delta/\phi.

We will further give a lower bound on δ\delta in order to establish an explicit convergence rate.

Lemma 4.3 (Zou et al. (2020)).

Under the same conditions of Lemma 4.1 and the step size η130Ldd25βD2\eta\leq\frac{1}{30Ld}\wedge\frac{d}{25\beta D^{2}}, there exists a constant c0c_{0} such that

ϕc0ρη/β.\phi\geq c_{0}\rho\sqrt{\eta/\beta}.

4.2 Convergence of μk\mu_{k} to the target distribution π\pi

Armed with these tools, we formally establish the first step of the proof.

Theorem 2.

Suppose that the generative network GG satisfies Assumptions (A.1)(A.3) as well as the strong smoothness condition. Set η=O(d1ρ2β1d2)\eta=O\big{(}d^{-1}\wedge\rho^{2}\beta^{-1}d^{-2}\big{)} and r=10ηd/βr=\sqrt{10\eta d/\beta}, then for any λ\lambda-warm start with respect to π\pi, the output of Algorithm 1 satisfies

μkπTVλ(1C0η)k+C1η1/2,\displaystyle\|\mu_{k}-\pi\|_{TV}\leq\lambda(1-C_{0}\eta)^{k}+C_{1}\eta^{1/2},

where ρ\rho is the Cheeger constant of π\pi, C0=O~(ρ2β1)C_{0}=\widetilde{O}\big{(}\rho^{2}\beta^{-1}\big{)}, and C2=O~(dβ1/2ρ1)C_{2}=\widetilde{O}\big{(}d\beta^{1/2}\rho^{-1}\big{)}. In particular, if the step size and the number of iterations satisfy:

η=O~(ρ2ϵ2d2β),andk=O~(d2β2log(λ)ρ4ϵ2),\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{2}\beta^{2}\log(\lambda)}{\rho^{4}\epsilon^{2}}\right),

then μkπTVϵ\|\mu_{k}-\pi\|_{TV}\leq\epsilon for ϵ>0\epsilon>0.

The convergence rate is polynomial in the Cheeger constant ρ\rho whose lower bound is difficult to obtain generally. A rough bound ρ=eO~(d)\rho=e^{-\tilde{O}(d)} can be derived using the Poincaré constant of the distribution π\pi, under the smoothness assumption. See (Bakry et al., 2008) for details.

Proof outline of Theorem 2.

To prove the result, we find a sufficient condition for η\eta that fulfills the requirements of Lemmas 4.1, 4.2 and 4.3 hold. For ηd25βD2\eta\leq\frac{d}{25\beta D^{2}}, we have

δ=10Ldη+10LDd1/2β1/2η3/212Ldη.\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}\leq 12Ld\eta.

Moreover, Lemma 4.2 requires δmin{12/2,ϕ/16}\delta\leq\min\{1-\sqrt{2}/2,\phi/16\}, while ϕc0ρη/β\phi\geq c_{0}\rho\sqrt{\eta/\beta} by Lemma 4.3, so we can set

η=min{130Ld,d25βD2,c02ρ2(156Ld)2β}\eta=\min\biggl{\{}\frac{1}{30Ld},\frac{d}{25\beta D^{2}},\frac{c_{0}^{2}\rho^{2}}{(156Ld)^{2}\beta}\biggr{\}}

for these conditions to hold. Putting all together, we obtain

μkπTV\displaystyle\|\mu_{k}-\pi\|_{TV} λ(1ϕ2/8)k+16δϕ\displaystyle\leq\lambda\big{(}1-\phi^{2}/8\big{)}^{k}+\frac{16\delta}{\phi}
λ(1C0η)k+C1η1/2,\displaystyle\leq\lambda(1-C_{0}\eta)^{k}+C_{1}\eta^{1/2},

where C0=c02ρ2/8βC_{0}=c_{0}^{2}\rho^{2}/8\beta, C1=156Ldβ1/2ρ1/c0C_{1}=156Ld\beta^{1/2}\rho^{-1}/c_{0}. Therefore, we have proved the first part.

For the second part, to achieve ϵ\epsilon-sampling error, it suffices to choose η\eta and kk such that

λ(1C0η)kϵ2,andC1η1/2ϵ2.\displaystyle\lambda(1-C_{0}\eta)^{k}\leq\frac{\epsilon}{2},~{}\text{and}\quad C_{1}\eta^{1/2}\leq\frac{\epsilon}{2}.

Plugging in C0,C1C_{0},C_{1} above, we can choose

η=O(ρ2ϵ2d2β)andk=O(log(λ/ϵ)C0η)=O~(d2β2log(λ)ρ4ϵ2)\displaystyle\eta=O\bigg{(}\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\bigg{)}\ \text{and}\ k=O\bigg{(}\frac{\log(\lambda/\epsilon)}{C_{0}\eta}\bigg{)}=\widetilde{O}\bigg{(}\frac{d^{2}\beta^{2}\log(\lambda)}{\rho^{4}\epsilon^{2}}\bigg{)}

such that μkπTVϵ\|\mu_{k}-\pi\|_{TV}\leq\epsilon, which completes the proof. ∎

4.3 Existence of λ\lambda-warm start initial distribution μ0\mu_{0}

Apart from the step size and the number of iterations, the convergence depends on λ\lambda, the goodness of the initial distribution μ0\mu_{0}. In this part, we specify a particular choice of μ0\mu_{0} in establish this.

Definition 4.2 (Set-Restricted Eigenvalue Condition, (Bora et al., 2017b)).

For some parameters τ>0\tau>0 and o0o\geq 0, Am×nA\in\mathbb{R}^{m\times n} is called S-REC(τ,o)\text{S-REC}(\tau,o) if for all z,z𝒟z,z^{\prime}\in\mathcal{D},

A(G(z)G(z))τG(z)G(z)o.\|A(G(z)-G(z^{\prime}))\|\geq\tau\|G(z)-G(z^{\prime})\|-o.
Lemma 4.4.

Suppose that G(z)G(z) satisfies the near-isometry property in Assumption A.2, and F(z)F(z) is LL-smooth. If AA is S-REC(τ,0)\text{S-REC}(\tau,0), then the Gaussian distribution 𝒩(0,12βL𝕀)\mathcal{N}(0,\frac{1}{2\beta L}\mathbb{I}) supported on 𝒟\mathcal{D} is a λ\lambda-warm start with respect to π\pi with λ=eO(d)\lambda=e^{O(d)}.

Proof.

Let μ0\mu_{0} denote the truncated Gaussian distribution 𝒩(0,12βL𝕀)\mathcal{N}(0,\frac{1}{2\beta L}\mathbb{I}) on 𝒟\mathcal{D} whose measure is

μ0(dz)=eβLz22𝟏(z𝒟)dz/Γ\mu_{0}({\mathrm{d}}z)={e^{-\beta L\|z\|_{2}^{2}}{\mathbf{1}}(z\in\mathcal{D}){\mathrm{d}}z}/{\Gamma}

where Γ=𝒟eβLz22dz\Gamma=\int_{\mathcal{D}}e^{-\beta L\|z\|_{2}^{2}}{\mathrm{d}}z is the normalization constant.

Along with the target measure π\pi, we can easily verify that

μ0(dz)π(dz)𝒟eβF(z)dzΓeβLz22+βF(z).\displaystyle\frac{\mu_{0}({\mathrm{d}}z)}{\pi({\mathrm{d}}z)}\leq\frac{\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z}{\Gamma}\cdot e^{-\beta L\|z\|_{2}^{2}+\beta F(z)}.

Our goal is to bound the right hand side. Using the smoothness and the simple fact F(z)=0F(z^{*})=0, we have

F(z)L2zz22Lz22+Lz22,\displaystyle F(z)\leq\frac{L}{2}\|z-z^{*}\|_{2}^{2}\leq L\|z^{*}\|_{2}^{2}+L\|z\|_{2}^{2},

which implies that eβLz22+βF(z)eβLz22e^{-\beta L\|z\|_{2}^{2}+\beta F(z)}\leq e^{\beta L\|z^{*}\|_{2}^{2}}.

To bound 𝒟eβF(z)dz\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z, we use the S-REC property of AA as well as the near-isometry of G(z)G(z). Recall the objective function:

F(z)\displaystyle F(z) =yAG(z)2=A(G(z)G(z)2τ2G(z)G(z)2oτ2ιG2zz2,\displaystyle=\|y-AG(z)\|^{2}=\|A(G(z)-G(z^{*})\|^{2}\geq\tau^{2}\|G(z)-G(z^{*})\|^{2}-o\geq\tau^{2}\iota_{G}^{2}\|z-z^{*}\|^{2},

where we have dropped oo for simplicity. Therefore,

𝒟eβF(z)dz\displaystyle\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z 𝒟eβτ2ιG2zz2dz(πβτ2ιG2)d/2.\displaystyle\leq\int_{\mathcal{D}}e^{-\beta\tau^{2}\iota_{G}^{2}\|z-z^{*}\|^{2}}{\mathrm{d}}z\leq\left(\frac{\pi}{\beta\tau^{2}\iota_{G}^{2}}\right)^{d/2}.

Putting the above results together, we can get

λmaxzKμ0(dz)π(dz)(πβτ2ιG2)d/2eβLz22Γ=eO(d),\displaystyle\lambda\leq\max_{z\in K}\frac{\mu_{0}({\mathrm{d}}z)}{\pi({\mathrm{d}}z)}\leq\bigg{(}\frac{\pi}{\beta\tau^{2}\iota_{G}^{2}}\bigg{)}^{d/2}\frac{e^{\beta L\|z^{*}\|_{2}^{2}}}{\Gamma}=e^{O(d)},

and conclude the proof. ∎

4.4 Completing the proof

Proof of Theorem 1.

Consider a random draw Z^\widehat{Z} from μk\mu_{k} and another Z^\widehat{Z}^{*} from π\pi. We have

𝔼[F(Z^)]\displaystyle\mathbb{E}[F(\widehat{Z})] =(𝔼[F(Z^)]𝔼[F(Z^)])+𝔼[F(Z^)].\displaystyle=\left(\mathbb{E}[F(\widehat{Z})]-\mathbb{E}[F(\widehat{Z}^{*})]\right)+\mathbb{E}[F(\widehat{Z}^{*})].

We will first give a crude bound for the second term 𝔼[F(Z^)]\mathbb{E}[F(\widehat{Z}^{*})] following the idea from (Raginsky et al., 2017):

𝔼[F(Z^)]=𝒟F(z)π(dz)\displaystyle\mathbb{E}[F(\widehat{Z}^{*})]=\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z) 𝒪(dβlogβd).\displaystyle\leq\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta}{d}\right).

The detailed proof is given in Appendix D. The first term is related to the convergence of μk\mu_{k} to π\pi in total variation shown in Theorem 2. Notice that F(z)2RAκGF(z)\leq 2R\|A\|\kappa_{G} for all z𝒟z\in\mathcal{D} due the Lipschitz property of the generative network GG. Moreover, by Theorem 2, we have μkπTVϵ\|\mu_{k}-\pi\|_{TV}\leq\epsilon^{\prime} for any ϵ>0\epsilon^{\prime}>0 and a sufficiently large kk. Hence, the first term is upper bounded by

|𝒟F(z)μk(dz)𝒟F(z)π(dz)|2RAκG|𝒟μk(dz)𝒟π(dz)|2RAκGϵ.\displaystyle\left|\int_{\mathcal{D}}F(z)\mu_{k}({\mathrm{d}}z)-\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)\right|\leq 2R\|A\|\kappa_{G}\left|\int_{\mathcal{D}}\mu_{k}({\mathrm{d}}z)-\int_{\mathcal{D}}\pi({\mathrm{d}}z)\right|\leq 2R\|A\|\kappa_{G}\epsilon^{\prime}.

Given the target error ϵ\epsilon, choose ϵ=ϵ/(2RAκG)\epsilon^{\prime}={\epsilon}/{(2R\|A\|\kappa_{G})}. By Lemma 4.4, we have λ=eO(d)\lambda=e^{O(d)}. Then, for

η=O~(ρ2ϵ2d2β),andk=O~(d3β2ρ4ϵ2),we have\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{3}\beta^{2}}{\rho^{4}\epsilon^{2}}\right),~{}\text{we have}
𝔼[F(Z^)]ϵ+𝒪(dβlogβd).\mathbb{E}[F(\widehat{Z})]\leq\epsilon+\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta}{d}\right).

Therefore, we complete the proof of our main result. ∎

5 Experimental results

While we emphasize that the primary focus of our paper is theoretical, we corroborate our theory with representative experimental results on MNIST and CIFAR-10. Even though we require requires a bounded domain within dd-dimensional Euclidean ball in our analysis, empirical results demonstrate that our approach works without the restriction.

Refer to caption Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption Refer to caption
(c) (d)
Figure 1: [MNIST] selected base digit G(z)G(z^{*}), evaluating (a) dissipativity (b) (5.1), [CIFAR] selected base image G(z)G(z^{*}), evaluating (c) dissipativity (d) (5.1).

5.1 Validation of strong smoothness

As mentioned above, our theory relies on the assumption that the following condition holds for some constants α>0,γ0\alpha>0,\gamma\geq 0 and z,z𝒟\forall z,z^{\prime}\in\mathcal{D} for a domain 𝒟\mathcal{D}. Here, we take 𝒟=d\mathcal{D}=\mathbb{R}^{d}.

G(z)G(z),zG(z)(zz)αzz2γ.\displaystyle\langle G(z)-G(z^{\prime}),\nabla_{z}G(z)(z-z^{\prime})\rangle\geq\alpha\|z-z^{\prime}\|^{2}-\gamma.

To estimate these constants, we generate samples zz and zz^{\prime} from 𝒩(0,𝕀)\mathcal{N}(0,\mathbb{I}). To establish α\alpha and γ\gamma, we perform experiments on two different datasets (i) MNIST (Net1) and (ii) CIFAR10 (Net2). For both datasets, we compute the terms u(z,z)=zG(z)(G(z)G(z)),zzu(z,z^{\prime})=\nabla_{z}G(z)^{\top}(G(z)-G(z^{\prime})),z-z^{\prime}\rangle and v(z,z)=zz2v(z,z^{\prime})=\|z-z^{\prime}\|^{2} for 500 different instantiations of zz and zz^{\prime}. We then plot these pairs of (αvγ,u)(\alpha v-\gamma,u) samples for different zz’s and zz^{\prime}’s and compute the values of α\alpha and γ\gamma by a simple linear program. We do this experiment for two DCGAN generators trained on MNIST (Figure 1 (a)) as well as on CIFAR10 (Figure 1 (c)).

Similarly for the compressed sensing case, we also derive values αA\alpha_{A} and γA\gamma_{A}, where a compressive matrix AA acts on the output of the generator GG. Here, we have picked the number of measurements m=0.1nm=0.1n where nn is the signal dimension. This is encapsulated in the following equation:

z(AG(z))(AG(z)AG(z)),zzαAzz2γA\displaystyle\langle\nabla_{z}(AG(z))^{\top}(AG(z)-AG(z^{\prime})),z-z^{\prime}\rangle\geq\alpha_{A}\|z-z^{\prime}\|^{2}-\gamma_{A} (5.1)

for all possible Gaussian matrices AA and different instantiations of zz and zz^{\prime}. Here, we capture the left side of the inequality in u(z,z)=z(AG(z))(AG(z)AG(z)),zzu(z,z^{\prime})=\langle\nabla_{z}(AG(z))^{\top}(AG(z)-AG(z^{\prime})),z-z^{\prime}\rangle. We similarly plot points (αAv(z,z)γA,u(z,z))(\alpha_{A}v(z,z^{\prime})-\gamma_{A},u(z,z^{\prime})) for all pairs (z,z)(z,z^{\prime}). The scatter plot is generated for 500 different instantiations of zz and zz^{\prime} and 55 different instantiations of AA. We do this experiment for two DCGAN generators, one trained on MNIST (Figure 1 (b)) and the other trained on CIFAR10 (Figure 1 (d)). These experiments indicate that the dissipativity constant α\alpha is positive in all cases.

5.2 Comparison of SGLD against GD

Refer to caption Refer to caption Refer to caption Refer to caption
(a) Ground truth (b) Initial (c) GD (d) SGLD
MSE = 0.0447 MSE = 0.0275
Figure 2: Comparing the recovery performance of SGLD and GD at m=0.2nm=0.2n measurements.

We test the SGLD reconstruction by using the update rule in (3.2) and compare against the optimizing the updates of zz using standard gradient descent as in Bora et al. (2017b). For all experiments, we use a pre-trained DCGAN generator, with network configuration described as follows: the generator consists of four different layers consisting of transposed convolutions, batch normalization and RELU activation; this is followed by a final layer with a transposed convolution and tanh\tanh activation Radford et al. (2015).

We display the reconstructions on MNIST in Figure 2. Note that the implementation in Bora et al. (2017b) requires 10 random restarts for CS reconstruction and they report the results corresponding to the best reconstruction. This likely suggests that the standard implementation is likely to get stuck in bad local minima or saddle points. For the sake of fair comparison, we fix the same random initialization of latent vector zz for both GD and SGLD with no restarts. We select m=0.2nm=0.2n. In Figure 2 we show reconstructions for the 16 different examples, which were all reconstructed at once using same k=2000k=2000 steps, learning rate of η=0.02\eta=0.02 and the inverse temperature β=1\beta=1 for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.

Phase transition plots scanning a range of compression ratios m/nm/n as well as example reconstructions on CIFAR-10 images can be found in the supplement. More thorough empirical comparisons with PGD-based approaches Shah & Hegde (2018); Raj et al. (2019) are deferred to future work.

5.3 Reconstructions for CIFAR10

We display the reconstructions on CIFAR10 in Figure 3. As with the implementation for MNIST, for the sake of fair comparison, we fix the same random initialization of latent vector zz for both GD and SGLD with no restarts. We select m=0.3nm=0.3n. In Figure 3 we show reconstructions for the 16 different examples from MNIST, which were all reconstructed at once using same k=2000k=2000 steps, learning rate of η=0.05\eta=0.05 and the inverse temperature β=1\beta=1 for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Similar to our experiments on MNIST we notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) Ground truth (b) Initial (c) GD (d) SGLD
MSE = 0.0248 MSE = 0.0246
Figure 3: [CIFAR10] Comparing the recovery performance of SGLD and GD at m=0.3nm=0.3n measurements.

Next, we plot phase transition diagrams by scanning the compression ratio f=m/n=[0.2,0.4,0.6,0.8,1.0]f=m/n=[0.2,0.4,0.6,0.8,1.0] for the MNIST dataset in Figure 4. For this experiment, we have chosen 5 different instantiations of the sampling matrix AA for each compression ratio ff. In Figure 4 we report the average Mean Square Error (MSE) of reconstruction x^x2\|\hat{x}-x\|^{2} over 5 different instances of AA.

0.20.20.30.30.40.40.50.50.60.60.70.70.80.80.90.911101.410^{-1.4}101.310^{-1.3}101.210^{-1.2}101.110^{-1.1}10110^{-1}100.910^{-0.9}100.810^{-0.8}Compression ratio fflog\log MSEGDSGLD
Figure 4: Phase transition plots representing average MSE of reconstructed image using gradient descent and stochastic gradient Langevin dynamics.

We conclude that SGLD gives improved reconstruction quality as compared to GD.

References

  • Asim et al. (2019) Asim, M., Ahmed, A., and Hand, P. Invertible generative models for inverse problems: mitigating representation error and dataset bias. arXiv preprint arXiv:1905.11672, 2019.
  • Bakry et al. (2008) Bakry, D., Barthe, F., Cattiaux, P., Guillin, A., et al. A simple proof of the poincaré inequality for a large class of probability measures. Electronic Communications in Probability, 13:60–66, 2008.
  • Baraniuk et al. (2010) Baraniuk, R., Cevher, V., Duarte, M., and Hegde, C. Model-based compressive sensing. IEEE Transactions on Information Theory, 56:1982–2001, 2010.
  • Bora et al. (2017a) Bora, A., Jalal, A., Price, E., and Dimakis, A. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  537–546. JMLR. org, 2017a.
  • Bora et al. (2017b) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  537–546. JMLR. org, 2017b.
  • Candes & Tao (2005) Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
  • Chang et al. (2017) Chang, J., Li, C., Póczos, B., and Kumar, B. One network to solve them all—solving linear inverse problems using deep projection models. In 2017 IEEE International Conference on Computer Vision (ICCV), pp.  5889–5898. IEEE, 2017.
  • Chen et al. (2001) Chen, S., Donoho, D., and Saunders, M. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, 2017.
  • Dong et al. (2016) Dong, C., Loy, C., He, K., and Tang, X. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
  • Donoho (2006) Donoho, D. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • Hale (1990) Hale, J. Asymptotic behavior of dissipative systems. Bull. Am. Math. Soc, 22:175–183, 1990.
  • Hand & Voroninski (2018) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. In Conference On Learning Theory, pp.  970–978, 2018.
  • Hand & Voroninski (2019) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. IEEE Transactions on Information Theory, 66(1):401–418, 2019.
  • Jagatap & Hegde (2019) Jagatap, G. and Hegde, C. Algorithmic guarantees for inverse imaging with untrained network priors. In Advances in Neural Information Processing Systems, 2019.
  • Jalal et al. (2020) Jalal, A., ECE, U., Karmalkar, S., CS, U., Dimakis, A. G., and Price, E. Compressed sensing with approximate priors via conditional resampling. Preprint, 2020.
  • Ji & Carin (2007) Ji, S. and Carin, L. Bayesian compressive sensing and projection optimization. In Proceedings of the 24th international conference on Machine learning, pp.  377–384, 2007.
  • Latorre et al. (2019) Latorre, F., Eftekhari, A., and Cevher, V. Fast and provable admm for learning with generative priors. In Advances in Neural Information Processing Systems, pp. 12004–12016, 2019.
  • Lee & Vempala (2018) Lee, Y. T. and Vempala, S. S. Convergence rate of riemannian hamiltonian monte carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp.  1115–1121, 2018.
  • Lei et al. (2019) Lei, Q., Jalal, A., Dhillon, I. S., and Dimakis, A. G. Inverting deep generative models, one layer at a time. In Advances in Neural Information Processing Systems, pp. 13910–13919, 2019.
  • Lindgren et al. (2020) Lindgren, E. M., Whang, J., and Dimakis, A. G. Conditional sampling from invertible generative models with applications to inverse problems. arXiv preprint arXiv:2002.11743, 2020.
  • Liu & Scarlett (2020) Liu, Z. and Scarlett, J. Information-theoretic lower bounds for compressive sensing with generative models. IEEE Journal on Selected Areas in Information Theory, 2020.
  • Lovász & Vempala (2007) Lovász, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
  • Lovász et al. (1993) Lovász, L. et al. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty, 2(1):1–46, 1993.
  • Mousavi & Baraniuk (2017) Mousavi, A. and Baraniuk, R. Learning to invert: Signal recovery via deep convolutional networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  2272–2276. IEEE, 2017.
  • Needell & Tropp (2009) Needell, D. and Tropp, J. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301–321, 2009.
  • Ongie et al. (2020) Ongie, G., Jalal, A., Metzler, C., Baraniuk, R., Dimakis, A., and Willett, R. Deep learning techniques for inverse problems in imaging. arXiv preprint arXiv:2005.06001, 2020.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.
  • Raj et al. (2019) Raj, A., Li, Y., and Bresler, Y. Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, pp.  5602–5611, 2019.
  • Shah & Hegde (2018) Shah, V. and Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4609–4613. IEEE, 2018.
  • Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  9446–9454, 2018.
  • Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688, 2011.
  • Whang et al. (2020) Whang, J., Lei, Q., and Dimakis, A. Compressed sensing with invertible generative models and dependent noise. arXiv preprint arXiv:2003.08089, 2020.
  • White (2016) White, T. Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.
  • Y. Wu (2019) Y. Wu, M. Rosca, T. L. Deep compressed sensing. arXiv preprint arXiv:1905.06723, 2019.
  • Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp.  1980–2022. PMLR, 2017.
  • Zou et al. (2020) Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. arXiv preprint arXiv:2010.09597, 2020.

Appendix A Conditions on the generator network

Proposition A.1.

Suppose G(z):𝒟dnG(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} is a feed-forward neural network with layers of non-increasing sizes and compact input domain 𝒟\mathcal{D}. Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, G(z)G(z) satisfies Assumptions (A.2) & (A.3) with constants ιG,κG,M\iota_{G},\kappa_{G},M, and if 2ιG2>MκG2\iota_{G}^{2}>M\kappa_{G}, the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.

Proof.

The proof proceeds similar to Latorre et al. (2019), Appendix B. Since G(z)G(z) is a composition of linear maps followed by C1C^{1} activation functions, G(z)G(z) is continuously differentiable. As a result, the Jacobian zG\nabla_{z}G is a continuous matrix-valued function and its restriction to the compact domain 𝒟d\mathcal{D}\subseteq\mathbb{R}^{d} is Lipschitz-continuous. Therefore, there exists M0M\geq 0 such that

zG(z)zG(z)Mzz,z,z𝒟.\|\nabla_{z}G(z)-\nabla_{z}G(z^{\prime})\|\leq M\|{z-z^{\prime}}\|,\qquad\forall z,z^{\prime}\in\mathcal{D}. (A.1)

Thus, Assumption (A.3) holds. Assumption (A.2) is also satisfied according to Latorre et al. (2019), Lemma 5. To show the strong smoothness, we use the fundamental theorem of calculus with the Lipchitzness of G(z)G(z) obtained by Assumption (A.2). For every z,z𝒟z,z^{\prime}\in\mathcal{D}, and u(t)=tz+(1t)zu(t)=tz+(1-t)z^{\prime}:

G(z)G(z),\displaystyle\langle G(z)-G(z^{\prime}), zG(z)(zz)\displaystyle\nabla_{z}G(z)(z-z^{\prime})\rangle
=G(z)G(z)2G(z)G(z),G(z)G(z)zG(z)(zz)\displaystyle=\|G(z)-G(z^{\prime})\|^{2}-\langle G(z)-G(z^{\prime}),G(z)-G(z^{\prime})-\nabla_{z}G(z)(z-z^{\prime})\rangle
=G(z)G(z)201G(z)G(z),(zG(u(t))zG(z))(zz)dt\displaystyle=\|G(z)-G(z^{\prime})\|^{2}-\int_{0}^{1}\langle G(z)-G(z^{\prime}),\bigl{(}\nabla_{z}G(u(t))-\nabla_{z}G(z)\bigr{)}(z-z^{\prime})\rangle{\mathrm{d}}t
ιG2zz2κGMzz201(1t)dt\displaystyle\geq\iota_{G}^{2}\|z-z^{\prime}\|^{2}-\kappa_{G}M\|z-z^{\prime}\|^{2}\int_{0}^{1}(1-t){\mathrm{d}}t
=(ιG2κGM2)zz2,\displaystyle=(\iota_{G}^{2}-\frac{\kappa_{G}M}{2})\|z-z^{\prime}\|^{2},

where in the last step we use the near-isometry and the Lipschitzness of zG(z)\nabla_{z}G(z) we have obtained. Consequently, G(z)G(z) is (ιG2κGM2,0)(\iota_{G}^{2}-\frac{\kappa_{G}M}{2},0)-strongly smooth, if ιG2>κGM2\iota_{G}^{2}>\frac{\kappa_{G}M}{2}. ∎

Lemma A.1 (Measurement complexity).

Let G(z):𝒟dnG(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let LL be its Lipschitz constant. If the number of measurements mm satisfies:

m=Ω(dδ2log(κG/γ)),m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right)\,,

for some small constant δ>0\delta>0. If the elements of AA are drawn according to 𝒩(0,1m)\mathcal{N}(0,\frac{1}{m}), then the loss function F(z)F(z) is (αδκG2,γ)(\alpha-\delta\kappa_{G}^{2},\gamma)-dissipative with probability at least 1exp(Ω(mδ2)1-\exp(-\Omega(m\delta^{2}).

Proof.

Using Proposition A.1, it follows that there exist α>0\alpha>0 and γ0\gamma\geq 0 such that G(z)G(z) is strongly smooth. Now, note that the left hand side of (3.6) is simplified as

zz,zF(z)\displaystyle\langle z-z^{*},\nabla_{z}F(z)\rangle =A(G(z)G(z)),AzG(z)(zz),\displaystyle=\left\langle A(G(z)-G(z^{*})),A\nabla_{z}G(z)(z-z^{*})\right\rangle, (A.2)

Denote u=G(z)G(z)u=G(z)-G(z^{*}) and v=zG(z)(zz)v=\nabla_{z}G(z)(z-z^{*}), then

zz,zF(z)=Au,Av=u,v(𝕀AA)u,v.\langle z-z^{*},\nabla_{z}F(z)\rangle=\langle Au,Av\rangle=\langle u,v\rangle-\langle(\mathbb{I}-A^{\top}A)u,v\rangle.

Using standard result in random matrix theory, we can get P(𝕀AAδ)exp(mδ2)P(\|\mathbb{I}-A^{\top}A\|\geq\delta)\leq\exp(-m\delta^{2}). Also, u,vκGzz\|u\|,\|v\|\leq\kappa_{G}\|z-z^{\prime}\|. Therefore,

zz,zF(z)u,vδzz2.\langle z-z^{*},\nabla_{z}F(z)\rangle\geq\langle u,v\rangle-\delta\|z-z^{\prime}\|^{2}.

For m=Ω(dδ2log(κG/γ))m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right), then

zz,zF(z)(αδ)zzγ,\langle z-z^{*},\nabla_{z}F(z)\rangle\geq(\alpha-\delta)\|z-z^{\prime}\|-\gamma,

with probability at least 1exp(Ω(mδ2)1-\exp(-\Omega(m\delta^{2}). Therefore, the loss function F(z)F(z) is (αδκG2,γ)(\alpha-\delta\kappa_{G}^{2},\gamma)-dissipative with probability at least 1exp(Ω(mδ2)1-\exp(-\Omega(m\delta^{2}). ∎

Appendix B Properties of F(z)F(z)

In this part, we establish some key properties of the loss function F(z)F(z). We use Assumptions (A.1)(A.3) on the boundedness, Lipschitz gradient and near-isometry to obtain an upper bound of zF(z)\|\nabla_{z}F(z)\| and the smoothness of F(z)F(z).

Lemma B.1 (Lipschitzness of F(z)F(z)).

We have zF(z)κG2AAzz\|\nabla_{z}F(z)\|\leq\kappa_{G}^{2}\|A^{\top}A\|\|z-z^{*}\| for any z𝒟dz\in\mathcal{D}\subset\mathbb{R}^{d}.

Proof.

Recall the gradient of F(z)F(z):

zF(z)\displaystyle\nabla_{z}F(z) =(zG(z))A(yAG(z))=(zG(z))AA(G(z)G(z)).\displaystyle=-(\nabla_{z}G(z))^{\top}A^{\top}(y-AG(z))=-(\nabla_{z}G(z))^{\top}A^{\top}A(G(z^{*})-G(z)).

It follows from the Lipschitz assumption (A.2) that G(z)G(z)κGzz\|G(z^{*})-G(z)\|\leq\kappa_{G}\|z-z^{*}\|, and hence zG(z)κG\|\nabla_{z}G(z)\|\leq\kappa_{G}. Therefore,

zF(z)κG2AAzz.\|\nabla_{z}F(z)\|\leq\kappa_{G}^{2}\|A^{\top}A\|\|z-z^{*}\|.

Lemma B.2 (Smoothness of F(z)F(z)).

For any z,z𝒟dz,z^{\prime}\in\mathcal{D}\subset\mathbb{R}^{d}, we have

zF(z)zF(z)(MB+κG2)AAzz.\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\|\leq(MB+\kappa_{G}^{2})\|A^{\top}A\|\|z-z^{\prime}\|.
Proof.

We use the assumptions on G(z)G(z) to derive the bound: G(z)B\|G(z^{*})\|\leq B.

zF(z)zF(z)\displaystyle\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\| (zG(z)zG(z))AAG(z)\displaystyle\leq\|(\nabla_{z}G(z^{\prime})-\nabla_{z}G(z))^{\top}A^{\top}AG(z^{*})\|
+(zG(z))AA(G(z)G(z))\displaystyle\qquad+\|(\nabla_{z}G(z))^{\top}A^{\top}A(G(z)-G(z^{\prime}))\|
+(zG(z)zG(z))AAG(z)\displaystyle\qquad+\|(\nabla_{z}G(z)-\nabla_{z}G(z^{\prime}))^{\top}A^{\top}AG(z^{\prime})\|

Then, using the boundedness, Lipschitzness and smoothness, we arrive at:

zF(z)zF(z)(MB+κG2)AA.\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\|\leq(MB+\kappa_{G}^{2})\|A^{\top}A\|.

Therefore, F(z)F(z) is LL-smooth, with L=(MB+κG2)AAL=(MB+\kappa_{G}^{2})\|A^{\top}A\|. ∎

Appendix C Conductance Analysis

In this section, we provide the proofs of Lemma 4.1 and 4.3 based on the conductance analysis laid out in (Zhang et al., 2017) and similarly in (Zou et al., 2020). The proof of 4.2 directly follows from Lemma 6.3, (Zou et al., 2020).

Proof of Lemma 4.1.

We use the same idea in Lemma 3 from (Zhang et al., 2017) and similarly that in Lemma 6.1 from (Zou et al., 2020). The main difference of our proof is that we use full gradient zF(z)\nabla_{z}F(z) in Algorithm 1, instead of stochastic mini-batch gradient, which simplifies the proof of this lemma a little.

We consider two cases for each uu: u𝒜u\not\in\mathcal{A} and u𝒜u\in\mathcal{A}. As long as we can prove the first case, the second case easily follows, by splitting 𝒜\mathcal{A} into {u}\{u\} and 𝒜\{u}\mathcal{A}\backslash\{u\} and using the result of the first case. For a detailed treatment of the latter case, we refer the reader to the proof of Lemma 6.1 in (Zou et al., 2020).

Now that u𝒜u\notin\mathcal{A}, we have

𝒯u(𝒜)=𝒜(u,r)𝒯u(w)dw=𝒜(u,r)αu(w)𝒯u(w)dw.\displaystyle\mathcal{T}^{\star}_{u}(\mathcal{A})=\int_{\mathcal{A}\cap\mathcal{B}(u,r)}\mathcal{T}^{\star}_{u}(w){\mathrm{d}}w=\int_{\mathcal{A}\cap\mathcal{B}(u,r)}\alpha_{u}(w)\mathcal{T}_{u}(w){\mathrm{d}}w. (C.1)

where αu(w)\alpha_{u}(w) is the acceptance ratio of the Metropolis-Hasting. If suffices to show that αu(w)1δ/2\alpha_{u}(w)\geq 1-\delta/2 for all wK(u,r)w\in K\cap\mathcal{B}(u,r), which implies

(1δ/2)𝒯u(𝒜)𝒯u(𝒜)𝒯u(𝒜).\displaystyle(1-\delta/2)\mathcal{T}_{u}(\mathcal{A})\leq\mathcal{T}^{\star}_{u}(\mathcal{A})\leq\mathcal{T}_{u}(\mathcal{A}).

The right hand side is obvious by the definition of αu(w)\alpha_{u}(w) while we can ensure δ1/2\delta\leq 1/2 with a sufficiently small η\eta. What remains is to show that

𝒯w(u)𝒯u(w)exp(β(F(w)F(u)))1δ/2.\displaystyle\frac{\mathcal{T}_{w}(u)}{\mathcal{T}_{u}(w)}\cdot\exp(-\beta(F(w)-F(u)))\geq 1-\delta/2. (C.2)

The left hand side is simplified by definition of 𝒯u(w)\mathcal{T}_{u}(w) as

exp(wu+ηg(u)224η/βuw+ηg(w)224η/β)exp(β(F(w)F(u)))1δ/2.\displaystyle\exp\bigg{(}\frac{\|w-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}-\frac{\|u-w+\eta g(w)\|_{2}^{2}}{4\eta/\beta}\bigg{)}\exp(-\beta(F(w)-F(u)))\geq 1-\delta/2.

Note that g(z)=zF(z)g(z)=\nabla_{z}F(z). Simplify the first exponent and combine with the second one gives the following form:

β(F(w)F(u)12wu,zF(w)+zF(u))+ηβ4(zF(u)2zF(w)2).\displaystyle-\beta\left(F(w)-F(u)-\frac{1}{2}\langle w-u,\nabla_{z}F(w)+\nabla_{z}F(u)\rangle\right)+\frac{\eta\beta}{4}(\|\nabla_{z}F(u)\|^{2}-\|\nabla_{z}F(w)\|^{2}). (C.3)

To lower bound the left hand side, we appeal to the smoothness of F(z)F(z). Specifically, by Lemmas B.1 and B.2, we have FF is LL-smooth and zF(z)D\|\nabla_{z}F(z)\|\leq D with L=(MB+κG2)L=(MB+\kappa_{G}^{2}) and D=κG2AAD=\kappa_{G}^{2}\|A^{\top}A\|. Then,

F(w)\displaystyle F(w) F(u)+wu,F(u)+Lwu222,\displaystyle\leq F(u)+\langle w-u,\nabla F(u)\rangle+\frac{L\|w-u\|_{2}^{2}}{2},
F(u)\displaystyle F(u) F(w)+uw,F(w)Lwu222.\displaystyle\geq F(w)+\langle u-w,\nabla F(w)\rangle-\frac{L\|w-u\|_{2}^{2}}{2}.

This directly implies that

|F(w)F(u)wu,12F(w)+F(u)|Lwu222.\displaystyle\big{|}F(w)-F(u)-\langle w-u,\frac{1}{2}\nabla F(w)+F(u)\rangle\big{|}\leq\frac{L\|w-u\|_{2}^{2}}{2}. (C.4)

Moreover,

|zF(u)22zF(w)22|\displaystyle\big{|}\|\nabla_{z}F(u)\|_{2}^{2}-\|\nabla_{z}F(w)\|_{2}^{2}\big{|} F(u)F(w)2F(u)+F(w)2\displaystyle\leq\|\nabla F(u)-\nabla F(w)\|_{2}\cdot\|\nabla F(u)+\nabla F(w)\|_{2}
2LDwu2.\displaystyle\leq 2LD\|w-u\|_{2}. (C.5)

Combining (C.4) and (C) in (C.3), and together with w(u,r)w\in\mathcal{B}(u,r) with r=10ηd/βr=\sqrt{10\eta d/\beta},

LHS of (C.3) Lβwu22ηβLDwu2\displaystyle\geq-\frac{L\beta\|w-u\|^{2}}{2}-\frac{\eta\beta LD\|w-u\|}{2}
5Ldη5LGd1/2β1/2η3/2.\displaystyle\geq-5Ld\eta-5LGd^{1/2}\beta^{-1/2}\eta^{3/2}.

Pick δ/2=5Ldη+5LDd1/2β1/2η3/2\delta/2=5Ld\eta+5LDd^{1/2}\beta^{-1/2}\eta^{3/2}, and use the fact ex1xe^{-x}\geq 1-x for x0x\geq 0, then we have proved the result. ∎

Next, we lower bound the conductance ϕ\phi of 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) using the idea in (Lee & Vempala, 2018; Zou et al., 2020), by first restating the following lemma:

Lemma C.1 (Lemma 13 in Lee & Vempala (2018)).

Let 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) be a time-reversible Markov chain on 𝒟\mathcal{D} with stationary distribution π\pi. Suppose for any u,v𝒟u,v\in\mathcal{D} and a fixed Δ>0\Delta>0 such that uv2Δ\|u-v\|_{2}\leq\Delta, we have 𝒯u()𝒯v()TV0.99\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99, then the conductance of 𝒯u()\mathcal{T}^{\star}_{u}(\cdot) satisfies ϕCρΔ\phi\geq C\rho\Delta for some constant C>0C>0 and ρ\rho is the Cheeger constant of π\pi.

Proof of Lemma 4.3.

To apply Lemma C.1, we follow the same idea of Zou et al. (2020) and reuse some of their results without proof. To this end, we prove that for some Δ\Delta, any pair of u,v𝒟u,v\in\mathcal{D} such that uv2Δ\|u-v\|_{2}\leq\Delta, we have 𝒯u()𝒯v()TV0.99\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99. Recall the distribution of the iterate zz after one-step standard SGLD without the accept/reject step in (4.1) is

P(z|u)=1(4πη/β)d/2exp(zu+ηg(u)224η/β)\displaystyle P(z|u)=\frac{1}{(4\pi\eta/\beta)^{d/2}}\exp\bigg{(}-\frac{\|z-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}\bigg{)}

Since Algorithm 1 accepts the candidate only if it falls in the region 𝒟(u,r)\mathcal{D}\cap\mathcal{B}(u,r), the acceptance probability is

p(u)=zP(|u)[z𝒟(u,r)].\displaystyle p(u)=\mathbb{P}_{z\sim P(\cdot|u)}\big{[}z\in\mathcal{D}\cap\mathcal{B}(u,r)\big{]}.

Therefore, the transition probability 𝒯u(z)\mathcal{T}^{\star}_{u}(z) for z𝒟(u,r)z\in\mathcal{D}\cap\mathcal{B}(u,r) is given by

𝒯u(z)=2p(u)+p(u)(1αu(z))2δu(z)+αu(z)2P(z|u)𝟏[z𝒟(u,r)].\displaystyle\mathcal{T}^{\star}_{u}(z)=\frac{2-p(u)+p(u)(1-\alpha_{u}(z))}{2}\delta_{u}(z)+\frac{\alpha_{u}(z)}{2}P(z|u)\cdot{\mathbf{1}}[z\in\mathcal{D}\cap\mathcal{B}(u,r)].

Take u,v𝒟u,v\in\mathcal{D} and let 𝒮u=𝒟(u,r)\mathcal{S}_{u}=\mathcal{D}\cap\mathcal{B}(u,r) and 𝒮v=𝒟(v,r)\mathcal{S}_{v}=\mathcal{D}\cap\mathcal{B}(v,r). By the definition of the total variation, there exists a set 𝒜𝒟\mathcal{A}\in\mathcal{D} such that

𝒯u()𝒯v()TV\displaystyle\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV} =|𝒯u(𝒜)𝒯v(𝒜)|\displaystyle=|\mathcal{T}^{\star}_{u}(\mathcal{A})-\mathcal{T}^{\star}_{v}(\mathcal{A})|
maxu,z[2p(u)+p(u)(1αu(z))2]I1\displaystyle\leq\underbrace{\max_{u,z}\bigg{[}\frac{2-p(u)+p(u)(1-\alpha_{u}(z))}{2}\bigg{]}}_{I_{1}}
+12|z𝒜αu(z)P(z|u)𝟏(z𝒮u)αv(z)P(z|v)𝟏(z𝒮v)dz|I2.\displaystyle+\frac{1}{2}\underbrace{\bigg{|}\int_{z\in\mathcal{A}}\alpha_{u}(z)P(z|u){\mathbf{1}}(z\in\mathcal{S}_{u})-\alpha_{v}(z)P(z|v){\mathbf{1}}(z\in\mathcal{S}_{v}){\mathrm{d}}z\bigg{|}}_{I_{2}}.

Using the mini-batch size that is exactly the same as the number of samples, we can reuse the bounds of I1I_{1} and I2I_{2} in Lemmas C.4 and C.5 of (Zou et al., 2020). Consequently,

𝒯u()𝒯v()TVI1+I2/20.85+0.1δ+βuv22η.\displaystyle\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq I_{1}+I_{2}/2\leq 0.85+0.1\delta+\frac{\sqrt{\beta}\|u-v\|_{2}}{\sqrt{2\eta}}.

By Lemma 4.1, we have δ=10Ldη+10LDd1/2β1/2η3/212Ldη\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}\leq 12Ld\eta if ηd25βD2\eta\leq\frac{d}{25\beta D^{2}}. Thus if

η125βD2130Ldηanduv22η10β0.1r,\displaystyle\eta\leq\frac{1}{25\beta D^{2}}\wedge\frac{1}{30Ld\eta}\quad\mbox{and}\quad\|u-v\|_{2}\leq\frac{\sqrt{2\eta}}{10\sqrt{\beta}}\leq 0.1r,

we have 𝒯u()𝒯v()TV0.99\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99. As the result of Lemma C.1, we prove a lower bound on the conductance ϕ\phi of 𝒯u()\mathcal{T}^{\star}_{u}(\cdot)

ϕc0ρη/β,\displaystyle\phi\geq c_{0}\rho\sqrt{\eta/\beta},

and finish the proof.

Appendix D Property of the Gibbs algorithm

Proposition D.1.

For 𝒟=(0,R)\mathcal{D}=\mathcal{B}(0,R), we have

𝒟F(z)π(dz)\displaystyle\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z) 𝒪(dβlogβLd).\displaystyle\leq\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta L}{d}\right).
Proof.

Let p(z)=eβF(z)/Λp(z)=e^{-\beta F(z)}/\Lambda denote the density of π\pi. Λ𝒟eβF(z)dz\Lambda\triangleq\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z is the partition function. We start by writing

𝒟F(z)π(dz)=1β(h(p)logΛ),\displaystyle\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)=\frac{1}{\beta}\left(h(p)-\log\Lambda\right), (D.1)

where

h(p)=𝒟p(z)logp(z)dz=KeβF(z)ΛlogeβF(z)Λdzh(p)=-\int_{\mathcal{D}}p(z)\log p(z){\mathrm{d}}z=-\int_{K}\frac{e^{-\beta F(z)}}{\Lambda}\log\frac{e^{-\beta F(z)}}{\Lambda}{\mathrm{d}}z

is the differential entropy of pp. To upper-bound h(p)h(p), we use the fact that the differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment. Moreover, since pp has the support in the Euclidean ball with radius RR, its second moment is simply bounded by R2R^{2}. Therefore, we have

h(p)h(𝒩(0,R2𝕀))=d2log2πR2d.\displaystyle h(p)\leq h(\mathcal{N}(0,R^{2}\mathbb{I}))=\frac{d}{2}\log\frac{2\pi R^{2}}{d}. (D.2)

Next, we give a lower bound on the second term, logΛ\log\Lambda. We use the smoothness of F(z)F(z) and the fact that zz^{*} is the minimizer of FF. We have F(z)L2zz2F(z)\leq\frac{L}{2}\|z-z^{*}\|^{2} for z𝒟z\in\mathcal{D}. As such,

logΛ=log𝒟eβF(z)dz\displaystyle\log\Lambda=\log\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z log𝒟eβLzz2/2dzO(d2log2πβL).\displaystyle\geq\log\int_{\mathcal{D}}e^{-\beta L\|z-z^{*}\|^{2}/2}{\mathrm{d}}z\asymp O\left(\frac{d}{2}\log\frac{2\pi}{\beta L}\right). (D.3)

Using (D.2) and (D.3) in (D.1) and simplifying, we prove the result. ∎