Provable Compressed Sensing with Generative Priors via Langevin Dynamics

Thanh V. Nguyen, Gauri Jagatap, Chinmay Hegde Email: thanhng@iastate.edu, gbj221@nyu.edu, chinmay.h@nyu.edu. This work was partially done while TN was with the Electrical and Computer Engineering Department at Iowa State University. GJ and CH are currently with the Tandon School of Engineering at New York University. This work was supported in part by NSF grants CCF-2005804 and CCF-1815101.

Abstract

Deep generative models have emerged as a powerful class of priors for signals in various inverse problems such as compressed sensing, phase retrieval and super-resolution. Here, we assume an unknown signal to lie in the range of some pre-trained generative model. A popular approach for signal recovery is via gradient descent in the low-dimensional latent space. While gradient descent has achieved good empirical performance, its theoretical behavior is not well understood. In this paper, we introduce the use of stochastic gradient Langevin dynamics (SGLD) for compressed sensing with a generative prior. Under mild assumptions on the generative model, we prove the convergence of SGLD to the true signal. We also demonstrate competitive empirical performance to standard gradient descent.

1 Introduction

We consider the familiar setting of inverse problems where the goal is to recover an $n$ -dimensional signal $x^{*}$ that is indirectly observed via a linear measurement operation $y=Ax^{*}$ . The measurement vector can be noisy, and its dimension $m$ may be less than $n$ . Several practical applications fit this setting, including super-resolution (Dong et al., 2016), in-painting, denoising Vincent et al. (2010), and compressed sensing Donoho (2006); Chang et al. (2017).

Since such an inverse problem is ill-posed in general, the recovery of $x^{*}$ from $y$ often requires assuming a low-dimensional structure or prior on $x^{*}$ . Choices of good priors have been extensively explored in the past three decades, including sparsity Chen et al. (2001); Needell & Tropp (2009), structured sparsity Baraniuk et al. (2010), end-to-end training via convolutional neural networks Chang et al. (2017); Mousavi & Baraniuk (2017), pre-trained generative priors Bora et al. (2017a), as well as untrained deep image priors Ulyanov et al. (2018); Jagatap & Hegde (2019).

In this paper, we focus on a powerful class of priors based on deep generative models. The setup is the following: the unknown signal $x^{*}$ is assumed to lie in the range of some pre-trained generator network, obtained from (say) a generative adversarial network (GAN) or a variational autoencoder (VAE). That is, $x^{*}=G(z^{*})$ for some $z^{*}$ in the latent space. The task is again to recover $x^{*}$ from (noisy) linear measurements.

Such generative priors have been shown to achieve high empirical success Chang et al. (2017); Bora et al. (2017a); Y. Wu (2019). However, progress on the theoretical side for inverse problems with generative priors has been much more modest. On the one hand, the seminal work of Bora et al. (2017b) established the first statistical upper bounds (in terms of measurement complexity) for compressed sensing for fairly general generative priors, which was later shown in Liu & Scarlett (2020) to be nearly optimal. On the other hand, provable algorithmic guarantees for recovery using generative priors are only available in very restrictive cases. The paper Hand & Voroninski (2018) proves the convergence of (a variant of) gradient descent for shallow generative priors whose weights obey a distributional assumption. The paper Shah & Hegde (2018) proves the convergence of projected gradient descent (PGD) under the assumption that the range of the (possibly deep) generative model $G$ admits a polynomial-time oracle projection. To our knowledge, the most general algorithmic result in this line of work is by Latorre et al. (2019). Here, the authors show that under rather mild and intuitive assumptions on $G$ , a linearized alternating direction method of multipliers (ADMM) applied to a regularized mean-squared error loss converges to a (potentially large) neighborhood of $x^{*}$ .

The main barrier for obtaining guarantees for recovery algorithms based on gradient descent is the non-convexity of the recovery problem induced by the generator network. Therefore, in this paper we sidestep traditional gradient descent-style optimization methods, and instead show that a very good estimate of $x^{*}$ can also be obtained by performing stochastic gradient Langevin Dynamics (SGLD) Welling & Teh (2011); Raginsky et al. (2017); Zhang et al. (2017); Zou et al. (2020). We show that this dynamics amounts to sampling from a Gibbs distribution whose energy function is precisely the reconstruction loss ¹¹1While preparing this manuscript, we became aware of concurrent work by Jalal et al. (2020) which also pursues a similar Langevin-style approach for solving compressed sensing problems; however, they do not theoretically analyze its dynamics..

As a stochastic version of gradient descent, SGLD is simple to implement. However, care must be taken in constructing the additive stochastic perturbation to each gradient update step. Nevertheless, the sampling viewpoint enables us to achieve finite-time convergence guarantees for compressed sensing recovery. To the best of our knowledge, this is the first such result for solving compressed sensing problems with generative neural network priors. Moreover, our analysis succeeds under (slightly) weaker assumptions on the generator network than those made in Latorre et al. (2019). Our specific contributions are as follows:

1.

We propose a provable compressed sensing recovery algorithm for generative priors based on stochastic gradient Langevin dynamics (SGLD).
2.

We prove polynomial-time convergence of our proposed recovery algorithm to the true underlying solution, under assumptions of smoothness and near-isometry of $G$ . These are technically weaker than the mild assumptions made in Latorre et al. (2019). We emphasize that these conditions are valid for a wide range of generator networks. Section 3 describes them in greater details.
3.

We provide several empirical results and demonstrate that our approach is competitive with existing (heuristic) methods based on gradient descent.

2 Prior work

We briefly review the literature on compressed sensing with deep generative models. For a thorough survey on deep learning for inverse problems, see Ongie et al. (2020).

In Bora et al. (2017a), the authors provide sufficient conditions under which the solution of the inverse problem is a minimizer of the (possibly non-convex) program:

\displaystyle\min_{x=G(z)}\|Ax-y\|_{2}^{2}\,.

(2.1)

Specifically, they show that if $A$ satisfies the so-called set-Restricted Eigenvalue Condition (REC), then the solution to (2.1) equals the unknown vector $x^{*}$ . They also show that if the generator $G$ has a latent dimension $k$ and is $L$ -Lipschitz, then a matrix $A\in\mathbb{R}^{m\times n}$ populated with i.i.d. Gaussian entries satisfies the REC, provided $m=O(k\log L)$ . However, they propose gradient descent as a heuristic to solve (2.1), but do not analyze its convergence. In Shah & Hegde (2018), the authors show that projected gradient descent (PGD) for (2.1) converges at a linear rate under the REC, but only if there exists a tractable projection oracle that can compute $\operatornamewithlimits{arg\,min}_{z}\|x-G(z)\|$ for any $x$ . The recent work Lei et al. (2019) provides sufficient conditions under which such a projection can be approximately computed. In Latorre et al. (2019), a provable recovery scheme based on ADMM is established, but guarantees recovery only up to a neighborhood around $x^{*}$ .

Note that all the above works assume mild conditions on the weights of the generator, use variations of gradient descent to update the estimate for $x$ , and require the forward matrix $A$ to satisfy the REC over the range of $G$ . Hand & Voroninski (2018, 2019) showed global convergence for gradient descent, but under the (strong) assumption that the weights of the trained generator are Gaussian distributed.

Generator networks trained with GANs are most commonly studied. However, Whang et al. (2020); Asim et al. (2019) have recently advocated using invertible generative models, which use real-valued non-volume preserving (NVP) transformations Dinh et al. (2017). An alternate strategy for sampling images consistent with linear forward models was proposed in Lindgren et al. (2020) where the authors assume an invertible generative mapping and sample the latent vector $z$ from a second generative invertible prior.

Our proposed approach also traces its roots to Bayesian compressed sensing Ji & Carin (2007), where instead of modeling the problem as estimating a (deterministic) sparse vector, one models the signal $x$ to be sampled from a sparsity promoting distribution, such as a Laplace prior. One can then derive the maximum a posteriori (MAP) estimate of $x$ under the constraint that the measurements $y=Ax$ are consistent. Our motivation is similar, except that we model the distribution of $x$ as being supported on the range of a generative prior.

3 Recovery via Langevin dynamics

In the rest of the paper, $x\wedge y$ denotes $\min\{x,y\}$ and $x\vee y$ for $\max\{x,y\}$ . Given a distribution $\mu$ and set $\mathcal{A}$ , we denote $\mu(\mathcal{A})$ the probability measure of $\mathcal{A}$ with respect to $\mu$ . $\|\mu-\nu\|_{TV}$ is the total variation distance between two distributions $\mu$ and $\nu$ . Finally, we use standard big-O notation in our analysis.

3.1 Preliminaries

We focus on the problem of recovering a signal $x^{*}\in\mathbb{R}^{n}$ from a set of linear measurements $y\in\mathbb{R}^{m}$ where

y=Ax^{*}+\varepsilon.

To keep our analysis and results simple, we consider zero measurement noise, i.e., $\varepsilon=0$ ²²2We not in passing that our analysis techniques succeed for any vector $\varepsilon$ with bounded $\ell_{2}$ norm.. Here, $A\in\mathbb{R}^{m\times n}$ is a matrix populated with i.i.d. Gaussian entries with mean 0 and variance $1/m$ . We assume that $x^{*}$ belongs to the range of a known generative model $G\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ ; that is,

x^{*}=G(z^{*})~{}~{}\text{for some}~{}~{}z^{*}\in\mathcal{D}.

Following Bora et al. (2017b), we restrict $z$ to belong to a $d$ -dimensional Euclidean ball, i.e., $\mathcal{D}=\mathcal{B}(0,R)$ . Then, given the measurements $y$ , our goal is to recover $x^{*}$ . Again following Bora et al. (2017b), we do so by solving the usual optimization problem:

\min_{z\in\mathcal{D}}F(z)\triangleq\|y-AG(z)\|^{2}.

(3.1)

Hereon and otherwise stated, $\|\cdot\|$ denotes the $\ell_{2}$ -norm. The most popular approach to solving (3.1) is to use gradient descent Bora et al. (2017b). For generative models $G(z)$ defined by deep neural networks, the function $F(z)$ is highly non-convex, and as such, it is impossible to guarantee global signal recovery using regular (projected) gradient descent.

We adopt a slightly more refined approach. Starting from an initial point $z_{0}\sim\mu_{0}$ , our algorithm computes stochastic gradient updates of the form:

z_{k+1}=z_{k}-\eta\nabla_{z}F(z)+\sqrt{2\eta\beta^{-1}}\xi_{k},\quad k=0,1,2,\dots

(3.2)

where $\xi_{k}$ is a unit Gaussian random vector in $\mathbb{R}^{d}$ , $\eta$ is the step size and $\beta$ is an inverse temperature parameter. This update rule is known as stochastic gradient Langevin dynamics (SGLD) Welling & Teh (2011) and has been widely studied both in theory and practice Raginsky et al. (2017); Zhang et al. (2017). Intuitively, (3.2) is an Euler discretization of the continuous-time diffusion equation:

\mathrm{d}Z(t)=-\nabla_{z}F(Z(t))\mathrm{d}t+\sqrt{2\beta^{-1}}\mathrm{d}B(t),\quad t\geq 0,

(3.3)

where $Z(0)\sim\mu_{0}$ . Under standard regularity conditions on $F(z)$ , one can show that the above diffusion has a unique invariant Gibbs measure.

We refine the standard SGLD to account for the boundedness of $z$ . Specifically, we require an additional Metropolis-like accept/reject step to ensure that $z_{k+1}$ always belongs to the support $\mathcal{D}$ , and also is not too far from $z_{k}$ of the previous iteration. We study this variant for theoretical analysis; in practice we have found that this is not necessary. Algorithm 1 (CS-SGLD) shows the detailed algorithm. Note that we can use stochastic (mini-batch) gradient instead of the full gradient $\nabla_{z}F(z)$ .

We wish to derive sufficient conditions on the convergence (in distribution) of the random process in Algorithm 1 to the target distribution $\pi$ , denoted by:

\pi(\mathrm{d}z)\propto\exp(-\beta F(z)){\mathbf{1}}(z\in\mathcal{D}),

(3.4)

and study its consequence in recovering the true signal $x^{*}$ . This leads to the first guarantees of a stochastic gradient-like method for compressed sensing with generative priors. In order to do so, we make the following three assumptions on the generator network $G(z)$ .

Algorithm 1 CS-SGLD

Input: step size

\eta

; inverse temperature parameter

\beta

, radius

r

and Lipschitz constant

L

F(z)

Draw

z_{0}

from

\mu_{0}=\mathcal{N}(0,\frac{1}{2L\beta}\mathbb{I})

truncated on

\mathcal{D}

for

k=0,1,\ldots,

Randomly sample

\xi_{k}\sim\mathcal{N}(0,\mathbb{I})

z_{k+1}=z_{k}-\eta\nabla_{z}F(z_{k})+\sqrt{2\eta/\beta}\xi_{k}

z_{k+1}\not\in\mathcal{B}(z_{k},r)\cap K

then

z_{k+1}=z_{k}

end if

end for

Output:

\widehat{z}=z_{k}

(A.1)

Boundedness. For all $z\in\mathcal{D}$ , we have that $\|G(z)\|\leq B$ for some $B>0$ .
(A.2)

Near-isometry. $G(z)$ is a near-isometric mapping if there exist $0<\iota_{G}\leq\kappa_{G}$ such that the following holds for any $z,z^{\prime}\in\mathcal{D}$ :

$\displaystyle\iota_{G}\|z-z^{\prime}\|\leq\|G(z)-G(z^{\prime})\|\leq\kappa_{G}\|z-z^{\prime}\|.$
(A.3)

Lipschitz gradients. The Jacobian of $G(z)$ is $M$ -Lipschitz, i.e., for any $z,z^{\prime}\in\mathcal{D}$ , we have

$\|\nabla_{z}G(z)-\nabla_{z}G(z^{\prime})\|\leq M\|z-z^{\prime}\|,$

where $\nabla_{z}G(z)=\frac{\partial G(z)}{\partial z}$ is the Jacobian of the mapping $G(\cdot)$ with respect to $z$ .

All three assumptions are justifiable. Assumption (A.1) is reasonable due to the bounded domain $K$ and for well-trained generative models $G(z)$ whose target data distribution is normalized. Assumption (A.2) is reminiscent of the ubiquitous restricted isometry property (RIP) used for compressed sensing analysis (Candes & Tao, 2005) and is recently adopted in (Latorre et al., 2019). Finally, Assumption (A.3) is needed so that the loss function $F(z)$ is smooth, following typical analyses of Markov processes.

Next, we introduce a new concept of smoothness for generative networks. This concept is a weaker version of a condition on $G(\cdot)$ introduced in Latorre et al. (2019).

Definition 3.1 (Strong smoothness).

The generator network $G(z)$ is $(\alpha,\gamma)$ -strongly smooth if there exist $\alpha>0$ and $\gamma\geq 0$ such that for any $z,z^{\prime}\in\mathcal{D}$ , we have

\langle G(z)-G(z^{\prime}),\nabla_{z}G(z)(z-z^{\prime})\rangle\geq\alpha\|z-z^{\prime}\|^{2}-\gamma.

(3.5)

Following Latorre et al. (2019) (Assumption 2), we call this property “strong smoothness”. However, our definition of strong smoothness requires two parameters instead of one, and is weaker since we allow for an additive slack parameter $\gamma\geq 0$ .

Definition 3.1 can be closely linked to the following property of the loss function $F(z)$ that turns out to be crucial in establishing convergence results for CS-SGLD.

Definition 3.2 (Dissipativity (Hale, 1990)).

A differentiable function $F(z)$ on $\mathcal{D}$ is $(\alpha,\gamma)$ -dissipative around $z^{*}$ if for constants $\alpha>0$ and $\gamma\geq 0$ , we have

\langle z-z^{*},\nabla_{z}F(z)\rangle\geq\alpha\|z-z^{*}\|^{2}-\gamma.

(3.6)

It is straightforward to see that (3.6) essentially recovers the strong smoothness condition (3.5) if the measurement matrix $A$ is assumed to be the identity matrix. In compressed sensing, it is often the case that $A$ is a (sub)Gaussian matrix and that given a sufficient number of measurements as well as Assumptions (A.1), (A.2) and (A.3), the dissipativity of $F(z)$ for such an $A$ can still be established.

Once $F$ is shown to be dissipative, the machinery of Raginsky et al. (2017); Zhang et al. (2017); Zou et al. (2020) can be adapted to show that the convergence of CS-SGLD. The majority of the remainder of the paper is devoted to proving this series of technical claims.

3.2 Main results

We first show that a very broad class of generator networks satisfies the assumptions made above. The following proposition is an extension of a result in Latorre et al. (2019).

Proposition 3.1.

Suppose $G(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ is a feed-forward neural network with layers of non-decreasing sizes and compact input domain $\mathcal{D}$ . Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, $G(z)$ satisfies Assumptions (A.2) & (A.3) with constants $\iota_{G},\kappa_{G},M$ , and if $2\iota_{G}^{2}>M\kappa_{G}$ , the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.

This proposition merits a thorough discussion. First, architectures with increasing layer sizes are common; many generative models (such as GANs) assume architectures of this sort. Observe that the non-decreasing layer size condition is much milder than the expansivity ratios of successive layers assumed in related work Hand & Voroninski (2018); Asim et al. (2019).

Second, the compactness assumption of the domain of $G$ is mild, and traces its provenance to earlier related works Bora et al. (2017b); Latorre et al. (2019). Moreover, common empirical techniques for training generative models (such as GANs) indeed assume that the latent vectors $z$ lie on the surface of a sphere White (2016).

Third, common activation functions such as the sigmoid, or the Exponential Linear Unit (ELU) are continuously differentiable and monotonic. Note that the standard Rectified Linear Unit (ReLU) activation does not satisfy these conditions, and establishing similar results for ReLU networks is deferred to future work.

The key for our theoretical analysis, as discussed above, is Definition 3.1, and establishing this requires Proposition 3.1. Interestingly however, in Section 5 below we provide empirical evidence that strong smoothness holds for generative adversarial networks with ReLU activation trained on the MNIST and CIFAR-10 image datasets.

We now obtain a measurement complexity result by deriving a bound on the number of measurements required for $F$ to be dissipative.

Lemma 3.1.

Let $G(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let $\kappa_{G}$ be its Lipschitz constant. Suppose the number of measurements $m$ satisfies:

m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right)\,,

for some small constant $\delta>0$ . If the elements of $A$ are drawn according to $\mathcal{N}(0,\frac{1}{m})$ , then the loss function $F(z)$ is $(1-\delta,\gamma)$ -dissipative with probability at least $1-\exp(-\Omega(m\delta^{2}))$ .

The above result can be derived using covering number arguments, similar to the treatment in Bora et al. (2017b). Observe that the number of measurements scales linearly with the dimension of the latent vector $z$ instead of the ambient dimension, keeping in line with the flavor of results in standard compressed sensing. Recent lower bounds reported Liu & Scarlett (2020) also have shown that the scaling of $m$ with respect to $d$ and $\log L$ might be tight for compressed sensing recovery in several natural parameter regimes.

We need two more quantities to readily state our convergence guarantee. Both definitions are widely used in the convergence analysis of MCMC methods. The first quantity defines the goodness of an initial distribution $\mu_{0}$ with respect to the target distribution $\pi$ .

Definition 3.3 ( $\lambda$ -warm start, Zou et al. (2020)).

Let $\nu$ be a distribution on $\mathcal{D}$ . An initial distribution $\mu_{0}$ is $\lambda$ -warm start with respect to $\nu$ if

\displaystyle\sup_{\mathcal{A}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{A}\subseteq\mathcal{D}}\frac{\mu_{0}(\mathcal{A})}{\nu(\mathcal{A})}\leq\lambda.

The next quantity is the Cheeger constant that connects the geometry of the objective function and the hitting time of SGLD to a particular set in the domain Zhang et al. (2017).

Definition 3.4 (Cheeger constant).

Let $\mu$ be a probability measure on $\mathcal{D}$ . We say $\mu$ satisfies the isoperimetric inequality with Cheeger constant $\rho$ if for any $\mathcal{A}\subset\mathcal{D}$ ,

\displaystyle\liminf_{h\rightarrow 0^{+}}\frac{\mu(\mathcal{A}_{h})-\mu(\mathcal{A})}{h}\geq\rho\min\big{\{}\mu(\mathcal{A}),1-\mu(\mathcal{A})\big{\}},

where $\mathcal{A}_{h}=\{u\in K\mathrel{\mathop{\mathchar 58\relax}}\exists v\in\mathcal{A},\|u-v\|_{2}\leq h\}$ .

Putting all the above ingredients together, our main theoretical result describing the convergence of Algorithm 1 (CS-SGLD) for compressed sensing recovery is given as follows.

Theorem 1 (Convergence of CS-SGLD).

Assume that the generative network $G$ satisfies Assumptions (A.1) – (A.3) as well as the strong smoothness condition. Consider a signal $x^{*}=G(z^{*})$ , and assume that it is measured with $m$ (sub)Gaussian measurements such that $m=\Omega(d\log\kappa_{G}/\gamma)$ . Choose an inverse temperature $\beta$ and precision parameter $\epsilon>0$ . Then, after $k$ iterations of SGLD in Algorithm 1, we obtain a latent vector $z_{k}$ such that

\mathbb{E}\left[F(z_{k})\right]\leq\epsilon+O\left(\frac{d}{\beta}\log\left(\frac{\beta}{d}\right)\right),

(3.7)

provided the step size $\eta$ and the number of iterations $k$ are chosen such that:

\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{3}\beta^{2}}{\rho^{4}\epsilon^{2}}\right).

In words, if we choose a high enough inverse temperature and appropriate step size, CS-SGLD converges (in expectation) to a signal estimate with very low loss within a polynomial number of iterations.

Let us parse the above result further. First, observe that the right hand side of (3.7) consists of two terms. The first term can be made arbitrarily small (at the cost of greater computational cost since $\eta$ decreases ). The second term represents the irreducible expected error of the exact sampling algorithm on the Gibbs measure $\pi({\mathrm{d}}z)$ , which is worse than the optimal loss obtained at $z=z^{*}$ .

Second, suppose the right hand side of (3.7) is upper bounded by $\epsilon^{\prime}$ . Once SGLD finds an $\epsilon^{\prime}$ -approximate minimizer of the loss, in the regime of sufficient compressed sensing measurements (as specified by Lemma 3.1), we can invoke Theorem 1.1 in Bora et al. (2017b) along with Jensen’s inequality to immediately obtain a recovery guarantee, i.e.,

\mathbb{E}\left[\mathinner{\!\left\lVert x^{*}-G(z_{k})\right\rVert}\right]\leq\sqrt{\epsilon^{\prime}}.

Third, the convergence rate of CS-SGLD can be slow. In particular, SGLD may require a polynomial number of iterations to recover the true signal, while linearized ADMM Latorre et al. (2019) converges within a logarithmic number of iterations up to a neighborhood of the true signal. Obtaining an improved characterization of CS-SGLD convergence (or perhaps devising a new linearly convergent algorithm) is an important direction for future work.

Fourth, the above result is for noiseless measurements. A rather similar result can be derived with noisy measurements of bounded noise (says, $\|\varepsilon\|\leq\sigma)$ . This quantity (times a constant depending on $A$ ) will affect (3.7) up to an additive term that scales with $\sigma$ . This is precisely in line with most compressed sensing recovery results and for simplicity we omit such a derivation.

4 Proof outline

In this section, we provide a brief proof sketch of Theorem 1, while relegating details to the appendix. At a high level, our analysis is an adaptation of the framework of Zhang et al. (2017); Zou et al. (2020) specialized to the problem of compressed sensing recovery using generative priors. The basic ingredient in the proof is the use of conductance analysis to show the convergence of CS-SGLD to the target distribution in total variation distance.

Let $\mu_{k}$ denote the probability measure of $Z_{k}$ generated by Algorithm 1 and $\pi$ denote the target distribution in 3.4. The proof of Theorem 1 consists of three main steps:

1.

First, we construct an auxiliary Metropolis-Hasting Markov process to show that $\mu_{k}$ converges to $\pi$ in total variation for a sufficiently large $k$ and a “good” initial distribution $\mu_{0}$ .
2.

Next, we construct an initial distribution $\mu_{0}$ that serves as a $\lambda$ -warm start with respect to $\pi$ .
3.

Finally, we show that a random draw from $\pi$ is a near-minimizer of $F(z)$ , proving that CS-SGLD recovers the signal to high fidelity.

We proceed with a characterization of the evolution of the distribution of $z_{k}$ in Algorithm 1, which basically follows (Zou et al., 2020).

4.1 Construction of Metropolis-Hasting SGLD

Let $g(z)=\nabla_{z}F(z)$ , $u$ and $w$ respectively be the points before and after one iteration of Algorithm 1; the Markov chain is written as $u\rightarrow v\rightarrow w$ , where $v\sim\mathcal{N}(u-\eta g(u),\frac{2\eta}{\beta}I)$ with the following density:

\displaystyle\begin{split}&P(v|u)=\bigg{[}\frac{1}{(4\pi\eta/\beta)^{d/2}}\exp\bigg{(}-\frac{\|v-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}\bigg{)}\bigg{|}u\bigg{]}.\end{split}

(4.1)

Without the correction step, $P(v|u)$ is exactly the transition probability of the standard Langevin dynamics. Note also that one can construct a similar density with a stochastic (mini-batch) gradient. The process of $v\rightarrow w$ is

\displaystyle w=\begin{cases}v&v\in\mathcal{B}(u,r)\cap\mathcal{D};\\ u&\mbox{otherwise}.\end{cases}

(4.2)

Let $p(u)=\mathbb{P}_{v\sim P(\cdot|u)}[v\in\mathcal{B}(u,r)\cap\mathcal{D}]$ be the probability of accepting $v$ . The conditional density $Q(w|u)$ is

\displaystyle Q(w|u)=(1-p(u))

\displaystyle\delta_{u}(w)+P(w|u)\cdot{\mathbf{1}}\big{[}w\in\mathcal{B}(u,r)\cap\mathcal{D}\big{]},

where $\delta_{u}(\cdot)$ is the Dirac-delta function at $u$ . Similar to Zou et al. (2020); Zhang et al. (2017), we consider the $1/2$ -lazy version of the above Markov process, with the transition distribution

\displaystyle\mathcal{T}_{u}(w)=\frac{1}{2}\delta_{u}(w)+\frac{1}{2}Q(w|u),

(4.3)

and construct an auxiliary Markov process by adding an extra Metropolis accept/reject step. While proving the ergodicity of the Markov process with transition distribution $\mathcal{T}_{u}(w)$ is difficult, the auxiliary chain does indeed converge to a unique stationary distribution $\pi\propto e^{-\beta F(z)}\cdot{\mathbf{1}}(z\in\mathcal{D})$ due to the Metropolis-Hastings correction step.

The auxiliary Markov chain is given as follows: starting from $u$ , let $w$ be the state generated from $\mathcal{T}_{u}(\cdot)$ . The Metropolis-Hasting SGLD accepts $w$ with probability,

\displaystyle\alpha_{u}(w)=\min\bigg{\{}1,\frac{\mathcal{T}_{w}(u)}{\mathcal{T}_{u}(w)}\cdot\exp\big{[}-\beta\big{(}F(w)-F(u)\big{)}\big{]}\bigg{\}}.

Let $\mathcal{T}^{\star}_{u}(\cdot)$ denote the transition distribution of the auxiliary Markov process, such that

\displaystyle\mathcal{T}^{\star}_{u}(w)=(1-\alpha_{u}(w))\delta(u)+\alpha_{u}(w)\mathcal{T}_{u}(w).

Below, we establish the connection between $\mathcal{T}_{u}(\cdot)$ and $\mathcal{T}^{\star}_{u}(\cdot)$ , as well as the convergence of the original chain in Algorithm 1 through a conductance analysis on $\mathcal{T}^{\star}_{u}(\cdot)$ .

Lemma 4.1.

Under Assumptions, $F(z)$ is $L$ -smooth and satisfies $\|\nabla_{z}F(z)\|\leq D$ for $z\in\mathcal{D}$ . For $r=\sqrt{10\eta d/\beta}$ , the transition distribution of the chain in Algorithm 1 is $\delta$ -close the auxiliary chain, i.e., for any set $\mathcal{A}\subseteq\mathcal{D}$

(1-\delta)\mathcal{T}_{u}^{\star}(\mathcal{A})\leq\mathcal{T}_{u}(\mathcal{A})\leq(1+\delta)\mathcal{T}_{u}^{\star}(\mathcal{A}).

where $\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}$ .

In Appendix B, we show that $F(z)$ is $L$ -smooth with $L=(MB+\kappa_{G}^{2})$ and its gradient is bounded by $D=\kappa_{G}^{2}\|A^{\top}A\|$ .

One can verify that $\mathcal{T}^{\star}_{u}(\cdot)$ is time-reversible (Zhang et al., 2017). Moreover, following Lovász et al. (1993); Lovász & Vempala (2007), the convergence of a time-reversible Markov chain to its stationary distribution depends on its conductance, which is defined as follows:

Definition 4.1 (Restricted conductance).

The conductance of a time-reversible Markov chain with transition distribution $\mathcal{T}^{\star}_{u}(\cdot)$ and stationary distribution $\pi$ is defined by,

\displaystyle\phi\triangleq\inf_{\mathcal{A}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{A}\subseteq\mathcal{D},\pi(\mathcal{A})\in(0,1)}\frac{\int_{\mathcal{A}}\mathcal{T}_{u}(\mathcal{D}\backslash\mathcal{A})\pi({\mathrm{d}}u)}{\min\{\pi(\mathcal{A}),\pi(\mathcal{D}\backslash\mathcal{A})\}}.

Using the conductance parameter $\phi$ and the closeness $\delta$ between $\mathcal{T}_{u}(\cdot)$ and $\mathcal{T}^{\star}_{u}(\cdot)$ , we can derive the convergence of $\mathcal{T}_{u}(\cdot)$ in total variation distance.

Lemma 4.2 (Zou et al. (2020)).

Assume the conditions of Lemma 4.1 hold. If $\mathcal{T}_{u}(\cdot)$ is $\delta$ -close to $\mathcal{T}^{\star}_{u}(\cdot)$ with $\delta\leq\min\{1-\sqrt{2}/2,\phi/16\}$ , and the initial distribution $\mu_{0}$ serves as a $\lambda$ -warm start with respect to $\pi$ , then

\displaystyle\|\mu_{k}-\pi\|_{TV}\leq\lambda\big{(}1-\phi^{2}/8\big{)}^{k}+16\delta/\phi.

We will further give a lower bound on $\delta$ in order to establish an explicit convergence rate.

Lemma 4.3 (Zou et al. (2020)).

Under the same conditions of Lemma 4.1 and the step size $\eta\leq\frac{1}{30Ld}\wedge\frac{d}{25\beta D^{2}}$ , there exists a constant $c_{0}$ such that

\phi\geq c_{0}\rho\sqrt{\eta/\beta}.

4.2 Convergence of $\mu_{k}$ to the target distribution $\pi$

Armed with these tools, we formally establish the first step of the proof.

Theorem 2.

Suppose that the generative network $G$ satisfies Assumptions (A.1) – (A.3) as well as the strong smoothness condition. Set $\eta=O\big{(}d^{-1}\wedge\rho^{2}\beta^{-1}d^{-2}\big{)}$ and $r=\sqrt{10\eta d/\beta}$ , then for any $\lambda$ -warm start with respect to $\pi$ , the output of Algorithm 1 satisfies

\displaystyle\|\mu_{k}-\pi\|_{TV}\leq\lambda(1-C_{0}\eta)^{k}+C_{1}\eta^{1/2},

where $\rho$ is the Cheeger constant of $\pi$ , $C_{0}=\widetilde{O}\big{(}\rho^{2}\beta^{-1}\big{)}$ , and $C_{2}=\widetilde{O}\big{(}d\beta^{1/2}\rho^{-1}\big{)}$ . In particular, if the step size and the number of iterations satisfy:

\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{2}\beta^{2}\log(\lambda)}{\rho^{4}\epsilon^{2}}\right),

then $\|\mu_{k}-\pi\|_{TV}\leq\epsilon$ for $\epsilon>0$ .

The convergence rate is polynomial in the Cheeger constant $\rho$ whose lower bound is difficult to obtain generally. A rough bound $\rho=e^{-\tilde{O}(d)}$ can be derived using the Poincaré constant of the distribution $\pi$ , under the smoothness assumption. See (Bakry et al., 2008) for details.

Proof outline of Theorem 2.

To prove the result, we find a sufficient condition for $\eta$ that fulfills the requirements of Lemmas 4.1, 4.2 and 4.3 hold. For $\eta\leq\frac{d}{25\beta D^{2}}$ , we have

\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}\leq 12Ld\eta.

Moreover, Lemma 4.2 requires $\delta\leq\min\{1-\sqrt{2}/2,\phi/16\}$ , while $\phi\geq c_{0}\rho\sqrt{\eta/\beta}$ by Lemma 4.3, so we can set

\eta=\min\biggl{\{}\frac{1}{30Ld},\frac{d}{25\beta D^{2}},\frac{c_{0}^{2}\rho^{2}}{(156Ld)^{2}\beta}\biggr{\}}

for these conditions to hold. Putting all together, we obtain

	$\displaystyle\\|\mu_{k}-\pi\\|_{TV}$	$\displaystyle\leq\lambda\big{(}1-\phi^{2}/8\big{)}^{k}+\frac{16\delta}{\phi}$
		$\displaystyle\leq\lambda(1-C_{0}\eta)^{k}+C_{1}\eta^{1/2},$

where $C_{0}=c_{0}^{2}\rho^{2}/8\beta$ , $C_{1}=156Ld\beta^{1/2}\rho^{-1}/c_{0}$ . Therefore, we have proved the first part.

For the second part, to achieve $\epsilon$ -sampling error, it suffices to choose $\eta$ and $k$ such that

\displaystyle\lambda(1-C_{0}\eta)^{k}\leq\frac{\epsilon}{2},~{}\text{and}\quad C_{1}\eta^{1/2}\leq\frac{\epsilon}{2}.

Plugging in $C_{0},C_{1}$ above, we can choose

\displaystyle\eta=O\bigg{(}\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\bigg{)}\ \text{and}\ k=O\bigg{(}\frac{\log(\lambda/\epsilon)}{C_{0}\eta}\bigg{)}=\widetilde{O}\bigg{(}\frac{d^{2}\beta^{2}\log(\lambda)}{\rho^{4}\epsilon^{2}}\bigg{)}

such that $\|\mu_{k}-\pi\|_{TV}\leq\epsilon$ , which completes the proof. ∎

4.3 Existence of $\lambda$ -warm start initial distribution $\mu_{0}$

Apart from the step size and the number of iterations, the convergence depends on $\lambda$ , the goodness of the initial distribution $\mu_{0}$ . In this part, we specify a particular choice of $\mu_{0}$ in establish this.

Definition 4.2 (Set-Restricted Eigenvalue Condition, (Bora et al., 2017b)).

For some parameters $\tau>0$ and $o\geq 0$ , $A\in\mathbb{R}^{m\times n}$ is called $\text{S-REC}(\tau,o)$ if for all $z,z^{\prime}\in\mathcal{D}$ ,

\|A(G(z)-G(z^{\prime}))\|\geq\tau\|G(z)-G(z^{\prime})\|-o.

Lemma 4.4.

Suppose that $G(z)$ satisfies the near-isometry property in Assumption A.2, and $F(z)$ is $L$ -smooth. If $A$ is $\text{S-REC}(\tau,0)$ , then the Gaussian distribution $\mathcal{N}(0,\frac{1}{2\beta L}\mathbb{I})$ supported on $\mathcal{D}$ is a $\lambda$ -warm start with respect to $\pi$ with $\lambda=e^{O(d)}$ .

Proof.

Let $\mu_{0}$ denote the truncated Gaussian distribution $\mathcal{N}(0,\frac{1}{2\beta L}\mathbb{I})$ on $\mathcal{D}$ whose measure is

\mu_{0}({\mathrm{d}}z)={e^{-\beta L\|z\|_{2}^{2}}{\mathbf{1}}(z\in\mathcal{D}){\mathrm{d}}z}/{\Gamma}

where $\Gamma=\int_{\mathcal{D}}e^{-\beta L\|z\|_{2}^{2}}{\mathrm{d}}z$ is the normalization constant.

Along with the target measure $\pi$ , we can easily verify that

\displaystyle\frac{\mu_{0}({\mathrm{d}}z)}{\pi({\mathrm{d}}z)}\leq\frac{\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z}{\Gamma}\cdot e^{-\beta L\|z\|_{2}^{2}+\beta F(z)}.

Our goal is to bound the right hand side. Using the smoothness and the simple fact $F(z^{*})=0$ , we have

\displaystyle F(z)\leq\frac{L}{2}\|z-z^{*}\|_{2}^{2}\leq L\|z^{*}\|_{2}^{2}+L\|z\|_{2}^{2},

which implies that $e^{-\beta L\|z\|_{2}^{2}+\beta F(z)}\leq e^{\beta L\|z^{*}\|_{2}^{2}}$ .

To bound $\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z$ , we use the S-REC property of $A$ as well as the near-isometry of $G(z)$ . Recall the objective function:

\displaystyle F(z)

\displaystyle=\|y-AG(z)\|^{2}=\|A(G(z)-G(z^{*})\|^{2}\geq\tau^{2}\|G(z)-G(z^{*})\|^{2}-o\geq\tau^{2}\iota_{G}^{2}\|z-z^{*}\|^{2},

where we have dropped $o$ for simplicity. Therefore,

\displaystyle\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z

\displaystyle\leq\int_{\mathcal{D}}e^{-\beta\tau^{2}\iota_{G}^{2}\|z-z^{*}\|^{2}}{\mathrm{d}}z\leq\left(\frac{\pi}{\beta\tau^{2}\iota_{G}^{2}}\right)^{d/2}.

Putting the above results together, we can get

\displaystyle\lambda\leq\max_{z\in K}\frac{\mu_{0}({\mathrm{d}}z)}{\pi({\mathrm{d}}z)}\leq\bigg{(}\frac{\pi}{\beta\tau^{2}\iota_{G}^{2}}\bigg{)}^{d/2}\frac{e^{\beta L\|z^{*}\|_{2}^{2}}}{\Gamma}=e^{O(d)},

and conclude the proof. ∎

4.4 Completing the proof

Proof of Theorem 1.

Consider a random draw $\widehat{Z}$ from $\mu_{k}$ and another $\widehat{Z}^{*}$ from $\pi$ . We have

\displaystyle\mathbb{E}[F(\widehat{Z})]

\displaystyle=\left(\mathbb{E}[F(\widehat{Z})]-\mathbb{E}[F(\widehat{Z}^{*})]\right)+\mathbb{E}[F(\widehat{Z}^{*})].

We will first give a crude bound for the second term $\mathbb{E}[F(\widehat{Z}^{*})]$ following the idea from (Raginsky et al., 2017):

\displaystyle\mathbb{E}[F(\widehat{Z}^{*})]=\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)

\displaystyle\leq\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta}{d}\right).

The detailed proof is given in Appendix D. The first term is related to the convergence of $\mu_{k}$ to $\pi$ in total variation shown in Theorem 2. Notice that $F(z)\leq 2R\|A\|\kappa_{G}$ for all $z\in\mathcal{D}$ due the Lipschitz property of the generative network $G$ . Moreover, by Theorem 2, we have $\|\mu_{k}-\pi\|_{TV}\leq\epsilon^{\prime}$ for any $\epsilon^{\prime}>0$ and a sufficiently large $k$ . Hence, the first term is upper bounded by

\displaystyle\left|\int_{\mathcal{D}}F(z)\mu_{k}({\mathrm{d}}z)-\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)\right|\leq 2R\|A\|\kappa_{G}\left|\int_{\mathcal{D}}\mu_{k}({\mathrm{d}}z)-\int_{\mathcal{D}}\pi({\mathrm{d}}z)\right|\leq 2R\|A\|\kappa_{G}\epsilon^{\prime}.

Given the target error $\epsilon$ , choose $\epsilon^{\prime}={\epsilon}/{(2R\|A\|\kappa_{G})}$ . By Lemma 4.4, we have $\lambda=e^{O(d)}$ . Then, for

\eta=\widetilde{O}\left(\frac{\rho^{2}\epsilon^{2}}{d^{2}\beta}\right),~{}\text{and}\quad k=\widetilde{O}\left(\frac{d^{3}\beta^{2}}{\rho^{4}\epsilon^{2}}\right),~{}\text{we have}

\mathbb{E}[F(\widehat{Z})]\leq\epsilon+\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta}{d}\right).

Therefore, we complete the proof of our main result. ∎

5 Experimental results

While we emphasize that the primary focus of our paper is theoretical, we corroborate our theory with representative experimental results on MNIST and CIFAR-10. Even though we require requires a bounded domain within $d$ -dimensional Euclidean ball in our analysis, empirical results demonstrate that our approach works without the restriction.

Refer to caption — Figure 1: [MNIST] selected base digit $G(z^{*})$ , evaluating (a) dissipativity (b) (5.1), [CIFAR] selected base image $G(z^{*})$ , evaluating (c) dissipativity (d) (5.1).

5.1 Validation of strong smoothness

As mentioned above, our theory relies on the assumption that the following condition holds for some constants $\alpha>0,\gamma\geq 0$ and $\forall z,z^{\prime}\in\mathcal{D}$ for a domain $\mathcal{D}$ . Here, we take $\mathcal{D}=\mathbb{R}^{d}$ .

\displaystyle\langle G(z)-G(z^{\prime}),\nabla_{z}G(z)(z-z^{\prime})\rangle\geq\alpha\|z-z^{\prime}\|^{2}-\gamma.

To estimate these constants, we generate samples $z$ and $z^{\prime}$ from $\mathcal{N}(0,\mathbb{I})$ . To establish $\alpha$ and $\gamma$ , we perform experiments on two different datasets (i) MNIST (Net1) and (ii) CIFAR10 (Net2). For both datasets, we compute the terms $u(z,z^{\prime})=\nabla_{z}G(z)^{\top}(G(z)-G(z^{\prime})),z-z^{\prime}\rangle$ and $v(z,z^{\prime})=\|z-z^{\prime}\|^{2}$ for 500 different instantiations of $z$ and $z^{\prime}$ . We then plot these pairs of $(\alpha v-\gamma,u)$ samples for different $z$ ’s and $z^{\prime}$ ’s and compute the values of $\alpha$ and $\gamma$ by a simple linear program. We do this experiment for two DCGAN generators trained on MNIST (Figure 1 (a)) as well as on CIFAR10 (Figure 1 (c)).

Similarly for the compressed sensing case, we also derive values $\alpha_{A}$ and $\gamma_{A}$ , where a compressive matrix $A$ acts on the output of the generator $G$ . Here, we have picked the number of measurements $m=0.1n$ where $n$ is the signal dimension. This is encapsulated in the following equation:

\displaystyle\langle\nabla_{z}(AG(z))^{\top}(AG(z)-AG(z^{\prime})),z-z^{\prime}\rangle\geq\alpha_{A}\|z-z^{\prime}\|^{2}-\gamma_{A}

(5.1)

for all possible Gaussian matrices $A$ and different instantiations of $z$ and $z^{\prime}$ . Here, we capture the left side of the inequality in $u(z,z^{\prime})=\langle\nabla_{z}(AG(z))^{\top}(AG(z)-AG(z^{\prime})),z-z^{\prime}\rangle$ . We similarly plot points $(\alpha_{A}v(z,z^{\prime})-\gamma_{A},u(z,z^{\prime}))$ for all pairs $(z,z^{\prime})$ . The scatter plot is generated for 500 different instantiations of $z$ and $z^{\prime}$ and $5$ different instantiations of $A$ . We do this experiment for two DCGAN generators, one trained on MNIST (Figure 1 (b)) and the other trained on CIFAR10 (Figure 1 (d)). These experiments indicate that the dissipativity constant $\alpha$ is positive in all cases.

5.2 Comparison of SGLD against GD

We test the SGLD reconstruction by using the update rule in (3.2) and compare against the optimizing the updates of $z$ using standard gradient descent as in Bora et al. (2017b). For all experiments, we use a pre-trained DCGAN generator, with network configuration described as follows: the generator consists of four different layers consisting of transposed convolutions, batch normalization and RELU activation; this is followed by a final layer with a transposed convolution and $\tanh$ activation Radford et al. (2015).

We display the reconstructions on MNIST in Figure 2. Note that the implementation in Bora et al. (2017b) requires 10 random restarts for CS reconstruction and they report the results corresponding to the best reconstruction. This likely suggests that the standard implementation is likely to get stuck in bad local minima or saddle points. For the sake of fair comparison, we fix the same random initialization of latent vector $z$ for both GD and SGLD with no restarts. We select $m=0.2n$ . In Figure 2 we show reconstructions for the 16 different examples, which were all reconstructed at once using same $k=2000$ steps, learning rate of $\eta=0.02$ and the inverse temperature $\beta=1$ for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.

Phase transition plots scanning a range of compression ratios $m/n$ as well as example reconstructions on CIFAR-10 images can be found in the supplement. More thorough empirical comparisons with PGD-based approaches Shah & Hegde (2018); Raj et al. (2019) are deferred to future work.

5.3 Reconstructions for CIFAR10

We display the reconstructions on CIFAR10 in Figure 3. As with the implementation for MNIST, for the sake of fair comparison, we fix the same random initialization of latent vector $z$ for both GD and SGLD with no restarts. We select $m=0.3n$ . In Figure 3 we show reconstructions for the 16 different examples from MNIST, which were all reconstructed at once using same $k=2000$ steps, learning rate of $\eta=0.05$ and the inverse temperature $\beta=1$ for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Similar to our experiments on MNIST we notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.

Next, we plot phase transition diagrams by scanning the compression ratio $f=m/n=[0.2,0.4,0.6,0.8,1.0]$ for the MNIST dataset in Figure 4. For this experiment, we have chosen 5 different instantiations of the sampling matrix $A$ for each compression ratio $f$ . In Figure 4 we report the average Mean Square Error (MSE) of reconstruction $\|\hat{x}-x\|^{2}$ over 5 different instances of $A$ .

Figure 4: Phase transition plots representing average MSE of reconstructed image using gradient descent and stochastic gradient Langevin dynamics.

We conclude that SGLD gives improved reconstruction quality as compared to GD.

References

Asim et al. (2019) Asim, M., Ahmed, A., and Hand, P. Invertible generative models for inverse problems: mitigating representation error and dataset bias. arXiv preprint arXiv:1905.11672, 2019.
Bakry et al. (2008) Bakry, D., Barthe, F., Cattiaux, P., Guillin, A., et al. A simple proof of the poincaré inequality for a large class of probability measures. Electronic Communications in Probability, 13:60–66, 2008.
Baraniuk et al. (2010) Baraniuk, R., Cevher, V., Duarte, M., and Hegde, C. Model-based compressive sensing. IEEE Transactions on Information Theory, 56:1982–2001, 2010.
Bora et al. (2017a) Bora, A., Jalal, A., Price, E., and Dimakis, A. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537–546. JMLR. org, 2017a.
Bora et al. (2017b) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537–546. JMLR. org, 2017b.
Candes & Tao (2005) Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
Chang et al. (2017) Chang, J., Li, C., Póczos, B., and Kumar, B. One network to solve them all—solving linear inverse problems using deep projection models. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5889–5898. IEEE, 2017.
Chen et al. (2001) Chen, S., Donoho, D., and Saunders, M. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, 2017.
Dong et al. (2016) Dong, C., Loy, C., He, K., and Tang, X. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
Donoho (2006) Donoho, D. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
Hale (1990) Hale, J. Asymptotic behavior of dissipative systems. Bull. Am. Math. Soc, 22:175–183, 1990.
Hand & Voroninski (2018) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. In Conference On Learning Theory, pp. 970–978, 2018.
Hand & Voroninski (2019) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. IEEE Transactions on Information Theory, 66(1):401–418, 2019.
Jagatap & Hegde (2019) Jagatap, G. and Hegde, C. Algorithmic guarantees for inverse imaging with untrained network priors. In Advances in Neural Information Processing Systems, 2019.
Jalal et al. (2020) Jalal, A., ECE, U., Karmalkar, S., CS, U., Dimakis, A. G., and Price, E. Compressed sensing with approximate priors via conditional resampling. Preprint, 2020.
Ji & Carin (2007) Ji, S. and Carin, L. Bayesian compressive sensing and projection optimization. In Proceedings of the 24th international conference on Machine learning, pp. 377–384, 2007.
Latorre et al. (2019) Latorre, F., Eftekhari, A., and Cevher, V. Fast and provable admm for learning with generative priors. In Advances in Neural Information Processing Systems, pp. 12004–12016, 2019.
Lee & Vempala (2018) Lee, Y. T. and Vempala, S. S. Convergence rate of riemannian hamiltonian monte carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1115–1121, 2018.
Lei et al. (2019) Lei, Q., Jalal, A., Dhillon, I. S., and Dimakis, A. G. Inverting deep generative models, one layer at a time. In Advances in Neural Information Processing Systems, pp. 13910–13919, 2019.
Lindgren et al. (2020) Lindgren, E. M., Whang, J., and Dimakis, A. G. Conditional sampling from invertible generative models with applications to inverse problems. arXiv preprint arXiv:2002.11743, 2020.
Liu & Scarlett (2020) Liu, Z. and Scarlett, J. Information-theoretic lower bounds for compressive sensing with generative models. IEEE Journal on Selected Areas in Information Theory, 2020.
Lovász & Vempala (2007) Lovász, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
Lovász et al. (1993) Lovász, L. et al. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty, 2(1):1–46, 1993.
Mousavi & Baraniuk (2017) Mousavi, A. and Baraniuk, R. Learning to invert: Signal recovery via deep convolutional networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2272–2276. IEEE, 2017.
Needell & Tropp (2009) Needell, D. and Tropp, J. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301–321, 2009.
Ongie et al. (2020) Ongie, G., Jalal, A., Metzler, C., Baraniuk, R., Dimakis, A., and Willett, R. Deep learning techniques for inverse problems in imaging. arXiv preprint arXiv:2005.06001, 2020.
Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.
Raj et al. (2019) Raj, A., Li, Y., and Bresler, Y. Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5602–5611, 2019.
Shah & Hegde (2018) Shah, V. and Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4609–4613. IEEE, 2018.
Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688, 2011.
Whang et al. (2020) Whang, J., Lei, Q., and Dimakis, A. Compressed sensing with invertible generative models and dependent noise. arXiv preprint arXiv:2003.08089, 2020.
White (2016) White, T. Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.
Y. Wu (2019) Y. Wu, M. Rosca, T. L. Deep compressed sensing. arXiv preprint arXiv:1905.06723, 2019.
Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp. 1980–2022. PMLR, 2017.
Zou et al. (2020) Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. arXiv preprint arXiv:2010.09597, 2020.

Appendix A Conditions on the generator network

Proposition A.1.

Suppose $G(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ is a feed-forward neural network with layers of non-increasing sizes and compact input domain $\mathcal{D}$ . Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, $G(z)$ satisfies Assumptions (A.2) & (A.3) with constants $\iota_{G},\kappa_{G},M$ , and if $2\iota_{G}^{2}>M\kappa_{G}$ , the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.

Proof.

The proof proceeds similar to Latorre et al. (2019), Appendix B. Since $G(z)$ is a composition of linear maps followed by $C^{1}$ activation functions, $G(z)$ is continuously differentiable. As a result, the Jacobian $\nabla_{z}G$ is a continuous matrix-valued function and its restriction to the compact domain $\mathcal{D}\subseteq\mathbb{R}^{d}$ is Lipschitz-continuous. Therefore, there exists $M\geq 0$ such that

\|\nabla_{z}G(z)-\nabla_{z}G(z^{\prime})\|\leq M\|{z-z^{\prime}}\|,\qquad\forall z,z^{\prime}\in\mathcal{D}.

(A.1)

Thus, Assumption (A.3) holds. Assumption (A.2) is also satisfied according to Latorre et al. (2019), Lemma 5. To show the strong smoothness, we use the fundamental theorem of calculus with the Lipchitzness of $G(z)$ obtained by Assumption (A.2). For every $z,z^{\prime}\in\mathcal{D}$ , and $u(t)=tz+(1-t)z^{\prime}$ :

	$\displaystyle\langle G(z)-G(z^{\prime}),$	$\displaystyle\nabla_{z}G(z)(z-z^{\prime})\rangle$
		$\displaystyle=\\|G(z)-G(z^{\prime})\\|^{2}-\langle G(z)-G(z^{\prime}),G(z)-G(z^{\prime})-\nabla_{z}G(z)(z-z^{\prime})\rangle$
		$\displaystyle=\\|G(z)-G(z^{\prime})\\|^{2}-\int_{0}^{1}\langle G(z)-G(z^{\prime}),\bigl{(}\nabla_{z}G(u(t))-\nabla_{z}G(z)\bigr{)}(z-z^{\prime})\rangle{\mathrm{d}}t$
		$\displaystyle\geq\iota_{G}^{2}\\|z-z^{\prime}\\|^{2}-\kappa_{G}M\\|z-z^{\prime}\\|^{2}\int_{0}^{1}(1-t){\mathrm{d}}t$
		$\displaystyle=(\iota_{G}^{2}-\frac{\kappa_{G}M}{2})\\|z-z^{\prime}\\|^{2},$

where in the last step we use the near-isometry and the Lipschitzness of $\nabla_{z}G(z)$ we have obtained. Consequently, $G(z)$ is $(\iota_{G}^{2}-\frac{\kappa_{G}M}{2},0)$ -strongly smooth, if $\iota_{G}^{2}>\frac{\kappa_{G}M}{2}$ . ∎

Lemma A.1 (Measurement complexity).

Let $G(z)\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}$ be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let $L$ be its Lipschitz constant. If the number of measurements $m$ satisfies:

m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right)\,,

for some small constant $\delta>0$ . If the elements of $A$ are drawn according to $\mathcal{N}(0,\frac{1}{m})$ , then the loss function $F(z)$ is $(\alpha-\delta\kappa_{G}^{2},\gamma)$ -dissipative with probability at least $1-\exp(-\Omega(m\delta^{2})$ .

Proof.

Using Proposition A.1, it follows that there exist $\alpha>0$ and $\gamma\geq 0$ such that $G(z)$ is strongly smooth. Now, note that the left hand side of (3.6) is simplified as

\displaystyle\langle z-z^{*},\nabla_{z}F(z)\rangle

\displaystyle=\left\langle A(G(z)-G(z^{*})),A\nabla_{z}G(z)(z-z^{*})\right\rangle,

(A.2)

Denote $u=G(z)-G(z^{*})$ and $v=\nabla_{z}G(z)(z-z^{*})$ , then

\langle z-z^{*},\nabla_{z}F(z)\rangle=\langle Au,Av\rangle=\langle u,v\rangle-\langle(\mathbb{I}-A^{\top}A)u,v\rangle.

Using standard result in random matrix theory, we can get $P(\|\mathbb{I}-A^{\top}A\|\geq\delta)\leq\exp(-m\delta^{2})$ . Also, $\|u\|,\|v\|\leq\kappa_{G}\|z-z^{\prime}\|$ . Therefore,

\langle z-z^{*},\nabla_{z}F(z)\rangle\geq\langle u,v\rangle-\delta\|z-z^{\prime}\|^{2}.

For $m=\Omega\left(\frac{d}{\delta^{2}}\log(\kappa_{G}/\gamma)\right)$ , then

\langle z-z^{*},\nabla_{z}F(z)\rangle\geq(\alpha-\delta)\|z-z^{\prime}\|-\gamma,

with probability at least $1-\exp(-\Omega(m\delta^{2})$ . Therefore, the loss function $F(z)$ is $(\alpha-\delta\kappa_{G}^{2},\gamma)$ -dissipative with probability at least $1-\exp(-\Omega(m\delta^{2})$ . ∎

Appendix B Properties of $F(z)$

In this part, we establish some key properties of the loss function $F(z)$ . We use Assumptions (A.1) – (A.3) on the boundedness, Lipschitz gradient and near-isometry to obtain an upper bound of $\|\nabla_{z}F(z)\|$ and the smoothness of $F(z)$ .

Lemma B.1 (Lipschitzness of $F(z)$ ).

We have $\|\nabla_{z}F(z)\|\leq\kappa_{G}^{2}\|A^{\top}A\|\|z-z^{*}\|$ for any $z\in\mathcal{D}\subset\mathbb{R}^{d}$ .

Proof.

Recall the gradient of $F(z)$ :

\displaystyle\nabla_{z}F(z)

\displaystyle=-(\nabla_{z}G(z))^{\top}A^{\top}(y-AG(z))=-(\nabla_{z}G(z))^{\top}A^{\top}A(G(z^{*})-G(z)).

It follows from the Lipschitz assumption (A.2) that $\|G(z^{*})-G(z)\|\leq\kappa_{G}\|z-z^{*}\|$ , and hence $\|\nabla_{z}G(z)\|\leq\kappa_{G}$ . Therefore,

\|\nabla_{z}F(z)\|\leq\kappa_{G}^{2}\|A^{\top}A\|\|z-z^{*}\|.

∎

Lemma B.2 (Smoothness of $F(z)$ ).

For any $z,z^{\prime}\in\mathcal{D}\subset\mathbb{R}^{d}$ , we have

\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\|\leq(MB+\kappa_{G}^{2})\|A^{\top}A\|\|z-z^{\prime}\|.

Proof.

We use the assumptions on $G(z)$ to derive the bound: $\|G(z^{*})\|\leq B$ .

	$\displaystyle\\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\\|$	$\displaystyle\leq\\|(\nabla_{z}G(z^{\prime})-\nabla_{z}G(z))^{\top}A^{\top}AG(z^{*})\\|$
		$\displaystyle\qquad+\\|(\nabla_{z}G(z))^{\top}A^{\top}A(G(z)-G(z^{\prime}))\\|$
		$\displaystyle\qquad+\\|(\nabla_{z}G(z)-\nabla_{z}G(z^{\prime}))^{\top}A^{\top}AG(z^{\prime})\\|$

Then, using the boundedness, Lipschitzness and smoothness, we arrive at:

\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\|\leq(MB+\kappa_{G}^{2})\|A^{\top}A\|.

Therefore, $F(z)$ is $L$ -smooth, with $L=(MB+\kappa_{G}^{2})\|A^{\top}A\|$ . ∎

Appendix C Conductance Analysis

In this section, we provide the proofs of Lemma 4.1 and 4.3 based on the conductance analysis laid out in (Zhang et al., 2017) and similarly in (Zou et al., 2020). The proof of 4.2 directly follows from Lemma 6.3, (Zou et al., 2020).

Proof of Lemma 4.1.

We use the same idea in Lemma 3 from (Zhang et al., 2017) and similarly that in Lemma 6.1 from (Zou et al., 2020). The main difference of our proof is that we use full gradient $\nabla_{z}F(z)$ in Algorithm 1, instead of stochastic mini-batch gradient, which simplifies the proof of this lemma a little.

We consider two cases for each $u$ : $u\not\in\mathcal{A}$ and $u\in\mathcal{A}$ . As long as we can prove the first case, the second case easily follows, by splitting $\mathcal{A}$ into $\{u\}$ and $\mathcal{A}\backslash\{u\}$ and using the result of the first case. For a detailed treatment of the latter case, we refer the reader to the proof of Lemma 6.1 in (Zou et al., 2020).

Now that $u\notin\mathcal{A}$ , we have

\displaystyle\mathcal{T}^{\star}_{u}(\mathcal{A})=\int_{\mathcal{A}\cap\mathcal{B}(u,r)}\mathcal{T}^{\star}_{u}(w){\mathrm{d}}w=\int_{\mathcal{A}\cap\mathcal{B}(u,r)}\alpha_{u}(w)\mathcal{T}_{u}(w){\mathrm{d}}w.

(C.1)

where $\alpha_{u}(w)$ is the acceptance ratio of the Metropolis-Hasting. If suffices to show that $\alpha_{u}(w)\geq 1-\delta/2$ for all $w\in K\cap\mathcal{B}(u,r)$ , which implies

\displaystyle(1-\delta/2)\mathcal{T}_{u}(\mathcal{A})\leq\mathcal{T}^{\star}_{u}(\mathcal{A})\leq\mathcal{T}_{u}(\mathcal{A}).

The right hand side is obvious by the definition of $\alpha_{u}(w)$ while we can ensure $\delta\leq 1/2$ with a sufficiently small $\eta$ . What remains is to show that

\displaystyle\frac{\mathcal{T}_{w}(u)}{\mathcal{T}_{u}(w)}\cdot\exp(-\beta(F(w)-F(u)))\geq 1-\delta/2.

(C.2)

The left hand side is simplified by definition of $\mathcal{T}_{u}(w)$ as

\displaystyle\exp\bigg{(}\frac{\|w-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}-\frac{\|u-w+\eta g(w)\|_{2}^{2}}{4\eta/\beta}\bigg{)}\exp(-\beta(F(w)-F(u)))\geq 1-\delta/2.

Note that $g(z)=\nabla_{z}F(z)$ . Simplify the first exponent and combine with the second one gives the following form:

\displaystyle-\beta\left(F(w)-F(u)-\frac{1}{2}\langle w-u,\nabla_{z}F(w)+\nabla_{z}F(u)\rangle\right)+\frac{\eta\beta}{4}(\|\nabla_{z}F(u)\|^{2}-\|\nabla_{z}F(w)\|^{2}).

(C.3)

To lower bound the left hand side, we appeal to the smoothness of $F(z)$ . Specifically, by Lemmas B.1 and B.2, we have $F$ is $L$ -smooth and $\|\nabla_{z}F(z)\|\leq D$ with $L=(MB+\kappa_{G}^{2})$ and $D=\kappa_{G}^{2}\|A^{\top}A\|$ . Then,

	$\displaystyle F(w)$	$\displaystyle\leq F(u)+\langle w-u,\nabla F(u)\rangle+\frac{L\\|w-u\\|_{2}^{2}}{2},$
	$\displaystyle F(u)$	$\displaystyle\geq F(w)+\langle u-w,\nabla F(w)\rangle-\frac{L\\|w-u\\|_{2}^{2}}{2}.$

This directly implies that

\displaystyle\big{|}F(w)-F(u)-\langle w-u,\frac{1}{2}\nabla F(w)+F(u)\rangle\big{|}\leq\frac{L\|w-u\|_{2}^{2}}{2}.

(C.4)

Moreover,

	$\displaystyle\big{\|}\\|\nabla_{z}F(u)\\|_{2}^{2}-\\|\nabla_{z}F(w)\\|_{2}^{2}\big{\|}$	$\displaystyle\leq\\|\nabla F(u)-\nabla F(w)\\|_{2}\cdot\\|\nabla F(u)+\nabla F(w)\\|_{2}$
		$\displaystyle\leq 2LD\\|w-u\\|_{2}.$		(C.5)

Combining (C.4) and (C) in (C.3), and together with $w\in\mathcal{B}(u,r)$ with $r=\sqrt{10\eta d/\beta}$ ,

	LHS of (C.3)	$\displaystyle\geq-\frac{L\beta\\|w-u\\|^{2}}{2}-\frac{\eta\beta LD\\|w-u\\|}{2}$
		$\displaystyle\geq-5Ld\eta-5LGd^{1/2}\beta^{-1/2}\eta^{3/2}.$

Pick $\delta/2=5Ld\eta+5LDd^{1/2}\beta^{-1/2}\eta^{3/2}$ , and use the fact $e^{-x}\geq 1-x$ for $x\geq 0$ , then we have proved the result. ∎

Next, we lower bound the conductance $\phi$ of $\mathcal{T}^{\star}_{u}(\cdot)$ using the idea in (Lee & Vempala, 2018; Zou et al., 2020), by first restating the following lemma:

Lemma C.1 (Lemma 13 in Lee & Vempala (2018)).

Let $\mathcal{T}^{\star}_{u}(\cdot)$ be a time-reversible Markov chain on $\mathcal{D}$ with stationary distribution $\pi$ . Suppose for any $u,v\in\mathcal{D}$ and a fixed $\Delta>0$ such that $\|u-v\|_{2}\leq\Delta$ , we have $\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99$ , then the conductance of $\mathcal{T}^{\star}_{u}(\cdot)$ satisfies $\phi\geq C\rho\Delta$ for some constant $C>0$ and $\rho$ is the Cheeger constant of $\pi$ .

Proof of Lemma 4.3.

To apply Lemma C.1, we follow the same idea of Zou et al. (2020) and reuse some of their results without proof. To this end, we prove that for some $\Delta$ , any pair of $u,v\in\mathcal{D}$ such that $\|u-v\|_{2}\leq\Delta$ , we have $\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99$ . Recall the distribution of the iterate $z$ after one-step standard SGLD without the accept/reject step in (4.1) is

\displaystyle P(z|u)=\frac{1}{(4\pi\eta/\beta)^{d/2}}\exp\bigg{(}-\frac{\|z-u+\eta g(u)\|_{2}^{2}}{4\eta/\beta}\bigg{)}

Since Algorithm 1 accepts the candidate only if it falls in the region $\mathcal{D}\cap\mathcal{B}(u,r)$ , the acceptance probability is

\displaystyle p(u)=\mathbb{P}_{z\sim P(\cdot|u)}\big{[}z\in\mathcal{D}\cap\mathcal{B}(u,r)\big{]}.

Therefore, the transition probability $\mathcal{T}^{\star}_{u}(z)$ for $z\in\mathcal{D}\cap\mathcal{B}(u,r)$ is given by

\displaystyle\mathcal{T}^{\star}_{u}(z)=\frac{2-p(u)+p(u)(1-\alpha_{u}(z))}{2}\delta_{u}(z)+\frac{\alpha_{u}(z)}{2}P(z|u)\cdot{\mathbf{1}}[z\in\mathcal{D}\cap\mathcal{B}(u,r)].

Take $u,v\in\mathcal{D}$ and let $\mathcal{S}_{u}=\mathcal{D}\cap\mathcal{B}(u,r)$ and $\mathcal{S}_{v}=\mathcal{D}\cap\mathcal{B}(v,r)$ . By the definition of the total variation, there exists a set $\mathcal{A}\in\mathcal{D}$ such that

	$\displaystyle\\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\\|_{TV}$	$\displaystyle=\|\mathcal{T}^{\star}_{u}(\mathcal{A})-\mathcal{T}^{\star}_{v}(\mathcal{A})\|$
		$\displaystyle\leq\underbrace{\max_{u,z}\bigg{[}\frac{2-p(u)+p(u)(1-\alpha_{u}(z))}{2}\bigg{]}}_{I_{1}}$
		$\displaystyle+\frac{1}{2}\underbrace{\bigg{\|}\int_{z\in\mathcal{A}}\alpha_{u}(z)P(z\|u){\mathbf{1}}(z\in\mathcal{S}_{u})-\alpha_{v}(z)P(z\|v){\mathbf{1}}(z\in\mathcal{S}_{v}){\mathrm{d}}z\bigg{\|}}_{I_{2}}.$

Using the mini-batch size that is exactly the same as the number of samples, we can reuse the bounds of $I_{1}$ and $I_{2}$ in Lemmas C.4 and C.5 of (Zou et al., 2020). Consequently,

\displaystyle\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq I_{1}+I_{2}/2\leq 0.85+0.1\delta+\frac{\sqrt{\beta}\|u-v\|_{2}}{\sqrt{2\eta}}.

By Lemma 4.1, we have $\delta=10Ld\eta+10LDd^{1/2}\beta^{1/2}\eta^{3/2}\leq 12Ld\eta$ if $\eta\leq\frac{d}{25\beta D^{2}}$ . Thus if

\displaystyle\eta\leq\frac{1}{25\beta D^{2}}\wedge\frac{1}{30Ld\eta}\quad\mbox{and}\quad\|u-v\|_{2}\leq\frac{\sqrt{2\eta}}{10\sqrt{\beta}}\leq 0.1r,

we have $\|\mathcal{T}^{\star}_{u}(\cdot)-\mathcal{T}^{\star}_{v}(\cdot)\|_{TV}\leq 0.99$ . As the result of Lemma C.1, we prove a lower bound on the conductance $\phi$ of $\mathcal{T}^{\star}_{u}(\cdot)$

\displaystyle\phi\geq c_{0}\rho\sqrt{\eta/\beta},

and finish the proof.

∎

Appendix D Property of the Gibbs algorithm

Proposition D.1.

For $\mathcal{D}=\mathcal{B}(0,R)$ , we have

\displaystyle\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)

\displaystyle\leq\mathcal{O}\left(\frac{d}{\beta}\log\frac{\beta L}{d}\right).

Proof.

Let $p(z)=e^{-\beta F(z)}/\Lambda$ denote the density of $\pi$ . $\Lambda\triangleq\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z$ is the partition function. We start by writing

\displaystyle\int_{\mathcal{D}}F(z)\pi({\mathrm{d}}z)=\frac{1}{\beta}\left(h(p)-\log\Lambda\right),

(D.1)

where

h(p)=-\int_{\mathcal{D}}p(z)\log p(z){\mathrm{d}}z=-\int_{K}\frac{e^{-\beta F(z)}}{\Lambda}\log\frac{e^{-\beta F(z)}}{\Lambda}{\mathrm{d}}z

is the differential entropy of $p$ . To upper-bound $h(p)$ , we use the fact that the differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment. Moreover, since $p$ has the support in the Euclidean ball with radius $R$ , its second moment is simply bounded by $R^{2}$ . Therefore, we have

\displaystyle h(p)\leq h(\mathcal{N}(0,R^{2}\mathbb{I}))=\frac{d}{2}\log\frac{2\pi R^{2}}{d}.

(D.2)

Next, we give a lower bound on the second term, $\log\Lambda$ . We use the smoothness of $F(z)$ and the fact that $z^{*}$ is the minimizer of $F$ . We have $F(z)\leq\frac{L}{2}\|z-z^{*}\|^{2}$ for $z\in\mathcal{D}$ . As such,

\displaystyle\log\Lambda=\log\int_{\mathcal{D}}e^{-\beta F(z)}{\mathrm{d}}z

\displaystyle\geq\log\int_{\mathcal{D}}e^{-\beta L\|z-z^{*}\|^{2}/2}{\mathrm{d}}z\asymp O\left(\frac{d}{2}\log\frac{2\pi}{\beta L}\right).

(D.3)

Using (D.2) and (D.3) in (D.1) and simplifying, we prove the result. ∎


(a) Ground truth	(b) Initial	(c) GD	(d) SGLD
		MSE = 0.0447	MSE = 0.0275


(a) Ground truth	(b) Initial	(c) GD	(d) SGLD
		MSE = 0.0248	MSE = 0.0246

	$\displaystyle\langle G(z)-G(z^{\prime}),$	$\displaystyle\nabla_{z}G(z)(z-z^{\prime})\rangle$
		$\displaystyle=\\|G(z)-G(z^{\prime})\\|^{2}-\langle G(z)-G(z^{\prime}),G(z)-G(z^{\prime})-\nabla_{z}G(z)(z-z^{\prime})\rangle$
		$\displaystyle=\\|G(z)-G(z^{\prime})\\|^{2}-\int_{0}^{1}\langle G(z)-G(z^{\prime}),\bigl{(}\nabla_{z}G(u(t))-\nabla_{z}G(z)\bigr{)}(z-z^{\prime})\rangle{\mathrm{d}}t$
		$\displaystyle\geq\iota_{G}^{2}\\|z-z^{\prime}\\|^{2}-\kappa_{G}M\\|z-z^{\prime}\\|^{2}\int_{0}^{1}(1-t){\mathrm{d}}t$
		$\displaystyle=(\iota_{G}^{2}-\frac{\kappa_{G}M}{2})\\|z-z^{\prime}\\|^{2},$

	$\displaystyle\\|\nabla_{z}F(z)-\nabla_{z}F(z^{\prime})\\|$	$\displaystyle\leq\\|(\nabla_{z}G(z^{\prime})-\nabla_{z}G(z))^{\top}A^{\top}AG(z^{*})\\|$
		$\displaystyle\qquad+\\|(\nabla_{z}G(z))^{\top}A^{\top}A(G(z)-G(z^{\prime}))\\|$
		$\displaystyle\qquad+\\|(\nabla_{z}G(z)-\nabla_{z}G(z^{\prime}))^{\top}A^{\top}AG(z^{\prime})\\|$

	$\displaystyle\big{\|}\\|\nabla_{z}F(u)\\|_{2}^{2}-\\|\nabla_{z}F(w)\\|_{2}^{2}\big{\|}$	$\displaystyle\leq\\|\nabla F(u)-\nabla F(w)\\|_{2}\cdot\\|\nabla F(u)+\nabla F(w)\\|_{2}$
		$\displaystyle\leq 2LD\\|w-u\\|_{2}.$		(C.5)

Provable Compressed Sensing with Generative Priors via Langevin Dynamics

Abstract

1 Introduction

2 Prior work

3 Recovery via Langevin dynamics

3.1 Preliminaries

Definition 3.1 (Strong smoothness).

Definition 3.2 (Dissipativity (Hale, 1990)).

3.2 Main results

Proposition 3.1.

Lemma 3.1.

Definition 3.3 (λ\lambda-warm start, Zou et al. (2020)).

Definition 3.4 (Cheeger constant).

Theorem 1 (Convergence of CS-SGLD).

4 Proof outline

4.1 Construction of Metropolis-Hasting SGLD

Lemma 4.1.

Definition 4.1 (Restricted conductance).

Lemma 4.2 (Zou et al. (2020)).

Lemma 4.3 (Zou et al. (2020)).

4.2 Convergence of μk\mu_{k} to the target distribution π\pi

Theorem 2.

Proof outline of Theorem 2.

4.3 Existence of λ\lambda-warm start initial distribution μ0\mu_{0}

Definition 4.2 (Set-Restricted Eigenvalue Condition, (Bora et al., 2017b)).

Lemma 4.4.

Proof.

4.4 Completing the proof

Proof of Theorem 1.

5 Experimental results

5.1 Validation of strong smoothness

5.2 Comparison of SGLD against GD

5.3 Reconstructions for CIFAR10

References

Appendix A Conditions on the generator network

Proposition A.1.

Proof.

Lemma A.1 (Measurement complexity).

Proof.

Appendix B Properties of F​(z)F(z)

Lemma B.1 (Lipschitzness of F​(z)F(z)).

Proof.

Lemma B.2 (Smoothness of F​(z)F(z)).

Proof.

Appendix C Conductance Analysis

Proof of Lemma 4.1.

Lemma C.1 (Lemma 13 in Lee & Vempala (2018)).

Proof of Lemma 4.3.

Appendix D Property of the Gibbs algorithm

Proposition D.1.

Proof.

Definition 3.3 ( $\lambda$ -warm start, Zou et al. (2020)).

4.2 Convergence of $\mu_{k}$ to the target distribution $\pi$

4.3 Existence of $\lambda$ -warm start initial distribution $\mu_{0}$

Appendix B Properties of $F(z)$

Lemma B.1 (Lipschitzness of $F(z)$ ).

Lemma B.2 (Smoothness of $F(z)$ ).