Provable Compressed Sensing with Generative Priors via Langevin Dynamics
Abstract
Deep generative models have emerged as a powerful class of priors for signals in various inverse problems such as compressed sensing, phase retrieval and super-resolution. Here, we assume an unknown signal to lie in the range of some pre-trained generative model. A popular approach for signal recovery is via gradient descent in the low-dimensional latent space. While gradient descent has achieved good empirical performance, its theoretical behavior is not well understood. In this paper, we introduce the use of stochastic gradient Langevin dynamics (SGLD) for compressed sensing with a generative prior. Under mild assumptions on the generative model, we prove the convergence of SGLD to the true signal. We also demonstrate competitive empirical performance to standard gradient descent.
1 Introduction
We consider the familiar setting of inverse problems where the goal is to recover an -dimensional signal that is indirectly observed via a linear measurement operation . The measurement vector can be noisy, and its dimension may be less than . Several practical applications fit this setting, including super-resolution (Dong et al., 2016), in-painting, denoising Vincent et al. (2010), and compressed sensing Donoho (2006); Chang et al. (2017).
Since such an inverse problem is ill-posed in general, the recovery of from often requires assuming a low-dimensional structure or prior on . Choices of good priors have been extensively explored in the past three decades, including sparsity Chen et al. (2001); Needell & Tropp (2009), structured sparsity Baraniuk et al. (2010), end-to-end training via convolutional neural networks Chang et al. (2017); Mousavi & Baraniuk (2017), pre-trained generative priors Bora et al. (2017a), as well as untrained deep image priors Ulyanov et al. (2018); Jagatap & Hegde (2019).
In this paper, we focus on a powerful class of priors based on deep generative models. The setup is the following: the unknown signal is assumed to lie in the range of some pre-trained generator network, obtained from (say) a generative adversarial network (GAN) or a variational autoencoder (VAE). That is, for some in the latent space. The task is again to recover from (noisy) linear measurements.
Such generative priors have been shown to achieve high empirical success Chang et al. (2017); Bora et al. (2017a); Y. Wu (2019). However, progress on the theoretical side for inverse problems with generative priors has been much more modest. On the one hand, the seminal work of Bora et al. (2017b) established the first statistical upper bounds (in terms of measurement complexity) for compressed sensing for fairly general generative priors, which was later shown in Liu & Scarlett (2020) to be nearly optimal. On the other hand, provable algorithmic guarantees for recovery using generative priors are only available in very restrictive cases. The paper Hand & Voroninski (2018) proves the convergence of (a variant of) gradient descent for shallow generative priors whose weights obey a distributional assumption. The paper Shah & Hegde (2018) proves the convergence of projected gradient descent (PGD) under the assumption that the range of the (possibly deep) generative model admits a polynomial-time oracle projection. To our knowledge, the most general algorithmic result in this line of work is by Latorre et al. (2019). Here, the authors show that under rather mild and intuitive assumptions on , a linearized alternating direction method of multipliers (ADMM) applied to a regularized mean-squared error loss converges to a (potentially large) neighborhood of .
The main barrier for obtaining guarantees for recovery algorithms based on gradient descent is the non-convexity of the recovery problem induced by the generator network. Therefore, in this paper we sidestep traditional gradient descent-style optimization methods, and instead show that a very good estimate of can also be obtained by performing stochastic gradient Langevin Dynamics (SGLD) Welling & Teh (2011); Raginsky et al. (2017); Zhang et al. (2017); Zou et al. (2020). We show that this dynamics amounts to sampling from a Gibbs distribution whose energy function is precisely the reconstruction loss 111While preparing this manuscript, we became aware of concurrent work by Jalal et al. (2020) which also pursues a similar Langevin-style approach for solving compressed sensing problems; however, they do not theoretically analyze its dynamics..
As a stochastic version of gradient descent, SGLD is simple to implement. However, care must be taken in constructing the additive stochastic perturbation to each gradient update step. Nevertheless, the sampling viewpoint enables us to achieve finite-time convergence guarantees for compressed sensing recovery. To the best of our knowledge, this is the first such result for solving compressed sensing problems with generative neural network priors. Moreover, our analysis succeeds under (slightly) weaker assumptions on the generator network than those made in Latorre et al. (2019). Our specific contributions are as follows:
-
1.
We propose a provable compressed sensing recovery algorithm for generative priors based on stochastic gradient Langevin dynamics (SGLD).
-
2.
We prove polynomial-time convergence of our proposed recovery algorithm to the true underlying solution, under assumptions of smoothness and near-isometry of . These are technically weaker than the mild assumptions made in Latorre et al. (2019). We emphasize that these conditions are valid for a wide range of generator networks. Section 3 describes them in greater details.
-
3.
We provide several empirical results and demonstrate that our approach is competitive with existing (heuristic) methods based on gradient descent.
2 Prior work
We briefly review the literature on compressed sensing with deep generative models. For a thorough survey on deep learning for inverse problems, see Ongie et al. (2020).
In Bora et al. (2017a), the authors provide sufficient conditions under which the solution of the inverse problem is a minimizer of the (possibly non-convex) program:
(2.1) |
Specifically, they show that if satisfies the so-called set-Restricted Eigenvalue Condition (REC), then the solution to (2.1) equals the unknown vector . They also show that if the generator has a latent dimension and is -Lipschitz, then a matrix populated with i.i.d. Gaussian entries satisfies the REC, provided . However, they propose gradient descent as a heuristic to solve (2.1), but do not analyze its convergence. In Shah & Hegde (2018), the authors show that projected gradient descent (PGD) for (2.1) converges at a linear rate under the REC, but only if there exists a tractable projection oracle that can compute for any . The recent work Lei et al. (2019) provides sufficient conditions under which such a projection can be approximately computed. In Latorre et al. (2019), a provable recovery scheme based on ADMM is established, but guarantees recovery only up to a neighborhood around .
Note that all the above works assume mild conditions on the weights of the generator, use variations of gradient descent to update the estimate for , and require the forward matrix to satisfy the REC over the range of . Hand & Voroninski (2018, 2019) showed global convergence for gradient descent, but under the (strong) assumption that the weights of the trained generator are Gaussian distributed.
Generator networks trained with GANs are most commonly studied. However, Whang et al. (2020); Asim et al. (2019) have recently advocated using invertible generative models, which use real-valued non-volume preserving (NVP) transformations Dinh et al. (2017). An alternate strategy for sampling images consistent with linear forward models was proposed in Lindgren et al. (2020) where the authors assume an invertible generative mapping and sample the latent vector from a second generative invertible prior.
Our proposed approach also traces its roots to Bayesian compressed sensing Ji & Carin (2007), where instead of modeling the problem as estimating a (deterministic) sparse vector, one models the signal to be sampled from a sparsity promoting distribution, such as a Laplace prior. One can then derive the maximum a posteriori (MAP) estimate of under the constraint that the measurements are consistent. Our motivation is similar, except that we model the distribution of as being supported on the range of a generative prior.
3 Recovery via Langevin dynamics
In the rest of the paper, denotes and for . Given a distribution and set , we denote the probability measure of with respect to . is the total variation distance between two distributions and . Finally, we use standard big-O notation in our analysis.
3.1 Preliminaries
We focus on the problem of recovering a signal from a set of linear measurements where
To keep our analysis and results simple, we consider zero measurement noise, i.e., 222We not in passing that our analysis techniques succeed for any vector with bounded norm.. Here, is a matrix populated with i.i.d. Gaussian entries with mean 0 and variance . We assume that belongs to the range of a known generative model ; that is,
Following Bora et al. (2017b), we restrict to belong to a -dimensional Euclidean ball, i.e., . Then, given the measurements , our goal is to recover . Again following Bora et al. (2017b), we do so by solving the usual optimization problem:
(3.1) |
Hereon and otherwise stated, denotes the -norm. The most popular approach to solving (3.1) is to use gradient descent Bora et al. (2017b). For generative models defined by deep neural networks, the function is highly non-convex, and as such, it is impossible to guarantee global signal recovery using regular (projected) gradient descent.
We adopt a slightly more refined approach. Starting from an initial point , our algorithm computes stochastic gradient updates of the form:
(3.2) |
where is a unit Gaussian random vector in , is the step size and is an inverse temperature parameter. This update rule is known as stochastic gradient Langevin dynamics (SGLD) Welling & Teh (2011) and has been widely studied both in theory and practice Raginsky et al. (2017); Zhang et al. (2017). Intuitively, (3.2) is an Euler discretization of the continuous-time diffusion equation:
(3.3) |
where . Under standard regularity conditions on , one can show that the above diffusion has a unique invariant Gibbs measure.
We refine the standard SGLD to account for the boundedness of . Specifically, we require an additional Metropolis-like accept/reject step to ensure that always belongs to the support , and also is not too far from of the previous iteration. We study this variant for theoretical analysis; in practice we have found that this is not necessary. Algorithm 1 (CS-SGLD) shows the detailed algorithm. Note that we can use stochastic (mini-batch) gradient instead of the full gradient .
We wish to derive sufficient conditions on the convergence (in distribution) of the random process in Algorithm 1 to the target distribution , denoted by:
(3.4) |
and study its consequence in recovering the true signal . This leads to the first guarantees of a stochastic gradient-like method for compressed sensing with generative priors. In order to do so, we make the following three assumptions on the generator network .
-
(A.1)
Boundedness. For all , we have that for some .
-
(A.2)
Near-isometry. is a near-isometric mapping if there exist such that the following holds for any :
-
(A.3)
Lipschitz gradients. The Jacobian of is -Lipschitz, i.e., for any , we have
where is the Jacobian of the mapping with respect to .
All three assumptions are justifiable. Assumption (A.1) is reasonable due to the bounded domain and for well-trained generative models whose target data distribution is normalized. Assumption (A.2) is reminiscent of the ubiquitous restricted isometry property (RIP) used for compressed sensing analysis (Candes & Tao, 2005) and is recently adopted in (Latorre et al., 2019). Finally, Assumption (A.3) is needed so that the loss function is smooth, following typical analyses of Markov processes.
Next, we introduce a new concept of smoothness for generative networks. This concept is a weaker version of a condition on introduced in Latorre et al. (2019).
Definition 3.1 (Strong smoothness).
The generator network is -strongly smooth if there exist and such that for any , we have
(3.5) |
Following Latorre et al. (2019) (Assumption 2), we call this property “strong smoothness”. However, our definition of strong smoothness requires two parameters instead of one, and is weaker since we allow for an additive slack parameter .
Definition 3.1 can be closely linked to the following property of the loss function that turns out to be crucial in establishing convergence results for CS-SGLD.
Definition 3.2 (Dissipativity (Hale, 1990)).
A differentiable function on is -dissipative around if for constants and , we have
(3.6) |
It is straightforward to see that (3.6) essentially recovers the strong smoothness condition (3.5) if the measurement matrix is assumed to be the identity matrix. In compressed sensing, it is often the case that is a (sub)Gaussian matrix and that given a sufficient number of measurements as well as Assumptions (A.1), (A.2) and (A.3), the dissipativity of for such an can still be established.
3.2 Main results
We first show that a very broad class of generator networks satisfies the assumptions made above. The following proposition is an extension of a result in Latorre et al. (2019).
Proposition 3.1.
Suppose is a feed-forward neural network with layers of non-decreasing sizes and compact input domain . Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, satisfies Assumptions (A.2) & (A.3) with constants , and if , the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.
This proposition merits a thorough discussion. First, architectures with increasing layer sizes are common; many generative models (such as GANs) assume architectures of this sort. Observe that the non-decreasing layer size condition is much milder than the expansivity ratios of successive layers assumed in related work Hand & Voroninski (2018); Asim et al. (2019).
Second, the compactness assumption of the domain of is mild, and traces its provenance to earlier related works Bora et al. (2017b); Latorre et al. (2019). Moreover, common empirical techniques for training generative models (such as GANs) indeed assume that the latent vectors lie on the surface of a sphere White (2016).
Third, common activation functions such as the sigmoid, or the Exponential Linear Unit (ELU) are continuously differentiable and monotonic. Note that the standard Rectified Linear Unit (ReLU) activation does not satisfy these conditions, and establishing similar results for ReLU networks is deferred to future work.
The key for our theoretical analysis, as discussed above, is Definition 3.1, and establishing this requires Proposition 3.1. Interestingly however, in Section 5 below we provide empirical evidence that strong smoothness holds for generative adversarial networks with ReLU activation trained on the MNIST and CIFAR-10 image datasets.
We now obtain a measurement complexity result by deriving a bound on the number of measurements required for to be dissipative.
Lemma 3.1.
Let be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let be its Lipschitz constant. Suppose the number of measurements satisfies:
for some small constant . If the elements of are drawn according to , then the loss function is -dissipative with probability at least .
The above result can be derived using covering number arguments, similar to the treatment in Bora et al. (2017b). Observe that the number of measurements scales linearly with the dimension of the latent vector instead of the ambient dimension, keeping in line with the flavor of results in standard compressed sensing. Recent lower bounds reported Liu & Scarlett (2020) also have shown that the scaling of with respect to and might be tight for compressed sensing recovery in several natural parameter regimes.
We need two more quantities to readily state our convergence guarantee. Both definitions are widely used in the convergence analysis of MCMC methods. The first quantity defines the goodness of an initial distribution with respect to the target distribution .
Definition 3.3 (-warm start, Zou et al. (2020)).
Let be a distribution on . An initial distribution is -warm start with respect to if
The next quantity is the Cheeger constant that connects the geometry of the objective function and the hitting time of SGLD to a particular set in the domain Zhang et al. (2017).
Definition 3.4 (Cheeger constant).
Let be a probability measure on . We say satisfies the isoperimetric inequality with Cheeger constant if for any ,
where .
Putting all the above ingredients together, our main theoretical result describing the convergence of Algorithm 1 (CS-SGLD) for compressed sensing recovery is given as follows.
Theorem 1 (Convergence of CS-SGLD).
Assume that the generative network satisfies Assumptions (A.1) – (A.3) as well as the strong smoothness condition. Consider a signal , and assume that it is measured with (sub)Gaussian measurements such that . Choose an inverse temperature and precision parameter . Then, after iterations of SGLD in Algorithm 1, we obtain a latent vector such that
(3.7) |
provided the step size and the number of iterations are chosen such that:
In words, if we choose a high enough inverse temperature and appropriate step size, CS-SGLD converges (in expectation) to a signal estimate with very low loss within a polynomial number of iterations.
Let us parse the above result further. First, observe that the right hand side of (3.7) consists of two terms. The first term can be made arbitrarily small (at the cost of greater computational cost since decreases ). The second term represents the irreducible expected error of the exact sampling algorithm on the Gibbs measure , which is worse than the optimal loss obtained at .
Second, suppose the right hand side of (3.7) is upper bounded by . Once SGLD finds an -approximate minimizer of the loss, in the regime of sufficient compressed sensing measurements (as specified by Lemma 3.1), we can invoke Theorem 1.1 in Bora et al. (2017b) along with Jensen’s inequality to immediately obtain a recovery guarantee, i.e.,
Third, the convergence rate of CS-SGLD can be slow. In particular, SGLD may require a polynomial number of iterations to recover the true signal, while linearized ADMM Latorre et al. (2019) converges within a logarithmic number of iterations up to a neighborhood of the true signal. Obtaining an improved characterization of CS-SGLD convergence (or perhaps devising a new linearly convergent algorithm) is an important direction for future work.
Fourth, the above result is for noiseless measurements. A rather similar result can be derived with noisy measurements of bounded noise (says, . This quantity (times a constant depending on ) will affect (3.7) up to an additive term that scales with . This is precisely in line with most compressed sensing recovery results and for simplicity we omit such a derivation.
4 Proof outline
In this section, we provide a brief proof sketch of Theorem 1, while relegating details to the appendix. At a high level, our analysis is an adaptation of the framework of Zhang et al. (2017); Zou et al. (2020) specialized to the problem of compressed sensing recovery using generative priors. The basic ingredient in the proof is the use of conductance analysis to show the convergence of CS-SGLD to the target distribution in total variation distance.
Let denote the probability measure of generated by Algorithm 1 and denote the target distribution in 3.4. The proof of Theorem 1 consists of three main steps:
-
1.
First, we construct an auxiliary Metropolis-Hasting Markov process to show that converges to in total variation for a sufficiently large and a “good” initial distribution .
-
2.
Next, we construct an initial distribution that serves as a -warm start with respect to .
-
3.
Finally, we show that a random draw from is a near-minimizer of , proving that CS-SGLD recovers the signal to high fidelity.
We proceed with a characterization of the evolution of the distribution of in Algorithm 1, which basically follows (Zou et al., 2020).
4.1 Construction of Metropolis-Hasting SGLD
Let , and respectively be the points before and after one iteration of Algorithm 1; the Markov chain is written as , where with the following density:
(4.1) |
Without the correction step, is exactly the transition probability of the standard Langevin dynamics. Note also that one can construct a similar density with a stochastic (mini-batch) gradient. The process of is
(4.2) |
Let be the probability of accepting . The conditional density is
where is the Dirac-delta function at . Similar to Zou et al. (2020); Zhang et al. (2017), we consider the -lazy version of the above Markov process, with the transition distribution
(4.3) |
and construct an auxiliary Markov process by adding an extra Metropolis accept/reject step. While proving the ergodicity of the Markov process with transition distribution is difficult, the auxiliary chain does indeed converge to a unique stationary distribution due to the Metropolis-Hastings correction step.
The auxiliary Markov chain is given as follows: starting from , let be the state generated from . The Metropolis-Hasting SGLD accepts with probability,
Let denote the transition distribution of the auxiliary Markov process, such that
Below, we establish the connection between and , as well as the convergence of the original chain in Algorithm 1 through a conductance analysis on .
Lemma 4.1.
Under Assumptions, is -smooth and satisfies for . For , the transition distribution of the chain in Algorithm 1 is -close the auxiliary chain, i.e., for any set
where .
In Appendix B, we show that is -smooth with and its gradient is bounded by .
One can verify that is time-reversible (Zhang et al., 2017). Moreover, following Lovász et al. (1993); Lovász & Vempala (2007), the convergence of a time-reversible Markov chain to its stationary distribution depends on its conductance, which is defined as follows:
Definition 4.1 (Restricted conductance).
The conductance of a time-reversible Markov chain with transition distribution and stationary distribution is defined by,
Using the conductance parameter and the closeness between and , we can derive the convergence of in total variation distance.
Lemma 4.2 (Zou et al. (2020)).
Assume the conditions of Lemma 4.1 hold. If is -close to with , and the initial distribution serves as a -warm start with respect to , then
We will further give a lower bound on in order to establish an explicit convergence rate.
4.2 Convergence of to the target distribution
Armed with these tools, we formally establish the first step of the proof.
Theorem 2.
Suppose that the generative network satisfies Assumptions (A.1) – (A.3) as well as the strong smoothness condition. Set and , then for any -warm start with respect to , the output of Algorithm 1 satisfies
where is the Cheeger constant of , , and . In particular, if the step size and the number of iterations satisfy:
then for .
The convergence rate is polynomial in the Cheeger constant whose lower bound is difficult to obtain generally. A rough bound can be derived using the Poincaré constant of the distribution , under the smoothness assumption. See (Bakry et al., 2008) for details.
Proof outline of Theorem 2.
To prove the result, we find a sufficient condition for that fulfills the requirements of Lemmas 4.1, 4.2 and 4.3 hold. For , we have
Moreover, Lemma 4.2 requires , while by Lemma 4.3, so we can set
for these conditions to hold. Putting all together, we obtain
where , . Therefore, we have proved the first part.
For the second part, to achieve -sampling error, it suffices to choose and such that
Plugging in above, we can choose
such that , which completes the proof. ∎
4.3 Existence of -warm start initial distribution
Apart from the step size and the number of iterations, the convergence depends on , the goodness of the initial distribution . In this part, we specify a particular choice of in establish this.
Definition 4.2 (Set-Restricted Eigenvalue Condition, (Bora et al., 2017b)).
For some parameters and , is called if for all ,
Lemma 4.4.
Suppose that satisfies the near-isometry property in Assumption A.2, and is -smooth. If is , then the Gaussian distribution supported on is a -warm start with respect to with .
Proof.
Let denote the truncated Gaussian distribution on whose measure is
where is the normalization constant.
Along with the target measure , we can easily verify that
Our goal is to bound the right hand side. Using the smoothness and the simple fact , we have
which implies that .
To bound , we use the S-REC property of as well as the near-isometry of . Recall the objective function:
where we have dropped for simplicity. Therefore,
Putting the above results together, we can get
and conclude the proof. ∎
4.4 Completing the proof
Proof of Theorem 1.
Consider a random draw from and another from . We have
We will first give a crude bound for the second term following the idea from (Raginsky et al., 2017):
The detailed proof is given in Appendix D. The first term is related to the convergence of to in total variation shown in Theorem 2. Notice that for all due the Lipschitz property of the generative network . Moreover, by Theorem 2, we have for any and a sufficiently large . Hence, the first term is upper bounded by
Given the target error , choose . By Lemma 4.4, we have . Then, for
Therefore, we complete the proof of our main result. ∎
5 Experimental results
While we emphasize that the primary focus of our paper is theoretical, we corroborate our theory with representative experimental results on MNIST and CIFAR-10. Even though we require requires a bounded domain within -dimensional Euclidean ball in our analysis, empirical results demonstrate that our approach works without the restriction.
![]() |
![]() |
![]() |
---|---|---|
(a) | (b) | |
![]() |
![]() |
![]() |
(c) | (d) |
5.1 Validation of strong smoothness
As mentioned above, our theory relies on the assumption that the following condition holds for some constants and for a domain . Here, we take .
To estimate these constants, we generate samples and from . To establish and , we perform experiments on two different datasets (i) MNIST (Net1) and (ii) CIFAR10 (Net2). For both datasets, we compute the terms and for 500 different instantiations of and . We then plot these pairs of samples for different ’s and ’s and compute the values of and by a simple linear program. We do this experiment for two DCGAN generators trained on MNIST (Figure 1 (a)) as well as on CIFAR10 (Figure 1 (c)).
Similarly for the compressed sensing case, we also derive values and , where a compressive matrix acts on the output of the generator . Here, we have picked the number of measurements where is the signal dimension. This is encapsulated in the following equation:
(5.1) |
for all possible Gaussian matrices and different instantiations of and . Here, we capture the left side of the inequality in . We similarly plot points for all pairs . The scatter plot is generated for 500 different instantiations of and and different instantiations of . We do this experiment for two DCGAN generators, one trained on MNIST (Figure 1 (b)) and the other trained on CIFAR10 (Figure 1 (d)). These experiments indicate that the dissipativity constant is positive in all cases.
5.2 Comparison of SGLD against GD
![]() |
![]() |
![]() |
![]() |
(a) Ground truth | (b) Initial | (c) GD | (d) SGLD |
MSE = 0.0447 | MSE = 0.0275 |
We test the SGLD reconstruction by using the update rule in (3.2) and compare against the optimizing the updates of using standard gradient descent as in Bora et al. (2017b). For all experiments, we use a pre-trained DCGAN generator, with network configuration described as follows: the generator consists of four different layers consisting of transposed convolutions, batch normalization and RELU activation; this is followed by a final layer with a transposed convolution and activation Radford et al. (2015).
We display the reconstructions on MNIST in Figure 2. Note that the implementation in Bora et al. (2017b) requires 10 random restarts for CS reconstruction and they report the results corresponding to the best reconstruction. This likely suggests that the standard implementation is likely to get stuck in bad local minima or saddle points. For the sake of fair comparison, we fix the same random initialization of latent vector for both GD and SGLD with no restarts. We select . In Figure 2 we show reconstructions for the 16 different examples, which were all reconstructed at once using same steps, learning rate of and the inverse temperature for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.
5.3 Reconstructions for CIFAR10
We display the reconstructions on CIFAR10 in Figure 3. As with the implementation for MNIST, for the sake of fair comparison, we fix the same random initialization of latent vector for both GD and SGLD with no restarts. We select . In Figure 3 we show reconstructions for the 16 different examples from MNIST, which were all reconstructed at once using same steps, learning rate of and the inverse temperature for both approaches. The only difference is the additional noise term in SGLD (Figure 2 part (d)). Similar to our experiments on MNIST we notice that this additional noise component helps achieve better reconstruction performance overall as compared to simple gradient descent.
![]() |
![]() |
![]() |
![]() |
(a) Ground truth | (b) Initial | (c) GD | (d) SGLD |
MSE = 0.0248 | MSE = 0.0246 |
Next, we plot phase transition diagrams by scanning the compression ratio for the MNIST dataset in Figure 4. For this experiment, we have chosen 5 different instantiations of the sampling matrix for each compression ratio . In Figure 4 we report the average Mean Square Error (MSE) of reconstruction over 5 different instances of .
We conclude that SGLD gives improved reconstruction quality as compared to GD.
References
- Asim et al. (2019) Asim, M., Ahmed, A., and Hand, P. Invertible generative models for inverse problems: mitigating representation error and dataset bias. arXiv preprint arXiv:1905.11672, 2019.
- Bakry et al. (2008) Bakry, D., Barthe, F., Cattiaux, P., Guillin, A., et al. A simple proof of the poincaré inequality for a large class of probability measures. Electronic Communications in Probability, 13:60–66, 2008.
- Baraniuk et al. (2010) Baraniuk, R., Cevher, V., Duarte, M., and Hegde, C. Model-based compressive sensing. IEEE Transactions on Information Theory, 56:1982–2001, 2010.
- Bora et al. (2017a) Bora, A., Jalal, A., Price, E., and Dimakis, A. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537–546. JMLR. org, 2017a.
- Bora et al. (2017b) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537–546. JMLR. org, 2017b.
- Candes & Tao (2005) Candes, E. J. and Tao, T. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005.
- Chang et al. (2017) Chang, J., Li, C., Póczos, B., and Kumar, B. One network to solve them all—solving linear inverse problems using deep projection models. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5889–5898. IEEE, 2017.
- Chen et al. (2001) Chen, S., Donoho, D., and Saunders, M. Atomic decomposition by basis pursuit. SIAM review, 43(1):129–159, 2001.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, 2017.
- Dong et al. (2016) Dong, C., Loy, C., He, K., and Tang, X. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
- Donoho (2006) Donoho, D. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
- Hale (1990) Hale, J. Asymptotic behavior of dissipative systems. Bull. Am. Math. Soc, 22:175–183, 1990.
- Hand & Voroninski (2018) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. In Conference On Learning Theory, pp. 970–978, 2018.
- Hand & Voroninski (2019) Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. IEEE Transactions on Information Theory, 66(1):401–418, 2019.
- Jagatap & Hegde (2019) Jagatap, G. and Hegde, C. Algorithmic guarantees for inverse imaging with untrained network priors. In Advances in Neural Information Processing Systems, 2019.
- Jalal et al. (2020) Jalal, A., ECE, U., Karmalkar, S., CS, U., Dimakis, A. G., and Price, E. Compressed sensing with approximate priors via conditional resampling. Preprint, 2020.
- Ji & Carin (2007) Ji, S. and Carin, L. Bayesian compressive sensing and projection optimization. In Proceedings of the 24th international conference on Machine learning, pp. 377–384, 2007.
- Latorre et al. (2019) Latorre, F., Eftekhari, A., and Cevher, V. Fast and provable admm for learning with generative priors. In Advances in Neural Information Processing Systems, pp. 12004–12016, 2019.
- Lee & Vempala (2018) Lee, Y. T. and Vempala, S. S. Convergence rate of riemannian hamiltonian monte carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1115–1121, 2018.
- Lei et al. (2019) Lei, Q., Jalal, A., Dhillon, I. S., and Dimakis, A. G. Inverting deep generative models, one layer at a time. In Advances in Neural Information Processing Systems, pp. 13910–13919, 2019.
- Lindgren et al. (2020) Lindgren, E. M., Whang, J., and Dimakis, A. G. Conditional sampling from invertible generative models with applications to inverse problems. arXiv preprint arXiv:2002.11743, 2020.
- Liu & Scarlett (2020) Liu, Z. and Scarlett, J. Information-theoretic lower bounds for compressive sensing with generative models. IEEE Journal on Selected Areas in Information Theory, 2020.
- Lovász & Vempala (2007) Lovász, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
- Lovász et al. (1993) Lovász, L. et al. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty, 2(1):1–46, 1993.
- Mousavi & Baraniuk (2017) Mousavi, A. and Baraniuk, R. Learning to invert: Signal recovery via deep convolutional networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2272–2276. IEEE, 2017.
- Needell & Tropp (2009) Needell, D. and Tropp, J. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301–321, 2009.
- Ongie et al. (2020) Ongie, G., Jalal, A., Metzler, C., Baraniuk, R., Dimakis, A., and Willett, R. Deep learning techniques for inverse problems in imaging. arXiv preprint arXiv:2005.06001, 2020.
- Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. arXiv preprint arXiv:1702.03849, 2017.
- Raj et al. (2019) Raj, A., Li, Y., and Bresler, Y. Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5602–5611, 2019.
- Shah & Hegde (2018) Shah, V. and Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4609–4613. IEEE, 2018.
- Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454, 2018.
- Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
- Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688, 2011.
- Whang et al. (2020) Whang, J., Lei, Q., and Dimakis, A. Compressed sensing with invertible generative models and dependent noise. arXiv preprint arXiv:2003.08089, 2020.
- White (2016) White, T. Sampling generative networks. arXiv preprint arXiv:1609.04468, 2016.
- Y. Wu (2019) Y. Wu, M. Rosca, T. L. Deep compressed sensing. arXiv preprint arXiv:1905.06723, 2019.
- Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory, pp. 1980–2022. PMLR, 2017.
- Zou et al. (2020) Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. arXiv preprint arXiv:2010.09597, 2020.
Appendix A Conditions on the generator network
Proposition A.1.
Suppose is a feed-forward neural network with layers of non-increasing sizes and compact input domain . Assume that the non-linear activation is a continuously differentiable, strictly increasing function. Then, satisfies Assumptions (A.2) & (A.3) with constants , and if , the strong smoothness in Definition 3.1 also holds almost surely with respect to the Lebesgue measure.
Proof.
The proof proceeds similar to Latorre et al. (2019), Appendix B. Since is a composition of linear maps followed by activation functions, is continuously differentiable. As a result, the Jacobian is a continuous matrix-valued function and its restriction to the compact domain is Lipschitz-continuous. Therefore, there exists such that
(A.1) |
Thus, Assumption (A.3) holds. Assumption (A.2) is also satisfied according to Latorre et al. (2019), Lemma 5. To show the strong smoothness, we use the fundamental theorem of calculus with the Lipchitzness of obtained by Assumption (A.2). For every , and :
where in the last step we use the near-isometry and the Lipschitzness of we have obtained. Consequently, is -strongly smooth, if . ∎
Lemma A.1 (Measurement complexity).
Let be a feed-forward neural network that satisfies the conditions in Proposition 3.1. Let be its Lipschitz constant. If the number of measurements satisfies:
for some small constant . If the elements of are drawn according to , then the loss function is -dissipative with probability at least .
Proof.
Using Proposition A.1, it follows that there exist and such that is strongly smooth. Now, note that the left hand side of (3.6) is simplified as
(A.2) |
Denote and , then
Using standard result in random matrix theory, we can get . Also, . Therefore,
For , then
with probability at least . Therefore, the loss function is -dissipative with probability at least . ∎
Appendix B Properties of
In this part, we establish some key properties of the loss function . We use Assumptions (A.1) – (A.3) on the boundedness, Lipschitz gradient and near-isometry to obtain an upper bound of and the smoothness of .
Lemma B.1 (Lipschitzness of ).
We have for any .
Proof.
Recall the gradient of :
It follows from the Lipschitz assumption (A.2) that , and hence . Therefore,
∎
Lemma B.2 (Smoothness of ).
For any , we have
Proof.
We use the assumptions on to derive the bound: .
Then, using the boundedness, Lipschitzness and smoothness, we arrive at:
Therefore, is -smooth, with . ∎
Appendix C Conductance Analysis
In this section, we provide the proofs of Lemma 4.1 and 4.3 based on the conductance analysis laid out in (Zhang et al., 2017) and similarly in (Zou et al., 2020). The proof of 4.2 directly follows from Lemma 6.3, (Zou et al., 2020).
Proof of Lemma 4.1.
We use the same idea in Lemma 3 from (Zhang et al., 2017) and similarly that in Lemma 6.1 from (Zou et al., 2020). The main difference of our proof is that we use full gradient in Algorithm 1, instead of stochastic mini-batch gradient, which simplifies the proof of this lemma a little.
We consider two cases for each : and . As long as we can prove the first case, the second case easily follows, by splitting into and and using the result of the first case. For a detailed treatment of the latter case, we refer the reader to the proof of Lemma 6.1 in (Zou et al., 2020).
Now that , we have
(C.1) |
where is the acceptance ratio of the Metropolis-Hasting. If suffices to show that for all , which implies
The right hand side is obvious by the definition of while we can ensure with a sufficiently small . What remains is to show that
(C.2) |
The left hand side is simplified by definition of as
Note that . Simplify the first exponent and combine with the second one gives the following form:
(C.3) |
To lower bound the left hand side, we appeal to the smoothness of . Specifically, by Lemmas B.1 and B.2, we have is -smooth and with and . Then,
This directly implies that
(C.4) |
Moreover,
(C.5) |
Combining (C.4) and (C) in (C.3), and together with with ,
LHS of (C.3) | |||
Pick , and use the fact for , then we have proved the result. ∎
Next, we lower bound the conductance of using the idea in (Lee & Vempala, 2018; Zou et al., 2020), by first restating the following lemma:
Lemma C.1 (Lemma 13 in Lee & Vempala (2018)).
Let be a time-reversible Markov chain on with stationary distribution . Suppose for any and a fixed such that , we have , then the conductance of satisfies for some constant and is the Cheeger constant of .
Proof of Lemma 4.3.
To apply Lemma C.1, we follow the same idea of Zou et al. (2020) and reuse some of their results without proof. To this end, we prove that for some , any pair of such that , we have . Recall the distribution of the iterate after one-step standard SGLD without the accept/reject step in (4.1) is
Since Algorithm 1 accepts the candidate only if it falls in the region , the acceptance probability is
Therefore, the transition probability for is given by
Take and let and . By the definition of the total variation, there exists a set such that
Using the mini-batch size that is exactly the same as the number of samples, we can reuse the bounds of and in Lemmas C.4 and C.5 of (Zou et al., 2020). Consequently,
By Lemma 4.1, we have if . Thus if
we have . As the result of Lemma C.1, we prove a lower bound on the conductance of
and finish the proof.
∎
Appendix D Property of the Gibbs algorithm
Proposition D.1.
For , we have
Proof.
Let denote the density of . is the partition function. We start by writing
(D.1) |
where
is the differential entropy of . To upper-bound , we use the fact that the differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment. Moreover, since has the support in the Euclidean ball with radius , its second moment is simply bounded by . Therefore, we have
(D.2) |
Next, we give a lower bound on the second term, . We use the smoothness of and the fact that is the minimizer of . We have for . As such,
(D.3) |
Using (D.2) and (D.3) in (D.1) and simplifying, we prove the result. ∎