This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

PhysReferences

Discovering Structure From Corruption for Unsupervised Image Reconstruction

Oscar Leong Angela F. Gao He Sun and Katherine L. Bouman The authors are with the Computing and Mathematical Sciences Department at the California Institute of Technology (Caltech), Pasadena, California. Katherine L. Bouman is also with the Electrical Engineering, Astronomy, and Mechanical Engineering Departments at Caltech. He Sun is also with the College of Future Technology and National Biomedical Imaging Center at Peking University, Beijing, China. Most of this work was done while He Sun was at Caltech. This research was carried out at the Jet Propulsion Laboratory and Caltech under a contract with the National Aeronautics and Space Administration and funded through the PDRDF. In addition, this work was sponsored by NSF Awards 2048237, 1935980, and an Amazon AI4Science Partnership Discovery Grant. We would like to thank Ben Prather, Abhishek Joshi, Vedant Dhruv, Chi-kwan Chan, and Charles Gammie for providing black hole simulations used in this work. We would also like to thank Aviad Levis, Yu Sun, and Jorio Cocola for their feedback and guidance.
Abstract

We consider solving ill-posed imaging inverse problems without access to an image prior or ground-truth examples. An overarching challenge in these inverse problems is that an infinite number of images, including many that are implausible, are consistent with the observed measurements. Thus, image priors are required to reduce the space of possible solutions to more desirable reconstructions. However, in many applications it is difficult or potentially impossible to obtain example images to construct an image prior. Hence inaccurate priors are often used, which inevitably result in biased solutions. Rather than solving an inverse problem using priors that encode the spatial structure of any one image, we propose to solve a set of inverse problems jointly by incorporating prior constraints on the collective structure of the underlying images. The key assumption of our work is that the underlying images we aim to reconstruct share common, low-dimensional structure. We show that such a set of inverse problems can be solved simultaneously without the use of a spatial image prior by instead inferring a shared image generator with a low-dimensional latent space. The parameters of the generator and latent embeddings are found by maximizing a proxy for the Evidence Lower Bound (ELBO). Once identified, the generator and latent embeddings can be combined to provide reconstructed images for each inverse problem. The framework we propose can handle general forward model corruptions, and we show that measurements derived from only a small number of ground-truth images (150\leqslant 150) are sufficient for image reconstruction. We demonstrate our approach on a variety of convex and non-convex inverse problems, including denoising, phase retrieval, and black hole video reconstruction.

Index Terms:
Inverse problems, computational imaging, prior models, generative networks, Bayesian inference
**footnotetext: These authors contributed equally. Corresponding emails: oleong@caltech.edu, afgao@caltech.edu.
Refer to caption
Figure 1: Method overview. In this paper we tackle ill-posed image reconstruction problems when a traditional image prior is not available or cannot be derived from example images. The key idea of our work is that image reconstruction is possible when one has access to only corrupted measurement examples and the underlying images share common, low-dimensional structure. We explore such ideas in this work in the following two ways: Left: Given a single measurement example (with the corresponding forward model) and a set of candidate image generation models (IGMs), we use our proposed criterion to select the best IGM from the measurements alone and then recover an image posterior under that IGM. The criterion we use for model selection is an approximation of the evidence lower bound (i.e., ELBOProxy in Eq. (3)) (see Sec. III-A). Right: Building off of this, we consider the setting where we must infer the IGM from noisy measurements alone (see Sec. III-B). In this case, we solve NN inverse problems simultaneously. The inputs of this method are NN measurement examples with their known forward models, and the outputs are a single inferred IGM and NN latent embeddings that, when combined, lead to NN image reconstruction posteriors.

I Introduction

In imaging inverse problems, the goal is to recover an underlying image from corrupted measurements, where the measurements and image are related via an understood forward model: y=f(x)+ηy=f(x)+\eta. Here, yy are measurements, xx is the underlying image, ff is a known forward model, and η\eta is noise. Such problems are ubiquitous and include denoising [6, 20], super-resolution [7], compressed sensing [8, 19], phase retrieval [21], and deconvolution [35]. Due to corruption by the forward model and noise, these problems are often ill-posed: there are many images that are consistent with the observed measurements, including ones that are implausible.

To combat the ill-posedness in imaging problems, solving for an image traditionally requires imposing additional structural assumptions to reduce the space of possible solutions. We encode these assumptions in an image generation model (IGM), whose goal is to capture the desired properties of an image’s spatial structure. IGMs are general; they encompass probabilistic spatial-domain priors (e.g., that encourage smoothness or sparsity), but also include deep image generators that are not necessarily probabilistic but are trained to primarily sample a certain class of images.

In order to define an IGM, it is necessary to have knowledge of the underlying image’s structure. If images similar to the underlying image are available, then an IGM can be learned directly [44, 56, 4]. However, an abundance of clean images is not available for many scientific imaging modalities (e.g., geophysical imaging and astronomical imaging). Collecting images in these domains can be extremely invasive, time-consuming, expensive, or even impossible. For instance, how should we define an IGM for black hole imaging without having ever seen a direct image of a black hole or knowing what one should look like? Moreover, classical approaches that utilize hand-crafted IGMs, such as total variation [26] or sparsity in a wavelet basis [37], are prone to human bias [34].

In this work, we show how one can solve a set of ill-posed image reconstruction tasks in an unsupervised fashion, i.e., without prior information about an image’s spatial structure or access to clean, example images. The key insight of our work is that knowledge of common structure across multiple diverse images can be sufficient regularization alone. In particular, suppose we have access to a collection of noisy measurements {y(i)}i=1N\{y^{(i)}\}_{i=1}^{N} that are observed through (potentially different) forward models y(i):=f(i)(x(i))+η(i)y^{(i)}:=f^{(i)}(x^{(i)})+\eta^{(i)}. The core assumption we make is that the different underlying images {x(i)}i=1N\{x^{(i)}\}_{i=1}^{N} are drawn from the same distribution (unknown a priori) and share common, low-dimensional structure. Thus, our “prior” is not at the spatial-level, but rather exploits the collective structure of the underlying images. This assumption is satisfied in a number of applications where there is no access to an abundance of clean images. For instance, although we might not know what a black hole looks like, we might expect it to be similar in appearance over time. We show that under this assumption, the image reconstruction posteriors p(x|y(i))p(x|y^{(i)}) can be learned jointly from a small number of examples {y(i)}i=1N\{y^{(i)}\}_{i=1}^{N} due to the common, low-dimensional structure of the collection {x(i)}i=1N\{x^{(i)}\}_{i=1}^{N}. Specifically, our main result is that one can capitalize on this common structure by jointly solving for 1) a shared image generator GθG_{\theta} and 2) NN low-dimensional latent distributions qϕ(i)q_{\phi^{(i)}}, such that the distribution induced by the push-forward of qϕ(i)q_{\phi^{(i)}} through GθG_{\theta} approximately captures the image reconstruction posterior p(x|y(i))p(x|y^{(i)}) for each measurement example i[N]i\in[N].

I-A Our Contributions

We outline the main contributions of our work, which extends our prior work presented in [22]:

  1. 1.

    We solve a collection of ill-posed inverse problems without prior knowledge of an image’s spatial structure by exploiting the common, low-dimensional structure shared across images. This common structure is exploited when inferring a shared IGM with a low-dimensional latent space.

  2. 2.

    To infer this IGM, we define a loss inspired by the evidence lower bound (ELBO). We motivate this loss by showing how it aids in unsupervised image reconstruction by helping select one IGM from a collection of candidate IGMs using a single measurement example.

  3. 3.

    We apply our approach to convex and non-convex inverse problems, such as denoising, black hole compressed sensing, and phase retrieval. We establish that we can solve inverse problems without spatial-level priors and demonstrate good performance with only a small number of independent measurement examples (e.g., 150\leqslant 150).

  4. 4.

    We theoretically analyze the inferred IGM in linear inverse problems under a linear image model to show that in this setting the inferred IGM performs dimensionality reduction akin to PCA on the collection of measurements.

II Background and Related Work

We now discuss related literature in model selection and learning-based IGMs. In order to highlight our key contributions, we emphasize the following assumptions in our framework:

  1. 1.

    We do not have access to a set of images from the same distribution as the underlying images.

  2. 2.

    We only have access to a collection of measurement examples, where each example comes from a different underlying image. The number of examples NN is small, e.g., N150N\leqslant 150.

  3. 3.

    For each underlying image x(i)x^{(i)} we wish to reconstruct, we only have access to a single measurement example y(i)=f(i)(x(i))+η(i)y^{(i)}=f^{(i)}(x^{(i)})+\eta^{(i)}. That is, we do not have multiple observations of the same underlying image. Note each f(i)f^{(i)} can be potentially different.

II-A Model selection

Model selection techniques seek to choose a model that best explains data by balancing performance and model complexity. In supervised learning problems with sufficiently large amounts of data, this can be achieved simply by evaluating the performance of different candidate models using reserved test data [50]. However, in image reconstruction or other inverse problems with limited data, one cannot afford to hold out data. In these cases, model selection is commonly conducted using probabilistic metrics. The simplest probabilistic metric used for linear model selection is adjusted R2 [40]. It re-weights the goodness-of-fit by the number of linear model parameters, helping reject high-dimensional parameters that do not improve the data fitting accuracy. Similar metrics in nonlinear model selection are Bayesian Information Criterion (BIC) [46] and Akaike Information Criterion (AIC) [1]. AIC and BIC compute different weighted summations of a model’s log-likelihood and complexity, offering different trade-offs between bias and variance to identify the best model for a given dataset.

In our work, we consider the use of the ELBO as a model selection criterion. In [10, 11], the use of the ELBO as a model selection criterion is theoretically analyzed and rates of convergence for variational posterior estimation are shown. Additionally, [52] proposes a generalized class of evidence lower bounds leveraging an extension of the evidence score. In [51], the ELBO is used for model selection to select a few, discrete parameters modeling a physical system (e.g., parameters that govern the orbit of an exoplanet). A significant difference in our context, however, is that we use the ELBO as a model selection criterion in a high-dimensional imaging context, and we optimize the ELBO over a continuous space of possible parameters.

II-B Learning IGMs

With access to a large corpus of example images, it is possible to directly learn an IGM to help solve inverse problems. Seminal work along these lines utilizing generative networks showcased that a pre-trained Generative Adversarial Network (GAN) can be used as an IGM in the problem of compressed sensing [4]. To solve the inverse problem, the GAN was used to constrain the search space for inversion. This approach was shown to outperform sparsity-based techniques with 5-10x fewer measurements. Since then, this idea has been expanded to other inverse problems, including denoising [25], super-resolution [39], magnetic resonance imaging (MRI) [2, 48], and phase retrieval [23, 47]. However, the biggest downside to this approach is the requirement of a large dataset of example images similar to the underlying image, which is often difficult or impossible to obtain in practice. Hence, we consider approaches that are able to directly solve inverse problems without example images.

Methods that aim to learn an IGM from only noisy measurements have been proposed. The main four distinctions between our work and these methods are that these works either: 1) require multiple independent observations of the same underlying image, 2) can only be applied to certain inverse problems, 3) require significantly more observations (either through more observations of each underlying image or by observing more underlying images), or 4) require significant hyperparameter tuning based on knowledge of example images.

Noise2Noise (N2N) [33] learns to denoise images by training on a collection of noisy, independent observations of the same image. To do so, N2N learns a neural network Φθ\Phi_{\theta} whose goal is to map between noisy images yy and denoised images xx. Since it has no denoised image examples to supervise training, it instead employs a loss that maps between noisy examples of the same underlying image. This objective is as follows:

argminθi=1N𝔼y1(i)𝒴1𝔼y2(i)𝒴2[L(Φθ(y1(i)),y2(i))],\operatorname*{arg\,min}_{\theta}\sum_{i=1}^{N}\mathbb{E}_{y^{(i)}_{1}\sim\mathcal{Y}_{1}}\mathbb{E}_{y^{(i)}_{2}\sim\mathcal{Y}_{2}}[L(\Phi_{\theta}(y^{(i)}_{1}),y^{(i)}_{2})], (1)

where yj(i)y^{(i)}_{j} corresponds to a noisy observation of the ii-th underlying image x(i)x^{(i)}, and 𝒴j\mathcal{Y}_{j} is a distribution of noisy images where 𝔼y𝒴j[y]=x\mathbb{E}_{y\sim\mathcal{Y}_{j}}[y]=x. This N2N objective requires at least two observations of the same image and is limited by the assumption that the expected value of multiple observations of a single image is the underlying image. Thus, N2N is only applicable to denoising problems where the forward model is the identity matrix with independent noise on each pixel. Additionally, in practice N2N requires thousands of underlying images (i.e., N=O(1000)N=O(1000)) to perform well. Thus, N2N’s main distinctions with our work are distinctions 1), 2), and 3).

Regularization by Artifact Removal (RARE) [36] generalizes N2N to perform image reconstruction from measurements under linear forward models. That is, the objective in Eq. (1) is modified to include a pseudo-inverse. Nonetheless, multiple observations of the same underlying image are required, such that 𝔼y𝒴[Ay]=x\mathbb{E}_{y\sim\mathcal{Y}}[A^{\dagger}y]=x for the pseudo-inverse matrix AA^{\dagger}. Thus, RARE suffers from the same limiting distinctions as N2N (i.e., 1), 2), and 3)).

Noise2Void [30] and Noise2Self [3] assume that the image can be partitioned such that the measurement noise in one subset of the partition is independent conditioned on the measurements in the other subset. This is true for denoising, but not applicable to general forward models. For example, in black hole and MRI compressed sensing, it is not true that the measurement noise can be independently partitioned since each measurement is a linear combination of all pixels. While this makes Noise2Void and Noise2Self more restrictive in the corruptions they can handle compared to RARE, they also don’t require multiple observations of the same underlying image. Hence the main differences between these works and our own are distinctions 2) and 3).

AmbientGAN [5] and other similar approaches based on GANs [28] and Variational Autoencoders (VAEs) [41, 38] have been proposed to learn an IGM directly from noisy measurements. For instance, AmbientGAN aims to learn a generator whose images lead to simulated measurements that are indistinguishable from the observed measurements; this generator can subsequently be used as a prior to solve inverse problems. However, AmbientGAN requires many measurement examples (on the order of 10,000) to produce a high quality generator. We corroborate this with experiments in Section IV to show that they require many independent observations and/or fine tuning of learning parameters to achieve good performance. Thus, the main distinctions between AmbientGAN and our work are 3) and 4).

Deep Image Prior (DIP) [55] uses a convolutional neural network as an implicit “prior”. DIP has shown strong performance across a variety of inverse problems to perform image reconstruction without explicit probabilistic priors. However, it is prone to overfitting and requires selecting a specific stopping criterion. While this works well when example images exist, selecting this stopping condition from noisy measurements alone introduces significant human bias that can negatively impact results. Thus, the main distinction between DIP and our work is 4).

We would also like to highlight additional work done to improve certain aspects of the DIP. While the original DIP method used a U-Net architecture [45], other works such as the Deep Decoder [24] and ConvDecoder [17] used a decoder-like architecture that progressively grows a low-dimensional random tensor to a high-dimensional image. When underparameterized, such architectures have been shown to avoid overfitting and the need for early stopping. Other works mitigating early stopping include [9], which takes a Bayesian perspective to the DIP by using Langevin dynamics to perform posterior inference over the weights to improve performance and show this does not lead to overfitting.

III Approach

Refer to caption
Figure 2: We propose to solve a collection of NN ill-posed inverse problems by exploiting the common, low-dimensional structure of the underlying images. Given a set of NN measurement examples {y(i)}i=1N\{y^{(i)}\}_{i=1}^{N} from NN different underlying images, we propose to model each image posterior as the output of a shared IGM with a low-dimensional latent space. In particular, each posterior is approximated by Gθqϕ(i)G_{\theta}\sharp q_{\phi^{(i)}}, the push-forward of qϕ(i)q_{\phi^{(i)}} through GθG_{\theta}, where GθG_{\theta} is the shared, common generator to all NN examples and qϕ(i)q_{\phi^{(i)}} is a low-dimensional, variational distribution particular to the ii-th example. The parameters, θ\theta and {ϕ(i)}i=1N\{\phi^{(i)}\}_{i=1}^{N}, are colored in blue and are jointly inferred. That is, such parameters are inferred specifically for the measurements {y(i)}i=1N\{y^{(i)}\}_{i=1}^{N}. The loss we use is the negative ELBOProxy\operatorname*{ELBOProxy}, which is denoted by \mathcal{L} and given by Eq. (5). Note that there is no notion of a training set and test set as we aim to solve the inverse problems jointly from all available measurements.

In this work, we propose to solve a set of inverse problems without prior access to an IGM by assuming that the set of underlying images have common, low-dimensional structure. We motivate the use of optimizing the ELBO to infer an IGM by showing that it is a good criterion for generative model selection in Section III-A. Then, by optimizing the ELBO, we show in Section III-B that one can directly infer an IGM from corrupted measurements alone by parameterizing the image model as a deep generative network with a low-dimensional latent distribution. The IGM network weights are shared across all images, capitalizing on the common structure present in the data, while the parameters of each latent distribution are learned jointly with the generator to model the image posteriors for each measurement example.

III-A Motivation for ELBO as a model selection criterion

In order to accurately infer an IGM, we motivate the use of the ELBO as a loss by showing that it provides a principled criterion for selecting an IGM to use as a prior model. Suppose we are given noisy measurements from a single image: y=f(x)+ηy=f(x)+\eta. In order to reconstruct the image xx, we traditionally first require an IGM GG that captures the distribution xx was sampled from. A natural approach would be to find or select the model GG that maximizes the model posterior distribution p(G|y)p(y|G)p(G).p(G|y)\propto p(y|G)p(G). That is, conditioned on the noisy measurements, find the IGM of highest likelihood. Unfortunately computing p(y|G)p(y|G) is intractable, as it requires marginalizing and integrating over all xx encompassed by the IGM GG. However, we show that this quantity can be well approximated using the ELBO.

To motivate our discussion, we first consider estimating the image posterior p(x|y,G)p(x|y,G) by learning the parameters ϕ\phi of a variational distribution hϕ(x)h_{\phi}(x). Observe that the definition of the KL-divergence followed by an application of Bayes’ theorem gives

DKL(hϕ(x)p(x|y,G)):=𝔼xhϕ(x)[loghϕ(x)p(x|y,G)]\displaystyle D_{\operatorname*{KL}}(h_{\phi}(x)\;\|\;p(x|y,G)):=\mathbb{E}_{x\sim h_{\phi}(x)}\left[\log\frac{h_{\phi}(x)}{p(x|y,G)}\right]
=𝔼xhϕ(x)[loghϕ(x)p(y|G)p(y|x,G)p(x|G)]\displaystyle=\mathbb{E}_{x\sim h_{\phi}(x)}\left[\log\frac{h_{\phi}(x)p(y|G)}{p(y|x,G)p(x|G)}\right]
=𝔼xhϕ(x)[logp(y|x,G)+logp(x|G)loghϕ(x)]\displaystyle=-\mathbb{E}_{x\sim h_{\phi}(x)}[\log p(y|x,G)+\log p(x|G)-\log h_{\phi}(x)]
+logp(y|G).\displaystyle\qquad\qquad+\log p(y|G).

The ELBO of an IGM GG given measurements yy under variational distribution hϕh_{\phi} is defined by

ELBO(G,hϕ;y)\displaystyle\operatorname*{ELBO}(G,h_{\phi};y) :=𝔼xhϕ(x)[logp(y|x,G)\displaystyle:=\mathbb{E}_{x\sim h_{\phi}(x)}[\log p(y|x,G)
+logp(x|G)loghϕ(x)].\displaystyle+\log p(x|G)-\log h_{\phi}(x)]. (2)

Rearranging the previous equation, we see that by the non-negativity of the KL-divergence that

logp(y|G)\displaystyle\log p(y|G) =DKL(hϕ(x)p(x|y,G))\displaystyle=D_{\operatorname*{KL}}(h_{\phi}(x)\;\|\;p(x|y,G))
+ELBO(G,hϕ;y)\displaystyle+\operatorname*{ELBO}(G,h_{\phi};y)
ELBO(G,hϕ;y).\displaystyle\geqslant\operatorname*{ELBO}(G,h_{\phi};y).

Thus, we can lower bound the model posterior as

logp(G|y)ELBO(G,hϕ;y)+logp(G)logp(y).\displaystyle\log p(G|y)\geqslant\operatorname*{ELBO}(G,h_{\phi};y)+\log p(G)-\log p(y).

Note that logp(y)\log p(y) is independent of the parameters of interest, ϕ\phi. If the variational distribution hϕ(x)h_{\phi}(x) is a good approximation to the posterior p(x|y,G)p(x|y,G), DKL0D_{\mathrm{KL}}\approx 0. Thus, maximizing logp(G|y)\log p(G|y) with respect to GG is approximately equivalent to maximizing ELBO(G,hϕ;y)+logp(G)\operatorname*{ELBO}(G,h_{\phi};y)+\log p(G).

Each term in the ELBO objective encourages certain properties of the IGM GG. In particular, the first term in the ELBO, 𝔼xhϕ(x)[logp(y|x,G)]\mathbb{E}_{x\sim h_{\phi}(x)}[\log p(y|x,G)], requires that GG should lead to an image estimate that is consistent with our measurements yy. The second term, 𝔼xhϕ(x)[logp(x|G)]\mathbb{E}_{x\sim h_{\phi}(x)}[\log p(x|G)], encourages images sampled from hϕ(x)h_{\phi}(x) to have high likelihood under our model GG. The final term is the entropy term, 𝔼xhϕ(x)[loghϕ(x)]\mathbb{E}_{x\sim h_{\phi}(x)}[-\log h_{\phi}(x)], which encourages a GG that leads to “fatter” minima that are less sensitive to small changes in likely images xx under GG.

III-A1 ELBOProxy

Some IGMs are explicit, which allows for direct computation of logp(x|G)\log p(x|G). For example, if our IGM models xx as isotropic Gaussian with variance λ\lambda, then logp(x|G)λ1x22-\log p(x|G)\propto\lambda^{-1}\|x\|^{2}_{2}. In this case, we can optimize the ELBO defined in Equation (2) directly and then perform model selection. However, an important class of IGMs that we are interested in are those given by deep generative networks. Such IGMs are not probabilistic in the usual Bayesian interpretation of a prior, but instead implicitly enforce structure in the data. A key characteristic of many generative network architectures (e.g., VAEs and GANs) that we leverage is that they generate high-dimensional images from low-dimensional latent representations. Bottlenecking helps the network learn global characteristics of the underlying image distribution while also respecting the low intrinsic dimensionality of natural images. However, this means that we can only compute logp(x|G)\log p(x|G) directly if we have an injective map [29]. This architectural requirement limits the expressivity of the network.

We instead consider a proxy of the ELBO that is especially helpful for deep generative networks. That is, suppose our IGM is of the form x=G(z)x=G(z). Introducing a variational family for our latent representations zqϕ(z)z\sim q_{\phi}(z) and choosing a latent prior distribution logpZ(z|G)\log p_{Z}(z|G), we arrive at the following proxy of the ELBO:

ELBOProxy(G,qϕ\displaystyle\operatorname*{ELBOProxy}(G,q_{\phi} ;y):=𝔼zqϕ(z)[logp(y|G(z))\displaystyle;y):=\mathbb{E}_{z\sim q_{\phi}(z)}[\log p(y|G(z))
+logpZ(z|G)logqϕ(z)].\displaystyle+\log p_{Z}(z|G)-\log q_{\phi}(z)]. (3)

In our experiments, we chose pZ(z|G)p_{Z}(z|G) to be an isotropic Gaussian prior. This is a common choice in many generative modeling frameworks and has shown to be a good choice of prior in the latent space.

To motivate this proxy, it is instructive to consider the case where our variational distribution hϕ:=Gqϕh_{\phi}:=G\sharp q_{\phi} is the push-forward of a latent distribution qϕq_{\phi} through an injective or invertible function GG. To be precise, recall the following definition of the push-forward measure.

Definition 1.

Let G:knG:\mathbb{R}^{k}\rightarrow\mathbb{R}^{n} be a measurable function and suppose pp is a distribution (or, more generally, a measure) on k\mathbb{R}^{k}. Then the push-forward measure μ:=Gp\mu:=G\sharp p is the measure on n\mathbb{R}^{n} that satisfies the following: for all Borel sets AA of n\mathbb{R}^{n}, μ(A)=p(G1(A))\mu(A)=p(G^{-1}(A)) where G1(A)G^{-1}(A) denotes the preimage of AA with respect to GG.

The push-forward measure essentially characterizes how a distribution pp changes when passed through a function GG. It follows from the definition of the push-forward that xGqϕx\sim G\sharp q_{\phi} if and only if x=G(z)x=G(z) where zqϕz\sim q_{\phi}. In the case hϕ=Gqϕh_{\phi}=G\sharp q_{\phi} for an injective function GG, the ELBO\operatorname*{ELBO} and ELBOProxy\operatorname*{ELBOProxy} are equivalent, as shown in the following proposition:

Proposition 1.

Suppose G:knG:\mathbb{R}^{k}\rightarrow\mathbb{R}^{n} is continuously differentiable and injective. For two probability distributions pZp_{Z} and qϕq_{\phi} on k\mathbb{R}^{k}, define the measures p(|G)=GpZp(\cdot|G)=G\sharp p_{Z} and hϕ=Gqϕh_{\phi}=G\sharp q_{\phi}. Then

ELBO(G,hϕ;y)=ELBOProxy(G,qϕ;y)ym.\displaystyle\operatorname*{ELBO}(G,h_{\phi};y)=\operatorname*{ELBOProxy}(G,q_{\phi};y)\ \forall y\in\mathbb{R}^{m}.
Proof.

It suffices to show

𝔼xhϕ(x)[\displaystyle\mathbb{E}_{x\sim h_{\phi}(x)}[ logp(x|G)loghϕ(x)]\displaystyle\log p(x|G)-\log h_{\phi}(x)]
=𝔼zqϕ(z)[logpZ(z|G)logqϕ(z)]\displaystyle=\mathbb{E}_{z\sim q_{\phi}(z)}[\log p_{Z}(z|G)-\log q_{\phi}(z)]

Let JG(z)n×kJ_{G}(z)\in\mathbb{R}^{n\times k} denote the Jacobian of GG at an input zkz\in\mathbb{R}^{k}. Since GG is injective and continuously differentiable with p(|G)=GpZp(\cdot|G)=G\sharp p_{Z}, we can compute the likelihood of any point xrange(G)x\in\mathrm{range}(G) [29] via

logp(x|G)\displaystyle\log p(x|G) =logpZ(G(x)|G)\displaystyle=\log p_{Z}(G^{\dagger}(x)|G)
12log|det[JG(G(x))TJG(G(x))]|\displaystyle-\frac{1}{2}\log|\det[J_{G}(G^{\dagger}(x))^{T}J_{G}(G^{\dagger}(x))]|

where GG^{\dagger} is the inverse of GG on its range. This is essentially the classical change-of-variables formula specialized to the case when GG is injective and we wish to access likelihoods on the range of GG. Note that this equation is only valid for points in the range of the injective function GG. Likewise, since hϕ=Gqϕh_{\phi}=G\sharp q_{\phi}, we can compute the entropy of hϕh_{\phi} for any point xrange(G)x\in\mathrm{range}(G) as

loghϕ(x)\displaystyle\log h_{\phi}(x) =logqϕ(G(x))\displaystyle=\log q_{\phi}(G^{\dagger}(x))
12log|det[JG(G(x))TJG(G(x))]|.\displaystyle-\frac{1}{2}\log|\det[J_{G}(G^{\dagger}(x))^{T}J_{G}(G^{\dagger}(x))]|.

Now observe that for xhϕx\sim h_{\phi}, xrange(G)x\in\mathrm{range}(G) as hϕh_{\phi} is the push-forward of qϕq_{\phi} through GG. Thus, for xhϕx\sim h_{\phi}, we have that

logp(x|G)\displaystyle\log p(x|G) loghϕ(x)\displaystyle-\log h_{\phi}(x)
=logpZ(G(x)|G)logqϕ(G(x)).\displaystyle=\log p_{Z}(G^{\dagger}(x)|G)-\log q_{\phi}(G^{\dagger}(x)).

By the definition of the push-forward measure, we have that xhϕx\sim h_{\phi} implies x=G(z)x=G(z) for some zqϕz\sim q_{\phi}. Using our previous formulas, we can compute the expectation over the difference logp(x|G)loghϕ(x)\log p(x|G)-\log h_{\phi}(x) with respect to hϕh_{\phi} as

𝔼xhϕ(x)[logp(x|G)loghϕ(x)]\displaystyle\mathbb{E}_{x\sim h_{\phi}(x)}[\log p(x|G)-\log h_{\phi}(x)]
=𝔼xhϕ(x)[logpZ(G(x)|G)logqϕ(G(x))]\displaystyle=\mathbb{E}_{x\sim h_{\phi}(x)}[\log p_{Z}(G^{\dagger}(x)|G)-\log q_{\phi}(G^{\dagger}(x))]
=𝔼zqϕ(z)[logpZ(G(G(z))|G)logqϕ(G(G(z)))]\displaystyle=\mathbb{E}_{z\sim q_{\phi}(z)}[\log p_{Z}(G^{\dagger}(G(z))|G)-\log q_{\phi}(G^{\dagger}(G(z)))]
=𝔼zqϕ(z)[logpZ(z|G)logqϕ(z)].\displaystyle=\mathbb{E}_{z\sim q_{\phi}(z)}[\log p_{Z}(z|G)-\log q_{\phi}(z)].

An important consequence of this result is that for injective generators GG, the inverse of GG (on its range) is not required for computing the ELBO\operatorname*{ELBO}. In this case, the ELBOProxy\operatorname*{ELBOProxy} is in fact equivalent to the ELBO\operatorname*{ELBO}. While not all generators GG will be injective, quality generators are largely injective over high likelihood image samples. In Section III-A2 and Fig. 3, we experimentally show that this proxy can aid in selecting potentially non-injective generative networks from corrupted measurements.

III-A2 Toy example

To illustrate the use of the ELBOProxy\operatorname*{ELBOProxy} as a model selection criterion, we conduct the following experiment that asks whether the ELBOProxy\operatorname*{ELBOProxy} can identify the best model from a given set of image generation models. For this experiment, we use the MNIST dataset [32] and consider two inverse problems: denoising and phase retrieval. We train a generative model GcG_{c} on each class c{0,1,2,,9}c\in\{0,1,2,\dots,9\} using the clean MNIST images directly. Hence, GcG_{c} generates images from class cc via Gc(z)G_{c}(z) where z𝒩(0,I)z\sim\mathcal{N}(0,I). Then, given noisy measurements ycy_{c} from a single image from class cc, we ask whether the generative model GcG_{c} from the appropriate class would achieve the best ELBOProxy\operatorname*{ELBOProxy}. Each GcG_{c} is the decoder of a VAE with a low-dimensional latent space, with no architectural constraints to ensure injectivity. For denoising, our measurements are yc=xc+ηcy_{c}=x_{c}+\eta_{c} where ηc𝒩(0,σ2I)\eta_{c}\sim\mathcal{N}(0,\sigma^{2}I) and σ=0.5\sigma=\sqrt{0.5}. For phase retrieval, yc=|(xc)|+ηcy_{c}=|\mathcal{F}(x_{c})|+\eta_{c} where \mathcal{F} is the Fourier transform and ηc𝒩(0,σ2I)\eta_{c}\sim\mathcal{N}(0,\sigma^{2}I) with σ=0.05\sigma=\sqrt{0.05}.

We construct 10×1010\times 10 arrays for each problem, where in the ii-th row and jj-th column, we compute the negative ELBOProxy\operatorname*{ELBOProxy} obtained by using model Gi1G_{{i-1}} to reconstruct images from class j1j-1. We calculate ELBOProxy(Gc,qϕc;yc)\operatorname*{ELBOProxy}(G_{c},q_{\phi_{c}};y_{c}) by parameterizing qϕcq_{\phi_{c}} with a Normalizing Flow [18] and optimizing network weights ϕc\phi_{c} to maximize (3). The expectation in the ELBOProxy\operatorname*{ELBOProxy} is approximated via Monte Carlo sampling. Results from the first 55 classes are shown in Fig. 3 and the full arrays are shown in the supplemental materials. We note that all of the correct models are chosen in both denoising and phase retrieval. We also note some interesting cases where the ELBOProxy\operatorname*{ELBOProxy} values are similar for certain cases, such as when recovering the 33 or 44 image. For example, when denoising the 44 image, both G4G_{4} and G9G_{9} achieve comparable ELBOProxy\operatorname*{ELBOProxy} values. By carefully inspecting the noisy image of 44, one can see that both models are reasonable given the structure of the noise.

Refer to caption
Figure 3: We consider two inverse problems: denoising and phase retrieval. Top: the two topmost rows correspond to the ground-truth image xcx_{c} and the noisy measurements ycy_{c}. Center: in each column, we show the means of the distribution induced by the push-forward of GjG_{j} and each latent distribution zqϕjz\sim q_{\phi_{j}} for j{0,,9}j\in\{0,\dots,9\}. Bottom: each column of the array corresponds to the negative ELBOProxy\operatorname*{ELBOProxy} achieved by each model in reconstructing the images. Here, lower is better. Boxes highlighted in green correspond to the best negative ELBOProxy\operatorname*{ELBOProxy} values in each column. In all these examples, the correct model was chosen.
Refer to caption
Figure 4: Improvement with an increasing number of noisy observations. We demonstrate our method of inferring an IGM to perform denoising for an increasing number of noisy MNIST images (5, 75, and 150 images from left to right). We showcase results on three randomly selected examples that appear in each collection of inverse problems. In each panel, we include the ground-truth, noisy measurements, mean of the posterior, and standard deviation of the posterior. We also include the residual error divided by the empirical standard deviation for N=150N=150. On the far right, we visualize reconstructions using an IGM trained on the full clean MNIST 8’s class (6000 images). We observe that the mean reconstructions and standard deviations from our low-data IGMs become more similar to the full-data IGM with increasing data. Our residual errors are largely within 3 standard deviations.

III-B Simultaneously solving many inverse problems

Refer to caption
Figure 5: Denoising 95 images of celebrity A. We demonstrate our method described in III-B using 95 noisy images of a celebrity. Here we show the underlying image (row 1), noisy measurements (row 2), mean reconstruction (row 3), and residual error (row 4) for a subset of the 95 different noisy images. Our reconstructions are much less noisy than the measurements and recover sharp features that are hard to discern in the noisy images. We visualize the residual error normalized by the empirical standard deviation, which indicates errors are largely within 3 standard deviations (refer to the colorbar in Fig. 4). Note that no explicit spatial-domain prior/regularizer was used in denoising.

As the previous section illustrates, the ELBOProxy\operatorname*{ELBOProxy} provides a good criterion for choosing an appropriate IGM from noisy measurements. Here, we consider the task of directly inferring the IGM from a collection of measurement examples y(i)=f(i)(x(i))+η(i)y^{(i)}=f^{(i)}(x^{(i)})+\eta^{(i)} for i[N]i\in[N], where the parameters are found by optimizing the ELBOProxy\operatorname*{ELBOProxy}. The key assumption we make is that common, low-dimensional structure is shared across the underlying images {x(i)}i=1N\{x^{(i)}\}_{i=1}^{N}. We propose to find a shared generator GθG_{\theta} with weights θ\theta along with latent distributions qϕ(i)q_{\phi^{(i)}} that can be used to reconstruct the full posterior of each image x(i)x^{(i)} from its corresponding measurement example y(i)y^{(i)}. This approach is illustrated in Fig. 2. Having the generator be shared across all images helps capture their common collective structure. Each forward model corruption, however, likely induces its own complicated image posteriors. Hence, we assign each measurement example y(i)y^{(i)} its own latent distribution to capture the differences in their posteriors. Note that because we optimize a proxy of the ELBO, the inferred distribution may not necessarily be the true image posterior, but it still captures a distribution of images that fit to the observed measurements.

Inference approach

More explicitly, given a collection of measurement examples {y(i)}i=1N\{y^{(i)}\}_{i=1}^{N}, we jointly infer a generator GθG_{\theta} and a set of variational distributions {qϕ(i)}i=1N\{q_{\phi^{(i)}}\}_{i=1}^{N} by optimizing a Monte Carlo estimate of the ELBOProxy\operatorname*{ELBOProxy} from Eq. (3), described by:

{θ^,ϕ^(1),,ϕ^(N)}\displaystyle\{\hat{\theta},\hat{\phi}^{(1)},\dots,\hat{\phi}^{(N)}\} argmaxθ,{ϕ(i)}i=1N\displaystyle\in\operatorname*{arg\,max}_{\theta,\{\phi^{(i)}\}_{i=1}^{N}}\mathcal{L} (4)

where

=\displaystyle\mathcal{L}= 1Ni=1NELBOProxy(Gθ,qϕ(i);y(i))+logp(Gθ).\displaystyle\frac{1}{N}\sum_{i=1}^{N}\operatorname*{ELBOProxy}(G_{\theta},q_{\phi^{(i)}};y^{(i)})+\log p(G_{\theta}). (5)

In terms of choices for logp(Gθ)\log p(G_{\theta}), we can add additional regularization to promote particular properties of the IGM GθG_{\theta}, such as having a small Lipschitz constant. Here, we consider having sparse neural network weights as a form of regularization and use dropout [49] during training to represent logp(Gθ)\log p(G_{\theta}).

Once a generator Gθ^G_{\hat{\theta}} and variational parameters ϕ^(i)\hat{\phi}^{(i)} have been inferred, we solve the ii-th inverse problem by simply sampling x^(i)=Gθ^(z^(i))\hat{x}^{(i)}=G_{\hat{\theta}}(\hat{z}^{(i)}) where z^(i)qϕ^(i)(z(i))\hat{z}^{(i)}\sim q_{\hat{\phi}^{(i)}}(z^{(i)}) or computing an average x¯(i)=1Tt=1TGθ^(z^t(i))\overline{x}^{(i)}=\frac{1}{T}\sum_{t=1}^{T}G_{\hat{\theta}}(\hat{z}_{t}^{(i)}). Producing samples for each inverse problem can help visualize the range of uncertainty under the learned IGM Gθ^G_{\hat{\theta}}, while the expected value of the distribution empirically provides clearer estimates with better metrics in terms of PSNR or MSE. We report PSNR outputs in our subsequent experiments and also visualize the standard deviation of our reconstructions.

IV Experimental Results

We now consider solving a set of inverse problems via the framework described in III-B. For each of these experiments, we use a multivariate Gaussian distribution to parameterize each of the posterior distributions qϕ(i)q_{\phi^{(i)}} and a Deep Decoder [24] with 66 layers, 150150 channels in each layer, a latent size of 4040, and a dropout of 10410^{-4} as the IGM. The multivariate Gaussian distributions are parameterized by means and covariance matrices {μ(i),Λ(i)=UiUiT+εI}i=1N\{\mu^{(i)},\Lambda^{(i)}=U_{i}U_{i}^{T}+\varepsilon I\}_{i=1}^{N}, where εI\varepsilon I with ε=103\varepsilon=10^{-3} is added to the covariance matrix to help with stability of the optimization. We choose to parameterize the latent distributions using Gaussians for memory considerations. Note that the same hyperparameters are used for all experiments demonstrating our proposed method.

In our experiments, we also compare to the following baseline methods: AmbientGAN [5], Deep Image Prior (DIP) [54], and regularized maximum likelihood using total variation (TV-RML). AmbientGAN is most similar to our setup, as it constructs an IGM directly from measurement examples; however, it doesn’t aim to estimate image reconstruction posteriors, but instead aims to learn an IGM that samples from the full underlying prior. TV-RML uses explicit total variation regularization, while DIP uses an implicit convolutional neural network “prior”. As we will show, all these baseline methods require fine-tuning hyperparameters to each set of measurements in order to produce their best results. All methods can also be applied to a variety of inverse problems, making them appropriate choices as baselines.

Refer to caption
Figure 6: Denoising baseline comparisons. We compare to various baselines (AmbientGAN, Deep Image Prior (DIP), and regularized maximum likelihood using TV (TV-RML) with weight λ\lambda), and we report the average PSNR across all 95 reconstructions. We show both early stopping and full training results using DIP. Our method exhibits higher PSNR than all other baselines. We also include results for baselines that require fine-tuning to demonstrate sensitivity to subjective stopping conditions.
Refer to caption
Figure 7: Multi-noise denoising. We demonstrate our method described in Section III-B to perform denoising on measurement collections experiencing different noise levels. For each experiment, we use 75 measurement examples, which are defined by y(i)=x(i)+ηy^{(i)}=x^{(i)}+\eta where η{η1,η2,η3}\eta\in\{\eta_{1},\eta_{2},\eta_{3}\} and ηi𝒩(0,σi2I)\eta_{i}\sim\mathcal{N}(0,\sigma_{i}^{2}I). In Experiment 1, we use noisy measurement examples that have additive noise with standard deviations of 0.01, 0.1, and 0.5. In Experiment 2, we use noisy measurement examples that have additive noise with standard deviations of 0.2, 0.3, and 0.5. We visualize the true underlying images, the measurement used for each experiment, and the mean of the image reconstruction posterior. Most of the reconstructions recover the primary features of the true image. However, in Experiment 1, the reconstructions of the low SNR measurements exhibit bias and do not match the true images well. This is likely because in Experiment 1, the high SNR measurements influence the inferred IGM more strongly than low SNR measurements, leading to biased reconstructions for the reconstructions highlighted in the red box.

IV-A Denoising

We show results on denoising a collection of noisy images of 8’s from the MNIST dataset in Fig. 4 and denoising a collection of noisy images of a single face from the PubFig [31] dataset in Fig. 5. The measurements for both datasets are defined by y=x+ηy=x+\eta where η𝒩(0,σ2I)\eta\sim\mathcal{N}(0,\sigma^{2}I) with an SNR of \sim-3 dB for the MNIST digits and an SNR of \sim15 dB for the faces. Our method is able to remove much of the added noise and recovers small scale features, even with only 10’s of observations. As shown in Fig. 4, the reconstructions achieved under the learned IGM improves as the number of independent observations increases. Our reconstructions also substantially outperform the baseline methods, as shown in Fig. 6. Unlike DIP, our method does not overfit and does not require early stopping. Our method does not exhibit noisy artifacts like those seen in all baselines methods, despite such methods being fine-tuned. We show quantitative comparisons in Table I.

In Fig. 7 we show additional multi-noise denoising experiments where we have 7575 noisy images, which have 3 different noise levels. More formally, y(i)=x(i)+ηy^{(i)}=x^{(i)}+\eta where η{η1,η2,η3}\eta\in\{\eta_{1},\eta_{2},\eta_{3}\} and ηi𝒩(0,σi2I)\eta_{i}\sim\mathcal{N}(0,\sigma_{i}^{2}I). In Experiment 1, the noise levels have a wide range, and we use standard deviations of {σ1,σ2,σ3}={0.01,0.1,0.5}\{\sigma_{1},\sigma_{2},\sigma_{3}\}=\{0.01,0.1,0.5\}. In Experiment 2, the noise levels are much closer together, and we use standard deviations of {σ1,σ2,σ3}={0.2,0.3,0.5}\{\sigma_{1},\sigma_{2},\sigma_{3}\}=\{0.2,0.3,0.5\}. When the SNRs are similar (as in Experiment 2), the reconstructions match the true underlying images well. However, when the measurements have a wide range of SNRs (i.e., Experiment 1), the reconstructions from low SNR measurements exhibit bias and poorly reconstruct the true underlying image, as shown in Fig.7. This is likely because the high SNR measurements influence the inferred IGM more strongly than the low SNR measurements. The full set of results are available in the supplemental materials.

Refer to caption
Figure 8: Phase retrieval from MNIST 8’s. We demonstrate our method described in III-B to perform phase retrieval on 150 images. For the Fourier phase retrieval setting, we show examples from the two observed modes of the posterior. For the Gaussian phase retrieval setting, we show the mean and standard deviation of our reconstructions.
Refer to caption
Figure 9: Visualization of the intrinsic resolution of the EHT compressed sensing measurements. The EHT measures sparse spatial frequencies of the image (i.e., components of the image’s Fourier transform). In order to generate the underlying image (c), all frequencies in the entire domain of (a) must be used. Restricting spatial frequencies to the ones in (a) and (b)’s green circle generates the target (d), where (b) is a zoom in of (a). The EHT samples a subset of the interior of the green circle, indicated by the sparse black samples in (b). Naively recovering an image using only these frequencies results in the dirty image (e), which is computed by AHyA^{H}y. The 2D spatial Fourier frequency coverage represented with (u,v)(u,v) positions is referred to as the UV coverage.

IV-B Phase retrieval

Here we consider solving non-convex inverse problems, and demonstrate our approach on phase retrieval. Our measurements are described by y=|(x)|+ηy=|\mathcal{F}(x)|+\eta where ()\mathcal{F}(\cdot) is a linear operator and η𝒩(0,σ2I)\eta\sim\mathcal{N}(0,\sigma^{2}I). We consider two types of measurements, one where ()\mathcal{F}(\cdot) is the Fourier transform and the other when ()\mathcal{F}(\cdot) is an m×nm\times n complex Gaussian matrix with m=0.1nm=\lceil 0.1n\rceil. Since each measurement is the magnitude of complex linear measurements, there is an inherent phase ambiguity in the problem. Additionally, flipping and spatial-shifts are possible reconstructions when performing Fourier phase retrieval. Due to the severe ill-posedness of the problem, representing this complicated posterior that includes all spatial shifts is challenging. Thus, we incorporate an envelope (i.e., a centered rectangular mask) as the final layer of GθG_{\theta} to encourage the reconstruction to be centered. Nonetheless, flipping and shifts are still possible within this enveloped region.

Refer to caption
Figure 10: Recovering a 60-day video of the M87 black hole with the EHT telescope array. Left: We demonstrate our method described in III-B using simulated measurements from 60 frames of an evolving black hole target with the forward model described in Fig. 9. Here we show the underlying images, dirty images (AHyA^{H}y, see Fig. 9), and mean reconstruction, respectively. Additionally, we show the unwrapped space × time image, which is taken along the overlaid white ring illustrated in the day 1 underlying image. The bright-spot’s temporal trajectory of our reconstruction matches that of the truth. Right: We compare our method to various baselines methods. Our results are much sharper and exhibit less artifacts than AmbientGAN and TV-RML with weight λ\lambda. Note that we include results using AmbientGAN with both default parameters and fine-tuned parameters.

Baselines Forward model Ours AmbGAN DIP DIP TV TV Dataset NN y=f(x)+ηy=f(x)+\eta fine-tuned fewer epochs many epochs lower λ\lambda higher λ\lambda celeb. A 95 y=x+ηy=x+\eta 27.0 22.9 25.6 25.4 26.2 26.9 celeb. B 95 y=x+ηy=x+\eta 26.2 19.7 24.4 25.2 25.9 26.4 MNIST 8’s 150 y=x+ηy=x+\eta 21.1 18.0 18.8 13.3 16.4 18.2 M87* (target) 60 y=Ax+ηy=Ax+\eta 29.3 25.7 29.0 28.6 24.3 25.9

TABLE I: Quantitative comparisons to baselines. We show the mean PSNR between the reconstructions of different methods (ours, AmbientGAN [5], DIP [54], and TV-RML) with either the true underlying image or the target in the case of black hole compressed sensing. The highest PSNR for each set of measurements is in bold. We fix the hyperparameters of the baselines across each forward model, empirically selecting them for good performance on each type of problem. Note that the range in performance depends on the choice of hyperparameters. For denoising, the DIP results corresponds to 100 and 300 epochs for the fewer and many epochs baselines, respectively, and the lower and higher λ\lambda’s are 1e3 and 1e4, respectively. For black hole compressed sensing, the DIP results corresponds to 1000 and 3000 epochs for the fewer and many epochs baselines, respectively, and the lower and higher λ\lambda’s are 1e2 and 1e3, respectively. Visual examples of celeb B. are shown in the supplemental materials.

We show results from a set of N=150N=150 noisy phase retrieval measurements from the MNIST 8’s class with a SNR of \sim52 dB. We consider three settings: 1) all measurements arise from a Gaussian measurement matrix, 2) all measurements arise from Fourier measurements, and 3) half of the measurements are Gaussian and the other half are Fourier. We show qualitative results for cases 1 and 2 in Fig. 8. In the Gaussian case, we note that our mean reconstructions are nearly identical to the true digits and the standard deviations exhibit uncertainty in regions we would expect (e.g., around edges). In the Fourier case, our reconstructions have features similar to the digit 88, but contain artifacts. These artifacts are only present in the Fourier case due to additional ambiguities, which lead to a more complex posterior [27]. We also show the average PSNR of our reconstructions for each measurement model in Table II. For more details on this experiment, please see Section XII-C of the supplemental materials.

Measurement Operator(s) Gaussian Fourier Both Gaussian 30.8 30.2 Fourier 13.6 19.4

TABLE II: Phase Retrieval PSNRs for different measurement operators. Each column corresponds to the type of measurements that the method was given (a total of N=150N=150). In the case of “Both”, 7575 Gaussian measurements and 7575 Fourier measurements were given. We then show the average PSNR for our reconstructions given the specific measurement operator (either Fourier or Gaussian). Note that when given both Gaussian and Fourier measurement examples our method exhibits a higher PSNR on the recovered images from Fourier examples than when given only Fourier measurements. Additionally, there is only a slight decrease in performance on the Gaussian measurement reconstructions as compared to given entirely Gaussian examples.

IV-C Black hole imaging

We consider a real-world inverse problem for which ground-truth data would be impossible to obtain. In particular, we consider a compressed sensing problem inspired by astronomical imaging of black holes with the Event Horizon Telescope (EHT): suppose we are given access to NN measurement examples of the form y(i)=A(i)x(i)+η(i)y^{(i)}=A^{(i)}x^{(i)}+\eta^{(i)} where A(i)m×nA^{(i)}\in\mathbb{C}^{m\times n} is a low-rank compressed sensing matrix arising from interferometric telescope measurements and η(i)\eta^{(i)} denotes noise with known properties (e.g., distributed as a zero-mean Gaussian with known covariance). The collection of images {x(i)}i=1N\{x^{(i)}\}_{i=1}^{N} are snapshots of an evolving black hole target. This problem is ill-posed and requires the use of priors or regularizers to recover a reasonable image [13]. Moreover, it is impossible to directly acquire example images of black holes, so any pixel-level prior defined a priori will exhibit human bias. Recovering an image, or movie, of a black hole with as little human bias as possible is essential for both studying the astrophysics of black holes as well as testing fundamental physics [12, 15]. We show how our proposed method can be used to tackle this important problem. In particular, we leverage knowledge that, although the black hole evolves, it will not change drastically from minute-to-minute or day-to-day. We study two black hole targets: the black hole at the center of the Messier 87 galaxy (M87) and the black hole at the center of the Milky Way galaxy – Sagattarius A* (Sgr A).

Imaging M87 using the current EHT array

We first consider reconstructing the black hole at the center of the Messier 87 galaxy, which does not evolve noticeably within the timescale of a single day. The underlying images are from a simulated 60 frame video with a single frame for each day. We show results on N=60N=60 frames from an evolving black hole target with a diameter of \sim40 microarcseconds, as was identified as the diameter of M87 according to [42, 14] in Fig. 10. In particular, the measurements are given by {y(i)=Ax(i)+η(i)}i=1N\{y^{(i)}=Ax^{(i)}+\eta^{(i)}\}_{i=1}^{N}, where x(i)x^{(i)} is the underlying image on day ii, AA is the forward model that represents the telescope array, which is static across different days, and the noise η(i)𝒩(0,Σ)\eta^{(i)}\sim\mathcal{N}(0,\Sigma) has a covariance of Σ\Sigma, which is a diagonal matrix with realistic variances based on the telescope properties***We leave the more challenging atmospheric noise that appears in measurements for future work.. Measurements are simulated from black hole images with a realistic flux of 11 Jansky [57]. We also visualize a reference “target” image, which is the underlying image filtered with a low-pass filter that represents the maximum resolution achievable with the telescope array used to collect measurements – in this case the EHT array consisting of 11 telescopes (see Fig. 9 and Section IX in the supplemental materials).

As seen in Fig. 10, our method is not only able to reconstruct the large scale features of the underlying image without any aliasing artifacts, but also achieves a level of super-resolution (see Table III in the supplemental materials). Our reconstructions also achieve higher super-resolution as compared to our baselines (i.e., AmbientGAN, TV-RML, and DPI) in Fig. 10 and do not exhibit artifacts evident in the reconstructions from these baselines. The two AmbientGAN settings were qualitatively chosen to show that the final result is sensitive to the choice of hyperparameters. The default AmbientGAN parameters produce poor results, and even with fine-tuning to best fit the underlying images (i.e., cheating with knowledge of the ground-truth), the results still exhibit substantial artifacts. We outperform the baselines in terms of PSNR when compared to the target image (see Table I). Our results demonstrate that we are able to capture the varying temporal structure of the black hole, rather than just recovering a static image. It is important to note that there is no explicit temporal regularization introduced; the temporal smoothness is implicitly inferred by the constructed IGM.

Refer to caption
Figure 11: Visualization of the EHT forward model of Sgr A over time. The spatial frequency space coordinates of Sgr A from an ngEHT [43] array measured over a course of one night is shown on the top. Snapshots of the coordinates that are measured at different times are shown on the bottom. Note that the location and sparsity of the measurement coordinates change over time.
Refer to caption
Figure 12: Recovering a video of the Sgr A* black hole with a futuristic telescope array over the course of a night. We demonstrate our method described in III-B to perform video reconstruction on 60 frames of a video, where the measurements for each frame are generated by different forward models given by imaging Sgr A. Here we show the underlying image, the dirty image (AHyA^{H}y), and the mean of the reconstructed posterior. Additionally, we show the unwrapped space × time image, which is taken along the white ring illustrated in the 4:00 UTC underlying image. The bright-spot’s temporal trajectory of our reconstruction matches that of the truth. The measurement noise is consistent with a black hole having a flux of 2 Janskys.
Imaging Sgr A from multiple forward models

The framework we introduce can also be applied to situations in which the measurements themselves are induced by different forward models. In particular, the measurements {y(i)=f(i)(x(i))+η(i)}i=1N\{y^{(i)}=f^{(i)}(x^{(i)})+\eta^{(i)}\}_{i=1}^{N} are given by an underlying image x(i)x^{(i)}, a forward model f(i)f^{(i)} that is specific to that observation, and noise η(i)\eta^{(i)} with known properties.

As an illustrative example, we consider the problem of reconstructing a video of the black hole at the center of the Milky Way – Sagittarius A (Sgr A). Unlike M87, Sgr A evolves on the order of minutes. Therefore, we can only consider that the black hole is static for only a short time when only a subset of telescopes are able to observe the black hole. This results in a different measurement forward model for each frame of the black hole “movie” [16]. In particular, the measurements are given by {y(i)=A(i)x(i)+η(i)}i=1N\{y^{(i)}=A^{(i)}x^{(i)}+\eta^{(i)}\}_{i=1}^{N}, where x(i)x^{(i)} is the underlying image at time ii, A(i)A^{(i)} is the forward model that incorporates the telescope configuration at that time, and η(i)𝒩(0,Σ(i))\eta^{(i)}\sim\mathcal{N}(0,\Sigma^{(i)}) is noise where Σ(i)\Sigma^{(i)} is a diagonal matrix with realistic standard deviations derived from the telescopes’ physical properties. The measurement noise is consistent with a black hole with a flux of 2 Janskys. The measurement operator is illustrated in Fig. 11.

We show examples of reconstructing 60 frames of Sgr A with a diameter of \sim50 microarcseconds using measurements simulated from a proposed future next-generation EHT (ngEHT) [43] array, which consists of 23 telescopes. These results are shown in Fig. 12. Our reconstructions remove much of the aliasing artifacts evident in the dirty images and reconstructs the primary features of the underlying image without any form of temporal regularization. These results have high fidelity especially considering that the measurements are very sparsely sampled. Although these results come from simulated measurements that do not account for all forms of noise we expect to encounter, the high-quality movie reconstructions obtained without the use of a spatio-temporal prior show great promise towards scientific discovery that could come from a future telescope array paired with our proposed reconstruction approach.

V Theory for Linear IGMs

We now introduce theoretical results on the inferred IGM for linear inverse problems. Specifically, we consider the case when the IGM GθG_{\theta} is linear and the latent variational distributions are Gaussian. The goal of this section is to develop intuition for the inferred IGM in a simpler setting. While our results may not generalize to non-linear generators parameterized by deep neural networks, our results aim to provide an initial understanding on the inferred generator.

More concretely, suppose we are given NN measurement examples of noisy linear measurements of the form

y(i):=Ax(i)+η(i),η(i)𝒩(0,σ2I)\displaystyle y^{(i)}:=Ax^{(i)}+\eta^{(i)},\ \eta^{(i)}\sim\mathcal{N}(0,\sigma^{2}I)

where Am×nA\in\mathbb{R}^{m\times n} with mnm\leqslant n and x(i)nx^{(i)}\in\mathbb{R}^{n}. We aim to infer Gθn×kG_{\theta}\in\mathbb{R}^{n\times k} with k<nk<n and latent distributions qϕ(i)=𝒩(μi,UiUiT)q_{\phi^{(i)}}=\mathcal{N}(\mu_{i},U_{i}U_{i}^{T}) where ϕ(i)={μi,Ui}\phi^{(i)}=\{\mu_{i},U_{i}\}, μik\mu_{i}\in\mathbb{R}^{k}, and Uik×kU_{i}\in\mathbb{R}^{k\times k} by minimizing the negative ELBOProxy:

(Gθ,{ϕ(i)})\displaystyle\mathcal{L}(G_{\theta},\{\phi^{(i)}\}) :=1Ni=1N𝔼zqϕ(i)(z)[logp(y(i)|Gθ(z))\displaystyle:=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{z\sim q_{\phi^{(i)}(z)}}[-\log p(y^{(i)}|G_{\theta}(z))
logpZ(z|Gθ)+logqϕ(i)(z)].\displaystyle-\log p_{Z}(z|G_{\theta})+\log q_{\phi^{(i)}}(z)]. (6)

We characterize the generator GG_{*} and latent parameters ϕ(i)={μi,Ui}\phi_{*}^{(i)}=\{\mu_{i}^{*},U_{i}^{*}\} that are stationary points of (6). The result is proven in Section X:

Theorem 2.

Fix σ>0\sigma>0 and let y(i)=Ax(i)+η(i)ny^{(i)}=Ax^{(i)}+\eta^{(i)}\in\mathbb{R}^{n} for i[N]i\in[N] where η(i)𝒩(0,σ2I)\eta^{(i)}\sim\mathcal{N}(0,\sigma^{2}I) and Am×nA\in\mathbb{R}^{m\times n} has krank(A)=mnk\leqslant\mathrm{rank}(A)=m\leqslant n. Define the sample covariance of the measurements YN:=1Ni=1Ny(i)(y(i))TY_{N}:=\frac{1}{N}\sum_{i=1}^{N}y^{(i)}(y^{(i)})^{T}. Then, with probability 11, GG_{*}, μi\mu_{i}^{*}, and UiU_{i}^{*} that satisfy

AG\displaystyle AG_{*} :=Ek(Lkσ2I)1/2R,\displaystyle:=E_{k}(L_{k}-\sigma^{2}I)^{1/2}R,
μi\displaystyle\mu_{i}^{*} :=(GTATAG+σ2I)1GTATy(i),and\displaystyle:=(G_{*}^{T}A^{T}AG_{*}+\sigma^{2}I)^{-1}G_{*}^{T}A^{T}y^{(i)},\ \text{and}
Ui\displaystyle U_{i}^{*} :=σ(GTATAG+σ2I)1/2R~,\displaystyle:=\sigma(G_{*}^{T}A^{T}AG_{*}+\sigma^{2}I)^{-1/2}\tilde{R},

for i[N]i\in[N] are stationary points of the objective (6) over n×k×(k×GL(k,))N\mathbb{R}^{n\times k}\times(\mathbb{R}^{k}\times\mathrm{GL}(k,\mathbb{R}))^{N} where GL(k,)\mathrm{GL}(k,\mathbb{R}) denotes the set of real invertible k×kk\times k matrices. Here, Ekm×kE_{k}\in\mathbb{R}^{m\times k} denotes the matrix whose columns correspond to the top-min{rank(YN),k}\min\{\mathrm{rank}(Y_{N}),k\} eigenvectors of YNY_{N}, Lkk×kL_{k}\in\mathbb{R}^{k\times k} contains the corresponding eigenvalues of YNY_{N}, and R,R~k×kR,\tilde{R}\in\mathbb{R}^{k\times k} are arbitrary orthogonal matrices. If rank(YN)<k\mathrm{rank}(Y_{N})<k, then for rank(YN)<ik\mathrm{rank}(Y_{N})<i\leqslant k, the ii-th column of EkE_{k} can be arbitrary and Lk,ii=σ2L_{k,ii}=\sigma^{2}.

The Theorem establishes the precise form of stationary points of the objective (6). In particular, it shows that this inferred IGM performs dimensionality reduction akin to PCA [53] on the collection of measurement examples. To gain further intuition about the Theorem, in Section X-B of the supplemental materials, we analyze this result in a context where the underlying images we wish to reconstruct explicitly lie in a low-dimensional subspace. We show that, in that setting, our estimator returns an approximation of the solution found via MAP estimation, which can only be computed with complete prior knowledge of the underlying image structure. While the Theorem focused on linear IGMs, it would be interesting to theoretically analyze non-linear IGMs parameterized by deep networks. We leave this for future work.

VI Conclusion

In this work we showcased how one can solve a set of inverse problems without a pre-defined IGM (e.g., a traditional spatial image prior) by leveraging common structure present across a collection of diverse underlying images. We demonstrated that even with a small set of corrupted measurements, one can jointly solve these inverse problems by directly inferring an IGM that maximizes a proxy of the ELBO. We demonstrate our method on a number of convex and non-convex imaging problems, including the challenging problem of black hole video reconstruction from interferometric measurements. Overall, our work showcases the possibility of solving inverse problems in a completely unsupervised fashion, free from significant human bias typical of ill-posed image reconstruction. We believe our approach can aid in automatic discovery of novel structure from scientific measurements, potentially paving the way to new avenues of exploration.

References

  • [1] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
  • [2] Marcus Alley, Joseph Y. Cheng, William Dally, Enhao Gong, Song Hann, Morteza Mardani, John M. Pauly, Neil Thakur, Shreyas Vasanawala, Lei Xing, and Greg Zaharchuk. Deep generative adversarial networks for compressed sensing (gancs) automates mri. arXiv preprint arXiv:1706.00051, 2017.
  • [3] Joshua Batson and Loic Royer. Noise2self: Blind denoising by self-supervision. In International Conference on Machine Learning, pages 524–533. PMLR, 2019.
  • [4] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros Dimakis. Compressed sensing using generative models. International Conference on Machine Learning (ICML), 2017.
  • [5] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from lossy measurements. In International conference on learning representations, 2018.
  • [6] Harold C. Burger, Christian J. Schuler, and Stefan Harmeling. Image denoising: Can plain neural networks compete with bm3d? IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [7] Emmanuel J. Candès and Carlos Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67(6):906–956, 2013.
  • [8] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006.
  • [9] Zezhou Cheng, Matheus Gadelha, Subhransu Maji, and Daniel Sheldon. A bayesian perspective on the deep image prior. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5438––5446, 2019.
  • [10] Badr-Eddine Chérief-Abdellatif. Consistency of elbo maximization for model selection. Proceedings of The 1st Symposium on Advances in Approximate Bayesian Inference, 96:1–21, 2018.
  • [11] Badr-Eddine Chérief-Abdellatif and Pierre Alquier. Consistency of variational bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12(2):2995–3035, 2018.
  • [12] The Event Horizon Telescope Collaboration. First m87 event horizon telescope results. i. the shadow of the supermassive black hole. The Astrophysical Journal Letters, 875(1):L1, Apr 2019.
  • [13] The Event Horizon Telescope Collaboration. First m87 event horizon telescope results. iv. imaging the central supermassive black hole. The Astrophysical Journal Letters, 875(1):L4, 2019.
  • [14] The Event Horizon Telescope Collaboration. First m87 event horizon telescope results. v. physical origin of the asymmetric ring. The Astrophysical Journal Letters, 875(1):L5, 2019.
  • [15] The Event Horizon Telescope Collaboration. First sagittarius a* event horizon telescope results. i. the shadow of the supermassive black hole in the center of the milky way. The Astrophysical Journal Letters, 930(2):L12, May 2022.
  • [16] The Event Horizon Telescope Collaboration. First sagittarius a* event horizon telescope results. iii. imaging of the galactic center supermassive black hole. The Astrophysical Journal Letters, 930(2):L14, 2022.
  • [17] Mohammad Zalbagi Darestani and Reinhard Heckel. Accelerated mri with un-trained neural networks. IEEE Transactions of Computational Imaging, 7:724–733, 2021.
  • [18] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. 2016.
  • [19] David Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 2006.
  • [20] Laura Dyer, Andrew Parker, Keanu Paphiti, and Jeremy Sanderson. Lightsheet microscopy. Current Protocols, 2(7):e448, 2022.
  • [21] Albert Fannjiang and Thomas Strohmer. The numerics of phase retrieval. Acta Numerica, 29:125 – 228, 2020.
  • [22] Angela F. Gao, Oscar Leong, He Sun, and Katie L. Bouman. Image reconstruction without explicit priors. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.
  • [23] Paul Hand, Oscar Leong, and Vladislav Voroninski. Phase retrieval under a generative prior. Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [24] Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-convolutional networks. International Conference on Learning Representations (ICLR), 2019.
  • [25] Reinhard Heckel, Wen Huang, Paul Hand, and Vladislav Voroninski. Rate-optimal denoising with deep neural networks. Information and Inference: A Journal of the IMA, 2020.
  • [26] Leonid I.Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60:259–268, 1992.
  • [27] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. Sparse phase retrieval: Convex algorithms and limitations. 2013 IEEE International Symposium on Information Theory Proceedings (ISIT), pages 1022–1026, 2013.
  • [28] Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [29] Konik Kothari, AmirEhsan Khorashadizadeh, Maarten de Hoop, and Ivan Dokmanić. Trumpets: Injective flows for inference and inverse problems. In Uncertainty in Artificial Intelligence, pages 1269–1278. PMLR, 2021.
  • [30] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void - learning denoising from single noisy images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2129–2137, 2019.
  • [31] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and simile classifiers for face verification. In 2009 IEEE 12th international conference on computer vision, pages 365–372. IEEE, 2009.
  • [32] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278––2324, 1998.
  • [33] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. Proceedings of the 35th International Conference on Machine Learning, 80:2965–2974, 2018.
  • [34] Anat Levin, Yair Weiss, Fredo Durand, and William T. Freeman. Understanding and evaluating blind deconvolution algorithms. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1964–1971, 2009.
  • [35] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv preprint, arXiv:1606.04933, 2016.
  • [36] Jiaming Liu, Yu Sun, Cihat Eldeniz, Weijie Gan, Hongyu An, and Ulugbek S Kamilov. Rare: Image reconstruction using deep priors learned without groundtruth. IEEE Journal of Selected Topics in Signal Processing, 14(6):1088–1099, 2020.
  • [37] Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
  • [38] Ray Mendoza, Minh Nguyen, Judith Weng Zhu, Vincent Dumont, Talita Perciano, Juliane Mueller, and Vidya Ganapati. A self-supervised approach to reconstruction in sparse x-ray computed tomography. In NeurIPS Machine Learning and the Physical Sciences Workshop, 2022.
  • [39] Sachit Menon, Alex Damian, McCourt Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [40] Jeremy Miles. R squared, adjusted r squared. Encyclopedia of Statistics in Behavioral Science, 2005.
  • [41] Andrew Olsen, Yolanda Hu, and Vidya Ganapati. Data-driven computational imaging for scientific discovery. In NeurIPS AI for Science Workshop, 2022.
  • [42] Oliver Porth, Koushik Chatterjee, Ramesh Narayan, Charles F Gammie, Yosuke Mizuno, Peter Anninos, John G Baker, Matteo Bugli, Chi-kwan Chan, Jordy Davelaar, et al. The event horizon general relativistic magnetohydrodynamic code comparison project. The Astrophysical Journal Supplement Series, 243(2):26, 2019.
  • [43] Alexander W Raymond, Daniel Palumbo, Scott N Paine, Lindy Blackburn, Rodrigo Córdova Rosado, Sheperd S Doeleman, Joseph R Farah, Michael D Johnson, Freek Roelofs, Remo PJ Tilanus, et al. Evaluation of new submillimeter vlbi sites for the event horizon telescope. The Astrophysical Journal Supplement Series, 253(1):5, 2021.
  • [44] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
  • [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 9351:234–241, 2015.
  • [46] Gideon E. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978.
  • [47] Fahad Shamshad and Ali Ahmed. Compressed sensing-based robust phase retrieval via deep generative priors. IEEE Sensors Journal, 21(2):2286 – 2298, 2017.
  • [48] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models. International Conference on Learning Representations (ICLR), 2022.
  • [49] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • [50] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974.
  • [51] He Sun, Katherine L. Bouman, Paul Tiede, Jason J. Wang, Sarah Blunt, and Dimitri Mawet. alpha-deep probabilistic inference (alpha-dpi): efficient uncertainty quantification from exoplanet astrometry to black hole feature extraction. The Astrophysical Journal, 932(2):99, 2022.
  • [52] Chenyang Tao, Liqun Chen, Ruiyi Zhang, Ricardo Henao, and Lawrence Carin. Variational inference and model selection with generalized evidence bounds. International Conference on Machine Learning (ICML), 80:893–902, 2018.
  • [53] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analys. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611––622, 1999.
  • [54] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
  • [55] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. International Journal of Computer Vision, 128:1867–1888, 2020.
  • [56] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE Global Conference on Signal and Information Processing, pages 945–948. IEEE, 2013.
  • [57] George N. Wong, Ben S. Prather, Vedant Dhruv, Benjamin R. Ryan, Monika Mościbrodzka, Chi-kwan Chan, Abhishek V. Joshi, Ricardo Yarza, Angelo Ricarte, Hotaka Shiokawa, Joshua C. Dolence, Scott C. Noble, Jonathan C. McKinney, and Charles F. Gammie. PATOKA: Simulating Electromagnetic Observables of Black Hole Accretion. The Astrophysical Journal Supplement Series, 259(2):64, April 2022.

VII Experimental details

Model selection

We parameterize the latent variational distributions qϕ(z)q_{\phi}(z) with Normalizing Flows lϕl_{\phi}. In particular, suppose our latent variables z=lϕ(s)z=l_{\phi}(s) where s𝒩(0,I)s\sim\mathcal{N}(0,I). Then,

logqϕ(z)=logπ(s)log|detdlϕ(s)ds|\displaystyle\log q_{\phi}(z)=\log\pi(s)-\log\left|\det\frac{dl_{\phi}(s)}{ds}\right|

where π()\pi(\cdot) denotes the likelihood of the base distribution of ss, which in this case is a Gaussian. For our prior on zz given GθG_{\theta}, we use an isotropic Gaussian prior. Then the ELBOProxy\operatorname*{ELBOProxy} equation becomes

ELBOProxy(Gθ,qϕ;y)\displaystyle\operatorname*{ELBOProxy}(G_{\theta},q_{\phi};y) :=𝔼zlϕ(z)[logp(y|G(z))\displaystyle:=\mathbb{E}_{z\sim l_{\phi}(z)}[\log p(y|G(z))
+logp(z|G)logqϕ(z)]\displaystyle+\log p(z|G)-\log q_{\phi}(z)]
=𝔼s𝒩(0,I)[logp(y|Gθ(lϕ(s)))\displaystyle=\mathbb{E}_{s\sim\mathcal{N}(0,I)}\Big{[}\log p(y|G_{\theta}(l_{\phi}(s)))
+12lϕ(s)22logπ(s)\displaystyle+\frac{1}{2}\|l_{\phi}(s)\|^{2}_{2}-\log\pi(s)
+log|detdlϕ(s)ds|].\displaystyle+\log\left|\det\frac{dl_{\phi}(s)}{ds}\right|\Big{]}.

Note that the expectation over logπ(s)\log\pi(s) is constant with respect to the parameters ϕ\phi, so we only optimize the remaining terms.

VIII Inference Details

Optimization details for model selection

Given a generator GcG_{c} for a class cc, we optimized the ELBOProxy\operatorname*{ELBOProxy} with respect to the parameters ϕ\phi of the Normalizing Flow. For denoising, we optimized for 50005000 epochs with Adam \citePhyskingma2014adamSupp and a learning rate of 10410^{-4} and set T=200T=200. For phase retrieval, we optimized for 75007500 epochs with Adam and a learning rate of 10410^{-4} and set T=1000T=1000. The entire experiment took approximately 10/1510/15 hours for denoising/phase retrieval, respectively, on a single NVIDIA V100 GPU. When optimizing the flow hϕh_{\phi} for a particular generator GcG_{c}, we record the parameters with the best ELBOProxy\operatorname*{ELBOProxy} value. This is the final value recorded in the figure. To generate the mean and standard deviation, we used 100100 samples.

Network details

The generator GcG_{c} corresponds to the decoder part of a VAE \citePhysKingmaWelling14Supp trained on the training set of class cc of MNIST. We used a convolutional architecture, where the encoder and decoder architecture contained 44 convolutional layers. There are also linear, fully-connected layers in between to learn the latent representation, which had dimension 16.16. The Normalizing Flows used the RealNVP \citePhysdinh2016densitySupp architecture. In both problems, the network had 1616 affine coupling layers with linear layers, Leaky ReLU activation functions, and batch normalization. After each affine-coupling block, the dimensions were randomly permuted.

VIII-A Inferring the IGM

Variational distribution

We parameterize the posterior using Gaussian distributions due to memory constraints; the memory of the whole setup with Normalizing Flows using 75 images is around 5 times larger than the setup using a Gaussian variational family. However, the Gaussian variational family is likely not as good of a variational distribution, so there will be a wider margin between p(y|G)p(y|G) and the ELBO. Additionally, the Gaussian variational family will not be as good at capturing multimodal distributions. Note that we do not notice overfitting so there is no need for early stopping.

IGM network details

We use a Deep Decoder \citePhysHH2018Supp generator architecture as the IGM with a dropout of 10410^{-4}. The Deep Decoder has 6 layers with 150 channels each, and has a latent size of 40. There is a sigmoid activation as the final layer of the generator.

Optimization details

We train for 99,000 epochs using the Adam \citePhyskingma2014adamSupp optimizer with a learning rate of 10410^{-4}, which takes around 2 hours for 5 images, 17 hours for 75 images, and 33 hours for 150 images on a single NVIDIA RTX A6000 GPU. The memory usage is around 11000MiB for 5 images, 24000MiB for 75 images, and 39000MiB for 150 images. We use a batch size of 20 samples per measurement ii, and we perform batch gradient descent based on all measurements i{1,,N}i\in\{1,\dots,N\}.

IX Event Horizon Telescope Array

IX-A Array used for the M87 results

NAME X Y Z
PDB 4523998 468045 4460310
PV 5088968 -301682 3825016
SMT -1828796 -5054407 3427865
SMA -5464523 -2493147 2150612
LMT -768714 -5988542 2063276
ALMA 2225061 -5440057 -2481681
APEX 2225040 -5441198 -2479303
JCMT -5464585 -2493001 2150654
CARMA -2397431 -4482019 3843524
KP -1995679 -5037318 3357328
GLT 1500692 -1191735 6066409

IX-B Array used for the Sgr A results

NAME X Y Z
ALMA 2225061 -5440057 -2481681
APEX 2225040 -5441198 -2479303
GLT 1500692 -1191735 6066409
JCMT -5464585 -2493001 2150654
KP -1995679 -5037318 3357328
LMT -768714 -5988542 2063276
SMA -5464523 -2493147 2150612
SMT -1828796 -5054407 3427865
BAJA -2352576 -4940331 3271508
BAR -2363000 -4445000 3907000
CAT 1569000 -4559000 -4163000
CNI 5311000 -1725000 3075000
GAM 5648000 1619000 -2477000
GARS 1538000 -2462000 -5659000
HAY 1521000 -4417000 4327000
NZ -4540000 719000 -4409000
OVRO -2409598 -4478348 3838607
SGO 1832000 -5034000 -3455000
CAS 1373000 -3399000 -5201000
LLA 2325338 -5341460 -2599661
PIKE -1285000 -4797000 3994000
PV 5088968 -301682 3825016

X Supplemental Theoretical Results

We present proofs of some of the theoretical results that appeared in the main body of the paper, along with new results that were not discussed. Throughout the proofs, we let 2\|\cdot\|_{2}, \|\cdot\|, and F\|\cdot\|_{F} denote the Euclidean norm for vectors, the spectral norm for matrices, and the Frobenius norm for matrices, respectively. For a matrix Mm×nM\in\mathbb{R}^{m\times n}, let σi(M)\sigma_{i}(M) and λi(M)\lambda_{i}(M) denote the ii-th singular value and eigenvalue, respectively (ordered in decreasing order). Let InI_{n} denote the n×nn\times n identity matrix. We may use II for simplicity when the size of the matrix is clear from context. To avoid cumbersome notation, we sometimes use the shorthand notation 𝔼z[]\mathbb{E}_{z}[\cdot] for 𝔼z𝒟[]\mathbb{E}_{z\sim\mathcal{D}}[\cdot].

X-A Proof of Theorem 2

We first focus on proving Theorem 2. To show this, we first establish that the ELBOProxy\operatorname*{ELBOProxy} loss can be evaluated in closed-form:

Proposition 2.

Fix σ>0\sigma>0 and suppose y(i):=Ax(i)+η(i),η(i)𝒩(0,σ2I)y^{(i)}:=Ax^{(i)}+\eta^{(i)},\ \eta^{(i)}\sim\mathcal{N}(0,\sigma^{2}I) for i[N]i\in[N] where Am×nA\in\mathbb{R}^{m\times n}. Then the loss :n×k×(k×k×k)N\mathcal{L}:\mathbb{R}^{n\times k}\times(\mathbb{R}^{k}\times\mathbb{R}^{k\times k})^{N}\rightarrow\mathbb{R} in (6) can be written as

1N\displaystyle\frac{1}{N} i=1N[12σ2(y(i)22+AGθUiF2+AGθμi22)\displaystyle\sum_{i=1}^{N}\Big{[}\frac{1}{2\sigma^{2}}\left(\|y^{(i)}\|_{2}^{2}+\|AG_{\theta}U_{i}\|_{F}^{2}+\|AG_{\theta}\mu_{i}\|_{2}^{2}\right)
1σ2y(i),AGθμi+12(UiF2+μi22)\displaystyle-\frac{1}{\sigma^{2}}\langle y^{(i)},AG_{\theta}\mu_{i}\rangle+\frac{1}{2}(\|U_{i}\|_{F}^{2}+\|\mu_{i}\|_{2}^{2})
logdet(Ui)]k2k2log(2π).\displaystyle-\log\det(U_{i})\Big{]}-\frac{k}{2}-\frac{k}{2}\log(2\pi).
Proof of Proposition 2.

Since the loss is separable in NN and expectation is linear, we can consider computing the loss for a single i[N]i\in[N]. Dropping the indices for notational simplicity, consider a fixed y=Ax+ηmy=Ax+\eta\in\mathbb{R}^{m} where η𝒩(0,σ2Im)\eta\sim\mathcal{N}(0,\sigma^{2}I_{m}). We aim to calculate

𝔼v[logp(y|Gθ(v))logp(v|Gθ)+logqϕ(v)]\displaystyle\mathbb{E}_{v}\left[-\log p(y|G_{\theta}(v))-\log p(v|G_{\theta})+\log q_{\phi}(v)\right]

where p(y|Gθ(v))p(y|G_{\theta}(v)) is Gaussian with variance σ2\sigma^{2}, p(v|Gθ)p(v|G_{\theta}) is isotropic Gaussian, and vqϕ=𝒩(μ,UUT)v\sim q_{\phi}=\mathcal{N}(\mu,UU^{T}). Note that v𝒩(μ,UUT)v\sim\mathcal{N}(\mu,UU^{T}) is equivalent to v=Uz+μv=Uz+\mu for z𝒩(0,Ik)z\sim\mathcal{N}(0,I_{k}). We can calculate the individual terms in the loss as follows. We first collect some useful results, whose proofs are deferred until after this proof is complete:

Lemma 1.

For any matrix Ud×kU\in\mathbb{R}^{d\times k}, we have 𝔼zUz22=UF2\mathbb{E}_{z}\|Uz\|_{2}^{2}=\|U\|_{F}^{2} where z𝒩(0,Ik)z\sim\mathcal{N}(0,I_{k}). Moreover, if UUTUU^{T} is invertible, then 𝔼z[zTUT(UUT)1Uz]=d\mathbb{E}_{z}[z^{T}U^{T}(UU^{T})^{-1}Uz]=d.

Now, for the first term, since the noise is Gaussian with variance σ2\sigma^{2}, it is equal to

𝔼z12σ2y\displaystyle\mathbb{E}_{z}\frac{1}{2\sigma^{2}}\|y AGθ(Uz+μ)22=12σ2y22\displaystyle-AG_{\theta}(Uz+\mu)\|_{2}^{2}=\frac{1}{2\sigma^{2}}\|y\|_{2}^{2}
+𝔼z12σ2AGθ(Uz+μ)22\displaystyle+\mathbb{E}_{z}\frac{1}{2\sigma^{2}}\|AG_{\theta}(Uz+\mu)\|_{2}^{2}
1σ2𝔼zy,AGθ(Uz+μ)\displaystyle-\frac{1}{\sigma^{2}}\mathbb{E}_{z}\langle y,AG_{\theta}(Uz+\mu)\rangle
=12σ2(y22+AGθUF2+AGθμ22)\displaystyle=\frac{1}{2\sigma^{2}}\left(\|y\|_{2}^{2}+\|AG_{\theta}U\|_{F}^{2}+\|AG_{\theta}\mu\|_{2}^{2}\right)
1σ2y,AGθμ\displaystyle-\frac{1}{\sigma^{2}}\langle y,AG_{\theta}\mu\rangle

where we used the first part of Lemma 1 in the third equality. For the prior term, we can calculate using Lemma 1 again

𝔼v[logp(v|Gθ)]\displaystyle\mathbb{E}_{v}\left[-\log p(v|G_{\theta})\right] =𝔼z12Uz+μ22\displaystyle=\mathbb{E}_{z}\frac{1}{2}\|Uz+\mu\|_{2}^{2}
=12(UF2+μ22).\displaystyle=\frac{1}{2}(\|U\|_{F}^{2}+\|\mu\|_{2}^{2}).

Finally, we can calculate the entropy term as

𝔼v\displaystyle\mathbb{E}_{v} [logqϕ(v)]=12𝔼v[(vμ)T(UUT)1(vμ)]\displaystyle\left[\log q_{\phi}(v)\right]=-\frac{1}{2}\mathbb{E}_{v}\left[(v-\mu)^{T}(UU^{T})^{-1}(v-\mu)\right]
k2log(2π)12log|det(UUT)|\displaystyle-\frac{k}{2}\log(2\pi)-\frac{1}{2}\log|\det(UU^{T})|
=12𝔼z[(Uz+μμ)T(UUT)1(Uz+μμ)]\displaystyle=-\frac{1}{2}\mathbb{E}_{z}\left[(Uz+\mu-\mu)^{T}(UU^{T})^{-1}(Uz+\mu-\mu)\right]
k2log(2π)12log|det(UUT)|\displaystyle-\frac{k}{2}\log(2\pi)-\frac{1}{2}\log|\det(UU^{T})|
=12𝔼z[zTUT(UUT)1Uz]\displaystyle=-\frac{1}{2}\mathbb{E}_{z}\left[z^{T}U^{T}(UU^{T})^{-1}Uz\right]
k2log(2π)12log|det(UUT)|\displaystyle-\frac{k}{2}\log(2\pi)-\frac{1}{2}\log|\det(UU^{T})|
=k2k2log(2π)logdet(U)\displaystyle=-\frac{k}{2}-\frac{k}{2}\log(2\pi)-\log\det(U)

where we used the second half of Lemma 1 in the last equality and log|det(UUT)|=2logdet(U)\log|\det(UU^{T})|=2\log\det(U) in the last equality. Adding the terms achieves the desired result. ∎

Proof of Lemma 1.

For the first claim, let U=ESVTU=ESV^{T} denote the singular value decomposition of UU with EE and VV square and unitary. Then by the rotational invariance of the Gaussian distribution and the 2\ell_{2}-norm, we have

𝔼zUz22=𝔼zESVTz22\displaystyle\mathbb{E}_{z}\|Uz\|_{2}^{2}=\mathbb{E}_{z}\|ESV^{T}z\|_{2}^{2} =𝔼zSz22\displaystyle=\mathbb{E}_{z}\|Sz\|_{2}^{2}
=i=1rank(U)σi(U)2𝔼zi[zi2]\displaystyle=\sum_{i=1}^{\mathrm{rank}(U)}\sigma_{i}(U)^{2}\mathbb{E}_{z_{i}}[z_{i}^{2}]
=i=1rank(U)σi(U)2=UF2.\displaystyle=\sum_{i=1}^{\mathrm{rank}(U)}\sigma_{i}(U)^{2}=\|U\|_{F}^{2}.

For the second claim, recall that zTUT(UUT)1Uz=UT(UUT)1U,zzTz^{T}U^{T}(UU^{T})^{-1}Uz=\langle U^{T}(UU^{T})^{-1}U,zz^{T}\rangle where A,B=Tr(ATB)\langle A,B\rangle=\mathrm{Tr}(A^{T}B) is the Hilbert-Schmidt inner product for matrices. Using linearity of the expectation and the cyclic property of the trace operator, we get

𝔼z[zTUT(UUT)1Uz]\displaystyle\mathbb{E}_{z}[z^{T}U^{T}(UU^{T})^{-1}Uz] =𝔼zUT(UUT)1U,zzT\displaystyle=\mathbb{E}_{z}\langle U^{T}(UU^{T})^{-1}U,zz^{T}\rangle
=UT(UUT)1U,𝔼z[zzT]\displaystyle=\langle U^{T}(UU^{T})^{-1}U,\mathbb{E}_{z}[zz^{T}]\rangle
=UT(UUT)1U,In\displaystyle=\langle U^{T}(UU^{T})^{-1}U,I_{n}\rangle
=Tr(UT(UUT)1U)\displaystyle=\mathrm{Tr}(U^{T}(UU^{T})^{-1}U)
=Tr(UUT(UUT)1)\displaystyle=\mathrm{Tr}(UU^{T}(UU^{T})^{-1})
=Tr(Id)=d.\displaystyle=\mathrm{Tr}(I_{d})=d.

We now prove Theorem 2. Our proof essentially computes the stationary conditions of the loss \mathcal{L} and computes the first-order critical points directly.

Proof of Theorem 2.

Using standard matrix calculus, the gradient of the objective \mathcal{L} with respect to each parameter is given by

Gθ\displaystyle\nabla_{G_{\theta}}\mathcal{L} =1σ2[1Ni=1NATAGθ(UiUiT+μiμiT)\displaystyle=\frac{1}{\sigma^{2}}\Big{[}\frac{1}{N}\sum_{i=1}^{N}A^{T}AG_{\theta}(U_{i}U_{i}^{T}+\mu_{i}\mu_{i}^{T})
1Ni=1NATy(i)μiT],\displaystyle-\frac{1}{N}\sum_{i=1}^{N}A^{T}y^{(i)}\mu_{i}^{T}\Big{]},
Ui\displaystyle\nabla_{U_{i}}\mathcal{L} =1σ2GθTATAGθUi+Ui(UiT)1,and\displaystyle=\frac{1}{\sigma^{2}}G_{\theta}^{T}A^{T}AG_{\theta}U_{i}+U_{i}-(U_{i}^{T})^{-1},\ \text{and}
μi\displaystyle\nabla_{\mu_{i}}\mathcal{L} =1σ2(GθTATAGθμiGθTATy(i))+μi.\displaystyle=\frac{1}{\sigma^{2}}(G_{\theta}^{T}A^{T}AG_{\theta}\mu_{i}-G_{\theta}^{T}A^{T}y^{(i)})+\mu_{i}.

For a fixed GθG_{\theta}, we can directly compute which μi\mu_{i} and UiU_{i} satisfy the stationary equations μi=0\nabla_{\mu_{i}}\mathcal{L}=0 and Ui=0\nabla_{U_{i}}\mathcal{L}=0:

μi\displaystyle\mu_{i} =(GθTATAGθ+σ2I)1GθTATy(i)and\displaystyle=(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}y^{(i)}\text{and}
Ui\displaystyle U_{i} =σ(GθTATAGθ+σ2I)1/2R~\displaystyle=\sigma(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1/2}\tilde{R}

where R~\tilde{R} is an arbitrary orthogonal matrix.

We now calculate, given such μi\mu_{i} and UiU_{i}, which GθG_{\theta} satisfies the condition Gθ=0\nabla_{G_{\theta}}\mathcal{L}=0. For notational simplicity, let Σσ:=GθTATAGθ+σ2I\Sigma_{\sigma}:=G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I. Plugging in μi\mu_{i} and UiU_{i} into the stationary condition Gθ=0\nabla_{G_{\theta}}\mathcal{L}=0, this equation reads

ATAGθ(σ2Ik+\displaystyle A^{T}AG_{\theta}(\sigma^{2}I_{k}+ Σσ1GθTATYNAGθ)Σσ1\displaystyle\Sigma_{\sigma}^{-1}G_{\theta}^{T}A^{T}Y_{N}AG_{\theta})\Sigma_{\sigma}^{-1}
=ATYNAGθΣσ1.\displaystyle=A^{T}Y_{N}AG_{\theta}\Sigma_{\sigma}^{-1}.

Since AA has rank mm, AATAA^{T} is invertible, so we can multiply both sides of the equation on the left by (AAT)1A(AA^{T})^{-1}A and the right by Σσ\Sigma_{\sigma} to obtain

(ImAGθΣσ1GθTAT)YNAGθ=σ2AGθ.\displaystyle(I_{m}-AG_{\theta}\Sigma_{\sigma}^{-1}G_{\theta}^{T}A^{T})Y_{N}AG_{\theta}=\sigma^{2}AG_{\theta}. (7)

Now, consider the SVD of AGθ=ESVTAG_{\theta}=ESV^{T} where Em×mE\in\mathbb{R}^{m\times m} and Vk×kV\in\mathbb{R}^{k\times k} are orthogonal and Sm×kS\in\mathbb{R}^{m\times k} is rectangular with non-zero diagonal submatrix containing the singular values of AGθAG_{\theta} denoted by S~k×k\tilde{S}\in\mathbb{R}^{k\times k}. First, observe that

AGθΣσ1GθTAT\displaystyle AG_{\theta}\Sigma_{\sigma}^{-1}G_{\theta}^{T}A^{T} =ESVT(VSTSVT+σ2Ik)1VSTET\displaystyle=ESV^{T}(VS^{T}SV^{T}+\sigma^{2}I_{k})^{-1}VS^{T}E^{T}
=ESVT(VS~2VT+σ2Ik)1VSTET\displaystyle=ESV^{T}(V\tilde{S}^{2}V^{T}+\sigma^{2}I_{k})^{-1}VS^{T}E^{T}
=ESVTV(S~2+σ2Ik)1VTVSTET\displaystyle=ESV^{T}V(\tilde{S}^{2}+\sigma^{2}I_{k})^{-1}V^{T}VS^{T}E^{T}
=ES(S~2+σ2Ik)1STET\displaystyle=ES(\tilde{S}^{2}+\sigma^{2}I_{k})^{-1}S^{T}E^{T}
=E[S~2(S~2+σ2Ik)1000]ET.\displaystyle=E\left[\begin{array}[]{cc}\tilde{S}^{2}(\tilde{S}^{2}+\sigma^{2}I_{k})^{-1}&0\\ 0&0\end{array}\right]E^{T}.

This implies that ImAGθΣσ1GθTAT=ES^σETI_{m}-AG_{\theta}\Sigma_{\sigma}^{-1}G_{\theta}^{T}A^{T}=E\hat{S}_{\sigma}E^{T} where

S^σ:=[IkS~2(S~2+σ2Ik)100Imk].\displaystyle\hat{S}_{\sigma}:=\left[\begin{array}[]{cc}I_{k}-\tilde{S}^{2}(\tilde{S}^{2}+\sigma^{2}I_{k})^{-1}&0\\ 0&I_{m-k}\end{array}\right].

Note that for σ>0\sigma>0, S^σ\hat{S}_{\sigma} is invertible. Hence, our stationary equation (7) reads

ES^σETYNESVT=σ2ESVT.\displaystyle E\hat{S}_{\sigma}E^{T}Y_{N}ESV^{T}=\sigma^{2}ESV^{T}.

Using the orthogonality of EE and VV and invertibility of S^σ\hat{S}_{\sigma}, we can further simplify the above equation to

YNES=σ2ES^σ1S.\displaystyle Y_{N}ES=\sigma^{2}E\hat{S}_{\sigma}^{-1}S.

Now, suppose Sii=σi(AGθ)=σi0S_{ii}=\sigma_{i}(AG_{\theta})=\sigma_{i}\neq 0. Then we have that the ii-th column of EE, denoted eie_{i}, satisfies

YNei=σ2(1σi2σi2+σ2)1ei.\displaystyle Y_{N}e_{i}=\sigma^{2}\left(1-\frac{\sigma_{i}^{2}}{\sigma_{i}^{2}+\sigma^{2}}\right)^{-1}e_{i}.

That is, eie_{i} is an eigenvector of YNY_{N} with eigenvalue λi=σ2(1σi2/(σi2+σ2))1.\lambda_{i}=\sigma^{2}(1-\sigma_{i}^{2}/(\sigma_{i}^{2}+\sigma^{2}))^{-1}. Observe that this equation is satisfied if and only if σi=±λiσ2\sigma_{i}=\pm\sqrt{\lambda_{i}-\sigma^{2}}. Since singular values are non-negative, we must have σi=λiσ2.\sigma_{i}=\sqrt{\lambda_{i}-\sigma^{2}}. Note that if Sii=0S_{ii}=0, then eie_{i} can be arbitrary. Thus, we obtain that in order for GθG_{\theta} to satisfy the stationary condition, it must satisfy

AGθ=Ek(Lkσ2I)1/2R\displaystyle AG_{\theta}=E_{k}(L_{k}-\sigma^{2}I)^{1/2}R

where EkE_{k} contains the top-min{rank(YN),k}\min\{\mathrm{rank}(Y_{N}),k\} eigenvectors of YNY_{N}, LkL_{k} is a diagonal matrix whose ii-th entry is λi\lambda_{i} with corresponding eigenvector eie_{i} and σ2\sigma^{2} otherwise, and RR is an arbitrary orthogonal matrix.

X-B Application: Denoising Data From a Subspace

In order to gain further intuition about Theorem 2, it will be useful to consider an example where the data explicitly lies in a low-dimensional subspace. We will then compare our inferred estimator with an estimator that has access to complete prior knowledge about the data.

Consider the denoising problem A=IA=I and suppose that each ground-truth image is drawn at random from a kk-dimensional subspace as follows: x(i)=Gz(i)x^{(i)}=Gz^{(i)} where Gn×kG\in\mathbb{R}^{n\times k} and z(i)𝒩(0,I)z^{(i)}\sim\mathcal{N}(0,I). If we had access to the ground-truth generating matrix GG and knew the underlying distribution in the latent space 𝒩(0,I)\mathcal{N}(0,I), then one could denoise the ii-th image y(i)y^{(i)} by returning the following MAP estimate x^(i)\hat{x}^{(i)} in the range of GG:

x^(i)\displaystyle\hat{x}^{(i)} :=Gz^(i),\displaystyle:=G\hat{z}^{(i)},
wherez^(i)\displaystyle\text{where}\ \hat{z}^{(i)} :=argminzk12σ2y(i)Gz22+12z22\displaystyle:=\operatorname*{arg\,min}_{z\in\mathbb{R}^{k}}\frac{1}{2\sigma^{2}}\|y^{(i)}-Gz\|_{2}^{2}+\frac{1}{2}\|z\|_{2}^{2}
=(GTG+σ2I)1GTy(i).\displaystyle=(G^{T}G+\sigma^{2}I)^{-1}G^{T}y^{(i)}.

Note that this is the mean of the posterior p(x|y(i),G)p(x|y^{(i)},G). Now, to compare solutions, observe that our (mean) estimator given by Theorem 2 is

Gμi=G(GTG+σ2I)1GTy(i).\displaystyle G_{*}\mu_{i}^{*}=G_{*}(G_{*}^{T}G_{*}+\sigma^{2}I)^{-1}G_{*}^{T}y^{(i)}.

Recall that GG_{*} contains the top eigenvectors and eigenvalues of YNY_{N}, the sample covariance of the measurements. In the case that our data x(i)=Gz(i)x^{(i)}=Gz^{(i)}, note that for sufficiently many examples NN, the sample covariance YNY_{N} is close to its mean 𝔼z,ηYN\mathbb{E}_{z,\eta}Y_{N}. That is, for large NN,

YN𝔼z,ηYN=GGT+σ2I.Y_{N}\approx\mathbb{E}_{z,\eta}Y_{N}=GG^{T}+\sigma^{2}I.

To see this more concretely, let G=ELVTG=ELV^{T} denote the SVD of GG. Then GGT+σ2I=EL^σETGG^{T}+\sigma^{2}I=E\hat{L}_{\sigma}E^{T} where L^σ\hat{L}_{\sigma} has σi(G)2+σ2\sigma_{i}(G)^{2}+\sigma^{2} in the first kk diagonal entries and σ2\sigma^{2} in the remaining nkn-k entries. So in the limit of increasing data, GG_{*} estimates the top-kk eigenvectors of 𝔼z,ηYN\mathbb{E}_{z,\eta}Y_{N}, which are precisely the top-kk principal components of GG. Moreover, the singular values of GG_{*} are λiσ2\sqrt{\lambda_{i}-\sigma^{2}} where λi=λi(YN)\lambda_{i}=\lambda_{i}(Y_{N}). As data increases, λi(YN)λi(𝔼z,ηYN)=σi(G)2+σ2\lambda_{i}(Y_{N})\approx\lambda_{i}(\mathbb{E}_{z,\eta}Y_{N})=\sigma_{i}(G)^{2}+\sigma^{2} so that σi(G)=λiσ2σi(G)\sigma_{i}(G_{*})=\sqrt{\lambda_{i}-\sigma^{2}}\approx\sigma_{i}(G). Thus the singular values of GG_{*} also approach those of GG.

A natural question to consider is how large the number of examples NN would need to be in order for such an approximation to hold. Ideally, this approximation would hold as long as the number of examples NN scales like the intrinsic dimensionality of the underlying images. For example, the following result shows that in the noiseless case, this approximation holds as long as the number of examples NN scales like kk, the dimension of the subspace. The proof follows from standard matrix concentration bounds, e.g., Theorem 4.6.1 in \citePhysVershyninHDPSupp.

Proposition 3.

Fix Gn×kG\in\mathbb{R}^{n\times k} and let y(i)=Gz(i)y^{(i)}=Gz^{(i)} where z(i)𝒩(0,I)z^{(i)}\sim\mathcal{N}(0,I) for i[N]i\in[N]. Then, for any ε>0\varepsilon>0, there exists positive absolute constants C1C_{1} and C2C_{2} such that the following holds: if NC1ε2kN\geqslant C_{1}\varepsilon^{-2}k, then with probability 12ek1-2e^{-k} the sample covariance YN:=1Ni=1Ny(i)(y(i))TY_{N}:=\frac{1}{N}\sum_{i=1}^{N}y^{(i)}(y^{(i)})^{T} obeys

YNGGTC2G2ε.\displaystyle\left\|Y_{N}-GG^{T}\right\|\leqslant C_{2}\|G\|^{2}\varepsilon.

Hence, in the case of data lying explicitly in a low-dimensional subspace, our estimator returns an approximation of the solution found via MAP estimation, which can only be computed with complete prior knowledge of the underlying image structure. In contrast, our estimator infers this subspace from measurement information only and its performance improves as the number of examples increases. The approximation quality of the estimator is intimately related to the number of examples and the intrinsic dimensionality of the images, which we show can be quantified in simple settings. While our main result focused on IGMs inferred with linear models, it is an interesting future direction to theoretically understand properties of non-linear IGMs given by deep networks and what can be said for more complicated forward models. We leave this for future work.

X-C Additional Theoretical Results

We can further analyze properties of the model found by minimizing \mathcal{L} from Theorem 2. In particular, we can characterize specific properties of the inferred mean μi\mu_{i}^{*} as a function of the amount of noise in our measurements and a fixed IGM Gθn×kG_{\theta}\in\mathbb{R}^{n\times k}. Specifically, given measurements y(i)y^{(i)} and an IGM GθG_{\theta}, recall that the inferred mean computed in Theorem 2 is given by

μi(σ):=σ2(σ2GθTATAGθ+I)1GθTATy(i)\displaystyle\mu_{i}^{*}(\sigma):=\sigma^{-2}(\sigma^{-2}\cdot G_{\theta}^{T}A^{T}AG_{\theta}+I)^{-1}G_{\theta}^{T}A^{T}y^{(i)}

We will give an explicit form for μi(σ)\mu_{i}^{*}(\sigma) as the amount of noise goes to 0 and show that, in the case our IGM is correct (Gθ=GG_{\theta}=G for some ground-truth GG), the mean recovers the ideal latent code with vanishing noise.

Proposition 4.

Suppose Am×nA\in\mathbb{R}^{m\times n} and Gn×kG\in\mathbb{R}^{n\times k} with kmnk\leqslant m\leqslant n. Let y(i):=AGzi+ηiy^{(i)}:=AGz_{i}+\eta_{i} where zikz_{i}\in\mathbb{R}^{k} and ηi𝒩(0,σ2I)\eta_{i}\sim\mathcal{N}(0,\sigma^{2}I) for i=1,,Ni=1,\dots,N. Then, with probability 11, for any Gθn×kG_{\theta}\in\mathbb{R}^{n\times k}, the estimator μi(σ)\mu_{i}^{*}(\sigma) for ziz_{i} defined above satisfies

limσ0+μi(σ)=(AGθ)AGzii[N].\displaystyle\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)=(AG_{\theta})^{\dagger}AGz_{i}\ \forall\ i\in[N].

In the special case that Gθ=GG_{\theta}=G and rank(AG)=k\mathrm{rank}(AG)=k, then

limσ0+μi(σ)=zii[N].\displaystyle\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)=z_{i}\ \forall\ i\in[N].

Moreover, if rank(AGθ)=rank(AG)=k\mathrm{rank}(AG_{\theta})=\mathrm{rank}(AG)=k, then we get that the following error bound holds for all i[N]i\in[N] with probability at least 1Neγm1-Ne^{-\gamma m} for some positive absolute constant γ\gamma:

μi(σ)zi2\displaystyle\|\mu_{i}^{*}(\sigma)-z_{i}\|_{2} AGzi2σk(AGθ)3σ2\displaystyle\leqslant\|AGz_{i}\|_{2}\sigma_{k}(AG_{\theta})^{-3}\cdot\sigma^{2}
+2σk(AGθ)1mσ\displaystyle+2\sigma_{k}(AG_{\theta})^{-1}\sqrt{m}\cdot\sigma
+2(AGθ)(AG)A(GθG).\displaystyle+\sqrt{2}\|(AG_{\theta})^{\dagger}\|\|(AG)^{\dagger}\|\cdot\|A(G_{\theta}-G)\|.
Proof.

For notational simplicity, set Σθ,σ:=σ2GθTATAGθ+I\Sigma_{\theta,\sigma}:=\sigma^{-2}\cdot G_{\theta}^{T}A^{T}AG_{\theta}+I. Then

μi(σ)\displaystyle\mu_{i}^{*}(\sigma) =σ2Σθ,σ1GθTATy(i)\displaystyle=\sigma^{-2}\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}y^{(i)}
=σ2Σθ,σ1GθTAT(AGzi+ηi)\displaystyle=\sigma^{-2}\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}(AGz_{i}+\eta_{i})
=σ2Σθ,σ1GθTATAGzi+σ2Σθ,σ1GθTATηi\displaystyle=\sigma^{-2}\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}AGz_{i}+\sigma^{-2}\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}\eta_{i}
=σ2Σθ,σ1GθTATAGzi+σ1Σθ,σ1GθTATη~i\displaystyle=\sigma^{-2}\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}AGz_{i}+\sigma^{-1}\cdot\Sigma_{\theta,\sigma}^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}

where we note that any ηi𝒩(0,σ2I)\eta_{i}\sim\mathcal{N}(0,\sigma^{2}I) can be written as ηi=ση~i\eta_{i}=\sigma\cdot\tilde{\eta}_{i} where η~i𝒩(0,I)\tilde{\eta}_{i}\sim\mathcal{N}(0,I). Observe that

Σθ,σ1\displaystyle\Sigma_{\theta,\sigma}^{-1} =(σ2GθTATAGθ+σ2σ2I)1\displaystyle=(\sigma^{-2}G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{-2}\cdot\sigma^{2}I)^{-1}
=σ2(GθTATAGθ+σ2I)1\displaystyle=\sigma^{2}(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}

since (cA)1=c1A1(cA)^{-1}=c^{-1}A^{-1} for c0c\neq 0. This simplifies μi(σ)\mu_{i}^{*}(\sigma) to

μi(σ)\displaystyle\mu_{i}^{*}(\sigma) =(GθTATAGθ+σ2I)1GθTATAGzi\displaystyle=(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}AGz_{i}
+σ(GθTATAGθ+σ2I)1GθTATη~i.\displaystyle+\sigma\cdot(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}.

To compute the limit of the expression above, we recall that the pseudoinverse of a matrix MM is a limit of the form

M\displaystyle M^{\dagger} :=limδ0+(MTM+δI)1MT\displaystyle:=\lim_{\delta\rightarrow 0^{+}}(M^{T}M+\delta I)^{-1}M^{T}
=limδ0+MT(MMT+δI)1.\displaystyle=\lim_{\delta\rightarrow 0^{+}}M^{T}(MM^{T}+\delta I)^{-1}.

Note that this quantity is defined even if the inverses of MMTMM^{T} or MTMM^{T}M exist. Applying this to M=AGθM=AG_{\theta}, we observe that

limσ0+(GθTATAGθ+σ2I)1GθTATAGzi\displaystyle\lim_{\sigma\rightarrow 0^{+}}(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}AGz_{i} =(AGθ)AGzi\displaystyle=(AG_{\theta})^{\dagger}AGz_{i}

and

limσ0+σ(GθTATAGθ+σ2I)1GθTATη~i\displaystyle\lim_{\sigma\rightarrow 0^{+}}\sigma(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}
=(limσ0+σ)(limσ0+(GθTATAGθ+σ2I)1GθTATη~i)\displaystyle=\left(\lim_{\sigma\rightarrow 0^{+}}\sigma\right)\left(\lim_{\sigma\rightarrow 0^{+}}(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}\right)
=0(AGθ)=0.\displaystyle=0\cdot(AG_{\theta})^{\dagger}=0.

Combining the above two limits gives limσ0+μi(σ)=(AGθ)AGzi\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)=(AG_{\theta})^{\dagger}AGz_{i} as claimed.

For the special case, since AGm×kAG\in\mathbb{R}^{m\times k} has rank(AG)=k\mathrm{rank}(AG)=k, the pseudoinverse of AGAG is given by (AG)=(GTATAG)1GTAT(AG)^{\dagger}=(G^{T}A^{T}AG)^{-1}G^{T}A^{T}. Applying this to M=AGM=AG, we observe that

limσ0+\displaystyle\lim_{\sigma\rightarrow 0^{+}} (GTATAG+σ2I)1GTATAGzi\displaystyle(G^{T}A^{T}AG+\sigma^{2}I)^{-1}G^{T}A^{T}AGz_{i}
=(AG)AGzi=zi.\displaystyle=(AG)^{\dagger}AGz_{i}=z_{i}.

Now, for the error bound, we note that for σ0\sigma\geqslant 0, we have by the triangle inequality,

μi(σ)zi2\displaystyle\|\mu_{i}^{*}(\sigma)-z_{i}\|_{2} μi(σ)limσ0+μi(σ)2\displaystyle\leqslant\|\mu_{i}^{*}(\sigma)-\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)\|_{2}
+limσ0+μi(σ)zi2\displaystyle+\|\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)-z_{i}\|_{2}

so we bound each term separately. Recall that we have shown limσ0+μi(σ)=(AGθ)AGzi\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)=(AG_{\theta})^{\dagger}AGz_{i}. For the first term, observe that

μi(σ)(AGθ)AGzi2\displaystyle\|\mu_{i}^{*}(\sigma)-(AG_{\theta})^{\dagger}AGz_{i}\|_{2}
(GθTATAGθ+σ2I)1GθTATAGzi(AGθ)AGzi\displaystyle\leqslant\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}AGz_{i}-(AG_{\theta})^{\dagger}AGz_{i}\|
+σ(GθTATAGθ+σ2)1GθTATη~i2\displaystyle+\sigma\cdot\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2})^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}\|_{2}
(GθTATAGθ+σ2I)1GθTAT(AGθ)AGzi2\displaystyle\leqslant\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}-(AG_{\theta})^{\dagger}\|\|AGz_{i}\|_{2}
+σ(GθTATAGθ+σ2)1GθTATη~i2\displaystyle+\sigma\cdot\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2})^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}\|_{2}

Let AGθ=USVTAG_{\theta}=USV^{T} denote the singular value decomposition of AGθAG_{\theta}. Then GθTATAGθ=VSUTUSVT=VS2VTG_{\theta}^{T}A^{T}AG_{\theta}=VSU^{T}USV^{T}=VS^{2}V^{T}. This implies that

(GθTATAGθ\displaystyle(G_{\theta}^{T}A^{T}AG_{\theta} +σ2I)1GθAT\displaystyle+\sigma^{2}I)^{-1}G_{\theta}A^{T}
=(VS2VT+σ2I)1VSUT.\displaystyle=(VS^{2}V^{T}+\sigma^{2}I)^{-1}VSU^{T}.

Moreover, since VV is orthogonal, we have that

(VS2VT+σ2I)1\displaystyle(VS^{2}V^{T}+\sigma^{2}I)^{-1} =(VS2VT+σ2VVT)1\displaystyle=(VS^{2}V^{T}+\sigma^{2}VV^{T})^{-1}
=V(S2+σ2I)1VT\displaystyle=V(S^{2}+\sigma^{2}I)^{-1}V^{T}

Thus, for σ,α0\sigma,\alpha\geqslant 0, we have that

[(GθTATAGθ+σ2I)1(GθTATAGθ+α2I)1]GθAT\displaystyle[(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}-(G_{\theta}^{T}A^{T}AG_{\theta}+\alpha^{2}I)^{-1}]G_{\theta}A^{T}
=[(VS2VT+σ2I)1(VS2VT+α2I)1]VSUT\displaystyle=\left[(VS^{2}V^{T}+\sigma^{2}I)^{-1}-(VS^{2}V^{T}+\alpha^{2}I)^{-1}\right]VSU^{T}
=V[(S2+σ2I)1(S2+α2I)1]VTVSUT\displaystyle=V\left[(S^{2}+\sigma^{2}I)^{-1}-(S^{2}+\alpha^{2}I)^{-1}\right]V^{T}VSU^{T}
=V[(S2+σ2I)1(S2+α2I)1]SUT.\displaystyle=V\left[(S^{2}+\sigma^{2}I)^{-1}-(S^{2}+\alpha^{2}I)^{-1}\right]SU^{T}.

Thus, we conclude by rotational invariance of the spectral norm, we obtain

(GθTATAGθ+σ2I)1GθAT\displaystyle\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}A^{T}
(GθTATAGθ+α2I)1GθAT\displaystyle-(G_{\theta}^{T}A^{T}AG_{\theta}+\alpha^{2}I)^{-1}G_{\theta}A^{T}\|
=[(S2+σ2I)1(S2+α2I)1]S.\displaystyle=\left\|\left[(S^{2}+\sigma^{2}I)^{-1}-(S^{2}+\alpha^{2}I)^{-1}\right]S\right\|.

Recall that SS is a diagonal matrix containing the singular values of AGθAG_{\theta}. Let σi\sigma_{i} denote the ii-th singular value of AGθAG_{\theta} for i=1,,ki=1,\dots,k. Then by direct calculation, we have that

[(S2+σ2I)1(S2+α2I)1]S\displaystyle\left\|\left[(S^{2}+\sigma^{2}I)^{-1}-(S^{2}+\alpha^{2}I)^{-1}\right]S\right\|
=maxi[k]|σi(1σi2+σ21σi2+α2)|\displaystyle=\max_{i\in[k]}\left|\sigma_{i}\cdot\left(\frac{1}{\sigma_{i}^{2}+\sigma^{2}}-\frac{1}{\sigma_{i}^{2}+\alpha^{2}}\right)\right|
=maxi[k]|σi(σ2α2)(σi2+σ2)(σi2+α2)|.\displaystyle=\max_{i\in[k]}\left|\frac{\sigma_{i}(\sigma^{2}-\alpha^{2})}{(\sigma_{i}^{2}+\sigma^{2})(\sigma_{i}^{2}+\alpha^{2})}\right|.

Taking the limit as α0+\alpha\rightarrow 0^{+}, we obtain

(GθTATAGθ+σ2I)1GθTAT(AGθ)\displaystyle\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2}I)^{-1}G_{\theta}^{T}A^{T}-(AG_{\theta})^{\dagger}\|
=maxi[k]σ2σi(σi2+σ2)\displaystyle=\max_{i\in[k]}\frac{\sigma^{2}}{\sigma_{i}(\sigma_{i}^{2}+\sigma^{2})}
σk3σ2.\displaystyle\leqslant\sigma_{k}^{-3}\cdot\sigma^{2}.

This additionally shows that for any σ0\sigma\geqslant 0,

(GθTATAGθ\displaystyle\|(G_{\theta}^{T}A^{T}AG_{\theta} +σ2I)1GθAT\displaystyle+\sigma^{2}I)^{-1}G_{\theta}A^{T}\|
=(S2+σ2I)1S\displaystyle=\left\|(S^{2}+\sigma^{2}I)^{-1}S\right\|
σk1.\displaystyle\leqslant\sigma_{k}^{-1}.

Let ENE_{N} be the event that maxi[N]η~i22m\max_{i\in[N]}\|\tilde{\eta}_{i}\|_{2}\leqslant 2\sqrt{m}. By a single-tailed variant of Corollary 5.17 in \citePhysVershyninHDPSupp, we have (EN)1Neγm\mathbb{P}(E_{N})\geqslant 1-Ne^{-\gamma m} for some absolute constant γ>0\gamma>0. Thus, with the same probability, we have that the following bound holds:

σ(GθTATAGθ+σ2)1GθTATη~i22σσk1m.\displaystyle\sigma\cdot\|(G_{\theta}^{T}A^{T}AG_{\theta}+\sigma^{2})^{-1}G_{\theta}^{T}A^{T}\tilde{\eta}_{i}\|_{2}\leqslant 2\sigma\sigma_{k}^{-1}\sqrt{m}.

For the term limσ0+μi(σ)zi2\|\lim_{\sigma\rightarrow 0^{+}}\mu_{i}^{*}(\sigma)-z_{i}\|_{2}, note that we have

(AGθ)AGzizi2\displaystyle\|(AG_{\theta})^{\dagger}AGz_{i}-z_{i}\|_{2} =(AGθ)AGzi(AG)AGzi2\displaystyle=\|(AG_{\theta})^{\dagger}AGz_{i}-(AG)^{\dagger}AGz_{i}\|_{2}
(AGθ)(AG)AGzi2\displaystyle\leqslant\|(AG_{\theta})^{\dagger}-(AG)^{\dagger}\|\cdot\|AGz_{i}\|_{2}

where \|\cdot\| is the spectral norm. Assuming rank(AGθ)=rank(AG)=k\mathrm{rank}(AG_{\theta})=\mathrm{rank}(AG)=k, we can apply Theorem 4.1 in \citePhysWedin1973 to bound

(AGθ)\displaystyle\|(AG_{\theta})^{\dagger} (AG)\displaystyle-(AG)^{\dagger}\|
2(AGθ)(AG)A(GθG).\displaystyle\leqslant\sqrt{2}\|(AG_{\theta})^{\dagger}\|\|(AG)^{\dagger}\|\|A(G_{\theta}-G)\|.

Combining this bound with the previous equation yields the desired inequality. ∎

X-D Further Discussion of ELBOProxy\operatorname*{ELBOProxy}

In Section III-A, we showed that the ELBO\operatorname*{ELBO} is equal to the ELBOProxy\operatorname*{ELBOProxy} for injective or invertible generators GG with approximate posteriors hϕh_{\phi} induced by GG. While this equality may not hold for non-injective networks GG, we expect this proxy to be a reasonable estimate for the ELBO via the following heuristic argument. Consider the distribution Gp(z|G)G\sharp p(z|G) convolved with isotropic Gaussian noise: pσ(x|G)=pσ(x|z,G)p(z|G)𝑑zp_{\sigma}(x|G)=\int p_{\sigma}(x|z,G)p(z|G)dz where pσ(x|z,G)=𝒩(G(z),σ2I)p_{\sigma}(x|z,G)=\mathcal{N}(G(z),\sigma^{2}I) for σ1\sigma\ll 1. If xhϕ=Gqϕx\sim h_{\phi}=G\sharp q_{\phi}, then x=G(z0)x=G(z_{0}) for some z0qϕz_{0}\sim q_{\phi}. For any small δ>0\delta>0, observe that the likelihood for such an xx can decomposed into two terms:

pσ(x|G)\displaystyle p_{\sigma}(x|G) =pσ(G(z0)|z,G)p(z|G)𝑑z\displaystyle=\int p_{\sigma}(G(z_{0})|z,G)p(z|G)dz
=zz02δpσ(G(z0)|z,G)p(z|G)𝑑z\displaystyle=\int_{\|z-z_{0}\|_{2}\leqslant\delta}p_{\sigma}(G(z_{0})|z,G)p(z|G)dz
+zz02>δpσ(G(z0)|z,G)p(z|G)𝑑z.\displaystyle+\int_{\|z-z_{0}\|_{2}>\delta}p_{\sigma}(G(z_{0})|z,G)p(z|G)dz.

For smooth generators GG and δ\delta sufficiently small, we would expect G(z)G(z0)G(z)\approx G(z_{0}) for zz0z\approx z_{0} so that the density pσ(G(z0)|z,G))p(z|G)p_{\sigma}(G(z_{0})|z,G))p(z|G) is constant in a neighborhood of z0z_{0}, making the first term Cσp(z0|G)\approx C_{\sigma}p(z_{0}|G) for some constant CσC_{\sigma}. Moreover, if G(z0)G(z_{0}) is of sufficiently high likelihood, we would expect the second term to be negligible for σ1\sigma\ll 1 so that 𝔼xhϕ[logpσ(x|G)]𝔼zqϕ[logp(z|G)]\mathbb{E}_{x\sim h_{\phi}}[\log p_{\sigma}(x|G)]\approx\mathbb{E}_{z\sim q_{\phi}}[\log p(z|G)] (up to a constant).

XI Full Model Selection Results

In this section, we expand upon the model selection results using the negative ELBOProxy\operatorname*{ELBOProxy} in Section III-A in the main text. The experimental setup is the same, where we aim to select a generative model for a class of MNIST in two different inverse problems: denoising and phase retrieval. The full array of negative ELBOProxy\operatorname*{ELBOProxy} values for denoising is given in Fig. 13 and for phase retrieval in Fig. 14.

Refer to caption
Figure 13: Full model selection results for denoising. We showcase further results on model selection with generative networks for denoising. Top: the two topmost rows correspond to the ground-truth image xcx_{c} and the noisy measurements ycy_{c}. Middle: we show the means of the variational distributions given by the IGM trained on a particular class. Bottom: each column of the array corresponds to the negative ELBOProxy\operatorname*{ELBOProxy} achieved by each model in reconstructing the images. Here, lower is better. We highlight entries with green boxes with the best negative ELBOProxy\operatorname*{ELBOProxy} values in each row.
Refer to caption
Figure 14: Full model selection results for phase retrieval. We showcase further results on model selection with generative networks for phase retrieval. Top: the two topmost rows correspond to the ground-truth image xcx_{c} and the log of the noisy measurements ycy_{c}. Middle: we show the means of the variational distributions given by the IGM trained on a particular class. Bottom: each column of the array corresponds to the negative ELBOProxy\operatorname*{ELBOProxy} achieved by each model in reconstructing the images. Here, lower is better. We highlight entries with green boxes with the best negative ELBOProxy\operatorname*{ELBOProxy} values in each row.

XII Additional results

In this section, we expand upon the results for inferring the IGM using the methods described in Section III-B of the main text.

Refer to caption
Figure 15: Dropout ablation with denoising 95 images of celebrity A. We demonstrate our method described in Section III-B of the main body using 95 noisy images of a celebrity. Here we show the ground-truth (row 1), noisy measurements (row 2), mean reconstruction with dropout (row 3), and mean reconstruction without dropout (row 4) for a subset of the 95 different noisy images. The reconstructions with and without dropout have similar image quality. The mean PSNR for with dropout is 27.0 and for without dropout is 27.2.

XII-A Ablations

In Fig. 15, we show an ablation test to validate the usage of dropout. Although the PSNR is higher when we do not have dropout, there are noticeable artifacts from the generative architecture in the reconstructions.

Refer to caption
Figure 16: Denoising 95 images of celebrity B. We demonstrate our method described in Section III-B of the main body using 95 noisy images of a celebrity. These results are of celebrity B, but correspond to Fig. 5 in the main body. Here we show the ground-truth (top), noisy measurements (middle), and mean reconstruction (bottom) for a subset of the 95 different noisy images. Our reconstructions are much less noisy, recovering sharper features that are hard to discern in the noisy images. Note that no predefined prior/regularizer was used in denoising.
Refer to caption
Figure 17: Posterior samples from denoising 95 images of a celebrity. We demonstrate our method described in Section III-B of the main body to denoise 95 images. Here we show the ground-truth image, the noisy measurements, mean of the posterior, standard deviation of the posterior, and samples from the posterior.
Refer to caption
Figure 18: Posterior samples from denoising 95 images of celebrity A. We show additional results for denoising 95 images of celebrity A, which corresponds to the same results shown in Fig. 5 of the main body. Here we show the ground-truth image, the noisy measurements, mean of the posterior, standard deviation of the posterior, and samples from the posterior.

XII-B Denoising

In Fig. 5 of the main body, we showed results for denoising 95 images of celebrity A. We include additional visualizations of the posterior by showing samples from multiple posteriors in Fig. 18, which are generated from the IGM trained as shown in Fig. 5 in the main body. We show additional results on denoising 95 images of celebrity face B in Fig. 16 along with samples from the posterior in Fig. 17. In both cases, our reconstructions are much less noisy than the measurements and sharpen the primary facial features.

XII-C Phase Retrieval

We show additional results from phase retrieval with 150 images of MNIST 8’s in Fig. 19 and Fig. 20. The examples shown in Fig. 19 are from the Fourier phase retrieval measurements whereas the ones in Fig. 20 are from Gaussian phase retrieval measurements. These are in addition to those shown in Fig. 8 in the main text. Note that the simple, unimodal Gaussian variational distribution is not expressive enough to capture the multimodal structure of the true posterior in the Fourier phase retrieval problem.

For Table II in the main body, we calculated the PSNRs for each measurement model in the following way: for each underlying image, we generated 10001000 samples from the generator and latent variational distribution and used cross-correlation in the frequency domain to find the most plausible shift from our reconstruction to the underlying image. After shifting the image, we calculated the PSNR to the underlying image. We then took the best PSNR amongst all samples and took the average across all NN image examples.

Refer to caption
Figure 19: Random examples from Fourier phase retrieval on MNIST 8’s. We demonstrate our method described in Section III-B of the main body to perform phase retrieval on 75 images. Here we show the ground-truth image, the log magnitude of the Fourier transform, mean of the posterior, standard deviation of the posterior, and samples from the posterior. Note that the simple Gaussian latent posterior distribution is unable to capture the full multimodal structure of the true posterior.
Refer to caption
Figure 20: Random examples from Gaussian phase retrieval on MNIST 8’s. We demonstrate our method described in Section III-B of the main body to perform phase retrieval on 150 images. Here we show the ground-truth image, mean of the posterior, standard deviation of the posterior, and samples from the posterior.

Blur size 0 μ\muas 5 μ\muas 10 μ\muas 15 μ\muas 20 μ\muas 25 μ\muas 30 μ\muas 35 μ\muas PSNR (\uparrow) 26.8 30.6 33.0 32.4 30.7 29.3 28.1 27.3 NCC (\uparrow) 0.922 0.971 0.984 0.978 0.961 0.941 0.920 0.901

TABLE III: Illustrating the intrinsic resolution of our result. We show quantitative comparisons between our reconstructions and the clean underlying image blurred by varying degrees, where 0 μ\muas is no blur and 35 μ\muas is the highest blur. 25 μ\muas represents the intrinsic resolution of the telescope. We find we are able to super-resolve the image by over 2x. The highest PSNR is highlighted in bold.

XII-D Multiple Forward Models

Refer to caption
Figure 21: Denoising many noise levels. We demonstrate our method to perform denoising on measurements with multiple noise levels. For each experiment, we use 75 measurement examples, which are defined by y(i)=x(i)+ηy^{(i)}=x^{(i)}+\eta where η{η1,η2,η3}\eta\in\{\eta_{1},\eta_{2},\eta_{3}\} and ηi𝒩(0,σi2I)\eta_{i}\sim\mathcal{N}(0,\sigma_{i}^{2}I). In Experiment 1, we use noisy measurement examples that have additive noise with standard deviations of 0.01, 0.1, and 0.5. In Experiment 2, we use noisy measurement examples that have additive noise with standard deviations of 0.2, 0.3, and 0.5. In Experiment 3, we use noisy measurement examples that have additive noise with standard deviations of 0.3, 0.5, and 0.7. We visualize the true underlying images, the measurement used for each experiment, and the mean of the image reconstruction posterior. Most of the reconstructions recover the primary features of the underlying image. However, in Experiment 1, the reconstructions of the low SNR measurements exhibit bias and do not match the true underlying images. Due to the wide range of SNRs with nearly noiseless measurements, our reconstructions overfit to the nearly clean images. The biased results are highlighted by the red box.

XIII IGM as a Generator

While the goal of our method is to improve performance in solving the underlying inverse problem, we can inspect what the inferred IGM has learned by generating samples. We show samples from different inferred IGMs in Fig. 22 for a variety of datasets and inverse problems. From left to right: IGM from denoising MNIST 8’s (Fig. 4 from the main body), IGM inferred from phase retrieval measurements of MNIST 8’s, IGM from denoising celebrity face A (Fig. 5 from the main body), IGM from denoising celebrity face B (Fig. 16), and IGM from video reconstruction of a black hole from the black hole compressed sensing problem (Fig. 10 from the main body).

Refer to caption
Figure 22: Samples from IGM. Here we show samples of the IGM defined by w=Gθ(z)w=G_{\theta}(z) where z𝒩(0,σ2I)z\sim\mathcal{N}(0,\sigma^{2}I). We set σ=1\sigma=1 for all cases except for phase retrieval, which uses σ=5\sigma=5. We show samples from a variety of IGMs using different datasets and inverse problems. From left to right: denoising MNIST 8’s, phase retrieval measurements of MNIST 8’s, denoising celebrity face A, denoising celebrity face B, video reconstruction of a black hole from the black hole compressed sensing problem. The range of these images are all from 0 to 1.
\bibliographystylePhys

plain\bibliographyPhyssupp.bib