This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear Inverse Problems Using a Generative Compound Gaussian Prior

Carter Lyons, Raghu G. Raj, and Margaret Cheney Carter Lyons is with the U.S. Naval Research Laboratory, Washington, DC 20375 USA and also with Colorado State University, Fort Collins, CO 80523 USA (e-mail: carter.lyons@colostate.edu)Raghu G. Raj is with the U.S. Naval Research Laboratory, Washington, DC 20375 USA (e-mail: raghu.g.raj@ieee.org)Margaret Cheney is with Colorado State University, Fort Collins, CO 80523 USA (e-mail: margaret.cheney@colostate.edu)This material is based upon research supported in part by the U.S. Office of Naval Research via the NRL base program and under award numbers N000142412241 and N0001423WX01875, and sponsored in part by the Air Force Office of Scientific Research under award number FA9550-21-1-0169.
Abstract

Since most inverse problems arising in scientific and engineering applications are ill-posed, prior information about the solution space is incorporated, typically through regularization, to establish a well-posed problem with a unique solution. Often, this prior information is an assumed statistical distribution of the desired inverse problem solution. Recently, due to the unprecedented success of generative adversarial networks (GANs), the generative network from a GAN has been implemented as the prior information in imaging inverse problems. In this paper, we devise a novel iterative algorithm to solve inverse problems in imaging where a dual-structured prior is imposed by combining a GAN prior with the compound Gaussian (CG) class of distributions. A rigorous computational theory for the convergence of the proposed iterative algorithm, which is based upon the alternating direction method of multipliers, is established. Furthermore, elaborate empirical results for the proposed iterative algorithm are presented. By jointly exploiting the powerful CG and GAN classes of image priors, we find, in compressive sensing and tomographic imaging problems, our proposed algorithm outperforms and provides improved generalizability over competitive prior art approaches while avoiding performance saturation issues in previous GAN prior-based methods.

Index Terms:
Inverse problems, generative adversarial networks, nonlinear programming, alternative direction method of multipliers, compound Gaussian, dual-structured prior

I Introduction

I-A Motivation

Inverse problems (IPs), the problem of recovering input model parameters – e.g., an image – representing a physical quantity from observed output data – e.g., an undersampling or corruption – produced by a forward measurement mapping, arise throughout many applications in scientific and engineering fields. In imaging, for example, compressive sensing (CS), X-ray computed tomography (CT), magnetic resonance imaging (MRI), super-resolution, radar imaging, and sonar imaging are just a handful of practical applications of linear IPs where the forward mapping, from input model parameters to output data, is effectively captured by a linear operator. Typically, forward mappings arising from scientific and engineering applications are non-injective, creating (linear) IPs that are ill-posed since the forward mapping sends multiple distinct input model parameterizations to the same observed output data. To alleviate ill-posedness, outside prior information about the input model parameters – e.g., sparsity or known statistics – is incorporated into the solution of the (linear) IPs. That is, given observed output data, the solution to the (linear) IP is input model parameters that, in addition to mapping to the observed output data via the forward measurement mapping, satisfies statistical or structural conditions imposed by the prior information.

Model-based methods – in particular, within a Bayesian maximum a posteriori (MAP) framework – are a primary approach of solving (linear) IPs and often involve the minimization of a cost function via an iterative algorithm. Prior information is encapsulated by the cost function through a regularization term, which renders unique IP solutions. Examples of model-based methods, which have been prevalent over the last two decades, include the Iterative Shrinkage and Thresholding Algorithm (ISTA) [1, 2], Bayesian Compressive Sensing [3], and Compressive Sampling Matching Pursuit [4, 5]. These previous approaches can be viewed, from a Bayesian MAP perspective, as implementing a generalized Gaussian prior [2, 1, 5]. Based upon the fact that the generalized Gaussian prior is subsumed by the compound Gaussian (CG) prior, which better captures statistical properties of images [6, 7, 8, 9, 10, 11], the authors have previously developed an improved set of model-based methods for linear IPs utilizing a CG prior, which have shown state-of-the-art performance in tomographic imaging and CS [8, 12, 9, 13, 14, 15].

With advancements in artificial intelligence and deep learning (DL), a new approach to improve IP solutions is to utilize a learned prior rather than a handcrafted prior (i.e., generalized Gaussian or CG). One such approach is to create a deep neural network (DNN) by applying algorithm unrolling [16] to a model-based method in which a parameterization of the regularization term is learned [8, 14, 17, 18, 19, 20]. Another approach, as originally proposed by Bora et al. [21], is to employ a generative adversarial network (GAN) [22], trained to sample from the distribution of the IP input model parameters, as the learned prior. In solving the (linear) IP, these GAN-prior approaches constrain the output from a model-based method onto the range of the GAN [21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. While these previous GAN-prior approaches have produced excellent empirical results and outperformed state-of-the-art model-based methods in cases where the forward measurement mapping severely undersamples the input model parameters, they suffer from a number of limitations including:

  • The GAN must capture a superb representation of the IP input model parameters. Typically, this is achieved by training the GAN with a substantial amount of data.

  • Generalization to alternative datasets of IP input model parameters inherently results in significant performance loss or requires the training of another GAN.

  • Performance, in the quality of the IP solutions, saturates with increasing sampling by the forward measurement mapping.

  • Theoretical guarantees are often elusive, due to the non-convex GAN as a prior.

In this paper, we introduce a novel method incorporating a dual-structured prior, based on both the CG and GAN priors, to address the aforementioned four limitations.

Lastly, we remark that some previous works with GAN priors incorporate training of the GAN while solving the (linear) IP, which can boost performance and reduce saturation [34, 35, 36, 37]. While we could incorporate such techniques into our proposed method, we focus on the case of using pretrained GAN priors and leave the case of adaptive GAN optimization (i.e., training the GAN while solving the IP) as an important topic of future work. Additionally, these adaptive GAN optimization approaches require substantial computational time and resources as a new GAN must be trained for each forward measurement mapping considered.

I-B Contributions

This paper constructs an iterative estimation algorithm with a CG-GAN-prior to solve linear IPs. Specifically, we:

  1. 1.

    Develop a novel CG-GAN-based iterative algorithm for IPs using the Alternating Direction Method of Multipliers (ADMM) [38, 39, 40], which we denote by Alternating direction method of multipliers with Compound Gaussian and Generative Adversarial Network priors (A-CG-GAN).

  2. 2.

    Derive a theoretical convergence analysis for the proposed A-CG-GAN to provide insight into the performance of the algorithm.

  3. 3.

    Present ample empirical results for A-CG-GAN on CS and CT linear IPs. We demonstrate the effectiveness of A-CG-GAN even with modestly trained GANs and show that the performance of A-CG-GAN does not saturate with higher sampling from the forward measurement mapping. Furthermore, we present a clear improvement over state-of-the-art GAN-prior approaches in reconstructed image quality and generalizability.

Our proposed CG-GAN iterative algorithm can be viewed as an expansion on [25] by incorporating the powerful CG prior into the IP framework as in [13, 41, 12, 8, 9, 15]. By combining these two techniques we gain the advantage of using a GAN-prior while generalizing to a broader signal prior within the CG-prior context. We remark that our proposed algorithm can reduce to [25] as a special case, which has shown excellent empirical performance when the image to be reconstructed lies in the range of the generator. As such, our proposed algorithm directly inherits such performance in the case of a GAN perfectly capturing the desired true images.

I-C Preliminaries

In this section, we briefly introduce the CG prior and GANs as a prior as these are the key components we combine to furnish the novel method of this paper. First, the forward measurement mapping we consider throughout this paper is

𝒚=A𝒄+𝝂\bm{y}=A\bm{c}+\bm{\nu} (1)

where 𝒚m\bm{y}\in\mathbb{R}^{m} are the output observations (or measurements), Am×nA\in\mathbb{R}^{m\times n} is a sensing matrix, 𝒄n\bm{c}\in\mathbb{R}^{n} are the unknown input model parameters, and 𝝂m\bm{\nu}\in\mathbb{R}^{m} is additive white noise. In many applications of interest, mnm\ll n and AA is decomposed as A=ΨΦA=\Psi\Phi for Ψm×n\Psi\in\mathbb{R}^{m\times n} a measurement matrix and Φn×n\Phi\in\mathbb{R}^{n\times n} a change of basis dictionary. This formulation results from representing some original input model parameters, 𝒔\bm{s}, with respect to (w.r.t) Φ\Phi as 𝒔=Φ𝒄\bm{s}=\Phi\bm{c}. The problem of signal reconstruction, or estimation, is the linear IP of recovering 𝒄\bm{c}, or 𝒔\bm{s}, given 𝒚\bm{y}, Ψ\Psi, and Φ\Phi.

I-C1 Compound Gaussian Prior

A fruitful way to formulate IPs is by Bayesian estimation. In particular, the maximum a posteriori (MAP) estimate of 𝒄\bm{c} from (1) is

𝒄=argmin𝒄12𝒚A𝒄22log(pc(𝒄)).\displaystyle\scalebox{1.0}{$\bm{c}^{*}=\arg\min_{\bm{c}}\,\,\frac{1}{2}\lVert\bm{y}-A\bm{c}\rVert_{2}^{2}-\log(p_{c}(\bm{c}))$}. (2)

where pcp_{c} is the assumed prior density of 𝒄\bm{c}. Clearly, the prior density of 𝒄\bm{c}, which incorporates domain-level knowledge into the IP, is crucial to a successful MAP estimate. Many previous works employ a generalized Gaussian prior [6, 7] such as a Gaussian prior for Tikhonov regression [42] or a Laplacian prior as is predominant in the CS framework [4, 1, 2, 5].

Through the study of the statistics of image sparsity coefficients, it has been shown that coefficients of natural images exhibit self-similarity, heavy-tailed marginal distributions, and self-reinforcement among local coefficients [7]. Such properties are not encompassed by the generalized Gaussian prior. Instead, CG densities [13], also known as Gaussian scale mixtures [6, 7], better capture these statistical properties of natural images and images from other modalities such as radar [10, 11]. A useful formulation of the CG prior lies in modeling 𝒄\bm{c} as the Hadamard product

𝒄\displaystyle\bm{c} =𝒛𝒖[z1u1,z2u2,,znun]T\displaystyle=\bm{z}\odot\bm{u}\coloneqq[z_{1}u_{1},z_{2}u_{2},\ldots,z_{n}u_{n}]^{T} (3)

such that 𝒖𝒩(𝝁u,Σu1)\bm{u}\sim\mathcal{N}(\bm{\mu}_{u},\Sigma_{u}^{-1}), 𝒛pz\bm{z}\sim p_{z} is a sparse or heavy-tailed positive random vector, and 𝒖\bm{u} and 𝒛\bm{z} are independent [7, 13]. We call 𝒖\bm{u} and 𝒛\bm{z} the Gaussian variable and scale variable, respectively. By suitably defining the distribution of 𝒛\bm{z}, the CG prior subsumes many well-known distributions including the generalized Gaussian, student’s tt, α\alpha-stable, and symmetrized Gamma distributions [7, 8].

We note that decomposing 𝒄\bm{c} as in (3), the joint MAP estimate of 𝒛\bm{z} and 𝒖\bm{u} from (1) is

𝒛,𝒖=argmin[𝒛,𝒖]12𝒚A(𝒛𝒖)22log(pz(𝒛))+12(𝒖𝝁u)TΣu(𝒖𝝁u)\scalebox{1.0}{$\bm{z}^{*},\bm{u}^{*}=\arg\min_{[\bm{z},\bm{u}]}\,\,\frac{1}{2}\lVert\bm{y}-A(\bm{z}\odot\bm{u})\rVert_{2}^{2}-\log(p_{z}(\bm{z}))$}\\ \scalebox{1.0}{$+\frac{1}{2}(\bm{u}-\bm{\mu}_{u})^{T}\Sigma_{u}(\bm{u}-\bm{\mu}_{u})$} (4)

where the estimated input model parameters, 𝒄\bm{c}^{*}, are then given by 𝒄=𝒛𝒖.\bm{c}^{*}=\bm{z}^{*}\odot\bm{u}^{*}. That is, from (2) we set 𝒄=𝒛𝒖\bm{c}=\bm{z}\odot\bm{u}, which, due to the independence of 𝒛\bm{z} and 𝒖\bm{u}, splits the prior term log(pc(𝒄))\log(p_{c}(\bm{c})) into the sum of a prior term for 𝒛\bm{z} and a prior term for 𝒖\bm{u}.

Previously, the CG prior has been used, with a log-normal distributed scale variable, to estimate images through an iterative MAP estimation [13, 41, 8, 12] and an algorithm unrolled DNN [8, 12]. Additionally, a learned scale variable distribution in the CG prior was used to estimate images through an alternative algorithm unrolled DNN [9, 14]. Finally, the CG prior, using a real-valued random scale variable, has been successfully used for image denoising [43] and hyperspectral image CS [44].

I-C2 GANs as a Prior

To illustrate the usage of GANs as a prior in IPs, we overview the seminal work of Bora et al. [21]. Let G:dnG:\mathbb{R}^{d}\to\mathbb{R}^{n} be the generative neural network (NN) from a GAN, which we assume has been adversarially trained against a discriminator NN D:n,D:\mathbb{R}^{n}\to\mathbb{R}, to sample from some distribution (e.g., GG produces “natural” images). Often dnd\ll n, implying that the generative DNN maps from a low-dimensional latent space to a high-dimensional space of interest. Given a low-dimensional latent distribution, pxp_{x}, over d\mathbb{R}^{d}, most commonly px𝒩(0,σ2I)p_{x}\sim\mathcal{N}(0,\sigma^{2}I), we can generate a distribution from GG as pGG(𝒙)p_{G}\sim G(\bm{x}) for 𝒙px\bm{x}\sim p_{x}.

A generative NN is incorporated as a prior by assuming that 𝒄\bm{c} from (1), satisfies 𝒄pG.\bm{c}\sim p_{G}. That is, there exists a latent vector 𝒙px\bm{x}\sim p_{x} such that 𝒄=G(𝒙).\bm{c}=G(\bm{x}). Therefore, rather than optimize for the solution to the IP directly as in (2), we optimize over the latent space to find the best 𝒙\bm{x}^{*} such that G(𝒙)𝒄G(\bm{x}^{*})\approx\bm{c}. On this front, consider the MAP estimate of 𝒙\bm{x} from (1), when 𝒄=G(𝒙)\bm{c}=G(\bm{x}), given by

(5)

Intuitively, if 𝒄\bm{c} is a “natural” image and GG samples from the distribution of “natural” images then we can expect that 𝒄=G(𝒙)\bm{c}=G(\bm{x}) for some 𝒙px.\bm{x}\sim p_{x}.

II ADMM with CG-GAN Prior (A-CG-GAN)

To solve the linear IP to (1) for the unknown vector 𝒄\bm{c}, we incorporate both the CG prior, as in (4), and GAN prior, as in (5), by setting the scale variable to lie within the range of a generative NN. Mathematically, we consider the following constrained optimization problem

(𝒙,𝒛,𝒖,𝒄)\displaystyle(\bm{x}^{*},\bm{z}^{*},\bm{u}^{*},\bm{c}^{*}) =argmin(𝒙,𝒛,𝒖,𝒄)𝒟x×𝒟z×𝒟u×𝒟cF(𝒙,𝒛,𝒖,𝒄)\displaystyle=\underset{(\bm{x},\bm{z},\bm{u},\bm{c})\in\mathcal{D}_{x}\times\mathcal{D}_{z}\times\mathcal{D}_{u}\times\mathcal{D}_{c}}{\arg\min}\,\,F(\bm{x},\bm{z},\bm{u},\bm{c})
s.t.𝒛=G(𝒙),𝒄=𝒛𝒖\displaystyle\hskip 11.38092pt\text{s.t.}\hskip 7.11317pt\bm{z}=G(\bm{x}),\hskip 14.22636pt\bm{c}=\bm{z}\odot\bm{u} (6)

where

F(𝒙,𝒛,𝒖,𝒄)12𝒚A𝒄22+μ𝒛1+Rx(𝒙)+λ2(𝒖𝝁u)TΣu(𝒖𝝁u),\scalebox{1.0}{$F(\bm{x},\bm{z},\bm{u},\bm{c})\coloneqq\frac{1}{2}\lVert\bm{y}-A\bm{c}\rVert_{2}^{2}+\mu\lVert\bm{z}\rVert_{1}+R_{x}(\bm{x})$}\\ \scalebox{1.0}{$+\frac{\lambda}{2}(\bm{u}-\bm{\mu}_{u})^{T}\Sigma_{u}(\bm{u}-\bm{\mu}_{u}),$} (7)

G:dnG:\mathbb{R}^{d}\to\mathbb{R}^{n} is a generative NN, Rx:dR_{x}:\mathbb{R}^{d}\to\mathbb{R} is a convex regularization function, μ\mu and λ\lambda are positive scaling parameters, and 𝒟x,𝒟z,𝒟u,\mathcal{D}_{x},\mathcal{D}_{z},\mathcal{D}_{u}, and 𝒟c\mathcal{D}_{c} are the convex domains of 𝒙,𝒛,𝒖\bm{x},\bm{z},\bm{u}, and 𝒄\bm{c}, respectively. We remark that the cost function, FF, consists of a data fidelity term, a regularization term enforcing sparsity of 𝒛\bm{z}, an implicit regularization term on 𝒙\bm{x}, and a regularization term enforcing Gaussianity of 𝒖\bm{u}.

We utilize ADMM [38, 39, 40] to solve the constrained minimization problem of (6). For this, define the feasibility gap

𝝃𝝃(𝒙,𝒛,𝒖,𝒄)[(𝒛G(𝒙))T(𝒄𝒛𝒖)T]T\displaystyle\bm{\xi}\equiv\bm{\xi}(\bm{x},\bm{z},\bm{u},\bm{c})\coloneqq\begin{bmatrix}(\bm{z}-G(\bm{x}))^{T}&(\bm{c}-\bm{z}\odot\bm{u})^{T}\end{bmatrix}^{T} (8)

and let 𝝃k𝝃(𝒙k,𝒛k,𝒖k,𝒄k)\bm{\xi}_{k}\equiv\bm{\xi}(\bm{x}_{k},\bm{z}_{k},\bm{u}_{k},\bm{c}_{k}). Additionally, define the augmented Lagrangian

(9)

for a positive scalar ρ\rho and a dual variable ϕ2n\boldsymbol{\phi}\in\mathbb{R}^{2n}. Therefore, the optimization problem of (6) is equivalent to

argmin(𝒙,𝒛,𝒖,𝒄)𝒟x×𝒟z×𝒟u×𝒟cargmaxϕρ(𝒙,𝒛,𝒖,𝒄,ϕ),\displaystyle\underset{(\bm{x},\bm{z},\bm{u},\bm{c})\in\mathcal{D}_{x}\times\mathcal{D}_{z}\times\mathcal{D}_{u}\times\mathcal{D}_{c}}{\arg\min}\,\,\,\underset{\boldsymbol{\phi}}{\arg\max}\,\,\mathcal{L}_{\rho}(\bm{x},\bm{z},\bm{u},\bm{c},\boldsymbol{\phi}), (10)

and our A-CG-GAN algorithm iteratively solves (10) through an ADMM block coordinate descent [45] as given in Algorithm 1.

To detail Algorithm 1, let ϕ=[𝝋1T𝝋2T]T\boldsymbol{\phi}=\begin{bmatrix}\bm{\varphi}_{1}^{T}&\bm{\varphi}_{2}^{T}\end{bmatrix}^{T} and define

𝒯c\displaystyle\mathcal{T}_{c} 𝒯c(𝒛,𝒖,ϕ)argmin𝒄ρ(𝒙,𝒛,𝒖,𝒄,ϕ)\displaystyle\equiv\mathcal{T}_{c}(\bm{z},\bm{u},\boldsymbol{\phi})\coloneqq\underset{\bm{c}}{\arg\min}\,\,\mathcal{L}_{\rho}(\bm{x},\bm{z},\bm{u},\bm{c},\boldsymbol{\phi})
=(ATA+ρI)1(AT𝒚+ρ𝒛𝒖𝝋2)\displaystyle=(A^{T}A+\rho I)^{-1}(A^{T}\bm{y}+\rho\bm{z}\odot\bm{u}-\bm{\varphi}_{2}) (11)
𝒯u\displaystyle\mathcal{T}_{u} 𝒯u(𝒛,𝒄,ϕ)argmin𝒖ρ(𝒙,𝒛,𝒖,𝒄,ϕ)\displaystyle\equiv\mathcal{T}_{u}(\bm{z},\bm{c},\boldsymbol{\phi})\coloneqq\underset{\bm{u}}{\arg\min}\,\,\mathcal{L}_{\rho}(\bm{x},\bm{z},\bm{u},\bm{c},\boldsymbol{\phi})
(12)

As the feasibility terms ϕ,𝝃+ρ2𝝃22\left\langle\boldsymbol{\phi},\bm{\xi}\right\rangle+\frac{\rho}{2}\lVert\bm{\xi}\rVert_{2}^{2} in (9) are convex w.r.t 𝒛\bm{z}, we minimize ρ\mathcal{L}_{\rho} in (9) w.r.t 𝒛\bm{z} using a standard fast ISTA (FISTA) [2] technique. We will write

𝒯z(J)𝒯z(J)(𝒙,𝒛,𝒖,𝒄,ϕ)=J FISTA steps on ρ w.r.t 𝒛.\displaystyle\mathcal{T}_{z}^{(J)}\equiv\mathcal{T}_{z}^{(J)}(\bm{x},\bm{z},\bm{u},\bm{c},\boldsymbol{\phi})=J\text{ FISTA steps on }\mathcal{L}_{\rho}\text{ w.r.t }\bm{z}.
Algorithm 1 ADMM with CG-GAN Prior (A-CG-GAN)
0:  𝒚,A,G,μ,λ,𝝁u,Rx,K,J,Jx\bm{y},A,G,\mu,\lambda,\bm{\mu}_{u},R_{x},K,J,J_{x} and tolerance parameter τ\tau
1:  𝒙0𝒩(𝟎,I)\bm{x}_{0}\sim\mathcal{N}(\bm{0},I), 𝒛0=G(𝒙0)\bm{z}_{0}=G(\bm{x}_{0}), 𝒖0=𝒯(𝒛0)\bm{u}_{0}=\mathcal{T}(\bm{z}_{0}), 𝒄0=𝒛0𝒖0\bm{c}_{0}=\bm{z}_{0}\odot\bm{u}_{0}, and ϕ0=𝟎\boldsymbol{\phi}_{0}=\bm{0}
2:  for k{0,1,,K}k\in\{0,1,\ldots,K\} do
3:     𝒙k+1=𝒯x(Jx)(𝒙k,𝒛k,ϕk)\bm{x}_{k+1}=\mathcal{T}_{x}^{(J_{x})}(\bm{x}_{k},\bm{z}_{k},\boldsymbol{\phi}_{k})
4:     𝒛k+1=𝒯z(J)(𝒙k+1,𝒛k,𝒖k,𝒄k,ϕk)\bm{z}_{k+1}=\mathcal{T}_{z}^{(J)}(\bm{x}_{k+1},\bm{z}_{k},\bm{u}_{k},\bm{c}_{k},\boldsymbol{\phi}_{k})
5:     𝒖k+1=𝒯u(𝒛k+1,𝒄k,ϕk)\bm{u}_{k+1}=\mathcal{T}_{u}(\bm{z}_{k+1},\bm{c}_{k},\boldsymbol{\phi}_{k})
6:     𝒄k+1=𝒯c(𝒛k+1,𝒖k+1,ϕk)\bm{c}_{k+1}=\mathcal{T}_{c}(\bm{z}_{k+1},\bm{u}_{k+1},\boldsymbol{\phi}_{k})
7:     ϕk+1=ϕk+σk+1𝝃k+1\boldsymbol{\phi}_{k+1}=\boldsymbol{\phi}_{k}+\sigma_{k+1}\bm{\xi}_{k+1}
8:     if 1dδx,k2+1n(δz,k2+δu,k2+δc,k2+𝝃k+122)<τ\frac{1}{d}\delta_{x,k}^{2}+\frac{1}{n}\left(\delta_{z,k}^{2}+\delta_{u,k}^{2}+\delta_{c,k}^{2}+\lVert\bm{\xi}_{k+1}\rVert_{2}^{2}\right)<\tau then
9:        return  (𝒙k+1,𝒛k+1,𝒖k+1,𝒄k+1)(\bm{x}_{k+1},\bm{z}_{k+1},\bm{u}_{k+1},\bm{c}_{k+1})
10:     end if
11:  end for
11:  (𝒙K,𝒛K,𝒖K,𝒄K)(\bm{x}_{K},\bm{z}_{K},\bm{u}_{K},\bm{c}_{K})

Next, for the update of 𝒙\bm{x}, which requires optimizing over the non-convex generative NN GG, we, similar to [21, 24, 23], use adaptive moment estimation (Adam) [46], a gradient descent (GD)-based optimizer, with a step size of 102.10^{-2}. The gradient, 𝒙ρ\nabla_{\bm{x}}\mathcal{L}_{\rho}, for each GD step is calculated using backpropagation and automatic differentiation [47] in TensorFlow [48]. We will write

𝒯x(Jx)𝒯z(Jx)(𝒙,𝒛,ϕ)=Jx GD steps on ρ w.r.t 𝒙.\displaystyle\mathcal{T}_{x}^{(J_{x})}\equiv\mathcal{T}_{z}^{(J_{x})}(\bm{x},\bm{z},\boldsymbol{\phi})=J_{x}\text{ GD steps on }\mathcal{L}_{\rho}\text{ w.r.t }\bm{x}.

As this update requires RxR_{x} to be differentiable, we instead use JxJ_{x} ISTA or proximal gradient descent (PGD) steps to minimize ρ\mathcal{L}_{\rho} w.r.t 𝒙\bm{x} when RxR_{x} is non-smooth and also denote these JxJ_{x} steps as 𝒯x(Jx)\mathcal{T}_{x}^{(J_{x})} for simplicity.

The dual variable, ϕ\boldsymbol{\phi}, is updated via a single gradient ascent step on ρ\mathcal{L}_{\rho} as is typical in ADMM [38, 39, 40]. That is,

ϕϕ+σϕρ=ϕ+σ𝝃\displaystyle\boldsymbol{\phi}\leftarrow\boldsymbol{\phi}+\sigma\nabla_{\boldsymbol{\phi}}\mathcal{L}_{\rho}=\boldsymbol{\phi}+\sigma\bm{\xi}

where σ\sigma is a real-value step size parameter.

As specified in Algorithm 1, the initial estimates are given by 𝒙0𝒩(𝟎,I)\bm{x}_{0}\sim\mathcal{N}(\bm{0},I), 𝒛0=G(𝒙0)\bm{z}_{0}=G(\bm{x}_{0}), 𝒖0=𝒯(𝒛0)\bm{u}_{0}=\mathcal{T}(\bm{z}_{0}), and 𝒄0=𝒛0𝒖0\bm{c}_{0}=\bm{z}_{0}\odot\bm{u}_{0} where, for A𝒛=ADiag(𝒛)A_{\bm{z}}=A\text{Diag}(\bm{z}),

for F^(𝒛,𝒖)=12𝒚A(𝒛𝒖)22+λ2(𝒖𝝁u)TΣu(𝒖𝝁u)\widehat{F}(\bm{z},\bm{u})=\frac{1}{2}\lVert\bm{y}-A(\bm{z}\odot\bm{u})\rVert_{2}^{2}+\frac{\lambda}{2}(\bm{u}-\bm{\mu}_{u})^{T}\Sigma_{u}(\bm{u}-\bm{\mu}_{u}). This particular choice of 𝒖0\bm{u}_{0} is inspired by previous work in regularizing IPs with a CG prior [8, 9].

Lastly, define δx,k=𝒙k+1𝒙k2\delta_{x,k}=\lVert\bm{x}_{k+1}-\bm{x}_{k}\rVert_{2} and similarly define δz,k,δu,k,\delta_{z,k},\delta_{u,k}, and δc,k\delta_{c,k}. Convergence of Algorithm 1 is determined by 1dδx,k2+1n(δz,k2+δu,k2+δc,k2+𝝃k+122)<τ\frac{1}{d}\delta_{x,k}^{2}+\frac{1}{n}\left(\delta_{z,k}^{2}+\delta_{u,k}^{2}+\delta_{c,k}^{2}+\lVert\bm{\xi}_{k+1}\rVert_{2}^{2}\right)<\tau for a tolerance parameter τ\tau. This stopping condition, informed by previous ADMM work [38, 25], ensures that the mean squared error in both the feasibility gap and the change in each of the four primal variables is sufficiently small.

We make a few remarks about Algorithm 1. First, (11) and (12) can be quickly calculated using SVD. Specifically, if A=USVTA=USV^{T} is the SVD of AA, then (11) can be rewritten as

𝒯c\displaystyle\mathcal{T}_{c} =V(STS+ρI)1VT(AT𝒚+ρ𝒛𝒖𝝋2),\displaystyle=V(S^{T}S+\rho I)^{-1}V^{T}(A^{T}\bm{y}+\rho\bm{z}\odot\bm{u}-\bm{\varphi}_{2}),

which is easily calculated since STS+ρIS^{T}S+\rho I is a diagonal matrix. Similarly, for (12), since Σu\Sigma_{u} is symmetric and positive definite then the SVD of Σu\Sigma_{u} has the form Σu=VSVT\Sigma_{u}=VSV^{T}, which will reduce the matrix inverse in (12) to inverting a diagonal matrix. Therefore, by performing the SVD of AA and Σu\Sigma_{u} upfront, only diagonal matrices need to be inverted during the iterations of Algorithm 1, which provides a considerable time reduction.

Finally, as A-CG-GAN expands on the methodology introduced by [25], we briefly explain the core difference. In [25], only a single constraint from the GAN prior, specifically 𝒄=G(𝒙)\bm{c}=G(\bm{x}), is imposed. Accordingly, each iteration of the ADMM algorithm in [25] consists of two primal variable updates for 𝒄\bm{c} and 𝒙\bm{x}, similar to lines 3 and 4 in Algorithm 1, and one dual variable update ϕϕ+σ(𝒄k+1G(𝒙k+1))\boldsymbol{\phi}\leftarrow\boldsymbol{\phi}+\sigma(\bm{c}_{k+1}-G(\bm{x}_{k+1})), similarly to line 7 in Algorithm 1. In our proposed A-CG-GAN, we have a similar GAN prior constraint now involving the generator, GG, and the scale variable, 𝒛\bm{z}, as we constrain 𝒛\bm{z} to be in the range of GG. Uniquely, we introduce a second constraint to additionally enforce the CG prior by constraining 𝒄\bm{c} to be the product of a scale variable, 𝒛\bm{z}, and Gaussian variable, 𝒖\bm{u}.

III Convergence Results

III-A Preliminaries

For our theoretical analysis in this section and the Appendix, we will use the following notation and nomenclature:

  1. 1.

    𝜻1𝒙\bm{\zeta}_{1}\coloneqq\bm{x}, 𝜻2𝒛\bm{\zeta}_{2}\coloneqq\bm{z}, 𝜻3𝒖\bm{\zeta}_{3}\coloneqq\bm{u}, and 𝜻4𝒄\bm{\zeta}_{4}\coloneqq\bm{c}

  2. 2.

    𝜻1(k)𝒙k\bm{\zeta}_{1}^{(k)}\coloneqq\bm{x}_{k}, 𝜻2(k)𝒛k\bm{\zeta}_{2}^{(k)}\coloneqq\bm{z}_{k}, 𝜻3(k)𝒖k\bm{\zeta}_{3}^{(k)}\coloneqq\bm{u}_{k}, and 𝜻4(k)𝒄k\bm{\zeta}_{4}^{(k)}\coloneqq\bm{c}_{k}

  3. 3.

    𝜻1𝒙\bm{\zeta}_{1}^{*}\coloneqq\bm{x}^{*}, 𝜻2𝒛\bm{\zeta}_{2}^{*}\coloneqq\bm{z}^{*}, 𝜻3𝒖\bm{\zeta}_{3}^{*}\coloneqq\bm{u}^{*}, and 𝜻4𝒄\bm{\zeta}_{4}^{*}\coloneqq\bm{c}^{*}

  4. 4.

    𝒗[𝜻i]i=14\bm{v}\coloneqq[\bm{\zeta}_{i}]_{i=1}^{4}, 𝒗k[𝜻i(k)]i=14\bm{v}_{k}\coloneqq[\bm{\zeta}_{i}^{(k)}]_{i=1}^{4}, and 𝒗[𝜻i]i=14\bm{v}^{*}\coloneqq[\bm{\zeta}_{i}^{*}]_{i=1}^{4}

  5. 5.

    𝒗i1:i2[𝜻i]i=i1i2\bm{v}_{i_{1}:i_{2}}\coloneqq[\bm{\zeta}_{i}]_{i=i_{1}}^{i_{2}} and 𝒗i1:i2(k)[𝜻i(k)]i=i1i2\bm{v}_{{i_{1}:i_{2}}}^{(k)}\coloneqq[\bm{\zeta}_{i}^{(k)}]_{i=i_{1}}^{i_{2}}

  6. 6.

    Δi,kmax{𝜻i𝜻i(k)2,𝜻i𝜻i(k+1)2}\Delta_{i,k}\coloneqq\max\left\{\lVert\bm{\zeta}_{i}^{*}-\bm{\zeta}_{i}^{(k)}\rVert_{2},\,\,\lVert\bm{\zeta}_{i}^{*}-\bm{\zeta}_{i}^{(k+1)}\rVert_{2}\right\}

  7. 7.

    Δ(2,3),k𝒛𝒖𝒛k𝒖k2\Delta_{(2,3),k}\coloneqq\lVert\bm{z}^{*}\odot\bm{u}^{*}-\bm{z}_{k}\odot\bm{u}_{k}\rVert_{2}

  8. 8.

    δi,k𝜻i(k+1)𝜻i(k)2\delta_{i,k}\coloneqq\lVert\bm{\zeta}_{i}^{(k+1)}-\bm{\zeta}_{i}^{(k)}\rVert_{2}

  9. 9.

    ϕk[𝝋1,kT𝝋2,kT]T\boldsymbol{\phi}_{k}\coloneqq\begin{bmatrix}\bm{\varphi}_{1,k}^{T}&\bm{\varphi}_{2,k}^{T}\end{bmatrix}^{T} and ϕ[(𝝋1)T(𝝋2)T]T\boldsymbol{\phi}^{*}\coloneqq\begin{bmatrix}(\bm{\varphi}_{1}^{*})^{T}&(\bm{\varphi}_{2}^{*})^{T}\end{bmatrix}^{T}

  10. 10.

    kρ(𝒗1:4(k),ϕk)\mathcal{L}_{k}\coloneqq\mathcal{L}_{\rho}(\bm{v}^{(k)}_{1:4},\boldsymbol{\phi}_{k}) and ρ(𝒗,ϕ)\mathcal{L}_{*}\coloneqq\mathcal{L}_{\rho}(\bm{v}^{*},\boldsymbol{\phi}^{*})

  11. 11.

    ¯k¯ρ(𝒗1:4(k),ϕk)\overline{\mathcal{L}}_{k}\coloneqq\overline{\mathcal{L}}_{\rho}(\bm{v}^{(k)}_{1:4},\boldsymbol{\phi}_{k}) and ¯¯ρ(𝒗,ϕ)\overline{\mathcal{L}}_{*}\coloneqq\overline{\mathcal{L}}_{\rho}(\bm{v}^{*},\boldsymbol{\phi}^{*}) for ¯ρ\overline{\mathcal{L}}_{\rho} defined in (14)

  12. 12.

    𝝃1,k𝝃1,k(𝒙k,𝒛k)𝒛kG(𝒙k)\bm{\xi}_{1,k}\equiv\bm{\xi}_{1,k}(\bm{x}_{k},\bm{z}_{k})\coloneqq\bm{z}_{k}-G(\bm{x}_{k})

  13. 13.

    𝝃2,k𝝃2,k(𝒛k,𝒖k,𝒄k)𝒄k𝒛k𝒖k\bm{\xi}_{2,k}\equiv\bm{\xi}_{2,k}(\bm{z}_{k},\bm{u}_{k},\bm{c}_{k})\coloneqq\bm{c}_{k}-\bm{z}_{k}\odot\bm{u}_{k}

  14. 14.

    𝝃k𝝃k(𝒙k,𝒛k,𝒖k,𝒄k)[𝝃1,kT𝝃2,kT]T.\bm{\xi}_{k}\equiv\bm{\xi}_{k}(\bm{x}_{k},\bm{z}_{k},\bm{u}_{k},\bm{c}_{k})\coloneqq\begin{bmatrix}\bm{\xi}_{1,k}^{T}&\bm{\xi}_{2,k}^{T}\end{bmatrix}^{T}.

  15. 15.

    tn{𝒕n:𝒕2t}\mathcal{B}^{n}_{t}\coloneqq\{\bm{t}\in\mathbb{R}^{n}:\lVert\bm{t}\rVert_{2}\leq t\}

Additionally, we will generalize the cost function in (7) to

F(𝒗)f𝒚(𝒄)+R1(𝒙)+R2(𝒛)+R3(𝒖)+R4(𝒄)\displaystyle F(\bm{v})\coloneqq f_{\bm{y}}(\bm{c})+R_{1}(\bm{x})+R_{2}(\bm{z})+R_{3}(\bm{u})+R_{4}(\bm{c}) (13)

where f𝒚:nf_{\bm{y}}:\mathbb{R}^{n}\to\mathbb{R} is a data fidelity term and R1,R2,R3,R_{1},R_{2},R_{3}, and R4R_{4} are regularization functions on 𝒙,𝒛,𝒖,\bm{x},\bm{z},\bm{u}, and 𝒄\bm{c}, respectively. We make the following assumptions about the terms of (13).

Assumption 1.

Each function f𝐲f_{\bm{y}} and RiR_{i} is continuous, convex, and has Lipschitz continuous gradient with constant LfL_{f} and LiL_{i}, respectively.

We define ρ\mathcal{L}_{\rho} as in (9) with FF now given in (13) and define an adjusted augmented Lagrangian as

¯ρ(𝒗,ϕ)f𝒚(𝒄)+ϕ,𝝃+ρ2𝝃22.\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})\coloneqq f_{\bm{y}}(\bm{c})+\langle\boldsymbol{\phi},\bm{\xi}\rangle+\frac{\rho}{2}\lVert\bm{\xi}\rVert_{2}^{2}. (14)

That is, ¯ρ\overline{\mathcal{L}}_{\rho} is the augmented Lagrangian without the regularization terms. Despite the generalization of the cost function, we assume that, as holds in Section II, the minimizer of ρ\mathcal{L}_{\rho} w.r.t each 𝜻i\bm{\zeta}_{i} individually is well approximated by GD/ISTA/FISTA steps (or for 𝒖\bm{u} and 𝒄\bm{c} is analytically known).

Note that we set 𝒟x=xd\mathcal{D}_{x}=\mathcal{B}_{x_{\infty}}^{d}, 𝒟z=zn\mathcal{D}_{z}=\mathcal{B}_{z_{\infty}}^{n}, 𝒟u=un\mathcal{D}_{u}=\mathcal{B}_{u_{\infty}}^{n}, and 𝒟c=cn\mathcal{D}_{c}=\mathcal{B}_{c_{\infty}}^{n}. Furthermore, to prove convergence, we will take an adaptive dual-variable step size of

(15)

for some initial step size σ0\sigma_{0}, which is utilized in [25] and motivated by [39] as a guarantee on maintaining a bounded dual-variable. While such a step size is chosen to provide theoretical guarantees, we find, in practice, that a constant step size σk+1=ρ1\sigma_{k+1}=\rho\leq 1 is sufficient for bounded dual-variables.

Lastly, as our results build off of the convergence properties discussed in [25], which again considers an ADMM algorithm with a single GAN prior constraint, we import necessary assumptions on GG, the generative NN, from [25].

Assumption 2.

For any 𝐱^,𝐱~d\widehat{\bm{x}},\widetilde{\bm{x}}\in\mathbb{R}^{d} the following bounds hold

νG𝒙^𝒙~2G(𝒙^)G(𝒙~)2\displaystyle\nu_{G}\lVert\widehat{\bm{x}}-\widetilde{\bm{x}}\rVert_{2}\leq\lVert G(\widehat{\bm{x}})-G(\widetilde{\bm{x}})\rVert_{2} τG𝒙^𝒙~2\displaystyle\leq\tau_{G}\lVert\widehat{\bm{x}}-\widetilde{\bm{x}}\rVert_{2}
G(𝒙^)G(𝒙~)(𝒙G(𝒙~))T(𝒙^𝒙~)2\displaystyle\lVert G(\widehat{\bm{x}})-G(\widetilde{\bm{x}})-(\nabla_{\bm{x}}G(\widetilde{\bm{x}}))^{T}(\widehat{\bm{x}}-\widetilde{\bm{x}})\rVert_{2} LG2𝒙^𝒙~22\displaystyle\leq\frac{L_{G}}{2}\lVert\widehat{\bm{x}}-\widetilde{\bm{x}}\rVert_{2}^{2}

for positive constants νG,τG,\nu_{G},\tau_{G}, and LGL_{G}.

We remark that the first inequality in Assumption 2 implies that GG is a near-isometry while the second inequality is a strong smoothness requirement on GG. See [25] for more details about generative NNs that satisfy Assumption 2.

III-B Core Theoretical Guarantees

In this section, we provide our main technical results for the convergence of our generalized A-CG-GAN algorithm, which are inspired by the previous work in [25]. For this, we let 𝒗\bm{v}^{*} be some solution of (6), with FF given in (13), and ϕ\boldsymbol{\phi}^{*} be the corresponding optimal dual variable from (10).

First, we present four propositions bounding the augmented Lagrangian change over an update of each primal variable.

Proposition 1.

Let α4=2αc1+2αc\alpha_{4}=\frac{2\alpha_{c}}{1+2\alpha_{c}} for any αc(Lf+ρ)1\alpha_{c}\leq(L_{f}+\rho)^{-1}. Then for any θ[0,1]\theta\in[0,1] and kk\in\mathbb{N} the following bound holds

¯ρ(𝒗1:4(k+1),ϕk)+R4(𝒄k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:4}^{(k+1)},\boldsymbol{\phi}_{k})+R_{4}(\bm{c}_{k+1})
¯ρ(𝒗1:3(k+1),𝒄k,ϕk)+θ𝒄𝒄k,𝒄¯k+θ2α4Δ4,k2\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})+\theta\langle\bm{c}^{*}-\bm{c}_{k},\nabla_{\bm{c}}\overline{\mathcal{L}}_{k}\rangle+\frac{\theta^{2}}{\alpha_{4}}\Delta_{4,k}^{2}
+(ρu)22δ2,k2+(ρz)22δ3,k2+θR4(𝒄)+(1θ)R4(𝒄k).\displaystyle\hskip 7.11317pt+\frac{(\rho u_{\infty})^{2}}{2}\delta_{2,k}^{2}+\frac{(\rho z_{\infty})^{2}}{2}\delta_{3,k}^{2}+\theta R_{4}(\bm{c}^{*})+(1-\theta)R_{4}(\bm{c}_{k}).
Proposition 2.

Let α3=2αu1+αu\alpha_{3}=\frac{2\alpha_{u}}{1+\alpha_{u}} for any αu(ρz2)1.\alpha_{u}\leq(\rho z_{\infty}^{2})^{-1}. Then for any θ[0,1]\theta\in[0,1] and kk\in\mathbb{N} there exists a γ0\gamma\geq 0 such that

¯ρ(𝒗1:3(k+1),𝒄k,ϕk)+R3(𝒖k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})+R_{3}(\bm{u}_{k+1})
¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)+θ𝒖𝒖k,𝒖¯k+θ2α3Δ3,k2\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+\theta\langle\bm{u}^{*}-\bm{u}_{k},\nabla_{\bm{u}}\overline{\mathcal{L}}_{k}\rangle+\frac{\theta^{2}}{\alpha_{3}}\Delta_{3,k}^{2}
+γ2δ2,k2+θR3(𝒖)+(1θ)R3(𝒖k).\displaystyle+\frac{\gamma}{2}\delta_{2,k}^{2}+\theta R_{3}(\bm{u}^{*})+(1-\theta)R_{3}(\bm{u}_{k}).
Proposition 3.

Let α2=2αz1+αz\alpha_{2}=\frac{2\alpha_{z}}{1+\alpha_{z}} for any αz(ρ(1+u2))1.\alpha_{z}\leq(\rho(1+u_{\infty}^{2}))^{-1}. Then for any θ[0,1]\theta\in[0,1] and kk\in\mathbb{N} the following bound holds

¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)+R2(𝒛k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+R_{2}(\bm{z}_{k+1})
¯ρ(𝒙k+1,𝒗2:4(k),ϕk)+θ𝒛𝒛k,𝒛¯k+θ2α2Δ2,k2\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+\theta\langle\bm{z}^{*}-\bm{z}_{k},\nabla_{\bm{z}}\overline{\mathcal{L}}_{k}\rangle+\frac{\theta^{2}}{\alpha_{2}}\Delta_{2,k}^{2}
+(ρτG)22δ1,k2+θR2(𝒛)+(1θ)R2(𝒛k).\displaystyle\hskip 7.11317pt+\frac{(\rho\tau_{G})^{2}}{2}\delta_{1,k}^{2}+\theta R_{2}(\bm{z}^{*})+(1-\theta)R_{2}(\bm{z}_{k}).
Proposition 4.

For kk\in\mathbb{N}, let α12(LG(4σ0+ρ𝛏1,k2)+ρτG2)1\alpha_{1}\leq 2(L_{G}(4\sigma_{0}+\rho\lVert\bm{\xi}_{1,k}\rVert_{2})+\rho\tau_{G}^{2})^{-1}. Then for any θ[0,1]\theta\in[0,1] the following bound holds

¯ρ(𝒙k+1,𝒗2:4(k),ϕk)+R1(𝒙k+1)¯k+θ𝒙𝒙k,𝒙¯k+θ2α1Δ1,k2+θR1(𝒙)+(1θ)R1(𝒙k).\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+R_{1}(\bm{x}_{k+1})\\ \leq\overline{\mathcal{L}}_{k}+\theta\langle\bm{x}^{*}-\bm{x}_{k},\nabla_{\bm{x}}\overline{\mathcal{L}}_{k}\rangle+\frac{\theta^{2}}{\alpha_{1}}\Delta_{1,k}^{2}\\ +\theta R_{1}(\bm{x}^{*})+(1-\theta)R_{1}(\bm{x}_{k}).

Proofs of propositions 123, and 4 are provided respectively in Appendices VI-CVI-DVI-E, and VI-F.

Next, we combine the previous four propositions to bound the augmented Lagrangian change over one iteration of A-CG-GAN.

Proposition 5.

For any θ[0,1]\theta\in[0,1] and every kk\in\mathbb{N} there exists positive constants {(αj,βj)}j=14\{(\alpha_{j},\beta_{j})\}_{j=1}^{4} such that

k+1k+j=14βjΔj,k2+θ2j=14Δj,k2αj+θ(k).\displaystyle\mathcal{L}_{k+1}\leq\mathcal{L}_{k}+\sum_{j=1}^{4}\beta_{j}\Delta_{j,k}^{2}+\theta^{2}\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}+\theta(\mathcal{L}_{*}-\mathcal{L}_{k}).

Proposition 5, which provides a bound on the change in the augmented Lagrangian, is proved in Appendix VI-G.

Finally, we present our main convergence theorem.

Theorem 6.

Let the assumptions of Lemma 12 hold and set {κj}j=14\{\kappa_{j}\}_{j=1}^{4} and μ\mu as given in Lemma 12. Choose {(αj,βj)}j=14\{(\alpha_{j},\beta_{j})\}_{j=1}^{4}, from Proposition 5, to satisfy β^β~434\widehat{\beta}\leq\frac{\widetilde{\beta}}{4}\leq\frac{3}{4} where β^max1i4αiβi\widehat{\beta}\coloneqq\sqrt{\max_{1\leq i\leq 4}\,\,\alpha_{i}\beta_{i}} and β~min1i4αiκi4\widetilde{\beta}\coloneqq\min_{1\leq i\leq 4}\,\,\frac{\alpha_{i}\kappa_{i}}{4}. Define η^4μ(ρβ~)1\widehat{\eta}\coloneqq 4\mu(\rho\widetilde{\beta})^{-1} and let j=14Δj,k2αjη^.\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\geq\widehat{\eta}. Then for δβ~8(0,1)\delta\coloneqq\frac{\widetilde{\beta}}{8}\in(0,1)

j=14Δj,k2αj(1δ)k016β^+η^16.\displaystyle\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\leq(1-\delta)^{k}\frac{\mathcal{L}_{0}-\mathcal{L}_{*}}{16\widehat{\beta}}+\frac{\widehat{\eta}}{16}.

A proof of Theorem 6 is provided in Appendix VI-H. We remark that Theorem 6 states that outside of a neighborhood about some solution 𝒗\bm{v}^{*}, our A-CG-GAN algorithm will converge, at worst, linearly with rate (1δ)k(1-\delta)^{k} to this neighborhood. As the size of the neighborhood is proportional to η^\widehat{\eta}, which in turn is inversely proportional to ρ\rho, we can decrease the size of the neighborhood by increasing the augmented Lagrangian penalty parameter ρ\rho.

IV Empirical Results

Refer to caption
(a) CIFAR10 estimates from CS with 60dB SNR.
Refer to caption
(b) CIFAR10 estimates from CT with 60dB SNR.
Refer to caption
(c) CIFAR10 estimates from CS with 40dB SNR.
Refer to caption
(d) CIFAR10 estimates from CT with 40dB SNR.
Refer to caption
(e) CalTech101 estimates from CS with 60dB SNR.
Refer to caption
(f) CalTech101 estimates from CT with 60dB SNR.
Refer to caption
(g) CalTech101 estimates from CS with 40dB SNR.
Refer to caption
(h) CalTech101 estimates from CT with 40dB SNR.
Refer to caption
Figure 1: Average SSIM, with 99% confidence intervals, for five reconstruction methods, using a generative adversarial network prior, which reconstructed one hundred 32×3232\times 32 test images (either CIFAR10 or CalTech101 downsampled to 32×3232\times 32 size). The average SSIM is presented as we either vary the compressive sensing ratio (mn)\left(\frac{m}{n}\right) or the number of angles in the Radon transform. Our A-CG-GAN method outperforms, or performs comparably to, the prior art methods in all scenarios.

We examine our proposed Algorithm 1 for two linear IPs in imaging, namely CS (where Ψm×n\Psi\in\mathbb{R}^{m\times n} has entries sampled 𝒩(0,1/m)\mathcal{N}(0,1/m)) and X-ray computed tomography (where Ψ\Psi corresponds to a Radon transform at a number of uniformly spaced angles). Comparisons are made against four prior-art methods that similarly consider a GAN-prior in solving IPs, which we denote by Bora [21], Dhar [24], Shah [23], and Latorre [25]. Note that Bora, Shah, and Latorre all find a solution to the IP within the range of a generative NN while Dhar considers a dual sparsity-GAN prior by finding a solution to the IP within sparse deviations from the range of a generative NN.

For data, we use 32×3232\times 32 CIFAR10 [49] images and CalTech101 [50] images downsampled to size 64×6464\times 64 (and 32×3232\times 32). With the CIFAR10 and 64×6464\times 64 CalTech101 datasets, we set aside 100 images to serve as the testing data and we train a DCGAN [51] on the remaining images, which are augmented using rotation and reflection. We use G¯:dn\overline{G}:\mathbb{R}^{d}\to\mathbb{R}^{n}, with d=100d=100, to denote the DCGAN generator.

For training G¯\overline{G}, which has a tanh\tanh output activation function, we scale and shift the training images onto the range [1,1][-1,1] by replacing training image 𝑰\bm{I} by 2Imax𝑰𝟏\frac{2}{I_{\max}}\bm{I}-\bm{1} where ImaxI_{\max} is the max pixel value of 𝑰\bm{I}. For testing – i.e., solving the IP to (1) – we scale the testing images onto the range [0,1][0,1] by replacing test image 𝑰\bm{I} by 𝑰/Imax.\bm{I}/I_{\max}. A measurement matrix Ψ\Psi is applied to each test image, after which white noise is added, producing noisy measurements, 𝒚\bm{y}, at a specified signal-to-noise-ratio (SNR). Additionally, for testing, we scale and shift the outputs from G¯\overline{G} onto the range [0,1][0,1] by considering the generator G~(𝒙)=G¯(𝒙)+𝟏2\widetilde{G}(\bm{x})=\frac{\overline{G}(\bm{x})+\bm{1}}{2}. Note that scaling to the range [0,1][0,1] for testing is conducted so that we can use the structural similarity index measure (SSIM), which requires input images with positive pixel values, to assess the reconstruction performance.

For our A-CG-GAN algorithm, we will take the generative NN G(𝒙)=Φ~G~(𝒙)G(\bm{x})=\widetilde{\Phi}\widetilde{G}(\bm{x}), where Φ~n×n\widetilde{\Phi}\in\mathbb{R}^{n\times n} is a sparsity change-of-basis matrix. We apply the change-of-basis matrix since we desire for the generative NN to produce sparse scale variables rather than the image directly. Furthermore, we set 𝝁u=𝟎\bm{\mu}_{u}=\bm{0}, Σu=I\Sigma_{u}=I, Rx0R_{x}\equiv 0, τ=106\tau=10^{-6}, K=1000K=1000, and Jx=J=10J_{x}=J=10 for all testing of A-CG-GAN. Instead, for each comparison method [21, 24, 23, 25], we use G(𝒙)=G~(𝒙)G(\bm{x})=\widetilde{G}(\bm{x}) as the generative NN prior and a grid search is performed to find the best hyperparameters for each of the four comparison methods. Finally, for each comparison method, random restarts were implemented as specified in each work respectively [21, 23, 24, 25]. We remark that using the same underlying image-generating NN provides fairness in the test IP comparisons since we guarantee the generative NN used in our method and each comparison method has the same training and quality.

Shown in Fig. 1 is the average SSIM – larger values denote a reconstruction more closely matching the original image – over the reconstructions of one hundred 32×3232\times 32 images from A-CG-GAN and each of the four comparison methods. Each plot in Fig. 1 gives both the average SSIM and 99% confidence interval – visualized by the background shading – as the dimension of 𝒚\bm{y} is varied in both CS and CT problems. The underlying DCGAN generator, G¯\overline{G}, used for each of these IPs has been trained on 50000 CIFAR10 images. Through a grid search, we set the hyperparameters of A-CG-GAN to be (λ,μ,ρ)=(0.1,103,1)(\lambda,\mu,\rho)=(0.1,10^{-3},1) and (λ,μ,ρ)=(1,102,1)(\lambda,\mu,\rho)=(1,10^{-2},1) for CT reconstructions from 60dB and 40dB, respectively. Furthermore, we set (λ,μ,ρ)=(106,106,104)(\lambda,\mu,\rho)=(10^{-6},10^{-6},10^{-4}) as the hyperparameters of A-CG-GAN for CS reconstructions. Additionally, we use Φ~=Φ=\widetilde{\Phi}=\Phi= discrete cosine transformation and Φ~=Φ=I\widetilde{\Phi}=\Phi=I for CS and CT problems, respectively.

In Figs. 1(a), 1(b) and Figs. 1(c), 1(d), CIFAR10 images are reconstructed from CS and CT measurements with an SNR of 60dB and 40dB, respectively. Instead, in Figs. 1(e), 1(f), 1(g), and 1(h), we show the generalizability of our method by reconstructing CalTech101 images, that have been downsampled to size 32×3232\times 32, while using the same CIFAR10 trained G¯\overline{G}.

Refer to caption
(a) CalTech101 estimates from CS with 60dB SNR.
Refer to caption
(b) CalTech101 estimates from CT with 60dB SNR.
Refer to caption
(c) CalTech101 estimates from CS with 40dB SNR.
Refer to caption
(d) CalTech101 estimates from CT with 40dB SNR.
Refer to caption
Figure 2: Average SSIM, with 99% confidence intervals, for five iterative reconstruction methods, using a generative adversarial network prior, which reconstructed one hundred 64×6464\times 64 test CalTech101 images. The average SSIM is presented as we either vary the compressive sensing ratio (mn)\left(\frac{m}{n}\right) or the number of angles in the Radon transform. Our A-CG-GAN method outperforms the compared prior art methods in all scenarios.

We observe, from Fig. 1, that A-CG-GAN outperforms, or performs comparably to, each of the four prior art methods in all scenarios. In particular, we highlight that the performance of A-CG-GAN does not saturate as the dimension of 𝒚\bm{y} increases – i.e., the sampling from the forward measurement mapping increases – while the performance of the comparisons in general does. Additionally, A-CG-GAN is able to generalize well to image distributions outside of the GAN training distribution as shown in Figs. 1(e), 1(f), 1(g), and 1(h).

Next, shown in Fig. 2 is the average SSIM over the reconstructions of one hundred 64×6464\times 64 images from A-CG-GAN and each of the four comparison methods. Each plot in Fig. 2 gives both the average SSIM and 99% confidence interval – visualized by the background shading – as the amount of sampling from the forward measurement mapping – i.e., the dimension of 𝒚\bm{y} – is varied in both CS and CT problems. The underlying DCGAN generator, G¯\overline{G}, used for each of these IPs has been trained on a minimal dataset of roughly 9000 CalTech101 images of size 64×6464\times 64. Empirically, we choose (λ,μ,ρ)=0.25×(106,106,104)(\lambda,\mu,\rho)=0.25\times(10^{-6},10^{-6},10^{-4}) in A-CG-GAN for CS reconstructions, (λ,μ,ρ)=(10,0.1,10)(\lambda,\mu,\rho)=(10,0.1,10) for 40dB CT reconstructions, and otherwise use the same setup as in the 32×3232\times 32 image reconstruction cases.

We observe from Fig. 2, that A-CG-GAN outperforms each of the four prior art methods in all scenarios. We highlight, in particular, that while a sub-optimal generator, due to a small training dataset, severely diminished the performance of each of the four comparison methods, A-CG-GAN still provides relatively high-quality reconstructed images as measured by SSIM. This is perhaps anticipated given the dual prior basis our method implements, which alleviates a complete dependence on the generative NN prior as in Bora, Shah, and Latorre [21, 23, 25]. Additionally, we again observe, from Fig. 2, that A-CG-GAN eliminates the performance saturation with increasing sampling from the forward measurement mapping.

Lastly, we remark that Dhar [24] is, in general, the best comparison method to ours as seen in Fig. 1 and Fig. 2. This provides further credence to the benefit of using dual prior information in addition to the GAN prior as Dhar, similar to our dual CG-GAN prior, incorporates a dual sparsity-GAN prior.

V Conclusion and Future Work

Utilizing a dual-structured prior that consists of the powerful CG class of densities and a generative NN, we developed a novel iterative estimate algorithm for linear IPs. Specifically, we proposed an algorithm, called A-CG-GAN, that finds an IP solution satisfying the CG structure of (3) where the scale variable portion of the CG prior is constrained onto the range of a generative NN. Hence, within the informative CG class of distributions, the generative NN captures a rich scale variable statistical representation, which provides a wide, but informative, IP regularization.

We conducted a theoretical analysis on the convergence of A-CG-GAN and showed that, under mild assumptions on the generative NN, a linear rate of convergence is guaranteed up to a neighborhood about a constrained minimizer solution. Subsequently, thorough numerical validation of A-CG-GAN in tomographic imaging and CS IPs was presented. Across multiple datasets and a wide array of forward measurement mappings, we empirically demonstrated that A-CG-GAN outperforms competitive state-of-the-art techniques to IPs that use a GAN prior. Specifically, A-CG-GAN displays three key properties missing from previous techniques using GAN priors for IPs:

  1. 1.

    Performance does not saturate with less undersampling from the forward measurement mapping

  2. 2.

    Generalization to alternative dataset distributions other than the GAN training distributions is possible

  3. 3.

    A sub-optimal generative NN, possibly due to a lack of training data or inadequate training, still provides relatively high-quality reconstructions

Building upon the new foundations established in this paper, we illuminate numerous opportunities for future exploration including: analyzing the performance of A-CG-GAN on larger image reconstructions and on data corrupted by different noise models. Studying each of these objectives can provide greater insight into the practical, real-world applicability of our method.

Another intriguing line of work would be to use a generative NN in A-CG-GAN that has been trained to produce scale variables directly rather than sparsifying an image-generating NN as is done in this paper. We could anticipate a possible improvement in reconstruction performance over the numerical results presented in this paper as the generative NN would directly capture the statistics of the scale variables rather than implicitly capturing these statistics from the overall images.

A last point of future work is to empirically study A-CG-GAN, and each of the four comparisons methods, when the generative NN has been trained on a single category of images. In particular, analyzing image reconstruction performance when the test images are from versus distinct from the training category. The datasets employed in this paper contain images from multiple categories – e.g., CIFAR10 has 10 categories and CalTech101 has 101 categories – creating a significant statistical variability in the datasets, which can result in lower quality images from a generative NN. Instead, when a generative NN is trained on a single category, it is often able to produce higher quality images from this single category. Accordingly, we conjecture that A-CG-GAN would handle the reconstructions of test images distinct from the training category significantly better than the four comparison methods, in much of the same manner that we observed in the 64×6464\times 64 CalTech101 reconstructions. Furthermore, we conjecture that A-CG-GAN (with 𝝁u=𝟏\bm{\mu}_{u}=\bm{1}) may have minimal improvement, or be matched by, the four comparison methods when test images are sampled from the training category.

VI Appendix

In this appendix, we provide mathematical proof details for the main theoretical results of A-CG-GAN, presented in Section III-B, by extending the proof ideas in [25] from ADMM with a single GAN-prior constraint to our case of ADMM with dual-structured CG-GAN-prior constraints.

VI-A Preliminary Tools

Lemma 1 [25]).

Let g:g:\mathbb{R}^{\ell}\to\mathbb{R} be a differentiable function and r:r:\mathbb{R}^{\ell}\to\mathbb{R} a convex possibly non-smooth function, then

proxαr(𝒘α𝒘g(𝒘))\displaystyle\textnormal{prox}_{\alpha r}(\bm{w}-\alpha\nabla_{\bm{w}}g(\bm{w}))
=argmin𝒕𝒕𝒘,𝒘g(𝒘)+(2α)1𝒕𝒘22+r(𝒕).\displaystyle\hskip 7.11317pt=\scalebox{1.0}{$\arg\min_{\bm{t}}$}\,\,\langle\bm{t}-\bm{w},\nabla_{\bm{w}}g(\bm{w})\rangle+(2\alpha)^{-1}\lVert\bm{t}-\bm{w}\rVert_{2}^{2}+r(\bm{t}).
Lemma 2 [40]).

Let g:g:\mathbb{R}^{\ell}\to\mathbb{R} be differentiable and have LgL_{g}-Lipschitz continuous gradient. For any 𝐰,𝐰~\bm{w},\widetilde{\bm{w}}\in\mathbb{R}^{\ell}

g(𝒘)g(𝒘~)+𝒘𝒘~,𝒘g(𝒘~)+Lg2𝒘𝒘~22.\displaystyle g(\bm{w})\leq g(\widetilde{\bm{w}})+\langle\bm{w}-\widetilde{\bm{w}},\nabla_{\bm{w}}g(\widetilde{\bm{w}})\rangle+\frac{L_{g}}{2}\lVert\bm{w}-\widetilde{\bm{w}}\rVert_{2}^{2}. (16)
Lemma 3.

Let g:g:\mathbb{R}^{\ell}\to\mathbb{R} be a differentiable function satisfying (16) for some positive constant LgL_{g}, some 𝐰~\widetilde{\bm{w}}\in\mathbb{R}^{\ell}, and any 𝐰\bm{w}\in\mathbb{R}^{\ell}. Let r:r:\mathbb{R}^{\ell}\to\mathbb{R} be a convex and possibly non-smooth function. Let αLg1\alpha\leq L_{g}^{-1}. If 𝐰^=proxαr(𝐰~α𝐰g(𝐰~))\widehat{\bm{w}}=\textnormal{prox}_{\alpha r}(\widetilde{\bm{w}}-\alpha\nabla_{\bm{w}}g(\widetilde{\bm{w}})), then for any θ[0,1]\theta\in[0,1] and 𝐰\bm{w}^{*}\in\mathbb{R}^{\ell}

g(𝒘^)+r(𝒘^)\displaystyle g(\widehat{\bm{w}})+r(\widehat{\bm{w}}) g(𝒘~)+θ𝒘𝒘~,𝒘g(𝒘~)\displaystyle\leq g(\widetilde{\bm{w}})+\theta\langle\bm{w}^{*}-\widetilde{\bm{w}},\nabla_{\bm{w}}g(\widetilde{\bm{w}})\rangle
+θ22α𝒘𝒘~22+θr(𝒘)+(1θ)r(𝒘~).\displaystyle\hskip 7.11317pt+\frac{\theta^{2}}{2\alpha}\lVert\bm{w}^{*}-\widetilde{\bm{w}}\rVert_{2}^{2}+\theta r(\bm{w}^{*})+(1-\theta)r(\widetilde{\bm{w}}).
Proof.

By (16), αLg1\alpha\leq L_{g}^{-1}, and Lemma 1

g(𝒘^)+r(𝒘^)\displaystyle g(\widehat{\bm{w}})+r(\widehat{\bm{w}})
g(𝒘~)+𝒘^𝒘~,𝒘g(𝒘~)+12α𝒘^𝒘~22+r(𝒘^)\displaystyle\leq g(\widetilde{\bm{w}})+\langle\widehat{\bm{w}}-\widetilde{\bm{w}},\nabla_{\bm{w}}g(\widetilde{\bm{w}})\rangle+\frac{1}{2\alpha}\lVert\widehat{\bm{w}}-\widetilde{\bm{w}}\rVert_{2}^{2}+r(\widehat{\bm{w}})

Bounding the final line by setting 𝒕=θ𝒘+(1𝜽)𝒘~\bm{t}=\theta\bm{w}^{*}+(1-\bm{\theta})\widetilde{\bm{w}} and then using convexity of rr produces the desired result. ∎

Lemma 4.

The function ¯ρ\overline{\mathcal{L}}_{\rho} has Lipschitz continuous gradient with respect to 𝐱\bm{x}, 𝐮\bm{u}, and 𝐜\bm{c} separately.

Proof.

Observe

[𝒛¯ρ(𝒗,ϕ)T𝒖¯ρ(𝒗,ϕ)T𝒄¯ρ(𝒗,ϕ)T]T=\displaystyle\begin{bmatrix}\nabla_{\bm{z}}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})^{T}&\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})^{T}&\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})^{T}\end{bmatrix}^{T}=
[𝝋1+ρ(𝒛G(𝒙))𝒖𝝋2+ρ𝒖(𝒛𝒖𝒄)𝒛𝝋2+ρ𝒛(𝒛𝒖𝒄)𝒄f𝒚(𝒄)+𝝋2+ρ(𝒄𝒛𝒖)],\displaystyle\hskip 7.11317pt\hskip 7.11317pt\hskip 7.11317pt\begin{bmatrix}\bm{\varphi}_{1}+\rho(\bm{z}-G(\bm{x}))-\bm{u}\odot\bm{\varphi}_{2}+\rho\bm{u}\odot(\bm{z}\odot\bm{u}-\bm{c})\\ -\bm{z}\odot\bm{\varphi}_{2}+\rho\bm{z}\odot(\bm{z}\odot\bm{u}-\bm{c})\\ \nabla_{\bm{c}}f_{\bm{y}}(\bm{c})+\bm{\varphi}_{2}+\rho(\bm{c}-\bm{z}\odot\bm{u})\end{bmatrix}, (17)

which implies 𝒛2¯ρ(𝒗,ϕ)=ρ(D{𝒖}2+I)\nabla_{\bm{z}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})=\rho(D\{\bm{u}\}^{2}+I), 𝒖2¯ρ(𝒗,ϕ)=ρD{𝒛}2\nabla_{\bm{u}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})=\rho D\{\bm{z}\}^{2}, and 𝒄2¯ρ(𝒗,ϕ)=f𝒚(𝒄)+ρI\nabla_{\bm{c}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})=f_{\bm{y}}(\bm{c})+\rho I. Therefore, 𝒛2¯ρ(𝒗,ϕ))2ρ(1+u2)\lVert\nabla_{\bm{z}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi}))\rVert_{2}\leq\rho(1+u_{\infty}^{2}), 𝒖2¯ρ(𝒗,ϕ)2ρz2\lVert\nabla_{\bm{u}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})\rVert_{2}\leq\rho z_{\infty}^{2}, and 𝒄2¯ρ(𝒗,ϕ)2Lf+ρ\lVert\nabla_{\bm{c}}^{2}\overline{\mathcal{L}}_{\rho}(\bm{v},\boldsymbol{\phi})\rVert_{2}\leq L_{f}+\rho since f𝒚f_{\bm{y}} has LfL_{f}-Lipschitz continuous gradient. ∎

Lemma 5.

For any j{1,2,3,4}j\in\{1,2,3,4\} it holds that δj,k24Δj,k2.\delta_{j,k}^{2}\leq 4\Delta_{j,k}^{2}.

Proof.

Using the triangle inequality observe

δj,k\displaystyle\delta_{j,k} 𝜻j𝜻j(k+1)22+𝜻j𝜻j(k)222Δj,k.\displaystyle\leq\lVert\bm{\zeta}_{j}^{*}-\bm{\zeta}_{j}^{(k+1)}\rVert_{2}^{2}+\lVert\bm{\zeta}_{j}^{*}-\bm{\zeta}_{j}^{(k)}\rVert_{2}^{2}\leq 2\Delta_{j,k}.\qed
Lemma 6.

For any j{1,2}j\in\{1,2\} it holds that 𝛗j,k24σ0\lVert\bm{\varphi}_{j,k}\rVert_{2}\leq 4\sigma_{0}.

Proof.

Since 𝝋j,0=𝟎\bm{\varphi}_{j,0}=\bm{0} and 𝝋j,k=𝝋j,k1+σk𝝃j,k\bm{\varphi}_{j,k}=\bm{\varphi}_{j,k-1}+\sigma_{k}\bm{\xi}_{j,k}, then by the triangle inequality and (15)

𝝋j,k2=i=1kσi𝝃j,i2i=1kσi𝝃j,i2i=1kσ0iln2(i+1).\displaystyle\lVert\bm{\varphi}_{j,k}\rVert_{2}=\left\lVert\sum_{i=1}^{k}\sigma_{i}\bm{\xi}_{j,i}\right\rVert_{2}\leq\sum_{i=1}^{k}\sigma_{i}\lVert\bm{\xi}_{j,i}\rVert_{2}\leq\sum_{i=1}^{k}\frac{\sigma_{0}}{i\ln^{2}(i+1)}.

Finally, note that i=1k1iln2(i+1)i=11iln2(i+1)4\sum_{i=1}^{k}\frac{1}{i\ln^{2}(i+1)}\leq\sum_{i=1}^{\infty}\frac{1}{i\ln^{2}(i+1)}\leq 4. ∎

VI-B Dual-Variable Update Bound

Lemma 7.

For any kk\in\mathbb{N} there exists positive constants {γi}i=14\{\gamma_{i}\}_{i=1}^{4} such that σk+1𝛏k+1222σ0j=14γiΔi,k2.\sigma_{k+1}\lVert\bm{\xi}_{k+1}\rVert_{2}^{2}\leq 2\sigma_{0}\sum_{j=1}^{4}\gamma_{i}\Delta_{i,k}^{2}.

Proof.

Using the feasibility condition 𝝃=𝟎\bm{\xi}^{*}=\bm{0}, the triangle inequality, Assumption 2, and (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) observe

𝝃1,k+122\displaystyle\lVert\bm{\xi}_{1,k+1}\rVert_{2}^{2} =𝒛k+1𝒛+G(𝒙)G(𝒙k+1)22\displaystyle=\lVert\bm{z}_{k+1}-\bm{z}^{*}+G(\bm{x}^{*})-G(\bm{x}_{k+1})\rVert_{2}^{2}
2Δ2,k2+2τG2Δ1,k2\displaystyle\leq 2\Delta_{2,k}^{2}+2\tau_{G}^{2}\Delta_{1,k}^{2}

and

𝝃2,k+122\displaystyle\lVert\bm{\xi}_{2,k+1}\rVert_{2}^{2} =𝒄k+1𝒄+𝒛𝒖𝒛k+1𝒖k+122\displaystyle=\lVert\bm{c}_{k+1}-\bm{c}^{*}+\bm{z}^{*}\odot\bm{u}^{*}-\bm{z}_{k+1}\odot\bm{u}_{k+1}\rVert_{2}^{2}
2Δ4,k2+(2z)2Δ3,k2+(2u)2Δ2,k2.\displaystyle\leq 2\Delta_{4,k}^{2}+(2z_{\infty})^{2}\Delta_{3,k}^{2}+(2u_{\infty})^{2}\Delta_{2,k}^{2}.

Combining the above two inequalities with σk+1σ0\sigma_{k+1}\leq\sigma_{0} and 𝝃k+122=𝝃1,k+122+𝝃2,k+122\lVert\bm{\xi}_{k+1}\rVert_{2}^{2}=\lVert\bm{\xi}_{1,k+1}\rVert_{2}^{2}+\lVert\bm{\xi}_{2,k+1}\rVert_{2}^{2} produces the desired bound for γ1=τG2\gamma_{1}=\tau_{G}^{2}, γ2=1+2u2\gamma_{2}=1+2u_{\infty}^{2}, γ3=2z2\gamma_{3}=2z_{\infty}^{2}, and γ4=1\gamma_{4}=1. ∎

Corollary 7.

For every kk\in\mathbb{N}

¯k+1¯ρ(𝒗1:4(k+1),ϕk)+2σ0j=14γiΔi,k2.\displaystyle\overline{\mathcal{L}}_{k+1}\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:4}^{(k+1)},\boldsymbol{\phi}_{k})+2\sigma_{0}\sum_{j=1}^{4}\gamma_{i}\Delta_{i,k}^{2}.
Proof.

Using that ϕk+1=ϕk+σk+1𝝃k+1\boldsymbol{\phi}_{k+1}=\boldsymbol{\phi}_{k}+\sigma_{k+1}\bm{\xi}_{k+1} with (14) and then applying Lemma 7 produces the desired bound. ∎

VI-C Proof of Proposition 1

Define 𝒄^k=proxαcR4(𝒄kα4𝒄¯ρ(𝒗1:3(k+1),𝒄k,ϕk))\widehat{\bm{c}}_{k}=\text{prox}_{\alpha_{c}R_{4}}(\bm{c}_{k}-\alpha_{4}\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})) to be the ISTA/PGD update of ¯ρ\overline{\mathcal{L}}_{\rho} w.r.t 𝒄\bm{c} at 𝒄k\bm{c}_{k}. Since 𝒄k+1=argmin𝒄¯ρ(𝒗1:3(k+1),𝒄,ϕk)+R4(𝒄)\bm{c}_{k+1}=\arg\min_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c},\boldsymbol{\phi}_{k})+R_{4}(\bm{c}) then

¯ρ(𝒗1:4(k+1),ϕk)+R4(𝒄k+1)¯ρ(𝒗1:3(k+1),𝒄^k,ϕk)+R4(𝒄^k).\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:4}^{(k+1)},\boldsymbol{\phi}_{k})+R_{4}(\bm{c}_{k+1})\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\widehat{\bm{c}}_{k},\boldsymbol{\phi}_{k})+R_{4}(\widehat{\bm{c}}_{k}).

Combining this inequality with Lemmas 4, 2, and 3 gives

¯ρ(𝒗1:4(k+1),ϕk)+R4(𝒄k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:4}^{(k+1)},\boldsymbol{\phi}_{k})+R_{4}(\bm{c}_{k+1})
¯ρ(𝒗1:3(k+1),𝒄k,ϕk)+θR4(𝒄)+(1θ)R4(𝒄k)\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})+\theta R_{4}(\bm{c}^{*})+(1-\theta)R_{4}(\bm{c}_{k})
+θ22αcΔ4,k2+θ𝒄𝒄k,𝒄¯ρ(𝒗1:3(k+1),𝒄k,ϕk)\displaystyle\hskip 7.11317pt+\frac{\theta^{2}}{2\alpha_{c}}\Delta_{4,k}^{2}+\theta\langle\bm{c}^{*}-\bm{c}_{k},\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})\rangle (18)

where, in Lemma 3, we take g(𝒄)=¯ρ(𝒗1:3(k+1),𝒄,ϕk)g(\bm{c})=\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c},\boldsymbol{\phi}_{k}) (which has (Lf+ρ)(L_{f}+\rho)-Lipschitz continuous gradient by Lemma 4 and thus satisfies (16) by Lemma 2), r=R4r=R_{4}, Lg=Lf+ρL_{g}=L_{f}+\rho, 𝒘~=𝒄k\widetilde{\bm{w}}=\bm{c}_{k}, and 𝒘=𝒄.\bm{w}^{*}=\bm{c}^{*}.

Next, by (17), for any 𝒙,𝒙^,𝒙~d\bm{x},\widehat{\bm{x}},\widetilde{\bm{x}}\in\mathbb{R}^{d} and any 𝒛,𝒛^,𝒛~,𝒖,𝒖^,𝒖~,𝒄n\bm{z},\widehat{\bm{z}},\widetilde{\bm{z}},\bm{u},\widehat{\bm{u}},\widetilde{\bm{u}},\bm{c}\in\mathbb{R}^{n} it holds that

𝒄¯ρ(𝒙^,𝒗2:4,ϕ)𝒄¯ρ(𝒙~,𝒗2:4,ϕ)=𝟎\displaystyle\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\widehat{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi})-\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\widetilde{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi})=\bm{0}

Thus

θ𝒄𝒄k,𝒄¯ρ(𝒗1:3(k+1),𝒄k,ϕk)θ𝒄𝒄k,𝒄¯k\displaystyle\theta\langle\bm{c}^{*}-\bm{c}_{k},\nabla_{\bm{c}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})\rangle-\theta\langle\bm{c}^{*}-\bm{c}_{k},\nabla_{\bm{c}}\overline{\mathcal{L}}_{k}\rangle
=θ𝒄𝒄k,ρ(𝒛k+1𝒖k+1𝒛k𝒖k)\displaystyle\hskip 7.11317pt=\theta\langle\bm{c}^{*}-\bm{c}_{k},-\rho(\bm{z}_{k+1}\odot\bm{u}_{k+1}-\bm{z}_{k}\odot\bm{u}_{k})\rangle
=θ𝒄𝒄k,ρ(𝒛k+1𝒛k)𝒖k+1\displaystyle\hskip 7.11317pt=\theta\langle\bm{c}^{*}-\bm{c}_{k},-\rho(\bm{z}_{k+1}-\bm{z}_{k})\odot\bm{u}_{k+1}\rangle
+θ𝒄𝒄k,ρ𝒛k(𝒖k+1𝒖k)\displaystyle\hskip 7.11317pt\hskip 7.11317pt+\theta\langle\bm{c}^{*}-\bm{c}_{k},-\rho\bm{z}_{k}\odot(\bm{u}_{k+1}-\bm{u}_{k})\rangle
θ2Δ4,k2+(ρu)22δ2,k2+(ρz)22δ3,k2\displaystyle\hskip 7.11317pt\leq\theta^{2}\Delta_{4,k}^{2}+\frac{(\rho u_{\infty})^{2}}{2}\delta_{2,k}^{2}+\frac{(\rho z_{\infty})^{2}}{2}\delta_{3,k}^{2} (19)

where we use 2𝒂,𝒃𝒂22+𝒃222\langle\bm{a},\bm{b}\rangle\leq\lVert\bm{a}\rVert_{2}^{2}+\lVert\bm{b}\rVert_{2}^{2} in the final inequality. Combining (18) and (19) produces the desired result. ∎

VI-D Proof of Proposition 2

Let 𝒖^k=proxαuR3(𝒖kαu𝒖¯ρ(𝒗1:2(k+1),𝒖k,𝒄k,ϕk))\widehat{\bm{u}}_{k}=\text{prox}_{\alpha_{u}R_{3}}\left(\bm{u}_{k}-\alpha_{u}\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{u}_{k},\bm{c}_{k},\boldsymbol{\phi}_{k})\right) be the ISTA/PGD update of ¯ρ\overline{\mathcal{L}}_{\rho} w.r.t 𝒖\bm{u} at 𝒖k\bm{u}_{k}. Since 𝒖k+1=argmin𝒖¯ρ(𝒗1:2(k+1),𝒖,𝒄k,ϕk)\bm{u}_{k+1}=\arg\min_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{u},\bm{c}_{k},\boldsymbol{\phi}_{k}) then

¯ρ(𝒗1:3(k+1),𝒄k,ϕk)+R3(𝒖k+1)¯ρ(𝒗1:2(k+1),𝒖^k,𝒄k,ϕk)+R3(𝒖^k).\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})+R_{3}(\bm{u}_{k+1})\\ \leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\widehat{\bm{u}}_{k},\bm{c}_{k},\boldsymbol{\phi}_{k})+R_{3}(\widehat{\bm{u}}_{k}). (20)

Define g(𝒖)=¯ρ(𝒗1:2(k+1),𝒖,𝒄k,ϕk)g(\bm{u})=\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{u},\bm{c}_{k},\boldsymbol{\phi}_{k}), which has (ρz2)(\rho z_{\infty}^{2})-Lipschitz continuous gradient by Lemma 4. Hence, g(𝒖)g(\bm{u}) satisfies (16) by Lemma 2. Combining (20) with Lemma 3 where g=g(𝒖),r=R3,Lg=ρz2,𝒘~=𝒖k,g=g(\bm{u}),r=R_{3},L_{g}=\rho z_{\infty}^{2},\widetilde{\bm{w}}=\bm{u}_{k}, and 𝒘=𝒖\bm{w}^{*}=\bm{u}^{*} gives

¯ρ(𝒗1:3(k+1),𝒄k,ϕk)+R3(𝒖k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:3}^{(k+1)},\bm{c}_{k},\boldsymbol{\phi}_{k})+R_{3}(\bm{u}_{k+1})
¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)+θR3(𝒖)+(1θ)R3(𝒖k)\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+\theta R_{3}(\bm{u}^{*})+(1-\theta)R_{3}(\bm{u}_{k})
+θ22αuΔ3,k2+θ𝒖𝒖k,𝒖¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk).\displaystyle\hskip 7.11317pt+\frac{\theta^{2}}{2\alpha_{u}}\Delta_{3,k}^{2}+\theta\langle\bm{u}^{*}-\bm{u}_{k},\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})\rangle. (21)

Next, by (17), for any 𝒙,𝒙^,𝒙~d\bm{x},\widehat{\bm{x}},\widetilde{\bm{x}}\in\mathbb{R}^{d} and any 𝒛,𝒛^,𝒛~,𝒖,𝒄,𝒄^,𝒄~n\bm{z},\widehat{\bm{z}},\widetilde{\bm{z}},\bm{u},\bm{c},\widehat{\bm{c}},\widetilde{\bm{c}}\in\mathbb{R}^{n} it holds that

𝒖¯ρ(𝒙^,𝒗2:4,ϕ)𝒖¯ρ(𝒙~,𝒗2:4,ϕ)\displaystyle\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\widehat{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi})-\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\widetilde{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi}) =𝟎\displaystyle=\bm{0}

and

𝒖¯ρ(𝒙,𝒛^,𝒗3:4,ϕ)𝒖¯ρ(𝒙,𝒛~,𝒗3:4,ϕ)=(ρ𝒖(𝒛^+𝒛~)𝝋2ρ𝒄)(𝒛^𝒛~).\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{x},\widehat{\bm{z}},\bm{v}_{3:4},\boldsymbol{\phi})-\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{x},\widetilde{\bm{z}},\bm{v}_{3:4},\boldsymbol{\phi})\\ =(\rho\bm{u}\odot(\widehat{\bm{z}}+\widetilde{\bm{z}})-\bm{\varphi}_{2}-\rho\bm{c})\odot(\widehat{\bm{z}}-\widetilde{\bm{z}}).

Thus

θ𝒖𝒖k,𝒖¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)θ𝒖𝒖k,𝒖¯k\displaystyle\theta\langle\bm{u}^{*}-\bm{u}_{k},\nabla_{\bm{u}}\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})\rangle-\theta\langle\bm{u}^{*}-\bm{u}_{k},\nabla_{\bm{u}}\overline{\mathcal{L}}_{k}\rangle
θ22Δ3,k2+(2ρuz+4σ0+ρc)22δ2,k2\displaystyle\leq\frac{\theta^{2}}{2}\Delta_{3,k}^{2}+\frac{(2\rho u_{\infty}z_{\infty}+4\sigma_{0}+\rho c_{\infty})^{2}}{2}\delta_{2,k}^{2} (22)

where in the final inequality we use 2a,ba22+b222\langle a,b\rangle\leq\lVert a\rVert_{2}^{2}+\lVert b\rVert_{2}^{2} and Lemma 6. Combining (21) and (22) produces the desired result with γ=(2ρuz+4σ0+ρc)2.\gamma=(2\rho u_{\infty}z_{\infty}+4\sigma_{0}+\rho c_{\infty})^{2}.

VI-E Proof of Proposition 3

Define g(𝒛)¯ρ(𝒙k+1,𝒛,𝒗3:4(k),ϕk)g(\bm{z})\coloneqq\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{z},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k}) and let 𝒛^k=proxαzR2(𝒛kαz𝒛g(𝒛k))\widehat{\bm{z}}_{k}=\text{prox}_{\alpha_{z}R_{2}}\left(\bm{z}_{k}-\alpha_{z}\nabla_{\bm{z}}g(\bm{z}_{k})\right) be a single ISTA step on gg w.r.t 𝒛\bm{z} from 𝒛k.\bm{z}_{k}. Since 𝒛k+1\bm{z}_{k+1} is the output of multiple FISTA/ISTA steps on g+R2g+R_{2} w.r.t 𝒛\bm{z} from 𝒛k\bm{z}_{k}, which decrease the cost function at least as much as a single ISTA step, then

¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)+R2(𝒛k+1)¯ρ(𝒙k+1,𝒛^k,𝒗3:4(k),ϕk)+R2(𝒛^k).\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+R_{2}(\bm{z}_{k+1})\\ \leq\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\widehat{\bm{z}}_{k},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+R_{2}(\widehat{\bm{z}}_{k}). (23)

Note that gg has (ρ(1+u2))(\rho(1+u_{\infty}^{2}))-Lipschitz continuous gradient by Lemma 4. Hence, g(𝒛)g(\bm{z}) satisfies (16) by Lemma 2. Combining (23) with Lemma 3 where g=g(𝒛),r=R2,Lg=ρ(1+u2),𝒘~=𝒛k,g=g(\bm{z}),r=R_{2},L_{g}=\rho(1+u_{\infty}^{2}),\widetilde{\bm{w}}=\bm{z}_{k}, and 𝒘=𝒛\bm{w}^{*}=\bm{z}^{*} gives

¯ρ(𝒗1:2(k+1),𝒗3:4(k),ϕk)+R2(𝒛k+1)\displaystyle\overline{\mathcal{L}}_{\rho}(\bm{v}_{1:2}^{(k+1)},\bm{v}_{3:4}^{(k)},\boldsymbol{\phi}_{k})+R_{2}(\bm{z}_{k+1})
¯ρ(𝒙k+1,𝒗2:4(k),ϕk)+θ𝒛𝒛k,𝒛¯ρ(𝒙k+1,𝒗2:4(k),ϕk)\displaystyle\leq\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+\theta\langle\bm{z}^{*}-\bm{z}_{k},\nabla_{\bm{z}}\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})\rangle
+θ22αzΔ2,k2+θRz(𝒛)+(1θ)Rz(𝒛k).\displaystyle\hskip 12.80365pt+\frac{\theta^{2}}{2\alpha_{z}}\Delta_{2,k}^{2}+\theta R_{z}(\bm{z}^{*})+(1-\theta)R_{z}(\bm{z}_{k}). (24)

Next, by (17), for any 𝒙,𝒙^,𝒙~d\bm{x},\widehat{\bm{x}},\widetilde{\bm{x}}\in\mathbb{R}^{d} and any 𝒛,𝒖,𝒄,n\bm{z},\bm{u},\bm{c},\in\mathbb{R}^{n} it holds that

𝒛¯ρ(𝒙^,𝒗2:4,ϕ)𝒛¯ρ(𝒙~,𝒗2:4,ϕ)=ρ(G(𝒙^)G(𝒙~)).\displaystyle\nabla_{\bm{z}}\overline{\mathcal{L}}_{\rho}(\widehat{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi})-\nabla_{\bm{z}}\overline{\mathcal{L}}_{\rho}(\widetilde{\bm{x}},\bm{v}_{2:4},\boldsymbol{\phi})=-\rho(G(\widehat{\bm{x}})-G(\widetilde{\bm{x}})).

Thus, using 2𝒂,𝒃𝒂22+𝒃222\langle\bm{a},\bm{b}\rangle\leq\lVert\bm{a}\rVert_{2}^{2}+\lVert\bm{b}\rVert_{2}^{2} and Assumption 2

θ𝒛𝒛k,𝒛¯ρ(𝒙k+1,𝒗2:4(k),ϕk)θ𝒛𝒛k,𝒛¯k\displaystyle\theta\langle\bm{z}^{*}-\bm{z}_{k},\nabla_{\bm{z}}\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})\rangle-\theta\langle\bm{z}^{*}-\bm{z}_{k},\nabla_{\bm{z}}\overline{\mathcal{L}}_{k}\rangle
=θ𝒛𝒛k,ρ(G(𝒙k+1)G(𝒙k))\displaystyle=\theta\langle\bm{z}^{*}-\bm{z}_{k},-\rho(G(\bm{x}_{k+1})-G(\bm{x}_{k}))\rangle
θ22Δ2,k2+ρ2τG22𝒙k+1𝒙k22.\displaystyle\leq\frac{\theta^{2}}{2}\Delta_{2,k}^{2}+\frac{\rho^{2}\tau_{G}^{2}}{2}\lVert\bm{x}_{k+1}-\bm{x}_{k}\rVert_{2}^{2}.

Combining this inequality with (24) produces the desired results. ∎

VI-F Bound on x Update (Proof of Proposition 4)

First, we provide two lemmas to simplify the proof of Proposition 4. For these lemmas, define ϱk:d\varrho_{k}:\mathbb{R}^{d}\to\mathbb{R} as

ϱk(𝒙)𝒛kG(𝒙)22𝝃1,k22+2(𝒙G(𝒙k))T(𝒙𝒙k),𝝃1,k).\varrho_{k}(\bm{x})\coloneqq||\bm{z}_{k}-G(\bm{x})||_{2}^{2}-||\bm{\xi}_{1,k}||_{2}^{2}\\ +2\langle(\nabla_{\bm{x}}G(\bm{x}_{k}))^{T}(\bm{x}-\bm{x}_{k}),\bm{\xi}_{1,k})\rangle.
Lemma 8.

For every kk\in\mathbb{N} and any 𝐱d\bm{x}\in\mathbb{R}^{d}

ϱk(𝒙)(τG2+LG𝝃1,k2)𝒙𝒙k22.\displaystyle\varrho_{k}(\bm{x})\leq(\tau_{G}^{2}+L_{G}\lVert\bm{\xi}_{1,k}\rVert_{2})\lVert\bm{x}-\bm{x}_{k}\rVert_{2}^{2}.
Proof.

First note that for any 𝒕,𝒕^,𝒕~n\bm{t},\widehat{\bm{t}},\widetilde{\bm{t}}\in\mathbb{R}^{n} it holds that

𝒕𝒕^22𝒕𝒕~22+2𝒕^𝒕~,𝒕𝒕~=𝒕^𝒕~22.\displaystyle\lVert\bm{t}-\widehat{\bm{t}}\rVert_{2}^{2}-\lVert\bm{t}-\widetilde{\bm{t}}\rVert_{2}^{2}+2\langle\widehat{\bm{t}}-\widetilde{\bm{t}},\bm{t}-\widetilde{\bm{t}}\rangle=\lVert\widehat{\bm{t}}-\widetilde{\bm{t}}\rVert_{2}^{2}.

Therefore

ϱk(𝒙)\displaystyle\varrho_{k}(\bm{x}) =G(𝒙)G(𝒙k)22\displaystyle=\lVert G(\bm{x})-G(\bm{x}_{k})\rVert_{2}^{2}
+2G(𝒙k)G(𝒙)+(𝒙G(𝒙k))T(𝒙𝒙k),𝝃1,k\displaystyle\hskip 7.11317pt\hskip 7.11317pt+2\langle G(\bm{x}_{k})-G(\bm{x})+(\nabla_{\bm{x}}G(\bm{x}_{k}))^{T}(\bm{x}-\bm{x}_{k}),\bm{\xi}_{1,k}\rangle
τG2𝒙𝒙k22+LG𝝃1,k2𝒙𝒙k22\displaystyle\leq\tau_{G}^{2}\lVert\bm{x}-\bm{x}_{k}\rVert_{2}^{2}+L_{G}\lVert\bm{\xi}_{1,k}\rVert_{2}\lVert\bm{x}-\bm{x}_{k}\rVert_{2}^{2}

where we use the Cauchy-Schwarz inequality and Assumption 2 to furnish the final inequality. ∎

Lemma 9.

For every kk\in\mathbb{N} any 𝐰d\bm{w}\in\mathbb{R}^{d} the inequality (16) holds for g=g(𝐱)=¯ρ(𝐱,𝐯2:4(k),ϕk),𝐰~=𝐱kg=g(\bm{x})=\overline{\mathcal{L}}_{\rho}(\bm{x},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k}),\widetilde{\bm{w}}=\bm{x}_{k}, and Lg=LG(4σ0+ρ𝛏1,k2)+ρτG2L_{g}=L_{G}(4\sigma_{0}+\rho\lVert\bm{\xi}_{1,k}\rVert_{2})+\rho\tau_{G}^{2}.

Proof.

First, note that

𝒙g(𝒙k)\displaystyle\nabla_{\bm{x}}g(\bm{x}_{k}) =𝒙G(𝒙k)(𝝋1,k+ρ𝝃1,k).\displaystyle=-\nabla_{\bm{x}}G(\bm{x}_{k})\left(\bm{\varphi}_{1,k}+\rho\bm{\xi}_{1,k}\right).

Now, observe

g(𝒙)g(𝒙k)𝒙𝒙k,𝒙g(𝒙k)\displaystyle g(\bm{x})-g(\bm{x}_{k})-\langle\bm{x}-\bm{x}_{k},\nabla_{\bm{x}}g(\bm{x}_{k})\rangle
4σ0LG2𝒙𝒙k22+ρ(τG2+LG𝝃1,k2)2𝒙𝒙k22\displaystyle\leq\frac{4\sigma_{0}L_{G}}{2}\lVert\bm{x}-\bm{x}_{k}\rVert_{2}^{2}+\frac{\rho(\tau_{G}^{2}+L_{G}\lVert\bm{\xi}_{1,k}\rVert_{2})}{2}\lVert\bm{x}-\bm{x}_{k}\rVert_{2}^{2}

where we use the Cauchy-Schwarz inequality, Assumption 2, Lemma 6, and Lemma 8 to produce the final inequality. ∎

We now prove the desired bound on the adjusted augmented Lagrangian over an update of 𝒙.\bm{x}.

Proof of Proposition 4.

Define g(𝒙)=¯ρ(𝒙,𝒗2:4(k),ϕk)g(\bm{x})=\overline{\mathcal{L}}_{\rho}(\bm{x},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k}) and let 𝒙^k=proxα1R1(𝒙kα1𝒙g(𝒙k)).\widehat{\bm{x}}_{k}=\text{prox}_{\alpha_{1}R_{1}}\left(\bm{x}_{k}-\alpha_{1}\nabla_{\bm{x}}g(\bm{x}_{k})\right). Since 𝒙k+1\bm{x}_{k+1} is the output of many GD/ISTA steps on g+R1g+R_{1} w.r.t 𝒙\bm{x} from 𝒙k\bm{x}_{k}, which decrease the cost function at least as much as a single ISTA step, then

¯ρ(𝒙k+1,𝒗2:4(k),ϕk)+R1(𝒙k+1)¯ρ(𝒙^k,𝒗2:4(k),ϕk)+R1(𝒙^k).\overline{\mathcal{L}}_{\rho}(\bm{x}_{k+1},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+R_{1}(\bm{x}_{k+1})\\ \leq\overline{\mathcal{L}}_{\rho}(\widehat{\bm{x}}_{k},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+R_{1}(\widehat{\bm{x}}_{k}). (25)

By Lemma 9 and Lemma 3 with g=g(𝒙),r=R1,𝒘~=𝒙k,Lg=LG(4σ0+ρ𝝃1,k2)+ρτG2,g=g(\bm{x}),r=R_{1},\widetilde{\bm{w}}=\bm{x}_{k},L_{g}=L_{G}(4\sigma_{0}+\rho\lVert\bm{\xi}_{1,k}\rVert_{2})+\rho\tau_{G}^{2}, and 𝒘=𝒙\bm{w}^{*}=\bm{x}^{*} we have

¯ρ(𝒙^k,𝒗2:4(k),ϕk)+R1(𝒙^k)¯k+θ𝒙𝒙k,𝒙¯k+θ2α1Δ1,k2+θR1(𝒙)+(1θ)R1(𝒙k),\overline{\mathcal{L}}_{\rho}(\widehat{\bm{x}}_{k},\bm{v}_{2:4}^{(k)},\boldsymbol{\phi}_{k})+R_{1}(\widehat{\bm{x}}_{k})\\ \leq\overline{\mathcal{L}}_{k}+\theta\langle\bm{x}^{*}-\bm{x}_{k},\nabla_{\bm{x}}\overline{\mathcal{L}}_{k}\rangle+\frac{\theta^{2}}{\alpha_{1}}\Delta_{1,k}^{2}\\ +\theta R_{1}(\bm{x}^{*})+(1-\theta)R_{1}(\bm{x}_{k}),

which combined with (25) produces the desired result. ∎

VI-G Complete Iteration Bound (Proof of Proposition 5)

In order to prove Proposition 5 we first derive two simplifying lemmas.

Lemma 10.

Let 𝐪in\bm{q}_{i}\in\mathbb{R}^{n} for i{1,2}i\in\{1,2\} and 𝐪=[𝐪1T𝐪2T]T.\bm{q}=\begin{bmatrix}\bm{q}_{1}^{T}&\bm{q}_{2}^{T}\end{bmatrix}^{T}. For every kk\in\mathbb{N}

|𝝃k+(𝒗𝝃k)T(𝒗𝒗k),𝒒|LG𝒒122Δ1,k2+𝒒222(Δ2,k2+Δ3,k2).\left|\left\langle\bm{\xi}_{k}+(\nabla_{\bm{v}}\bm{\xi}_{k})^{T}(\bm{v}^{*}-\bm{v}_{k}),\bm{q}\right\rangle\right|\\ \leq\frac{L_{G}\lVert\bm{q}_{1}\rVert_{2}}{2}\Delta_{1,k}^{2}+\frac{\lVert\bm{q}_{2}\rVert_{2}}{2}\left(\Delta_{2,k}^{2}+\Delta_{3,k}^{2}\right).
Proof.

First, note that

which combined with the feasibility condition 𝝃=𝟎\bm{\xi}^{*}=\bm{0} gives

𝝃k+(𝒗𝝃k)T(𝒗𝒗k)\displaystyle\bm{\xi}_{k}+(\nabla_{\bm{v}}\bm{\xi}_{k})^{T}(\bm{v}^{*}-\bm{v}_{k})
=[𝒛G(𝒙k)(𝒙G(𝒙k))T(𝒙𝒙k)𝒄𝒛𝒖k𝒛k𝒖+𝒛k𝒖k]\displaystyle\hskip 7.11317pt\hskip 7.11317pt=\begin{bmatrix}\bm{z}^{*}-G(\bm{x}_{k})-(\nabla_{\bm{x}}G(\bm{x}_{k}))^{T}(\bm{x}^{*}-\bm{x}_{k})\\ \bm{c}^{*}-\bm{z}^{*}\odot\bm{u}_{k}-\bm{z}_{k}\odot\bm{u}^{*}+\bm{z}_{k}\odot\bm{u}_{k}\end{bmatrix}
=[G(𝒙)G(𝒙k)(𝒙G(𝒙k))T(𝒙𝒙k)(𝒛𝒛k)(𝒖𝒖k)].\displaystyle\hskip 7.11317pt\hskip 7.11317pt=\begin{bmatrix}G(\bm{x}^{*})-G(\bm{x}_{k})-(\nabla_{\bm{x}}G(\bm{x}_{k}))^{T}(\bm{x}^{*}-\bm{x}_{k})\\ (\bm{z}^{*}-\bm{z}_{k})\odot(\bm{u}^{*}-\bm{u}_{k})\end{bmatrix}.

Since 2aba2+b22ab\leq a^{2}+b^{2}, note that

2(𝒛𝒛k)(𝒖𝒖k)2Δ2,k2+Δ3,k2.\displaystyle 2\lVert(\bm{z}^{*}-\bm{z}_{k})\odot(\bm{u}^{*}-\bm{u}_{k})\rVert_{2}\leq\Delta_{2,k}^{2}+\Delta_{3,k}^{2}. (26)

Therefore, by the Cauchy-Schwarz and triangle inequalities

|𝝃k+(𝒗𝝃k)T(𝒗𝒗k),𝒒|G(𝒙)G(𝒙k)(𝒙G(𝒙k))T(𝒙𝒙k)2𝒒12+(𝒛𝒛k)(𝒖𝒖k)2𝒒22.\left|\left\langle\bm{\xi}_{k}+(\nabla_{\bm{v}}\bm{\xi}_{k})^{T}(\bm{v}^{*}-\bm{v}_{k}),\bm{q}\right\rangle\right|\\ \leq\lVert G(\bm{x}^{*})-G(\bm{x}_{k})-(\nabla_{\bm{x}}G(\bm{x}_{k}))^{T}(\bm{x}^{*}-\bm{x}_{k})\rVert_{2}\lVert\bm{q}_{1}\rVert_{2}\\ +\lVert(\bm{z}^{*}-\bm{z}_{k})\odot(\bm{u}^{*}-\bm{u}_{k})\rVert_{2}\lVert\bm{q}_{2}\rVert_{2}.

Finally, applying Assumption 2 and (26) produces the desired result. ∎

Lemma 11.

Let ψj,k=4σ0+ρ𝛏j,k2\psi_{j,k}=4\sigma_{0}+\rho\lVert\bm{\xi}_{j,k}\rVert_{2} for j{1,2}j\in\{1,2\} and kk\in\mathbb{N}. For every kk\in\mathbb{N}

¯k¯+𝒗𝒗k,𝒗¯kLGψ1,k2Δ1,k2+ψ2,k2(Δ2,k2+Δ3,k2).\overline{\mathcal{L}}_{k}-\overline{\mathcal{L}}_{*}+\langle\bm{v}^{*}-\bm{v}_{k},\nabla_{\bm{v}}\overline{\mathcal{L}}_{k}\rangle\\ \leq\frac{L_{G}\psi_{1,k}}{2}\Delta_{1,k}^{2}+\frac{\psi_{2,k}}{2}\left(\Delta_{2,k}^{2}+\Delta_{3,k}^{2}\right).
Proof.

First, note that

𝒗¯k=[𝟎T𝟎T𝟎T(𝒄f𝒚(𝒄k))T]T+(𝒗𝝃k)(ϕk+ρ𝝃k).\nabla_{\bm{v}}\overline{\mathcal{L}}_{k}=\begin{bmatrix}\bm{0}^{T}&\bm{0}^{T}&\bm{0}^{T}&(\nabla_{\bm{c}}f_{\bm{y}}(\bm{c}_{k}))^{T}\end{bmatrix}^{T}\\ +(\nabla_{\bm{v}}\bm{\xi}_{k})\left(\boldsymbol{\phi}_{k}+\rho\bm{\xi}_{k}\right). (27)

Hence

¯k¯+𝒗𝒗k,𝒗¯k\displaystyle\overline{\mathcal{L}}_{k}-\overline{\mathcal{L}}_{*}+\langle\bm{v}^{*}-\bm{v}_{k},\nabla_{\bm{v}}\overline{\mathcal{L}}_{k}\rangle
=f𝒚(𝒄k)f𝒚(𝒄)+𝒄𝒄k,𝒄f𝒚(𝒄k)\displaystyle=f_{\bm{y}}(\bm{c}_{k})-f_{\bm{y}}(\bm{c}^{*})+\langle\bm{c}^{*}-\bm{c}_{k},\nabla_{\bm{c}}f_{\bm{y}}(\bm{c}_{k})\rangle
+ϕk,𝝃k+ρ2𝝃k22+𝒗𝒗k,(𝒗𝝃k)(ϕk+ρ𝝃k)\displaystyle\hskip 7.11317pt\hskip 7.11317pt+\langle\boldsymbol{\phi}_{k},\bm{\xi}_{k}\rangle+\frac{\rho}{2}\lVert\bm{\xi}_{k}\rVert_{2}^{2}+\left\langle\bm{v}^{*}-\bm{v}_{k},(\nabla_{\bm{v}}\bm{\xi}_{k})\left(\boldsymbol{\phi}_{k}+\rho\bm{\xi}_{k}\right)\right\rangle
𝝃k+(𝒗𝝃k)T(𝒗𝒗k),ϕk+ρ𝝃k\displaystyle\leq\left\langle\bm{\xi}_{k}+(\nabla_{\bm{v}}\bm{\xi}_{k})^{T}(\bm{v}^{*}-\bm{v}_{k}),\boldsymbol{\phi}_{k}+\rho\bm{\xi}_{k}\right\rangle

where we use the convexity of f𝒚f_{\bm{y}} to produce the final inequality. Applying Lemma 10 produces the desired result. ∎

We now prove the desired bound on the augmented Lagrangian over one iteration of Algorithm 1.

Proof of Proposition 5.

Applying Corollary 7 followed by Propositions 123, and 4 and finally Lemma 5 gives

k+1=¯k+1+j=14Rj(𝜻j(k+1))\displaystyle\mathcal{L}_{k+1}=\overline{\mathcal{L}}_{k+1}+\sum_{j=1}^{4}R_{j}(\bm{\zeta}_{j}^{(k+1)})
k+θ𝒗𝒗k,𝒗¯k+j=14(θ2αjΔj,k2+γ^jΔj,k2)\displaystyle\hskip 7.11317pt\hskip 7.11317pt\leq\mathcal{L}_{k}+\theta\left\langle\bm{v}^{*}-\bm{v}_{k},\nabla_{\bm{v}}\overline{\mathcal{L}}_{k}\right\rangle+\sum_{j=1}^{4}\left(\frac{\theta^{2}}{\alpha_{j}}\Delta_{j,k}^{2}+\widehat{\gamma}_{j}\Delta_{j,k}^{2}\right)
+θj=14(Rj(𝜻)Rj(𝜻j(k))).\displaystyle\hskip 7.11317pt\hskip 7.11317pt+\theta\sum_{j=1}^{4}(R_{j}(\bm{\zeta}^{*})-R_{j}(\bm{\zeta}_{j}^{(k)})).

for γ^1=2τG2(ρ2+σ0),γ^2=2[(ρu)2+(2ρuz+4σ0+ρc)2+σ0(1+2u2)],γ^3=2z2(ρ2+2σ0),\widehat{\gamma}_{1}=2\tau_{G}^{2}(\rho^{2}+\sigma_{0}),\widehat{\gamma}_{2}=2[(\rho u_{\infty})^{2}+(2\rho u_{\infty}z_{\infty}+4\sigma_{0}+\rho c_{\infty})^{2}+\sigma_{0}(1+2u_{\infty}^{2})],\widehat{\gamma}_{3}=2z_{\infty}^{2}(\rho^{2}+2\sigma_{0}), and γ^4=2σ0\widehat{\gamma}_{4}=2\sigma_{0}.

Applying Lemma 11 produces the desired result for β1=LG(4σ0+ρξ1,max)2+γ^1\beta_{1}=\frac{L_{G}(4\sigma_{0}+\rho\xi_{1,\max})}{2}+\widehat{\gamma}_{1}, β2=4σ0+ρξ2,max2+γ^2\beta_{2}=\frac{4\sigma_{0}+\rho\xi_{2,\max}}{2}+\widehat{\gamma}_{2}, β3=4σ0+ρξ2,max2+γ^3\beta_{3}=\frac{4\sigma_{0}+\rho\xi_{2,\max}}{2}+\widehat{\gamma}_{3}, and β4=γ^4\beta_{4}=\widehat{\gamma}_{4} where ξj,max=maxk𝝃j,k2.\xi_{j,\max}=\max_{k}\lVert\bm{\xi}_{j,k}\rVert_{2}.

VI-H Convergence (Proof of Theorem 6)

We first establish a handful of preliminary results through the following lemmas in order to prove Theorem 6.

Lemma 12.

For kk\in\mathbb{N} assume that 𝛗j,k𝛗j22φj,min>0\lVert\bm{\varphi}_{j,k}-\bm{\varphi}_{j}^{*}\rVert_{2}^{2}\geq\varphi_{j,\min}>0 and Δ3,k2p3>0\Delta_{3,k}^{2}\geq p_{3}>0 where j{1,2}j\in\{1,2\}. Let ω1>LG𝛗12ρνG2,ω2>𝛗22ρ,ω3>𝛗22ρp3,\omega_{1}>\frac{L_{G}\lVert\bm{\varphi}_{1}^{*}\rVert_{2}}{\rho\nu_{G}^{2}},\omega_{2}>\frac{\lVert\bm{\varphi}_{2}^{*}\rVert_{2}}{\rho},\omega_{3}>\frac{\lVert\bm{\varphi}_{2}^{*}\rVert_{2}}{\rho p_{3}}, and ω4>0\omega_{4}>0. Set ω5,1ω1τGΔ1,02+ω2Δ2,02φ1,min\omega_{5,1}\geq\frac{\omega_{1}\tau_{G}\Delta_{1,0}^{2}+\omega_{2}\Delta_{2,0}^{2}}{\varphi_{1,\min}} and ω5,2ω3Δ(2,3),02+ω4Δ4,02φ2,min\omega_{5,2}\geq\frac{\omega_{3}\Delta_{(2,3),0}^{2}+\omega_{4}\Delta_{4,0}^{2}}{\varphi_{2,\min}}. Then

k12i=14κiΔi,k2μρ\displaystyle\mathcal{L}_{k}-\mathcal{L}_{*}\geq\frac{1}{2}\sum_{i=1}^{4}\kappa_{i}\Delta_{i,k}^{2}-\frac{\mu}{\rho}

for positive constants κ1=ρω1νG2LG𝛗12,κ2=ρω2𝛗22,κ3=ρω3p3𝛗22,\kappa_{1}=\rho\omega_{1}\nu_{G}^{2}-L_{G}\lVert\bm{\varphi}_{1}^{*}\rVert_{2},\kappa_{2}=\rho\omega_{2}-\lVert\bm{\varphi}_{2}^{*}\rVert_{2},\kappa_{3}=\rho\omega_{3}p_{3}-\lVert\bm{\varphi}_{2}^{*}\rVert_{2}, κ4=ρω4\kappa_{4}=\rho\omega_{4}, and μ=(1+max{ω5,1,ω5,2})i=12(φi,max2+φi22)\mu=\left(1+\max\{\omega_{5,1},\omega_{5,2}\}\right)\sum_{i=1}^{2}\left(\varphi_{i,\max}^{2}+\lVert\varphi_{i}^{*}\rVert_{2}^{2}\right).

Proof.

Note that the first order optimality conditions of (6), with FF given in (13), are 𝒗¯R1(𝒙)×R2(𝒛)×R3(𝒖)×R4(𝒄)-\nabla_{\bm{v}}\overline{\mathcal{L}}_{*}\in\partial R_{1}(\bm{x}^{*})\times\partial R_{2}(\bm{z}^{*})\times\partial R_{3}(\bm{u}^{*})\times\partial R_{4}(\bm{c}^{*}) and 𝝃=𝟎\bm{\xi}^{*}=\bm{0} where Ri(𝜻i)\partial R_{i}(\bm{\zeta}_{i}) is the subdifferential of RiR_{i} at 𝜻i\bm{\zeta}_{i}. By the definition of a subdifferential

i=14[R𝜻i(𝜻i(k))R𝜻i(𝜻i)]𝒗k𝒗,𝒗¯.\displaystyle\sum_{i=1}^{4}\left[R_{\bm{\zeta}_{i}}(\bm{\zeta}_{i}^{(k)})-R_{\bm{\zeta}_{i}}(\bm{\zeta}_{i}^{*})\right]\geq\langle\bm{v}_{k}-\bm{v}^{*},-\nabla_{\bm{v}}\overline{\mathcal{L}}_{*}\rangle.

Thus, by (27), 𝝃=𝟎\bm{\xi}^{*}=\bm{0}, and convexity of f𝒚f_{\bm{y}}, observe

k=¯k¯+i=14[R𝜻i(𝜻i(k))R𝜻i(𝜻i)]\displaystyle\mathcal{L}_{k}-\mathcal{L}_{*}=\overline{\mathcal{L}}_{k}-\overline{\mathcal{L}}_{*}+\sum_{i=1}^{4}\left[R_{\bm{\zeta}_{i}}(\bm{\zeta}_{i}^{(k)})-R_{\bm{\zeta}_{i}}(\bm{\zeta}_{i}^{*})\right]
ϕk,𝝃k+ρ2𝝃k22+(𝒗𝝃)T(𝒗𝒗k),ϕ\displaystyle\geq\langle\boldsymbol{\phi}_{k},\bm{\xi}_{k}\rangle+\frac{\rho}{2}\lVert\bm{\xi}_{k}\rVert_{2}^{2}+\langle(\nabla_{\bm{v}}\bm{\xi}^{*})^{T}(\bm{v}^{*}-\bm{v}_{k}),\boldsymbol{\phi}^{*}\rangle
ϕkϕ,𝝃k+ρ2𝝃k22\displaystyle\geq\langle\boldsymbol{\phi}_{k}-\boldsymbol{\phi}^{*},\bm{\xi}_{k}\rangle+\frac{\rho}{2}\lVert\bm{\xi}_{k}\rVert_{2}^{2}
LG𝝋122Δ1,k2𝝋222(Δ2,k2+Δ3,k2)\displaystyle\hskip 7.11317pt\hskip 7.11317pt-\frac{L_{G}\lVert\bm{\varphi}_{1}^{*}\rVert_{2}}{2}\Delta_{1,k}^{2}-\frac{\lVert\bm{\varphi}_{2}^{*}\rVert_{2}}{2}\left(\Delta_{2,k}^{2}+\Delta_{3,k}^{2}\right) (28)

where we applied Lemma 10 to furnish the final inequality.

Next, by Assumption 2 and 𝝃=𝟎\bm{\xi}^{*}=\bm{0}, observe

𝒛kG(𝒙k)𝝋1𝝋1,kρ22\displaystyle\lVert\bm{z}_{k}-G(\bm{x}_{k})-\frac{\bm{\varphi}_{1}^{*}-\bm{\varphi}_{1,k}}{\rho}\rVert_{2}^{2}
=(𝒛k𝒛)(G(𝒙k)G(𝒙))𝝋1𝝋1,kρ22\displaystyle\hskip 7.11317pt\hskip 7.11317pt=\lVert(\bm{z}_{k}-\bm{z}^{*})-\left(G(\bm{x}_{k})-G(\bm{x}^{*})\right)-\frac{\bm{\varphi}_{1}^{*}-\bm{\varphi}_{1,k}}{\rho}\rVert_{2}^{2}
ω2Δ2,k2+ω1νG2Δ1,k2ω5,1ρ2𝝋1𝝋1,k22\displaystyle\hskip 7.11317pt\hskip 7.11317pt\geq\omega_{2}\Delta_{2,k}^{2}+\omega_{1}\nu_{G}^{2}\Delta_{1,k}^{2}-\frac{\omega_{5,1}}{\rho^{2}}\lVert\bm{\varphi}_{1}^{*}-\bm{\varphi}_{1,k}\rVert_{2}^{2} (29)

and

𝒄k𝒛k𝒖k𝝋2𝝋2,kρ22\displaystyle\lVert\bm{c}_{k}-\bm{z}_{k}\odot\bm{u}_{k}-\frac{\bm{\varphi}_{2}^{*}-\bm{\varphi}_{2,k}}{\rho}\rVert_{2}^{2}
=(𝒄k𝒄)(𝒛k𝒖k𝒛𝒖)𝝋2𝝋2,kρ22\displaystyle\hskip 7.11317pt\hskip 7.11317pt=\lVert(\bm{c}_{k}-\bm{c}^{*})-(\bm{z}_{k}\odot\bm{u}_{k}-\bm{z}^{*}\odot\bm{u}^{*})-\frac{\bm{\varphi}_{2}^{*}-\bm{\varphi}_{2,k}}{\rho}\rVert_{2}^{2}
ω4Δ4,k2+ω3p3Δ3,k2ω5,2ρ2𝝋2𝝋2,k22.\displaystyle\hskip 7.11317pt\hskip 7.11317pt\geq\omega_{4}\Delta_{4,k}^{2}+\omega_{3}p_{3}\Delta_{3,k}^{2}-\frac{\omega_{5,2}}{\rho^{2}}\lVert\bm{\varphi}_{2}^{*}-\bm{\varphi}_{2,k}\rVert_{2}^{2}. (30)

Additionally, observe that

ϕkϕ222i=12(φi,max2+φi22).\displaystyle\lVert\boldsymbol{\phi}_{k}-\boldsymbol{\phi}^{*}\rVert_{2}^{2}\leq 2\sum_{i=1}^{2}\left(\varphi_{i,\max}^{2}+\lVert\varphi_{i}^{*}\rVert_{2}^{2}\right). (31)

Finally, combining

with (28), (29), (30), and (31) produces the desired bound. ∎

Lemma 13.

Let the assumptions of Lemma 12 hold and set {κj}j=14\{\kappa_{j}\}_{j=1}^{4} and μ\mu as given in Lemma 12. Choose {(αj,βj)}j=14\{(\alpha_{j},\beta_{j})\}_{j=1}^{4}, from Proposition 5, to satisfy β^β~\widehat{\beta}\leq\widetilde{\beta} where β^max1i4αiβi\widehat{\beta}\coloneqq\sqrt{\max_{1\leq i\leq 4}\,\,\alpha_{i}\beta_{i}} and β~min1i4αiκi4\widetilde{\beta}\coloneqq\min_{1\leq i\leq 4}\,\,\frac{\alpha_{i}\kappa_{i}}{4}. If j=14Δj,k2αjημ2ρ(β~β^)1\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\geq\eta\geq\frac{\mu}{2\rho}(\widetilde{\beta}-\widehat{\beta})^{-1} then (k)(2j=14Δj,k2αj)1β^.\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}\geq\widehat{\beta}.

Proof.

By Lemma 12

(k)(2j=14Δj,k2αj)1(12j=14κjΔj,k2μρ)(2j=14Δj,k2αj)1β~μ2ρη\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}\\ \geq\left(\frac{1}{2}\sum_{j=1}^{4}\kappa_{j}\Delta_{j,k}^{2}-\frac{\mu}{\rho}\right)\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}\\ \geq\widetilde{\beta}-\frac{\mu}{2\rho\eta} (32)

applying ημ2ρ(β~β^)1\eta\geq\frac{\mu}{2\rho}(\widetilde{\beta}-\widehat{\beta})^{-1} produces the desired bound. ∎

We now prove the main convergence result for our proposed A-CG-GAN algorithm

Proof of Theorem 6.

Define

θk=min{1,(k)2(2j=14Δj,k2αj)2β^2},\displaystyle\theta_{k}=\min\left\{1,\sqrt{\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)^{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-2}-\widehat{\beta}^{2}}\right\},

which is well-defined by Lemma 13 (where we take η=η^\eta=\widehat{\eta}) and satisfies θk[0,1]\theta_{k}\in[0,1] for every kk\in\mathbb{N}.

Case 1: θk=(k)2(2j=14Δj,k2αj)2β^2.\theta_{k}=\sqrt{\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)^{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-2}-\widehat{\beta}^{2}}.

Using that abab\sqrt{a-b}\geq\sqrt{a}-\sqrt{b} for all ab0a\geq b\geq 0 observe

θk2j=14Δj,k2αj+θk()\displaystyle\theta_{k}^{2}\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}+\theta_{k}(\mathcal{L}_{*}-\mathcal{L})
(k)22(2j=14Δj,k2αj)1β^2j=14Δj,k2αj\displaystyle\leq\frac{\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)^{2}}{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}-\widehat{\beta}^{2}\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}
(k)2(2j=14Δj,k2αj)1+β^(k)\displaystyle\hskip 7.11317pt\hskip 7.11317pt-\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)^{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}+\widehat{\beta}(\mathcal{L}_{k}-\mathcal{L}_{*})
(33)

where, in the final inequality, we used

j=14βjΔj,k2=j=14αjβjΔj,k2αjβ^2j=14Δj,k2αj.\displaystyle\sum_{j=1}^{4}\beta_{j}\Delta_{j,k}^{2}=\sum_{j=1}^{4}\alpha_{j}\beta_{j}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\leq\widehat{\beta}^{2}\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}. (34)

Now, by Proposition 5, (33), and (32) observe

k+1\displaystyle\mathcal{L}_{k+1}-\mathcal{L}_{*}
k+j=14βjΔj,k2+θk2j=14Δj,k2αj+θk()\displaystyle\leq\mathcal{L}_{k}-\mathcal{L}_{*}+\sum_{j=1}^{4}\beta_{j}\Delta_{j,k}^{2}+\theta_{k}^{2}\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}+\theta_{k}(\mathcal{L}_{*}-\mathcal{L})
(k)(1k2(2j=14Δj,k2αj)1+β^)\displaystyle\leq(\mathcal{L}_{k}-\mathcal{L}_{*})\left(1-\frac{\mathcal{L}_{k}-\mathcal{L}_{*}}{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-1}+\widehat{\beta}\right)
(k)(1β~2+μ4ρη^+β^)\displaystyle\leq\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)\left(1-\frac{\widetilde{\beta}}{2}+\frac{\mu}{4\rho\widehat{\eta}}+\widehat{\beta}\right)
(k)(1β~/8)\displaystyle\leq\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)\left(1-\widetilde{\beta}/8\right) (35)

where we use that β^β~4\widehat{\beta}\leq\frac{\widetilde{\beta}}{4} and η^2μ(ρβ~)1\widehat{\eta}\geq 2\mu(\rho\widetilde{\beta})^{-1} in the final inequality.

Case 2: θk=1.\theta_{k}=1.

By Proposition 5 and (34) observe

k+1\displaystyle\mathcal{L}_{k+1}-\mathcal{L}_{*}
k+j=14βjΔj,k2+j=14Δj,k2αj+()\displaystyle\leq\mathcal{L}_{k}-\mathcal{L}_{*}+\sum_{j=1}^{4}\beta_{j}\Delta_{j,k}^{2}+\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}+(\mathcal{L}_{*}-\mathcal{L})
(1+β^2)j=14Δj,k2αj\displaystyle\leq(1+\widehat{\beta}^{2})\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}
1+β^22(k)\displaystyle\leq\frac{\sqrt{1+\widehat{\beta}^{2}}}{2}(\mathcal{L}_{k}-\mathcal{L}_{*}) (36)

where we use 1(k)2(2j=14Δj,k2αj)2β^21\leq\sqrt{\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right)^{2}\left(2\sum_{j=1}^{4}\frac{\Delta_{j,k}^{2}}{\alpha_{j}}\right)^{-2}-\widehat{\beta}^{2}} in the final inequality. Note that 1+x22x\sqrt{1+x^{2}}\leq 2-x for all x34x\leq\frac{3}{4}. Since β~/434\widetilde{\beta}/4\leq\frac{3}{4} then

1+β^21+(β~/4)22β~/4,\displaystyle\sqrt{1+\widehat{\beta}^{2}}\leq\sqrt{1+\left(\widetilde{\beta}/4\right)^{2}}\leq 2-\widetilde{\beta}/4,

which combined with (36) gives (35).

Hence, in both cases, k+1(1δ)(k)\mathcal{L}_{k+1}-\mathcal{L}_{*}\leq\left(1-\delta\right)\left(\mathcal{L}_{k}-\mathcal{L}_{*}\right). Applying this inequality recursively gives

k+1(1δ)k+1(0).\displaystyle\mathcal{L}_{k+1}-\mathcal{L}_{*}\leq\left(1-\delta\right)^{k+1}\left(\mathcal{L}_{0}-\mathcal{L}_{*}\right). (37)

Therefore, combining Lemma 12 and (37)

Finally, observing that β~1(4β^)1\widetilde{\beta}^{-1}\leq(4\widehat{\beta})^{-1} and μ4ρβ~=η^16\frac{\mu}{4\rho\widetilde{\beta}}=\frac{\widehat{\eta}}{16} produces the desired result. ∎

References

  • [1] I. Daubechies, M. Defrise, and C. De Mol, “An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint,” Commun. Pure Appl. Math, vol. 57, no. 11, pp. 1413–1457, Nov 2004.
  • [2] A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, Jan 2009.
  • [3] S. Ji, Y. Xue, and L. Carin, “Bayesian Compressive Sensing,” IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356, 2008.
  • [4] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing.   Springer New York, 2013.
  • [5] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, May 2009.
  • [6] M. J. Wainwright and E. P. Simoncelli, “Scale Mixtures of Gaussians and the Statistics of Natural Images.” in Advances in Neural Information Processing Systems, vol. 12.   MIT Press, 1999, pp. 855–861.
  • [7] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky, “Random Cascades on Wavelet Trees and Their Use in Analyzing and Modeling Natural Images,” Applied and Computational Harmonic Analysis, vol. 11, no. 1, pp. 89–123, 2001.
  • [8] C. Lyons, R. G. Raj, and M. Cheney, “A Compound Gaussian Least Squares Algorithm and Unrolled Network for Linear Inverse Problems,” IEEE Transactions on Signal Processing, vol. 71, pp. 4303–4316, 2023.
  • [9] C. Lyons, R. G. Raj, and M. Cheney, “Deep Regularized Compound Gaussian Network for Solving Linear Inverse Problems,” IEEE Transactions on Computational Imaging, vol. 10, pp. 399–414, 2024.
  • [10] Z. Chance, R. G. Raj, and D. J. Love, “Information-theoretic structure of multistatic radar imaging,” in IEEE RadarCon (RADAR), 2011, pp. 853–858.
  • [11] Z. Idriss, R. G. Raj, and R. M. Narayanan, “Waveform Optimization for Multistatic Radar Imaging Using Mutual Information,” IEEE Transactions on Aerospace and Electronics Systems, vol. 57, no. 4, pp. 2410–2425, Aug 2021.
  • [12] C. Lyons, R. G. Raj, and M. Cheney, “CG-Net: A Compound Gaussian Prior Based Unrolled Imaging Network,” in 2022 IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2022, pp. 623–629.
  • [13] R. G. Raj, “A hierarchical Bayesian-MAP approach to inverse problems in imaging,” Inverse Problems, vol. 32, no. 7, p. 075003, Jul 2016.
  • [14] C. Lyons, R. G. Raj, and M. Cheney, “A Deep Compound Gaussian Regularized Unfoled Imaging Network,” in 2022 56th Asilomar Conference on Signals, Systems, and Computers, 2022, pp. 940–947.
  • [15] C. Lyons, R. G. Raj, and M. Cheney, “On Generalization Bounds for Deep Compound Gaussian Neural Networks,” arXiv preprint arXiv:2402.13106, 2024, in review by IEEE Transactions on Information Theory.
  • [16] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 399–406.
  • [17] J. Song, B. Chen, and J. Zhang, “Deep Memory-Augmented Proximal Unrolling Network for Compressive Sensing,” International Journal of Computer Vision, vol. 131, no. 6, pp. 1477–1496, Jun 2023.
  • [18] Y. Su and Q. Lian, “iPiano-Net: Nonconvex optimization inspired multi-scale reconstruction network for compressed sensing,” Signal Processing: Image Communication, vol. 89, p. 115989, 2020.
  • [19] J. Adler and O. Öktem, “Learned Primal-Dual Reconstruction,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1322–1332, 2018.
  • [20] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning Proximal Operators: Using Denoising Networks for Regularizing Inverse Imaging Problems,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1781–1790.
  • [21] A. Bora, A. Jalal, E. Price, and A. G. Dimakis, “Compressed Sensing using Generative Models,” in Proceedings of the International Conference on Machine Learning, vol. 70, 2017, pp. 537–546.
  • [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems 27 (NIPS 2014), vol. 27, 2014.
  • [23] V. Shah and C. Hegde, “Solving Linear Inverse Problems Using GAN Priors: An Algorithm with Provable Guarantees,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4609–4613.
  • [24] M. Dhar, A. Grover, and S. Ermon, “Modeling Sparse Deviations for Compressed Sensing using Generative Models,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80.   PMLR, 10–15 Jul 2018, pp. 1214–1223.
  • [25] F. Latorre, A. Eftekhari, and V. Cevher, “Fast and Provable ADMM for Learning with Generative Priors,” in Advances in Neural Information Processing Systems (NeurlIPS), vol. 32, 2019.
  • [26] G. Song, Z. Fan, and J. Lafferty, “Surfing: Iterative Optimization Over Incrementally Trained Deep Networks,” in Advances in Neural Information Processing Systems (NeurlIPS), vol. 32, 2019.
  • [27] A. Jalal, L. Liu, A. G. Dimakis, and C. Caramanis, “Robust Compressed Sensing using Generative Models,” in Advances in Neural Information Processing Systems (NeurlIPS), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 713–727.
  • [28] M. Asim, M. Daniels, O. Leong, A. Ahmed, and P. Hand, “Invertible generative models for inverse problems: mitigating representation error and dataset bias,” in International Conference on Machine Learning.   PMLR, 2020, pp. 399–409.
  • [29] M. Asim, F. Shamshad, and A. Ahmed, “Blind Image Deconvolution Using Deep Generative Priors,” IEEE Transactions on Computational Imaging, vol. 6, pp. 1493–1506, 2020.
  • [30] M. Asim, F. Shamshad, and A. Ahmed, “Solving Bilinear Inverse Problems using Deep Generative Priors,” CoRR, 2018.
  • [31] B. Aubin, B. Loureiro, A. Baker, F. Krzakala, and L. Zdeborová, “Exact asymptotics for phase retrieval and compressed sensing with random generative priors,” in Proceedings of The First Mathematical and Scientific Machine Learning Conference, ser. Proceedings of Machine Learning Research, J. Lu and R. Ward, Eds., vol. 107.   PMLR, 20–24 Jul 2020, pp. 55–73.
  • [32] P. Hand and B. Joshi, “Global Guarantees for Blind Demodulation with Generative Priors,” Advances in Neural Information Processing Systems (NeurlIPS), vol. 32, 2019.
  • [33] P. Hand, O. Leong, and V. Voroninski, “Phase Retrieval Under a Generative Prior,” in Advances in Neural Information Processing Systems (NeurlIPS), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, 2018.
  • [34] G. Jagatap and C. Hegde, “Phase Retrieval using Untrained Neural Network Priors,” in NeurIPS 2019 Workshop on Solving Inverse Problems with Deep Networks, 2019.
  • [35] M. Kabkab, P. Samangouei, and R. Chellappa, “Task-Aware Compressed Sensing with Generative Adversarial Networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  • [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep Image Prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [37] M. Mardani, E. Gong, J. Y. Cheng, S. S. Vasanawala, G. Zaharchuk, L. Xing, and J. M. Pauly, “Deep Generative Adversarial Neural Networks for Compressive Sensing MRI,” IEEE Transactions on Medical Imaging, vol. 38, no. 1, pp. 167–179, 2019.
  • [38] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [39] D. P. Bertsekas, “On Penalty and Multiplier Methods for Constrained Minimization,” SIAM Journal on Control and Optimization, vol. 14, no. 2, pp. 216–235, 1976.
  • [40] S. Boyd and L. Vandenberghe, Convex Optimization.   Cambridge University Press, Mar 2004.
  • [41] J. McKay, R. G. Raj, and V. Monga, “Fast stochastic hierarchical Bayesian map for tomographic imaging,” in 51st Asilomar Conference on Signals, Systems, and Computers.   IEEE, Oct 2017, pp. 223–227.
  • [42] M. Bertero, C. De Mol, and E. R. Pike, “Linear inverse problems with discrete data: II. Stability and regularisation,” Inverse Problems, vol. 4, no. 3, pp. 573–594, Aug 1988.
  • [43] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, “Image Denoising Using Scale Mixtures of Gaussians in the Wavelet Domain,” IEEE Transactions on Image Processing, vol. 12, no. 11, pp. 1338–1351, Nov 2003.
  • [44] T. Huang, W. Dong, X. Yuan, J. Wu, and G. Shi, “Deep Gaussian Scale Mixture Prior for Spectral Compressive Imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 216–16 225.
  • [45] S. J. Wright, “Coordinate descent algorithms,” Mathematical Programming, vol. 151, no. 1, pp. 3–34, Jun 2015.
  • [46] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, 2015.
  • [47] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, “Automatic Differentiation in Machine Learning: a Survey,” Journal of Machine Learning Research, vol. 18, 2018.
  • [48] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving et al., “TensorFlow: A System for Large-Scale Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, 2016, pp. 265–283.
  • [49] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” University of Toronto, Tech. Rep., 2009.
  • [50] L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories,” in Conference on Computer Vision and Pattern Recognition Workshop, 2004, pp. 178–178.
  • [51] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.