This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newaliascnt

propositiontheorem\aliascntresettheproposition \newaliascntcorollarytheorem\aliascntresetthecorollary \newaliascntlemmatheorem\aliascntresetthelemma

Model-adapted Fourier sampling for generative compressed sensing

Aaron Berk &Simone Brugiapaglia &Yaniv Plan &Matthew Scott        Xia Sheng &Özgür Yılmaz Dept. Mathematics and Statistics, McGill University, Montréal, Canada. Now affiliated with Deep Render, London UKDept. Mathematics and Statistics, Concordia University, Montréal, CanadaCorresponding author: matthewscott@math.ubc.caAuthors listed in alphabetic order. MS was primarily responsible for developing the theory and writing the main body, XS was primarily responsible for numerical experiments and writing the numerical section.YP, MS, XS and OY affiliated with Dept. Mathematics, University of British Columbia, Vancouver, Canada
Abstract

We study generative compressed sensing when the measurement matrix is randomly subsampled from a unitary matrix (with the DFT as an important special case). It was recently shown that O(kdn𝜶2)\textit{O}(kdn\|\boldsymbol{\alpha}\|_{\infty}^{2}) uniformly random Fourier measurements are sufficient to recover signals in the range of a neural network G:G:\to of depth dd, where each component of the so-called local coherence vector 𝜶\boldsymbol{\alpha} quantifies the alignment of a corresponding Fourier vector with the range of GG. We construct a model-adapted sampling strategy with an improved sample complexity of O(kd𝜶22)\textit{O}(kd\|\boldsymbol{\alpha}\|_{2}^{2}) measurements. This is enabled by: (1) new theoretical recovery guarantees that we develop for nonuniformly random sampling distributions and then (2) optimizing the sampling distribution to minimize the number of measurements needed for these guarantees. This development offers a sample complexity applicable to natural signal classes, which are often almost maximally coherent with low Fourier frequencies. Finally, we consider a surrogate sampling scheme, and validate its performance in recovery experiments using the CelebA dataset.

1 Introduction

Compressed sensing considers signals 𝒙0\boldsymbol{x}_{0}\in with high ambient dimension nn that belong to (or can be well-approximated by elements of) a prior set 𝒱\mathcal{V}\subseteq with lower “complexity” than the ambient space. The aim is to recover (an approximation to) such signals with provable accuracy guarantees from the linear, typically noisy, measurements 𝒃=A𝒙0+𝜼\boldsymbol{b}=A\boldsymbol{x}_{0}+\boldsymbol{\eta}, where 𝜼m\boldsymbol{\eta}\in\mathbb{C}^{m} denotes noise and Am×nA\in\mathbb{C}^{m\times n} with mnm\ll n is an appropriately chosen (possibly random) measurement matrix. The signal is to be recovered by means of a computationally feasible method that utilizes the structure of 𝒱\mathcal{V} and has access to only AA and 𝒃\boldsymbol{b}. In classical compressed sensing, 𝒱\mathcal{V} is the set of sparse vectors. In generative compressed sensing, the prior set 𝒱\mathcal{V} is chosen to be the range of a generative neural network G:G:\to, an idea that was first explored in [5]. Our results will hold for ReLU-activated neural networks as defined in [4, Definition I.2]. Here we denote the ReLU (Rectified Linear Unit) activation as σ(z):=max{z,0}\sigma(z):=\max\{z,0\}, applied component-wise to a real vector 𝒛\boldsymbol{z}.

Definition 1.1 ((k,d,n)(k,d,n)-Generative Network).

With k,d,nk,d,n\in\mathbb{N}, fix the integers 2k:=k0k1,,kd1kd=n2\leq k:=k_{0}\leq k_{1},\ldots,k_{d-1}\leq k_{d}=n, and for i[d]i\in[d], let W(i)W^{(i)}\in. A (k,d,n)(k,d,n)-generative network is a function G:G:\to of the form G(𝒛):=W(d)σ(W(2)σ(W(1)𝒛)).G(\boldsymbol{z}):=W^{(d)}\sigma\left(\cdots W^{(2)}\sigma\left(W^{(1)}\boldsymbol{z}\right)\right).

In the same work [5], the authors provided a theoretical framework for generative compressed sensing with Am×nA\in\mathbb{R}^{m\times n} having independent identically distributed (i.i.d.) Gaussian entries. However, the assumption of Gaussian measurements is unrealistic for applications like MRI, where measurements are spatial Fourier transforms. Limitations of the hardware in such an application restrict the set of possible measurements to the rows of a fixed unitary matrix (or approximately so, up to discretization of the measurements). Hence we consider the more realistic subsampled unitary measurement matrices, i.e., matrices with rows randomly subsampled from the rows of a unitary matrix Fn×nF\in\mathbb{C}^{n\times n}.

A first result in the generative setting with subsampled unitary measurements [4] shows that for well-behaved networks, m=𝒪(k2d2)m=\operatorname{\mathcal{O}}(k^{2}d^{2}) (up to log factors) measurements is sufficient for recovery with high probability. However, sampling uniformly is not efficient; sampling more informative measurements at a higher rate is known to improve performance in compressed sensing (e.g., low Fourier frequencies have strong correlation with natural images, they will therefore tend to be more informative) [3]. This idea is mirrored in the radial sampling strategy used in MRI scans, which takes a disproportionate number of low-frequency measurements [15]. We address this limitation by generalizing the theory to any measurement matrix of the form A=SFA=SF where Fn×nF\in\mathbb{C}^{n\times n} is a unitary matrix and Sm×nS\in\mathbb{R}^{m\times n} is a sampling matrix, which we now define. We adopt the convention that 𝒆j\boldsymbol{e}_{j} denotes a canonical basis vector and that Δn1\Delta^{n-1} is the simplex in n\mathbb{R}^{n}. See Section 1 for the definition of symbols throughout this paper.

Definition 1.2 (Sampling Matrix).

We define a sampling matrix to be any matrix Sm×nS\in\mathbb{R}^{m\times n} composed of i.i.d. row vectors 𝒔1,,𝒔m\boldsymbol{s}^{*}_{1},\dots,\boldsymbol{s}^{*}_{m} such that (𝒔i=𝒆j)=pj\mathbb{P}(\boldsymbol{s}_{i}=\boldsymbol{e}_{j})=p_{j} for all i[m],j[n]i\in[m],j\in[n], for some fixed probability vector 𝒑Δn1\boldsymbol{p}\in\Delta^{n-1}.

We will further show, similarly to [18], that picking the sampling probabilities in a manner informed by the geometry of the problem yields improved recovery guarantees relative to the uniform case. Specifically, we provide a bound on the measurement complexity of m=𝒪(k2d2)m=\operatorname{\mathcal{O}}(k^{2}d^{2}) even for models which are highly aligned with a small subset of the rows of FF. To find good sampling probabilities, we must understand that measurements (i.e. the rows of the measurement matrix) are effective for a prior set 𝒱\mathcal{V}\subseteq if they help differentiate between signals in 𝒱\mathcal{V}. Therefore, we consider the alignment of rows of FF with the set 𝒱𝒱\mathcal{V}-\mathcal{V} (where the difference is in the sense of a Minkowski sum, see Section 1). We will sample rows of FF that have a high degree of alignment with 𝒱𝒱\mathcal{V}-\mathcal{V} at a higher rate. For technical reasons, we consider alignments with a slightly larger set, given by the following set operator previously introduced in [4, Definition 2.1].

Definition 1.3 (Piecewise Linear Expansion).

Let 𝒞\mathcal{C}\subseteq be the union of NN convex cones: 𝒞=i=1N𝒞i\mathcal{C}=\bigcup_{i=1}^{N}\mathcal{C}_{i}. Define the piecewise linear expansion Δ(𝒞):=i=1Nspan(𝒞i)=i=1N(𝒞i𝒞i).\Delta(\mathcal{C}):=\bigcup_{i=1}^{N}\operatorname{\mathrm{span}}(\mathcal{C}_{i})=\bigcup_{i=1}^{N}(\mathcal{C}_{i}-\mathcal{C}_{i}).

The piecewise linear expansion is a well-defined set operator as shown in [4, Remark A.2]. Specifically, it is independent of the choice of convex cones 𝒞i\mathcal{C}_{i}.

We require the following quantity to quantify the alignment of the individual vectors with the prior set.

Definition 1.4 (local coherence).

The local coherence of a vector ϕn\boldsymbol{\phi}\in\mathbb{C}^{n} with respect to a cone 𝒯\mathcal{T}\subseteq is defined as

α𝒯(ϕ):=sup𝒙Δ(𝒯)𝕊n1|ϕ,𝒙|.\alpha_{\mathcal{T}}(\boldsymbol{\phi}):=\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\lvert\langle\boldsymbol{\phi},\boldsymbol{x}\rangle\rvert.

The local coherences of a unitary matrix Fn×nF\in\mathbb{C}^{n\times n} with respect to a cone 𝒯\mathcal{T}\subseteq are collected in the vector 𝜶\boldsymbol{\alpha} with entries αj:=α𝒯(𝒇j)\alpha_{j}:=\alpha_{\mathcal{T}}(\boldsymbol{f}_{j}), where 𝒇j\boldsymbol{f}^{*}_{j} is the jthj^{th} row of FF.

By using the local coherences of FF to inform sampling probabilities, we will show that recovery occurs from only m=𝒪(kd𝜶22)m=\operatorname{\mathcal{O}}(kd\lVert\boldsymbol{\alpha}\rVert_{2}^{2}) (up to log factors) measurements with high probability.

Refer to caption
Figure 1: Samples recovered by different sampling schemes with different sampling rates

Prior Work

Sampling with non-uniform probabilities was the subject of a line of research in classical compressed sensing. Seminal works involving the notion of local coherence in classical compressed sensing are [10, 11, 18, 19]. There have been quantities analogous to the local coherence appearing in the literature, such as Christoffel functions [16] and leverage scores [7, 14], which were recently introduced in machine learning in the context of, e.g., kernel-based methods [8] and deep learning [1].

While writing this manuscript, we became aware of the recent paper [2]. This work presents a framework for optimizing sampling in general scenarios based on the so-called generalized Christoffel function, a quantity that admits the (squared) local coherence considered here as a particular case. Furthermore, the method we use to numerically approximate the coherence in Section 3 can be seen as a special case of that proposed in [2]. However, the results in [2] assume that 𝒱𝒱\mathcal{V}-\mathcal{V} is a union of low-dimensional subspaces. Hence, they are not directly applicable to the case of generative compressed sensing with ReLU networks, for which 𝒱𝒱\mathcal{V}-\mathcal{V} is in general only contained in a union of low-dimensional subspaces. On the other hand, our theory explicitly covers the case of generative compressed sensing, and illustrates how sufficient conditions on mm leading to successful signal recovery depend on the generative network’s parameters (k,d,n)(k,d,n). Second, we provide recovery guarantees that hold with high probability, as opposed to expectation.

The present work directly improves on results from [4] by improving the sample complexity from n𝜶2n\lVert\boldsymbol{\alpha}\rVert_{\infty}^{2} to 𝜶22\lVert\boldsymbol{\alpha}\rVert_{2}^{2} when the sampling probabilities are adapted to the generative model used. This is a sizable improvement in performance guarantees for a significant class of realistic generative models. It can be understood as extending the theory of generative compressed sensing with Fourier measurements to many realistic settings where we provide nearly order-optimal bounds on the sampling complexity. Indeed, in the context of the main result from  [4], “favourable coherence” corresponds to 𝜶Ckd/n\lVert\boldsymbol{\alpha}\rVert_{\infty}\leq C\sqrt{kd/n} where CC is an absolute constant. Despite the fact that the prior generated by a neural network with Gaussian weights will have such a coherence  [4], we observe empirically in Figure 2d) that for a trained generative model, a small number of Fourier coefficients have values close to one. In such cases, the main result from [4] becomes vacuous while Theorem 2.1 remains meaningful.

Notation

For any map ff, we denote (f)\operatorname{\mathcal{R}}(f) to be the range of ff. We define the simplex Δn1:={𝒑n:pi0,i=1npi=1}\Delta^{n-1}:=\{\boldsymbol{p}\in\mathbb{R}^{n}:p_{i}\geq 0,\sum_{i=1}^{n}p_{i}=1\}, and the sphere 𝕊n1:={𝒙n:𝒙2=1}\mathbb{S}^{n-1}:=\{\boldsymbol{x}\in\mathbb{R}^{n}:\|\boldsymbol{x}\|_{2}=1\}. For any matrix Am×nA\in\mathbb{C}^{m\times n}, we denote its pseudo-inverse by A+n×mA^{+}\in\mathbb{C}^{n\times m}. For a set 𝒱\mathcal{V}\in we denote its self-difference 𝒱𝒱:={𝒗1𝒗2|𝒗1,𝒗2𝒱}\mathcal{V}-\mathcal{V}:=\{\boldsymbol{v}_{1}-\boldsymbol{v}_{2}|\boldsymbol{v}_{1},\boldsymbol{v}_{2}\in\mathcal{V}\}. We define Π𝒯:\operatorname{\Pi}_{\mathcal{T}}:\to to be the orthogonal projection on to a set 𝒯\mathcal{T}\subseteq in the sense that Π𝒯𝒙\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x} is a single element from the set argmin𝒕𝒯𝒙𝒕2\operatorname*{\arg\min}_{\boldsymbol{t}\in\mathcal{T}}\|\boldsymbol{x}-\boldsymbol{t}\|_{2}. We let [l]={1,,l}[l]=\{1,\ldots,l\}. We denote ,\langle\cdot,\cdot\rangle to be the the canonical inner product in \mathbb{R} or \mathbb{C} depending on the context.

2 Main Result

We now state the main result of this paper, where we give an upper bound on the sample complexity required for signal recovery. The accuracy of the recovery is dependent on the measurement noise, the modelling error (how far the signal is from the prior), and imperfect optimization. These are denoted, respectively, by 𝜼,𝒙\boldsymbol{\eta},\boldsymbol{x}^{\perp} and ε^\hat{\varepsilon} below.

Theorem 2.1.

Fix a (k,d,n)(k,d,n)-generative network GG, the cone 𝒯:=(G)(G)\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)\subseteq, the unitary matrix Fn×nF\in\mathbb{C}^{n\times n}. Let 𝛂\boldsymbol{\alpha} be the vector of local coherences of FF with respect to 𝒯\mathcal{T}. Let 𝐩:=(αj2/𝛂22)j[n]Δn1\boldsymbol{p}:=(\alpha_{j}^{2}/\lVert\boldsymbol{\alpha}\rVert_{2}^{2})_{j\in[n]}\in\Delta^{n-1} with Sm×nS\in\mathbb{R}^{m\times n} the corresponding random sampling matrix. Let Dn×nD\in\mathbb{R}^{n\times n} be the diagonal matrix with entries Di,i=1/piD_{i,i}=1/\sqrt{p_{i}}. Let D~:=SDS+m×m\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}. Let ε>0\varepsilon>0. If

mC𝜶22(kdlog(nk)+log(2ε))m\geq C\|\boldsymbol{\alpha}\|_{2}^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\left(\frac{2}{\varepsilon}\right)\right)

for CC an absolute constant, then with probability at least 1ε1-\varepsilon over the realization of SS the following statement holds.

Statement 2.1.

For any choice of 𝒙0\boldsymbol{x}_{0}\in and 𝜼n\boldsymbol{\eta}\in\mathbb{C}^{n}, let 𝒃:=1mSF𝒙+𝜼\boldsymbol{b}:=\frac{1}{\sqrt{m}}SF\boldsymbol{x}+\boldsymbol{\eta} and 𝒙:=𝒙0Π𝒯𝒙0\boldsymbol{x}^{\perp}:=\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}, and any 𝒙^,ε^>0\hat{\boldsymbol{x}}\in,\hat{\varepsilon}>0 satisfying 1mSDF𝒙^D~𝒃2min𝒙𝒯1mSDF𝒙D~𝒃2+ε^\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{D}\boldsymbol{b}\|_{2}\leq\min_{\boldsymbol{x}\in\mathcal{T}}\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}-\tilde{D}\boldsymbol{b}\|_{2}+\hat{\varepsilon}. We have that

𝒙^𝒙02𝒙2+3mSDF𝒙2+3D~𝜼2+32ε^.\displaystyle\|\hat{\boldsymbol{x}}-\boldsymbol{x}_{0}\|_{2}\leq\|\boldsymbol{x}^{\perp}\|_{2}+\frac{3}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+3\|\tilde{D}\boldsymbol{\eta}\|_{2}+\frac{3}{2}\hat{\varepsilon}.
Remark 2.1.

The matrix D~\tilde{D} has a vector of diagonal entries Diag(D~)\text{Diag}(\tilde{D}) satisfying Diag(D~)=SDiag(D)\text{Diag}(\tilde{D})=S\text{Diag}(D). ∎

Remark 2.2.

We give a generalization of this result to arbitrary sampling probability vectors in Theorem A3.1. ∎

3 Numerics

In this section, we provide empirical evidence of the connection between coherence-based non-uniform sampling and recovery error. By presenting visual (Figure 1) and quantitative (Figure 2) evidence, we validate that model-adapted sampling using a coherence-informed probability vector can outperform a uniform sampling scheme — requiring fewer measurements for successful recovery.

Refer to caption
Figure 2: In (a) we shows the reconstruction error in terms of number of measurements with out-of-range signals. In (b) we present the same quantity for in-range signals. In (c) we treat signal recoveries with relative error less than 3×1033\times 10^{-3} as “successful”, and display the proportion of successful recoveries out of 64 attempts on in-range signals. In (d) we plot the local coherences for one channel of one image. Observe that the local coherences take on values in many different orders of magnitude and peak sharply for a small number of local coherences.

Coherence heuristic

Ideally, we would compute the local coherence 𝜶\boldsymbol{\alpha} using  Definition 1.4, but to our knowledge computing local coherence is intractable for generative models relevant to practical settings [4]. Thus, we approximate the quantity by sampling points from the range of the generative model and computing the local coherence from the sampled points instead. Specifically, we sample codes from the latent space of the generative model to generate a batch of images with shape (B,C,H,W)(B,C,H,W), where B,C,H,WB,C,H,W stand for batch size, number of channels, image height and image width respectively. Then, we compute the set self-difference of the image batch, and normalize each difference vector. This gives a tensor of shape (B2,C,H,W)(B^{2},C,H,W). We perform a channel-wise two-dimensional Discrete Fourier Transform (DFT) on the tensor, take the element-wise modulus, and then maximize over the batch dimension. This results in a coherence tensor a0a_{0} with shape (C,H,W).(C,H,W). To obtain the coherence-informed probability vector of each channel, we first square element-wise, then we normalize channel-wise. To estimate the local coherences of our generative model we use a batch size of 5000 and employ the DFT from PyTorch [17].

In-Range vs Out-of-Range Signals

We run signal recovery experiments for two kinds of images: when the signals are conditioned to be in the range of the generative model (in-range signals), and when they are directly picked from the validation set (out-of-range signals). The in-range signals are randomly generated (with Gaussian codes) by the same generative network that we use for recovery. This ensures that these signals lie within the prior set. In the out-of-range setting, a residual recovery error can be observed even with large numbers of measurements Figure 2. This error occurs because of the so-called model mismatch; there is some distance between the prior set and the signals.

Procedure for Signal Recovery

The way we perform signal recovery goes as follows. For a given image 𝒙0C×H×W\boldsymbol{x}_{0}\in\mathbb{R}^{C\times H\times W}, we create a mask M{0,1}C×H×WM\in{\{0,1\}}^{C\times H\times W} by randomly sampling with replacement mm times for each channel according to the probability vector. Let FF be the channel-wise DFT operator and G:kC×H×WG:\mathbb{R}^{k}\to\mathbb{R}^{C\times H\times W} be the generative neural network (where we omit the batch dimension for simplicity). We denote \odot to be the element-wise tensor multiplication and F\|\cdot\|_{F} to be the Frobenius norm. We approximately solve the optimization program 𝒛^argmin𝒛kMF𝒙0MFG(𝒛)F2\hat{\boldsymbol{z}}\in\operatorname*{\arg\min}_{\boldsymbol{z}\in\mathbb{R}^{k}}\|M\odot F\boldsymbol{x}_{0}-M\odot FG(\boldsymbol{z})\|_{F}^{2} by running AdamW [13] with lr=0.003,β1=0.9lr=0.003,\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999 for 20000 iterations on four different random initializations, and pick the code that achieves the lowest loss. The recovered signal is then 𝒙^:=G(𝒛^)\hat{\boldsymbol{x}}:=G(\hat{\boldsymbol{z}}). We measure the quality of the signal recovery by using the relative recovery error (rre), rre(𝒙0,𝒙^)=𝒙0𝒙^2𝒙02.\operatorname{rre}(\boldsymbol{x}_{0},\hat{\boldsymbol{x}})=\frac{\|\boldsymbol{x}_{0}-\hat{\boldsymbol{x}}\|_{2}}{\|\boldsymbol{x}_{0}\|_{2}}.

Observe that Figure 2b) demonstrates the efficiency of model-adapted sampling. Signal recovery with adapted sampling occurs with 1616 times fewer measurements than when using uniform sampling. Similar performance gains can be observed visually in Figure 1 and Figure 3. Comparing the number of measurements to the ambient dimension, we see from Figure 2d) that signal recovery occurs with mn27/2562=0.2%\frac{m}{n}\approx 2^{7}/256^{2}=0.2\%.

There are a few ways these numerical experiments do not directly match our theory. The sampling is done channel wise, which is technically block sampling [3]. Also, the signal recovery is performed without the preconditioning factor D~\tilde{D} that appears in Theorem 2.1.

4 Conclusion

In this paper we bring together the ideas used to quantify the compatibility of generative models with subsampled unitary measurements, which were first explored in [4], with ideas of non-uniform sampling from classical compressed sensing. We present the first theoretical result applying coherence-based sampling (similar to leverage score sampling, or Christoffel function sampling) to the setting where the prior is a ReLU generative neural network. We find that adapting the sampling scheme to the geometry of the problem yields substantially improved sampling complexities for many realistic generative networks, and that this improvement is significant in empirical experiments.

Possible avenues for future research include extending the theory presented in [2] to ReLU nets by using methods introduced in the present work. This would yield the benefit of extending the theory from this paper to a number of realistic sampling schemes. A second research direction consists of investigating the optimality of the sample complexity bound that we present in this paper. The sample complexity that we guarantee includes a factor of k2d2k^{2}d^{2} when the generative model is well-behaved. Whether this dependence can be reduced to kdkd, as is the case when the measurement matrix is Gaussian, is an interesting problem that remains open. Finally, the class of neural networks considered in this work could be expanded to include more realistic ones.

R

pages1 Rpages1 Rpages10 Rpages21 Rpages15 Rpages14 Rpages11 Rpages9 Rpages1 Rpages165 Rpages12 Rpages4 Rpages17

References

  • [1] Ben Adcock, Juan M Cardenas and Nick Dexter “CAS4DL: Christoffel adaptive sampling for function approximation via deep learning” In Sampling Theory, Signal Processing, and Data Analysis 20.2 Springer, 2022, pp. 21
  • [2] Ben Adcock, Juan M. Cardenas and Nick Dexter “CS4ML: A General Framework for Active Learning with Arbitrary Data Based on Christoffel Functions” arXiv, 2023 arXiv:2306.00945 [cs, math]
  • [3] Ben Adcock and Anders C. Hansen “Compressive Imaging: Structure, Sampling, Learning” Cambridge: Cambridge University Press, 2021 DOI: 10.1017/9781108377447
  • [4] Aaron Berk et al. “A Coherence Parameter Characterizing Generative Compressed Sensing with Fourier Measurements” In IEEE Journal on Selected Areas in Information Theory, 2022, pp. 1–1 DOI: 10.1109/JSAIT.2022.3220196
  • [5] Ashish Bora, Ajil Jalal, Eric Price and Alexandros G. Dimakis “Compressed Sensing Using Generative Models” In Proceedings of the 34th International Conference on Machine Learning PMLR, 2017, pp. 537–546
  • [6] E.J. Candes, J. Romberg and T. Tao “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information” In IEEE Transactions on Information Theory 52.2, 2006, pp. 489–509 DOI: 10.1109/TIT.2005.862083
  • [7] Samprit Chatterjee and Ali S Hadi “Influential observations, high leverage points, and outliers in linear regression” In Statistical science JSTOR, 1986, pp. 379–393
  • [8] Tamás Erdélyi, Cameron Musco and Christopher Musco “Fourier sparse leverage scores and approximate kernel learning” In Advances in Neural Information Processing Systems 33, 2020, pp. 109–122
  • [9] Simon Foucart and Holger Rauhut “A Mathematical Introduction to Compressive Sensing” Springer New York, 2013
  • [10] Felix Krahmer, Holger Rauhut and Rachel Ward “Local Coherence Sampling in Compressed Sensing” In Proceedings of the 10th International Conference on Sampling Theory and Applications
  • [11] Felix Krahmer and Rachel Ward “Stable and Robust Sampling Strategies for Compressive Imaging” In IEEE Transactions on Image Processing 23.2, 2014, pp. 612–622 DOI: 10.1109/TIP.2013.2288004
  • [12] Ziwei Liu, Ping Luo, Xiaogang Wang and Xiaoou Tang “Deep Learning Face Attributes in the Wild” In Proceedings of International Conference on Computer Vision (ICCV), 2015
  • [13] Ilya Loshchilov and Frank Hutter “Decoupled Weight Decay Regularization”, 2019 arXiv:1711.05101 [cs.LG]
  • [14] Ping Ma, Michael Mahoney and Bin Yu “A statistical perspective on algorithmic leveraging” In International conference on machine learning, 2014, pp. 91–99 PMLR
  • [15] Maria Murad et al. “Radial Undersampling-Based Interpolation Scheme for Multislice CSMRI Reconstruction Techniques” In BioMed Research International 2021, 2021, pp. 6638588 DOI: 10.1155/2021/6638588
  • [16] Paul Nevai “Géza Freud, orthogonal polynomials and Christoffel functions. A case study” In Journal of Approximation Theory 48.1, 1986, pp. 3–167 DOI: https://doi.org/10.1016/0021-9045(86)90016-X
  • [17] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 8024–8035 URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [18] Gilles Puy, Pierre Vandergheynst and Yves Wiaux “On Variable Density Compressive Sampling” In IEEE Signal Processing Letters 18.10, 2011, pp. 595–598 DOI: 10.1109/LSP.2011.2163712
  • [19] Holger Rauhut and Rachel Ward “Sparse Legendre Expansions via 𝓁\mathscr{l}1-Minimization” In Journal of Approximation Theory 164.5, 2012, pp. 517–533 DOI: 10.1016/j.jat.2012.01.008
  • [20] Roman Vershynin “High-Dimensional Probability: An Introduction with Applications in Data Science” Cambridge University Press, 2018
  • [21] Yuanbo Xiangli et al. “Real or Not Real, that is the Question” In International Conference on Learning Representations, 2020

This supplementary material contains acknowledgements, a generalization of Theorem 2.1, the proof of said generalization, properties of the piecewise linear expansion, and additional image recoveries both in-range and out-of-range.

Appendix A1 Acknowledgements

The authors acknowledge Ben Adcock for providing feedback on a preliminary version of this paper. S. Brugiapaglia acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2020- 06766 and the Fonds de Recherche du Québec Nature et Technologies (FRQNT) through grant 313276. Y. Plan is partially supported by an NSERC Discovery Grant (GR009284), an NSERC Discovery Accelerator Supplement (GR007657), and a Tier II Canada Research Chair in Data Science (GR009243). O. Yilmaz was supported by an NSERC Discovery Grant (22R82411) O. Yilmaz also acknowledges support by the Pacific Institute for the Mathematical Sciences (PIMS).

Appendix A2 Additional Notation

In this work we make use of absolute constants which we always label CC. Any constant labelled CC is implicitly understood to be absolute and has a value that may differ from one appearance to the next. We write aba\lesssim b to mean aCba\leq Cb. We denote by \|\cdot\| the operator norm, and by 2\|\cdot\|_{2} the euclidean norm. We use capital letters for matrices and boldface lowercase letters for vectors. For some matrix Am×nA\in\mathbb{C}^{m\times n}, we denote by 𝒂i1×n\boldsymbol{a}_{i}^{*}\in\mathbb{C}^{1\times n} (lowercase of the letter symbolizing the matrix) the ithi^{th} row vector of AA, meaning that A=i=1m𝒆i𝒂iA=\sum_{i=1}^{m}\boldsymbol{e}_{i}\boldsymbol{a}_{i}^{*}. For a vector 𝒖\boldsymbol{u}\in, we denote by uiu_{i} its ithi^{\text{th}} entry. We let {𝒆i}i[n]\{\boldsymbol{e}_{i}\}_{i\in[n]} be the canonical basis of .

Appendix A3 Generalized Main Result

We present a generalization of Theorem 2.1 which provides recovery guarantees for arbitrary sampling probabilities. To quantify the quality of the interaction between the generative model GG, the unitary matrix FF, and the sampling probability vector 𝒑\boldsymbol{p}, we introduce the following quantity.

Definition A3.1.

Let 𝒯\mathcal{T}\subseteq be a cone. Let Fn×nF\in\mathbb{C}^{n\times n} be a unitary matrix and 𝒑Δn1\boldsymbol{p}\in\Delta^{n-1}. Let 𝜶n\boldsymbol{\alpha}\in\mathbb{R}^{n} the local coherences of FF with respect to 𝒯\mathcal{T}. Then define the quanity

μ𝒯(F,𝒑):=maxj[n]αjpj.\mu_{\mathcal{T}}(F,\boldsymbol{p}):=\max_{j\in[n]}\frac{\alpha_{j}}{\sqrt{p_{j}}}.

We can now state the following.

Theorem A3.1 (Generalized Main Result).

Fix the (k,d,n)(k,d,n)-generative network GG, the cone 𝒯:=(G)(G)\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)\subseteq, the unitary matrix Fn×nF\in\mathbb{C}^{n\times n}, the probability vector 𝐩Δn1\boldsymbol{p}\in\Delta^{n-1}, and the corresponding random sampling matrix Sm×nS\in\mathbb{R}^{m\times n}. Let Dn×nD\in\mathbb{R}^{n\times n} be a diagonal matrix with entries Di,i=1piD_{i,i}=\frac{1}{\sqrt{p_{i}}}. Let D~:=SDS+m×m\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}. Let 𝛂\boldsymbol{\alpha} be the vector of local coherences of FF with respect to 𝒯\mathcal{T}. Let ε>0\varepsilon>0.

Suppose that

mCμ𝒯2(F,𝒑)(kdlog(nk)+log2ε).m\geq C\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p})\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right).

Furthermore, if we pick the sampling probability vector

𝒑:=(αj2𝜶22)j[n],\boldsymbol{p}^{*}:=\left(\frac{\alpha_{j}^{2}}{\lVert\boldsymbol{\alpha}\rVert_{2}^{2}}\right)_{j\in[n]},

we only require that

mC𝜶22(kdlog(nk)+log2ε).m\geq C\|\boldsymbol{\alpha}\|_{2}^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right).

Then with probability at least 1ε1-\varepsilon over the realization of SS, Statement 2.1 holds.

This theorem is strictly more general than Theorem 2.1. It is therefore sufficient to prove the generalized version, which we do in the next section.

Remark A3.1.

The result [4, Theorem 2.1] is a corollary of Theorem 2.1; it follows from taking 𝒑\boldsymbol{p} as the uniform probability vector. ∎

Appendix A4 Proof of Theorem A3.1

Let us first introduce the so-called Restricted Isometry Property (RIP) [6, 9, 20].

Definition A4.1 (Restricted Isometry Property).

Let 𝒯\mathcal{T}\subseteq be a cone and Am×nA\in\mathbb{C}^{m\times n} a matrix. We say that AA satisfies the Restricted Isometry Property (RIP) when

sup𝒖𝒯𝕊n1|A𝒖21|13.\sup_{\boldsymbol{u}\in\mathcal{T}\cap\mathbb{S}^{n-1}}\lvert\lVert A\boldsymbol{u}\rVert_{2}-1\rvert\leq\frac{1}{3}.

Note that the constant 1/31/3 is a specific choice made in order to simplify the presentation of this proof. It could be replaced by any generic absolute constant in (0,1)(0,1).

The following lemma says that if, conditioning on SS, 1mSDF\frac{1}{\sqrt{m}}SDF has the RIP on (G)(G)\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G), then we have signal recovery.

Lemma \thelemma (RIP of a Subsampled and Preconditioned Matrix Yields Recovery).

Let 𝒱\mathcal{V}\subseteq be a cone, Fn×nF\in\mathbb{C}^{n\times n} be a unitary matrix, Sm×nS\in\mathbb{R}^{m\times n} a matrix with all rows in the canonical basis of n\mathbb{R}^{n}. Let Dn×nD\in\mathbb{R}^{n\times n} be a diagonal matrix. Let D~:=SDS+m×m\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}.

If 1mSDF\frac{1}{\sqrt{m}}SDF has the RIP on the cone 𝒯:=𝒱𝒱\mathcal{T}:=\mathcal{V}-\mathcal{V}, then Statement 2.1 holds.

See the proof in Appendix Appendix A5.

We now proceed to prove a slightly stronger statement than what is required by Appendix A4; that the RIP holds on the piecewise linear expansion Δ((G)(G))(G)(G)\Delta(\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G))\supseteq\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G).

To control the complexity of GG, we count the number of affine pieces that it comprises. We do this with the result [4, Lemma A.6], which we re-write below for convenience.

Lemma \thelemma (Containing the Range of a ReLU Network in a Union of Subspaces).

Let GG be a (k,d,n)(k,d,n)-generative network with layer widths k=k0k1,,kd1kdk=k_{0}\leq k_{1},\ldots,k_{d-1}\leq k_{d} where kd=nk_{d}=n and k¯:=(=1d1k)1/(d1)\bar{k}:=\left(\prod_{\ell=1}^{d-1}k_{\ell}\right)^{1/(d-1)}. Then (G)\operatorname{\mathcal{R}}(G) is a union of no more than NN at-most kk-dimensional polyhedral cones where

logN\displaystyle\log{N} k(d1)log(2ek¯k)kdlog(nk).\displaystyle\leq k(d-1)\log{\left(\frac{2e\bar{k}}{k}\right)}\lesssim kd\log{\left(\frac{n}{k}\right)}.

From this result, we see that 𝒯:=(G)(G)\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G) is contained in a union of no more than N2N^{2} affine pieces each of dimension no more than 2k2k. Then from Remark A6.1 (for the proof, see [4, Remark A.2]), the cone Δ(𝒯)\Delta(\mathcal{T}) is a union of no more than N2N^{2} subspaces each of dimension at-most 2k2k (the factor of two will be absorbed into the absolute constant of the statement.) Fix 𝒰𝒯\mathcal{U}\subseteq\mathcal{T} to be any one of these subspaces. Then the following lemma implies that the matrix 1mSDF\frac{1}{\sqrt{m}}SDF has the RIP on 𝒰\mathcal{U} with high probability.

Lemma \thelemma (Deviation of Subsampled Preconditioned Unitary Matrix on a Subspace).

Let Fn×nF\in\mathbb{C}^{n\times n} be a unitary matrix, and Sm×nS\in\mathbb{R}^{m\times n} a random sampling matrix associated with the probability vector 𝐩Δn1\boldsymbol{p}\in\Delta^{n-1}. Let DnD\in\mathbb{R}^{n} be a diagonal pre-conditioning matrix with entries Di,i=1piD_{i,i}=\frac{1}{\sqrt{p_{i}}}. Let t>0t>0. Let 𝒰\mathcal{U}\subseteq be a subspace of dimension kk. Then

sup𝒙𝒰𝕊n1|1mSDF𝒙21|μ𝒰(F,𝒑)mlogk+μ𝒰(F,𝒑)mt\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu_{\mathcal{U}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{\log k}+\frac{\mu_{\mathcal{U}}(F,\boldsymbol{p})}{\sqrt{m}}t (1)

with probability at least 12exp(t2)1-2\exp(-t^{2}).

See the proof in Appendix A5.

Since 𝒰Δ(𝒯)\mathcal{U}\subseteq\Delta(\mathcal{T}) we have that μ𝒰(F,𝒑)μ𝒯(F,𝒑)\mu_{\mathcal{U}}(F,\boldsymbol{p})\leq\mu_{\mathcal{T}}(F,\boldsymbol{p}). Using this fact to upper-bound the r.h.s. of Equation 1, we find an identical concentration inequality that applies to each of the subspaces constituting Δ(𝒯)\Delta(\mathcal{T}). By using Appendix A4 to bound the number of subspaces, we control the deviation of 1mSDF\frac{1}{\sqrt{m}}SDF uniformly over all the subspaces constituting Δ(𝒯)\Delta(\mathcal{T}) with a union bound. We find that, with probability at least 12exp(t2)1-2\exp(-t^{2}),

sup𝒙Δ(𝒯)𝕊n1|1mSDF𝒙21|\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\left\|SDF\boldsymbol{x}\right\|_{2}-1\right|
μ𝒯(F,𝒑)mlogk+μ𝒯(F,𝒑)mkdlog(nk)+μ𝒯(F,𝒑)mt.\lesssim\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{\log k}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{kd\log\left(\frac{n}{k}\right)}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}t.

For the method by which we applied the union bound, see [4, Lemma A.2]. In the r.h.s. of the equation above, the second term dominates the first, so the expression simplifies to

sup𝒙Δ(𝒯)𝕊n1|1mSDF𝒙21|μ𝒯(F,𝒑)mkdlog(nk)+μ𝒯(F,𝒑)mt.\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{kd\log\left(\frac{n}{k}\right)}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}t. (2)

By fixing t=log(2ε)t=\sqrt{\log\left(\frac{2}{\varepsilon}\right)} we find that the RIP holds with probability at least 1ε1-\varepsilon on 𝒯\mathcal{T} when

mCμ𝒯(F,𝒑)2(kdlog(nk)+log2ε).m\geq C\mu_{\mathcal{T}}(F,\boldsymbol{p})^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right). (3)

Then we find that the first part of Theorem A3.1 follows from Appendix A4.

The second sufficient condition on mm follows from picking 𝒑\boldsymbol{p} so as to minimize the factor μ𝒯(F,𝒑)2\mu_{\mathcal{T}}(F,\boldsymbol{p})^{2} in Equation 3.

Lemma \thelemma (Adapting the Sampling Scheme to the Model).

Let Fn×nF\in\mathbb{C}^{n\times n} be a unitary matrix, and Sm×nS\in\mathbb{R}^{m\times n} a random sampling matrix associated with the probability vector 𝐩Δn1\boldsymbol{p}\in\Delta^{n-1}. Let 𝛂\boldsymbol{\alpha} be the local coherences of FF with respect to a cone 𝒯\mathcal{T}\in.

𝒑j:=(αj2𝜶22)j[n]argmin𝒑Δn1(μ𝒯(F,𝒑)).\boldsymbol{p}_{j}^{*}:=\left(\frac{\alpha_{j}^{2}}{\lVert\boldsymbol{\alpha}\rVert_{2}^{2}}\right)_{j\in[n]}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}(\mu_{\mathcal{T}}(F,\boldsymbol{p})).

It achieves a value of

μ𝒯(F,𝒑)=𝜶2.\mu_{\mathcal{T}}(F,\boldsymbol{p}^{*})=\lVert\boldsymbol{\alpha}\rVert_{2}.

See the proof.

Applying Appendix A4 to Equation 3 concludes the proof of Theorem A3.1.

Appendix A5 Proof of the Lemmas

Proof of Appendix A4.

Let Fn×nF\in\mathbb{C}^{n\times n} be a unitary matrix, and Sm×nS\in\mathbb{R}^{m\times n} a random sampling matrix associated with the probability vector 𝒑Δn1\boldsymbol{p}\in\Delta^{n-1}. Let Dn×nD\in\mathbb{R}^{n\times n} be a diagonal matrix with entries Di,i=1piD_{i,i}=\frac{1}{\sqrt{p_{i}}}. Let D~=SDS+\tilde{D}=SDS^{+}.

We let 𝒃~:=D~𝒃\tilde{\boldsymbol{b}}:=\tilde{D}\boldsymbol{b} and 𝜼~:=D~𝜼\tilde{\boldsymbol{\eta}}:=\tilde{D}\boldsymbol{\eta}. By left-multiplying the equation 𝒃=1mSF𝒙0+𝜼\boldsymbol{b}=\frac{1}{\sqrt{m}}SF\boldsymbol{x}_{0}+\boldsymbol{\eta} by D~\tilde{D} we get 𝒃~=1mSDF𝒙0+𝜼~\tilde{\boldsymbol{b}}=\frac{1}{\sqrt{m}}SDF\boldsymbol{x}_{0}+\tilde{\boldsymbol{\eta}}. Notice that the linear operator 1mSDF\frac{1}{\sqrt{m}}SDF has the RIP by assumption.

By triangle inequality and the observation that Π𝒯𝒙0𝒯\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\in\mathcal{T},

1mSDF𝒙^𝒃~2\displaystyle\left\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\|_{2} min𝒙𝒯1mSDF𝒙𝒃~2+ε^\displaystyle\leq\min_{\boldsymbol{x}\in\mathcal{T}}\left\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}-\tilde{\boldsymbol{b}}\right\|_{2}+\hat{\varepsilon}
1mSDFΠ𝒯𝒙0𝒃~2+ε^\displaystyle\leq\left\|\frac{1}{\sqrt{m}}SDF\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}-\tilde{\boldsymbol{b}}\right\|_{2}+\hat{\varepsilon}
=1mSDF𝒙+𝜼~2+ε^\displaystyle=\left\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}^{\perp}+\tilde{\boldsymbol{\eta}}\right\|_{2}+\hat{\varepsilon}
1mSDF𝒙2+𝜼~2+ε^.\displaystyle\leq\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+\|\tilde{\boldsymbol{\eta}}\|_{2}+\hat{\varepsilon}.

Since 𝒙^,Π𝒯𝒙0𝒯\hat{\boldsymbol{x}},\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\in\mathcal{T}, with the RIP property we find that

1mSDF𝒙^𝒃~2\displaystyle\left\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\|_{2} =1mSDF(𝒙^Π𝒯𝒙0)1mSDF(𝒙0Π𝒯𝒙0)𝜼~2\displaystyle=\left\|\frac{1}{\sqrt{m}}SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\frac{1}{\sqrt{m}}SDF\left(\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\tilde{\boldsymbol{\eta}}\right\|_{2}
1mSDF(𝒙^Π𝒯𝒙0)21mSDF𝒙2𝜼~2\displaystyle\geq\frac{1}{\sqrt{m}}\|SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)\|_{2}-\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}-\|\tilde{\boldsymbol{\eta}}\|_{2}
(113)𝒙^Π𝒯𝒙021mSDF𝒙2𝜼~2.\displaystyle\geq\left(1-\frac{1}{3}\right)\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\|_{2}-\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}-\|\tilde{\boldsymbol{\eta}}\|_{2}.

Assembling the two inequalities gives

𝒙^Π𝒯𝒙021113[21mSDF𝒙2+2𝜼~2+ε^].\displaystyle\left\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right\|_{2}\leq\frac{1}{1-\frac{1}{3}}\left[2\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+2\|\tilde{\boldsymbol{\eta}}\|_{2}+\hat{\varepsilon}\right].

Finally, we apply triangle inequality to get

𝒙^𝒙02\displaystyle\|\hat{\boldsymbol{x}}-\boldsymbol{x}_{0}\|_{2} 𝒙0Π𝒯𝒙02+𝒙^Π𝒯𝒙02\displaystyle\leq\|\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\|_{2}+\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\|_{2}
𝒙2+3mSDF𝒙2+3𝜼~2+32ε^.\displaystyle\leq\|\boldsymbol{x}^{\perp}\|_{2}+\frac{3}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+3\|\tilde{\boldsymbol{\eta}}\|_{2}+\frac{3}{2}\hat{\varepsilon}.

Proof of Appendix A4.

In what follows, we will use that 𝒙𝒰\forall\boldsymbol{x}\in\mathcal{U}, SDF𝒙=SDFP𝒰P𝒰𝒙SDF\boldsymbol{x}=SDFP_{\mathcal{U}}^{*}P_{\mathcal{U}}\boldsymbol{x} where P𝒰P_{\mathcal{U}}\in is the matrix with rows chosen to be any fixed orthonormal basis of 𝒰\mathcal{U}. Indeed, notice that P𝒰P𝒰=Π𝒰P_{\mathcal{U}}^{*}P_{\mathcal{U}}=\Pi_{\mathcal{U}}\in, the orthonormal projection on to 𝒰\mathcal{U}. Now consider

():=sup𝒙𝒰𝕊n1|1mSDF𝒙221|\displaystyle(\star):=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDF\boldsymbol{x}\right\|_{2}^{2}-1\right| =sup𝒙𝒰𝕊n1|1mSDFP𝒰P𝒰𝒙221|\displaystyle=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDFP_{\mathcal{U}}^{*}P_{\mathcal{U}}\boldsymbol{x}\right\|_{2}^{2}-1\right|
=sup𝒖𝕊k1|1mSDFP𝒰𝒖221|\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left|\frac{1}{m}\left\|SDFP_{\mathcal{U}}^{*}\boldsymbol{u}\right\|_{2}^{2}-1\right|
=1msup𝒖𝕊k1|𝒖[(SDFP𝒰)(SDFP𝒰)mI]𝒖|.\displaystyle=\frac{1}{m}\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left|\boldsymbol{u}^{*}\left[(SDFP_{\mathcal{U}}^{*})^{*}(SDFP_{\mathcal{U}}^{*})-mI\right]\boldsymbol{u}\right|.

The second equality above follows from a change of variables P𝒰𝒙𝒖P_{\mathcal{U}}\boldsymbol{x}\to\boldsymbol{u}\in. Since the matrix within the square bracket is symmetric, the last expression we find above corresponds to an operator norm.

()\displaystyle(\star) =1mP𝒰FDSSDFP𝒰mI\displaystyle=\frac{1}{m}\left\|P_{\mathcal{U}}F^{*}DS^{*}SDFP_{\mathcal{U}}^{*}-mI\right\| (4)
=1mi=1m[P𝒰FD𝒔i𝒔iDFP𝒰I]\displaystyle=\frac{1}{m}\left\|\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}-I\right]\right\| (5)

This is a sum of independent random matrices because the sampling matrix matrix SS is random and has independent rows. We now consider what will be the central ingredient of this proof: the Matrix Bernstein concentration bound [20, Theorem 5.4.1]. We will use it to bound Equation 5.

Lemma \thelemma (Matrix Bernstein).

Let X1,,XNX_{1},...,X_{N} be independent, mean zero, n×nn\times n symmetric random matrices, such that XiK||X_{i}||\leq K almost surely for all i. Then, for every t0t\geq 0, we have

{i=1NXit}2nexp(t2/2σ2+Kt/3),\mathbb{P}\left\{\left\lVert\sum_{i=1}^{N}X_{i}\right\rVert\geq t\right\}\leq 2n\exp\left(-\frac{t^{2}/2}{\sigma^{2}+Kt/3}\right),

where σ2=i=1N𝔼Xi2\sigma^{2}=\left\|\sum_{i=1}^{N}\mathbb{E}X_{i}^{2}\right\|.

To compute σ2\sigma^{2} and KK, we notice that we can write

i=1m[P𝒰FD𝒔i𝒔iDFP𝒰I]=i[n][𝒗i𝒗iI]\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}-I\right]=\sum_{i\in[n]}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I]

for the random vectors 𝒗i:=P𝒰FD𝒔i\boldsymbol{v}_{i}:=P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}. These vectors have two key properties. First, they are isotropic; this is the property that

𝔼[𝒗i𝒗i]\displaystyle\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}] =𝔼[P𝒰FD𝒔i𝒔iDFP𝒰]\displaystyle=\mathbb{E}[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}]
=P𝒰FD𝔼[𝒔i𝒔i]DFP𝒰=I.\displaystyle=P_{\mathcal{U}}F^{*}D\mathbb{E}[\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}]DFP_{\mathcal{U}}^{*}=I.

The isotropic property gives us immediately that, as required, the matrices {𝒗i𝒗iI}i[m]\{\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I\}_{i\in[m]} are mean-zero.

The second property of the vectors {𝒗i}i[m]\{\boldsymbol{v}_{i}\}_{i\in[m]} is that they have bounded magnitude almost surely.

𝒗i2\displaystyle\|\boldsymbol{v}_{i}\|_{2} =P𝒰(FD𝒔i)2\displaystyle=\|P_{\mathcal{U}}(F^{*}D\boldsymbol{s}_{i})\|_{2}
=1piP𝒰𝒇i2\displaystyle=\frac{1}{\sqrt{p_{i}}}\|P_{\mathcal{U}}\boldsymbol{f}_{i}\|_{2}
=1pisup𝒙𝒰𝕊n1𝒙,𝒇i\displaystyle=\frac{1}{\sqrt{p_{i}}}\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\langle\boldsymbol{x},\boldsymbol{f}_{i}\right\rangle
μ𝒰(F,𝒑).\displaystyle\leq\mu_{\mathcal{U}}(F,\boldsymbol{p}).

Let μ:=μ𝒰(F,𝒑)\mu:=\mu_{\mathcal{U}}(F,\boldsymbol{p}) for conciseness. We proceed to compute a value for KK. By triangle inequality and property of the operator norm of rank one matrices, we see that

𝒗i𝒗iI𝒗i22+12μ2.\|\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I\|\leq\|\boldsymbol{v}_{i}\|_{2}^{2}+1\leq 2\mu^{2}.

The last inequality holds because of the lower bound μ21\mu^{2}\geq 1, which we now justify. Consider that from Appendix A4 we have that μ𝜶2\mu\geq\|\boldsymbol{\alpha}\|_{2}, and furthermore that for any fixed one-dimensional subspace 𝒰0𝒰\mathcal{U}_{0}\subseteq\mathcal{U}, we have that 𝜶2=F𝒖^2=1\|\boldsymbol{\alpha}\|_{2}=\|F\hat{\boldsymbol{u}}\|_{2}=1 for a unit vector 𝒖^𝒰0\hat{\boldsymbol{u}}\in\mathcal{U}_{0}. This gives us the desired lower bound by monotonicity of μ\mu over set containment.

We now compute σ2,\sigma^{2}, similarly [3, Lemma 12.21]: Then

σ2\displaystyle\sigma^{2} =i=1m𝔼[(𝒗i𝒗iI)2]\displaystyle=\left\|\sum_{i=1}^{m}\mathbb{E}\left[(\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I)^{2}\right]\right\|
=sup𝒖𝕊k1𝒖,i=1m(𝔼[𝒗i𝒗i𝒗i𝒗i]I)𝒖\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}(\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]-I)\boldsymbol{u}\right\rangle
=sup𝒖𝕊k1𝒖,i=1m𝒗i22𝔼[𝒗i𝒗i]𝒖i=1m𝒖22\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}\lVert\boldsymbol{v}_{i}\rVert_{2}^{2}\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]\boldsymbol{u}\right\rangle-\sum_{i=1}^{m}\|\boldsymbol{u}\|_{2}^{2}
sup𝒖𝕊k1𝒖,i=1mμ2I𝒖i=1m𝔼[𝒗i𝒗i]𝒖22\displaystyle\leq\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}\mu^{2}I\boldsymbol{u}\right\rangle-\sum_{i=1}^{m}\lVert\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]\boldsymbol{u}\rVert_{2}^{2}
μ2m.\displaystyle\leq\mu^{2}m.

The second equality holds because the matrix is symmetric non-negative definite, and the last inequality is obtained by dropping the second negative term.

Then applying the Matrix Bernstein yields

{i=1m[P𝒰FD𝒔i𝒔iDFP𝒰I]t}2kexp(t2/2μ2m+2μ2t3).\mathbb{P}\left\{\left\|\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}-I\right]\right\|\geq t\right\}\leq 2k\exp\left(-\frac{t^{2}/2}{\mu^{2}m+2\mu^{2}\frac{t}{3}}\right).

Substituting with Equation 5, we get

{sup𝒙𝒰𝕊n1|1mSDF𝒙221|tm}2kexp(t2/2μ2m+2μ2t3).\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDF\boldsymbol{x}\right\|_{2}^{2}-1\right|\geq\frac{t}{m}\right\}\leq 2k\exp\left(-\frac{t^{2}/2}{\mu^{2}m+2\mu^{2}\frac{t}{3}}\right).

We would like to get our result in terms of the l2l_{2} norm without the square. For this purpose we make use of the “square-root trick” that can be found in [20, Theorem 3.1.1]. We re-write the above as

{sup𝒙𝒰𝕊n1|1mSDF𝒙221|tm}2kexp(Cmin(t2μ2m,tμ2)).\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDF\boldsymbol{x}\right\|_{2}^{2}-1\right|\geq\frac{t}{m}\right\}\leq 2k\exp\left(-C\min\left(\frac{t^{2}}{\mu^{2}m},\frac{t}{\mu^{2}}\right)\right).

We make the substitution tmmax(δ,δ2)t\to m\max(\delta,\delta^{2}), which yields

{sup𝒙𝒰𝕊n1|1mSDF𝒙221|max(δ,δ2)}2kexp(Cmδ2μ2).\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\|SDF\boldsymbol{x}\|_{2}^{2}-1\right|\geq\max(\delta,\delta^{2})\right\}\leq 2k\exp\left(-C\frac{m\delta^{2}}{\mu^{2}}\right).

With the restricted inequality a,δ>0,|a1|δ|a21|max(δ,δ2)\forall a,\delta>0,|a-1|\geq\delta\implies|a^{2}-1|\geq\max(\delta,\delta^{2}), we infer that

{sup𝒙𝒰𝕊n1|1mSDF𝒙21|δ}2kexp(Cmδ2μ2).\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\geq\delta\right\}\leq 2k\exp\left(-C\frac{m\delta^{2}}{\mu^{2}}\right).

Finally, with another substitution (cmδ2μ2logk)t2\left(\frac{cm\delta^{2}}{\mu^{2}}-\log k\right)\to t^{2} we write that

sup𝒙𝒰𝕊n1|1mSDF𝒙21|μmlogk+μmt\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu}{\sqrt{m}}\sqrt{\log k}+\frac{\mu}{\sqrt{m}}t

with probability at least 12exp(t2)1-2\exp(-t^{2}). ∎

Proof of Appendix A4.

It suffices to show that 𝒑:=(αj2𝜶22)j[n]\boldsymbol{p}^{*}:=\left(\frac{\alpha_{j}^{2}}{\|\boldsymbol{\alpha}\|_{2}^{2}}\right)_{j\in[n]} satisfies

𝒑argmin𝒑Δn1μ𝒯2(F,𝒑)=argmin𝒑Δn1maxj[n]αj2pj.\boldsymbol{p}^{*}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p})=\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}.

The vector 𝒑\boldsymbol{p}^{*} achieves a value of

μ𝒯2(F,𝒑)=𝜶22\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p}^{*})=\|\boldsymbol{\alpha}\|_{2}^{2}

which is the minimum. Indeed, for any fixed vector 𝒑Δn1\boldsymbol{p}\in\Delta^{n-1},

maxj[n]αj2pjj=1najαj2pj𝒂Δn1.\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}\geq\sum_{j=1}^{n}a_{j}\frac{\alpha_{j}^{2}}{p_{j}}\quad\forall\boldsymbol{a}\in\Delta^{n-1}.

The inequality above holds because the r.h.s. is a convex combination of the terms {αj2pj}j[n]\left\{\frac{\alpha_{j}^{2}}{p_{j}}\right\}_{j\in[n]}, and is therefore upper-bounded by the maximum element of the combination. By letting 𝒂=𝒑\boldsymbol{a}=\boldsymbol{p}, we get

maxj[n]αj2pj𝜶22.\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}\geq\|\boldsymbol{\alpha}\|_{2}^{2}.

Therefore, 𝒑argmin𝒑Δn1μ𝒯2(F,𝒑)\boldsymbol{p}^{*}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p}). ∎

Appendix A6 Properties of the Piecewise Linear Expansion

The following is a subset of the elements in remark [4, Remark A.2], to which we refer the reader for the proof.

Remark A6.1 (Properties of the Piecewise Linear Expansion).

Below we list several properties about Δ\Delta. Let 𝒞=i=1N𝒞i\mathcal{C}=\bigcup\limits_{i=1}^{N}\mathcal{C}_{i} be the union of NN\in\mathbb{N} convex cones 𝒞i\mathcal{C}_{i}.

  1. 1.

    The set Δ(𝒞)\Delta(\mathcal{C}) is uniquely defined. In particular, it is independent of the (finite) decomposition of 𝒞\mathcal{C} into convex cones.

  2. 2.

    If maxi[N]dim𝒞ik\max_{i\in[N]}\dim\mathcal{C}_{i}\leq k, then Δ(𝒞)\Delta(\mathcal{C}) is a union of no more than NN at-most kk-dimensional linear subspaces.

  3. 3.

    The set Δ(𝒞)\Delta(\mathcal{C}) satisfies 𝒞Δ(𝒞)𝒞𝒞\mathcal{C}\subseteq\Delta(\mathcal{C})\subseteq\mathcal{C}-\mathcal{C}.

  4. 4.

    There are choices of 𝒞\mathcal{C} for which 𝒞Δ(𝒞)\mathcal{C}\subsetneq\Delta(\mathcal{C}) (for instance, refer to the example at the end of this section).

Appendix A7 Experimental Specifications

CelebA with RealnessGAN

CelebFaces Attributes Dataset (CelebA) is a dataset with over 200,000 celebrity face images [12]. We train a model on most images of the CelebA dataset, leaving out 2000 images to comprise a validation set. We crop the colour images to 256 by 256, leading to 256 ×\times 256 ×\times 3 = 196608 pixels per image. On this dataset, we train a RealnessGAN with the same training setup as described in [21], substituting the last Tanh layer with HardTanh, a linearized version of Tanh, to fit in our theoretical framework. See [21] for more training and architecture details.

Appendix A8 Additional Image Recoveries

Refer to caption
Figure 3: In-range signal recovery of images using uniform sampling and model-adapted sampling. The sampling rate on the bottom is computed using the ratio between number of measurements and number of pixels.
Refer to caption
Figure 4: Out-of-range signal recovery of images using uniform sampling and model-adapted sampling.