\newaliascnt

propositiontheorem\aliascntresettheproposition \newaliascntcorollarytheorem\aliascntresetthecorollary \newaliascntlemmatheorem\aliascntresetthelemma

Model-adapted Fourier sampling for generative compressed sensing

Aaron Berk &Simone Brugiapaglia &Yaniv Plan &Matthew Scott Xia Sheng &Özgür Yılmaz Dept. Mathematics and Statistics, McGill University, Montréal, Canada. Now affiliated with Deep Render, London UKDept. Mathematics and Statistics, Concordia University, Montréal, CanadaCorresponding author: matthewscott@math.ubc.caAuthors listed in alphabetic order. MS was primarily responsible for developing the theory and writing the main body, XS was primarily responsible for numerical experiments and writing the numerical section.YP, MS, XS and OY affiliated with Dept. Mathematics, University of British Columbia, Vancouver, Canada

Abstract

We study generative compressed sensing when the measurement matrix is randomly subsampled from a unitary matrix (with the DFT as an important special case). It was recently shown that $\textit{O}(kdn\|\boldsymbol{\alpha}\|_{\infty}^{2})$ uniformly random Fourier measurements are sufficient to recover signals in the range of a neural network $G:\to$ of depth $d$ , where each component of the so-called local coherence vector $\boldsymbol{\alpha}$ quantifies the alignment of a corresponding Fourier vector with the range of $G$ . We construct a model-adapted sampling strategy with an improved sample complexity of $\textit{O}(kd\|\boldsymbol{\alpha}\|_{2}^{2})$ measurements. This is enabled by: (1) new theoretical recovery guarantees that we develop for nonuniformly random sampling distributions and then (2) optimizing the sampling distribution to minimize the number of measurements needed for these guarantees. This development offers a sample complexity applicable to natural signal classes, which are often almost maximally coherent with low Fourier frequencies. Finally, we consider a surrogate sampling scheme, and validate its performance in recovery experiments using the CelebA dataset.

1 Introduction

Compressed sensing considers signals $\boldsymbol{x}_{0}\in$ with high ambient dimension $n$ that belong to (or can be well-approximated by elements of) a prior set $\mathcal{V}\subseteq$ with lower “complexity” than the ambient space. The aim is to recover (an approximation to) such signals with provable accuracy guarantees from the linear, typically noisy, measurements $\boldsymbol{b}=A\boldsymbol{x}_{0}+\boldsymbol{\eta}$ , where $\boldsymbol{\eta}\in\mathbb{C}^{m}$ denotes noise and $A\in\mathbb{C}^{m\times n}$ with $m\ll n$ is an appropriately chosen (possibly random) measurement matrix. The signal is to be recovered by means of a computationally feasible method that utilizes the structure of $\mathcal{V}$ and has access to only $A$ and $\boldsymbol{b}$ . In classical compressed sensing, $\mathcal{V}$ is the set of sparse vectors. In generative compressed sensing, the prior set $\mathcal{V}$ is chosen to be the range of a generative neural network $G:\to$ , an idea that was first explored in [5]. Our results will hold for ReLU-activated neural networks as defined in [4, Definition I.2]. Here we denote the ReLU (Rectified Linear Unit) activation as $\sigma(z):=\max\{z,0\}$ , applied component-wise to a real vector $\boldsymbol{z}$ .

Definition 1.1 ( $(k,d,n)$ -Generative Network).

With $k,d,n\in\mathbb{N}$ , fix the integers $2\leq k:=k_{0}\leq k_{1},\ldots,k_{d-1}\leq k_{d}=n$ , and for $i\in[d]$ , let $W^{(i)}\in$ . A $(k,d,n)$ -generative network is a function $G:\to$ of the form $G(\boldsymbol{z}):=W^{(d)}\sigma\left(\cdots W^{(2)}\sigma\left(W^{(1)}\boldsymbol{z}\right)\right).$

In the same work [5], the authors provided a theoretical framework for generative compressed sensing with $A\in\mathbb{R}^{m\times n}$ having independent identically distributed (i.i.d.) Gaussian entries. However, the assumption of Gaussian measurements is unrealistic for applications like MRI, where measurements are spatial Fourier transforms. Limitations of the hardware in such an application restrict the set of possible measurements to the rows of a fixed unitary matrix (or approximately so, up to discretization of the measurements). Hence we consider the more realistic subsampled unitary measurement matrices, i.e., matrices with rows randomly subsampled from the rows of a unitary matrix $F\in\mathbb{C}^{n\times n}$ .

A first result in the generative setting with subsampled unitary measurements [4] shows that for well-behaved networks, $m=\operatorname{\mathcal{O}}(k^{2}d^{2})$ (up to log factors) measurements is sufficient for recovery with high probability. However, sampling uniformly is not efficient; sampling more informative measurements at a higher rate is known to improve performance in compressed sensing (e.g., low Fourier frequencies have strong correlation with natural images, they will therefore tend to be more informative) [3]. This idea is mirrored in the radial sampling strategy used in MRI scans, which takes a disproportionate number of low-frequency measurements [15]. We address this limitation by generalizing the theory to any measurement matrix of the form $A=SF$ where $F\in\mathbb{C}^{n\times n}$ is a unitary matrix and $S\in\mathbb{R}^{m\times n}$ is a sampling matrix, which we now define. We adopt the convention that $\boldsymbol{e}_{j}$ denotes a canonical basis vector and that $\Delta^{n-1}$ is the simplex in $\mathbb{R}^{n}$ . See Section 1 for the definition of symbols throughout this paper.

Definition 1.2 (Sampling Matrix).

We define a sampling matrix to be any matrix $S\in\mathbb{R}^{m\times n}$ composed of i.i.d. row vectors $\boldsymbol{s}^{*}_{1},\dots,\boldsymbol{s}^{*}_{m}$ such that $\mathbb{P}(\boldsymbol{s}_{i}=\boldsymbol{e}_{j})=p_{j}$ for all $i\in[m],j\in[n]$ , for some fixed probability vector $\boldsymbol{p}\in\Delta^{n-1}$ .

We will further show, similarly to [18], that picking the sampling probabilities in a manner informed by the geometry of the problem yields improved recovery guarantees relative to the uniform case. Specifically, we provide a bound on the measurement complexity of $m=\operatorname{\mathcal{O}}(k^{2}d^{2})$ even for models which are highly aligned with a small subset of the rows of $F$ . To find good sampling probabilities, we must understand that measurements (i.e. the rows of the measurement matrix) are effective for a prior set $\mathcal{V}\subseteq$ if they help differentiate between signals in $\mathcal{V}$ . Therefore, we consider the alignment of rows of $F$ with the set $\mathcal{V}-\mathcal{V}$ (where the difference is in the sense of a Minkowski sum, see Section 1). We will sample rows of $F$ that have a high degree of alignment with $\mathcal{V}-\mathcal{V}$ at a higher rate. For technical reasons, we consider alignments with a slightly larger set, given by the following set operator previously introduced in [4, Definition 2.1].

Definition 1.3 (Piecewise Linear Expansion).

Let $\mathcal{C}\subseteq$ be the union of $N$ convex cones: $\mathcal{C}=\bigcup_{i=1}^{N}\mathcal{C}_{i}$ . Define the piecewise linear expansion $\Delta(\mathcal{C}):=\bigcup_{i=1}^{N}\operatorname{\mathrm{span}}(\mathcal{C}_{i})=\bigcup_{i=1}^{N}(\mathcal{C}_{i}-\mathcal{C}_{i}).$

The piecewise linear expansion is a well-defined set operator as shown in [4, Remark A.2]. Specifically, it is independent of the choice of convex cones $\mathcal{C}_{i}$ .

We require the following quantity to quantify the alignment of the individual vectors with the prior set.

Definition 1.4 (local coherence).

The local coherence of a vector $\boldsymbol{\phi}\in\mathbb{C}^{n}$ with respect to a cone $\mathcal{T}\subseteq$ is defined as

\alpha_{\mathcal{T}}(\boldsymbol{\phi}):=\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\lvert\langle\boldsymbol{\phi},\boldsymbol{x}\rangle\rvert.

The local coherences of a unitary matrix $F\in\mathbb{C}^{n\times n}$ with respect to a cone $\mathcal{T}\subseteq$ are collected in the vector $\boldsymbol{\alpha}$ with entries $\alpha_{j}:=\alpha_{\mathcal{T}}(\boldsymbol{f}_{j})$ , where $\boldsymbol{f}^{*}_{j}$ is the $j^{th}$ row of $F$ .

By using the local coherences of $F$ to inform sampling probabilities, we will show that recovery occurs from only $m=\operatorname{\mathcal{O}}(kd\lVert\boldsymbol{\alpha}\rVert_{2}^{2})$ (up to log factors) measurements with high probability.

Refer to caption — Figure 1: Samples recovered by different sampling schemes with different sampling rates

Prior Work

Sampling with non-uniform probabilities was the subject of a line of research in classical compressed sensing. Seminal works involving the notion of local coherence in classical compressed sensing are [10, 11, 18, 19]. There have been quantities analogous to the local coherence appearing in the literature, such as Christoffel functions [16] and leverage scores [7, 14], which were recently introduced in machine learning in the context of, e.g., kernel-based methods [8] and deep learning [1].

While writing this manuscript, we became aware of the recent paper [2]. This work presents a framework for optimizing sampling in general scenarios based on the so-called generalized Christoffel function, a quantity that admits the (squared) local coherence considered here as a particular case. Furthermore, the method we use to numerically approximate the coherence in Section 3 can be seen as a special case of that proposed in [2]. However, the results in [2] assume that $\mathcal{V}-\mathcal{V}$ is a union of low-dimensional subspaces. Hence, they are not directly applicable to the case of generative compressed sensing with ReLU networks, for which $\mathcal{V}-\mathcal{V}$ is in general only contained in a union of low-dimensional subspaces. On the other hand, our theory explicitly covers the case of generative compressed sensing, and illustrates how sufficient conditions on $m$ leading to successful signal recovery depend on the generative network’s parameters $(k,d,n)$ . Second, we provide recovery guarantees that hold with high probability, as opposed to expectation.

The present work directly improves on results from [4] by improving the sample complexity from $n\lVert\boldsymbol{\alpha}\rVert_{\infty}^{2}$ to $\lVert\boldsymbol{\alpha}\rVert_{2}^{2}$ when the sampling probabilities are adapted to the generative model used. This is a sizable improvement in performance guarantees for a significant class of realistic generative models. It can be understood as extending the theory of generative compressed sensing with Fourier measurements to many realistic settings where we provide nearly order-optimal bounds on the sampling complexity. Indeed, in the context of the main result from [4], “favourable coherence” corresponds to $\lVert\boldsymbol{\alpha}\rVert_{\infty}\leq C\sqrt{kd/n}$ where $C$ is an absolute constant. Despite the fact that the prior generated by a neural network with Gaussian weights will have such a coherence [4], we observe empirically in Figure 2d) that for a trained generative model, a small number of Fourier coefficients have values close to one. In such cases, the main result from [4] becomes vacuous while Theorem 2.1 remains meaningful.

Notation

For any map $f$ , we denote $\operatorname{\mathcal{R}}(f)$ to be the range of $f$ . We define the simplex $\Delta^{n-1}:=\{\boldsymbol{p}\in\mathbb{R}^{n}:p_{i}\geq 0,\sum_{i=1}^{n}p_{i}=1\}$ , and the sphere $\mathbb{S}^{n-1}:=\{\boldsymbol{x}\in\mathbb{R}^{n}:\|\boldsymbol{x}\|_{2}=1\}$ . For any matrix $A\in\mathbb{C}^{m\times n}$ , we denote its pseudo-inverse by $A^{+}\in\mathbb{C}^{n\times m}$ . For a set $\mathcal{V}\in$ we denote its self-difference $\mathcal{V}-\mathcal{V}:=\{\boldsymbol{v}_{1}-\boldsymbol{v}_{2}|\boldsymbol{v}_{1},\boldsymbol{v}_{2}\in\mathcal{V}\}$ . We define $\operatorname{\Pi}_{\mathcal{T}}:\to$ to be the orthogonal projection on to a set $\mathcal{T}\subseteq$ in the sense that $\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}$ is a single element from the set $\operatorname*{\arg\min}_{\boldsymbol{t}\in\mathcal{T}}\|\boldsymbol{x}-\boldsymbol{t}\|_{2}$ . We let $[l]=\{1,\ldots,l\}$ . We denote $\langle\cdot,\cdot\rangle$ to be the the canonical inner product in $\mathbb{R}$ or $\mathbb{C}$ depending on the context.

2 Main Result

We now state the main result of this paper, where we give an upper bound on the sample complexity required for signal recovery. The accuracy of the recovery is dependent on the measurement noise, the modelling error (how far the signal is from the prior), and imperfect optimization. These are denoted, respectively, by $\boldsymbol{\eta},\boldsymbol{x}^{\perp}$ and $\hat{\varepsilon}$ below.

Theorem 2.1.

Fix a $(k,d,n)$ -generative network $G$ , the cone $\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)\subseteq$ , the unitary matrix $F\in\mathbb{C}^{n\times n}$ . Let $\boldsymbol{\alpha}$ be the vector of local coherences of $F$ with respect to $\mathcal{T}$ . Let $\boldsymbol{p}:=(\alpha_{j}^{2}/\lVert\boldsymbol{\alpha}\rVert_{2}^{2})_{j\in[n]}\in\Delta^{n-1}$ with $S\in\mathbb{R}^{m\times n}$ the corresponding random sampling matrix. Let $D\in\mathbb{R}^{n\times n}$ be the diagonal matrix with entries $D_{i,i}=1/\sqrt{p_{i}}$ . Let $\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}$ . Let $\varepsilon>0$ . If

m\geq C\|\boldsymbol{\alpha}\|_{2}^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\left(\frac{2}{\varepsilon}\right)\right)

for $C$ an absolute constant, then with probability at least $1-\varepsilon$ over the realization of $S$ the following statement holds.

Statement 2.1.

For any choice of $\boldsymbol{x}_{0}\in$ and $\boldsymbol{\eta}\in\mathbb{C}^{n}$ , let $\boldsymbol{b}:=\frac{1}{\sqrt{m}}SF\boldsymbol{x}+\boldsymbol{\eta}$ and $\boldsymbol{x}^{\perp}:=\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}$ , and any $\hat{\boldsymbol{x}}\in,\hat{\varepsilon}>0$ satisfying $\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{D}\boldsymbol{b}\|_{2}\leq\min_{\boldsymbol{x}\in\mathcal{T}}\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}-\tilde{D}\boldsymbol{b}\|_{2}+\hat{\varepsilon}$ . We have that

\displaystyle\|\hat{\boldsymbol{x}}-\boldsymbol{x}_{0}\|_{2}\leq\|\boldsymbol{x}^{\perp}\|_{2}+\frac{3}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+3\|\tilde{D}\boldsymbol{\eta}\|_{2}+\frac{3}{2}\hat{\varepsilon}.

Remark 2.1.

The matrix $\tilde{D}$ has a vector of diagonal entries $\text{Diag}(\tilde{D})$ satisfying $\text{Diag}(\tilde{D})=S\text{Diag}(D)$ . ∎

Remark 2.2.

We give a generalization of this result to arbitrary sampling probability vectors in Theorem A3.1. ∎

3 Numerics

In this section, we provide empirical evidence of the connection between coherence-based non-uniform sampling and recovery error. By presenting visual (Figure 1) and quantitative (Figure 2) evidence, we validate that model-adapted sampling using a coherence-informed probability vector can outperform a uniform sampling scheme — requiring fewer measurements for successful recovery.

Coherence heuristic

Ideally, we would compute the local coherence $\boldsymbol{\alpha}$ using Definition 1.4, but to our knowledge computing local coherence is intractable for generative models relevant to practical settings [4]. Thus, we approximate the quantity by sampling points from the range of the generative model and computing the local coherence from the sampled points instead. Specifically, we sample codes from the latent space of the generative model to generate a batch of images with shape $(B,C,H,W)$ , where $B,C,H,W$ stand for batch size, number of channels, image height and image width respectively. Then, we compute the set self-difference of the image batch, and normalize each difference vector. This gives a tensor of shape $(B^{2},C,H,W)$ . We perform a channel-wise two-dimensional Discrete Fourier Transform (DFT) on the tensor, take the element-wise modulus, and then maximize over the batch dimension. This results in a coherence tensor $a_{0}$ with shape $(C,H,W).$ To obtain the coherence-informed probability vector of each channel, we first square element-wise, then we normalize channel-wise. To estimate the local coherences of our generative model we use a batch size of 5000 and employ the DFT from PyTorch [17].

In-Range vs Out-of-Range Signals

We run signal recovery experiments for two kinds of images: when the signals are conditioned to be in the range of the generative model (in-range signals), and when they are directly picked from the validation set (out-of-range signals). The in-range signals are randomly generated (with Gaussian codes) by the same generative network that we use for recovery. This ensures that these signals lie within the prior set. In the out-of-range setting, a residual recovery error can be observed even with large numbers of measurements Figure 2. This error occurs because of the so-called model mismatch; there is some distance between the prior set and the signals.

Procedure for Signal Recovery

The way we perform signal recovery goes as follows. For a given image $\boldsymbol{x}_{0}\in\mathbb{R}^{C\times H\times W}$ , we create a mask $M\in{\{0,1\}}^{C\times H\times W}$ by randomly sampling with replacement $m$ times for each channel according to the probability vector. Let $F$ be the channel-wise DFT operator and $G:\mathbb{R}^{k}\to\mathbb{R}^{C\times H\times W}$ be the generative neural network (where we omit the batch dimension for simplicity). We denote $\odot$ to be the element-wise tensor multiplication and $\|\cdot\|_{F}$ to be the Frobenius norm. We approximately solve the optimization program $\hat{\boldsymbol{z}}\in\operatorname*{\arg\min}_{\boldsymbol{z}\in\mathbb{R}^{k}}\|M\odot F\boldsymbol{x}_{0}-M\odot FG(\boldsymbol{z})\|_{F}^{2}$ by running AdamW [13] with $lr=0.003,\beta_{1}=0.9$ and $\beta_{2}=0.999$ for 20000 iterations on four different random initializations, and pick the code that achieves the lowest loss. The recovered signal is then $\hat{\boldsymbol{x}}:=G(\hat{\boldsymbol{z}})$ . We measure the quality of the signal recovery by using the relative recovery error (rre), $\operatorname{rre}(\boldsymbol{x}_{0},\hat{\boldsymbol{x}})=\frac{\|\boldsymbol{x}_{0}-\hat{\boldsymbol{x}}\|_{2}}{\|\boldsymbol{x}_{0}\|_{2}}.$

Observe that Figure 2b) demonstrates the efficiency of model-adapted sampling. Signal recovery with adapted sampling occurs with $16$ times fewer measurements than when using uniform sampling. Similar performance gains can be observed visually in Figure 1 and Figure 3. Comparing the number of measurements to the ambient dimension, we see from Figure 2d) that signal recovery occurs with $\frac{m}{n}\approx 2^{7}/256^{2}=0.2\%$ .

There are a few ways these numerical experiments do not directly match our theory. The sampling is done channel wise, which is technically block sampling [3]. Also, the signal recovery is performed without the preconditioning factor $\tilde{D}$ that appears in Theorem 2.1.

4 Conclusion

In this paper we bring together the ideas used to quantify the compatibility of generative models with subsampled unitary measurements, which were first explored in [4], with ideas of non-uniform sampling from classical compressed sensing. We present the first theoretical result applying coherence-based sampling (similar to leverage score sampling, or Christoffel function sampling) to the setting where the prior is a ReLU generative neural network. We find that adapting the sampling scheme to the geometry of the problem yields substantially improved sampling complexities for many realistic generative networks, and that this improvement is significant in empirical experiments.

Possible avenues for future research include extending the theory presented in [2] to ReLU nets by using methods introduced in the present work. This would yield the benefit of extending the theory from this paper to a number of realistic sampling schemes. A second research direction consists of investigating the optimality of the sample complexity bound that we present in this paper. The sample complexity that we guarantee includes a factor of $k^{2}d^{2}$ when the generative model is well-behaved. Whether this dependence can be reduced to $kd$ , as is the case when the measurement matrix is Gaussian, is an interesting problem that remains open. Finally, the class of neural networks considered in this work could be expanded to include more realistic ones.

pages1 Rpages1 Rpages10 Rpages21 Rpages15 Rpages14 Rpages11 Rpages9 Rpages1 Rpages165 Rpages12 Rpages4 Rpages17

References

[1] Ben Adcock, Juan M Cardenas and Nick Dexter “CAS4DL: Christoffel adaptive sampling for function approximation via deep learning” In Sampling Theory, Signal Processing, and Data Analysis 20.2 Springer, 2022, pp. 21
[2] Ben Adcock, Juan M. Cardenas and Nick Dexter “CS4ML: A General Framework for Active Learning with Arbitrary Data Based on Christoffel Functions” arXiv, 2023 arXiv:2306.00945 [cs, math]
[3] Ben Adcock and Anders C. Hansen “Compressive Imaging: Structure, Sampling, Learning” Cambridge: Cambridge University Press, 2021 DOI: 10.1017/9781108377447
[4] Aaron Berk et al. “A Coherence Parameter Characterizing Generative Compressed Sensing with Fourier Measurements” In IEEE Journal on Selected Areas in Information Theory, 2022, pp. 1–1 DOI: 10.1109/JSAIT.2022.3220196
[5] Ashish Bora, Ajil Jalal, Eric Price and Alexandros G. Dimakis “Compressed Sensing Using Generative Models” In Proceedings of the 34th International Conference on Machine Learning PMLR, 2017, pp. 537–546
[6] E.J. Candes, J. Romberg and T. Tao “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information” In IEEE Transactions on Information Theory 52.2, 2006, pp. 489–509 DOI: 10.1109/TIT.2005.862083
[7] Samprit Chatterjee and Ali S Hadi “Influential observations, high leverage points, and outliers in linear regression” In Statistical science JSTOR, 1986, pp. 379–393
[8] Tamás Erdélyi, Cameron Musco and Christopher Musco “Fourier sparse leverage scores and approximate kernel learning” In Advances in Neural Information Processing Systems 33, 2020, pp. 109–122
[9] Simon Foucart and Holger Rauhut “A Mathematical Introduction to Compressive Sensing” Springer New York, 2013
[10] Felix Krahmer, Holger Rauhut and Rachel Ward “Local Coherence Sampling in Compressed Sensing” In Proceedings of the 10th International Conference on Sampling Theory and Applications
[11] Felix Krahmer and Rachel Ward “Stable and Robust Sampling Strategies for Compressive Imaging” In IEEE Transactions on Image Processing 23.2, 2014, pp. 612–622 DOI: 10.1109/TIP.2013.2288004
[12] Ziwei Liu, Ping Luo, Xiaogang Wang and Xiaoou Tang “Deep Learning Face Attributes in the Wild” In Proceedings of International Conference on Computer Vision (ICCV), 2015
[13] Ilya Loshchilov and Frank Hutter “Decoupled Weight Decay Regularization”, 2019 arXiv:1711.05101 [cs.LG]
[14] Ping Ma, Michael Mahoney and Bin Yu “A statistical perspective on algorithmic leveraging” In International conference on machine learning, 2014, pp. 91–99 PMLR
[15] Maria Murad et al. “Radial Undersampling-Based Interpolation Scheme for Multislice CSMRI Reconstruction Techniques” In BioMed Research International 2021, 2021, pp. 6638588 DOI: 10.1155/2021/6638588
[16] Paul Nevai “Géza Freud, orthogonal polynomials and Christoffel functions. A case study” In Journal of Approximation Theory 48.1, 1986, pp. 3–167 DOI: https://doi.org/10.1016/0021-9045(86)90016-X
[17] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 8024–8035 URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[18] Gilles Puy, Pierre Vandergheynst and Yves Wiaux “On Variable Density Compressive Sampling” In IEEE Signal Processing Letters 18.10, 2011, pp. 595–598 DOI: 10.1109/LSP.2011.2163712
[19] Holger Rauhut and Rachel Ward “Sparse Legendre Expansions via $\mathscr{l}$ 1-Minimization” In Journal of Approximation Theory 164.5, 2012, pp. 517–533 DOI: 10.1016/j.jat.2012.01.008
[20] Roman Vershynin “High-Dimensional Probability: An Introduction with Applications in Data Science” Cambridge University Press, 2018
[21] Yuanbo Xiangli et al. “Real or Not Real, that is the Question” In International Conference on Learning Representations, 2020

This supplementary material contains acknowledgements, a generalization of Theorem 2.1, the proof of said generalization, properties of the piecewise linear expansion, and additional image recoveries both in-range and out-of-range.

Appendix A1 Acknowledgements

The authors acknowledge Ben Adcock for providing feedback on a preliminary version of this paper. S. Brugiapaglia acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2020- 06766 and the Fonds de Recherche du Québec Nature et Technologies (FRQNT) through grant 313276. Y. Plan is partially supported by an NSERC Discovery Grant (GR009284), an NSERC Discovery Accelerator Supplement (GR007657), and a Tier II Canada Research Chair in Data Science (GR009243). O. Yilmaz was supported by an NSERC Discovery Grant (22R82411) O. Yilmaz also acknowledges support by the Pacific Institute for the Mathematical Sciences (PIMS).

Appendix A2 Additional Notation

In this work we make use of absolute constants which we always label $C$ . Any constant labelled $C$ is implicitly understood to be absolute and has a value that may differ from one appearance to the next. We write $a\lesssim b$ to mean $a\leq Cb$ . We denote by $\|\cdot\|$ the operator norm, and by $\|\cdot\|_{2}$ the euclidean norm. We use capital letters for matrices and boldface lowercase letters for vectors. For some matrix $A\in\mathbb{C}^{m\times n}$ , we denote by $\boldsymbol{a}_{i}^{*}\in\mathbb{C}^{1\times n}$ (lowercase of the letter symbolizing the matrix) the $i^{th}$ row vector of $A$ , meaning that $A=\sum_{i=1}^{m}\boldsymbol{e}_{i}\boldsymbol{a}_{i}^{*}$ . For a vector $\boldsymbol{u}\in$ , we denote by $u_{i}$ its $i^{\text{th}}$ entry. We let $\{\boldsymbol{e}_{i}\}_{i\in[n]}$ be the canonical basis of .

Appendix A3 Generalized Main Result

We present a generalization of Theorem 2.1 which provides recovery guarantees for arbitrary sampling probabilities. To quantify the quality of the interaction between the generative model $G$ , the unitary matrix $F$ , and the sampling probability vector $\boldsymbol{p}$ , we introduce the following quantity.

Definition A3.1.

Let $\mathcal{T}\subseteq$ be a cone. Let $F\in\mathbb{C}^{n\times n}$ be a unitary matrix and $\boldsymbol{p}\in\Delta^{n-1}$ . Let $\boldsymbol{\alpha}\in\mathbb{R}^{n}$ the local coherences of $F$ with respect to $\mathcal{T}$ . Then define the quanity

\mu_{\mathcal{T}}(F,\boldsymbol{p}):=\max_{j\in[n]}\frac{\alpha_{j}}{\sqrt{p_{j}}}.

We can now state the following.

Theorem A3.1 (Generalized Main Result).

Fix the $(k,d,n)$ -generative network $G$ , the cone $\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)\subseteq$ , the unitary matrix $F\in\mathbb{C}^{n\times n}$ , the probability vector $\boldsymbol{p}\in\Delta^{n-1}$ , and the corresponding random sampling matrix $S\in\mathbb{R}^{m\times n}$ . Let $D\in\mathbb{R}^{n\times n}$ be a diagonal matrix with entries $D_{i,i}=\frac{1}{\sqrt{p_{i}}}$ . Let $\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}$ . Let $\boldsymbol{\alpha}$ be the vector of local coherences of $F$ with respect to $\mathcal{T}$ . Let $\varepsilon>0$ .

Suppose that

m\geq C\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p})\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right).

Furthermore, if we pick the sampling probability vector

\boldsymbol{p}^{*}:=\left(\frac{\alpha_{j}^{2}}{\lVert\boldsymbol{\alpha}\rVert_{2}^{2}}\right)_{j\in[n]},

we only require that

m\geq C\|\boldsymbol{\alpha}\|_{2}^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right).

Then with probability at least $1-\varepsilon$ over the realization of $S$ , Statement 2.1 holds.

This theorem is strictly more general than Theorem 2.1. It is therefore sufficient to prove the generalized version, which we do in the next section.

Remark A3.1.

The result [4, Theorem 2.1] is a corollary of Theorem 2.1; it follows from taking $\boldsymbol{p}$ as the uniform probability vector. ∎

Appendix A4 Proof of Theorem A3.1

Let us first introduce the so-called Restricted Isometry Property (RIP) [6, 9, 20].

Definition A4.1 (Restricted Isometry Property).

Let $\mathcal{T}\subseteq$ be a cone and $A\in\mathbb{C}^{m\times n}$ a matrix. We say that $A$ satisfies the Restricted Isometry Property (RIP) when

\sup_{\boldsymbol{u}\in\mathcal{T}\cap\mathbb{S}^{n-1}}\lvert\lVert A\boldsymbol{u}\rVert_{2}-1\rvert\leq\frac{1}{3}.

Note that the constant $1/3$ is a specific choice made in order to simplify the presentation of this proof. It could be replaced by any generic absolute constant in $(0,1)$ .

The following lemma says that if, conditioning on $S$ , $\frac{1}{\sqrt{m}}SDF$ has the RIP on $\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)$ , then we have signal recovery.

Lemma \thelemma (RIP of a Subsampled and Preconditioned Matrix Yields Recovery).

Let $\mathcal{V}\subseteq$ be a cone, $F\in\mathbb{C}^{n\times n}$ be a unitary matrix, $S\in\mathbb{R}^{m\times n}$ a matrix with all rows in the canonical basis of $\mathbb{R}^{n}$ . Let $D\in\mathbb{R}^{n\times n}$ be a diagonal matrix. Let $\tilde{D}:=SDS^{+}\in\mathbb{R}^{m\times m}$ .

If $\frac{1}{\sqrt{m}}SDF$ has the RIP on the cone $\mathcal{T}:=\mathcal{V}-\mathcal{V}$ , then Statement 2.1 holds.

See the proof in Appendix Appendix A5.

We now proceed to prove a slightly stronger statement than what is required by Appendix A4; that the RIP holds on the piecewise linear expansion $\Delta(\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G))\supseteq\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)$ .

To control the complexity of $G$ , we count the number of affine pieces that it comprises. We do this with the result [4, Lemma A.6], which we re-write below for convenience.

Lemma \thelemma (Containing the Range of a ReLU Network in a Union of Subspaces).

Let $G$ be a $(k,d,n)$ -generative network with layer widths $k=k_{0}\leq k_{1},\ldots,k_{d-1}\leq k_{d}$ where $k_{d}=n$ and $\bar{k}:=\left(\prod_{\ell=1}^{d-1}k_{\ell}\right)^{1/(d-1)}$ . Then $\operatorname{\mathcal{R}}(G)$ is a union of no more than $N$ at-most $k$ -dimensional polyhedral cones where

\displaystyle\log{N}

\displaystyle\leq k(d-1)\log{\left(\frac{2e\bar{k}}{k}\right)}\lesssim kd\log{\left(\frac{n}{k}\right)}.

From this result, we see that $\mathcal{T}:=\operatorname{\mathcal{R}}(G)-\operatorname{\mathcal{R}}(G)$ is contained in a union of no more than $N^{2}$ affine pieces each of dimension no more than $2k$ . Then from Remark A6.1 (for the proof, see [4, Remark A.2]), the cone $\Delta(\mathcal{T})$ is a union of no more than $N^{2}$ subspaces each of dimension at-most $2k$ (the factor of two will be absorbed into the absolute constant of the statement.) Fix $\mathcal{U}\subseteq\mathcal{T}$ to be any one of these subspaces. Then the following lemma implies that the matrix $\frac{1}{\sqrt{m}}SDF$ has the RIP on $\mathcal{U}$ with high probability.

Lemma \thelemma (Deviation of Subsampled Preconditioned Unitary Matrix on a Subspace).

Let $F\in\mathbb{C}^{n\times n}$ be a unitary matrix, and $S\in\mathbb{R}^{m\times n}$ a random sampling matrix associated with the probability vector $\boldsymbol{p}\in\Delta^{n-1}$ . Let $D\in\mathbb{R}^{n}$ be a diagonal pre-conditioning matrix with entries $D_{i,i}=\frac{1}{\sqrt{p_{i}}}$ . Let $t>0$ . Let $\mathcal{U}\subseteq$ be a subspace of dimension $k$ . Then

\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu_{\mathcal{U}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{\log k}+\frac{\mu_{\mathcal{U}}(F,\boldsymbol{p})}{\sqrt{m}}t

(1)

with probability at least $1-2\exp(-t^{2})$ .

See the proof in Appendix A5.

Since $\mathcal{U}\subseteq\Delta(\mathcal{T})$ we have that $\mu_{\mathcal{U}}(F,\boldsymbol{p})\leq\mu_{\mathcal{T}}(F,\boldsymbol{p})$ . Using this fact to upper-bound the r.h.s. of Equation 1, we find an identical concentration inequality that applies to each of the subspaces constituting $\Delta(\mathcal{T})$ . By using Appendix A4 to bound the number of subspaces, we control the deviation of $\frac{1}{\sqrt{m}}SDF$ uniformly over all the subspaces constituting $\Delta(\mathcal{T})$ with a union bound. We find that, with probability at least $1-2\exp(-t^{2})$ ,

\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\left\|SDF\boldsymbol{x}\right\|_{2}-1\right|

\lesssim\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{\log k}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{kd\log\left(\frac{n}{k}\right)}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}t.

For the method by which we applied the union bound, see [4, Lemma A.2]. In the r.h.s. of the equation above, the second term dominates the first, so the expression simplifies to

\sup_{\boldsymbol{x}\in\Delta(\mathcal{T})\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}\sqrt{kd\log\left(\frac{n}{k}\right)}+\frac{\mu_{\mathcal{T}}(F,\boldsymbol{p})}{\sqrt{m}}t.

(2)

By fixing $t=\sqrt{\log\left(\frac{2}{\varepsilon}\right)}$ we find that the RIP holds with probability at least $1-\varepsilon$ on $\mathcal{T}$ when

m\geq C\mu_{\mathcal{T}}(F,\boldsymbol{p})^{2}\left(kd\log\left(\frac{n}{k}\right)+\log\frac{2}{\varepsilon}\right).

(3)

Then we find that the first part of Theorem A3.1 follows from Appendix A4.

The second sufficient condition on $m$ follows from picking $\boldsymbol{p}$ so as to minimize the factor $\mu_{\mathcal{T}}(F,\boldsymbol{p})^{2}$ in Equation 3.

Lemma \thelemma (Adapting the Sampling Scheme to the Model).

Let $F\in\mathbb{C}^{n\times n}$ be a unitary matrix, and $S\in\mathbb{R}^{m\times n}$ a random sampling matrix associated with the probability vector $\boldsymbol{p}\in\Delta^{n-1}$ . Let $\boldsymbol{\alpha}$ be the local coherences of $F$ with respect to a cone $\mathcal{T}\in$ .

\boldsymbol{p}_{j}^{*}:=\left(\frac{\alpha_{j}^{2}}{\lVert\boldsymbol{\alpha}\rVert_{2}^{2}}\right)_{j\in[n]}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}(\mu_{\mathcal{T}}(F,\boldsymbol{p})).

It achieves a value of

\mu_{\mathcal{T}}(F,\boldsymbol{p}^{*})=\lVert\boldsymbol{\alpha}\rVert_{2}.

See the proof.

Applying Appendix A4 to Equation 3 concludes the proof of Theorem A3.1.

Appendix A5 Proof of the Lemmas

Proof of Appendix A4.

Let $F\in\mathbb{C}^{n\times n}$ be a unitary matrix, and $S\in\mathbb{R}^{m\times n}$ a random sampling matrix associated with the probability vector $\boldsymbol{p}\in\Delta^{n-1}$ . Let $D\in\mathbb{R}^{n\times n}$ be a diagonal matrix with entries $D_{i,i}=\frac{1}{\sqrt{p_{i}}}$ . Let $\tilde{D}=SDS^{+}$ .

We let $\tilde{\boldsymbol{b}}:=\tilde{D}\boldsymbol{b}$ and $\tilde{\boldsymbol{\eta}}:=\tilde{D}\boldsymbol{\eta}$ . By left-multiplying the equation $\boldsymbol{b}=\frac{1}{\sqrt{m}}SF\boldsymbol{x}_{0}+\boldsymbol{\eta}$ by $\tilde{D}$ we get $\tilde{\boldsymbol{b}}=\frac{1}{\sqrt{m}}SDF\boldsymbol{x}_{0}+\tilde{\boldsymbol{\eta}}$ . Notice that the linear operator $\frac{1}{\sqrt{m}}SDF$ has the RIP by assumption.

By triangle inequality and the observation that $\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\in\mathcal{T}$ ,

	$\displaystyle\left\\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\\|_{2}$	$\displaystyle\leq\min_{\boldsymbol{x}\in\mathcal{T}}\left\\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}-\tilde{\boldsymbol{b}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle\leq\left\\|\frac{1}{\sqrt{m}}SDF\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}-\tilde{\boldsymbol{b}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle=\left\\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}^{\perp}+\tilde{\boldsymbol{\eta}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle\leq\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}+\\|\tilde{\boldsymbol{\eta}}\\|_{2}+\hat{\varepsilon}.$

Since $\hat{\boldsymbol{x}},\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\in\mathcal{T}$ , with the RIP property we find that

	$\displaystyle\left\\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\\|_{2}$	$\displaystyle=\left\\|\frac{1}{\sqrt{m}}SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\frac{1}{\sqrt{m}}SDF\left(\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\tilde{\boldsymbol{\eta}}\right\\|_{2}$
		$\displaystyle\geq\frac{1}{\sqrt{m}}\\|SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)\\|_{2}-\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}-\\|\tilde{\boldsymbol{\eta}}\\|_{2}$
		$\displaystyle\geq\left(1-\frac{1}{3}\right)\\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\\|_{2}-\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}-\\|\tilde{\boldsymbol{\eta}}\\|_{2}.$

Assembling the two inequalities gives

\displaystyle\left\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right\|_{2}\leq\frac{1}{1-\frac{1}{3}}\left[2\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}^{\perp}\|_{2}+2\|\tilde{\boldsymbol{\eta}}\|_{2}+\hat{\varepsilon}\right].

Finally, we apply triangle inequality to get

	$\displaystyle\\|\hat{\boldsymbol{x}}-\boldsymbol{x}_{0}\\|_{2}$	$\displaystyle\leq\\|\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\\|_{2}+\\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\\|_{2}$
		$\displaystyle\leq\\|\boldsymbol{x}^{\perp}\\|_{2}+\frac{3}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}+3\\|\tilde{\boldsymbol{\eta}}\\|_{2}+\frac{3}{2}\hat{\varepsilon}.$

∎

Proof of Appendix A4.

In what follows, we will use that $\forall\boldsymbol{x}\in\mathcal{U}$ , $SDF\boldsymbol{x}=SDFP_{\mathcal{U}}^{*}P_{\mathcal{U}}\boldsymbol{x}$ where $P_{\mathcal{U}}\in$ is the matrix with rows chosen to be any fixed orthonormal basis of $\mathcal{U}$ . Indeed, notice that $P_{\mathcal{U}}^{*}P_{\mathcal{U}}=\Pi_{\mathcal{U}}\in$ , the orthonormal projection on to $\mathcal{U}$ . Now consider

	$\displaystyle(\star):=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\|\frac{1}{m}\left\\|SDF\boldsymbol{x}\right\\|_{2}^{2}-1\right\|$	$\displaystyle=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\|\frac{1}{m}\left\\|SDFP_{\mathcal{U}}^{*}P_{\mathcal{U}}\boldsymbol{x}\right\\|_{2}^{2}-1\right\|$
		$\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\|\frac{1}{m}\left\\|SDFP_{\mathcal{U}}^{*}\boldsymbol{u}\right\\|_{2}^{2}-1\right\|$
		$\displaystyle=\frac{1}{m}\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\|\boldsymbol{u}^{}\left[(SDFP_{\mathcal{U}}^{})^{}(SDFP_{\mathcal{U}}^{})-mI\right]\boldsymbol{u}\right\|.$

The second equality above follows from a change of variables $P_{\mathcal{U}}\boldsymbol{x}\to\boldsymbol{u}\in$ . Since the matrix within the square bracket is symmetric, the last expression we find above corresponds to an operator norm.

	$\displaystyle(\star)$	$\displaystyle=\frac{1}{m}\left\\|P_{\mathcal{U}}F^{}DS^{}SDFP_{\mathcal{U}}^{*}-mI\right\\|$		(4)
		$\displaystyle=\frac{1}{m}\left\\|\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{}DFP_{\mathcal{U}}^{*}-I\right]\right\\|$		(5)

This is a sum of independent random matrices because the sampling matrix matrix $S$ is random and has independent rows. We now consider what will be the central ingredient of this proof: the Matrix Bernstein concentration bound [20, Theorem 5.4.1]. We will use it to bound Equation 5.

Lemma \thelemma (Matrix Bernstein).

Let $X_{1},...,X_{N}$ be independent, mean zero, $n\times n$ symmetric random matrices, such that $||X_{i}||\leq K$ almost surely for all i. Then, for every $t\geq 0$ , we have

\mathbb{P}\left\{\left\lVert\sum_{i=1}^{N}X_{i}\right\rVert\geq t\right\}\leq 2n\exp\left(-\frac{t^{2}/2}{\sigma^{2}+Kt/3}\right),

where $\sigma^{2}=\left\|\sum_{i=1}^{N}\mathbb{E}X_{i}^{2}\right\|$ .

To compute $\sigma^{2}$ and $K$ , we notice that we can write

\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}-I\right]=\sum_{i\in[n]}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I]

for the random vectors $\boldsymbol{v}_{i}:=P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}$ . These vectors have two key properties. First, they are isotropic; this is the property that

	$\displaystyle\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]$	$\displaystyle=\mathbb{E}[P_{\mathcal{U}}F^{}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{}DFP_{\mathcal{U}}^{*}]$
		$\displaystyle=P_{\mathcal{U}}F^{}D\mathbb{E}[\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{}]DFP_{\mathcal{U}}^{*}=I.$

The isotropic property gives us immediately that, as required, the matrices $\{\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I\}_{i\in[m]}$ are mean-zero.

The second property of the vectors $\{\boldsymbol{v}_{i}\}_{i\in[m]}$ is that they have bounded magnitude almost surely.

	$\displaystyle\\|\boldsymbol{v}_{i}\\|_{2}$	$\displaystyle=\\|P_{\mathcal{U}}(F^{*}D\boldsymbol{s}_{i})\\|_{2}$
		$\displaystyle=\frac{1}{\sqrt{p_{i}}}\\|P_{\mathcal{U}}\boldsymbol{f}_{i}\\|_{2}$
		$\displaystyle=\frac{1}{\sqrt{p_{i}}}\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\langle\boldsymbol{x},\boldsymbol{f}_{i}\right\rangle$
		$\displaystyle\leq\mu_{\mathcal{U}}(F,\boldsymbol{p}).$

Let $\mu:=\mu_{\mathcal{U}}(F,\boldsymbol{p})$ for conciseness. We proceed to compute a value for $K$ . By triangle inequality and property of the operator norm of rank one matrices, we see that

\|\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I\|\leq\|\boldsymbol{v}_{i}\|_{2}^{2}+1\leq 2\mu^{2}.

The last inequality holds because of the lower bound $\mu^{2}\geq 1$ , which we now justify. Consider that from Appendix A4 we have that $\mu\geq\|\boldsymbol{\alpha}\|_{2}$ , and furthermore that for any fixed one-dimensional subspace $\mathcal{U}_{0}\subseteq\mathcal{U}$ , we have that $\|\boldsymbol{\alpha}\|_{2}=\|F\hat{\boldsymbol{u}}\|_{2}=1$ for a unit vector $\hat{\boldsymbol{u}}\in\mathcal{U}_{0}$ . This gives us the desired lower bound by monotonicity of $\mu$ over set containment.

We now compute $\sigma^{2},$ similarly [3, Lemma 12.21]: Then

	$\displaystyle\sigma^{2}$	$\displaystyle=\left\\|\sum_{i=1}^{m}\mathbb{E}\left[(\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}-I)^{2}\right]\right\\|$
		$\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}(\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{}\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{}]-I)\boldsymbol{u}\right\rangle$
		$\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}\lVert\boldsymbol{v}_{i}\rVert_{2}^{2}\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]\boldsymbol{u}\right\rangle-\sum_{i=1}^{m}\\|\boldsymbol{u}\\|_{2}^{2}$
		$\displaystyle\leq\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\langle\boldsymbol{u},\sum_{i=1}^{m}\mu^{2}I\boldsymbol{u}\right\rangle-\sum_{i=1}^{m}\lVert\mathbb{E}[\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}]\boldsymbol{u}\rVert_{2}^{2}$
		$\displaystyle\leq\mu^{2}m.$

The second equality holds because the matrix is symmetric non-negative definite, and the last inequality is obtained by dropping the second negative term.

Then applying the Matrix Bernstein yields

\mathbb{P}\left\{\left\|\sum_{i=1}^{m}\left[P_{\mathcal{U}}F^{*}D\boldsymbol{s}_{i}\boldsymbol{s}_{i}^{*}DFP_{\mathcal{U}}^{*}-I\right]\right\|\geq t\right\}\leq 2k\exp\left(-\frac{t^{2}/2}{\mu^{2}m+2\mu^{2}\frac{t}{3}}\right).

Substituting with Equation 5, we get

\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDF\boldsymbol{x}\right\|_{2}^{2}-1\right|\geq\frac{t}{m}\right\}\leq 2k\exp\left(-\frac{t^{2}/2}{\mu^{2}m+2\mu^{2}\frac{t}{3}}\right).

We would like to get our result in terms of the $l_{2}$ norm without the square. For this purpose we make use of the “square-root trick” that can be found in [20, Theorem 3.1.1]. We re-write the above as

\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\left\|SDF\boldsymbol{x}\right\|_{2}^{2}-1\right|\geq\frac{t}{m}\right\}\leq 2k\exp\left(-C\min\left(\frac{t^{2}}{\mu^{2}m},\frac{t}{\mu^{2}}\right)\right).

We make the substitution $t\to m\max(\delta,\delta^{2})$ , which yields

\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{m}\|SDF\boldsymbol{x}\|_{2}^{2}-1\right|\geq\max(\delta,\delta^{2})\right\}\leq 2k\exp\left(-C\frac{m\delta^{2}}{\mu^{2}}\right).

With the restricted inequality $\forall a,\delta>0,|a-1|\geq\delta\implies|a^{2}-1|\geq\max(\delta,\delta^{2})$ , we infer that

\mathbb{P}\left\{\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\geq\delta\right\}\leq 2k\exp\left(-C\frac{m\delta^{2}}{\mu^{2}}\right).

Finally, with another substitution $\left(\frac{cm\delta^{2}}{\mu^{2}}-\log k\right)\to t^{2}$ we write that

\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left|\frac{1}{\sqrt{m}}\|SDF\boldsymbol{x}\|_{2}-1\right|\lesssim\frac{\mu}{\sqrt{m}}\sqrt{\log k}+\frac{\mu}{\sqrt{m}}t

with probability at least $1-2\exp(-t^{2})$ . ∎

Proof of Appendix A4.

It suffices to show that $\boldsymbol{p}^{*}:=\left(\frac{\alpha_{j}^{2}}{\|\boldsymbol{\alpha}\|_{2}^{2}}\right)_{j\in[n]}$ satisfies

\boldsymbol{p}^{*}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p})=\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}.

The vector $\boldsymbol{p}^{*}$ achieves a value of

\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p}^{*})=\|\boldsymbol{\alpha}\|_{2}^{2}

which is the minimum. Indeed, for any fixed vector $\boldsymbol{p}\in\Delta^{n-1}$ ,

\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}\geq\sum_{j=1}^{n}a_{j}\frac{\alpha_{j}^{2}}{p_{j}}\quad\forall\boldsymbol{a}\in\Delta^{n-1}.

The inequality above holds because the r.h.s. is a convex combination of the terms $\left\{\frac{\alpha_{j}^{2}}{p_{j}}\right\}_{j\in[n]}$ , and is therefore upper-bounded by the maximum element of the combination. By letting $\boldsymbol{a}=\boldsymbol{p}$ , we get

\max_{j\in[n]}\frac{\alpha_{j}^{2}}{p_{j}}\geq\|\boldsymbol{\alpha}\|_{2}^{2}.

Therefore, $\boldsymbol{p}^{*}\in\operatorname*{\arg\min}_{\boldsymbol{p}\in\Delta^{n-1}}\mu^{2}_{\mathcal{T}}(F,\boldsymbol{p})$ . ∎

Appendix A6 Properties of the Piecewise Linear Expansion

The following is a subset of the elements in remark [4, Remark A.2], to which we refer the reader for the proof.

Remark A6.1 (Properties of the Piecewise Linear Expansion).

Below we list several properties about $\Delta$ . Let $\mathcal{C}=\bigcup\limits_{i=1}^{N}\mathcal{C}_{i}$ be the union of $N\in\mathbb{N}$ convex cones $\mathcal{C}_{i}$ .

1.

The set $\Delta(\mathcal{C})$ is uniquely defined. In particular, it is independent of the (finite) decomposition of $\mathcal{C}$ into convex cones.
2.

If $\max_{i\in[N]}\dim\mathcal{C}_{i}\leq k$ , then $\Delta(\mathcal{C})$ is a union of no more than $N$ at-most $k$ -dimensional linear subspaces.
3.

The set $\Delta(\mathcal{C})$ satisfies $\mathcal{C}\subseteq\Delta(\mathcal{C})\subseteq\mathcal{C}-\mathcal{C}$ .
4.

There are choices of $\mathcal{C}$ for which $\mathcal{C}\subsetneq\Delta(\mathcal{C})$ (for instance, refer to the example at the end of this section).

∎

Appendix A7 Experimental Specifications

CelebA with RealnessGAN

CelebFaces Attributes Dataset (CelebA) is a dataset with over 200,000 celebrity face images [12]. We train a model on most images of the CelebA dataset, leaving out 2000 images to comprise a validation set. We crop the colour images to 256 by 256, leading to 256 $\times$ 256 $\times$ 3 = 196608 pixels per image. On this dataset, we train a RealnessGAN with the same training setup as described in [21], substituting the last Tanh layer with HardTanh, a linearized version of Tanh, to fit in our theoretical framework. See [21] for more training and architecture details.

	$\displaystyle\left\\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\\|_{2}$	$\displaystyle\leq\min_{\boldsymbol{x}\in\mathcal{T}}\left\\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}-\tilde{\boldsymbol{b}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle\leq\left\\|\frac{1}{\sqrt{m}}SDF\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}-\tilde{\boldsymbol{b}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle=\left\\|\frac{1}{\sqrt{m}}SDF\boldsymbol{x}^{\perp}+\tilde{\boldsymbol{\eta}}\right\\|_{2}+\hat{\varepsilon}$
		$\displaystyle\leq\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}+\\|\tilde{\boldsymbol{\eta}}\\|_{2}+\hat{\varepsilon}.$

	$\displaystyle\left\\|\frac{1}{\sqrt{m}}SDF\hat{\boldsymbol{x}}-\tilde{\boldsymbol{b}}\right\\|_{2}$	$\displaystyle=\left\\|\frac{1}{\sqrt{m}}SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\frac{1}{\sqrt{m}}SDF\left(\boldsymbol{x}_{0}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)-\tilde{\boldsymbol{\eta}}\right\\|_{2}$
		$\displaystyle\geq\frac{1}{\sqrt{m}}\\|SDF\left(\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\right)\\|_{2}-\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}-\\|\tilde{\boldsymbol{\eta}}\\|_{2}$
		$\displaystyle\geq\left(1-\frac{1}{3}\right)\\|\hat{\boldsymbol{x}}-\operatorname{\Pi}_{\mathcal{T}}\boldsymbol{x}_{0}\\|_{2}-\frac{1}{\sqrt{m}}\\|SDF\boldsymbol{x}^{\perp}\\|_{2}-\\|\tilde{\boldsymbol{\eta}}\\|_{2}.$

	$\displaystyle(\star):=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\|\frac{1}{m}\left\\|SDF\boldsymbol{x}\right\\|_{2}^{2}-1\right\|$	$\displaystyle=\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\|\frac{1}{m}\left\\|SDFP_{\mathcal{U}}^{*}P_{\mathcal{U}}\boldsymbol{x}\right\\|_{2}^{2}-1\right\|$
		$\displaystyle=\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\|\frac{1}{m}\left\\|SDFP_{\mathcal{U}}^{*}\boldsymbol{u}\right\\|_{2}^{2}-1\right\|$
		$\displaystyle=\frac{1}{m}\sup_{\boldsymbol{u}\in\cap\mathbb{S}^{k-1}}\left\|\boldsymbol{u}^{}\left[(SDFP_{\mathcal{U}}^{})^{}(SDFP_{\mathcal{U}}^{})-mI\right]\boldsymbol{u}\right\|.$

	$\displaystyle\\|\boldsymbol{v}_{i}\\|_{2}$	$\displaystyle=\\|P_{\mathcal{U}}(F^{*}D\boldsymbol{s}_{i})\\|_{2}$
		$\displaystyle=\frac{1}{\sqrt{p_{i}}}\\|P_{\mathcal{U}}\boldsymbol{f}_{i}\\|_{2}$
		$\displaystyle=\frac{1}{\sqrt{p_{i}}}\sup_{\boldsymbol{x}\in\mathcal{U}\cap\mathbb{S}^{n-1}}\left\langle\boldsymbol{x},\boldsymbol{f}_{i}\right\rangle$
		$\displaystyle\leq\mu_{\mathcal{U}}(F,\boldsymbol{p}).$

Model-adapted Fourier sampling for generative compressed sensing

Abstract

1 Introduction

Definition 1.1 ((k,d,n)(k,d,n)-Generative Network).

Definition 1.2 (Sampling Matrix).

Definition 1.3 (Piecewise Linear Expansion).

Definition 1.4 (local coherence).

Prior Work

Notation

2 Main Result

Theorem 2.1.

Statement 2.1.

Remark 2.1.

Remark 2.2.

3 Numerics

Coherence heuristic

In-Range vs Out-of-Range Signals

Procedure for Signal Recovery

4 Conclusion

References

Appendix A1 Acknowledgements

Appendix A2 Additional Notation

Appendix A3 Generalized Main Result

Definition A3.1.

Theorem A3.1 (Generalized Main Result).

Remark A3.1.

Appendix A4 Proof of Theorem A3.1

Definition A4.1 (Restricted Isometry Property).

Lemma \thelemma (RIP of a Subsampled and Preconditioned Matrix Yields Recovery).

Lemma \thelemma (Containing the Range of a ReLU Network in a Union of Subspaces).

Lemma \thelemma (Deviation of Subsampled Preconditioned Unitary Matrix on a Subspace).

Lemma \thelemma (Adapting the Sampling Scheme to the Model).

Appendix A5 Proof of the Lemmas

Proof of Appendix A4.

Proof of Appendix A4.

Lemma \thelemma (Matrix Bernstein).

Proof of Appendix A4.

Appendix A6 Properties of the Piecewise Linear Expansion

Remark A6.1 (Properties of the Piecewise Linear Expansion).

Appendix A7 Experimental Specifications

CelebA with RealnessGAN

Appendix A8 Additional Image Recoveries

Definition 1.1 ( $(k,d,n)$ -Generative Network).