This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Global Universality of Singular Values in Products
of Many Large Random Matrices

Boris Hanin111 Princeton University. Email: bhanin@princeton.edu , Tianze Jiang222 Princeton University. Email: tzjiang@princeton.edu
Abstract

We study the singular values (and Lyapunov exponents) for products of NN independent n×nn\times n random matrices with i.i.d. entries. Such matrix products have been extensively analyzed using free probability, which applies when nn\rightarrow\infty at fixed NN, and the multiplicative ergodic theorem, which holds when NN\rightarrow\infty while nn remains fixed. The regime when N,nN,n\rightarrow\infty simultaneously is considerably less well understood, and our work is the first to prove universality for the global distribution of singular values in this setting. Our main result gives non-asymptotic upper bounds on the Kolmogorov-Smirnoff distance between the empirical measure of (normalized) squared singular values and the uniform measure on [0,1][0,1] that go to zero when n,Nn,N\rightarrow\infty at any relative rate. We assume only that the distribution of matrix entries has zero mean, unit variance, bounded fourth moment, and a bounded density. Our proofs rely on two key ingredients. The first is a novel small-ball estimate on singular vectors of random matrices from which we deduce a non-asymptotic variant of the multiplicative ergodic theorem that holds for growing matrix size nn. The second is a martingale concentration argument, which shows that while Lyapunov exponents at large NN are not universal at fixed matrix size, their empirical distribution becomes universal as soon as the matrix size grows with NN.

1 Introduction

This article concerns the distribution of singular values for products of independent random matrices

XN,nWNW1,Win×n,X_{N,n}\triangleq W_{N}\cdots W_{1},\qquad W_{i}\in\mathbb{R}^{n\times n}, (1)

with entries of nWi\sqrt{n}W_{i} drawn i.i.d. from a fixed distribution μ\mu. We assume μ\mu satisfies the following

Condition 1.

The probability measure μ\mu has zero mean, unit variance, a finite fourth moment M4<M_{4}<\infty, and a density with respect to the Lebesgue measure bounded above by K<K_{\infty}<\infty.

Our main result, Theorem˜1, is a quantitative universality result for

ρN,n1ni=1nδsi(Xn,N)2/N,\rho_{N,n}\triangleq\frac{1}{n}\sum_{i=1}^{n}\delta_{s_{i}(X_{n,N})^{2/N}},

the empirical distribution of rescaled singular values s1(XN,n)sn(XN,n)s_{1}(X_{N,n})\geq\cdots\geq s_{n}(X_{N,n}) of XN,nX_{N,n}.

Theorem 1.

Under Condition˜1, there exist constants c1,c2,c3,c4>0c_{1},c_{2},c_{3},c_{4}>0 depending on K,M4K_{\infty},M_{4} with the following property. For all ε(0,1/2)\varepsilon\in\left(0,1/2\right), if N>c1ε2N>c_{1}\varepsilon^{-2} and n>c2ε2log(1/ε)n>c_{2}\varepsilon^{-2}\cdot\log(1/\varepsilon), then

(dKS(ρN,n,U[0,1])>ε)c3exp{c4nNε2/logn},\mathbb{P}\left(d_{KS}(\rho_{N,n},\,\mathrm{U}_{[0,1]})>\varepsilon\right)\leq c_{3}\exp\{-c_{4}nN\varepsilon^{2}/\log n\}, (2)

where dKSd_{KS} is the Kolmogorov-Smirnoff distance and U[0,1]\mathrm{U}_{[0,1]} is the uniform distribution on [0,1][0,1].

Theorem˜1 is the first universality result for ρN,n\rho_{N,n} which holds for a general class of distributions μ\mu regardless of the relative size of n,Nn,N. It guarantees that as soon as μ\mu matches the first two moments of a standard Gaussian (and has bounded fourth moment and LL_{\infty} norm), the empirical singular value distribution is close to the case when μ\mu is a (real) standard Gaussian (see Theorem 1.2 in [19]). Our result also extends to complex distributions with independent real and imaginary parts (see Section˜4). The nature of the universality underlying Theorem˜1, however, is unusual in the following two senses:

  • Requirement of bounded density. Universality is a hallmark of random matrix theory in the regime where the matrix size nn tends to infinity. It is uncommon in such universality results to include the hypothesis that μ\mu has a bounded density. But such an assumption is essential in our setting because we are interested in the setting of growing NN. For example, the matrix product XN,nX_{N,n} has rank one with positive probability if μ\mu contains an atom and NN is exponential in n2n^{2}. In the ergodic limit of fixed nn and diverging NN, moreover, the limiting empirical measure ρn,\rho_{n,\infty} of singular values is known to be non-universal and depends on μ\mu, see e.g. [22, 2, 4]. This is in contrast the free probability limit where it is typical to consider polynomials of fixed degree (e.g. fixed NN) evaluated at a collection of n×nn\times n random matrices with nn\rightarrow\infty (see e.g. [16, 31, 12]).

  • One global shape, many local shapes. For many classical random matrix ensembles, universality holds not just for the global distribution of eigenvalues or singular values but persists also at the microscopic scale where consecutive eigenvalues or singular values remains order 11 apart as nn\rightarrow\infty. In our setting, however, even in the simplest case when μ\mu is a standard complex Gaussian, the local distribution depends on the limiting value of n/Nn/N [2, 25]. The relative size of n,Nn,N therefore determines the local statistics but does not impact the global properties of ρN,n\rho_{N,n}. It remains open both to determine at what scale the local distribution of singular values begins to depend on N/nN/n and whether the local limits, derived using methods from integrable systems, are universal (see [18] for some partial progress).

While the effect of simultaneously large n,Nn,N on the statistics ρn,N\rho_{n,N} is still far from understood, prior work showed that the global distribution of singular values ρn,N\rho_{n,N} converges to U[0,1]\mathrm{U}_{[0,1]} if one either first takes nn\rightarrow\infty and then NN\rightarrow\infty or vice versa [20, 21]. These articles use very different tools: the work [21], which first takes nn\rightarrow\infty, relies on free probability while [20] uses the multiplicative ergodic theorem to analyze what happens when one first takes NN\rightarrow\infty.

Neither free probability nor ergodic techniques are simple to make effective when both nn and NN are large but finite. To make progress in this direction the article [19] used small ball probabilities to quantify, at finite NN, the rate of the convergence in the multiplicative ergodic theorem and obtain a sharper version of Theorem˜1 in the special case when μ\mu is the standard (real) Gaussian (see Section˜4 for a discussion of our optimality). We take a similar approach. The core difference is that the distribution of the individual WiW_{i} matrices is no longer isotropic (invariant under left or right rotations). As we explain in Section˜3, this means we must obtain new small ball probabilities for the inner product between a fixed kk-frame in n\mathbb{R}^{n} and the projection onto the span of the top kk singular vectors for XN,nX_{N,n}.

Outline of Remainder of the Article.

The rest of this article is organized as follows. First, in Section˜2, we give a more thorough review of the relation between our results and prior work. Then in Section˜3 we state the main results needed to prove Theorem˜1. We make some further remarks on the results as well as future directions in Section˜4. The remaining proofs of these results are provided in Section˜6, after a brief recall of auxiliary technical results needed in Section˜5.

2 Related works

Products of random matrices are a vast subject. We provide here some representative references, focusing mainly on work in which the number of matrices, NN, is large or growing.

The setting where the matrix size nn is fixed while the the number of terms in the matrix product NN grows has attracted much interest starting from the seminal work of Furstenberg [11] and later of Oseledec [32] on the multiplicative ergodic theorem. Particularly relevant to the present article are the works of Newman [29] and Isopsi-Newman [20]. Since then, the study of Lyapunov exponents of random matrix products has found applications to the study of random Schödinger operators [8], number theory and dynamics [23, 28], and beyond [42, 6].

Matrix products when nn\rightarrow\infty while NN is potentially large but fixed have also been extensively studied. For instance, classical results in free probability concern the spectrum of products of a fixed number of (freely) independent matrices [27, 40, 30]. In this vein, the articles [38, 21] both use tools from free probability to obtain the analog of Theorem˜1 in the setting where first nn\rightarrow\infty and then NN\rightarrow\infty. Prior work has also taken up a non-asymptotic analysis of eigenvalues [16, 12] for such matrix products as well as the local distribution of their singular values [26].

The setting when n,Nn,N simultaneously grow is less well understood but has nonetheless attracted significant interest in recent years. For example, we point the reader to a beautiful set of the articles that use techniques from integrable systems and integrable probability to study singular values for products of iid complex Ginibre matrices and related integrable matrix ensembles. These include the works [2, 1, 3, 7, 10, 9] which, at a physics level of rigor, were the first to analyze the asymptotic distribution of singular values for products of iid complex Ginibre matrices. Some of the results in the preceding articles were proved rigorously in [25]. We also point the interested reader to [5, 14] another perspective on how to use techniques from integrable probability to study such matrix products. The study of the singular values of XN,nX_{N,n} when n,Nn,N are both large has also received attention due to its connection with the spectrum of input-output Jacobians in randomly initialized neural networks [33, 34, 35, 18, 17].

3 Main ideas and proof outline

As we will explain in this section, there are three key steps in the proof of Theorem˜1. To present them, let us agree on some notations. We write 𝖥𝗋n,k={Xn×k:XTX=𝕀k}\mathsf{Fr}_{n,k}=\{X\in\mathbb{R}^{n\times k}:X^{T}X=\mathbb{I}_{k}\} for the space of kk-frames in n\mathbb{R}^{n} (i.e. orthonormal systems of kk vectors). For any matrix Aa×b,a,bkA\in\mathbb{R}^{a\times b},a,b\geq k, we write: A(k)i=1ksi(A)\|A\|_{(k)}\triangleq\prod_{i=1}^{k}s_{i}(A) the product of top kk singular values. For any n×kn\times k matrix XX we thus have

X(k)=i=1ksi(X)=𝖽𝖾𝗍(XTX)\left|\left|X\right|\right|_{(k)}=\prod_{i=1}^{k}s_{i}(X)=\sqrt{\mathsf{det}(X^{T}X)} (3)

following the Gram identity. Unless specified otherwise, all constants are finite, positive, and may depend on M4,KM_{4},K_{\infty}, the moments in Condition˜1.

Step 1: From many singular values to the top singular value.

To study the singular values of XN,nX_{N,n}, it will be convenient to study their partial products:

XN,n(k)=i=1ksi(XN,n)=supU𝖥𝗋n,kXN,nU(k),k=1,,n,\left|\left|X_{N,n}\right|\right|_{(k)}=\prod_{i=1}^{k}s_{i}(X_{N,n})=\sup_{U\in\mathsf{Fr}_{n,k}}\left|\left|X_{N,n}U\right|\right|_{(k)},\qquad k=1,\ldots,n, (4)

where the first equality is a definition and the second equality follows from standard linear algebra. The representation on the right of (4) recasts the product of the top kk singular values of XN,nX_{N,n} as the top singular value for the action of XN,nX_{N,n} on the space of kk-frames in n\mathbb{R}^{n}. This is useful since analyzing the top singular value, or equivalently top Lyapunov exponent, is a natural and well-studied way to understand the long time behavior of a dynamical system. This is precisely the philosophy of most prior work in the regime where NN\rightarrow\infty (see e.g. [20, 11, 19, 24]).

Step 2: Removing the supremum in (4).

One advantage of the representation on the right of (4) is that each term XN,nU(k)\left|\left|X_{N,n}U\right|\right|_{(k)} inside the supremum can naturally be thought of as simple sub-multiplicative functional of the state of a random dynamical system after NN steps. In this analogy, the frame UU determines the initial condition and the evolution from time t1t-1 to time tt consists of multiplying by WtW_{t}.

The second and most important technical step in our proof of Theorem˜1 is to show that once NN is large, we can approximately drop the supremum over frames in (4). That this is possible is a key conceptual insight that goes back to [11], which showed that for a wide range of entry distributions μ\mu, we have

limN1Nlog(supV𝖥𝗋n,kXN,nV(k)XN,nU0(k))=0\lim_{N\rightarrow\infty}\frac{1}{N}\log\left(\frac{\sup_{V\in\mathsf{Fr}_{n,k}}\left|\left|X_{N,n}V\right|\right|_{(k)}}{\left|\left|X_{N,n}U_{0}\right|\right|_{(k)}}\right)=0

on fixed U0𝖥𝗋n,kU_{0}\in\mathsf{Fr}_{n,k}. The previous displayed equation is a consequence of the multiplicative ergodic theorem, which guarantees that as NN\rightarrow\infty the supremum, the average, and the pointwise behavior of XN,nV(k)\left|\left|X_{N,n}V\right|\right|_{(k)} is the same for almost every frame VV. Since we seek to describe the distribution of singular values of XN,nX_{N,n} when NN is finite, we will need a quantitative version of this result. This is the content of Proposition˜1, which is more conveniently phrased in terms of the Lyapunov exponents of XN,nX_{N,n}:

λi=λi(XN,n)=1Nlogsi(XN,n)=i-th Lyapunov exponent,i=1kλi=1NlogXN,n(k)\lambda_{i}=\lambda_{i}(X_{N,n})=\frac{1}{N}\log s_{i}(X_{N,n})=i\text{-th Lyapunov exponent},\qquad\sum_{i=1}^{k}\lambda_{i}=\frac{1}{N}\log\left|\left|X_{N,n}\right|\right|_{(k)}
Proposition 1 (Reduction from sup norm to pointwise norm).

Denote by U0=𝕀[k]T𝖥𝗋n,kU_{0}=\mathbb{I}_{[k]}^{T}\in\mathsf{Fr}_{n,k} the kk-frame whose columns are the first kk standard basis vectors of n\mathbb{R}^{n}. Then, assuming Condition˜1 holds, there exist constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 depending only on K,M4K_{\infty},M_{4}, such that for any nkn\geq k:

(1n|i=1kλi1NlogXN,nU0(k)|>s)c1exp{c2nNs}.\mathbb{P}\left(\frac{1}{n}\left|\sum_{i=1}^{k}\lambda_{i}-\frac{1}{N}\log\left|\left|X_{N,n}U_{0}\right|\right|_{(k)}\right|>s\right)\leq c_{1}\exp\{-c_{2}nNs\}. (5)

for all s>c3klog(en/k)nNs>c_{3}\frac{k\log(en/k)}{nN}.

We prove Proposition˜1 in Section˜6.1 and emphasize here only the main ideas. The key observation is that the estimate (5) follows from proving that the subspace spanned by the top kk singular vectors of XN,nX_{N,n} is “well-spread” on the Grassmannian with high probability when N1N\gg 1. To explain this, let us consider the simple but illustrative case of k=1k=1. Our goal is then to obtain a lower bound for

01NnlogXN,nv(1)1nλ1=1NnlogXN,nvXN,nop,0\geq\frac{1}{Nn}\log\|{X_{N,n}v}\|_{(1)}-\frac{1}{n}\lambda_{1}=\frac{1}{Nn}\log\frac{\|X_{N,n}v\|}{\|X_{N,n}\|_{op}},

where v=[1,0,0,]Tv=[1,0,0,\dots]^{T}. If sis_{i} are the singular values of XN,nX_{N,n} and ei,fie_{i},f_{i} are the corresponding right and left singular vectors, then

XN,nv=i=1nsiv,eifjX_{N,n}v=\sum_{i=1}^{n}s_{i}\langle v,e_{i}\rangle f_{j}

and we have

2NnlogXN,nvXN,nop=1Nnlogi=1nsi2|v,ei|2s121Nnlog|v,e1|2.\frac{2}{Nn}\log\frac{\|X_{N,n}v\|}{\|X_{N,n}\|_{op}}=\frac{1}{Nn}\log\frac{\sum_{i=1}^{n}s_{i}^{2}|\langle v,e_{i}\rangle|^{2}}{s_{1}^{2}}\geq\frac{1}{Nn}\log|\langle v,e_{1}\rangle|^{2}. (6)

Obtaining lower bounds on v,e1\langle v,e_{1}\rangle is the same as obtaining small ball probabilities for e1e_{1} around the orthogonal complement to vv. Repeating this argument for general kk shows that Proposition˜1 will follow from the statement that “the distribution of the top singular vectors does not concentrate on a fixed co-dimension kk subspace”. See Lemma˜11 for the key result to this end.

Prior work [10, 9, 2, 5, 15, 19] assumed that the matrices WiW_{i} are rotationally invariant and hence that the law of right singular vectors (ei)i=1k(e_{i})_{i=1}^{k} is Haar for every kk. In particular, when k=1k=1 this implies

(|v,e1|ϵ/n)Cϵ\mathbb{P}\left(|\langle v,e_{1}\rangle|\leq\epsilon/\sqrt{n}\right)\leq C\epsilon

for a universal constant CC. The main technical difficulty in our present setting is that we do not know how to characterize the (joint) distribution of the top kk singular vectors of XN,nX_{N,n}, even when k=1k=1. Nonetheless, we obtain small ball probabilities for log(XN,nU0(k)/XN,n(k))\log\left(\left|\left|X_{N,n}U_{0}\right|\right|_{(k)}/\left|\left|X_{N,n}\right|\right|_{(k)}\right) in Section˜6.1 relying only on small ball probabilities for the entrywise measure μ\mu.

Finally, we mention that the frame U0U_{0} from Proposition˜1 does not have to be the canonical frame. A different proof shows, that a slightly weaker version of (5) holds for any U0𝖥𝗋n,kU_{0}\in\mathsf{Fr}_{n,k} (see eq.˜20). This yields seemingly new information on the problem of studying singular vectors of W1W_{1}, which we believe is of independent interest. We defer this discussion to Section˜6.1.1.

Step 3: Doob decomposition and concentration for logXN,nU0\log\left|\left|X_{N,n}U_{0}\right|\right| in (4).

In light of Proposition˜1, estimating the partial products XN,n(k)\left|\left|X_{N,n}\right|\right|_{(k)} comes down to bounding the “point-wise” norms XN,nU0(k)\left|\left|X_{N,n}U_{0}\right|\right|_{(k)} for a fixed frame U0𝖥𝗋n,kU_{0}\in\mathsf{Fr}_{n,k}, which is done in the following

Proposition 2.

Assuming Condition˜1, there exist constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 depending on M4,KM_{4},K_{\infty} such that for any n,k,Nn,k,N, any U0𝖥𝗋n,kU_{0}\in\mathsf{Fr}_{n,k}, and for all sc3n1log(ek)s\geq c_{3}n^{-1}\log(ek):

(1n|1NlogXN,nU0(k)12j=1klogni+1n|>s)c1exp{c2sn/log(ek)}.\mathbb{P}\left(\frac{1}{n}\left|\frac{1}{N}\log\left|\left|X_{N,n}U_{0}\right|\right|_{(k)}-\frac{1}{2}\sum_{j=1}^{k}\log\frac{n-i+1}{n}\right|>s\right)\leq c_{1}\exp\{-c_{2}sn/\log(ek)\}. (7)

We prove Proposition˜2 in Section˜6.2. The main idea is to express logXN,nU0(k)\log\left|\left|X_{N,n}U_{0}\right|\right|_{(k)} as an average. For this, let U1=U0U_{1}=U_{0} and define Ut+1U_{t+1} inductively through the singular value decomposition of WtUtW_{t}U_{t}:

WtUt=Ut+1𝖽𝗂𝖺𝗀({si(WtUt)}i=1k)Ot,Ot𝖥𝗋k,k,Ut+1𝖥𝗋n,kW_{t}U_{t}=U_{t+1}\mathsf{diag}(\{s_{i}(W_{t}U_{t})\}_{i=1}^{k})O_{t},\qquad O_{t}\in\mathsf{Fr}_{k,k},\,U_{t+1}\in\mathsf{Fr}_{n,k}

Then, recalling (3) and noting that UtU_{t} is a measurable function of W1,,Wt1W_{1},\ldots,W_{t-1}, a simple computation gives the following equality in distribution

1NlogXN,nU(k)2=1Ni=1NlogWiUi(k)2.\frac{1}{N}\log\|X_{N,n}U\|_{(k)}^{2}=\frac{1}{N}\sum_{i=1}^{N}\log\|W_{i}U_{i}\|_{(k)}^{2}. (8)

A direct computation now shows that given any U𝖥𝗋n,kU\in\mathsf{Fr}_{n,k} we have

𝔼nWμn×n[WU(k)2]=n!nk(nk)!=j=1knj+1n.\mathbb{E}_{\sqrt{n}W\sim\mu^{\otimes n\times n}}\left[\left|\left|WU\right|\right|_{(k)}^{2}\right]=\frac{n!}{n^{k}(n-k)!}=\prod_{j=1}^{k}\frac{n-j+1}{n}. (9)

These expectations determine the constants around which 1NlogXN,nU0(k)\frac{1}{N}\log\left|\left|X_{N,n}U_{0}\right|\right|_{(k)} concentrates in Proposition˜1. The result then follows from an Azuma-type concentration inequality for random variables with sub-exponential tails (see Lemma˜9).

Combining everything together.

Putting Proposition˜1 and Proposition˜2 we get for slog(ek)/ns\gtrsim\log(ek)/n that

(1n|i=1kλi(XN,n)12j=1klogni+1n|>s)c1exp{c2sn/log(ek)}.\mathbb{P}\left(\frac{1}{n}\left|\sum_{i=1}^{k}\lambda_{i}(X_{N,n})-\frac{1}{2}\sum_{j=1}^{k}\log\frac{n-i+1}{n}\right|>s\right)\leq c_{1}\exp\left\{-c_{2}sn/\log(ek)\right\}. (10)

Together with some elementary algebra in Section˜6.3, we are able to control the cumulative distribution function for the empirical measure ρN,n\rho_{N,n} of singular values with high probability.

4 Discussions

In this section, we discuss some extensions and limitations of the present work.

Dependence on n,Nn,N.

We begin by briefly remarking on the dependence in Theorem˜1 and relation (10) on n,Nn,N. In particular, for fixed NN, consider an i.i.d. sequence of {XN,n}\{X_{N,n}\} and let εN1/2\varepsilon\asymp N^{-1/2} one has:

(supt|1n#{1insi2/N(XN,n)t}𝒰(t)|c1N1/2)2exp{c2n/logn}\mathbb{P}\left(\sup_{t\in\mathbb{R}}\left|\frac{1}{n}\#\left\{1\leq i\leq n\mid s_{i}^{2/N}\left(X_{N,n}\right)\leq t\right\}-\mathcal{U}(t)\right|\geq c_{1}N^{-1/2}\right)\leq 2\exp\{-c_{2}n/\log n\}

where 𝒰(t)xU[0,1](xt)\mathcal{U}(t)\triangleq\mathbb{P}_{x\sim\mathrm{U}[0,1]}(x\leq t). By Borel-Cantelli lemma, this implies that with probability one,

supt|limn1n#{1insi2/N(XN,n)t}𝒰(t)|CN.\sup_{t\in\mathbb{R}}\left|\lim_{n\rightarrow\infty}\frac{1}{n}\#\left\{1\leq i\leq n\mid s_{i}^{2/N}\left(X_{N,n}\right)\leq t\right\}-\mathcal{U}(t)\right|\leq\frac{C}{\sqrt{N}}.

The O(1/N)O(1/\sqrt{N}) rate can be seen as a Berry-Esseen type bound, see also [19, Section 1.2]. This suggests that our dependence on NN is at least comparable to standard CLT rates. However, since we require in (10) that sΩ(log(ek)/n)s\in\Omega(\log(ek)/n) (even when NN\to\infty), is unclear whether the dependence on nn is optimal. This dependence, unfortunately, cannot be improved significantly based on current techniques. To illustrate this, consider k=1k=1. The dependence of the mean 𝔼[logWU02]=𝔼xii.i.d.μ[logi=1nxi2]\mathbb{E}[\log\|WU_{0}\|^{2}]=\mathbb{E}_{x_{i}\sim_{i.i.d.}\mu}[\log\sum_{i=1}^{n}x_{i}^{2}] on μ\mu can be shown to be Ω(1/n)\Omega(1/n) (i.e. there exists different μ\mu’s such that the expectation differs by Ω(1/n)\Omega(1/n)). The studies of more fine-grained behaviors of Lyapunov exponents in the ergodic regime (when universality does not hold for n<n<\infty) are left open to future works.

Extension to complex matrices.

While our proofs are formulated for real matrices, we remark that all results can be directly extended to complex random variables under the following assumptions on μ\mu:

Condition 2.

Suppose that each entry of the nWi\sqrt{n}W_{i}’s are i.i.d. drawn from a distribution X+YiX+Yi, where XμXX\sim\mu_{X} and YμYY\sim\mu_{Y} are independent (real) random variables satisfying (1) zero mean and unit variance: 𝔼[X]=𝔼[Y]=0,𝔼[X2+Y2]=1\mathbb{E}[X]=\mathbb{E}[Y]=0,\mathbb{E}[X^{2}+Y^{2}]=1; (2) finite fourth moment 𝔼[X4+Y4]M4<\mathbb{E}[X^{4}+Y^{4}]\leq M_{4}<\infty; and (3) density bounded above μXL,μYLK<\|\mu_{X}\|_{L_{\infty}},\|\mu_{Y}\|_{L_{\infty}}\leq K_{\infty}<\infty.

Extending both Proposition˜1 and Proposition˜2 mainly requires replacing transpose with conjugate transpose and absolute values with norms does not require much conversions at all. The only nontrivial technical difference lies in transforming existing small ball estimates for real random variables (which appears in Lemma˜4) to their complex analog. This can be done by noting that a complex frame A+Bi𝖥𝗋n,kA+Bi\in\mathsf{Fr}_{n,k} can be converted into a real frame [ABBA]𝖥𝗋2n,2k\begin{bmatrix}A&-B\\ B&A\end{bmatrix}\in\mathsf{Fr}_{2n,2k} (and defining the density of a d\mathbb{C}^{d} complex distribution to be the density of the distribution of their canonical real decomposition in 2d\mathbb{R}^{2d}). As a result, Theorem˜1 and (10) holds under Condition˜2 as well with different set of constants (versus Condition˜1). Moreover, in fact, our proof only needs that WiW_{i}’s are independent and distributed according to Wi=dAiPiW_{i}=_{d}A_{i}P_{i} where PiP_{i} is any fixed orthogonal matrix and Aiμin×nA_{i}\sim\mu_{i}^{\otimes n\times n} for some μi\mu_{i} satisfying prescribed conditions.

Dependence on constants M4,KM_{4},K_{\infty}.

A careful analysis of the proof shows that the constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} appearing in the statement of Theorem˜1 can be taken to depend on constants M4,KM_{4},K_{\infty} as follows

c1O(KδM4),c2O(KδM4),c3=2,c41O(KδM4),c_{1}\in O(K_{\infty}^{\delta}\cdot M_{4}),\quad c_{2}\in O(K_{\infty}^{\delta}\cdot M_{4}),\quad c_{3}=2,\quad c_{4}^{-1}\in O(K_{\infty}^{\delta}\cdot M_{4}), (11)

where δ>0\delta>0 is arbitrary but fixed and the implicit constants in the big-O terms are universal.

5 Review of auxiliary technical results

Before we complete all the deferred proofs, we collect below we several technical results use in our main proofs and establish some notation.

Notation.

We use =d=_{d} to denote equivalence in distribution. Unless specified otherwise, we use \land to denote the minimum and \lor the maximum of two numbers. We denote [r]{1,2,,r},r,[r]\triangleq\{1,2,\dots,r\},r\in\mathbb{N}, and ([r]t){I[r]:|I|=t}{\binom{[r]}{t}}\triangleq\{I\subseteq[r]:|I|=t\}. When not specified otherwise, we write (for a m×nm\times n matrix AA) AinA_{i}\in\mathbb{R}^{n} as the ii-th row of AA for i[m]i\in[m] and AI|I|×nA_{I}\in\mathbb{R}^{|I|\times n} as the |I|×n|I|\times n submatrix for I[m]I\subseteq[m]. Furthermore, we denote s1(A)s2(A)s_{1}(A)\geq s_{2}(A)\geq\dots be the ordered singular values of any matrix AA.

An isotropic inequality for right product with random uniform frames.

We will examine the effect of a “uniformly random” frame being applied on any matrix.

Lemma 1 (See also Section 9 in [19]).

There exists a constant cc with the following property. Suppose GG is sampled from the Haar measure on 𝖥𝗋n,k\mathsf{Fr}_{n,k}. For any invertible matrix Mn×nM\in\mathbb{R}^{n\times n}

((MG(k)M(k))1kεkn)(cε)k2.\mathbb{P}\left(\left(\frac{\|MG\|_{(k)}}{\|M\|_{(k)}}\right)^{\frac{1}{k}}\leq\varepsilon\sqrt{\frac{k}{n}}\right)\leq(c\varepsilon)^{\frac{k}{2}}.
Proof.

Note that there exists L,R𝖥𝗋n,kL,R\in\mathsf{Fr}_{n,k} such that:

LTM=𝖽𝗂𝖺𝗀({si(M)}i=1k)RTL^{T}M=\mathsf{diag}(\{s_{i}(M)\}_{i=1}^{k})R^{T}

so

MG(k)|𝖽𝖾𝗍(LTMG)|=M(k)|𝖽𝖾𝗍(RTG)|\|MG\|_{(k)}\geq|\mathsf{det}(L^{T}MG)|=\|M\|_{(k)}|\mathsf{det}(R^{T}G)|

and hence

MG(k)M(k)|𝖽𝖾𝗍(RTG)|\frac{\|MG\|_{(k)}}{\|M\|_{(k)}}\geq|\mathsf{det}(R^{T}G)|

where RR is a fixed frame. The rest follows from Corollary 9.4 and equation (9.1) in [19]. ∎

Sub-multiplicativity for products of top singular values.

We will use the following result, which shows that the (k)\|\cdot\|_{(k)} norm is sub-multiplicative.

Lemma 2 (See also [13]).

For any two matrices A,Bn×nA,B\in\mathbb{R}^{n\times n}, one has:

AB(k)A(k)B(k)\|AB\|_{(k)}\leq\|A\|_{(k)}\|B\|_{(k)} (12)
Proof.

For any two matrices A,Bn×nA,B\in\mathbb{R}^{n\times n} and frames L,R𝖥𝗋n,kL,R\in\mathsf{Fr}_{n,k} such that 𝖽𝖾𝗍(LTABR)=AB(k)\mathsf{det}(L^{T}ABR)=\|AB\|_{(k)}, the standard SVD gives: LTA=Σk×k(L)R1TL^{T}A=\Sigma^{(L)}_{k\times k}R_{1}^{T} and BR=L1Σk×k(R)BR=L_{1}\Sigma^{(R)}_{k\times k} where L1,R1𝖥𝗋n,kL_{1},R_{1}\in\mathsf{Fr}_{n,k} so:

AB(k)=𝖽𝖾𝗍(LTABR)=i(Σk×k(L))i(Σk×k(R))i𝖽𝖾𝗍(R1TL1)i(Σk×k(L))i(Σk×k(R))iA(k)B(k)\|AB\|_{(k)}=\mathsf{det}(L^{T}ABR)=\prod_{i}(\Sigma^{(L)}_{k\times k})_{i}(\Sigma^{(R)}_{k\times k})_{i}\cdot\mathsf{det}(R_{1}^{T}L_{1})\leq\prod_{i}(\Sigma^{(L)}_{k\times k})_{i}(\Sigma^{(R)}_{k\times k})_{i}\leq\|A\|_{(k)}\|B\|_{(k)}

which concludes our claim. ∎

Useful results on small-ball probability.

To control the small-ball density of projections, we will use the following result. In fact, this is the only place in which we needed bounded Lebesgue density in Condition˜1.

Lemma 3 (Theorem 1.1 of [36]).

Let X=(X1,,Xn)X=\left(X_{1},\ldots,X_{n}\right) where XiX_{i} are real-valued independent random variables. Assume that the densities of XiX_{i} are bounded by KK almost everywhere. Let P𝖥𝗋n,dP\in\mathsf{Fr}_{n,d} be an orthogonal projection from n\mathbb{R}^{n} onto d\mathbb{R}^{d}. Then the density of the random vector PTXP^{T}X is bounded by (CK)d(CK)^{d} almost everywhere, where CC is a positive absolute constant. Furthermore, when d=1d=1 and PP is a vector with norm 1, the max density of P,X\langle P,X\rangle is at most 2K\sqrt{2}K.

As a corollary, we can show that

Lemma 4.

Let X=(X1,,Xn)X=\left(X_{1},\ldots,X_{n}\right) where XiX_{i} are real-valued independent random variables. Assume that the densities of XiX_{i} are bounded by KK almost everywhere. Let P𝖥𝗋n,dP\in\mathsf{Fr}_{n,d} be a orthogonal projection in n\mathbb{R}^{n} onto d\mathbb{R}^{d}. Then

(PX22ds)(C1Ks)d\mathbb{P}\left(\|PX\|_{2}^{2}\leq d\cdot s\right)\leq(C_{1}K\sqrt{s})^{d}

where C1C_{1} is an absolute constant.

Proof.

Consider the Lebesgue measure on a dd-dimensional ball with radius ds\sqrt{ds}. Its volume Vd(ds)V_{d}(\sqrt{ds}) satisfies:

Vd(ds)(Cs1/2)dV_{d}(\sqrt{ds})\leq\left(Cs^{1/2}\right)^{d}

where CC is a universal constant independent of dd. Hence, via Lemma˜3 we get:

(PTX22ds)(Cds1/2)d(CK)d(C~Ks)d\mathbb{P}\left(\|P^{T}X\|_{2}^{2}\leq d\cdot s\right)\leq\left(C_{d}s^{1/2}\right)^{d}\cdot(CK)^{d}\leq(\widetilde{C}K\sqrt{s})^{d}

for some universal C~\widetilde{C}. ∎

Useful results on sub-exponential random variables.

We collect here some simple results on sub-exponential random variables. We begin with an elementary result:

Lemma 5.

If a random variable XX with constants c1>1c_{1}>1 and c2>0c_{2}>0 such that:

(|X|t)c1exp(t/c2)\mathbb{P}(|X|\geq t)\leq c_{1}\exp(-t/c_{2})

for all t>0t>0 then (|X|t)2exp[t/(2c1c2)]\mathbb{P}(|X|\geq t)\leq 2\exp[-t/(2c_{1}c_{2})] for all t>0t>0.

Proof.

Since probabilities are always bounded above by 1, this can be easily verified via checking:

c1exp(t/c2)12exp[t/(2c1c2)]1c_{1}\exp(-t/c_{2})\land 1\leq 2\exp[-t/(2c_{1}c_{2})]\land 1

for all c1>1,c2>0,t0c_{1}>1,c_{2}>0,t\geq 0. ∎

We now recall the usual equivalent definitions of a sub-exponential random variable.

Lemma 6 (Sub-exponential properties, Proposition 2.7.1 in [41]).

Let XX be a random variable. Then the following properties are equivalent: the parameters Ki>0K_{i}>0 appearing in these properties differ from each other by at most an absolute constant factor.

  1. 1.

    The tails of XX satisfy

    {|X|t}2exp(t/K1) for all t0\mathbb{P}\{|X|\geq t\}\leq 2\exp\left(-t/K_{1}\right)\quad\text{ for all }t\geq 0
  2. 2.

    The moments of XX satisfy

    XLp=(𝔼|X|p)1/pK2p for all p1\|X\|_{L^{p}}=\left(\mathbb{E}|X|^{p}\right)^{1/p}\leq K_{2}p\quad\text{ for all }p\geq 1
  3. 3.

    The MGF of |X||X| satisfies

    𝔼exp(λ|X|)exp(K3λ) for all λ such that 0λ1K3\mathbb{E}\exp(\lambda|X|)\leq\exp\left(K_{3}\lambda\right)\quad\text{ for all }\lambda\text{ such that }0\leq\lambda\leq\frac{1}{K_{3}}
  4. 4.

    The MGF of |X||X| is bounded at some point, namely

    𝔼exp(|X|/K4)2\mathbb{E}\exp\left(|X|/K_{4}\right)\leq 2

Moreover, if 𝔼[X]=0\mathbb{E}[X]=0 then previous properties are also equivalent to the following one: the MGF of XX satisfies

𝔼exp(λX)exp(K52λ2)\mathbb{E}\exp(\lambda X)\leq\exp\left(K_{5}^{2}\lambda^{2}\right)

for all λ\lambda such that |λ|1K5|\lambda|\leq\frac{1}{K_{5}}.

Lemma 7 (Sub-exp properties for almost centered random variables).

There exists a universal c0>0c_{0}>0, such that the following holds. If a random variable ZZ satisfies for some constants c1c_{1}:

(|Z|t)2exp(t/c1),\mathbb{P}(|Z|\geq t)\leq 2\exp(-t/c_{1}),

then

𝔼[eλZ]exp{2(c0c1)2λ2},|λ|[2c02c1,1c0c1].\mathbb{E}\left[e^{\lambda Z}\right]\leq\exp\left\{2\left(c_{0}c_{1}\right)^{2}\lambda^{2}\right\},\quad\quad|\lambda|\in\left[\frac{2}{c_{0}^{2}c_{1}},\frac{1}{c_{0}c_{1}}\right].
Proof.

First of all, denote μ=𝔼[Z]\mu=\mathbb{E}[Z] then

|μ|𝔼[|Z|]=0(|Z|t)dt02exp(t/c1)dt=2c1.|\mu|\leq\mathbb{E}[|Z|]=\int_{0}^{\infty}\mathbb{P}(|Z|\geq t)\mathrm{d}t\leq\int_{0}^{\infty}2\exp(-t/c_{1})\mathrm{d}t=2c_{1}.

By Lemma˜6, for all p=1,2,,p=1,2,\dots,

(𝔼[|Z|p])1/pc0c1p\left(\mathbb{E}[|Z|^{p}]\right)^{1/p}\leq c_{0}c_{1}p

for some fixed constant c0c_{0}. Thus for p=1,2,,p=1,2,\dots,

(𝔼[|Zμ|p])1/p(𝔼[(2|Z|2|μ|)p])1/p2(c0c1p|μ|)(2c02)c1p\left(\mathbb{E}[|Z-\mu|^{p}]\right)^{1/p}\leq\left(\mathbb{E}[(2|Z|\lor 2|\mu|)^{p}]\right)^{1/p}\leq 2(c_{0}c_{1}p\lor|\mu|)\leq(2c_{0}\lor 2)c_{1}p

and hence (again via Lemma˜6) for all |λ|<(c~0c1c2)1|\lambda|<(\widetilde{c}_{0}c_{1}c_{2})^{-1} one has:

𝔼[eλ(Zμ)]exp{(c~0c1)2λ2}.\mathbb{E}[e^{\lambda(Z-\mu)}]\leq\exp\{(\widetilde{c}_{0}c_{1})^{2}\lambda^{2}\}.

or

𝔼[eλZ]exp{λμ+(c~0c1)2λ2}.\mathbb{E}[e^{\lambda Z}]\leq\exp\{\lambda\mu+(\widetilde{c}_{0}c_{1})^{2}\lambda^{2}\}.

Thus, so long as |λμ|(c~0c1)2λ2|\lambda\mu|\leq(\widetilde{c}_{0}c_{1})^{2}\lambda^{2}, or |λ||μ|/(c~0c1)2|\lambda|\geq|\mu|/(\widetilde{c}_{0}c_{1})^{2}, the mean deviation is dominated and we get the desired results for some universal constant c~0\widetilde{c}_{0}. ∎

Moment inequalities for almost-martingale stochastic processes
Lemma 8 (See also [37]).

Suppose there is a stochastic process {Zi}i=1T\{Z_{i}\}_{i=1}^{T} along with a filtration {i}i=1T\{\mathcal{F}_{i}\}_{i=1}^{T} and that for all ii and some function {Gi()}\{G_{i}(\cdot)\} one has:

𝔼[Gi+1(Zi+1)|i}]aialmost surely\mathbb{E}[G_{i+1}(Z_{i+1})|\mathcal{F}_{i}\}]\leq a_{i}\quad\quad\text{almost surely}

for a sequence of positive numbers {ai}\{a_{i}\}, then for any fixed TT:

𝔼[i=1TGi(Zi)]i=1Tai.\mathbb{E}\left[\prod_{i=1}^{T}G_{i}(Z_{i})\right]\leq\prod_{i=1}^{T}a_{i}.
Proof.

We use the law of total expectations multiple times

𝔼[i=1TGi(Zi)]\displaystyle\mathbb{E}\left[\prod_{i=1}^{T}G_{i}(Z_{i})\right] =𝔼[𝔼[i=1TGi(Zi)|T}]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\prod_{i=1}^{T}G_{i}(Z_{i})\,\middle|\mathcal{F}_{T}\}\right]\right]
=𝔼[𝔼[G(ZT)|T1}]𝔼[i=1T1Gi(Zi)|T1}]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[G(Z_{T})\,\middle|\mathcal{F}_{T-1}\}\right]\mathbb{E}\left[\prod_{i=1}^{T-1}G_{i}(Z_{i})\,\middle|\mathcal{F}_{T-1}\}\right]\right]
aT𝔼[𝔼[i=1T1Gi(Zi)|T1}]]\displaystyle\leq a_{T}\cdot\mathbb{E}\left[\mathbb{E}\left[\prod_{i=1}^{T-1}G_{i}(Z_{i})\,\middle|\mathcal{F}_{T-1}\}\right]\right]
\displaystyle\dots\quad i=1Tai.\displaystyle\leq\prod_{i=1}^{T}a_{i}.

Lemma 9.

There exists a universal c0c_{0} such that the following holds. For stochastic process {Zi}\{Z_{i}\} on a filtration {i}\{\mathcal{F}_{i}\} where for all ii, with probability one over i1\mathcal{F}_{i-1}, the next item satisfies:

(|Zi|ti1)c1exp(t/c2)\mathbb{P}\left(|Z_{i}|\geq t\mid\mathcal{F}_{i-1}\right)\leq c_{1}\exp(-t/c_{2})

for some c1>1,c2>0c_{1}>1,c_{2}>0. Then, for any ϵ2(c0c1c2)\epsilon\geq 2(c_{0}c_{1}c_{2}),

(|1Tt=1TZt|>ϵ)2exp{T(2(c0c1c2)1ϵ)}.\mathbb{P}\left(\left|\frac{1}{T}\sum_{t=1}^{T}Z_{t}\right|>\epsilon\right)\leq 2\exp\{T\cdot(2-(c_{0}c_{1}c_{2})^{-1}\epsilon)\}. (13)
Proof.

We call Lemma˜5 as well as Lemma˜7 to get that some universal c0>0c_{0}>0:

𝔼[eλZi|i1]exp{2(c0c1c2)2λ2}\mathbb{E}[e^{\lambda Z_{i}}|\mathcal{F}_{i-1}]\leq\exp\{2(c_{0}c_{1}c_{2})^{2}\lambda^{2}\}

for |λ|=1/(c0c1c2)|\lambda|=1/(c_{0}c_{1}c_{2}). Hence, by Lemma˜8, a Chernoff-type bound yields:

(1Tt=1TZt>ϵ)=(eλt=1TZt>eλTϵ)eλTϵ𝔼[eλtZt]exp{λTϵ+2T(c0c1c2)2λ2}.\mathbb{P}\left(\frac{1}{T}\sum_{t=1}^{T}Z_{t}>\epsilon\right)=\mathbb{P}\left(e^{\lambda\sum_{t=1}^{T}Z_{t}}>e^{\lambda T\epsilon}\right)\leq e^{-\lambda T\epsilon}\mathbb{E}\left[e^{\lambda\sum_{t}Z_{t}}\right]\leq\exp\{-\lambda T\epsilon+2T(c_{0}c_{1}c_{2})^{2}\lambda^{2}\}.

Hence, take λ=1/(c0c1c2)\lambda=1/(c_{0}c_{1}c_{2}) we get

(1Tt=1TZt>ϵ)exp{λTϵ+2T(c0c1c2)2λ2}=exp{T(2(c0c1c2)1ϵ)}.\mathbb{P}\left(\frac{1}{T}\sum_{t=1}^{T}Z_{t}>\epsilon\right)\leq\exp\{-\lambda T\epsilon+2T(c_{0}c_{1}c_{2})^{2}\lambda^{2}\}=\exp\{T\cdot(2-(c_{0}c_{1}c_{2})^{-1}\epsilon)\}.

Repeating the arguments for {Zi}\{-Z_{i}\}, we get the desired concentration. ∎

An inequality concerning the partial sums of logni+1n\log\frac{n-i+1}{n}’s.

We adapt two inequalities from prior literature that will be useful. Their proofs involve converting sums into respective integrals.

Lemma 10 (Adapted from the proof of Lemma 12.1 in [19]).

Fix positive integers q,mq,m satisfying 4mq4\leq m\leq q. If q>mq>m, then

j=1mlog(11j/q)m22q.\sum_{j=1}^{m}\log\left(\frac{1}{1-j/q}\right)\geq\frac{m^{2}}{2q}.

and (even for q=mq=m)

j=1mlog(11+j/q)2m23q.\sum_{j=1}^{m}\log\left(\frac{1}{1+j/q}\right)\leq-\frac{2m^{2}}{3q}.

6 Remainder of the proofs

6.1 Remaining proofs in Step 2

The key estimate result (5), is follows immediately once we show that (recall (3) and (4))

(logXN,nU0(k)XN,n(k)cklog(en/k))c1(en/k)c2ck\mathbb{P}\left(-\log\frac{\|X_{N,n}U_{0}\|_{(k)}}{\|X_{N,n}\|_{(k)}}\geq ck\log(en/k)\right)\leq c_{1}(en/k)^{-c_{2}ck} (14)

for c>c3c>c_{3}. We prove this estimate by first obtaining an upper bounding the inverse moment of the determinant of T=n𝟙[k]WUT=\sqrt{n}\mathds{1}_{[k]}WU (and thus also a small-ball estimate this determinant).

Lemma 11.

For all C11/10C_{1}\leq 1/10, there exists a constant CC with the following property. For all nkn\geq k and U𝖥𝗋n,kU\in\mathsf{Fr}_{n,k}, T=MUk×kT=MU\in\mathbb{R}^{k\times k} where Mμk×nM\sim\mu^{\otimes k\times n} satisfies:

𝔼[(|𝖽𝖾𝗍(T)|2/k!)C1]t=1keC2/teC2(1+logk),C2=CK4C1M4,\mathbb{E}\left[\left(\left|\mathsf{det}\left(T\right)\right|^{2}/k!\right)^{-C_{1}}\right]\leq\prod_{t=1}^{k}e^{C_{2}/t}\leq e^{C_{2}(1+\log k)},\qquad C_{2}=CK_{\infty}^{4C_{1}}M_{4},

where KK_{\infty} is the upper bound for the density of μ\mu in condition˜1.

Proof.

For matrix SS and vector vv we define 𝖽𝗂𝗌𝗍(v,S)infw𝗌𝗉𝖺𝗇(S)vw\mathsf{dist}(v,S)\triangleq\inf_{w\in\mathsf{span}(S)}\|v-w\|. Note that

log|𝖽𝖾𝗍(T)|2/k!=i=1klog𝖽𝗂𝗌𝗍(Ti,T<i)2log(ki+1)i=1klog(C(i)/i),\log|\mathsf{det}(T)|^{2}/k!=\sum_{i=1}^{k}\log\|\mathsf{dist}(T_{i},T_{<i})\|^{2}-\log(k-i+1)\triangleq\sum_{i=1}^{k}\log(C^{(i)}/i),

where for each fixed ii we’ve set

C(i)𝖽𝗂𝗌𝗍(Tj,T<j)2,j=ki+1.C^{(i)}\triangleq\|\mathsf{dist}(T_{j},T_{<j})\|^{2},\qquad j=k-i+1.

Note that the rows T1,,TkT_{1},\ldots,T_{k} of TT are projections of the rows of M1,,MkM_{1},\ldots,M_{k} of MM onto the column space of UU. Hence, we may also write

C(i)=dist(Mj,T<j)2C^{(i)}=\left|\left|\mathrm{dist}(M_{j},T_{<j})\right|\right|^{2} (15)

Note that MjM_{j} is independent of T<jT_{<j}. Hence, by Lemma˜8, the conclusion of Lemma˜11 will follow once we show that there exists C2>0C_{2}>0 such that for all t1t\geq 1 (and frame Θt\Theta_{t}):

𝔼[(C(t)/t)C1]1+C2/texp{C2/t}.\mathbb{E}\left[(C^{(t)}/t)^{-C_{1}}\right]\leq 1+C_{2}/t\leq\exp\{C_{2}/t\}.

To obtain this estimate, we will show that

𝔼[(C(t)/t)C1𝟏{(C(t)/t)1>(CK)4}]\displaystyle\mathbb{E}\left[(C^{(t)}/t)^{-C_{1}}{\bf 1}_{\{(C^{(t)}/t)^{-1}>(CK_{\infty})^{4}\}}\right] =O(1/t)\displaystyle=O(1/t) (16)
𝔼[(C(t)/t)C1𝟏{(C(t)/t)1<(CK)4}]\displaystyle\mathbb{E}\left[(C^{(t)}/t)^{-C_{1}}{\bf 1}_{\{(C^{(t)}/t)^{-1}<(CK_{\infty})^{4}\}}\right] =1+O(1/t),\displaystyle=1+O(1/t), (17)

where CC is some universal constant CC such that CK2CK_{\infty}\geq 2. To obtain (16) and (17), we return to (15) and denote by Θi𝖥𝗋n,i\Theta_{i}\in\mathsf{Fr}_{n,i} a frame consisting an orthonormal basis for the orthogonal complement to T<jT_{<j}. We then have

C(i)=dΘiTu2,u=Mi is the ith row of M,C^{(i)}=_{d}\left|\left|\Theta_{i}^{T}u\right|\right|^{2},\qquad u=M_{i}\text{ is the }i-\text{th row of }M, (18)

where Θi\Theta_{i} is independent of uu. Since uμnu\sim\mu^{\otimes n}, we have has

𝔼[C(i)]=i.\mathbb{E}[C^{(i)}]=i.

Moreover, by Lemma˜4,

((C(t)/t)C1s)(CKs1/2C1)t\mathbb{P}\left(\left(C^{(t)}/t\right)^{-C_{1}}\geq s\right)\leq(CK_{\infty}\cdot s^{-1/2C_{1}})^{t}

for some universal constant CC, which we assume is sufficiently large that CK2CK_{\infty}\geq 2. In particular, we have that

s(CK)4C1(CKs1/2C1)tst/4C1s\geq(CK_{\infty})^{4C_{1}}\quad\Longrightarrow\quad(CK_{\infty}\cdot s^{-1/2C_{1}})^{t}\leq s^{-t/4C_{1}}

and hence that

𝔼[(C(t)/t)C1𝟏{(C(t)/t)1>(CK)4}]\displaystyle\mathbb{E}\left[(C^{(t)}/t)^{-C_{1}}{\bf 1}_{\{(C^{(t)}/t)^{-1}>(CK_{\infty})^{4}\}}\right] (CK)4C1((C(t)/t)C1>s)𝑑s\displaystyle\leq\int_{(CK_{\infty})^{4C_{1}}}^{\infty}\mathbb{P}\left((C^{(t)}/t)^{-C_{1}}>s\right)ds
1st/4C1𝑑s=4C1t4C1=O(1/t).\displaystyle\leq\int_{1}^{\infty}s^{-t/4C_{1}}ds=\frac{4C_{1}}{t-4C_{1}}=O(1/t).

This confirms (16). Next, to verify (17), note that since xC1x^{-C_{1}} is a convex function, there exists a finite D(K)D(K_{\infty}) such that for all x(CK)4x\geq(CK_{\infty})^{-4} it is bounded above by it’s second order Taylor expansion around x=1x=1:

xC11+C1(1x)+D(1x)2.x^{-C_{1}}\leq 1+C_{1}(1-x)+D(1-x)^{2}.

The right hand side is always positive and hence

𝔼[𝟙C(t)/t(CK)4(C(t)/t)C1]𝔼[1C1(1C(t)/t)+D(1C(t)/t)2]=1+Dt2𝔼[(C(t)t)2].\mathbb{E}\left[\mathds{1}_{C^{(t)}/t\geq(CK_{\infty})^{-4}}(C^{(t)}/t)^{-C_{1}}\right]\leq\mathbb{E}\left[1-C_{1}(1-C^{(t)}/t)+D(1-C^{(t)}/t)^{2}\right]=1+\frac{D}{t^{2}}\mathbb{E}\left[(C^{(t)}-t)^{2}\right]. (19)

To deduce (17) we must therefore show that the expression on the right is 1+O(1/t)1+O(1/t). For this observe, recall from (18) that

C(t)=ΘtTu2.C^{(t)}=\|\Theta_{t}^{T}u\|^{2}.

Write

𝒜(μ):=𝔼[(C(t))2],\mathcal{A}(\mu):=\mathbb{E}\left[(C^{(t)})^{2}\right],

where we emphasize the dependence on the distribution μ\mu. Note that (C(t))2(C^{(t)})^{2} is a degree four polynomial in the uiu_{i}’s and that

𝒜(μ)=i=1ns=1tΘsi2𝔼[ui4]+terms that depend only on the first two moments of μ.\mathcal{A}(\mu)=\sum_{i=1}^{n}\sum_{s=1}^{t}\Theta_{si}^{2}\mathbb{E}[u_{i}^{4}]+\text{terms that depend only on the first two moments of }\mu.

Since the first two moments of μ\mu are the same as those of a standard Gaussian we therefore find

𝒜(μ)=(M43)t+𝒜(𝒩(0,1)).\mathcal{A}(\mu)=(M_{4}-3)t+\mathcal{A}(\mathcal{N}(0,1)).

Moreover, writing χt2\chi_{t}^{2} for a chi-squared distribution with tt degrees of freedom we find

𝒜(𝒩(0,1))=𝔼[(χt2)2]=t(t+2).\mathcal{A}(\mathcal{N}(0,1))=\mathbb{E}[\left(\chi_{t}^{2}\right)^{2}]=t(t+2).

Hence,

𝒜(μ)=t2+(M41)t\mathcal{A}(\mu)=t^{2}+(M_{4}-1)t

and so

𝔼(C(t)t)2=M4t.\mathbb{E}{(C^{(t)}-t)^{2}}=M_{4}t.

When combined with (19) this yields Therefore 𝔼[(C(t)t)2](M4+5)t\mathbb{E}\left[(C^{(t)}-t)^{2}\right]\leq(M_{4}+5)t, and

𝔼[(C(t)/t)C1]1+1+DM4texp{1+DM4t}\mathbb{E}\left[(C^{(t)}/t)^{-C_{1}}\right]\leq 1+\frac{1+DM_{4}}{t}\leq\exp\left\{\frac{1+DM_{4}}{t}\right\}

where D=100(CK)4C1D=100(CK_{\infty})^{4C_{1}} suffices. This verifies (17) and completes the proof. ∎

Completion of Proof of Proposition˜1.

Given the above tools, we are now in place to show Proposition˜1, for which we only needed to establish (14). To do this, note that via Lemma˜2:

XN,n(k)W1(k)WNW3W2(k)\|X_{N,n}\|_{(k)}\leq\|W_{1}\|_{(k)}\|W_{N}\cdots W_{3}W_{2}\|_{(k)}

and that there exists Θ0,L𝖥𝗋n,k\Theta_{0},L\in\mathsf{Fr}_{n,k} via the SVD of WNWN1W2W_{N}W_{N-1}\dots W_{2} such that

Θ0TWNW3W2=𝖽𝗂𝖺𝗀({si(WNW3W2)}i=1k)LT\Theta_{0}^{T}\cdot W_{N}\cdots W_{3}W_{2}=\mathsf{diag}(\{s_{i}(W_{N}\cdots W_{3}W_{2})\}_{i=1}^{k})\cdot L^{T}

and hence the determinant is

XN,nU0(k)(3)|𝖽𝖾𝗍(Θ0TXN,nU0)|=|𝖽𝖾𝗍(LTW1U0)|WNW2(k)|𝖽𝖾𝗍(LTW1U0)|XN,n(k)W1(k)\|X_{N,n}U_{0}\|_{(k)}\geq_{\eqref{eq:wedge-norm-def}}|\mathsf{det}(\Theta_{0}^{T}X_{N,n}U_{0})|=|\mathsf{det}\left(L^{T}W_{1}U_{0}\right)|\cdot\|W_{N}\cdots W_{2}\|_{(k)}\geq|\mathsf{det}(L^{T}W_{1}U_{0})|\frac{\|X_{N,n}\|_{(k)}}{\|W_{1}\|_{(k)}}

Thus,

0logXN,nU0(k)XN,n(k)logW1(k)|𝖽𝖾𝗍(LTW1U0)|=logM(k)|𝖽𝖾𝗍(LTMU0)|0\leq-\log\frac{\|X_{N,n}U_{0}\|_{(k)}}{\|X_{N,n}\|_{(k)}}\leq\log\frac{\|W_{1}\|_{(k)}}{\left|\mathsf{det}\left(L^{T}W_{1}U_{0}\right)\right|}=\log\frac{\|M\|_{(k)}}{\left|\mathsf{det}\left(L^{T}MU_{0}\right)\right|}

where M=nW1μn×nM=\sqrt{n}\cdot W_{1}\sim\mu^{\otimes n\times n} is the un-normalized random matrix. To complete the derivation of (14), it now suffices to show that there exists constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 such that for G𝗎𝗇𝗂𝖿(𝖥𝗋n,k)G\sim\mathsf{unif}(\mathsf{Fr}_{n,k}) (see Lemma˜1 for the exact definition):

(M(k)2MG(k)2(en/k)ck),(MG(k)2k!(en/k)ck),(k!|𝖽𝖾𝗍(LTMU0)|2(en/k)ck)\mathbb{P}\left(\frac{\|M\|_{(k)}^{2}}{\|MG\|_{(k)}^{2}}\geq(en/k)^{ck}\right),\mathbb{P}\left(\frac{\|MG\|_{(k)}^{2}}{k!}\geq(en/k)^{ck}\right),\mathbb{P}\left(\frac{k!}{|\mathsf{det}(L^{T}MU_{0})|^{2}}\geq(en/k)^{ck}\right)

are all at most

c2(en/k)c1ckc_{2}(en/k)^{-c_{1}ck}

whenever c>c3c>c_{3}. We bound these probabilities separately below:

  1. 1.

    We consider, for any full-rank MM, with randomness over GG:

    (M(k)2MG(k)2(en/k)ck)=(MG(k)2M(k)2(ken)ck)\mathbb{P}\left(\frac{\|M\|_{(k)}^{2}}{\|MG\|_{(k)}^{2}}\geq(en/k)^{ck}\right)=\mathbb{P}\left(\frac{\|MG\|_{(k)}^{2}}{\|M\|_{(k)}^{2}}\leq\left(\frac{k}{en}\right)^{ck}\right)

    This quantity is bounded directly by Lemma˜1 which states that the above objective is at most:

    ((MG(k)M(k))12kεkn)(cε)k2\mathbb{P}\left(\left(\frac{\|MG\|_{(k)}}{\|M\|_{(k)}}\right)^{\frac{1}{2k}}\leq\varepsilon\sqrt{\frac{k}{n}}\right)\leq(c\varepsilon)^{\frac{k}{2}}

    for any ε\varepsilon and some universal constant cc. This means that

    (M(k)2MG(k)2(en/k)ck)(enk)c0kck/4\mathbb{P}\left(\frac{\|M\|_{(k)}^{2}}{\|MG\|_{(k)}^{2}}\geq(en/k)^{ck}\right)\leq\left(\frac{en}{k}\right)^{c_{0}k-ck/4}

    for a universal c0c_{0}.

  2. 2.

    We show that one has, for any GG:

    (MG(k)2k!(en/k)ck)(enk)kck\mathbb{P}\left(\frac{\|MG\|_{(k)}^{2}}{k!}\geq(en/k)^{ck}\right)\leq\left(\frac{en}{k}\right)^{k-ck}

    This is because by (9):

    𝔼[MG(k)2k!]=n(n1)(nk+1)k!nkk!(en/k)k\mathbb{E}\left[\frac{\|MG\|_{(k)}^{2}}{k!}\right]=\frac{n(n-1)\cdots(n-k+1)}{k!}\leq\frac{n^{k}}{k!}\leq(en/k)^{k}
  3. 3.

    For U0U_{0} being the truncated identity, by Lemma˜11, there exists C1,C2C_{1},C_{2} such that:

    (k!|𝖽𝖾𝗍(LTMU0)|2(en/k)ck)(en/k)C1ck(ek)C2(en/k)C2(k+1)C1ck\mathbb{P}\left(\frac{k!}{|\mathsf{det}(L^{T}MU_{0})|^{2}}\geq(en/k)^{ck}\right)\leq(en/k)^{-C_{1}ck}(ek)^{C_{2}}\leq(en/k)^{C_{2}(k+1)-C_{1}ck}

    holds for any L𝖥𝗋n,kL\in\mathsf{Fr}_{n,k} directly, since (en/k)k+1ek+1>ek(en/k)^{k+1}\geq e^{k+1}>ek.

Combining these three points along with a union bound concludes our proof of (14) and Proposition˜1. \square

6.1.1 A different proof without restricting U0U_{0}

Of perhaps separate interest, we show a similar result to Proposition˜1 without restricting on U0U_{0} to be the truncated identity. Specifically, we show that:

Proposition 3.

Assuming Condition˜1, there exist constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 depending only on K,M4K_{\infty},M_{4}, such that for any nkn\geq k and any U0𝖥𝗋n,kU_{0}\in\mathsf{Fr}_{n,k}:

(1n|j=1kλi1NlogXN,nU0(k)|>s)c1exp{c2nNs/k}.\mathbb{P}\left(\frac{1}{n}\left|\sum_{j=1}^{k}\lambda_{i}-\frac{1}{N}\log\left\|X_{N,n}U_{0}\right\|_{(k)}\right|>s\right)\leq c_{1}\exp\{-c_{2}nNs/k\}. (20)

for all s>c3klog(en)nNs>c_{3}\frac{k\log(en)}{nN}.

Following the exact same recipe for the proof in Proposition˜1, the only distinction lies in the following lemma, which we find interesting in its own rights.

Lemma 12.

For any fixed frames U,V𝖥𝗋n,kU,V\in\mathsf{Fr}_{n,k} and Mμn×nM\in\mu^{\otimes n\times n} where μ\mu satisfies Condition˜1, one has that for all t>0t>0:

(|𝖽𝖾𝗍(UTMV)|1/kn(t+1)k1/2)22Kntk.\mathbb{P}\left(\left|\mathsf{det}\left(U^{T}MV\right)\right|^{1/k}\leq n^{-(t+1)}k^{-1/2}\right)\leq 2\sqrt{2}K_{\infty}n^{-t}k.
Proof.

First, we look at what linear transformations we can do to U,VU,V while preserving 𝖽𝖾𝗍(UTMV)\mathsf{det}(U^{T}MV):

  • Adding a constant multiple of one column to another column. This, by definition, does not change the determinant.

  • Row exchanges. This changes the determinant but preserves the law of 𝖽𝖾𝗍(UTMV)\mathsf{det}(U^{T}MV) as the law of MM is invariant with respect to row and column exchanges.

Note that for any frame, these two operations allow us to turn them into the form:

UT,VTU~T,V~T=[a10000a200000ak],[b10000b200000bk]U^{T},V^{T}\to\widetilde{U}^{T},\widetilde{V}^{T}=\begin{bmatrix}a_{1}&0&0&\cdots&0&\cdots\\ 0&a_{2}&0&\cdots&0&\cdots\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&a_{k}&\cdots\end{bmatrix},\begin{bmatrix}b_{1}&0&0&\cdots&0&\cdots\\ 0&b_{2}&0&\cdots&0&\cdots\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&b_{k}&\cdots\end{bmatrix}

where |ai|,|bj|1n|a_{i}|,|b_{j}|\geq\frac{1}{\sqrt{n}}, by following the algorithm below. Algorithm: Diagonalizing frame U𝖥𝗋n,kU\in\mathsf{Fr}_{n,k} column by column. 1. Initialize U=U(0)U=U^{(0)}. For t=1,2,,kt=1,2,\dots,k, do the following to get U(t)U^{(t)} from U(t1)U^{(t-1)}: (1) row-exchanging the argmax (of absolute value) entry row of the tt-th column (U(t1))tT(U^{(t-1)})^{T}_{t} to the tt-th row; (2) use column elimination (adding appropriate scalar multiples of the tt-th column) to make the tt-th row all zero except at the tt-th column. 2. Output U~=U(k)\widetilde{U}=U^{(k)}.

To analyze this procedure, note that under time tt, the norm of the tt-th column is at least 1 because (ignoring row exchanges which are irrelevant) it has only been added a linear combination of the first (t1)(t-1) columns which are all orthogonal to itself. Hence, the arg-max absolute value is at least 1/n1/\sqrt{n} at time tt. Hence, the result U(k)U^{(k)} must have the top k×kk\times k submatrix diagonalized with diagonal entries at least 1/n1/\sqrt{n}.

Note that we can take any k×kk\times k matrices that have bounded determinant and left (right) multiply to our product. Let two k×kk\times k matrices be

L,R=[a11000a21000ak1],[b11000b21000bk1],L,R=\begin{bmatrix}a_{1}^{-1}&0&\cdots&0\\ 0&a_{2}^{-1}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&a_{k}^{-1}\end{bmatrix},\begin{bmatrix}b_{1}^{-1}&0&\cdots&0\\ 0&b_{2}^{-1}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&b_{k}^{-1}\end{bmatrix},

then both A=U~LA=\widetilde{U}L and B=V~RB=\widetilde{V}R share the form of [100010001].\begin{bmatrix}1&0&\cdots&0&\cdots\\ 0&1&\cdots&0&\cdots\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&1&\cdots\end{bmatrix}. Note that since

|𝖽𝖾𝗍(L)|,|𝖽𝖾𝗍(R)|nk/2|\mathsf{det}(L)|,|\mathsf{det}(R)|\leq n^{k/2}

we only need to study the small ball probability for

|𝖽𝖾𝗍(UTMV)|=d|𝖽𝖾𝗍(U~TMV~)|=|𝖽𝖾𝗍(ATMB)|(|𝖽𝖾𝗍(L)||𝖽𝖾𝗍(R)|)1nk|𝖽𝖾𝗍(ATMB)|.|\mathsf{det}(U^{T}MV)|=_{d}|\mathsf{det}(\widetilde{U}^{T}M\widetilde{V})|=|\mathsf{det}(A^{T}MB)|\cdot\left(|\mathsf{det}(L)||\mathsf{det}(R)|\right)^{-1}\geq n^{-k}|\mathsf{det}(A^{T}MB)|. (21)

To analyze this determinant, we need the following result.

Lemma 13.

Under Condition˜1, let Mμk×kM\sim\mu^{\otimes k\times k}. Fix any Xk×kX\in\mathbb{R}^{k\times k}, one has:

(|𝖽𝖾𝗍(X+M)|<(ntk1)k)22Kntk\mathbb{P}\left(|\mathsf{det}(X+M)|<\left(n^{-t}\sqrt{k^{-1}}\right)^{k}\right)\leq 2\sqrt{2}K_{\infty}n^{-t}k
Proof.

Note this simple observation (Lemma 5.1, [39]): let NN be any matrix and then the ii-th row of J(N1)TJ\triangleq(N^{-1})^{T} satisfies:

Ji1=(N1)i1=mincjNijicjNj\|J_{i}\|^{-1}=\|(N^{-1})_{i}\|^{-1}=\min_{c_{j}\in\mathbb{R}}\|N_{i}-\sum_{j\neq i}c_{j}N_{j}\|

and this is simply because Ji,NijicjNj=1\langle J_{i},N_{i}-\sum_{j\neq i}c_{j}N_{j}\rangle=1 always for any {c}\{c\}. Hence,

σmin(N)=σmax1(J)JF1k1minimincjk1,jiNijicjNj.\sigma_{\min}(N)=\sigma^{-1}_{\max}(J)\geq\|J\|_{F}^{-1}\geq\sqrt{k^{-1}}\min_{i}\min_{c_{j}\in\mathbb{R}^{k-1},j\neq i}\left\|N_{i}-\sum_{j\neq i}c_{j}N_{j}\right\|.

To use a union bound on the kk rows of N=X+MN=X+M, we only need to show that for any fixed ii,

(mincik1NijicjNjnt)22Knt.\mathbb{P}\left(\min_{c_{-i}\in\mathbb{R}^{k-1}}\left\|N_{i}-\sum_{j\neq i}c_{j}N_{j}\right\|\leq n^{-t}\right)\leq 2\sqrt{2}Kn^{-t}.

This is because, fix MiM_{-i} and only consider MiμkM_{i}\sim\mu^{\otimes k}. Let the unit vector of null space {Ni}\{N_{-i}\}^{\perp} (which is independent with MiM_{i}) be wiw_{i} then:

mincik1NijicjNj=|Mi,wi+Xi,wi|\min_{c_{-i}\in\mathbb{R}^{k-1}}\left\|N_{i}-\sum_{j\neq i}c_{j}N_{j}\right\|=|\langle M_{i},w_{i}\rangle+\langle X_{i},w_{i}\rangle|

where Xi,wiX_{i},w_{i} are σ(Mi)\sigma(M_{-i})-measurable. The probability that this is small is directly concluded Lemma˜3. ∎

To complete the proof, let us write the product ATMBA^{T}MB as a linear combination of

ATMB=1i,jnMijAiTBjA^{T}MB=\sum_{1\leq i,j\leq n}M_{ij}A_{i}^{T}B_{j}

where Ai,Bj1×kA_{i},B_{j}\in\mathbb{R}^{1\times k}. Note that AiTBj=E(i,j)A_{i}^{T}B_{j}=E^{(i,j)} for 1i,jk1\leq i,j\leq k where E(i,j)E^{(i,j)} denotes the rank-1 matrix with only the (i,j)(i,j)th entry being 1 and else 0. Hence, conditioned on the irrelevant entries σ({Mij:ijk})\sigma(\{M_{ij}:i\lor j\geq k\}) (treating them as constant) we get

ATMBM[k],[k]+X,XM[k],[k]A^{T}MB\triangleq M_{[k],[k]}+X,\quad\quad X\perp\!\!\!\perp M_{[k],[k]}

applying Lemma˜13 and combining with (21) we are done. ∎

6.2 Remaining proofs in Step 3: Derivation of Proposition˜2

Our goal in this section is to derive Proposition˜2. As mentioned in Section˜3 the main idea is to express the norm logXN,nU0(k)\log\|X_{N,n}U_{0}\|_{(k)} as an average. To do this, we repeatedly use the SVD yields to obtain an alternative representation of XN,nU(k)\|X_{N,n}U\|_{(k)} as follows. First, let U1=U0U_{1}=U_{0} and Ut+1U_{t+1} be defined via (for t=1,2,,N1t=1,2,\dots,N-1) the singular value decomposition of WtUtW_{t}U_{t} as:

WtUt=Ut+1𝖽𝗂𝖺𝗀({si(WtUt)}i=1k)O(t)W_{t}U_{t}=U_{t+1}\mathsf{diag}(\{s_{i}(W_{t}U_{t})\}_{i=1}^{k})O_{(t)}

for O(t)𝖥𝗋k,kO_{(t)}\in\mathsf{Fr}_{k,k} and Ut+1𝖥𝗋n,kU_{t+1}\in\mathsf{Fr}_{n,k}. Then XN,nU0(k)\|X_{N,n}U_{0}\|_{(k)} can be also written as (recall (3)):

XN,nU0(k)\displaystyle\|X_{N,n}U_{0}\|_{(k)} =supV𝖽𝖾𝗍(VTXN,nU0)\displaystyle=\sup_{V}\mathsf{det}(V^{T}X_{N,n}U_{0})
=supV𝖽𝖾𝗍(VTWNUN)i=1N1𝖽𝖾𝗍[𝖽𝗂𝖺𝗀({si(WiUi)}i=1k)]\displaystyle=\sup_{V}\mathsf{det}(V^{T}W_{N}U_{N})\cdot\prod_{i=1}^{N-1}\mathsf{det}\left[\mathsf{diag}(\{s_{i}(W_{i}U_{i})\}_{i=1}^{k})\right]
=i=1NWiUi(k).\displaystyle=\prod_{i=1}^{N}\|W_{i}U_{i}\|_{(k)}.

Since Utσ(W1t1)U_{t}\in\sigma(W_{1}^{t-1}) is independent with WtW_{t} for all tt, we only need to study the objective in (8):

1NlogXN,nU(k)2=1Ni=1NlogWiUi(k)2.\frac{1}{N}\log\|X_{N,n}U\|_{(k)}^{2}=\frac{1}{N}\sum_{i=1}^{N}\log\|W_{i}U_{i}\|_{(k)}^{2}.

Let us define the n×kn\times k product matrix T(i)=nWiUiT^{(i)}=\sqrt{n}W_{i}U_{i} where each rows are independent (conditioned on UiU_{i}) from the following law

Tji.i.d.UiTwwμnT_{j}\sim_{i.i.d.}U_{i}^{T}w\mid_{w\sim\mu^{\otimes n}}

with 𝔼[(Tj)i]=0,𝔼[(Tj)i2]=1,𝔼[((Tj)i1(Tj)i2]=0\mathbb{E}[(T_{j})_{i}]=0,\mathbb{E}[(T_{j})_{i}^{2}]=1,\mathbb{E}[((T_{j})_{i_{1}}(T_{j})_{i_{2}}]=0 for all i,j,i1i2i,j,i_{1}\neq i_{2}. Thus, for any k×kk\times k submatrix TIT_{I} (indexed by a row subset I([n]k)I\in{\binom{[n]}{k}} of size kk), the expected determinant squared (by independence of rows) is exactly

𝔼[|𝖽𝖾𝗍(TI)|2]=i1,i2,,ikj1,j2,,jks=1k𝔼[TIsisTIsjs]=i1,i2,,ikj1,j2,,jks=1k𝟙[is=js]=k!.\mathbb{E}[|\mathsf{det}(T_{I})|^{2}]=\sum_{\begin{subarray}{c}i_{1},i_{2},\dots,i_{k}\\ j_{1},j_{2},\dots,j_{k}\end{subarray}}\prod_{s=1}^{k}\mathbb{E}\left[T_{I_{s}i_{s}}T_{I_{s}j_{s}}\right]=\sum_{\begin{subarray}{c}i_{1},i_{2},\dots,i_{k}\\ j_{1},j_{2},\dots,j_{k}\end{subarray}}\prod_{s=1}^{k}\mathds{1}\left[i_{s}=j_{s}\right]=k!.

This gives

𝔼[WiUi(k)2]=𝔼[𝖽𝖾𝗍((WiUi)TWiUi)]=1nkI([n]k)𝔼|𝖽𝖾𝗍(TI)|2=j=1knj+1n.\mathbb{E}\left[\|W_{i}U_{i}\|_{(k)}^{2}\right]=\mathbb{E}[\mathsf{det}\left((W_{i}U_{i})^{T}W_{i}U_{i}\right)]=\frac{1}{n^{k}}\sum_{I\in{\binom{[n]}{k}}}\mathbb{E}|\mathsf{det}(T_{I})|^{2}=\prod_{j=1}^{k}\frac{n-j+1}{n}.

To complete the proof of Proposition˜2 note that

(1n|logWiUi(k)2j=1klognj+1n|>s)exp{sn}+(WiUi(k)2n!(nk)!nk<exp{sn}),\displaystyle\mathbb{P}\left(\frac{1}{n}\left|\log\|W_{i}U_{i}\|_{(k)}^{2}-\sum_{j=1}^{k}\log\frac{n-j+1}{n}\right|>s\right)\leq\exp\{-sn\}+\mathbb{P}\left(\frac{\|W_{i}U_{i}\|_{(k)}^{2}}{\frac{n!}{(n-k)!\cdot n^{k}}}<\exp\{-sn\}\right), (22)

where the upper tail (WiUi(k)2n!(nk)!nk>exp{sn})esn\mathbb{P}\left(\frac{\|W_{i}U_{i}\|_{(k)}^{2}}{\frac{n!}{(n-k)!\cdot n^{k}}}>\exp\{-sn\}\right)\leq e^{-sn} follows from Markov’s inequality. Observe thatLemma˜11 gives the following inverse moment bound:

𝔼[(|𝖽𝖾𝗍(TI)|2/k!)C1]eC2(1+logk)\mathbb{E}\left[\left(\left|\mathsf{det}\left(T_{I}\right)\right|^{2}/k!\right)^{-C_{1}}\right]\leq e^{C_{2}(1+\log k)}

for some constant C1,C2C_{1},C_{2} depending only on M4,KM_{4},K_{\infty}. Therefore, one can apply Jensen’s inequality on f(x)=xC1f(x)=x^{-C_{1}} to get:

𝔼[(WU(k)2/n!(nk)!nk)C1]\displaystyle\mathbb{E}\left[\left(\|WU\|_{(k)}^{2}/\frac{n!}{(n-k)!\cdot n^{k}}\right)^{-C_{1}}\right] =𝔼[(1(nk)|I|=k|𝖽𝖾𝗍(TI)|2/k!)C1]\displaystyle=\mathbb{E}\left[\left(\frac{1}{{\binom{n}{k}}}\sum_{|I|=k}|\mathsf{det}(T_{I})|^{2}/k!\right)^{-C_{1}}\right]
Jensen’s inequality{}_{\text{Jensen's inequality}} 𝔼[|I|=k(|𝖽𝖾𝗍(TI)|2/k!)C1(nk)](ek)C2.\displaystyle\leq\mathbb{E}\left[\frac{\sum_{|I|=k}(|\mathsf{det}(T_{I})|^{2}/k!)^{-C_{1}}}{{\binom{n}{k}}}\right]\leq(ek)^{C_{2}}.

This gives the following bound on the lower tail:

(WU(k)2/n!(nk)!nk<esn)exp{C1sn+C2log(ek))}.\mathbb{P}\left(\|WU\|_{(k)}^{2}/\frac{n!}{(n-k)!\cdot n^{k}}<e^{-sn}\right)\leq\exp\{-C_{1}sn+C_{2}\log(ek))\}. (23)

To conclude the proof of Proposition˜2, we combine (8), (23) and (13) in Lemma˜9 with

Zi=1n(logWiUi(k)2i=1klogni+1n),Z_{i}=\frac{1}{n}\left(\log\|W_{i}U_{i}\|_{(k)}^{2}-\sum_{i=1}^{k}\log\frac{n-i+1}{n}\right),

to see that the conditions for Lemma˜9 hold with c1Θ(1)c_{1}\in\Theta(1) and c2Θ(log(ek)n)c_{2}\in\Theta(\frac{\log(ek)}{n}). We get as a result that for some constants c0,c3c_{0},c_{3} depending only on K,M4K_{\infty},M_{4} and for all sc3n1log(ek)s\geq c_{3}n^{-1}\log(ek):

(1nN|logXN,nU(k)2Nj=1klognj+1n|>s)2exp{c0nNs/log(ek)}.\mathbb{P}\left(\frac{1}{nN}\left|\log\left\|X_{N,n}U\right\|_{(k)}^{2}-N\sum_{j=1}^{k}\log\frac{n-j+1}{n}\right|>s\right)\leq 2\exp\{-c_{0}nNs/\log(ek)\}.

This concludes the proof of (7). \square

6.3 Completion of Proof of Theorem˜1

We are now in a position to complete the proof of Theorem˜1. For this, note that (10) follows immediately when combining (5), (7), and a union bound. Given this, we can check Theorem˜1 via a union bound as follows.

By twisting constants (multiplying by universal constants) in (2), we may assume that ε>5/n\varepsilon>5/n and that ε<0.01\varepsilon<0.01. Note that under the given conditions of Nc1ε2,nc2ε2logε1N\geq c_{1}\varepsilon^{-2},n\geq c_{2}\varepsilon^{-2}\log\varepsilon^{-1}, one has that (via (10)):

(|1ni=mk(λi12logni+1n)|ε2)C2exp{C3nNε2/log(ek)}n4\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=m}^{k}\left(\lambda_{i}-\frac{1}{2}\log\frac{n-i+1}{n}\right)\right|\geq\varepsilon^{2}\right)\leq C_{2}\exp\left\{-C_{3}nN\varepsilon^{2}/\log(ek)\right\}\leq n^{-4}

if we pick large enough constants c1,c2c_{1},c_{2}. As a result, a union bound may be applied such that

(m<k,|1ni=mk(λi12logni+1n)|ε2)C2exp{C~3nNε2/logn}.\mathbb{P}\left(\exists m<k,\quad\left|\frac{1}{n}\sum_{i=m}^{k}\left(\lambda_{i}-\frac{1}{2}\log\frac{n-i+1}{n}\right)\right|\geq\varepsilon^{2}\right)\leq C_{2}\exp\left\{-\widetilde{C}_{3}nN\varepsilon^{2}/\log n\right\}.

for some C~3>0\widetilde{C}_{3}>0. In fact, we will show that so long as:

|1ni=mk(λi12logni+1n)|ε2\left|\frac{1}{n}\sum_{i=m}^{k}\left(\lambda_{i}-\frac{1}{2}\log\frac{n-i+1}{n}\right)\right|\leq\varepsilon^{2} (24)

holds for all 1m<kn1\leq m<k\leq n, one has that for all 1t01\geq t\geq 0:

|F(t)||1n#{1insi2/N(XN,n)t}t|5ε|F(t)|\triangleq\left|\frac{1}{n}\#\left\{1\leq i\leq n\mid s_{i}^{2/N}\left(X_{N,n}\right)\leq t\right\}-t\right|\leq 5\varepsilon (25)

which, if true, concludes the proof of Theorem˜1 (by, again, twisting constants). First let us check two basic inequalities following from Lemma˜10 directly. In particular, it follows that for any nqm5n\geq q\geq m\geq 5 one has:

mlog(q/n)j=nq+1nq+mlognj+1n=j=1m1log11j/q(m1)22q.m\log(q/n)-\sum_{j=n-q+1}^{n-q+m}\log\frac{n-j+1}{n}=\sum_{j=1}^{m-1}\log\frac{1}{1-j/q}\geq\frac{(m-1)^{2}}{2q}. (26)

Furthermore for any nqm5,nqm0n\geq q\geq m\geq 5,n-q-m\geq 0, we also have

mlog(q/n)j=nqm+1nqlognj+1n=j=1mlog(11+j/q)2m23q.m\log(q/n)-\sum_{j=n-q-m+1}^{n-q}\log\frac{n-j+1}{n}=\sum_{j=1}^{m}\log\left(\frac{1}{1+j/q}\right)\leq-\frac{2m^{2}}{3q}. (27)

To show (25), it suffices to check that for t=1t=1 and t=s/nt=s/n where 1sn11\leq s\leq n-1 is an integer. The reason is that 1n#{1insi2/N(XN,n)t}\frac{1}{n}\#\left\{1\leq i\leq n\mid s_{i}^{2/N}\left(X_{N,n}\right)\leq t\right\} is non-decreasing. Hence if nt(s,s+1)nt\in(s,s+1) where 0sn10\leq s\leq n-1 then

|F(t)|1n+|F(s/n)||F((s+1)/n)|ε+|F(s/n)||F((s+1)/n)||F(t)|\leq\frac{1}{n}+|F(s/n)|\lor|F((s+1)/n)|\leq\varepsilon+|F(s/n)|\lor|F((s+1)/n)|

and 0F(t)F(1)0\geq F(t)\geq F(1) if t1t\geq 1. The case for t0t\leq 0 is trivial. Suppose αnε<α+1\alpha\leq n\varepsilon<\alpha+1 where α5\alpha\geq 5 is an integer. We will show that |F(t)|4ε|F(t)|\leq 4\varepsilon for t=s/n,s=1,2,,nt=s/n,s=1,2,\dots,n.

The case of t=1t=1

Suppose λ1λ2λr0>λr+1\lambda_{1}\geq\lambda_{2}\dots\geq\lambda_{r}\geq 0>\lambda_{r+1}. Then |F(1)|=r/n|F(1)|=r/n and (24) reads:

nε2i=1r(λi12logni+1n)12i=1rlogni+1n.n\varepsilon^{2}\geq\sum_{i=1}^{r}\left(\lambda_{i}-\frac{1}{2}\log\frac{n-i+1}{n}\right)\geq-\frac{1}{2}\sum_{i=1}^{r}\log\frac{n-i+1}{n}.

If r53αr\leq 5\leq 3\alpha then (25) is already closed. Otherwise, by (26) with q=n,m=rq=n,m=r one gets

nε212i=1rlogni+1n(r1)22n>r24n.n\varepsilon^{2}\geq-\frac{1}{2}\sum_{i=1}^{r}\log\frac{n-i+1}{n}\geq\frac{(r-1)^{2}}{2n}>\frac{r^{2}}{4n}.

which implies that r/n<2εr/n<2\varepsilon.

The case of t=s/nt=s/n, where s[1,n1]s\in[1,n-1] is an integer

Suppose

λ1λ2λr>12log(s/n)λr+1λn.\lambda_{1}\geq\lambda_{2}\dots\geq\lambda_{r}>\frac{1}{2}\log(s/n)\geq\lambda_{r+1}\geq\dots\geq\lambda_{n}.

Then |F(t)|=|r+sn|/n|F(t)|=|r+s-n|/n. Again, if |r+sn|5|r+s-n|\leq 5, then we are already done.

  • If r+sn5r+s-n\geq 5, then our condition (24), combined with (26) with q=s,m=r+snq=s,m=r+s-n reads:

    2nε2i=ns+1r(2λilogni+1n)mlog(s/n)j=ns+1rlognj+1n(m1)22s2n\varepsilon^{2}\geq\sum_{i=n-s+1}^{r}\left(2\lambda_{i}-\log\frac{n-i+1}{n}\right)\geq m\log(s/n)-\sum_{j=n-s+1}^{r}\log\frac{n-j+1}{n}\geq\frac{(m-1)^{2}}{2s}

    or (n|F(t)|1)24nsε24n2ε2(n|F(t)|-1)^{2}\leq 4ns\varepsilon^{2}\leq 4n^{2}\varepsilon^{2}. This implies that |F(t)|1/n+2ε<3ε|F(t)|\leq 1/n+2\varepsilon<3\varepsilon.

  • If nsr5n-s-r\geq 5. If nr<αn-r<\alpha then we are already done. Assume otherwise, if nr2sn-r\leq 2s, then (27) with q=s,m=nrsq=s,m=n-r-s reads

    2nε2i=r+1ns(2λi+logni+1n)mlog(s/n)+j=r+1nslognj+1n2m23s2n\varepsilon^{2}\geq\sum_{i=r+1}^{n-s}\left(-2\lambda_{i}+\log\frac{n-i+1}{n}\right)\geq-m\log(s/n)+\sum_{j=r+1}^{n-s}\log\frac{n-j+1}{n}\geq\frac{2m^{2}}{3s}

    which implies, as before, |F(t)|2ε<3ε.|F(t)|\leq 2\varepsilon<3\varepsilon. If nr>2sn-r>2s, then (27) with q=s=mq=s=m reads

    2nε2i=n2s+1ns(2λi+logni+1n)slog(s/n)+j=n2s+1nslognj+1n2s32n\varepsilon^{2}\geq\sum_{i=n-2s+1}^{n-s}\left(-2\lambda_{i}+\log\frac{n-i+1}{n}\right)\geq-s\log(s/n)+\sum_{j=n-2s+1}^{n-s}\log\frac{n-j+1}{n}\geq\frac{2s}{3}

    which implies s3nε2<αs\leq 3n\varepsilon^{2}<\alpha and t<εt<\varepsilon. In this case, note that:

    |F(t)|=nrnt|F(α+1n)|+α+1sn3ε+αn|F(t)|=\frac{n-r}{n}-t\leq\left|F\left(\frac{\alpha+1}{n}\right)\right|+\frac{\alpha+1-s}{n}\leq 3\varepsilon+\frac{\alpha}{n}

    from our previous discussion. This concludes our claim.

Our proof is concluded. \square

References

  • AB [12] Gernot Akemann and Zdzislaw Burda. Universal microscopic correlation functions for products of independent ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 45(46):465201, 2012.
  • ABK [14] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. Universal distribution of lyapunov exponents for products of ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 47(39):395202, 2014.
  • ABK [19] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. From integrable to chaotic systems: Universal local statistics of lyapunov exponents. EPL (Europhysics Letters), 126(4):40001, 2019.
  • AEV [23] Artur Avila, Alex Eskin, and Marcelo Viana. Continuity of the lyapunov exponents of random matrix products. arXiv preprint arXiv:2305.06009, 2023.
  • Ahn [22] Andrew Ahn. Fluctuations of β\beta-jacobi product processes. Probability Theory and Related Fields, 183(1):57–123, 2022.
  • AJM+ [95] Ludwig Arnold, Christopher KRT Jones, Konstantin Mischaikow, Geneviève Raugel, and Ludwig Arnold. Random dynamical systems. Springer, 1995.
  • AKMP [19] Gernot Akemann, Mario Kieburg, Adam Mielke, and Tomaž Prosen. Universal signature from integrability to chaos in dissipative open quantum systems. Physical review letters, 123(25):254101, 2019.
  • B+ [12] Philippe Bougerol et al. Products of random matrices with applications to Schrödinger operators, volume 8. Springer Science & Business Media, 2012.
  • BLS [13] Zdzislaw Burda, Giacomo Livan, and Artur Swiech. Commutative law for products of infinitely large isotropic random matrices. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 88(2):022107, 2013.
  • BNS [12] Zdzislaw Burda, Maciej A Nowak, and Artur Swiech. Spectral relations between products and powers of isotropic random matrices. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 86(6):061137, 2012.
  • FK [60] Harry Furstenberg and Harry Kesten. Products of random matrices. The Annals of Mathematical Statistics, 31(2):457–469, 1960.
  • GJ [21] Friedrich Götze and Jonas Jalowy. Rate of convergence to the circular law via smoothing inequalities for log-potentials. Random Matrices: Theory and Applications, 10(03):2150026, 2021.
  • GN [50] Izrail Moiseevich Gel’fand and Mark Aronovich Naimark. The relation between the unitary representations of the complex unimodular group and its unitary subgroup. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 14(3):239–260, 1950.
  • GS [18] Vadim Gorin and Yi Sun. Gaussian fluctuations for products of random matrices. arXiv preprint arXiv:1812.06532, 2018.
  • GS [22] Vadim Gorin and Yi Sun. Gaussian fluctuations for products of random matrices. American Journal of Mathematics, 144(2):287–393, 2022.
  • GT [10] Friedrich Götze and Alexander Tikhomirov. On the asymptotic spectrum of products of independent random matrices. arXiv preprint arXiv:1012.2710, 2010.
  • Han [18] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems, 2018.
  • HN [20] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
  • HP [21] Boris Hanin and Grigoris Paouris. Non-asymptotic results for singular values of gaussian matrix products. Geometric and Functional Analysis, 31(2):268–324, 2021.
  • IN [92] Marco Isopi and Charles M. Newman. The triangle law for lyapunov exponents of large random matrices. Communications in Mathematical Physics, 143(3):591–598, 1992.
  • Kar [08] Vladislav Kargin. Lyapunov exponents of free operators. Journal of Functional Analysis, 255(8):1874–1888, 2008.
  • Kar [14] Vladislav Kargin. On the largest lyapunov exponent for products of gaussian matrices. Journal of Statistical Physics, 157(1):70–83, 2014.
  • KZ [97] Maxim Kontsevich and Anton Zorich. Lyapunov exponents and hodge theory. arXiv preprint hep-th/9701164, 1997.
  • LP [06] Émile Le Page. Théorèmes limites pour les produits de matrices aléatoires. In Probability Measures on Groups: Proceedings of the Sixth Conference Held at Oberwolfach, Germany, June 28–July 4, 1981, pages 258–303. Springer, 2006.
  • LWW [18] Dang-Zheng Liu, Dong Wang, and Yanhui Wang. Lyapunov exponent, universality and phase transition for products of random matrices. arXiv preprint arXiv:1810.00433, 2018.
  • LWZ [19] Dang-Zheng Liu, Dong Wang, and Lun Zhang. Bulk and soft-edge universality for singular values of products of ginibre random matrices. Annals Henri Poincare, 55(1):98–126, 2019.
  • MS [17] James A Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017.
  • MT [02] Howard Masur and Serge Tabachnikov. Rational billiards and flat structures. In Handbook of dynamical systems, volume 1, pages 1015–1089. Elsevier, 2002.
  • New [86] Charles M Newman. The distribution of lyapunov exponents: exact results for random matrices. Communications in mathematical physics, 103(1):121–126, 1986.
  • NS [06] Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006.
  • OS [10] Sean O’Rourke and Alexander Soshnikov. Products of independent non-hermitian random matrices. 2010.
  • Ose [68] Valery Iustinovich Oseledec. A multiplicative ergodic theorem, lyapunov characteristic numbers for dynamical systems. Transactions of the Moscow Mathematical Society, 19:197–231, 1968.
  • PSG [17] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pages 4788–4798, 2017.
  • [34] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
  • [35] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1924–1932, 2018.
  • RV [15] Mark Rudelson and Roman Vershynin. Small ball probabilities for linear images of high-dimensional distributions. International Mathematics Research Notices, 2015(19):9594–9617, 2015.
  • Sha [11] Ohad Shamir. A variant of azuma’s inequality for martingales with subgaussian tails. arXiv preprint arXiv:1110.2392, 2011.
  • Tuc [10] Gabriel H Tucci. Limits laws for geometric means of free random variables. Indiana University mathematics journal, pages 1–13, 2010.
  • TV [10] Terence Tao and Van Vu. Random matrices: The distribution of the smallest singular values. Geometric And Functional Analysis, 20:260–297, 2010.
  • VDN [92] Dan V Voiculescu, Ken J Dykema, and Alexandru Nica. Free random variables, volume 1. American Mathematical Soc., 1992.
  • Ver [18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
  • Wil [17] Amie Wilkinson. What are lyapunov exponents, and why are they interesting? Bulletin of the American Mathematical Society, 54(1):79–105, 2017.