This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Estimation of high-dimensional low-rank matrices

Angelika Rohdelabel=e1]angelika.rohde@math.uni-hamburg.de [    Alexandre B. Tsybakovlabel=e2]alexandre.tsybakov@upmc.fr [ Universität Hamburg and CREST Fachbereich Mathematik
Universität Hamburg
Bundesstraße 55
D-20146 Hamburg
Germany
Laboratoire de Statistique
CREST
3, avenue Pierre Larousse
F-92240 Malakoff Cedex
France
(2011; 12 2009; 7 2010)
Abstract

Suppose that we observe entries or, more generally, linear combinations of entries of an unknown m×Tm\times T-matrix AA corrupted by noise. We are particularly interested in the high-dimensional setting where the number mTmT of unknown entries can be much larger than the sample size NN. Motivated by several applications, we consider estimation of matrix AA under the assumption that it has small rank. This can be viewed as dimension reduction or sparsity assumption. In order to shrink toward a low-rank representation, we investigate penalized least squares estimators with a Schatten-pp quasi-norm penalty term, p1p\leq 1. We study these estimators under two possible assumptions—a modified version of the restricted isometry condition and a uniform bound on the ratio “empirical norm induced by the sampling operator/Frobenius norm.” The main results are stated as nonasymptotic upper bounds on the prediction risk and on the Schatten-qq risk of the estimators, where q[p,2]q\in[p,2]. The rates that we obtain for the prediction risk are of the form rm/Nrm/N (for m=Tm=T), up to logarithmic factors, where rr is the rank of AA. The particular examples of multi-task learning and matrix completion are worked out in detail. The proofs are based on tools from the theory of empirical processes. As a by-product, we derive bounds for the kkth entropy numbers of the quasi-convex Schatten class embeddings SpMS2MS_{p}^{M}\hookrightarrow S_{2}^{M}, p<1p<1, which are of independent interest.

62G05,
62F10,
High-dimensional low-rank matrices,
empirical process,
sparse recovery,
Schatten norm,
penalized least-squares estimator,
quasi-convex Schatten class embeddings,
doi:
10.1214/10-AOS860
keywords:
[class=AMS] .
keywords:
.
volume: 39issue: 2

T1Supported in part by the Grant ANR-06-BLAN-0194, ANR “Parcimonie” and by the PASCAL-2 Network of Excellence.

and

1 Introduction

Consider the observations (Xi,Yi)(X_{i},Y_{i}) satisfying the model

Yi=\operatornametr(XiA)+ξi,i=1,,N,Y_{i}=\operatorname{tr}(X_{i}^{\prime}A^{*})+\xi_{i},\qquad i=1,\ldots,N, (1)

where Xim×TX_{i}\in\mathbb{R}^{m\times T} are given matrices (mm rows, TT columns), Am×TA^{*}\in\mathbb{R}^{m\times T} is an unknown matrix, ξi\xi_{i} are i.i.d. random errors, \operatornametr(B)\operatorname{tr}(B) denotes the trace of square matrix BB and XX^{\prime} stands for the transposed of XX. Our aim is to estimate the matrix AA^{*} and to predict the future YY-values based on the sample (Xi,Yi)(X_{i},Y_{i}), i=1,,Ni=1,\ldots,N.

We will call model (1) the trace regression model. Clearly, for T=1T=1 it reduces to the standard regression model. The “design” matrices XiX_{i} will be called masks. This name is motivated by the fact that we focus on the applications of trace regression where XiX_{i} are very sparse, that is, contain only a small percentage of nonzero entries. Therefore, multiplication of AA^{*} by XiX_{i} masks most of the entries of AA^{*}. The following two examples are of particular interest.

(i) Point masks. For some, typically small, integer dd the point masks XiX_{i} are defined as elements of the set

𝒳d={i=1deki(m)eli(T)\dvtx1kim,1liT,\displaystyle\mathcal{X}_{d}=\Biggl{\{}\sum_{i=1}^{d}e_{k_{i}}(m)e_{l_{i}}^{\prime}(T)\dvtx 1\leq k_{i}\leq m,1\leq l_{i}\leq T,
with (ki,li)(ki,li) for ii},\displaystyle\hphantom{m,1\leq l_{i}\leq T,tt,}\mbox{with }(k_{i},l_{i})\not=(k_{i^{\prime}},l_{i^{\prime}})\mbox{ for }i\not=i^{\prime}\Biggr{\}},

where ek(m)e_{k}(m) are the canonical basis vectors of m\mathbb{R}^{m}. In particular, for d=1d=1 the point masks XiX_{i} are matrices that have only one nonzero entry, which equals to 1. The problem of estimation of AA^{*} in this case becomes the problem of matrix completion; the observations YiY_{i} are just some selected entries of AA^{*} corrupted by noise, and the aim is to reconstruct all the entries of AA. The problem of matrix completion dates back at least to Srebro, Rennie and Jaakkola (2005), Srebro and Shraibman (2005) and is mainly motivated by applications in recommendation systems. We will analyze the following two special cases of matrix completion:

  • USR (Uniform Sampling at Random) matrix completion. The masks XiX_{i} are independent, uniformly distributed on

    𝒳1={ek(m)el(T)\dvtx1km,1lT},\mathcal{X}_{1}=\{e_{k}(m)e_{l}^{\prime}(T)\dvtx 1\leq k\leq m,1\leq l\leq T\},

    and independent from ξ1,,ξN\xi_{1},\dots,\xi_{N}.

  • Collaborative sampling (CS) matrix completion. The masks XiX_{i} (random or deterministic) belong to 𝒳1\mathcal{X}_{1}, are all distinct and independent from ξ1,,ξN\xi_{1},\ldots,\xi_{N}.

The CS matrix completion model is natural to describe recommendation systems where every user rates every product only once. The USR matrix completion can be used for transmission of a large-dimensional matrix trough a noisy communication channel; only a chosen small number of entries is transmitted, and nevertheless the original matrix AA^{*} can be reconstructed by the receiver. An important feature of the real-world matrix completion problems is that the number of observed entries is much smaller than the size of the matrix: NmTN\ll mT, whereas mTmT can be very large. For example, mTmT is of the order of hundreds of millions for the Netflix problem.

(ii) Column or row masks. If XiX_{i} has only a small number dd of nonzero columns or rows, it is called column or row mask, respectively. We suppose here that dd is much smaller than mm and TT. A remarkable case d=1d=1 is covering the problem known in Statistics and Econometrics as longitudinal (or panel, or cross-section) data analysis and in Machine Learning as multi-task learning. In what follows, we will designate this problem as multi-task learning, to avoid ambiguity. In the simplest version of multi-task learning, we have N=nTN=nT where TT is the number of tasks (for instance, in image detection each task tt is associated with a particular type of visual object, e.g., face, car, chair, etc.), and nn is the number of observations per task. The tasks are characterized by vectors of parameters atma^{*}_{t}\in\mathbb{R}^{m}, t=1,,Tt=1,\dots,T, which constitute the columns of matrix AA^{*}:

A=(a1aT).A^{*}=(a^{*}_{1}\cdots a^{*}_{T}).

The XiX_{i} are column masks, each containing only one nonzero column 𝐱(t,s)m\mathbf{x}^{(t,s)}\in\mathbb{R}^{m} (with the convention that 𝐱(t,s)\mathbf{x}^{(t,s)} is the ttth column):

Xi{(00𝐱(t,s)t00),t=1,,T,s=1,,n}.X_{i}\in\bigl{\{}\bigl{(}0\cdots 0\underbrace{{\mathbf{x}}^{(t,s)}}_{t}0\cdots 0\bigr{)},t=1,\ldots,T,s=1,\ldots,n\bigr{\}}.

The column 𝐱(t,s)\mathbf{x}^{(t,s)} is interpreted as the vector of predictor variables corresponding to ssth observation for the ttth task. Thus, for each i=1,,Ni=1,\dots,N there exists a pair (t,s)(t,s) with t=1,,T,s=1,,nt=1,\dots,T,s=1,\dots,n, such that

\operatornametr(XiA)=(at)𝐱(t,s).\operatorname{tr}(X_{i}^{\prime}A^{*})=(a^{*}_{t})^{\prime}\mathbf{x}^{(t,s)}. (2)

If we denote by Y(t,s)Y^{(t,s)} and ξ(t,s)\xi^{(t,s)} the corresponding values YiY_{i} and ξi\xi_{i}, then the trace regression model (1) can be written as a collection of TT standard regression models:

Y(t,s)=(at)𝐱(t,s)+ξ(t,s),t=1,,T,s=1,,n.Y^{(t,s)}=(a^{*}_{t})^{\prime}\mathbf{x}^{(t,s)}+\xi^{(t,s)},\qquad t=1,\ldots,T,s=1,\ldots,n.

This is the usual formulation of the multi-task learning model in the literature.

For both examples given above, the matrices XiX_{i} are sparse in the sense that they have only a small portion of nonzero entries. On the other hand, such a sparsity property is not necessarily granted for the target matrix AA^{*}. Nevertheless, we can always characterize AA^{*} by its rank r=\operatornamerank(A)r=\operatorname{rank}(A^{*}), and say that a matrix is sparse if it has small rank; cf. Recht, Fazel and Parrilo (2010). For example, the problem of estimation of a square matrix Am×mA^{*}\in\mathbb{R}^{m\times m} is a parametric problem which is formally of dimension m2m^{2} but it has only (2mr)r(2m-r)r free parameters. If rr is small as compared to mm, then the intrinsic dimension of the problem is of the order rmrm. In other words, the rank sparsity assumption rmr\ll m is a dimension reduction assumption. This assumption will be crucial for the interpretation of our results. Another sparsity assumption that we will consider is that Schatten-pp norm of AA^{*} (see the definition in Section 2 below) is small for some 0<p10<p\leq 1. This is an analog of sparsity expressed in terms of the p\ell_{p} norm, 0<p10<p\leq 1, in vector estimation problems.

Estimation of high-dimensional matrices has been recently studied by several authors in settings different from the ours [cf., e.g., Meinshausen and Bühlmann (2006), Bickel and Levina (2008), Ravikumar et al. (2008), Amini and Wainwright (2009), Cai, Zhang and Zhou (2010) and the references cited therein]. Most of attention was devoted to estimation of a large covariance matrix or its inverse. In these papers, sparsity is characterized by the number of nonzero entries of a matrix.

Candès and Recht (2009), Candès and Tao (2009), Gross (2009), Recht (2009) considered the nonnoisy setting (ξi0\xi_{i}\equiv 0) of the matrix completion problem under conditions that the singular vectors of AA^{*} are sufficiently spread out on the unit sphere or “incoherent.” They focused on exact recovery of AA^{*}. Until now, the sharpest results are those of Gross (2009) and Recht (2009) who showed that under “incoherence condition” the exact recovery is possible with high probability if N>Cr(m+T)log2mN>Cr(m+T)\log^{2}m with some constant C>0C>0 when we observe NN entries of a matrix Am×TA^{*}\in\mathbb{R}^{m\times T} with locations uniformly sampled at random. Candès and Plan (2010a), Keshavan, Montanari and Oh (2009) explored the same setting in the presence of noise, proposed estimators A^\hat{A} of AA^{*} and evaluated their Frobenius norm A^AF\|\hat{A}-A^{*}\|_{F}. The better bounds are in Keshavan, Montanari and Oh (2009) who suggest A^\hat{A} such that for Am×TA^{*}\in\mathbb{R}^{m\times T} and T=αmT=\alpha m with α>1\alpha>1 the squared error A^AF2\|\hat{A}-A^{*}\|_{F}^{2} is of the order α5/2rm3(logN)/N\alpha^{5/2}rm^{3}(\log N)/N with probability close to 1 when the noise is i.i.d. Gaussian.

In this paper, we consider the general noisy setting of the trace regression problem. We study a class of Schatten-pp estimators A^\hat{A}, that is, the penalized least squares estimators with a penalty proportional to Schatten-pp norm; cf. (7). The special case p=1p=1 corresponds to the “matrix Lasso.” We study the convergence properties of their prediction error

d^2,N(A^,A)2=N1i=1N\operatornametr2(Xi(A^A))\hat{d}_{2,N}({\hat{A}},A^{*})^{2}=N^{-1}\sum_{i=1}^{N}\operatorname{tr}^{2}\bigl{(}X_{i}^{\prime}({\hat{A}}-A^{*})\bigr{)}

and of their Schatten-qq error. The main contributions of this paper are the following.

  1. [(a)]

  2. (a)

    For all 0<p10<p\leq 1, under various assumptions on the masks XiX_{i} (no assumption, USR matrix completion, CS matrix completionmatrix completionmatrix compl) we obtain different bounds on the prediction error of Schatten-pp estimators involving the Schatten-pp norm of AA^{*}.

  3. (b)

    For pp sufficiently close to 0, under a mild assumption on XiX_{i}, we show that Schatten-pp estimators achieve the prediction error rate of convergence rmax(m,T)N\frac{r\max(m,T)}{N}, up to a logarithmic factor. This result is valid for matrices AA^{*} whose eigenvalues are not exponentially large in NN. It covers the matrix completion and high-dimensional multi-task learning problems.

  4. (c)

    For all 0<p10<p\leq 1, we obtain upper bounds for the prediction error under the matrix Restricted Isometry (RI) condition on the masks XiX_{i}, which is a rather strong condition, and under the assumption that rank(A)r{\rm rank}(A^{*})\leq r. We also derive the bounds for the Schatten-qq error of A^{\hat{A}}. The rate in the bounds for the prediction error is rmax(m,T)/Nr\max(m,T)/N when the RI condition is satisfied with scaling factor 1 (i.e., for the case not related to matrix completion and high-dimensional multi-task learning).

  5. (d)

    We prove the lower bounds showing that the rate rmax(m,T)/Nr\max(m,T)/N is minimax optimal for the prediction error and Schatten-22 (i.e., Frobenius) norm estimation error under the RI condition on the class of matrices AA^{*} of rank smaller than rr. Our result is even more general because we prove our lower bound on the intersection of the Schatten-0 ball with the Schatten-pp ball for any 0<p10<p\leq 1, which allows us to show minimax optimality of the upper bounds of (a) as well. Furthermore, we prove minimax lower bounds for collaborative sampling and USR matrix completion problems.

The main point of this paper is to show that the suitably tuned Schatten estimators attain the optimal rate of prediction error up to logarithmic factors. The striking fact is that we can achieve this not only under the very restrictive assumption, such as the RI condition, but also under very mild assumptions on the masks XiX_{i}.

Finally, it is useful to compare the results for matrix estimation when the sparsity is expressed by the rank with those for the high-dimensional vector estimation when the sparsity is expressed by the number of nonzero components of the vector. For the vector estimation, we have the linear model

Yi=Xiβ+ξi,i=1,,N,Y_{i}=X_{i}^{\prime}\beta+\xi_{i},\qquad i=1,\ldots,N,

where XipX_{i}\in\mathbb{R}^{p}, βp\beta\in\mathbb{R}^{p} and, for example, ξi\xi_{i} are i.i.d. 𝒩(0,1)\mathcal{N}(0,1) random variables. Consider the high-dimensional case pNp\gg N. (This is analogous to the assumption m2Nm^{2}\gg N in the matrix problem and means that the nominal dimension is much larger than the sample size.) The sparsity assumption for the vector case has the form sNs\ll N, where ss is the number of nonzero components, or the intrinsic dimension of β\beta. Let β^\hat{\beta} be an estimator of β\beta. Then the optimal rate of convergence of the prediction risk N1i=1N(Xi(β^β))2N^{-1}\sum_{i=1}^{N}(X_{i}^{\prime}(\hat{\beta}-\beta))^{2} on the class of vectors β\beta with given ss is of the order s/Ns/N, up to logarithmic factors. This rate is shown to be attained, up to logarithmic factors, for many estimators, such as the BIC, the Lasso, the Dantzig selector, Sparse Exponential Weighting, etc.; cf., for example, Bunea, Tsybakov and Wegkamp (2007), Koltchinskii (2008), Bickel, Ritov and Tsybakov (2009), Dalalyan and Tsybakov (2008). Note that this rate is of the form intrinsicdimensionsamplesize=sN\frac{\mathrm{intrinsic\ dimension}}{\mathrm{sample\ size}}=\frac{s}{N}, up to a logarithmic factor. The general interpretation is therefore completely analogous to that of the matrix case: Assume for simplicity that AA^{*} is a square m×mm\times m matrix with \operatornamerank(A)=r\operatorname{rank}(A^{*})=r. As mentioned above, the intrinsic dimension (the number of parameters to be estimated to recover AA^{*}) is then (2mr)r(2m-r)r, which is of the order rm\sim rm if rmr\ll m. An interesting difference is that the logarithmic risk inflation factor is inevitable in the vector case [cf. Donoho et al. (1992), Foster and George (1994)], but not in the matrix problem, as our results reveal.

This paper is organized as follows. In Section 2, we introduce notation, some definitions, basic facts about the Schatten quasi-norms and define the Schatten-pp estimators. Section 3 describes elementary steps in their convergence analysis and presents two general approaches to upper bounds on the estimation and prediction error (cf. Theorems 1 and 2) depending on the efficient noise level τ\tau. Our main results are stated in Sections 4, 5 (matrix completion), 6 (multi-task learning). They are obtained from Theorems 1 and 2 by specifying the effective noise level τ\tau under particular assumptions on the masks XiX_{i}. Concentration bounds for certain random matrices leading to the expressions for the effective noise level are collected in Section 8. Section 7 is devoted to minimax lower bounds. Sections 9 and 10 contain the main proofs. Finally, in Section 11 we establish bounds for the kkth entropy numbers of the quasi-convex Schatten class embeddings SpMS2MS_{p}^{M}\hookrightarrow S_{2}^{M}, p<1p<1, which are needed for our proofs and are of independent interest.

2 Preliminaries

2.1 Notation, definitions and basic facts

We will write ||2|\cdot|_{2} for the Euclidean norm in d\mathbb{R}^{d} for any integer dd. For any matrix Am×TA\in\mathbb{R}^{m\times T}, we denote by A(j,)A_{(j,\cdot)} for 1jm1\leq j\leq m its jjth row and write A(,k)A_{(\cdot,k)} for its kkth column, 1kT1\leq k\leq T. We denote by σ1(A)σ2(A)0\sigma_{1}(A)\geq\sigma_{2}(A)\geq\cdots\geq 0 the singular values of AA. The (quasi-)norm of some (quasi-) Banach space \mathcal{B} is canonically denoted by \|\cdot\|_{\mathcal{B}}. In particular, for any matrix Am×TA\in\mathbb{R}^{m\times T} and 0<p<0<p<\infty we consider the Schatten (quasi-)norms

ASp=(j=1min(m,T)σj(A)p)1/pandAS=σ1(A).\|A\|_{S_{p}}=\Biggl{(}\sum_{j=1}^{\min(m,T)}\sigma_{j}(A)^{p}\Biggr{)}^{1/p}\quad\mbox{and}\quad\|A\|_{S_{\infty}}=\sigma_{1}(A).

The Schatten spaces SpS_{p} are defined as spaces of all matrices Am×TA\in\mathbb{R}^{m\times T} equipped with quasi-norm ASp\|A\|_{S_{p}}. In particular, the Schatten-2 norm coincides with the Frobenius norm

AS2=\operatornametr(AA)=(i,jaij2)1/2,\|A\|_{S_{2}}=\sqrt{\operatorname{tr}(A^{\prime}A)}=\biggl{(}\sum_{i,j}a_{ij}^{2}\biggr{)}^{1/2},

where aija_{ij} denote the elements of matrix Am×TA\in\mathbb{R}^{m\times T}. Recall that for 0<p<10<p<1 the Schatten spaces SpS_{p} are not normed but only quasi-normed, and Spp\|\cdot\|_{S_{p}}^{p} satisfies the inequality

A+BSppASpp+BSpp\displaystyle\|A+B\|_{S_{p}}^{p}\leq\|A\|_{S_{p}}^{p}+\|B\|_{S_{p}}^{p} (3)

for any 0<p10<p\leq 1 and any two matrices A,Bm×TA,B\in\mathbb{R}^{m\times T}; cf. McCarthy (1967) and Rotfeld (1969). We will use the following well-known trace duality property:

|\operatornametr(AB)|AS1BSA,Bm×T.|\operatorname{tr}(A^{\prime}B)|\leq\|A\|_{S_{1}}\|B\|_{S_{\infty}}\qquad\forall A,B\in\mathbb{R}^{m\times T}.

2.2 Characteristics of the sampling operator

Let \dvtxm×TN\mathcal{L}\dvtx\mathbb{R}^{m\times T}\rightarrow\mathbb{R}^{N} be the sampling operator, that is, the linear mapping defined by

A(\operatornametr(X1A),,\operatornametr(XNA))/N.A\mapsto(\operatorname{tr}(X_{1}^{\prime}A),\ldots,\operatorname{tr}(X_{N}^{\prime}A))/\sqrt{N}.

We have

|(A)|22=N1i=1N\operatornametr2(XiA).|\mathcal{L}(A)|_{2}^{2}=N^{-1}\sum_{i=1}^{N}\operatorname{tr}^{2}(X_{i}^{\prime}A).

Depending on the context, we also write d^2,N(A,B)\hat{d}_{2,N}(A,B) for |(AB)|2|\mathcal{L}(A-B)|_{2}, where AA and BB are any matrices in m×T\mathbb{R}^{m\times T}. Unless the reverse is explicitly stated, we will tacitly assume that the matrices XiX_{i} are nonrandom.

We will denote by ϕmax(1)\phi_{\max}(1) the maximal rank-1 restricted eigenvalue of \mathcal{L}:

ϕmax(1)=supAm×T\dvtx\operatornamerank(A)=1|(A)|2AS2.\phi_{\max}(1)=\sup_{A\in\mathbb{R}^{m\times T}\dvtx{\operatorname{rank}}(A)=1}\frac{|\mathcal{L}(A)|_{2}}{\|A\|_{S_{2}}}. (4)

We now introduce two basic assumptions on the sampling operator that will be used in the sequel. The sampling operator \mathcal{L} will be called uniformly bounded if there exists a constant c0<c_{0}<\infty such that

supAm×T{0}|(A)|22AS22c0uniformly in mT and N.\sup_{A\in\mathbb{R}^{m\times T}\setminus\{0\}}\frac{|\mathcal{L}(A)|_{2}^{2}}{\|A\|_{S_{2}}^{2}}\leq c_{0}\qquad\mbox{uniformly in $m$, $T$ and $N$.} (5)

Clearly, if \mathcal{L} is uniformly bounded, then ϕmax2(1)c0\phi_{\max}^{2}(1)\leq c_{0}. Condition (5) is trivially satisfied with c0=1c_{0}=1 for USR matrix completion and with c0=1/Nc_{0}=1/N for CS matrix completion.

The sampling operator \mathcal{L} is said to satisfy the Restricted Isometry condition RI (rr,ν\nu) for some integer 1rmin(m,T)1\leq r\leq\min(m,T) and some 0<ν<0<\nu<\infty if there exists a constant δr(0,1)\delta_{r}\in(0,1) such that

(1δr)AS2ν|(A)|2(1+δr)AS2(1-\delta_{r})\|A\|_{S_{2}}\leq\nu|\mathcal{L}(A)|_{2}\leq(1+\delta_{r})\|A\|_{S_{2}} (6)

for all matrices Am×TA\in\mathbb{R}^{m\times T} of rank at most rr.

A difference of this condition from the Restricted Isometry condition, introduced by Candès and Tao (2005) in the vector case or from its analog for the matrix case suggested by Recht, Fazel and Parrilo (2010), is that we state it with a scaling factor ν\nu. This factor is introduced to account for the fact that the masks XiX_{i} are typically very sparse, so that they do not induce isometries with coefficient close to one. Indeed, ν\nu will be large in the examples that we consider below.

2.3 Least squares estimators with Schatten penalty

In this paper, we study the estimators A^\hat{A} defined as a solution of the minimization problem

minAm×T(1Ni=1N(Yi\operatornametr(XiA))2+λASpp)\min_{A\in\mathbb{R}^{m\times T}}\Biggl{(}\frac{1}{N}\sum_{i=1}^{N}\bigl{(}Y_{i}-\operatorname{tr}(X_{i}^{\prime}A)\bigr{)}^{2}+\lambda\|A\|_{S_{p}}^{p}\Biggr{)} (7)

with some fixed 0<p10<p\leq 1 and λ>0\lambda>0. The case p=1p=1 (matrix Lasso) is of outstanding interest since the minimization problem is then convex and thus can be efficiently solved in polynomial time. We call A^\hat{A} the Schatten-pp estimator. Such estimators have been recently considered by many authors motivated by applications to multi-task learning and recommendation systems. Probably, the first study is due to Srebro, Rennie and Jaakkola (2005) who dealt with binary classification and considered the Schatten-1 estimator with the hinge loss rather than squared loss. Argyriou et al. (2008), Argyriou, Evgeniou and Pontil (2008), Argyriou, Micchelli and Pontil (2010), Bach (2008), Abernethy et al. (2009) discussed connections of (7) to other related minimization problems, along with characterizations of the solutions and computational issues, mainly focusing on the convex case p=1p=1. Also for the nonconvex case (0<p<10<p<1), Argyriou et al. (2008), Argyriou, Evgeniou and Pontil (2008) suggested an algorithm of approximate computation of Schatten-pp estimator or its analogs. However, for 0<p<10<p<1 the methods can find only a local minimum in (7), so that Schatten estimators with such pp remain for the moment mainly of theoretical value. In particular, analyzing these estimators reveals, which rates of convergence can, in principle, be attained.

The statistical properties of Schatten estimators are not yet well understood. To our knowledge, the only previous study is that of Bach (2008) showing that for p=1p=1, under some condition on XiX_{i}^{\prime}’s [analogous to strong irrepresentability condition in the vector case; cf. Meinshausen and Bühlmann (2006), Zhao and Yu (2006)], \operatornamerank(A)\operatorname{rank}(A^{*}) is consistently recovered by \operatornamerank(A^)\operatorname{rank}(\hat{A}) when m,Tm,T are fixed and NN\to\infty. Our results are of a different kind. They are nonasymptotic and meaningful in the case mTN>max(m,T)mT\gg N>\max(m,T). Furthermore, we do not consider the recovery of the rank, but rather the estimation and prediction properties of Schatten-pp estimators.

After this paper has been submitted, we became aware of interesting contemporaneous and independent works by Candès and Plan (2010b), Negahban et al. (2009) and Negahban and Wainwright (2011). Those papers focus on the bounds for the Schatten-2 (i.e., Frobenius) norm error of the matrix Lasso estimator under the matrix RI condition. This is related to the particular instance of our results in item (c) above with p=1p=1 and q=2q=2. Their analysis of this case is complementary to ours in several aspects. Negahban and Wainwright (2011) derive their bound under the assumption that XiX_{i} are matrices with i.i.d. standard Gaussian elements and AA^{*} belongs to a Schatten-pp^{\prime} ball with 0p10\leq p^{\prime}\leq 1, which leads to rates different from ours if p0p^{\prime}\neq 0. An assumption used in this context in Negahban and Wainwright (2011) is that N>mTN>mT (in our notation), which excludes the high-dimensional case mTNmT\gg N that we are mainly interested in Candès and Plan (2010b) consider approximately low-rank matrices, explore the closely related matrix Dantzig selector and provide lower bounds corresponding to a special case of item (d) above. The results of these papers do not cover the matrix completion and multi-task learning problems, which are in the main focus of our study. We also mention a more recent work by Bunea, She and Wegkamp (2010) dealing with a special case of our model and analyzing matrix estimators penalized by the rank.

3 Two schemes of analyzing Schatten estimators

In this section, we discuss two schemes of proving upper bounds on the prediction error of A^\hat{A}. The first bound involves only the Schatten-pp norm of matrix AA^{*}. The second involves only the rank of AA^{*} but needs the RI condition on the sampling operator.

We start by sketching elementary steps in the convergence analysis of Schatten-pp estimators. By the definition of A^\hat{A},

1Ni=1N(Yi\operatornametr(XiA^))2+λA^Spp1Ni=1N(Yi\operatornametr(XiA))2+λASpp.\frac{1}{N}\sum_{i=1}^{N}\bigl{(}Y_{i}-\operatorname{tr}(X_{i}^{\prime}\hat{A})\bigr{)}^{2}+\lambda\|\hat{A}\|_{S_{p}}^{p}\leq\frac{1}{N}\sum_{i=1}^{N}\bigl{(}Y_{i}-\operatorname{tr}(X_{i}^{\prime}A^{*})\bigr{)}^{2}+\lambda\|A^{*}\|_{S_{p}}^{p}.

Recalling that Yi=\operatornametr(XiA)+ξiY_{i}=\operatorname{tr}(X_{i}^{\prime}A^{*})+\xi_{i}, we can transform this by a simple algebra to

d^2,N(A^,A)22Ni=1Nξi\operatornametr((A^A)Xi)+λ(ASppA^Spp).\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq\frac{2}{N}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}\bigl{(}(\hat{A}-A^{*})^{\prime}X_{i}\bigr{)}+\lambda(\|A^{*}\|_{S_{p}}^{p}-\|\hat{A}\|_{S_{p}}^{p}). (8)

In the sequel, inequality (8) will be referred to as basic inequality and the random variable N1i=1Nξi\operatornametr((A^A)Xi)N^{-1}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}((\hat{A}-A^{*})^{\prime}X_{i}) will be called the stochastic term. The core in the analysis of Schatten-pp estimators consists in proving tight bounds for the right-hand side of the basic inequality (8). For this purpose, we first need a control of the stochastic term. Section 8 below demonstrates that such a control strongly depends on the properties of \mathcal{L}, that is, of the problem at hand. In summary, Section 8 establishes that, under suitable conditions, for any 0<p10<p\leq 1 the stochastic term can be bounded for all δ>0\delta>0 with probability close to 1 as follows:

|1Ni=1Nξi\operatornametr(Xi(A^A))|\displaystyle\Biggl{|}\frac{1}{N}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}\bigl{(}X_{i}^{\prime}(\hat{A}-A^{*})\bigr{)}\Biggr{|}
(9)
{τA^AS1, for p=1,δ2d^2,N(A^,A)2+τδp1A^ASpp, for 0<p<1,\displaystyle\qquad\leq\cases{\tau\|\hat{A}-A^{*}\|_{S_{1}},&\quad for $p=1$,\cr\displaystyle\frac{\delta}{2}\hat{d}_{2,N}(\hat{A},A^{*})^{2}+\tau\delta^{p-1}\|\hat{A}-A^{*}\|_{S_{p}}^{p},&\quad for $0<p<1$,}

where 0<τ<0<\tau<\infty depends on m,Tm,T and NN. The quantity τ\tau plays a crucial role in this bound. We will call τ\tau the effective noise level. Exact expressions for τ\tau under various assumptions on the sampling operator \mathcal{L} and on the noise ξi\xi_{i} are derived in Section 8. In Table 1, we present the values of τ\tau for three important examples under the assumption that ξi\xi_{i} are i.i.d. Gaussian 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. In the cases listed in Table 1, inequality (3) holds with probability 1ε1-\varepsilon, where ε=(1/C)exp(C(m+T))\varepsilon=(1/C)\exp(-C(m+T)) (first and third example) and ε=(1/C)(max(m,T)+1)C\varepsilon=(1/C^{\prime})(\max(m,T)+1)^{-C^{\prime}} (second example) with constants C,C>0C,C^{\prime}>0 independent of N,m,TN,m,T.

Table 1: Effective noise level for uniformly bounded \mathcal{L}, USR and collaborative sampling matrix completion. Here M=max(m,T)M=\max(m,T), and the constants c>0,c(p)>0c>0,c(p)>0 depend only on σ\sigma
Assumptions on \boldsXi\bolds{X_{i}} Assumptions on \boldsN,m,T,p\bolds{N,m,T,p} Value of \boldsτ\bolds{\tau}
Uniformly bounded \mathcal{L} 0<p10<p\leq 1 c(p)(M/N)1p/2c(p)(M/N)^{1-p/2}
USR matrix completion p=1p=1, (m+T)mT>N(m+T)mT>N cmin(M/N,(logM)/N)c\min(M/N,(\log M)/\sqrt{N})
CS matrix completion p=1p=1 cM1/2/NcM^{1/2}/N

The following two points will be important to understand the subsequent results:

  • In this paper, we will always choose the regularization parameter λ\lambda in the form λ=4τ\lambda=4\tau.

  • With this choice of λ\lambda, the effective noise level τ\tau characterizes the rate of convergence of the Schatten estimator. The smaller is τ\tau, the faster is the rate.

In particular, the first line in Table 1 reveals that when M=max(m,T)<NM=\max(m,T)<N the largest τ\tau corresponds to p=1p=1 and it becomes smaller when pp decreases to 0. This suggests that choosing Schatten-pp estimators with p<1p<1 and especially pp close to 0 might be advantageous. Note that the assumption of uniform boundedness of \mathcal{L} is very mild. For example, it is trivially satisfied with c0=1c_{0}=1 for USR matrix completion and with c0=1/Nc_{0}=1/N for CS matrix completion. However, in these two cases a specific analysis leads to sharper bounds on the effective noise level (i.e., on the rate of convergence of the estimators); cf. the second and third lines of Table 1.

In this section, we provide two bounds on the prediction error of A^\hat{A} with a general effective noise level τ\tau. We then detail them in Sections 46 for particular values of τ\tau depending on the assumptions on the XiX_{i}. The first bound involves the Schatten-pp norm of matrix AA^{*}.

Theorem 1

Let Am×TA^{*}\in\mathbb{R}^{m\times T}, and let 0<p10<p\leq 1. Assume that (3) holds with probability at least 1ε1-\varepsilon for some ε>0\varepsilon>0 and 0<τ<0<\tau<\infty. Let A^\hat{A} be the Schatten-pp estimator defined as a minimizer of (7) with λ=4τ\lambda=4\tau. Then

d^2,N(A^,A)216τASpp\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 16\tau\|A^{*}\|_{S_{p}}^{p} (10)

holds with probability at least 1ε1-\varepsilon.

{pf}

From (8) and (3) with δ=1/2\delta=1/2 and λ=4τ\lambda=4\tau, we get

d^2,N(A^,A)28τ(A^ASpp+ASppA^Spp).\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 8\tau(\|\hat{A}-A^{*}\|_{S_{p}}^{p}+\|A^{*}\|_{S_{p}}^{p}-\|\hat{A}\|_{S_{p}}^{p}).

This and the pp-norm inequality (3) yield (10).

The bound (10) depends on the magnitude of the elements of AA^{*} via ASp\|A^{*}\|_{S_{p}}. The next theorem shows that under the RI condition this dependence can be avoided, and only the rank of AA^{*} affects the rate of convergence.

Theorem 2

Let Am×TA^{*}\in\mathbb{R}^{m\times T} with \operatornamerank(A)r\operatorname{rank}(A^{*})\leq r, and let 0<p10<p\leq 1. Assume that (3) holds with probability at least 1ε1-\varepsilon for some ε>0\varepsilon>0 and 0<τ<0<\tau<\infty. Assume also that the Restricted Isometry condition RI ((2+a)r((2+a)r,ν)\nu) holds with some 0<ν<0<\nu<\infty, with a sufficiently large a=a(p)a=a(p) depending only on pp and with 0<δ(2+a)rδ00<\delta_{(2+a)r}\leq\delta_{0} for a sufficiently small δ0=δ0(p)\delta_{0}=\delta_{0}(p) depending only on pp.

Let A^\hat{A} be the Schatten-pp estimator defined as a minimizer of (7) with λ=4τ\lambda=4\tau. Then with probability at least 1ε1-\varepsilon we have

d^2,N(A^,A)2\displaystyle\hat{d}_{2,N}(\hat{A},A^{*})^{2} \displaystyle\leq C1rτ2/(2p)ν2p/(2p),\displaystyle C_{1}r\tau^{2/(2-p)}\nu^{2p/(2-p)}, (11)
A^ASqq\displaystyle\|\hat{A}-A^{*}\|_{S_{q}}^{q} \displaystyle\leq C2rτq/(2p)ν2q/(2p)q[p,2],\displaystyle C_{2}r\tau^{q/(2-p)}\nu^{2q/(2-p)}\qquad\forall q\in[p,2], (12)

where C1C_{1} and C2C_{2} are constants, C1C_{1} depends only on pp and C2C_{2} depends on pp and qq.

Proof of Theorem 2 is given in Section 9. The values a=a(p)a=a(p) and δ0(p)\delta_{0}(p) can be deduced from the proof. In particular, for p=1p=1 it is sufficient to take a=19a=19.

Remark 1.

Note that if ν=1\nu=1 the rates in (11) and (12) do not depend on pp if we assume in addition the uniform boundedness of \mathcal{L}, which is a very mild condition. Indeed, taking the value of τ\tau from the first line of Table 1 we see that rτ2/(2p)ν2p/(2p)rM/Nr\tau^{2/(2-p)}\nu^{2p/(2-p)}\sim rM/N for all 0<p10<p\leq 1. Thus, under the RI condition, using Schatten-pp estimators with p<1p<1 does not improve the rate of convergence on the class of matrices AA^{*} of rank at most rr.

Discussion about the scaling factor ν\nu. Remark 1 deals with the case ν=1\nu=1, which seems to be not always appropriate for trace regression models. To our knowledge, the only available examples of matrices XX such that the sampling operator \mathcal{L} satisfies the RI condition with ν=1\nu=1 are complete matrices, that is, matrices with all nonzero entries, which are random and have specific distributions [typically, i.i.d. Rademacher or Gaussian entries; cf. Recht, Fazel and Parrilo (2010)]. Except for degenerate cases [such as N=mTN=mT, the XiX_{i} distinct and of the form Nek(m)el(T)\sqrt{N}e_{k}(m)e_{l}(T)^{\prime} for 1km,1lT1\leq k\leq m,1\leq l\leq T] the sampling operator \mathcal{L} defines typically a restricted isometry with ν=1\nu=1 only if the matrices XiX_{i} contain a considerable number of (uniformly bounded) nonzero entries.

Let us now specify the form of the RI condition in the context of multi-task learning discussed in the Introduction. Using (2) for a matrix A=(a1aT)A=(a_{1}\cdots a_{T}), we obtain

|(A)|22\displaystyle|\mathcal{L}(A)|_{2}^{2} =\displaystyle= N1i=1N\operatornametr2(XiA)\displaystyle N^{-1}\sum_{i=1}^{N}\operatorname{tr}^{2}(X_{i}^{\prime}A)
=\displaystyle= N1t=1Ts=1nat𝐱(t,s)(𝐱(t,s))at=T1t=1TatΨtat,\displaystyle N^{-1}\sum_{t=1}^{T}\sum_{s=1}^{n}a_{t}^{\prime}\mathbf{x}^{(t,s)}\bigl{(}\mathbf{x}^{(t,s)}\bigr{)}^{\prime}a_{t}=T^{-1}\sum_{t=1}^{T}a_{t}^{\prime}\Psi_{t}a_{t},

where Ψt=n1s=1n𝐱(t,s)(𝐱(t,s))\Psi_{t}=n^{-1}\sum_{s=1}^{n}\mathbf{x}^{(t,s)}(\mathbf{x}^{(t,s)})^{\prime} is the Gram matrix of predictors for the ttth task. These matrices correspond to TT separate regression models. The standard assumption is that they are normalized so that all the diagonal elements of each Ψt\Psi_{t} are equal to 1. This suggests that the natural RI scaling factor ν\nu for such model is of the order νT\nu\sim\sqrt{T}. For example, in the simplest case when all the matrices Ψt\Psi_{t} are just equal to the m×mm\times m identity matrix, we find |(A)|22=T1t=1TatΨtat=T1AS22.|\mathcal{L}(A)|_{2}^{2}=T^{-1}\sum_{t=1}^{T}a_{t}^{\prime}\Psi_{t}a_{t}=T^{-1}\|A\|_{S_{2}}^{2}. Similarly, we get the RI condition with scaling factor νT\nu\sim\sqrt{T} when the spectra of all the Gram matrices Ψt,t=1,,T,\Psi_{t},t=1,\dots,T, are included in a fixed interval [a,b][a,b] with 0<a<b<0<a<b<\infty. However, this excludes the high-dimensional task regressions, such that the number of parameters mm is larger than the sample size, m>nm>n. In conclusion, application of the matrix RI techniques in multi-task learning is restricted to low-dimensional regression and the scaling factor is νT\nu\sim\sqrt{T}.

The reason for the failure of the RI approach is that the masks XiX_{i} are sparse. The sparser are XiX_{i}, the larger is ν\nu. The extreme situation corresponds to matrix completion problems. Indeed, if N<mTN<mT, then there exists a matrix of rank 11 in the null-space of the sampling operator \mathcal{L} and hence the RI condition cannot be satisfied. For NmTN\geq mT we can have the RI condition with scaling factor νmT\nu\sim\sqrt{mT}, but NmTN\geq mT means that essentially all the entries are observed, so that the very problem of completion does not arise.

4 Upper bounds under mild conditions on the sampling operator

The above discussion suggests that Theorem 2 and, in general, the argument based on the restricted isometry or related conditions are not well adapted for several interesting settings. Motivated by this, we propose another approach described in the next theorem, which requires only the comparably mild uniform boundedness condition (5). For simplicity, we focus on Gaussian errors ξi\xi_{i}. Set M=max(m,T)M=\max(m,T).

Theorem 3

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Assume that M>1M>1, N>eMN>{\rm{e}}M and that the uniform boundedness condition (5) is satisfied. Let Am×TA^{*}\in\mathbb{R}^{m\times T} with \operatornamerank(A)r\operatorname{rank}(A^{*})\leq r and the maximal singular value σ1(A)(N/M)C\sigma_{1}(A^{*})\leq(N/M)^{C^{*}} for some 0<C<0<C^{*}<\infty. Set p=(log(N/M))1p=(\log(N/M))^{-1}, cκ=(2κ1)(2κ)κ1/(2κ1)c_{\kappa}=(2\kappa-1)(2\kappa)\kappa^{-1/(2\kappa-1)} where κ=(2p)/(22p)\kappa=(2-p)/(2-2p) and

λ=4cκ(ϑ/p)1p/2(MN)1p/2\lambda=4c_{\kappa}(\vartheta/p)^{1-p/2}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2} (13)

for some ϑC2\vartheta\geq C^{2} and CC a universal positive constant independent of rr, MM and NN. Then the Schatten-pp estimator A^\hat{A} defined as a minimizer of (7) with λ\lambda as in (13) satisfies

d^2,N(A^,A)2C3ϑrMNlog(NM)\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C_{3}\vartheta\frac{rM}{N}\log\biggl{(}\frac{N}{M}\biggr{)} (14)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}) where the positive constant C3C_{3} is independent of rr, MM and NN.

{pf}

Inequality (3) holds with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/\break C^{2}) by Lemma 5. We then use (10) and note that, under our choice of pp, τcϑM/(Np)\tau\leq c\vartheta M/(Np) for some constant c<c<\infty, which does not depend on MM and NN, and

ASppr[σ1(A)]pr(NM)Cp=exp(C)r.\|A^{*}\|_{S_{p}}^{p}\leq r[\sigma_{1}(A^{*})]^{p}\leq r\biggl{(}\frac{N}{M}\biggr{)}^{C^{*}p}=\exp(C^{*})r.
\upqed

Finally, we give the following theorem quantifying the rates of convergence of the prediction risk in terms of the Schatten norms of AA^{*}. Its proof is straightforward in view of Theorem 1 and Lemmas 2 and 5.

Theorem 4

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables and Am×TA^{*}\in\mathbb{R}^{m\times T}. Then the Schatten-pp estimator A^\hat{A} has the following properties: {longlist}[(ii)]

Let p=1p=1, and λ=32σϕmax(1)(m+T)/N.\lambda=32\sigma\phi_{\max}(1)\sqrt{(m+T)/N}. Then

d^2,N(A^,A)2Cσϕmax(1)AS1m+TN\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C\sigma\phi_{\max}(1)\|A^{*}\|_{S_{1}}\sqrt{\frac{m+T}{N}} (15)

with probability at least 12exp{(2log5)(m+T)}1-2\exp\{-(2-\log 5)(m+T)\} where C>0C>0 is an absolute constant.

Let 0<p<10<p<1 and let the uniform boundedness condition (5) hold. Set λ\lambda as in (13). Then

d^2,N(A^,A)2CASpp(MN)1p/2\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C\|A^{*}\|_{S_{p}}^{p}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2} (16)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}) where M=max(m,T)M=\max(m,T) and the constant C>0C>0 is independent of rr, MM and NN.

In Theorem 5 below we show that these rates are optimal in a minimax sense on the corresponding Schatten-pp balls for the sampling operators satisfying the RI condition.

5 Upper bounds for noisy matrix completion

As discussed in Section 3, for matrix completion problems the restricted isometry argument as in Theorem 2 is not applicable. We will therefore use Theorems 1 and 3. First, combining Theorem 1 with Lemma 3 of Section 8 we get the following corollary.

Corollary 1 ((USR matrix completion))

(i) Let the i.i.d. zero-mean random variables ξi\xi_{i} satisfy the Bernstein condition (34). Assume thatmT(m+T)>NmT(m+T)>N and consider the USR matrix completion model. Let τ2\tau_{2} be given by (42) with some D2D\geq 2. Then the Schatten-11 estimator A^\hat{A} defined with λ=4τ2\lambda=4\tau_{2} satisfies

d^2,N(A^,A)216C¯AS1m+TN\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 16{\bar{C}}\|A^{*}\|_{S_{1}}\frac{m+T}{N} (17)

with probability at least 14exp{(2log5)(m+T)}1-4\exp\{-(2-\log 5)(m+T)\}, where C¯=4σ10D+8HD{\bar{C}}=4\sigma\sqrt{10D}+8HD.

(ii) Let the i.i.d. zero-mean random variables ξi\xi_{i} satisfy the light tail condition (35), and let τ3\tau_{3} be given by (43) for some B>0B>0. Then the Schatten-11 estimator A^\hat{A} defined with λ=4τ3\lambda=4\tau_{3} satisfies

d^2,N(A^,A)216AS1Bσlog(max(m+1,T+1))N\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 16\|A^{*}\|_{S_{1}}\sqrt{B}\frac{\sigma\log(\max(m+1,T+1))}{\sqrt{N}} (18)

with probability at least 1(1/C)max(m+1,T+1)CB1-(1/C){\max(m+1,T+1)}^{-CB} for some constant C>0C>0 which does not depend on m,Tm,T and NN.

Next, combining Theorem 3 with Lemma 5 of Section 8 we get the following corollary.

Corollary 2 ((USR matrix completion, nonconvex penalty))

Let ξ1,,ξN\xi_{1},\dots,\break\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Assume that M=max(m,T)>1M=\max(m,T)>1, N>eMN>{\rm e}M and consider the USR matrix completion model. Let Am×TA^{*}\in\mathbb{R}^{m\times T} with \operatornamerank(A)r\operatorname{rank}(A^{*})\leq r and the maximal singular value σ1(A)(N/M)C\sigma_{1}(A^{*})\leq(N/M)^{C^{*}} for some 0<C<0<C^{*}<\infty. Set p=(log(N/M))1p=(\log(N/M))^{-1}, cκ=(2κ1)(2κ)κ1/(2κ1)c_{\kappa}=(2\kappa-1)(2\kappa)\kappa^{-1/(2\kappa-1)} where κ=(2p)/(22p)\kappa=(2-p)/(2-2p) and

λ=4cκ(ϑ/p)1p/2(MN)1p/2\lambda=4c_{\kappa}(\vartheta/p)^{1-p/2}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2}

for some ϑC2\vartheta\geq C^{2} with a universal constant C>0C>0, independent of rr, MM and NN. Then the Schatten-pp estimator A^\hat{A} defined as a minimizer of (7) satisfies

d^2,N(A^,A)2C3ϑrMNlog(NM)\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C_{3}\vartheta\frac{rM}{N}\log\biggl{(}\frac{N}{M}\biggr{)} (19)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}), where the positive constant C3C_{3} is also independent of rr, MM and NN.

Note that the bounds of Corollaries 1(i) and 2 achieve the rate rmax(m,T)/Nr\max(m,\break T)/N, up to logarithmic factors under different conditions on the maximal singular value of AA^{*}. If max(m,T)<N<mT\max(m,T)<N<mT then the condition in Corollary 2 does not imply more than a polynomial in max(m,T)\max(m,T) growth on σ1(A)\sigma_{1}(A^{*}), which is a mild assumption. On the other hand, (17) requires uniform boundedness of σ1(A)\sigma_{1}(A^{*}) by some constant to achieve the same rate. However, the estimators of Corollary 2 correspond to nonconvex penalty and are computationally hard.

We now turn to the collaborative sampling matrix completion. The next corollary follows from combination of Theorem 1 with Lemmas 3 and 4 of Section 8.

Corollary 3 ((Collaborative sampling))

Consider the problem of matrix completion with collaborative sampling. {longlist}[(ii)]

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Let τ4\tau_{4} be given by (44) with some D2D\geq 2. Then the Schatten-11 estimator A^\hat{A} defined with λ=4τ4\lambda=4\tau_{4} satisfies

d^2,N(A^,A)216C¯AS1m+TN\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 16{\bar{C}}\|A^{*}\|_{S_{1}}\frac{\sqrt{m+T}}{N} (20)

with probability at least 12exp{(Dlog5)(m+T)}1-2\exp\{-(D-\log 5)(m+T)\}, where C¯=8σD{\bar{C}}=8\sigma\sqrt{D}.

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. zero-mean random variables satisfying the Bernstein condition (34). Let τ5\tau_{5} be given by (45) with some D2D\geq 2. Then the Schatten-11 estimator A^\hat{A} defined with λ=4τ5\lambda=4\tau_{5} satisfies

d^2,N(A^,A)264AS1σ2D(m+T)+2HD(m+T)N\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 64\|A^{*}\|_{S_{1}}\frac{\sigma\sqrt{2D(m+T)}+2HD(m+T)}{N} (21)

with probability at least 12exp{(Dlog5)(m+T)}1-2\exp\{-(D-\log 5)(m+T)\}.

Remark 2.

Using the inequality AS1rAS2\|A\|_{S_{1}}\leq\sqrt{r}\|A\|_{S_{2}} for matrices AA of rank at most rr, we find that that the bound (20) is minimax optimal on the class of matrices

{Am×T\dvtx\operatornamerank(A)r,AS22Cσ2rmax(m,T)}\{A\in\mathbb{R}^{m\times T}\dvtx\operatorname{rank}(A)\leq r,\|A\|_{S_{2}}^{2}\leq C\sigma^{2}r\max(m,T)\}

for some constant C>0C>0, if the masks X1,,XNX_{1},\dots,X_{N} fulfill the dispersion condition of Theorem 7 below. It is further interesting to note that the construction in the proof of the lower bound in Theorem 7 fails if the restriction is AS22δ2\|A\|_{S_{2}}^{2}\leq\delta^{2} where δ2\delta^{2} of smaller order than rmax(m,T)r\max(m,T).

6 Upper bounds for multi-task learning

For multi-task learning, we can apply both Theorems 2 and 3. Theorem 2 imposes a strong assumption on the masks XiX_{i}, namely the RI condition. Nevertheless, the advantage is that Theorem 2 covers the computationally easy case p=1p=1.

Corollary 4 ((Multi-task learning; RI condition))

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Consider the multi-task learning problem with \operatornamerank(A)r\operatorname{rank}(A^{*})\leq r. Assume that the spectra of the Gram matrices Ψt\Psi_{t} are uniformly in tt bounded from above by a constant c1<c_{1}<\infty. Assume also that the Restricted Isometry condition RI (21r21r, ν\nu) holds with some 0<ν<0<\nu<\infty and with 0<δ21rδ00<\delta_{21r}\leq\delta_{0} for a sufficiently small δ0\delta_{0}. Set

λ=32σc1(m+T)nT2.\lambda=32\sigma\sqrt{\frac{c_{1}(m+T)}{nT^{2}}}.

Let A^\hat{A} be the Schatten-11 estimator with this parameter λ\lambda. Then with probability at least 12exp{(2log5)(m+T)}1-2\exp\{-(2-\log 5)(m+T)\} we have

d^2,N(A^,A)2\displaystyle\hat{d}_{2,N}(\hat{A},A^{*})^{2} \displaystyle\leq C¯1c1σ2rν2(m+TnT2),\displaystyle{\bar{C}}_{1}c_{1}\sigma^{2}r\nu^{2}\biggl{(}\frac{m+T}{nT^{2}}\biggr{)},
A^ASqq\displaystyle\|\hat{A}-A^{*}\|_{S_{q}}^{q} \displaystyle\leq C¯2c1q/2σqrν2q(m+TnT2)q/2q[1,2],\displaystyle{\bar{C}}_{2}c_{1}^{q/2}\sigma^{q}r\nu^{2q}\biggl{(}\frac{m+T}{nT^{2}}\biggr{)}^{q/2}\qquad\forall q\in[1,2],

where C¯1{\bar{C}}_{1} is an absolute constant and C¯2{\bar{C}}_{2} depends only on qq.

The proof of Corollary 4 is straightforward in view of Theorem 2, Lemma 2 and the fact that, under the premise of Corollary 4, we have |(A)|22=T1t=1TatΨtat(c1/T)AS22|\mathcal{L}(A)|_{2}^{2}=T^{-1}\sum_{t=1}^{T}a_{t}^{\prime}\Psi_{t}a_{t}\leq(c_{1}/T)\|A\|_{S_{2}}^{2} for all matrices Am×TA\in\mathbb{R}^{m\times T}, so that the sampling operator is uniformly bounded [(5) holds with c0=c1/Tc_{0}=c_{1}/T], and thus ϕmax(1)c0c1/T\phi_{\max}(1)\leq\sqrt{c_{0}}\leq\sqrt{c_{1}/T}.

Taking in the bounds of Corollary 4 the natural scaling factor νT\nu\sim\sqrt{T}, we obtain the following inequalities:

d^2,N(A^,A)2\displaystyle\hat{d}_{2,N}(\hat{A},A^{*})^{2} \displaystyle\leq C~1r(m+T)nT,\displaystyle{\tilde{C}}_{1}\frac{r(m+T)}{nT}, (22)
1TA^AS22\displaystyle\frac{1}{T}\|\hat{A}-A^{*}\|_{S_{2}}^{2} \displaystyle\leq C~2r(m+T)nT,\displaystyle{\tilde{C}}_{2}\frac{r(m+T)}{nT}, (23)

where the constants C~1{\tilde{C}}_{1} and C~2{\tilde{C}}_{2} do not depend on m,Tm,T and nn.

A remarkable fact is that the rates in Corollary 4 are free of logarithmic inflation factor. This is one of the differences between the matrix estimation problems and vector estimation ones, where the logarithmic risk inflation is inevitable, as first noticed by Donoho et al. (1992), Foster and George (1994). For more details about optimal rates of sparse estimation in the vector case, see Rigollet and Tsybakov (2010).

Since the Group Lasso is a special case of the nuclear norm penalized minimization on block-diagonal matrices [cf., e.g., Bach (2008)] Corollary 4 and the bounds (22), (23) imply the corresponding bounds for the Group Lasso under the low-rank assumption. To note the difference with from the previous results for the Group Lasso, we consider, for example, those obtained in multi-task setting by Lounici et al. (2009, 2010). The main difference is that the sparsity index ss appearing in Lounici et al. (2009, 2010) is now replaced by rr. In Lounici et al. (2009, 2010), the columns ata^{*}_{t} of AA^{*} are supposed to be sparse, with the sets of nonzero elements of cardinality not more than ss, whereas here the sparsity is characterized by the rank rr of AA^{*}.

Finally, we give the following result based on application of Theorem 3.

Corollary 5 ((Multi-task learning; uniformly bounded \mathcal{L}))

Let ξ1,,ξN\xi_{1},\dots,\break\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables, and assume that n>en>{\rm e}. Consider the multi-task learning problem with Am×TA^{*}\in\mathbb{R}^{m\times T}, \operatornamerank(A)r\operatorname{rank}(A^{*})\leq r, such that the maximal singular value σ1(A)nC\sigma_{1}(A^{*})\leq n^{C^{*}} for some 0<C<0<C^{*}<\infty. Assume that the spectra of the Gram matrices Ψt\Psi_{t} are uniformly in tt bounded from above by c0Tc_{0}T where c0<c_{0}<\infty is a constant. Set p=(logn)1p=(\log n)^{-1}, cκ=(2κ1)(2κ)κ1/(2κ1)c_{\kappa}=(2\kappa-1)(2\kappa)\kappa^{-1/(2\kappa-1)} where κ=(2p)/(22p)\kappa=(2-p)/(2-2p) and

λ=4cκ(ϑ/p)1p/2(1n)1p/2\lambda=4c_{\kappa}(\vartheta/p)^{1-p/2}\biggl{(}\frac{1}{n}\biggr{)}^{1-p/2}

for some ϑC2\vartheta\geq C^{2} and a universal constant C>0C>0, independent of rr, mm and nn. Then the Schatten-pp estimator A^\hat{A} with this parameter λ\lambda satisfies

d^2,N(A^,A)2C3ϑrMnTlogn\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C_{3}\vartheta\frac{rM}{nT}\log n (24)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}) where M=max(m,T)M=\max(m,T), and the positive constant C3C_{3} is independent of rr, mm and nn.

Corollary 5 follows from Theorem 3. Indeed, it suffices to remark that, under the premises of Corollary 5, we have |(A)|22=T1t=1TatΨtatc0AS22|\mathcal{L}(A)|_{2}^{2}=T^{-1}\sum_{t=1}^{T}a_{t}^{\prime}\Psi_{t}a_{t}\leq c_{0}\|A\|_{S_{2}}^{2} for all matrices Am×TA\in\mathbb{R}^{m\times T}, so that the sampling operator is uniformly bounded; cf. (5).

For m=Tm=T, we can write (24) in the form

d^2,N(A^,A)2C3rmnTlogn.\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq C_{3}^{\prime}\frac{rm}{nT}\log n. (25)

Clearly, this bound achieves the optimal rate “intrinsic dimension/sample size” rm/N\sim rm/N, up to logarithms (recall that N=nTN=nT in the multi-task learning). The bounds (22) and (23) achieve this rate in a more precise sense because they are free of extra logarithmic factors.

Another remark concerns the possible range of mm. It follows from the discussion in Section 3 that the “dimension larger than the sample size” framework is not covered by Corollary 4 since this corollary relies on the RI condition. In contrast, the bounds of Corollary 5 make sense when the dimension mm is larger than the sample size nn of each task; we only need to have mexp(n)m\ll\exp(n) for Corollary 5 to be meaningful. Corollary 5 holds when the RI assumption is violated and under a mild condition on the masks XiX_{i}. The price to pay is to assume that the singular values of AA^{*} do not grow exponentially fast. Also, the estimator of Corollary 5 corresponds to p<1p<1, so it is computationally hard.

7 Minimax lower bounds

In this section, we derive lower bounds for the prediction error, which show that the upper bounds that we have proved are optimal in a minimax sense for two scenarios: (i) under the RI condition and (ii) for matrix completion with collaborative sampling. We also provide a lower bound for USR matrix completion. Under the RI condition with ν=1\nu=1, minimax lower bounds for the Frobenius norm A^AS2\|\hat{A}-A^{*}\|_{S_{2}} on “Schatten-0” balls {Am×T\dvtx\operatornamerank(A)r}\{A^{*}\in\mathbb{R}^{m\times T}\dvtx\operatorname{rank}(A^{*})\leq r\} are derived in Candès and Plan (2010b) with a technique different from ours, which does not allow one to include further boundedness constraints on AA^{*} in addition to that it has rank at most rr. Specifically, they prove their lower bound by passage to Bayes risk with an unbounded support prior (Gaussian prior). Our lower bounds are more general in the sense that they are obtained on smaller sets, namely, the intersections of Schatten-0 and Schatten-pp balls. This is similar in spirit to Rigollet and Tsybakov (2010) establishing minimax lower bounds on the intersection of 0\ell_{0} and 1\ell_{1} balls for the vector sparsity scenario. In what follows, we denote by infA^\inf_{\hat{A}} the infimum over all estimators based on (X1,Y1),,(XN,YN)(X_{1},Y_{1}),\ldots,(X_{N},Y_{N}), and for any Am×TA\in\mathbb{R}^{m\times T} we denote by A\mathbb{P}_{A} the probability distribution of (Y1,,YN)(Y_{1},\dots,Y_{N}) satisfying (1) with A=AA^{*}=A.

Theorem 5 ((Lower bound—Restricted Isometry))

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables for some σ2>0\sigma^{2}>0. Let M=max(m,T)8M=\max(m,T)\geq 8, r1r\geq 1, min(T,m)r\min(T,m)\geq r and for 0<α<1/80<\alpha<1/8 define

ψM,N,r,Δ\displaystyle\psi_{M,N,r,\Delta} =\displaystyle= min(rMN,Δp(MN)1p/2,Δ2),\displaystyle\min\biggl{(}\frac{rM}{N},\Delta^{p}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2},\Delta^{2}\biggr{)},
C(α)\displaystyle C(\alpha) =\displaystyle= α(1δr)222/p(1+δr)2log2128andC(α,ν)=α22/p(1+δr)2log2128ν2.\displaystyle\frac{\alpha(1-\delta_{r})^{2}}{2^{2/p}(1+\delta_{r})^{2}}\frac{\log 2}{128}\quad\mbox{and}\quad C(\alpha,\nu)=\frac{\alpha}{2^{2/p}(1+\delta_{r})^{2}}\frac{\log 2}{128}\nu^{2}.
{longlist}

[(ii)]

Assume that the sampling operator \mathcal{L} satisfies the right-hand side inequality in the RI (r,ν)(r,\nu)-condition (6) for some δr(0,1)\delta_{r}\in(0,1). Then for any p(0,2]p\in(0,2], Δ>0\Delta>0, 0<α<1/80<\alpha<1/8,

infA^supAm×T:\operatornamerank(A)r,ASpΔνσA(A^AS22>C(α,ν)σ2ψM,N,r,Δ)β,\qquad\inf_{\hat{A}}\mathop{\sup_{A^{*}\in\mathbb{R}^{m\times T}:}}_{\operatorname{rank}(A^{*})\leq r,\|A^{*}\|_{S_{p}}\leq\Delta\nu\sigma}\mathbb{P}_{A^{*}}\bigl{(}\|\hat{A}-A^{*}\|_{S_{2}}^{2}>C(\alpha,\nu)\sigma^{2}\psi_{M,N,r,\Delta}\bigr{)}\geq\beta, (26)

where β=β(M,α)>0\beta=\beta(M,\alpha)>0 is such that β(M,α)1\beta(M,\alpha)\rightarrow 1 as M,α0M\rightarrow\infty,\alpha\to 0.

Assume that the sampling operator \mathcal{L} satisfies the RI (r,ν)(r,\nu)-condition (6) for some δr(0,1)\delta_{r}\in(0,1). Then for any p(0,2]p\in(0,2], Δ>0\Delta>0, 0<α<1/80<\alpha<1/8, with β\beta as in (26),

infA^supAm×T:\operatornamerank(A)r,ASpΔνσA(d^2,N(A^,A)2>C(α)σ2ψM,N,r,Δ)β.\qquad\inf_{\hat{A}}\mathop{\sup_{A^{*}\in\mathbb{R}^{m\times T}:}}_{\operatorname{rank}(A^{*})\leq r,\|A^{*}\|_{S_{p}}\leq\Delta\nu\sigma}\mathbb{P}_{A^{*}}\bigl{(}\hat{d}_{2,N}(\hat{A},A^{*})^{2}>C(\alpha)\sigma^{2}\psi_{M,N,r,\Delta}\bigr{)}\geq\beta. (27)
Remark 3.

It is worth to note that C(α)C(\alpha) and β(M,α)\beta(M,\alpha) do not depend on the constant ν\nu of the RI condition.

{pf*}

Proof of Theorem 5 Without loss of generality, we assume that M=mTM=m\geq T. For a constant γ>0\gamma>0 and an integer s{1,2,,r}s\in\{1,2,\ldots,r\}, both to be specified later, define

𝒜s,γ={A=(aij)m×T\dvtxaij{0,γν/N} if 1js;aij=0 otherwise}.\mathcal{A}_{s,\gamma}=\bigl{\{}A=(a_{ij})\in\mathbb{R}^{m\times T}\dvtx a_{ij}\in\bigl{\{}0,\gamma\nu/\sqrt{N}\bigr{\}}\mbox{ if }1\leq j\leq s;a_{ij}=0\mbox{ otherwise}\bigr{\}}.

By construction, any element of 𝒜s,γ\mathcal{A}_{s,\gamma} as well as the difference of any two elements of 𝒜s,γ\mathcal{A}_{s,\gamma} has rank at most ss. Due to the Varshamov–Gilbert bound [cf. Lemma 2.9 in Tsybakov (2009)], there exists a subset 𝒜s,γ0𝒜s,γ\mathcal{A}^{0}_{s,\gamma}\subset\mathcal{A}_{s,\gamma} of cardinality \operatornameCard(𝒜s,γ0)2sm/8\operatorname{Card}(\mathcal{A}_{s,\gamma}^{0})\geq 2^{sm/8} containing A0=0A_{0}=0 such that for any two distinct elements A1A_{1} and A2A_{2} of 𝒜s,γ0\mathcal{A}_{s,\gamma}^{0},

d^2,N(A1,A2)2ν2(1δr)2A1A2S22(1δr)2γ28sMN,\hat{d}_{2,N}(A_{1},A_{2})^{2}\geq\nu^{-2}(1-\delta_{r})^{2}\|A_{1}-A_{2}\|_{S_{2}}^{2}\geq(1-\delta_{r})^{2}\frac{\gamma^{2}}{8}\frac{sM}{N}, (28)

where the first inequality follows from the left-hand side inequality in the RI condition (6) and is only used to prove (27). We will prove (27); the proof of (26) is analogous in view of the second inequality in (28).

Then, for any A1𝒜s,γ0A_{1}\in\mathcal{A}_{s,\gamma}^{0}, the Kullback–Leibler divergence K(A0,A1)K(\mathbb{P}_{A_{0}},\mathbb{P}_{A_{1}}) between A0\mathbb{P}_{A_{0}} and A1\mathbb{P}_{A_{1}} satisfies

K(A0,A1)=N2σ2d^2,N(A0,A1)2γ22σ2(1+δr)2sM,K(\mathbb{P}_{A_{0}},\mathbb{P}_{A_{1}})=\frac{N}{2\sigma^{2}}\hat{d}_{2,N}(A_{0},A_{1})^{2}\leq\frac{\gamma^{2}}{2\sigma^{2}}(1+\delta_{r})^{2}sM, (29)

where we used again the RI condition. We now apply Theorem 2.5 in Tsybakov (2009). Fix some α(0,1/8)\alpha\in(0,1/8). Note that the condition

1\operatornameCard(𝒜s,γ0)1A𝒜s,γ0K(A,A0)αlog(\operatornameCard(𝒜s,γ0)1)\frac{1}{\operatorname{Card}(\mathcal{A}_{s,\gamma}^{0})-1}\sum_{A\in\mathcal{A}_{s,\gamma}^{0}}K(\mathbb{P}_{A},\mathbb{P}_{A_{0}})\leq\alpha\log\bigl{(}\operatorname{Card}(\mathcal{A}_{s,\gamma}^{0})-1\bigr{)} (30)

is satisfied for γ2ασ2(log2)/(4(1+δr)2)\gamma^{2}\leq\alpha\sigma^{2}(\log 2)/(4(1+\delta_{r})^{2}). Define

rΔ=\operatornameargmin{l\dvtxΔpl(MN)p/2},r_{\Delta}=\operatorname{arg\,min}\biggl{\{}l\in\mathbb{N}\dvtx\Delta^{p}\leq l\biggl{(}\frac{M}{N}\biggr{)}^{p/2}\biggr{\}},

and consider separately the following three cases.

The case rΔ=1r_{\Delta}=1. In this case, ψM,N,r,Δ=Δ2\psi_{M,N,r,\Delta}=\Delta^{2} for any r1r\geq 1, and Δ2N/M1\Delta^{2}N/M\leq 1. Set

s1=1andγ1=(α(1+δr)2log24σ2Δ2NM)1/2.s_{1}=1\quad\mbox{and}\quad\gamma_{1}=\biggl{(}\frac{\alpha}{(1+\delta_{r})^{2}}\frac{\log 2}{4}\sigma^{2}\Delta^{2}\frac{N}{M}\biggr{)}^{1/2}.

Then ASpAS2M/NνγΔνσ\|A\|_{S_{p}}\leq\|A\|_{S_{2}}\leq\sqrt{M/N}\nu\gamma\leq\Delta\nu\sigma for all A𝒜1,γ1A\in\mathcal{A}_{1,\gamma_{1}}, i.e., 𝒜1,γ1\mathcal{A}_{1,\gamma_{1}} is contained in the set

{Am×T\dvtx\operatornamerank(A)r,ASpΔνσ}.\{A\in\mathbb{R}^{m\times T}\dvtx\operatorname{rank}(A)\leq r,\|A\|_{S_{p}}\leq\Delta\nu\sigma\}.

Now, inequality (28) shows that d^2,N(A1,A2)24C(α)σ2Δ2\hat{d}_{2,N}(A_{1},A_{2})^{2}\geq 4C(\alpha)\sigma^{2}\Delta^{2} for any two distinct elements A1,A2𝒜1,γ10A_{1},A_{2}\in\mathcal{A}_{1,\gamma_{1}}^{0}, while Δ2N/M1\Delta^{2}N/M\leq 1 implies that γ12ασ2(log2)/(4(1+δr)2)\gamma_{1}^{2}\leq\break\alpha\sigma^{2}(\log 2)/(4(1+\delta_{r})^{2}). Hence, condition (30) is satisfied.

The case 2rΔr2\!\leq\!r_{\Delta}\!\leq\!r. In this case, the rate ψM,N,r,Δ\psi_{M,N,r,\Delta} is equal to Δp(M/N)1p/2\Delta^{p}(M/N)^{1-p/2}. We consider the set 𝒜rΔ,γ20\mathcal{A}_{r_{\Delta},\gamma_{2}}^{0} with some γ2\gamma_{2} to be specified below. For A𝒜rΔ,γ20A\in\mathcal{A}_{r_{\Delta},\gamma_{2}}^{0}, we have ASp2rΔ(2p)/pAS22rΔ2/pγ22ν2M/N\|A\|_{S_{p}}^{2}\leq r_{\Delta}^{(2-p)/p}\|A\|_{S_{2}}^{2}\leq r_{\Delta}^{2/p}\gamma_{2}^{2}\nu^{2}M/N. Since also rΔ2Δp(N/M)p/2r_{\Delta}\leq 2\Delta^{p}(N/M)^{p/2} when rΔ2r_{\Delta}\geq 2, it follows that ASpΔνσ\|A\|_{S_{p}}\leq\Delta\nu\sigma whenever

γ221/pσ.\gamma_{2}\leq 2^{-1/p}\sigma. (31)

Now define

s2=rΔandγ2=21/p(α(1+δr)2log24σ2)1/2.s_{2}=r_{\Delta}\quad\mbox{and}\quad\gamma_{2}=2^{-1/p}\biggl{(}\frac{\alpha}{(1+\delta_{r})^{2}}\frac{\log 2}{4}\sigma^{2}\biggr{)}^{1/2}.

Then (30) is satisfied and γ2\gamma_{2} fulfills also the constraint (31), since α<1/8\alpha<1/8, (log2)/4<1(\log 2)/4<1. Thus, 𝒜rΔ,γ20\mathcal{A}_{r_{\Delta},\gamma_{2}}^{0} is a subset of matrices Am×TA\in\mathbb{R}^{m\times T} with \operatornamerank(A)r\operatorname{rank}(A)\leq r and ASpΔνσ\|A\|_{S_{p}}\leq\Delta\nu\sigma. Finally, (28) implies that

d^2,N(A1,A2)2\displaystyle\hat{d}_{2,N}(A_{1},A_{2})^{2} \displaystyle\geq (1δr)2γ228rΔMN(1δr)2γ228Δp(MN)1p/2\displaystyle(1-\delta_{r})^{2}\frac{\gamma_{2}^{2}}{8}\frac{r_{\Delta}M}{N}\geq(1-\delta_{r})^{2}\frac{\gamma_{2}^{2}}{8}\Delta^{p}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2}
=\displaystyle= 4C(α)σ2Δp(MN)1p/2\displaystyle 4C(\alpha)\sigma^{2}\Delta^{p}\biggl{(}\frac{M}{N}\biggr{)}^{1-p/2}

for any two distinct elements A1,A2A_{1},A_{2} of 𝒜rΔ,γ2\mathcal{A}_{r_{\Delta},\gamma_{2}}.

The case rΔ>rr_{\Delta}>r. In this case, ψM,N,Δ,r=rM/N\psi_{M,N,\Delta,r}=rM/N. The conditions required in Theorem 2.5 of Tsybakov (2009) follow immediately as above, this time with the set of matrices 𝒜r,γ30\mathcal{A}_{r,\gamma_{3}}^{0}, where γ32=ασ2(log2)/(4(1+δr)2)\gamma_{3}^{2}=\alpha\sigma^{2}(\log 2)/(4(1+\delta_{r})^{2}).

Remark 4.

Theorem 5 implies that the rates of convergence in Theorem 4 are optimal in a minimax sense on Schatten-pp balls {Am×T\dvtxASpΔ}\{A^{*}\in\mathbb{R}^{m\times T}\dvtx\break\|A^{*}\|_{S_{p}}\leq\Delta\} under the RI condition and natural assumptions on m,Tm,T and NN. Indeed, using Theorem 5 with no restriction on the rank [i.e., when r=min(m,T)r=\min(m,T)], and putting for simplicity Δ=1\Delta=1, we find that the rate in the lower bound is of the order min(min(m,T)M/N,(M/N)1p/2,1)\min(\min(m,T)M/N,(M/N)^{1-p/2},1). For m=T(=M)m=T(=M) and m3>N>mm^{3}>N>m this minimum equals (M/N)1p/2(M/N)^{1-p/2}, which coincides with the upper bound of Theorem 4.

The lower bound for the prediction error (27) in the above theorem does not apply to matrix completion with N<mTN<mT since then the Restricted Isometry condition cannot be satisfied, as discussed in Section 3. However, for the bound (26) we only need the right-hand side inequality in the RI condition. For example, the latter is trivially satisfied for CS matrix completion with ν=N\nu=\sqrt{N} and δr=0\delta_{r}=0. This yields the following corollary.

Corollary 6 ((Lower bound—CS matrix completion))

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables for some σ2>0\sigma^{2}>0. Let M=max(m,T)8M=\max(m,T)\geq 8, r1r\geq 1, min(T,m)r\min(T,m)\geq r, and consider the problem of CS matrix completion. Then for any p(0,2]p\in(0,2], Δ>0\Delta>0, 0<α<1/80<\alpha<1/8,

infA^supAm×T:\operatornamerank(A)r,ASpΔNσA(1NA^AS22>C(α)σ2ψM,N,r,Δ)β,\inf_{\hat{A}}\mathop{\sup_{A^{*}\in\mathbb{R}^{m\times T}:}}_{\operatorname{rank}(A^{*})\leq r,\|A^{*}\|_{S_{p}}\leq\Delta\sqrt{N}\sigma}\mathbb{P}_{A^{*}}\biggl{(}\frac{1}{N}\|\hat{A}-A^{*}\|_{S_{2}}^{2}>C^{\prime}(\alpha)\sigma^{2}\psi_{M,N,r,\Delta}\biggr{)}\geq\beta,

where C(α)=α(log2)22/p/128C^{\prime}(\alpha)=\alpha(\log 2)2^{-2/p}/128 and β=β(M,α)\beta=\beta(M,\alpha), ψM,N,r,Δ\psi_{M,N,r,\Delta} are as in Theorem 5.

The model of uniform sampling without replacement considered in Candès and Recht (2009) is a particular case of CS matrix completion. In the noisy case, Keshavan, Montanari and Oh (2009) obtain upper bounds under such a sampling scheme with the rate rM/NrM/N, up to logarithmic factors. The lower bound of Corollary 6 is of the same order when Δ=\Delta=\infty, that is, for the class of matrices of rank smaller than rr. However, Keshavan, Montanari and Oh (2009) obtained their bounds on some subclasses of this class characterized by additional strong restrictions.

It is useful to note that for bounds of the type (26) it is enough to have a condition on \mathcal{L} in expectation, as specified in the next theorem.

Theorem 6

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables for some σ2>0\sigma^{2}>0. Let M=max(m,T)8M=\max(m,T)\geq 8, r1r\geq 1, min(T,m)r\min(T,m)\geq r, and assume that X1,,XNX_{1},\dots,X_{N} are random matrices independent of ξ1,,ξN\xi_{1},\dots,\xi_{N}, and the sampling operator satisfies ν2𝔼|(A)|22AS22\nu^{2}\mathbb{E}|{\mathcal{L}}(A)|_{2}^{2}\leq\|A\|_{S_{2}}^{2} for some ν>0\nu>0 and all Am×TA\in\mathbb{R}^{m\times T} such that rank(A)r{\rm rank}(A)\leq r. Then for any p(0,2]p\in(0,2], Δ>0\Delta>0, 0<α<1/80<\alpha<1/8,

infA^supAm×T:\operatornamerank(A)r,ASpΔνσA(1ν2A^AS22>C(α)σ2ψM,N,r,Δ)β,\inf_{\hat{A}}\mathop{\sup_{A^{*}\in\mathbb{R}^{m\times T}:}}_{\operatorname{rank}(A^{*})\leq r,\|A^{*}\|_{S_{p}}\leq\Delta\nu\sigma}\mathbb{P}_{A^{*}}\biggl{(}\frac{1}{\nu^{2}}\|\hat{A}-A^{*}\|_{S_{2}}^{2}>C^{\prime}(\alpha)\sigma^{2}\psi_{M,N,r,\Delta}\biggr{)}\geq\beta,

where C(α)=α(log2)22/p/128C^{\prime}(\alpha)=\alpha(\log 2)2^{-2/p}/128 and β=β(M,α)\beta=\beta(M,\alpha), ψM,N,r,Δ\psi_{M,N,r,\Delta} are as in Theorem 5.

{pf}

We proceed as in Theorem 5, with the only difference in the bound on the Kullback–Leibler divergence. Indeed, under our asumptions, instead of (29) we have

K(A0,A1)=N2σ2𝔼(d^2,N(A0,A1)2)N2ν2A0A1S22γ2sM2σ2.\qquad K(\mathbb{P}_{A_{0}},\mathbb{P}_{A_{1}})=\frac{N}{2\sigma^{2}}\mathbb{E}(\hat{d}_{2,N}(A_{0},A_{1})^{2})\leq\frac{N}{2\nu^{2}}\|A_{0}-A_{1}\|_{S_{2}}^{2}\leq\frac{\gamma^{2}sM}{2\sigma^{2}}. (32)
\upqed

Theorem 6 applies to USR matrix completion with ν=mT\nu=\sqrt{mT}. Indeed, in that case mT𝔼|(A)|22=AS22mT\mathbb{E}|{\mathcal{L}}(A)|_{2}^{2}=\|A\|_{S_{2}}^{2}. In particular, Theorem 6 with Δ=\Delta=\infty shows that on the class of matrices of rank smaller than rr the lower bound of estimation in the squared Frobenius norm for USR matrix completion is of the order rM/NrM/N.

The next theorem gives a lower bound for the prediction error under collaborative sampling without the RI condition. Instead, we only impose a rather natural condition that the observed noisy entries are sufficiently well dispersed, that is, there exist rr rows or rr columns with more that κMr\kappa Mr observations for some fixed κ(0,1]\kappa\in(0,1]. We state the result with an additional constraint on the Frobenius norm of AA^{*}, in order to fit the corresponding upper bound (cf. Remark 2 in Section 5).

Theorem 7 ((Lower bound—CS matrix completion))

Let ξ1,,ξN\xi_{1},\dots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables for some σ2>0\sigma^{2}>0 and assume that the masks X1=ei1(m)ej1(T),,XN=eiN(m)ejN(T)X_{1}=e_{i_{1}}(m)e_{j_{1}}^{\prime}(T),\dots,X_{N}=e_{i_{N}}(m)e_{j_{N}}^{\prime}(T) are pairwise different. Let min(T,m)r\min(T,\break m)\geq r and κMr8\kappa Mr\geq 8 for some fixed κ(0,1]\kappa\in(0,1], where M=max(m,T)M=\max(m,T). Assume furthermore that the following dispersion condition holds: there exist numbers 1k1<<krT1\leq k_{1}<\cdots<k_{r}\leq T or 1k1<<krm1\leq k_{1}^{\prime}<\cdots<k_{r}^{\prime}\leq m such that either the set {(i1,j1),,(iN,jN)}{(i,k1),,(i,kr)\dvtxi=1,,m}\{(i_{1},j_{1}),\dots,(i_{N},j_{N})\}\,\cap\,\{(i,k_{1}),\ldots,(i,k_{r})\dvtx i=1,\ldots,m\} or the set {(i1,j1),,(iN,jN)}{(k1,j),,(kr,j)\dvtxj=1,,T}\{(i_{1},j_{1}),\ldots,(i_{N},j_{N})\}\,\cap\,\{(k_{1}^{\prime},j),\ldots,(k_{r}^{\prime},j)\dvtx j=1,\ldots,T\} has cardinality at least κMr+1\kappa Mr+1. Define 𝒞δ,r={Am×T\dvtx\operatornamerank(A)r and AS2δ}.\mathcal{C}_{\delta,r}=\{A\in\mathbb{R}^{m\times T}\dvtx\operatorname{rank}(A)\leq r\mbox{ and }\|A\|_{S_{2}}\leq\delta\}. Then for any 0<α<1/80<\alpha<1/8 and δ2ασ2(log2)(κMr+1)/4\delta^{2}\geq\alpha\sigma^{2}(\log 2)(\kappa Mr+1)/4,

infA^supA𝒞δ,rA(d^2,N(A^,A)2>C(α)σ2κrMN)β(κM,α)>0,\inf_{\hat{A}}\sup_{A^{*}\in\mathcal{C}_{\delta,r}}\mathbb{P}_{A^{*}}\biggl{(}\hat{d}_{2,N}(\hat{A},A^{*})^{2}>C^{\prime}(\alpha)\frac{\sigma^{2}\kappa rM}{N}\biggr{)}\geq\beta(\kappa M,\alpha)>0,

with a function β1\beta\rightarrow 1 as κM,α0\kappa M\rightarrow\infty,\alpha\rightarrow 0 and C(α)=α(log2)/128C^{\prime}(\alpha)=\alpha(\log 2)/128.

{pf}

We proceed as in the case Δ=\Delta=\infty, p=2p=2, ν=N\nu=\sqrt{N} of Theorem 5 taking a different set 𝒜0{\mathcal{A}}^{0} instead 𝒜s,γ0{\mathcal{A}}^{0}_{s,\gamma}. Let, for definiteness, the dispersion condition be satisfied with the set of indices 𝒦={(i1,j1),,(iN,jN)}{(i,k1),,(i,kr)\dvtxi=1,,m}{\mathcal{K}}=\{(i_{1},j_{1}),\dots,(i_{N},j_{N})\}\cap\{(i,k_{1}),\dots,(i,k_{r})\dvtx i=1,\ldots,m\}. Then there exists a subset 𝒦{\mathcal{K}}^{\prime} of 𝒦{\mathcal{K}} with cardinality \operatornameCard(𝒦)=κMr\operatorname{Card}({\mathcal{K}}^{\prime})=\lceil\kappa Mr\rceil. We define

𝒜={A=(aij)m×T\dvtxaij{0,γ} if (i,j)𝒦;aij=0 otherwise}.\mathcal{A}=\bigl{\{}A=(a_{ij})\in\mathbb{R}^{m\times T}\dvtx a_{ij}\in\{0,\gamma\}\mbox{ if }(i,j)\in{\mathcal{K}}^{\prime};a_{ij}=0\mbox{ otherwise}\bigr{\}}.

Any element of 𝒜\mathcal{A} as well as the difference of any two elements of 𝒜\mathcal{A} has rank at most rr, and AS22γ2κMr\|A\|_{S_{2}}^{2}\leq\gamma^{2}\lceil\kappa Mr\rceil, A𝒜\forall A\in\mathcal{A}. So, 𝒜𝒞δ,r\mathcal{A}\subset\mathcal{C}_{\delta,r} if γ2(κMr+1)δ2\gamma^{2}(\kappa Mr+1)\leq\delta^{2}. As in Theorem 5, the Varshamov–Gilbert bound implies that there exists a subset 𝒜0𝒜\mathcal{A}^{0}\subset\mathcal{A} of cardinality \operatornameCard(𝒜0)2κMr/8\operatorname{Card}(\mathcal{A}^{0})\geq 2^{\lceil\kappa Mr\rceil/8} containing A0=0A_{0}=0, that for any two distinct elements A1A_{1} and A2A_{2} of 𝒜0\mathcal{A}^{0},

d^2,N(A1,A2)2=N1A1A2S22γ28κMrN.\hat{d}_{2,N}(A_{1},A_{2})^{2}=N^{-1}\|A_{1}-A_{2}\|_{S_{2}}^{2}\geq\frac{\gamma^{2}}{8}\frac{\lceil\kappa Mr\rceil}{N}.

Instead of the bound (29), we have now the inequality K(A0,A1)γ22σ2κMrK(\mathbb{P}_{A_{0}},\mathbb{P}_{A_{1}})\leq\frac{\gamma^{2}}{2\sigma^{2}}\lceil\kappa Mr\rceil for any A1𝒜0A_{1}\in\mathcal{A}^{0}. Finally, we choose γ2=ασ2(log2)/4\gamma^{2}=\alpha\sigma^{2}(\log 2)/4. With these modifications, the rest of the proof is the same as that of Theorem 5 in the case rΔ>rr_{\Delta}>r.

8 Control of the stochastic term

We consider two approaches for bounding the stochastic term N1i=1Nξi\operatornametr((A^A)Xi)N^{-1}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}((\hat{A}-A^{*})^{\prime}X_{i}) on the right-hand side of the basic inequality (8). The first one used for p=1p=1 consists in application of the trace duality

|1Ni=1Nξi\operatornametr((A^A)Xi)|A^AS1𝐌S\Biggl{|}\frac{1}{N}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}\bigl{(}(\hat{A}-A^{*})^{\prime}X_{i}\bigr{)}\Biggr{|}\leq\|\hat{A}-A^{*}\|_{S_{1}}\|\mathbf{M}\|_{S_{\infty}} (33)

with 𝐌=N1i=1NξiXi\mathbf{M}=N^{-1}\sum_{i=1}^{N}\xi_{i}X_{i} and then of suitable exponential bounds for the spectral norm of 𝐌\mathbf{M} under different conditions on XiX_{i}, i=1,,Ni=1,\ldots,N. The second approach used to treat the case 0<p<10<p<1 (nonconvex penalties) (cf. Section 8.2) is based on refined empirical process techniques. Proofs of the results of this section are deferred to Section 10.

8.1 Tail bounds for the spectral norm of random matrices

We say that the random variables ξi\xi_{i}, i=1,,Ni=1,\dots,N, satisfy the Bernstein condition if

max1iN𝔼|ξi|l12l!σ2Hl2,l=2,3,,\max_{1\leq i\leq N}\mathbb{E}|\xi_{i}|^{l}\leq\frac{1}{2}l!\sigma^{2}H^{l-2},\qquad l=2,3,\ldots, (34)

with some finite constants σ\sigma and HH, and we say that they satisfy the light tail condition if

max1iN𝔼(exp(ξi2/σ2))exp(1)\max_{1\leq i\leq N}\mathbb{E}(\exp(\xi_{i}^{2}/\sigma^{2}))\leq\exp(1) (35)

for some positive constant σ2\sigma^{2}.

Lemma 1

Let the i.i.d. zero-mean random variables ξi\xi_{i} satisfy the Bernstein condition (34). Let also either

max1jm1Ni=1N|Xi(j,)|22Srow2\max_{1\leq j\leq m}\frac{1}{N}\sum_{i=1}^{N}\bigl{|}X_{i(j,\cdot)}\bigr{|}_{2}^{2}\leq S_{\mathrm{row}}^{2} (36)

and

max1jm,1iN|Xi(j,)|2Hrow\max_{1\leq j\leq m,1\leq i\leq N}\bigl{|}X_{i(j,\cdot)}\bigr{|}_{2}\leq H_{\mathrm{row}} (37)

or the conditions

max1kT1Ni=1N|Xi(,k)|22Scol2\max_{1\leq k\leq T}\frac{1}{N}\sum_{i=1}^{N}\bigl{|}X_{i(\cdot,k)}\bigr{|}_{2}^{2}\leq S_{\mathrm{col}}^{2} (38)

and

max1kT,1iN|Xi(,k)|2Hcol\max_{1\leq k\leq T,1\leq i\leq N}\bigl{|}X_{i(\cdot,k)}\bigr{|}_{2}\leq H_{\mathrm{col}} (39)

hold true with some constants Srow,Hrow,Scol,HcolS_{\mathrm{row}},H_{\mathrm{row}},S_{\mathrm{col}},H_{\mathrm{col}}. Let D>1D>1. Then, respectively, with probability at least 12/mD11-2/m^{D-1} or at least 12/TD11-2/T^{D-1} we have

𝐌Sτ,\|\mathbf{M}\|_{S_{\infty}}\leq\tau, (40)

where τ=τrow=Crowm(logm)/N\tau=\tau_{\mathrm{row}}=C_{\mathrm{row}}\sqrt{m(\log m)/N} if (36) and (37) are satisfied or τ=τcol=CcolT(logT)/N}\tau=\tau_{\mathrm{col}}=C_{\mathrm{col}}\sqrt{T(\log T)/N}\} if (38) and (39) hold. Here

Crow\displaystyle C_{\mathrm{row}} =\displaystyle= (2Dσ2Srow2+2DHrowHlogmN),\displaystyle\Biggl{(}\sqrt{2D\sigma^{2}S_{\mathrm{row}}^{2}}+2DH_{\mathrm{row}}H\sqrt{\frac{\log m}{N}}\Biggr{)},
Ccol\displaystyle C_{\mathrm{col}} =\displaystyle= (2Dσ2Scol2+2DHcolHlogTN).\displaystyle\Biggl{(}\sqrt{2D\sigma^{2}S_{\mathrm{col}}^{2}}+2DH_{\mathrm{col}}H\sqrt{\frac{\log T}{N}}\Biggr{)}.
Lemma 2

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Then, for any D2D\geq 2,

𝐌S42Dσϕmax(1)m+TN=:τ1\|\mathbf{M}\|_{S_{\infty}}\leq 4\sqrt{2D}\sigma\phi_{\max}(1)\sqrt{\frac{m+T}{N}}=:\tau_{1} (41)

with probability at least 12exp{(Dlog5)(m+T)}1-2\exp\{-(D-\log 5)(m+T)\}, where ϕmax(1)\phi_{\max}(1) is the maximal rank 1 eigenvalue of the sampling operator \mathcal{L}.

If mm and TT have the same order of magnitude, the bound of Lemma 2 is better, since it does not contain extra logarithmic factors. On the other hand, if mm and TT differ dramatically, for example, mTm\gg T, then Lemma 1 can provide a significant improvement. Indeed, the “column” version of Lemma 1 guarantees the rate τTlogT/N\tau\sim\sqrt{T\log T}/\sqrt{N} which in this case is much smaller than m/N\sqrt{m/N}. In all the cases, the concentration rate in Lemma 2 is exponential and thus faster than in Lemma 1.

The next lemma treats the stochastic term for USR matrix completion.

Lemma 3 ((USR matrix completion))

(i) Let the i.i.d. zero-mean random variables ξi\xi_{i} satisfy the Bernstein condition (34). Consider the USR matrix completion problem and assume that mT(m+T)>NmT(m+T)>N. Then, for any D2D\geq 2,

𝐌S(4σ10D+8HD)m+TN=:τ2\|\mathbf{M}\|_{S_{\infty}}\leq\bigl{(}4\sigma\sqrt{10D}+8HD\bigr{)}\frac{m+T}{N}=:\tau_{2} (42)

with probability at least 14exp{(2log5)(m+T)}1-4\exp\{-(2-\log 5)(m+T)\}.

(ii) Assume that the i.i.d. zero-mean random variables ξi\xi_{i} satisfy the light tail condition (35) for some σ2>0\sigma^{2}>0. Then for any B>0B>0,

𝐌SBσlog(max(m+1,T+1))N=:τ3\|\mathbf{M}\|_{S_{\infty}}\leq\sqrt{B}\frac{\sigma\log(\max(m+1,T+1))}{\sqrt{N}}=:\tau_{3} (43)

with probability at least 1(1/C)max(m+1,T+1)CB1-(1/C){\max(m+1,T+1)}^{-CB} for some constant C>0C>0 which does not depend on m,Tm,T and NN.

The proof of part (i) is based on a refinement of a technique in Vershynin (2007), whereas that of part (ii) follows immediately from the large deviations inequality of Nemirovski (2004). For example, if ξi𝒩(0,σ2)\xi_{i}\sim\mathcal{N}(0,\sigma^{2}), in which case both results apply, the bound (ii) is tighter than (i) for sample sizes N(m+T)2N\ll(m+T)^{2} which is the most interesting case for matrix completion.

Much tighter bounds are available when the XiX_{i} are constrained to be pairwise different. Besides it is noteworthy that the rates in (44) and (45) below are different for Gaussian and Bernstein errors.

Lemma 4 ((Collaborative sampling))

Consider the problem of CS matrix completion. {longlist}[(iii)]

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Then, for anyD2D\geq 2,

𝐌S8σDm+TN=:τ4\|\mathbf{M}\|_{S_{\infty}}\leq 8\sigma\sqrt{D}\frac{\sqrt{m+T}}{N}=:\tau_{4} (44)

with probability at least 12exp{(Dlog5)(m+T)}1-2\exp\{-(D-\log 5)(m+T)\}.

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. zero-mean random variables satisfying the Bernstein condition (34). Then, for any D2D\geq 2 and

𝐌S4σ2D(m+T)+8HD(m+T)N=:τ5\|\mathbf{M}\|_{S_{\infty}}\leq\frac{4\sigma\sqrt{2D(m+T)}+8HD(m+T)}{N}=:\tau_{5} (45)

with probability at least 12exp{(Dlog5)(m+T)}1-2\exp\{-(D-\log 5)(m+T)\}.

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables. Then for any A>1A>1,

𝐌Sσ2Alog(m+T)Nmax{i=1NXiXiS1/2,i=1NXiXiS1/2}=:τ6\|\mathbf{M}\|_{S_{\infty}}\leq\frac{\sigma\sqrt{2A\log(m+T)}}{N}\max\Biggl{\{}\Biggl{\|}\sum_{i=1}^{N}X_{i}^{\prime}X_{i}\Biggr{\|}_{S_{\infty}}^{1/2},\Biggl{\|}\sum_{i=1}^{N}X_{i}X_{i}^{\prime}\Biggr{\|}_{S_{\infty}}^{1/2}\Biggr{\}}=:\tau_{6}

with probability at least 12(m+T)1A1-2(m+T)^{1-A}.

Since the masks XiX_{i} are distinct, the maximum appearing in (iii) is bounded by max(m,T)\sqrt{\max(m,T)}; in case it is attained, the bound (44) is slightly stronger since it is free from the logarithmic factor. For NmTN\ll mT the tightness of the bound in (iii) depends strongly on the geometry of the XiX_{i}’s and the maximum can be significantly smaller than max(m,T)\sqrt{\max(m,T)}. Note also that the concentration in (44) is exponential, while it is only polynomial in (iii).

8.2 Concentration bounds for the stochastic term under nonconvex penalties

The last bound in this section applies in the case 0<p<10<p<1. It is given in the following lemma.

Lemma 5

Let ξ1,,ξN\xi_{1},\ldots,\xi_{N} be i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) random variables, 0<p<10<p<1 and M=max(m,T)M=\max(m,T). Assume that the sampling operator \mathcal{L} is uniformly bounded; cf. (5). Set cκ=(2κ1)(2κ)κ1/(2κ1)c_{\kappa}=(2\kappa-1)(2\kappa)\kappa^{-1/(2\kappa-1)} where κ=(2p)/(22p)\kappa=(2-p)/(2-2p). Then for any fixed δ>0\delta>0, ϑC2\vartheta\geq C^{2} and τ7=cκ(ϑ/p)1p/2(M/N)1p/2\tau_{7}=c_{\kappa}(\vartheta/p)^{1-p/2}(M/N)^{1-p/2} we have

|1Ni=1Nξi\operatornametr(Xi(A^A))|δ2d^2,N(A^,A)2+τ7δp1A^ASpp\quad\quad\Biggl{|}\frac{1}{N}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}\bigl{(}X_{i}^{\prime}(\hat{A}-A^{*})\bigr{)}\Biggr{|}\leq\frac{\delta}{2}\hat{d}_{2,N}(\hat{A},A^{*})^{2}+\tau_{7}\delta^{p-1}\|\hat{A}-A^{*}\|_{S_{p}}^{p}\hskip-10.0pt (46)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}) for some constant C=C(p,c0,σ2)>0C=C(p,c_{0},\break\sigma^{2})>0 which is independent of MM and NN and satisfies sup0<pqC(p,c0,σ)<\sup_{0<p\leq q}C(p,c_{0},\sigma)<\infty for all q<1q<1.

Note at this point that we cannot rely the proof of Lemma 5 directly on the trace duality and norm interpolation (cf. Lemma 11), that is, on the inequalities

|1Ni=1Nξi\operatornametr(Xi(A^A))|\displaystyle\quad\quad\Biggl{|}\frac{1}{N}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}\bigl{(}X_{i}^{\prime}(\hat{A}-A^{*})\bigr{)}\Biggr{|} \displaystyle\leq A^AS1𝐌S\displaystyle\|\hat{A}-A^{*}\|_{S_{1}}\|\mathbf{M}\|_{S_{\infty}}
\displaystyle\leq A^AS21p/(2p)A^ASpp/(2p)𝐌S.\displaystyle\|\hat{A}-A^{*}\|_{S_{2}}^{1-p/(2-p)}\|\hat{A}-A^{*}\|_{S_{p}}^{p/(2-p)}\|\mathbf{M}\|_{S_{\infty}}.\hskip-10.0pt

Indeed, one may think that we could have bounded here the SS_{\infty}-norm of 𝐌\mathbf{M} in the same way as in Section 8.1, and then the proof would be complete after suitable decoupling if we were able to bound from above A^AS22\|\hat{A}-A^{*}\|_{S_{2}}^{2} by d^2,N(A^,A)2\hat{d}_{2,N}(\hat{A},A^{*})^{2} times a constant factor. However, this is not possible. Even the Restricted Isometry condition cannot help here because A^A\hat{A}-A^{*} is not necessarily of small rank. Nevertheless, we will show that by other techniques it is possible to derive an inequality similar to (8.2) with d^2,N(A^,A)\hat{d}_{2,N}(\hat{A},A^{*}) instead of A^AS2\|\hat{A}-A^{*}\|_{S_{2}}. Further details are given in Sections 10 and 11.

9 Proof of Theorem 2

Preliminaries

We first give two lemmas on matrix decomposition needed in our proof, which are essentially provided by Recht, Fazel and Parrilo (2010) [subsequently, RFP(10) for short].

Lemma 6

Let AA and BB be matrices of the same dimension. If AB=0AB^{\prime}=0, AB=0A^{\prime}B=0, then

A+BSpp=ASpp+BSppp>0.\|A+B\|_{S_{p}}^{p}=\|A\|_{S_{p}}^{p}+\|B\|_{S_{p}}^{p}\qquad\forall p>0.
{pf}

For p=1p=1 the result is Lemma 2.3 in RFP(10). The argument obviously extends to any p>0p>0 since RFP(10) show that the singular values of A+BA+B are equal to the union (with repetition) of the singular values of AA and BB.

Lemma 7

Let Am×TA\in\mathbb{R}^{m\times T} with \operatornamerank(A)=r\operatorname{rank}(A)=r and singular value decomposition A=UΛVA=U\Lambda V^{\prime}. Let Bm×T{B}\in\mathbb{R}^{m\times T} be arbitrary. Then there exists a decompositon B=B1+B2B=B_{1}+B_{2} with the following properties: {longlist}[(iii)]

\operatornamerank(B1)2\operatornamerank(A)=2r\operatorname{rank}(B_{1})\leq 2\operatorname{rank}(A)=2r,

AB2=0AB_{2}^{\prime}=0, AB2=0A^{\prime}B_{2}=0,

\operatornametr(B1B2)=0\operatorname{tr}(B_{1}^{\prime}B_{2})=0,

B1B_{1} and B2B_{2} are of the form

B1=U(B~11B~12B~210)VandB2=U(000B~22)V\displaystyle B_{1}=U\pmatrix{\tilde{B}_{11}&\tilde{B}_{12}\cr\tilde{B}_{21}&0}V^{\prime}\quad\mbox{and}\quad B_{2}=U\pmatrix{0&0\cr 0&\tilde{B}_{22}}V^{\prime}
\eqntextwith B~11r×r.\displaystyle\eqntext{\mbox{with }\tilde{B}_{11}\in\mathbb{R}^{r\times r}.} (48)

The points (i)–(iii) are the statement of Lemma 3.4 in RFP(08), the representation (iv) is provided in its proof.

{pf*}

Proof of Theorem 2 First note that there exists a decomposition A^=A^(1)+A^(2)\hat{A}=\hat{A}^{(1)}+\hat{A}^{(2)} with the following properties: {longlist}[(iii)]

\operatornamerank(A^(1)A)2\operatornamerank(A)=2r\operatorname{rank}(\hat{A}^{(1)}-A^{*})\leq 2\operatorname{rank}(A^{*})=2r,

A(A^(2))=0A^{*}(\hat{A}^{(2)})^{\prime}=0, (A)A^(2)=0(A^{*})^{\prime}\hat{A}^{(2)}=0,

\operatornametr((A^(1)A)A^(2))=0\operatorname{tr}((\hat{A}^{(1)}-A^{*})^{\prime}\hat{A}^{(2)})=0. This follows from Lemma 7 with A=AA=A^{*} and B=A^AB=\hat{A}-A^{*}. In the notation of Lemma 7, we have B1=A^(1)AB_{1}=\hat{A}^{(1)}-A^{*} and B2=A^(2)B_{2}=\hat{A}^{(2)}.

From the basic inequalities (8) and (3) with δ=1/2\delta=1/2, we find

(1I{0<p<1}/2)d^2,N(A^,A)2\displaystyle\bigl{(}1-I_{\{0<p<1\}}/2\bigr{)}\hat{d}_{2,N}(\hat{A},A^{*})^{2}
(49)
22pτA^ASpp+4τ(ASppA^Spp).\displaystyle\qquad\leq 2^{2-p}\tau\|\hat{A}-A^{*}\|_{S_{p}}^{p}+4\tau(\|A^{*}\|_{S_{p}}^{p}-\|\hat{A}\|_{S_{p}}^{p}).

In particular, for the case p=1p=1,

d^2,N(A^,A)22τA^ASpp+4τ(ASppA^Spp).\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 2\tau\|\hat{A}-A^{*}\|_{S_{p}}^{p}+4\tau(\|A^{*}\|_{S_{p}}^{p}-\|\hat{A}\|_{S_{p}}^{p}). (50)

For brevity, we will conduct the proof with the numerical constants given in (50), that is, with those for p=1p=1. The proof for general pp differs only in the values of the constants, but their expressions become cumbersome.

Using (3), we get

d^2,N(A^,A)2\displaystyle\hat{d}_{2,N}(\hat{A},A^{*})^{2}
(51)
2τA^(1)ASpp+4τASpp+2τA^(2)Spp4τA^Spp.\displaystyle\qquad\leq 2\tau\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}+4\tau\|A^{*}\|_{S_{p}}^{p}+2\tau\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}^{p}-4\tau\|\hat{A}\|_{S_{p}}^{p}.

By (3) again and by Lemma 6,

A^Spp\displaystyle\|\hat{A}\|_{S_{p}}^{p} \displaystyle\geq A+A^(2)SppA^(1)ASpp\displaystyle\bigl{\|}A^{*}+\hat{A}^{(2)}\bigr{\|}_{S_{p}}^{p}-\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}
=\displaystyle= ASpp+A^(2)SppA^(1)ASpp,\displaystyle\|A^{*}\|_{S_{p}}^{p}+\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}^{p}-\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p},

since (A)A^(2)=0(A^{*})^{\prime}\hat{A}^{(2)}=0 and A(A^(2))=0A^{*}(\hat{A}^{(2)})^{\prime}=0 by construction. Together with (9) this yields

d^2,N(A^,A)22τA^(1)ASpp2τA^(2)Spp+4τA^(1)ASpp,\qquad\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 2\tau\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}-2\tau\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}^{p}+4\tau\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p},\hskip-5.0pt (52)

from which one may deduce

d^2,N(A^,A)26τA^(1)ASpp\hat{d}_{2,N}(\hat{A},A^{*})^{2}\leq 6\tau\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p} (53)

and

A^(2)Spp3A^(1)ASpp.\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}^{p}\leq 3\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}. (54)

Consider now the following decomposition of the matrix A^(2)\hat{A}^{(2)}. First, recall that A^(2)\hat{A}^{(2)} is of the form

A^(2)=U(000B~22)V.\hat{A}^{(2)}=U\pmatrix{0&0\cr 0&\tilde{B}_{22}}V^{\prime}.

Write B~22=W1Λ(B~22)W2\tilde{B}_{22}=W_{1}\Lambda(\tilde{B}_{22})W_{2}^{\prime} with diagonal matrix Λ(B~22)\Lambda(\tilde{B}_{22}) of dimension rr^{\prime} and W1W1=W2W2=Ir×rW_{1}^{\prime}W_{1}=W_{2}^{\prime}W_{2}=I_{r^{\prime}\times r^{\prime}} for some rmin(m,T)r^{\prime}\leq\min(m,T). In the next step, W1W_{1} and W2W_{2} are complemented to orthogonal matrices W¯1\bar{W}_{1} and W¯2\bar{W}_{2} of dimension min(m,T)×min(m,T)\min(m,T)\times\min(m,T). For instance, set

W¯2=(0W2)min(m,T)×min(m,T),\bar{W}_{2}^{\prime}=\pmatrix{&&0\cr*&&\cr&&W_{2}^{\prime}}\in\mathbb{R}^{\min(m,T)\times\min(m,T)},

where * complements the columns of the matrix (0W2){{0}\choose{W_{2}^{\prime}}} to an orthonormal basis in m×T\mathbb{R}^{m\times T}, and proceed analogously with W1W_{1}. In particular, W¯1W¯1=W¯2W¯2=Imin(m,T)×min(m,T)\bar{W}_{1}^{\prime}\bar{W}_{1}=\bar{W}_{2}^{\prime}\bar{W}_{2}=I_{\min(m,T)\times\min(m,T)}. Also

A^(2)=U(000W1Λ(B~22)W2)V=UW¯1(000Λ(B~22))W¯2V=:UW¯1DW¯2V.\hat{A}^{(2)}=U\pmatrix{0\!&0\cr 0\!&W_{1}\Lambda(\tilde{B}_{22})W_{2}^{\prime}}V^{\prime}=U\bar{W}_{1}\pmatrix{0\!&0\cr 0\!&\Lambda(\tilde{B}_{22})}\bar{W}_{2}^{\prime}V^{\prime}=:U\bar{W}_{1}D\bar{W}_{2}^{\prime}V^{\prime}.

We now represent A^(2)\hat{A}^{(2)} as a finite sum of matrices A^(2)=j=1RA^j(2)\hat{A}^{(2)}=\sum_{j=1}^{R^{\prime}}\hat{A}^{(2)}_{j} with

A^i(2)=UW¯1DiW¯2V\hat{A}^{(2)}_{i}=U\bar{W}_{1}D_{i}\bar{W}_{2}^{\prime}V^{\prime}

and

Di=(000Λi),D_{i}=\pmatrix{0&0\cr 0&\Lambda_{i}},

where the r×rr^{\prime}\times r^{\prime} diagonal matrix Λi\Lambda_{i} has the form Λi=\operatornamediag(λjI{jIi})\Lambda_{i}=\operatorname{diag}(\lambda_{j}I_{\{j\in I_{i}\}}), i1i\geq 1. We denote here by I1I_{1} the set of arar indices from {1,,min(m,T)}\{1,\dots,\min(m,T)\} corresponding to the arar largest in absolute value diagonal entries of Λ\Lambda, by I2I_{2} the set of indices corresponding to the next arar largest in absolute value diagonal entries λj\lambda_{j}, etc. Clearly, the matrices A^k(2)\hat{A}^{(2)}_{k} are mutually orthogonal: \operatornametr((A^j(2))A^k(2))=0\operatorname{tr}((\hat{A}^{(2)}_{j})^{\prime}\hat{A}^{(2)}_{k})=0 for jkj\not=k and \operatornamerank(A^j(2))ar\operatorname{rank}(\hat{A}^{(2)}_{j})\leq ar. Moreover, A^i(2)\hat{A}^{(2)}_{i} is orthogonal to A^(1)A\hat{A}^{(1)}-A^{*}.

Let σ1σ2\sigma_{1}\geq\sigma_{2}\geq\cdots be the singular values of A^(2)\hat{A}^{(2)}, then σ1σar\sigma_{1}\geq\cdots\geq\sigma_{ar} are the singular values of A^1(2)\hat{A}^{(2)}_{1}, σar+1σ2ar\sigma_{ar+1}\geq\cdots\geq\sigma_{2ar} those of A^2(2)\hat{A}^{(2)}_{2}, etc. By construction, we have \operatornameCard(Ii)=ar\operatorname{Card}(I_{i})=ar for all ii, and for all kIi+1k\in I_{i+1}

σkminjIiσj(1arjIiσjp)1/p.\sigma_{k}\leq\min_{j\in I_{i}}\sigma_{j}\leq\biggl{(}\frac{1}{ar}\sum_{j\in I_{i}}\sigma_{j}^{p}\biggr{)}^{1/p}.

Thus,

kIi+1σk2ar(1arjIiσjp)2/p\sum_{k\in I_{i+1}}\sigma_{k}^{2}\leq ar\biggl{(}\frac{1}{ar}\sum_{j\in I_{i}}\sigma_{j}^{p}\biggr{)}^{2/p}

from which one can deduce for all j2j\geq 2:

A^j(2)S2=(kIjσk2)1/2(ar)1/21/p(kIj1σkp)1/p=(ar)1/21/pA^j1(2)Sp\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{2}}=\biggl{(}\sum_{k\in I_{j}}\sigma_{k}^{2}\biggr{)}^{1/2}\leq(ar)^{1/2-1/p}\biggl{(}\sum_{k\in I_{j-1}}\sigma_{k}^{p}\biggr{)}^{1/p}=(ar)^{1/2-1/p}\bigl{\|}\hat{A}_{j-1}^{(2)}\bigr{\|}_{S_{p}}

and consequently

j2A^j(2)S2(ar)1/21/pj1A^j(2)Sp.\sum_{j\geq 2}\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{2}}\leq(ar)^{1/2-1/p}\sum_{j\geq 1}\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{p}}.

Because of the elementary inequality x1/p+y1/p(x+y)1/px^{1/p}+y^{1/p}\leq(x+y)^{1/p} for any nonnegative x,yx,y and 0<p10<p\leq 1,

j2A^j(2)Sp\displaystyle\sum_{j\geq 2}\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{p}} =\displaystyle= j2(kIjσkp)1/p(j2kIjσkp)1/p\displaystyle\sum_{j\geq 2}\biggl{(}\sum_{k\in I_{j}}\sigma_{k}^{p}\biggr{)}^{1/p}\leq\biggl{(}\sum_{j\geq 2}\sum_{k\in I_{j}}\sigma_{k}^{p}\biggr{)}^{1/p}
\displaystyle\leq (kσkp)1/p=A^(2)Sp.\displaystyle\biggl{(}\sum_{k}\sigma_{k}^{p}\biggr{)}^{1/p}=\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}.

Therefore,

j2A^j(2)S2\displaystyle\sum_{j\geq 2}\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{2}} \displaystyle\leq (ar)1/21/pA^(2)Sp\displaystyle(ar)^{1/2-1/p}\bigl{\|}\hat{A}^{(2)}\bigr{\|}_{S_{p}}
\displaystyle\leq 31/p(ar)1/21/pA^(1)ASp[using inequality (54)]\displaystyle 3^{1/p}(ar)^{1/2-1/p}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}\qquad\mbox{[using inequality (\ref{eq: 1--2})]}
\displaystyle\leq 31/p(ar)1/21/p(2r)1/p1/2A^(1)AS2,\displaystyle 3^{1/p}(ar)^{1/2-1/p}(2r)^{1/p-1/2}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}},

where the last inequality results from \operatornamerank(A^(1)A)2r\operatorname{rank}(\hat{A}^{(1)}-A^{*})\leq 2r and

(12rk2rσkp)1/p(12rk2rσk2)1/2.\biggl{(}\frac{1}{2r}\sum_{k\leq 2r}\sigma_{k}^{p}\biggr{)}^{1/p}\leq\biggl{(}\frac{1}{2r}\sum_{k\leq 2r}\sigma_{k}^{2}\biggr{)}^{1/2}.

Finally,

j2A^j(2)S231/p(a2)1/21/pA^(1)AS2.\sum_{j\geq 2}\bigl{\|}\hat{A}_{j}^{(2)}\bigr{\|}_{S_{2}}\leq 3^{1/p}\biggl{(}\frac{a}{2}\biggr{)}^{1/2-1/p}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}. (55)

We now proceed with the final argument. First, note that \operatornamerank((A^(1)A)+A^1(2))(2+a)r\operatorname{rank}((\hat{A}^{(1)}-A^{*})+\hat{A}^{(2)}_{1})\leq(2+a)r. Next, by the triangular inequality, the restricted isometry condition and the orthogonality of A^j(2)\hat{A}^{(2)}_{j} and A^(1)A\hat{A}^{(1)}-A^{*} we obtain

νd^2,N(A^,A)\displaystyle\nu\hat{d}_{2,N}(\hat{A},A^{*}) =\displaystyle= ν|(A^A)|2\displaystyle\nu|\mathcal{L}(\hat{A}-A^{*})|_{2}
\displaystyle\geq ν|(A^(1)A+A^1(2))|2νj2|(A^j(2))|2\displaystyle\nu\bigl{|}\mathcal{L}\bigl{(}\hat{A}^{(1)}-A^{*}+\hat{A}^{(2)}_{1}\bigr{)}\bigr{|}_{2}-\nu\sum_{j\geq 2}\bigl{|}\mathcal{L}\bigl{(}\hat{A}^{(2)}_{j}\bigr{)}\bigr{|}_{2}
\displaystyle\geq (1δ(2+a)r)A^(1)A+A^1(2)S2(1+δar)j2A^j(2)S2\displaystyle\bigl{(}1-\delta_{(2+a)r}\bigr{)}\bigl{\|}\hat{A}^{(1)}-A^{*}+\hat{A}^{(2)}_{1}\bigr{\|}_{S_{2}}-(1+\delta_{ar})\sum_{j\geq 2}\bigl{\|}\hat{A}^{(2)}_{j}\bigr{\|}_{S_{2}}
\displaystyle\geq A^(1)AS2((1δ(2+a)r)(1+δar)31/p(a2)1/21/p).\displaystyle\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}\biggl{(}\bigl{(}1-\delta_{(2+a)r}\bigr{)}-(1+\delta_{ar})3^{1/p}\biggl{(}\frac{a}{2}\biggr{)}^{1/2-1/p}\biggr{)}.

Define

a=a(p)=min{k\dvtxk>(61/p/2)2p/(2p)}.a=a(p)=\min\bigl{\{}k\in\mathbb{N}\dvtx k>\bigl{(}6^{1/p}/\sqrt{2}\bigr{)}^{2p/(2-p)}\bigr{\}}.

Then 131/p(a/2)1/21/p>01-3^{1/p}(a/2)^{1/2-1/p}>0. Now, δ(2+a)rδar\delta_{(2+a)r}\geq\delta_{ar}, and thus

(1δ(2+a)r)(1+δar)31/p(a2)1/21/p(131/p(a2)1/21/p)2δ(2+a)r>0\bigl{(}1-\delta_{(2+a)r}\bigr{)}-(1+\delta_{ar})3^{1/p}\biggl{(}\!\frac{a}{2}\!\biggr{)}^{\!1/2-1/p}\geq\biggl{(}\!1-3^{1/p}\biggl{(}\frac{a}{2}\biggr{)}^{\!1/2-1/p}\biggr{)}-2\delta_{(2+a)r}>0

whenever

δ(2+a)r<12(131/p(a2)1/21/p).\delta_{(2+a)r}<\frac{1}{2}\biggl{(}1-3^{1/p}\biggl{(}\frac{a}{2}\biggr{)}^{1/2-1/p}\biggr{)}. (57)

In case of (57), there exists a universal constant κ=κ(p)\kappa=\kappa(p) such that

ν2d^2,N(A^,A)2κA^(1)AS22.\nu^{2}\hat{d}_{2,N}(\hat{A},A^{*})^{2}\geq\kappa\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}^{2}. (58)

Now, inequalities (53) and (58) yield

κA^(1)AS226τν2A^(1)ASpp6τν2(2r)1p/2A^(1)AS2p,\quad\quad\kappa\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}^{2}\leq 6\tau\nu^{2}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}\leq 6\tau\nu^{2}(2r)^{1-p/2}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}^{p},\hskip-15.0pt (59)

where the second inequality results from the fact that \operatornamerank(A^(1)A)2r\operatorname{rank}(\hat{A}^{(1)}-A^{*})\leq 2r, which implies

A^(1)ASp(2r)1/p1/2A^(1)AS2.\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}\leq(2r)^{1/p-1/2}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}. (60)

From (59), we obtain

κA^(1)AS22p6τν2(2r)1p/2.\kappa\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}^{2-p}\leq 6\tau\nu^{2}(2r)^{1-p/2}. (61)

Furthermore, from (53), (60) and (61) we find

d^2,N(A^,A)2\displaystyle\hat{d}_{2,N}(\hat{A},A^{*})^{2} \displaystyle\leq 6τ(2r)1p/2A^(1)AS2p\displaystyle 6\tau(2r)^{1-p/2}\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{2}}^{p}
\displaystyle\leq 2r(6τ)2/(2p)κp/(2p)ν2p/(2p).\displaystyle 2r(6\tau)^{2/(2-p)}\kappa^{-p/(2-p)}\nu^{2p/(2-p)}.

This proves (11). It remains to prove (12). We first demonstrate (12) for q=2q=2, then for q=pq=p, and finally obtain (12) for all q[p,2]q\in[p,2] by Schatten norm interpolation.

Using (55), (9), (9), we find

(1δ(2+a)r)A^(1)A+A^1(2)S2\displaystyle\bigl{(}1-\delta_{(2+a)r}\bigr{)}\bigl{\|}\hat{A}^{(1)}-A^{*}+\hat{A}^{(2)}_{1}\bigr{\|}_{S_{2}} \displaystyle\leq νd^2,N(A^,A)+(1+δar)j2A^j(2)S2\displaystyle\nu\hat{d}_{2,N}(\hat{A},A^{*})+(1+\delta_{ar})\sum_{j\geq 2}\bigl{\|}\hat{A}^{(2)}_{j}\bigr{\|}_{S_{2}}
\displaystyle\leq Crτ1/(2p)ν2/(2p)\displaystyle C\sqrt{r}\tau^{1/(2-p)}\nu^{2/(2-p)}

for some constant C=C(p)>0C=C(p)>0. This and again (55) yield

A^AS2A^(1)A+A^1(2)S2+j2A^j(2)S2Crτ1/(2p)ν2/(2p)\|\hat{A}-A^{*}\|_{S_{2}}\leq\bigl{\|}\hat{A}^{(1)}-A^{*}+\hat{A}^{(2)}_{1}\bigr{\|}_{S_{2}}+\sum_{j\geq 2}\bigl{\|}\hat{A}^{(2)}_{j}\bigr{\|}_{S_{2}}\leq C^{\prime}\sqrt{r}\tau^{1/(2-p)}\nu^{2/(2-p)}

for some constant C=C(p)>0C^{\prime}=C^{\prime}(p)>0. Thus, we have proved (12) for q=2q=2. Next, using inequalities (3) and (54) we obtain

A^ASppA^(1)ASpp+A^(2)Spp4A^(1)ASpp.\|\hat{A}-A^{*}\|_{S_{p}}^{p}\leq\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}+\bigl{\|}\hat{A}^{(2)}\bigl{\|}_{S_{p}}^{p}\leq 4\bigl{\|}\hat{A}^{(1)}-A^{*}\bigr{\|}_{S_{p}}^{p}.

Combining this with (60) and (61) we get (12) for q=pq=p. Finally, (12) for arbitrary q[p,2]q\in[p,2] follows from the norm interpolation formula

ASqqASpp(2q)/(2p)AS22(qp)/(2p);\|A\|_{S_{q}}^{q}\leq\|A\|_{S_{p}}^{p(2-q)/(2-p)}\|A\|_{S_{2}}^{2(q-p)/(2-p)};

cf. Lemma 11 of Section 11 with θ=p(2p)q(2q)\theta=\frac{p(2-p)}{q(2-q)}.

10 Proofs of the lemmas

{pf*}Proof of Lemma 1 First, observe that

𝐌S=supuT:|u|2=1|𝐌u|2mmax1jmsupuT:|u|2=1|uη¯j|,\|\mathbf{M}\|_{S_{\infty}}=\mathop{\sup_{u\in\mathbb{R}^{T}:}}_{|u|_{2}=1}|\mathbf{M}u|_{2}\leq\sqrt{m}\max_{1\leq j\leq m}\mathop{\sup_{u\in\mathbb{R}^{T}:}}_{|u|_{2}=1}|u^{\prime}\bar{\eta}_{j}|,

with vectors η¯j=N1i=1NξiXi(j,)\bar{\eta}_{j}=N^{-1}\sum_{i=1}^{N}\xi_{i}X_{i(j,\cdot)}. Consequently, for any t>0t>0,

(𝐌StmlogmN)\displaystyle\mathbb{P}\Biggl{(}\|\mathbf{M}\|_{S_{\infty}}\geq t\sqrt{\frac{m\log m}{N}}\Biggr{)} \displaystyle\leq (mmax1jm|η¯j|2tmlogmN)\displaystyle\mathbb{P}\Biggl{(}\sqrt{m}\max_{1\leq j\leq m}|\bar{\eta}_{j}|_{2}\geq t\sqrt{\frac{m\log m}{N}}\Biggr{)}
\displaystyle\leq mmax1jm(|η¯j|2tlogmN).\displaystyle m\max_{1\leq j\leq m}\mathbb{P}\Biggl{(}|\bar{\eta}_{j}|_{2}\geq t\sqrt{\frac{\log m}{N}}\Biggr{)}.

To proceed with the evaluation of the latter probability, we use the following concentration bound [Pinelis and Sakhanenko (1985)].

Lemma 8

Let ζ1,,ζN\zeta_{1},\ldots,\zeta_{N} be independent zero mean random variables in a separable Hilbert space \mathcal{H} such that

i=1N𝔼ζil12l!B2Ll2,l=2,3,,\sum_{i=1}^{N}\mathbb{E}\|\zeta_{i}\|_{\mathcal{H}}^{l}\leq\frac{1}{2}l!B^{2}L^{l-2},\qquad l=2,3,\ldots, (63)

with some finite constants B,L>0B,L>0. Then

(i=1Nζix)2exp(x22B2+2xL)x>0.\mathbb{P}\Biggl{(}\Biggl{\|}\sum_{i=1}^{N}\zeta_{i}\Biggr{\|}_{\mathcal{H}}\geq x\Biggr{)}\leq 2\exp\biggl{(}-\frac{x^{2}}{2B^{2}+2xL}\biggr{)}\qquad\forall x>0.

Setting ζi=ξiXi(j,)\zeta_{i}=\xi_{i}X_{i(j,\cdot)}, =T\mathcal{H}=\mathbb{R}^{T}, note first that, by the Bernstein condition (34),

i=1N𝔼ζil\displaystyle\sum_{i=1}^{N}\mathbb{E}\|\zeta_{i}\|_{\mathcal{H}}^{l} =\displaystyle= 𝔼|ξi|li=1N|Xi(j,)|2l\displaystyle\mathbb{E}|\xi_{i}|^{l}\sum_{i=1}^{N}\bigl{|}X_{i(j,\cdot)}\bigr{|}_{2}^{l}
\displaystyle\leq 12l!σ2Hl2(maxji=1N|Xi(j,)|22)maxi,j|Xi(j,)|2l2\displaystyle\frac{1}{2}l!\sigma^{2}H^{l-2}\Biggl{(}\max_{j}\sum_{i=1}^{N}\bigl{|}X_{i(j,\cdot)}\bigr{|}_{2}^{2}\Biggr{)}\max_{i,j}\bigl{|}X_{i(j,\cdot)}\bigr{|}_{2}^{l-2}
\displaystyle\leq 12l!B2Ll2,\displaystyle\frac{1}{2}l!B^{2}L^{l-2},

where B2=σ2Srow2NB^{2}=\sigma^{2}S_{\mathrm{row}}^{2}N and L=HrowHL=H_{\mathrm{row}}H, that is, condition (63) is satisfied. Now an application of Lemma 8 yields for any t>0t>0

(|η¯j|2tlogmN)\displaystyle\mathbb{P}\Biggl{(}|\bar{\eta}_{j}|_{2}\geq t\sqrt{\frac{\log m}{N}}\Biggr{)} =\displaystyle= (|1Ni=1Nξi|2>tlogm)\displaystyle\mathbb{P}\Biggl{(}\Biggl{|}\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\xi_{i}\Biggr{|}_{2}>t\sqrt{\log m}\Biggr{)}
\displaystyle\leq 2exp(N(logm)t22B2+2tLNlogm)\displaystyle 2\exp\biggl{(}-\frac{N(\log m)t^{2}}{2B^{2}+2tL\sqrt{N\log m}}\biggr{)}
=\displaystyle= 2exp(N(logm)t22σ2Srow2N+2tLNlogm).\displaystyle 2\exp\biggl{(}-\frac{N(\log m)t^{2}}{2\sigma^{2}S_{\mathrm{row}}^{2}N+2tL\sqrt{N\log m}}\biggr{)}.

Define t=2Dσ2Srow2+2DLlogmNt=\sqrt{2D\sigma^{2}S_{\mathrm{row}}^{2}}+2DL\sqrt{\frac{\log m}{N}} for some D>1D>1. Then

t2B¯+L¯tD,where B¯=2σ2Srow2,L¯=2LlogmN.\frac{t^{2}}{\bar{B}+\bar{L}t}\geq D,\qquad\mbox{where }\bar{B}=2\sigma^{2}S_{\mathrm{row}}^{2},\bar{L}=2L\sqrt{\frac{\log m}{N}}.

With this choice of tt,

(|η¯j|2tlogmN)2exp(Dlogm)=2mD\mathbb{P}\Biggl{(}|\bar{\eta}_{j}|_{2}\geq t\sqrt{\frac{\log m}{N}}\Biggr{)}\leq 2\exp(-D\log m)=2m^{-D}

and therefore (𝐌Sτrow)2m1D\mathbb{P}(\|\mathbf{M}\|_{S_{\infty}}\geq\tau_{\mathrm{row}})\leq 2m^{1-D}, where

τrow=(2Dσ2Srow2+2DHrowHlogmN)mlogmN.\tau_{\mathrm{row}}=\Biggl{(}\sqrt{2D\sigma^{2}S_{\mathrm{row}}^{2}}+2DH_{\mathrm{row}}H\sqrt{\frac{\log m}{N}}\Biggr{)}\sqrt{\frac{m\log m}{N}}.

Similarly, using 𝐌S=sup|v|2=1|v𝐌|2\|\mathbf{M}\|_{S_{\infty}}=\sup_{|v|_{2}=1}|v^{\prime}\mathbf{M}|_{2}, and assuming (38) and (39), we get (𝐌Sτcol)2T1D\mathbb{P}(\|\mathbf{M}\|_{S_{\infty}}\geq\tau_{\mathrm{col}})\leq 2T^{1-D}, where

τcol=(2Dσ2Scol2+2DHcolHlogTN)TlogTN.\tau_{\mathrm{col}}=\Biggl{(}\sqrt{2D\sigma^{2}S_{\mathrm{col}}^{2}}+2DH_{\mathrm{col}}H\sqrt{\frac{\log T}{N}}\Biggr{)}\sqrt{\frac{T\log T}{N}}.
\upqed
{pf*}

Proof of Lemma 3 The matrix 𝐌=1Ni=1NξiXi\mathbf{M}=\frac{1}{N}\sum_{i=1}^{N}\xi_{i}X_{i} is a sum of i.i.d. random matrices. Therefore, part (ii) of the lemma follows by direct application of the large deviations inequality of Nemirovski (2004).

To prove part (i) of the lemma, we use bounds on maximal eigenvalues of subgaussian matrices due to Mendelson, Pajor and Tomczak-Jaegermann (2007); see also Vershynin (2007). However, direct application of these bounds (based on the overall subgaussianity) does not lead to rates that are accurate enough for our purposes. We therefore need to refine the argument using the specific structure of the matrices. Note first that

𝐌S=maxv𝒮T1|𝐌v|2=maxu𝒮m1,v𝒮T1u𝐌v,\|\mathbf{M}\|_{S_{\infty}}=\max_{v\in\mathcal{S}^{T-1}}|\mathbf{M}v|_{2}=\max_{u\in\mathcal{S}^{m-1},v\in\mathcal{S}^{T-1}}u^{\prime}\mathbf{M}v,

where 𝒮m1\mathcal{S}^{m-1} is the unit sphere in m\mathbb{R}^{m}. Therefore, denoting by m\mathcal{M}_{m} and T\mathcal{M}_{T} the minimal 1/21/2-nets in Euclidean metric on 𝒮m1\mathcal{S}^{m-1} and 𝒮T1\mathcal{S}^{T-1}, respectively, we easily get

𝐌S2maxvT|𝐌v|24maxum,vT|u𝐌v|.\|\mathbf{M}\|_{S_{\infty}}\leq 2\max_{v\in\mathcal{M}_{T}}|\mathbf{M}v|_{2}\leq 4\max_{u\in\mathcal{M}_{m},v\in\mathcal{M}_{T}}|u^{\prime}\mathbf{M}v|.

Now, \operatornameCard(m)5m\operatorname{Card}(\mathcal{M}_{m})\leq 5^{m} [cf. Kolmogorov and Tikhomirov (1959)] so that by the union bound, for any τ>0\tau>0,

(𝐌Sτ)5m+Tmaxum,vT(|u𝐌v|τ/4).\mathbb{P}(\|\mathbf{M}\|_{S_{\infty}}\geq\tau)\leq 5^{m+T}\max_{u\in\mathcal{M}_{m},v\in\mathcal{M}_{T}}\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4). (64)

It remains to bound the last probability in (64) for fixed u,vu,v. Let us fix some u𝒮m1,v𝒮T1u\in\mathcal{S}^{m-1},v\in\mathcal{S}^{T-1} and introduce the random event

𝒜={1Ni=1N(uXiv)25(m+T)N}.\mathcal{A}=\Biggl{\{}\frac{1}{N}\sum_{i=1}^{N}(u^{\prime}X_{i}v)^{2}\leq\frac{5(m+T)}{N}\Biggr{\}}.

Note that 𝔼(uXiv)2=k=1ml=1Tuk2vl2(X1=ek(m)el(T))=(mT)1|u|22×|v|22=(mT)1\mathbb{E}(u^{\prime}X_{i}v)^{2}=\sum_{k=1}^{m}\sum_{l=1}^{T}u_{k}^{2}v_{l}^{2}\mathbb{P}(X_{1}=e_{k}(m)e_{l}^{\prime}(T))=(mT)^{-1}|u|_{2}^{2}\times|v|_{2}^{2}=(mT)^{-1}, and consider the zero-mean random variables ηi=(uXiv)2𝔼(uXiv)2=(uXiv)2(mT)1\eta_{i}=(u^{\prime}X_{i}v)^{2}-\mathbb{E}(u^{\prime}X_{i}v)^{2}=(u^{\prime}X_{i}v)^{2}-(mT)^{-1}. We have |ηi|2maxi(uXiv)22|u|22|v|22=2|\eta_{i}|\!\leq\!2\max_{i}(u^{\prime}X_{i}v)^{2}\!\leq\!2|u|_{2}^{2}|v|_{2}^{2}=2. Furthermore,

𝔼(ηi2)\displaystyle\mathbb{E}(\eta_{i}^{2}) \displaystyle\leq 𝔼(uXiv)4k=1ml=1Tuk4vl4(X1=ek(m)el(T))\displaystyle\mathbb{E}(u^{\prime}X_{i}v)^{4}\leq\sum_{k=1}^{m}\sum_{l=1}^{T}u_{k}^{4}v_{l}^{4}\mathbb{P}\bigl{(}X_{1}=e_{k}(m)e_{l}^{\prime}(T)\bigr{)}
=\displaystyle= (mT)1k=1muk4l=1Tvl4(mT)1.\displaystyle(mT)^{-1}\sum_{k=1}^{m}u_{k}^{4}\sum_{l=1}^{T}v_{l}^{4}\leq(mT)^{-1}.

Therefore, using Bernstein’s inequality and the condition (m+T)/N>(mT)1(m+T)/N>(mT)^{-1} we get

(𝒜c)\displaystyle\mathbb{P}(\mathcal{A}^{c}) \displaystyle\leq 2exp(N(4(m+T)/N)22(mT)1+(4/3)(4(m+T)/N))\displaystyle 2\exp\biggl{(}-\frac{N(4(m+T)/N)^{2}}{2(mT)^{-1}+(4/3)(4(m+T)/N)}\biggr{)}
\displaystyle\leq 2exp(2(m+T)),\displaystyle 2\exp\bigl{(}-2(m+T)\bigr{)},

where 𝒜c\mathcal{A}^{c} is the complement of 𝒜\mathcal{A}. We now bound the conditional probability

(|u𝐌v|τ/4|X1,,XN)=(|1Ni=1Nξi(uXiv)|τ/4|X1,,XN).\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4|X_{1},\dots,X_{N})=\mathbb{P}\Biggl{(}\Biggl{|}\frac{1}{N}\sum_{i=1}^{N}\xi_{i}(u^{\prime}X_{i}v)\Biggr{|}\geq\tau/4\bigl{|}X_{1},\dots,X_{N}\Biggr{)}.

Note that conditionally on X1,,XNX_{1},\dots,X_{N}, the ξi(uXiv)\xi_{i}(u^{\prime}X_{i}v) are independent zero-mean random variables with

i=1N𝔼(|ξi(uXiv)|l|X1,,XN)𝔼|ξ1|li=1N|uXiv|2l2,\sum_{i=1}^{N}\mathbb{E}(|\xi_{i}(u^{\prime}X_{i}v)|^{l}|X_{1},\ldots,X_{N})\leq\mathbb{E}|\xi_{1}|^{l}\sum_{i=1}^{N}|u^{\prime}X_{i}v|^{2}\qquad\forall l\geq 2,

where we used the fact that |uXiv|l2(|u|2|v|2)l2=1|u^{\prime}X_{i}v|^{l-2}\leq(|u|_{2}|v|_{2})^{l-2}=1 for l2l\geq 2. This and the Bernstein condition (34) yield that, for (X1,,XN)𝒜(X_{1},\dots,X_{N})\in\mathcal{A},

i=1N𝔼(|ξi(uXiv)|l|X1,,XN)l!2B2Hl2\sum_{i=1}^{N}\mathbb{E}(|\xi_{i}(u^{\prime}X_{i}v)|^{l}|X_{1},\dots,X_{N})\leq\frac{l!}{2}B^{2}H^{l-2}

with B2=5(m+T)σ2B^{2}=5(m+T)\sigma^{2}. Therefore, by Lemma 8, for (X1,,XN)𝒜(X_{1},\dots,X_{N})\in\mathcal{A} we have

(|u𝐌v|τ/4|X1,,XN)2exp(N2τ2/1610σ2(m+T)+NτH/2).\quad\quad\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4|X_{1},\dots,X_{N})\leq 2\exp\biggl{(}-\frac{N^{2}\tau^{2}/16}{10\sigma^{2}(m+T)+N\tau H/2}\biggr{)}.\hskip-10.0pt (66)

For τ\tau defined in (42) the last expression does not exceed 2exp(D(m+T))2\exp(-D(m+T)). Together with (64) and (10), this proves the lemma.

{pf*}

Proof of Lemma 2 We act as in the proof of Lemma 3 but since the matrices XiX_{i} are now deterministic, we do not need to introduce the event 𝒜\mathcal{A}. By the definition of ϕmax(1)\phi_{\max}(1),

1Ni=1N(uXiv)2=|(uv)|22ϕmax2(1)uvS22=ϕmax2(1)\frac{1}{N}\sum_{i=1}^{N}(u^{\prime}X_{i}v)^{2}=|\mathcal{L}(uv^{\prime})|_{2}^{2}\leq\phi_{\max}^{2}(1)\|uv^{\prime}\|_{S_{2}}^{2}=\phi_{\max}^{2}(1)

for all u𝒮m1,v𝒮T1u\in\mathcal{S}^{m-1},v\in\mathcal{S}^{T-1}. Hence, 1Ni=1Nξi(uXiv)\frac{1}{N}\sum_{i=1}^{N}\xi_{i}(u^{\prime}X_{i}v) is a zero-mean Gaussian random variable with variance not larger than ϕmax2(1)σ2/N\phi_{\max}^{2}(1)\sigma^{2}/N. Therefore,

(|u𝐌v|τ/4)2exp(Nτ232ϕmax2(1)σ2).\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4)\leq 2\exp\biggl{(}-\frac{N\tau^{2}}{32\phi_{\max}^{2}(1)\sigma^{2}}\biggr{)}.

For τ\tau as in (41) the last expression does not exceed 2exp(D(m+T))2\exp(-D(m+T)). Combining this with (64), we get the lemma.

{pf*}

Proof of Lemma 4 We proceed again as in the proof of Lemmas 3 and 2. Denote by Ω\Omega the set of pairs (k,l)(k,l) such that {X1,,XN}={ek(m)el(T),(k,l)Ω}\{X_{1},\dots,X_{N}\}=\{e_{k}(m)e_{l}^{\prime}(T),\break(k,l)\in\Omega\} (recall that all XiX_{i} are distinct by assumption). Then

i=1N(uXiv)2=(k,l)Ωuk2vl2|u|22|v|22=1\sum_{i=1}^{N}(u^{\prime}X_{i}v)^{2}=\sum_{(k,l)\in\Omega}u_{k}^{2}v_{l}^{2}\leq|u|_{2}^{2}|v|_{2}^{2}=1 (67)

for any u𝒮m1,v𝒮T1u\in\mathcal{S}^{m-1},v\in\mathcal{S}^{T-1}. Hence, under the assumptions of part (i) of the lemma,

(|u𝐌v|τ/4)2exp(N2τ232σ2)\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4)\leq 2\exp\biggl{(}-\frac{N^{2}\tau^{2}}{32\sigma^{2}}\biggr{)}

which does not exceed 2exp(D(m+T))2\exp(-D(m+T)) for τ\tau defined in (44). Combining this with (64) we get part (i) of the lemma. To prove part (ii) we note that, as in the proof of Lemma 3, |uXiv|l21|u^{\prime}X_{i}v|^{l-2}\leq 1 for l2l\geq 2. This and (67) yield

i=1N𝔼(|ξi(uXiv)|l)l!2B2Hl2l2,\sum_{i=1}^{N}\mathbb{E}(|\xi_{i}(u^{\prime}X_{i}v)|^{l})\leq\frac{l!}{2}B^{2}H^{l-2}\qquad\forall l\geq 2,

with B2=σ2B^{2}=\sigma^{2}. Therefore, by Lemma 8, we have

(|u𝐌v|τ/4)2exp(N2τ2/162σ2+NτH/2),\mathbb{P}(|u^{\prime}\mathbf{M}v|\geq\tau/4)\leq 2\exp\biggl{(}-\frac{N^{2}\tau^{2}/16}{2\sigma^{2}+N\tau H/2}\biggr{)},

and we complete the proof of (ii) in the same way as in Lemmas 3 and 2.

Part (iii) follows by an application of Theorem 2.1, Tropp (2010), after replacing every XiX_{i} by its self-adjoint dilation [see Paulsen (1986)].

For the proof of Lemma 5 we will need some notation. The ppth Schatten class of M×MM\times M-matrices is denoted by SpMS_{p}^{M}, and we write

(SpM)={AM×M\dvtxASp1}\mathcal{B}(S_{p}^{M})=\{A\in\mathbb{R}^{M\times M}\dvtx\|A\|_{S_{p}}\leq 1\}

for the corresponding closed Schatten-pp unit ball in M×M\mathbb{R}^{M\times M}. For any pseudo-metric space (𝒯,d)(\mathcal{T},d) and any ε>0\varepsilon>0, we define the covering number

𝒩(𝒯,d,ε)=min{\operatornameCard(𝒯0)\dvtx𝒯0𝒯 and infs𝒯0d(t,s)ε for all t𝒯}.\mathcal{N}(\mathcal{T},d,\varepsilon)=\min\Bigl{\{}\operatorname{Card}(\mathcal{T}_{0})\dvtx\mathcal{T}_{0}\subset\mathcal{T}\mbox{ and }\inf_{s\in\mathcal{T}_{0}}d(t,s)\leq\varepsilon\mbox{ for all }t\in\mathcal{T}\Bigr{\}}.

In other words, 𝒩(𝒯,d,ε)\mathcal{N}(\mathcal{T},d,\varepsilon) is the smallest number of closed balls of radius ε\varepsilon in the metric dd needed to cover the set 𝒯\mathcal{T}. We will sometimes write 𝒩(𝒯,,ε)\mathcal{N}(\mathcal{T},\|\cdot\|,\varepsilon) instead of 𝒩(𝒯,d,ε)\mathcal{N}(\mathcal{T},d,\varepsilon) if the metric dd is associated with the norm \|\cdot\|. The empirical norm 2,N\|\cdot\|_{2,N} corresponds to d^2,N\hat{d}_{2,N}, that is, for all AM×MA\in\mathbb{R}^{M\times M},

A2,N2=1Nj=1N\operatornametr(AXj)2.\|A\|_{2,N}^{2}=\frac{1}{N}\sum_{j=1}^{N}\operatorname{tr}(A^{\prime}X_{j})^{2}.
{pf*}

Proof of Lemma 5 Let us first assume that m=TMm=T\equiv M. Since

supBM×M|(1/N)i=1Nξi\operatornametr(BXi)B2,N1p/(2p)BSpp/(2p)|=supB(SpM)|(1/N)i=1Nξi\operatornametr(BXi)B2,N1p/(2p)|,\sup_{B\in\mathbb{R}^{M\times M}}\biggl{|}\frac{(1/\sqrt{N})\sum_{i=1}^{N}\xi_{i}\operatorname{tr}(B^{\prime}X_{i})}{\|B\|_{2,N}^{1-p/(2-p)}\|B\|_{S_{p}}^{p/(2-p)}}\biggr{|}=\sup_{B\in\mathcal{B}(S_{p}^{M})}\biggl{|}\frac{(1/\sqrt{N})\sum_{i=1}^{N}\xi_{i}\operatorname{tr}(B^{\prime}X_{i})}{\|B\|_{2,N}^{1-p/(2-p)}}\biggr{|},

the expression on the LHS of (46) is not greater than

MpNd^2,N(A^,A)1p/(2p)A^ASpp/(2p)\displaystyle\frac{\sqrt{M}}{\sqrt{p}\sqrt{N}}\hat{d}_{2,N}(\hat{A},A^{*})^{1-p/(2-p)}\|\hat{A}-A^{*}\|_{S_{p}}^{p/(2-p)}
×supB(SpM)|(M/p)(p2)/(2p)N1/2i=1Nξi\operatornametr(BXi)((M/p)(p2)/(2p)B2,N)1p/(2p)|.\displaystyle\qquad{}\times\sup_{B\in\mathcal{B}(S_{p}^{M})}\biggl{|}\frac{(M/p)^{(p-2)/(2p)}N^{-1/2}\sum_{i=1}^{N}\xi_{i}\operatorname{tr}(B^{\prime}X_{i})}{((M/p)^{(p-2)/(2p)}\|B\|_{2,N})^{1-p/(2-p)}}\biggr{|}.

Due to the linear dependence in MM of the ε\varepsilon-entropies of the quasi-convex Schatten class embeddings SpMS2MS_{p}^{M}\hookrightarrow S_{2}^{M} (cf. Corollary 7) and the fact that the required bound should be uniform in MM and in pp for p0p\searrow 0, we introduced an additional weighting by (M/p)(p2)/2p(M/p)^{(p-2)/2p}. Now define

𝒢M,p={AM×M\dvtx(M/p)(2p)/(2p)A(SpM)}.\mathcal{G}_{M,p}=\bigl{\{}A\in\mathbb{R}^{M\times M}\dvtx(M/p)^{(2-p)/(2p)}A\in\mathcal{B}(S_{p}^{M})\bigr{\}}.

By the entropy bound of Corollary 7 and the uniform boundedness condition (5),

log𝒩(𝒢M,p,d^2,N,ε)log𝒩(𝒢M,p,c0S2,ε)pα0(p)(ε/c0)2p/(2p)\log\mathcal{N}(\mathcal{G}_{M,p},\hat{d}_{2,N},\varepsilon)\leq\log\mathcal{N}\bigl{(}\mathcal{G}_{M,p},\sqrt{c_{0}}\|\cdot\|_{S_{2}},\varepsilon\bigr{)}\leq p\alpha_{0}(p)\bigl{(}\varepsilon/\sqrt{c_{0}}\bigr{)}^{-2p/(2-p)}

whence

0δlog𝒩(𝒢M,p,d^2,N,ε)𝑑εc0p/(2(2p))pα0(p)2p22pδ1p/(2p).\quad\quad\int_{0}^{\delta}\sqrt{\log\mathcal{N}(\mathcal{G}_{M,p},\hat{d}_{2,N},\varepsilon)}\,d\varepsilon\leq c_{0}^{p/(2(2-p))}p\alpha_{0}(p)\frac{2-p}{2-2p}\delta^{1-p/(2-p)}.\hskip-10.0pt (68)

We remark that due to the order specification of α0\alpha_{0} in Corollary 7, the expression

c0p/(2(2p))pα0(p)2p22pc_{0}^{p/(2(2-p))}p\alpha_{0}(p)\frac{2-p}{2-2p} (69)

is uniformly bounded as long as pp stays uniformly bounded away from 11. Note that for p=1p=1 the entropy integral on the LHS in (68) does not converge.

Claim 1.

For any q(0,1)q\in(0,1), there exist constants c(q)c(q) and c(q)c^{\prime}(q), such that for all 0<pq0<p\leq q, all 0<δc00<\delta\leq\sqrt{c_{0}} and uniformly in MM and NN,

(supB𝒢M,p:B2,Nδ|1Nj=1Nξj\operatornametr(XjB)|T)c(q)exp(T2c(q)2δ2)\mathbb{P}\Biggl{(}\mathop{\sup_{B\in\mathcal{G}_{M,p}:}}_{\|B\|_{2,N}\leq\delta}\Biggl{|}\frac{1}{\sqrt{N}}\sum_{j=1}^{N}\xi_{j}\operatorname{tr}(X_{j}^{\prime}B)\Biggr{|}\geq T\Biggr{)}\leq c(q)\exp\biggl{(}-\frac{T^{2}}{c(q)^{2}\delta^{2}}\biggr{)} (70)

for all Tc(q)δ1p/(2p)T\geq c^{\prime}(q)\delta^{1-p/(2-p)}.

{pf}

The bound is essentially stated in van de Geer (2000) as Lemma 3.2 [further referred to as VG(00)]. The constant in VG(00) depends neither on the 2,N\|\cdot\|_{2,N}-diameter of the function class nor on the function class itself and is valid, in particular, for ε=0\varepsilon=0, in the notation of VG(00). The uniformity in 0<pq0<p\leq q follows from the uniform boundedness of (69) over p(0,q]p\in(0,q]. The required case corresponds to K=K=\infty in the notation of VG(00). Its proof follows by taking ε=0\varepsilon=0 and applying the theorem of monotone convergence as KK\rightarrow\infty, since the RHS of the inequality is independent of KK.

Claim 2.

For any q(0,1)q\in(0,1), there exists a constant C(q)C(q) such that for any 0<pq0<p\leq q

(supB𝒢M,p|(1/N)j=1Nξj\operatornametr(BXj)B2,N1p/(2p)|T)C(q)exp(T2M/C(q)2)\quad\quad\mathbb{P}\biggl{(}\sup_{B\in\mathcal{G}_{M,p}}\biggl{|}\frac{(1/\sqrt{N})\sum_{j=1}^{N}\xi_{j}\operatorname{tr}(B^{\prime}X_{j})}{\|B\|_{2,N}^{1-p/(2-p)}}\biggr{|}\geq T\biggr{)}\leq C(q)\exp\bigl{(}-T^{2}M/C(q)^{2}\bigr{)}\hskip-15.0pt (71)

for all TC(q)T\geq C(q).

{pf}

First, observe that

supA𝒢M,pA2,N\displaystyle\sup_{A\in\mathcal{G}_{M,p}}\|A\|_{2,N} \displaystyle\leq c0supA𝒢M,pAS2\displaystyle\sqrt{c_{0}}\sup_{A\in\mathcal{G}_{M,p}}\|A\|_{S_{2}}
\displaystyle\leq c0(M/p)(p2)/(2p)supA(S2M)AS2=c0(M/p)(p2)/(2p),\displaystyle\sqrt{c_{0}}(M/p)^{(p-2)/(2p)}\sup_{A\in\mathcal{B}(S_{2}^{M})}\|A\|_{S_{2}}=\sqrt{c}_{0}(M/p)^{(p-2)/(2p)},

where the last inequality follows from (SpM)(S2M)\mathcal{B}(S_{p}^{M})\subset\mathcal{B}(S_{2}^{M}). Define the decomposition of 𝒢M,p\mathcal{G}_{M,p}

𝒢M,p(k)={A𝒢M,p\dvtx(1/2)kc0(M/p)(p2)/(2p)\displaystyle\mathcal{G}_{M,p}^{(k)}=\bigl{\{}A\in\mathcal{G}_{M,p}\dvtx(1/2)^{k}\sqrt{c_{0}}(M/p)^{(p-2)/(2p)}
A2,N(1/2)k1c0(M/p)(p2)/(2p)},\displaystyle\hphantom{\mathcal{G}_{M,p}^{(k)}=\bigl{\{}A\in\mathcal{G}_{M,p}\dvtx}\qquad\leq\|A\|_{2,N}\leq(1/2)^{k-1}\sqrt{c_{0}}(M/p)^{(p-2)/(2p)}\bigr{\}},
\eqntextk.\displaystyle\eqntext{k\in\mathbb{N}.} (72)

Then by peeling-off the class 𝒢M,p\mathcal{G}_{M,p}, we obtain together with claim I for all Tc(q)T\geq c^{\prime}(q)

(supB𝒢M,p|(1/N)j=1Nξj\operatornametr(BXj)B2,N1p/(2p)|T)\displaystyle\mathbb{P}\biggl{(}\sup_{B\in\mathcal{G}_{M,p}}\biggl{|}\frac{(1/\sqrt{N})\sum_{j=1}^{N}\xi_{j}\operatorname{tr}(B^{\prime}X_{j})}{\|B\|_{2,N}^{1-p/(2-p)}}\biggr{|}\geq T\biggr{)}
k=1(supB𝒢M,p(k)|1Nj=1Nξj\operatornametr(BXj)|\displaystyle\qquad\leq\sum_{k=1}^{\infty}\mathbb{P}\Biggl{(}\sup_{B\in\mathcal{G}_{M,p}^{(k)}}\Biggl{|}\frac{1}{\sqrt{N}}\sum_{j=1}^{N}\xi_{j}\operatorname{tr}(B^{\prime}X_{j})\Biggr{|}
T((1/2)kc0(M/p)(p2)/(2p))1p/(2p))\displaystyle\qquad\hphantom{\qquad\leq\sum_{k=1}^{\infty}\mathbb{P}\Biggl{(}}\geq T\bigl{(}(1/2)^{k}\sqrt{c_{0}}(M/p)^{(p-2)/(2p)}\bigr{)}^{1-p/(2-p)}\Biggr{)}
k=1c(q)exp(T2(1/2)2((1/2)kc0(M/p)(p2)/(2p))2p/(2p)c(q)2)\displaystyle\qquad\leq\sum_{k=1}^{\infty}c(q)\exp\biggl{(}-\frac{T^{2}(1/2)^{2}((1/2)^{k}\sqrt{c_{0}}(M/p)^{(p-2)/(2p)})^{-2p/(2-p)}}{c(q)^{2}}\biggr{)}
k=1c(q)exp(T2M2k(2p)/(2p)C0(q)4pc(q)2)\displaystyle\qquad\leq\sum_{k=1}^{\infty}c(q)\exp\biggl{(}-\frac{T^{2}M2^{k(2p)/(2-p)}C_{0}(q)}{4pc(q)^{2}}\biggr{)} (73)

with the definition

C0(q)=inf0<pqc0p/(2p).C_{0}(q)=\inf_{0<p\leq q}c_{0}^{-p/(2-p)}.

It remains to note that the last sum in (10) is bounded by C(q)exp(T2M/C(q)2)C(q)\exp(-T^{2}M/\break C(q)^{2}) uniformly in 0<pq0<p\leq q whenever TC(q)T\geq C(q) for some suitable constant C(q)C(q). This follows from the fact that

k=1exp(p12k(2p)/(2p))k=11p12k(2p)/(2p)+1p1(1/2)(2p)/(2p),\sum_{k=1}^{\infty}\exp\bigl{(}-p^{-1}2^{k(2p)/(2-p)}\bigr{)}\leq\sum_{k=1}^{\infty}\frac{1}{p^{-1}2^{k(2p)/(2-p)}+1}\leq\frac{p}{1-(1/2)^{(2p)/(2-p)}},

and the latter expression is bounded uniformly in 0<pq0<p\leq q.

In particular, the result reveals that the LHS of (46) is bounded by

d^2,N(A^,A)1p/(2p)A^ASpp/(2p)ϑ/p(MN)1/2\hat{d}_{2,N}(\hat{A},A^{*})^{1-p/(2-p)}\|\hat{A}-A^{*}\|_{S_{p}}^{p/(2-p)}\sqrt{\vartheta/p}\biggl{(}\frac{M}{N}\biggr{)}^{1/2} (74)

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}) for any ϑC(q)\sqrt{\vartheta}\geq C(q).

We now use the following simple consequence of the concavity of the logarithm which is stated, for instance, in Tsybakov and van de Geer (2005) (Lemma 5).

Lemma 9

For any positive vv, tt and any κ1\kappa\geq 1, δ>0\delta>0 we have

vt1/(2κ)(δ/2)t+cκδ1/(2κ1)v2κ/(2κ1),vt^{1/(2\kappa)}\leq(\delta/2)t+c_{\kappa}\delta^{-1/(2\kappa-1)}v^{2\kappa/(2\kappa-1)},

where cκ=(2κ1)(2κ)κ1/(2κ1)c_{\kappa}=(2\kappa-1)(2\kappa)\kappa^{-1/(2\kappa-1)}.

Taking in Lemma 9

t=d^2,N(A^,A)2,v=A^ASpp/(2p)ϑ/p(MN)1/2,t=\hat{d}_{2,N}(\hat{A},A^{*})^{2},\qquad v=\|\hat{A}-A^{*}\|_{S_{p}}^{p/(2-p)}\sqrt{\vartheta/p}\biggl{(}\frac{M}{N}\biggr{)}^{1/2},

and κ=(2p)/(22p)\kappa=(2-p)/(2-2p) shows that for any δ>0\delta>0

(74)(δ/2)d^2,N(A^,A)2+τ7δp1A^ASpp(\ref{eq: bound1})\leq(\delta/2)\hat{d}_{2,N}(\hat{A},A^{*})^{2}+\tau_{7}\delta^{p-1}\|\hat{A}-A^{*}\|_{S_{p}}^{p}

with probability at least 1Cexp(ϑM/C2)1-C\exp(-\vartheta M/C^{2}).

The case mTm\not=T can be deduced from the above result by the following observation. For any matrix B=(bij)m×TB=(b_{ij})\in\mathbb{R}^{m\times T}, define the extension B~=(b~ij)M×M\tilde{B}=(\tilde{b}_{ij})\in\mathbb{R}^{M\times M} with M=max(m,T)M=\max(m,T) as follows: b~ij=bij\tilde{b}_{ij}=b_{ij} for 1im1\leq i\leq m, 1jT1\leq j\leq T and b~ij=0\tilde{b}_{ij}=0 otherwise. Then one easily checks that B~Sp=BSp\|\tilde{B}\|_{S_{p}}=\|B\|_{S_{p}} for all p[0,]p\in[0,\infty]. Furthermore, \operatornametr(BXi)=\operatornametr(B~X~i)\operatorname{tr}(B^{\prime}X_{i})=\operatorname{tr}(\tilde{B}^{\prime}\tilde{X}_{i}) and

supAM×M{0}N1k=1N\operatornametr(X~iA)2AS22=supAm×T{0}A2,N2AS22c0.\sup_{A\in\mathbb{R}^{M\times M}\setminus\{0\}}\frac{N^{-1}\sum_{k=1}^{N}\operatorname{tr}(\tilde{X}_{i}^{\prime}A)^{2}}{\|A\|_{S_{2}}^{2}}=\sup_{A\in\mathbb{R}^{m\times T}\setminus\{0\}}\frac{\|A\|_{2,N}^{2}}{\|A\|_{S_{2}}^{2}}\leq c_{0}.

Consequently, the result follows now from the already established proof for the case m=Tm=T.

11 Entropy numbers for quasi-convex Schatten class embeddings

Here we derive bounds for the kkth entropy numbers of the embeddings SpMS2MS_{p}^{M}\hookrightarrow S_{2}^{M} for 0<p<10<p<1, where SpMS_{p}^{M} denotes the ppth Schatten class of real M×MM\times M-matrices. Corresponding results for the lpMl2Ml_{p}^{M}\hookrightarrow l_{2}^{M}-embeddings are given first by Edmunds and Triebel (1989) but their proof does not carry over to the Schatten spaces. Pajor (1998) provides bounds for the SpMS2MS_{p}^{M}\hookrightarrow S_{2}^{M}-embeddings in the convex case, p1p\geq 1. His approach is based on the trace duality (Hölder’s inequality for p1+q1=1p^{-1}+q^{-1}=1) and the geometric formulation of Sudakov’s minoration

εlog𝒩(A,||2,ε)c𝔼suptAG,t\varepsilon\sqrt{\log\mathcal{N}(A,|\cdot|_{2},\varepsilon)}\leq c\mathbb{E}\sup_{t\in A}\langle G,t\rangle

for some positive constant cc, with a dd-dimensional standard Gaussian vector GG and an arbitrary subset AA of d\mathbb{R}^{d}. Here ||2|\cdot|_{2} is the Euclidean norm in d\mathbb{R}^{d} and ,\langle\cdot,\cdot\rangle is the corresponding scalar product. Guédon and Litvak (2000) derive a slightly sharper bound for the lplql_{p}\hookrightarrow l_{q}-embeddings than Edmunds and Triebel (1989) with a different technique. In addition, they prove lower bounds. We adjust their ideas concerning finite p\ell_{p} spaces to the nonconvex Schatten spaces.

We denote by ek(𝑖𝑑p,rM)e_{k}(\mathit{id}_{p,r}^{M}) the kkth entropy number of the embedding SpMSrMS_{p}^{M}\hookrightarrow S_{r}^{M} for 0<p<r0<p<r\leq\infty, that is, the infimum of all ε>0\varepsilon>0 such that there exist 2k12^{k-1} balls in SrMS_{r}^{M} of radius ε\varepsilon that cover (SpM)\mathcal{B}(S_{p}^{M}). For the general definition of kkth entropy numbers ek(T\dvtxFE)e_{k}(T\dvtx F\rightarrow E) of bounded linear operators TT between quasi-Banach spaces FF and EE, we refer to Edmunds and Triebel (1996).

Recall that a homogeneous nonnegative functional \|\cdot\| is called CC-quasi-norm, if it satisfies for all x,yx,y the inequality x+yCmax(x,y)\|x+y\|\leq C\max(\|x\|,\|y\|). Finally, any pp-norm is a CC-quasi-norm with C=21/pC=2^{1/p} [cf., e.g., Edmunds and Triebel (1996), page 2]. We will use the following lemma.

Lemma 10 ([Guédon and Litvak (2000)])

Assume that i\|\cdot\|_{i} are symmetric CiC_{i}-quasi-norms on n\mathbb{R}^{n} for i=0,1i=0,1, and for some θ(0,1)\theta\in(0,1), θ\|\cdot\|_{\theta} is a quasi-norm on n\mathbb{R}^{n} such that xθx0θx11θ\|x\|_{\theta}\leq\|x\|_{0}^{\theta}\|x\|_{1}^{1-\theta} for all xnx\in\mathbb{R}^{n}. Then for any quasi-normed space FF, any linear operator T\dvtxFnT\dvtx F\rightarrow\mathbb{R}^{n}, and all integers kk and mm, we have

em+k1(T\dvtxFEθ)(C0em(T\dvtxFE0))θ(C1ek(T\dvtxFE1))1θ,e_{m+k-1}(T\dvtx F\rightarrow E_{\theta})\leq\bigl{(}C_{0}e_{m}(T\dvtx F\rightarrow E_{0})\bigr{)}^{\theta}\bigl{(}C_{1}e_{k}(T\dvtx F\rightarrow E_{1})\bigr{)}^{1-\theta},

where EtE_{t} stands for n\mathbb{R}^{n} equipped with quasi-norm t\|\cdot\|_{t}, t{0,θ,1}t\in\{0,\theta,1\}.

Guédon and Litvak (2000) did not specify the notion of symmetry they used. So we have to clarify that here a (quasi-)norm \|\cdot\| is called symmetric if (n,)(\mathbb{R}^{n},\|\cdot\|) is isometrically isomorphic to a symmetrically (quasi-)normed operator ideal. This includes the diagonal operator spaces (finite p\ell_{p}) as well as the Schatten spaces. The proof of Lemma 10 follows the lines of Pietsch (1980), Proposition 12.1.12, replacing the triangle inequality by the quasi-triangle inequality. Recall that the Schatten classes SpS_{p} form interpolation couples like their commutative analogs p\ell_{p}.

Lemma 11 ((Interpolation inequality))

For 0<p<q<r<0<p<q<r<\infty, let θ[0,1]\theta\in[0,1] be such that

θp+1θr=1q.\frac{\theta}{p}+\frac{1-\theta}{r}=\frac{1}{q}.

Then, for all Am×TA\in\mathbb{R}^{m\times T},

ASqASpθASr1θ.\|A\|_{S_{q}}\leq\|A\|_{S_{p}}^{\theta}\|A\|_{S_{r}}^{1-\theta}.

Proof is immediate in view of the inequalities

jajq=jajθqaj(1θ)q(jajp)θq/p(jajr)(1θ)q/r\sum_{j}a_{j}^{q}=\sum_{j}a_{j}^{\theta q}a_{j}^{(1-\theta)q}\leq\biggl{(}\sum_{j}a_{j}^{p}\biggr{)}^{\theta q/p}\biggl{(}\sum_{j}a_{j}^{r}\biggr{)}^{(1-\theta)q/r}

valid for any nonnegative aja_{j}’s.

Proposition 1 ((Entropy numbers))

Let 0<p<10<p<1, p<rp<r\leq\infty. Then there exists an absolute constant β\beta independent of pp and rr, such that for all integers kk and MM we have

ek(𝑖𝑑p,rM)min{1,α(β,p,r)(Mk)1/p1/r}e_{k}(\mathit{id}_{p,r}^{M})\leq\min\biggl{\{}1,\alpha(\beta,p,r)\biggl{(}\frac{M}{k}\biggr{)}^{1/p-1/r}\biggr{\}}

with

α(β,p,r)21+1/r(βp)1/p1/r(11p)(1/p1)(1/p1/r).\alpha(\beta,p,r)\leq 2^{1+1/r}\biggl{(}\frac{\beta}{p}\biggr{)}^{1/p-1/r}\biggl{(}\frac{1}{1-p}\biggr{)}^{(1/p-1)(1/p-1/r)}.
{pf}

The fact that ek(𝑖𝑑p,rM)e_{k}(\mathit{id}_{p,r}^{M}) is bounded by 11 is obvious, since (SpM)(SrM)\mathcal{B}(S_{p}^{M})\subset\mathcal{B}(S_{r}^{M}). Consider the other case. We start with r=r=\infty and then extend the result to r<r<\infty by interpolation. Fix some number L>ML>M and let D=D(M,L,p)D=D(M,L,p) be the smallest constant which satisfies, for all 1kL1\leq k\leq L,

ek(𝑖𝑑p,M)D(Mk)1/p.e_{k}(\mathit{id}_{p,\infty}^{M})\leq D\biggl{(}\frac{M}{k}\biggr{)}^{1/p}. (75)

Let us show that α=supM,LD(M,L,p)\alpha=\sup_{M,L}D(M,L,p) is finite. Since Sp\|\cdot\|_{S_{p}}, p<1p<1, can be viewed as a quasi-norm on M2\mathbb{R}^{M^{2}} (isomorphic to M×M\mathbb{R}^{M\times M}), Lemma 10 applies with F=E0=SpMF=E_{0}=S_{p}^{M}, E1=SME_{1}=S_{\infty}^{M}, θ=p\theta=p, Eθ=S1ME_{\theta}=S_{1}^{M} and m=1m=1. This gives

ek(𝑖𝑑p,1M)4(ek(𝑖𝑑p,M))1p.e_{k}(\mathit{id}_{p,1}^{M})\leq 4(e_{k}(\mathit{id}_{p,\infty}^{M}))^{1-p}. (76)

Here the factor 4 follows from the relations C1=2C_{1}=2 and C0p2C_{0}^{p}\leq 2. Now, (76) and the factorization theorem for entropy numbers of bounded linear operators between quasi-Banach spaces [see, e.g., Edmunds and Triebel (1996), page 8], with factorization via S1MS_{1}^{M}, leads to the bound

ek(𝑖𝑑p,M)\displaystyle e_{k}(\mathit{id}_{p,\infty}^{M}) \displaystyle\leq e[(1p)k](𝑖𝑑p,1M)e[pk](𝑖𝑑1,M)\displaystyle e_{[(1-p)k]}(\mathit{id}_{p,1}^{M})e_{[pk]}(\mathit{id}_{1,\infty}^{M})
\displaystyle\leq 4(e[(1p)k](𝑖𝑑p,M))1pe[pk](𝑖𝑑1,M),\displaystyle 4\bigl{(}e_{[(1-p)k]}(\mathit{id}_{p,\infty}^{M})\bigr{)}^{1-p}e_{[pk]}(\mathit{id}_{1,\infty}^{M}),

where for any x(0,)x\in(0,\infty), [x][x] denotes the smallest integer which is larger or equal to xx. Proposition 5 of Pajor (1998) entails log𝒩((S1M),S,ε)cM/ε,ε>0,\log\mathcal{N}(\mathcal{B}(S_{1}^{M}),\|\cdot\|_{S_{\infty}},\varepsilon)\leq cM/\varepsilon,\forall\varepsilon>0, and hence

ek(𝑖𝑑1,M)cM/ke_{k}(\mathit{id}_{1,\infty}^{M})\leq c^{\prime}M/k (78)

with constants cc and cc^{\prime} independent of MM, ε\varepsilon and kk. Note that, in contrast to the l1MlMl_{1}^{M}\hookrightarrow l_{\infty}^{M}-embedding, for which the kkth entropy numbers are bounded by c′′k1log(1+M/k)c^{\prime\prime}k^{-1}\log(1+M/k) with some c′′>0c^{\prime\prime}>0 and log2MkM\log_{2}M\leq k\leq M [see, e.g., Edmunds and Triebel (1996), page 98], we have in (78) not a logarithmic but linear dependence of MM in the upper bound. Plugging (75) and (78) into (11) yields

ek(𝑖𝑑p,M)\displaystyle e_{k}(\mathit{id}_{p,\infty}^{M}) \displaystyle\leq 4(D(M(1p)k)1/p)1pcMpk\displaystyle 4\biggl{(}D\biggl{(}\frac{M}{(1-p)k}\biggr{)}^{1/p}\biggr{)}^{1-p}\frac{c^{\prime}M}{pk}
=\displaystyle= 4cp(11p)(1p)/pD1p(Mk)1/p.\displaystyle\frac{4c^{\prime}}{p}\biggl{(}\frac{1}{1-p}\biggr{)}^{(1-p)/p}D^{1-p}\biggl{(}\frac{M}{k}\biggr{)}^{1/p}.

Thus, by definition of DD,

Dp4cp(11p)(1p)/p,D^{p}\leq\frac{4c^{\prime}}{p}\biggl{(}\frac{1}{1-p}\biggr{)}^{(1-p)/p},

which shows that DD is uniformly bounded in MM and LL. This proves the proposition for r=r=\infty.

Consider now the case r<r<\infty. In view of Lemma 11 with θ=p/r\theta=p/r, we can apply Lemma 10 with F=E0=SpMF=E_{0}=S_{p}^{M}, E1=SME_{1}=S_{\infty}^{M}, θ=p/r\theta=p/r, Eθ=SrME_{\theta}=S_{r}^{M} and m=1m=1. This yields

ek(𝑖𝑑p,rM)\displaystyle e_{k}(\mathit{id}_{p,r}^{M}) \displaystyle\leq 21+1/r(ek(𝑖𝑑p,M))1p/r\displaystyle 2^{1+1/r}(e_{k}(\mathit{id}_{p,\infty}^{M}))^{1-p/r}
\displaystyle\leq 21+1/rD1p/r(Mk)1/p1/r.\displaystyle 2^{1+1/r}D^{1-p/r}\biggl{(}\frac{M}{k}\biggr{)}^{1/p-1/r}.
\upqed
Corollary 7

For any p(0,1)p\in(0,1), there exists a positive constant α0(p)\alpha_{0}(p) such that for all integers M1M\geq 1 and any ε(0,1]\varepsilon\in(0,1],

log𝒩((SpM),S2,ε)α0(p)Mε2p/(2p).\log\mathcal{N}(\mathcal{B}(S_{p}^{M}),\|\cdot\|_{S_{2}},\varepsilon)\leq\alpha_{0}(p)M\varepsilon^{-2p/(2-p)}.

Moreover, α0(p)=O(1/p)\alpha_{0}(p)=\mathrm{O}(1/p) for p0p\searrow 0.

{pf}

The result follows by transforming the entropy number bound of Proposition 1 into an entropy bound. Specification of the constant in Proposition 1 yields

α0(p)=O(βp(1+p1p)(1p)/p)=O(1/p)\alpha_{0}(p)=\mathrm{O}\biggl{(}\frac{\beta}{p}\biggl{(}1+\frac{p}{1-p}\biggr{)}^{(1-p)/p}\biggr{)}=\mathrm{O}(1/p)

as p0p\searrow 0.

Acknowledgment

We are grateful to Alain Pajor for pointing out reference Guédon and Litvak (2000).

References

  • (1) Abernethy, J., Bach, F., Evgeniou, T. and Vert, J.-P. (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. J. Mach. Learn. Res. 10 803–826.
  • (2) Amini, A. and Wainwright, M. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921. \MR2541450
  • (3) Argyriou, A., Evgeniou, T. and Pontil, M. (2008). Convex multi-task feature learning. Mach. Learn. 73 243–272.
  • (4) Argyriou, A., Micchelli, C. A. and Pontil, M. (2010). On spectral learning. J. Mach. Learn. Res. To appear. \MR2600635
  • (5) Argyriou, A., Micchelli, C. A., Pontil, M. and Ying, Y. (2008). A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 20 (J.C. Platt, et al., eds.) 25–32. MIT Press, Cambridge, MA.
  • (6) Bach, F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res. 9 1019–1048. \MR2417263
  • (7) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227. \MR2387969
  • (8) Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. \MR2533469
  • (9) Bunea, F., She, Y. and Wegkamp, M. H. (2010). Optimal selection of reduced rank estimators of high-dimensional matrices. Available at arXiv:1004.2995.
  • (10) Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697. \MR2351101
  • (11) Cai, T., Zhang, C.-H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38 2118–2144. \MR2676885
  • (12) Candès, E. J. and Plan, Y. (2010a). Matrix completion with noise. Proc. IEEE 98 925–936.
  • (13) Candès, E. J. and Plan, Y. (2010b). Tight oracle bounds for low-rank matrix recovery from a mininal number of noisy random measurements. Available at arXiv:1001.0339.
  • (14) Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772. \MR2565240
  • (15) Candès, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215. \MR2243152
  • (16) Candès, E. J. and Tao, T. (2009). The power of convex relaxation: Near-optimal matrix completion. Unpublished manuscript.
  • (17) McCarthy, C. A. (1967). CpC_{p}. Israel J. Math. 5 249–272. \MR0225140
  • (18) Dalalyan, A. and Tsybakov, A. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity. Mach. Learn. 72 39–61.
  • (19) Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object. J. Roy. Statist. Soc. Ser. B 54 41–81. \MR1157714
  • (20) Edmunds, D. E. and Triebel, H. (1996). Function Spaces, Entropy Numbers, Differential Operators. Cambridge Univ. Press, Cambridge. \MR1410258
  • (21) Edmunds, D. E. and Triebel, H. (1989). Entropy numbers and approximation numbers in function spaces. Proc. London Math. Soc. 58 137–152. \MR0969551
  • (22) Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975. \MR1329177
  • (23) Guédon, O. and Litvak, A. E. (2000). Euclidean projections of a pp-convex body. In Geometric Aspects of Functional Analysis, Israel Seminar (GAFA) 1996–2000 (V. D. Milman and G. Schechtman, eds.). Lecutre Notes in Mathematics 1745 95–108. Springer, Berlin. \MR1796715
  • (24) Gross, D. (2009). Recovering low-rank matrices from few coefficients in any basis. Available at arXiv:0910.1879.
  • (25) Keshavan, R. H., Montanari, A. and Oh, S. (2009). Matrix completion from noisy entries. Available at arXiv:0906.2027.
  • (26) Kolmogorov, A. N. and Tikhomirov, V. M. (1959). The ε\varepsilon-entropy and ε\varepsilon-capacity of sets in function spaces. Uspekhi Matem. Nauk 14 3–86. \MR0112032
  • (27) Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems. Ecole d’Eté de Probabilités de Saint-Flour, Lecture Notes. Preprint.
  • (28) Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of COLT-2009.
  • (29) Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2010). Oracle inequalities and optimal inference under group sparsity. Available at arXiv:1007.1771.
  • (30) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. \MR2278363
  • (31) Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17 1248–1282. \MR2373017
  • (32) Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of MM-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, NIPS-2009.
  • (33) Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low rank matrices with noise and high-dimensional scaling. Ann. Statist. To appear. Available at arXiv:0912.5100.
  • (34) Nemirovski, A. (2004). Regular Banach spaces and large deviations of random sums. Unpublished manuscript.
  • (35) Pajor, A. (1998). Metric entropy of the Grassmann manifold. Convex Geom. Anal. 34 181–188. \MR1665590
  • (36) Paulsen, V. I. (1986). Completely Bounded Maps and Dilations. In Pitman Research Notes in Mathematics 146. Longman, New York. \MR0868472
  • (37) Pietsch, A. (1980). Operator Ideals. Elsevier, Amsterdam. \MR0582655
  • (38) Pinelis, I. F. and Sakhanenko, A. I. (1985). Remarks on inequalities for the probabilities of large deviations. Theory Probab. Appl. 30 143–148. \MR0779438
  • (39) Ravikumar, P., Wainwright, M., Raskutti, G. and Yu, B. (2008). High-dimensional covariance estimation by minimizing 1\ell_{1}-penalized log-determinant divergence. Unpublished manuscript.
  • (40) Recht, B. (2009). A simpler approach to matrix completion. Available at arXiv:0910.0651.
  • (41) Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501. \MR2680543
  • (42) Rigollet, P. and Tsybakov, A. B. (2010). Exponential screening and optimal rates of sparse estimation. Available at arXiv:1003.2654.
  • (43) Rotfeld, S. Y. (1969). The singular numbers of the sum of completely continuous operators. In Topics in Mathematical Physics (M. S. Birman, ed.). Spectral Theory 3 73–78. English version published by Consultants Bureau, New York.
  • (44) Srebro, N., Rennie, J. and Jaakkola, T. (2005). Maximum margin matrix factorization. In Advances in Neural Information Processing Systems 17 (L. Saul, Y. Weiss and L. Bottou, eds.) 1329–1336. MIT Press, Cambridge, MA.
  • (45) Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Learning Theory, Proceedings of COLT-2005. Lecture Notes in Comput. Sci. 3559 545–560. Springer, Berlin. \MR2203286
  • (46) Tropp, J. A. (2010). User-friendly tail bounds for sums of random matrices. Available at arXiv:1004.4389.
  • (47) Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer, New York. \MR2724359
  • (48) Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224. \MR2195633
  • (49) van de Geer, S. (2000). Empirical Processes in M-estimation. Cambridge Univ. Press, Cambridge.
  • (50) Vershynin, R. (2007). Some problems in asymptotic convex geometry and random matrices motivated by numerical algorithms. In Banach Spaces and Their Applications in Analysis (B. Randrianantoanina and N. Randrianantoanina, eds.) 209–218. de Gruyter, Berlin. \MR2374708
  • (51) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–2563. \MR2274449