Estimation of high-dimensional low-rank matrices
Abstract
Suppose that we observe entries or, more generally, linear combinations of entries of an unknown -matrix corrupted by noise. We are particularly interested in the high-dimensional setting where the number of unknown entries can be much larger than the sample size . Motivated by several applications, we consider estimation of matrix under the assumption that it has small rank. This can be viewed as dimension reduction or sparsity assumption. In order to shrink toward a low-rank representation, we investigate penalized least squares estimators with a Schatten- quasi-norm penalty term, . We study these estimators under two possible assumptions—a modified version of the restricted isometry condition and a uniform bound on the ratio “empirical norm induced by the sampling operator/Frobenius norm.” The main results are stated as nonasymptotic upper bounds on the prediction risk and on the Schatten- risk of the estimators, where . The rates that we obtain for the prediction risk are of the form (for ), up to logarithmic factors, where is the rank of . The particular examples of multi-task learning and matrix completion are worked out in detail. The proofs are based on tools from the theory of empirical processes. As a by-product, we derive bounds for the th entropy numbers of the quasi-convex Schatten class embeddings , , which are of independent interest.
doi:
10.1214/10-AOS860keywords:
[class=AMS] .keywords:
.T1Supported in part by the Grant ANR-06-BLAN-0194, ANR “Parcimonie” and by the PASCAL-2 Network of Excellence.
and
1 Introduction
Consider the observations satisfying the model
(1) |
where are given matrices ( rows, columns), is an unknown matrix, are i.i.d. random errors, denotes the trace of square matrix and stands for the transposed of . Our aim is to estimate the matrix and to predict the future -values based on the sample , .
We will call model (1) the trace regression model. Clearly, for it reduces to the standard regression model. The “design” matrices will be called masks. This name is motivated by the fact that we focus on the applications of trace regression where are very sparse, that is, contain only a small percentage of nonzero entries. Therefore, multiplication of by masks most of the entries of . The following two examples are of particular interest.
(i) Point masks. For some, typically small, integer the point masks are defined as elements of the set
where are the canonical basis vectors of . In particular, for the point masks are matrices that have only one nonzero entry, which equals to 1. The problem of estimation of in this case becomes the problem of matrix completion; the observations are just some selected entries of corrupted by noise, and the aim is to reconstruct all the entries of . The problem of matrix completion dates back at least to Srebro, Rennie and Jaakkola (2005), Srebro and Shraibman (2005) and is mainly motivated by applications in recommendation systems. We will analyze the following two special cases of matrix completion:
-
–
USR (Uniform Sampling at Random) matrix completion. The masks are independent, uniformly distributed on
and independent from .
-
–
Collaborative sampling (CS) matrix completion. The masks (random or deterministic) belong to , are all distinct and independent from .
The CS matrix completion model is natural to describe recommendation systems where every user rates every product only once. The USR matrix completion can be used for transmission of a large-dimensional matrix trough a noisy communication channel; only a chosen small number of entries is transmitted, and nevertheless the original matrix can be reconstructed by the receiver. An important feature of the real-world matrix completion problems is that the number of observed entries is much smaller than the size of the matrix: , whereas can be very large. For example, is of the order of hundreds of millions for the Netflix problem.
(ii) Column or row masks. If has only a small number of nonzero columns or rows, it is called column or row mask, respectively. We suppose here that is much smaller than and . A remarkable case is covering the problem known in Statistics and Econometrics as longitudinal (or panel, or cross-section) data analysis and in Machine Learning as multi-task learning. In what follows, we will designate this problem as multi-task learning, to avoid ambiguity. In the simplest version of multi-task learning, we have where is the number of tasks (for instance, in image detection each task is associated with a particular type of visual object, e.g., face, car, chair, etc.), and is the number of observations per task. The tasks are characterized by vectors of parameters , , which constitute the columns of matrix :
The are column masks, each containing only one nonzero column (with the convention that is the th column):
The column is interpreted as the vector of predictor variables corresponding to th observation for the th task. Thus, for each there exists a pair with , such that
(2) |
If we denote by and the corresponding values and , then the trace regression model (1) can be written as a collection of standard regression models:
This is the usual formulation of the multi-task learning model in the literature.
For both examples given above, the matrices are sparse in the sense that they have only a small portion of nonzero entries. On the other hand, such a sparsity property is not necessarily granted for the target matrix . Nevertheless, we can always characterize by its rank , and say that a matrix is sparse if it has small rank; cf. Recht, Fazel and Parrilo (2010). For example, the problem of estimation of a square matrix is a parametric problem which is formally of dimension but it has only free parameters. If is small as compared to , then the intrinsic dimension of the problem is of the order . In other words, the rank sparsity assumption is a dimension reduction assumption. This assumption will be crucial for the interpretation of our results. Another sparsity assumption that we will consider is that Schatten- norm of (see the definition in Section 2 below) is small for some . This is an analog of sparsity expressed in terms of the norm, , in vector estimation problems.
Estimation of high-dimensional matrices has been recently studied by several authors in settings different from the ours [cf., e.g., Meinshausen and Bühlmann (2006), Bickel and Levina (2008), Ravikumar et al. (2008), Amini and Wainwright (2009), Cai, Zhang and Zhou (2010) and the references cited therein]. Most of attention was devoted to estimation of a large covariance matrix or its inverse. In these papers, sparsity is characterized by the number of nonzero entries of a matrix.
Candès and Recht (2009), Candès and Tao (2009), Gross (2009), Recht (2009) considered the nonnoisy setting () of the matrix completion problem under conditions that the singular vectors of are sufficiently spread out on the unit sphere or “incoherent.” They focused on exact recovery of . Until now, the sharpest results are those of Gross (2009) and Recht (2009) who showed that under “incoherence condition” the exact recovery is possible with high probability if with some constant when we observe entries of a matrix with locations uniformly sampled at random. Candès and Plan (2010a), Keshavan, Montanari and Oh (2009) explored the same setting in the presence of noise, proposed estimators of and evaluated their Frobenius norm . The better bounds are in Keshavan, Montanari and Oh (2009) who suggest such that for and with the squared error is of the order with probability close to 1 when the noise is i.i.d. Gaussian.
In this paper, we consider the general noisy setting of the trace regression problem. We study a class of Schatten- estimators , that is, the penalized least squares estimators with a penalty proportional to Schatten- norm; cf. (7). The special case corresponds to the “matrix Lasso.” We study the convergence properties of their prediction error
and of their Schatten- error. The main contributions of this paper are the following.
-
[(a)]
-
(a)
For all , under various assumptions on the masks (no assumption, USR matrix completion, CS matrix completionmatrix completionmatrix compl) we obtain different bounds on the prediction error of Schatten- estimators involving the Schatten- norm of .
-
(b)
For sufficiently close to 0, under a mild assumption on , we show that Schatten- estimators achieve the prediction error rate of convergence , up to a logarithmic factor. This result is valid for matrices whose eigenvalues are not exponentially large in . It covers the matrix completion and high-dimensional multi-task learning problems.
-
(c)
For all , we obtain upper bounds for the prediction error under the matrix Restricted Isometry (RI) condition on the masks , which is a rather strong condition, and under the assumption that . We also derive the bounds for the Schatten- error of . The rate in the bounds for the prediction error is when the RI condition is satisfied with scaling factor 1 (i.e., for the case not related to matrix completion and high-dimensional multi-task learning).
-
(d)
We prove the lower bounds showing that the rate is minimax optimal for the prediction error and Schatten- (i.e., Frobenius) norm estimation error under the RI condition on the class of matrices of rank smaller than . Our result is even more general because we prove our lower bound on the intersection of the Schatten-0 ball with the Schatten- ball for any , which allows us to show minimax optimality of the upper bounds of (a) as well. Furthermore, we prove minimax lower bounds for collaborative sampling and USR matrix completion problems.
The main point of this paper is to show that the suitably tuned Schatten estimators attain the optimal rate of prediction error up to logarithmic factors. The striking fact is that we can achieve this not only under the very restrictive assumption, such as the RI condition, but also under very mild assumptions on the masks .
Finally, it is useful to compare the results for matrix estimation when the sparsity is expressed by the rank with those for the high-dimensional vector estimation when the sparsity is expressed by the number of nonzero components of the vector. For the vector estimation, we have the linear model
where , and, for example, are i.i.d. random variables. Consider the high-dimensional case . (This is analogous to the assumption in the matrix problem and means that the nominal dimension is much larger than the sample size.) The sparsity assumption for the vector case has the form , where is the number of nonzero components, or the intrinsic dimension of . Let be an estimator of . Then the optimal rate of convergence of the prediction risk on the class of vectors with given is of the order , up to logarithmic factors. This rate is shown to be attained, up to logarithmic factors, for many estimators, such as the BIC, the Lasso, the Dantzig selector, Sparse Exponential Weighting, etc.; cf., for example, Bunea, Tsybakov and Wegkamp (2007), Koltchinskii (2008), Bickel, Ritov and Tsybakov (2009), Dalalyan and Tsybakov (2008). Note that this rate is of the form , up to a logarithmic factor. The general interpretation is therefore completely analogous to that of the matrix case: Assume for simplicity that is a square matrix with . As mentioned above, the intrinsic dimension (the number of parameters to be estimated to recover ) is then , which is of the order if . An interesting difference is that the logarithmic risk inflation factor is inevitable in the vector case [cf. Donoho et al. (1992), Foster and George (1994)], but not in the matrix problem, as our results reveal.
This paper is organized as follows. In Section 2, we introduce notation, some definitions, basic facts about the Schatten quasi-norms and define the Schatten- estimators. Section 3 describes elementary steps in their convergence analysis and presents two general approaches to upper bounds on the estimation and prediction error (cf. Theorems 1 and 2) depending on the efficient noise level . Our main results are stated in Sections 4, 5 (matrix completion), 6 (multi-task learning). They are obtained from Theorems 1 and 2 by specifying the effective noise level under particular assumptions on the masks . Concentration bounds for certain random matrices leading to the expressions for the effective noise level are collected in Section 8. Section 7 is devoted to minimax lower bounds. Sections 9 and 10 contain the main proofs. Finally, in Section 11 we establish bounds for the th entropy numbers of the quasi-convex Schatten class embeddings , , which are needed for our proofs and are of independent interest.
2 Preliminaries
2.1 Notation, definitions and basic facts
We will write for the Euclidean norm in for any integer . For any matrix , we denote by for its th row and write for its th column, . We denote by the singular values of . The (quasi-)norm of some (quasi-) Banach space is canonically denoted by . In particular, for any matrix and we consider the Schatten (quasi-)norms
The Schatten spaces are defined as spaces of all matrices equipped with quasi-norm . In particular, the Schatten-2 norm coincides with the Frobenius norm
where denote the elements of matrix . Recall that for the Schatten spaces are not normed but only quasi-normed, and satisfies the inequality
(3) |
for any and any two matrices ; cf. McCarthy (1967) and Rotfeld (1969). We will use the following well-known trace duality property:
2.2 Characteristics of the sampling operator
Let be the sampling operator, that is, the linear mapping defined by
We have
Depending on the context, we also write for , where and are any matrices in . Unless the reverse is explicitly stated, we will tacitly assume that the matrices are nonrandom.
We will denote by the maximal rank-1 restricted eigenvalue of :
(4) |
We now introduce two basic assumptions on the sampling operator that will be used in the sequel. The sampling operator will be called uniformly bounded if there exists a constant such that
(5) |
Clearly, if is uniformly bounded, then . Condition (5) is trivially satisfied with for USR matrix completion and with for CS matrix completion.
The sampling operator is said to satisfy the Restricted Isometry condition RI (,) for some integer and some if there exists a constant such that
(6) |
for all matrices of rank at most .
A difference of this condition from the Restricted Isometry condition, introduced by Candès and Tao (2005) in the vector case or from its analog for the matrix case suggested by Recht, Fazel and Parrilo (2010), is that we state it with a scaling factor . This factor is introduced to account for the fact that the masks are typically very sparse, so that they do not induce isometries with coefficient close to one. Indeed, will be large in the examples that we consider below.
2.3 Least squares estimators with Schatten penalty
In this paper, we study the estimators defined as a solution of the minimization problem
(7) |
with some fixed and . The case (matrix Lasso) is of outstanding interest since the minimization problem is then convex and thus can be efficiently solved in polynomial time. We call the Schatten- estimator. Such estimators have been recently considered by many authors motivated by applications to multi-task learning and recommendation systems. Probably, the first study is due to Srebro, Rennie and Jaakkola (2005) who dealt with binary classification and considered the Schatten-1 estimator with the hinge loss rather than squared loss. Argyriou et al. (2008), Argyriou, Evgeniou and Pontil (2008), Argyriou, Micchelli and Pontil (2010), Bach (2008), Abernethy et al. (2009) discussed connections of (7) to other related minimization problems, along with characterizations of the solutions and computational issues, mainly focusing on the convex case . Also for the nonconvex case (), Argyriou et al. (2008), Argyriou, Evgeniou and Pontil (2008) suggested an algorithm of approximate computation of Schatten- estimator or its analogs. However, for the methods can find only a local minimum in (7), so that Schatten estimators with such remain for the moment mainly of theoretical value. In particular, analyzing these estimators reveals, which rates of convergence can, in principle, be attained.
The statistical properties of Schatten estimators are not yet well understood. To our knowledge, the only previous study is that of Bach (2008) showing that for , under some condition on ’s [analogous to strong irrepresentability condition in the vector case; cf. Meinshausen and Bühlmann (2006), Zhao and Yu (2006)], is consistently recovered by when are fixed and . Our results are of a different kind. They are nonasymptotic and meaningful in the case . Furthermore, we do not consider the recovery of the rank, but rather the estimation and prediction properties of Schatten- estimators.
After this paper has been submitted, we became aware of interesting contemporaneous and independent works by Candès and Plan (2010b), Negahban et al. (2009) and Negahban and Wainwright (2011). Those papers focus on the bounds for the Schatten-2 (i.e., Frobenius) norm error of the matrix Lasso estimator under the matrix RI condition. This is related to the particular instance of our results in item (c) above with and . Their analysis of this case is complementary to ours in several aspects. Negahban and Wainwright (2011) derive their bound under the assumption that are matrices with i.i.d. standard Gaussian elements and belongs to a Schatten- ball with , which leads to rates different from ours if . An assumption used in this context in Negahban and Wainwright (2011) is that (in our notation), which excludes the high-dimensional case that we are mainly interested in Candès and Plan (2010b) consider approximately low-rank matrices, explore the closely related matrix Dantzig selector and provide lower bounds corresponding to a special case of item (d) above. The results of these papers do not cover the matrix completion and multi-task learning problems, which are in the main focus of our study. We also mention a more recent work by Bunea, She and Wegkamp (2010) dealing with a special case of our model and analyzing matrix estimators penalized by the rank.
3 Two schemes of analyzing Schatten estimators
In this section, we discuss two schemes of proving upper bounds on the prediction error of . The first bound involves only the Schatten- norm of matrix . The second involves only the rank of but needs the RI condition on the sampling operator.
We start by sketching elementary steps in the convergence analysis of Schatten- estimators. By the definition of ,
Recalling that , we can transform this by a simple algebra to
(8) |
In the sequel, inequality (8) will be referred to as basic inequality and the random variable will be called the stochastic term. The core in the analysis of Schatten- estimators consists in proving tight bounds for the right-hand side of the basic inequality (8). For this purpose, we first need a control of the stochastic term. Section 8 below demonstrates that such a control strongly depends on the properties of , that is, of the problem at hand. In summary, Section 8 establishes that, under suitable conditions, for any the stochastic term can be bounded for all with probability close to 1 as follows:
(9) | |||
where depends on and . The quantity plays a crucial role in this bound. We will call the effective noise level. Exact expressions for under various assumptions on the sampling operator and on the noise are derived in Section 8. In Table 1, we present the values of for three important examples under the assumption that are i.i.d. Gaussian random variables. In the cases listed in Table 1, inequality (3) holds with probability , where (first and third example) and (second example) with constants independent of .
Assumptions on | Assumptions on | Value of |
---|---|---|
Uniformly bounded | ||
USR matrix completion | , | |
CS matrix completion |
The following two points will be important to understand the subsequent results:
-
•
In this paper, we will always choose the regularization parameter in the form .
-
•
With this choice of , the effective noise level characterizes the rate of convergence of the Schatten estimator. The smaller is , the faster is the rate.
In particular, the first line in Table 1 reveals that when the largest corresponds to and it becomes smaller when decreases to 0. This suggests that choosing Schatten- estimators with and especially close to 0 might be advantageous. Note that the assumption of uniform boundedness of is very mild. For example, it is trivially satisfied with for USR matrix completion and with for CS matrix completion. However, in these two cases a specific analysis leads to sharper bounds on the effective noise level (i.e., on the rate of convergence of the estimators); cf. the second and third lines of Table 1.
In this section, we provide two bounds on the prediction error of with a general effective noise level . We then detail them in Sections 4–6 for particular values of depending on the assumptions on the . The first bound involves the Schatten- norm of matrix .
Theorem 1
The bound (10) depends on the magnitude of the elements of via . The next theorem shows that under the RI condition this dependence can be avoided, and only the rank of affects the rate of convergence.
Theorem 2
Let with , and let . Assume that (3) holds with probability at least for some and . Assume also that the Restricted Isometry condition RI , holds with some , with a sufficiently large depending only on and with for a sufficiently small depending only on .
Let be the Schatten- estimator defined as a minimizer of (7) with . Then with probability at least we have
(11) | |||||
(12) |
where and are constants, depends only on and depends on and .
Proof of Theorem 2 is given in Section 9. The values and can be deduced from the proof. In particular, for it is sufficient to take .
Remark 1.
Note that if the rates in (11) and (12) do not depend on if we assume in addition the uniform boundedness of , which is a very mild condition. Indeed, taking the value of from the first line of Table 1 we see that for all . Thus, under the RI condition, using Schatten- estimators with does not improve the rate of convergence on the class of matrices of rank at most .
Discussion about the scaling factor . Remark 1 deals with the case , which seems to be not always appropriate for trace regression models. To our knowledge, the only available examples of matrices such that the sampling operator satisfies the RI condition with are complete matrices, that is, matrices with all nonzero entries, which are random and have specific distributions [typically, i.i.d. Rademacher or Gaussian entries; cf. Recht, Fazel and Parrilo (2010)]. Except for degenerate cases [such as , the distinct and of the form for ] the sampling operator defines typically a restricted isometry with only if the matrices contain a considerable number of (uniformly bounded) nonzero entries.
Let us now specify the form of the RI condition in the context of multi-task learning discussed in the Introduction. Using (2) for a matrix , we obtain
where is the Gram matrix of predictors for the th task. These matrices correspond to separate regression models. The standard assumption is that they are normalized so that all the diagonal elements of each are equal to 1. This suggests that the natural RI scaling factor for such model is of the order . For example, in the simplest case when all the matrices are just equal to the identity matrix, we find Similarly, we get the RI condition with scaling factor when the spectra of all the Gram matrices are included in a fixed interval with . However, this excludes the high-dimensional task regressions, such that the number of parameters is larger than the sample size, . In conclusion, application of the matrix RI techniques in multi-task learning is restricted to low-dimensional regression and the scaling factor is .
The reason for the failure of the RI approach is that the masks are sparse. The sparser are , the larger is . The extreme situation corresponds to matrix completion problems. Indeed, if , then there exists a matrix of rank in the null-space of the sampling operator and hence the RI condition cannot be satisfied. For we can have the RI condition with scaling factor , but means that essentially all the entries are observed, so that the very problem of completion does not arise.
4 Upper bounds under mild conditions on the sampling operator
The above discussion suggests that Theorem 2 and, in general, the argument based on the restricted isometry or related conditions are not well adapted for several interesting settings. Motivated by this, we propose another approach described in the next theorem, which requires only the comparably mild uniform boundedness condition (5). For simplicity, we focus on Gaussian errors . Set .
Theorem 3
Let be i.i.d. random variables. Assume that , and that the uniform boundedness condition (5) is satisfied. Let with and the maximal singular value for some . Set , where and
(13) |
for some and a universal positive constant independent of , and . Then the Schatten- estimator defined as a minimizer of (7) with as in (13) satisfies
(14) |
with probability at least where the positive constant is independent of , and .
Inequality (3) holds with probability at least by Lemma 5. We then use (10) and note that, under our choice of , for some constant , which does not depend on and , and
Finally, we give the following theorem quantifying the rates of convergence of the prediction risk in terms of the Schatten norms of . Its proof is straightforward in view of Theorem 1 and Lemmas 2 and 5.
Theorem 4
Let be i.i.d. random variables and . Then the Schatten- estimator has the following properties: {longlist}[(ii)]
Let , and Then
(15) |
with probability at least where is an absolute constant.
In Theorem 5 below we show that these rates are optimal in a minimax sense on the corresponding Schatten- balls for the sampling operators satisfying the RI condition.
5 Upper bounds for noisy matrix completion
As discussed in Section 3, for matrix completion problems the restricted isometry argument as in Theorem 2 is not applicable. We will therefore use Theorems 1 and 3. First, combining Theorem 1 with Lemma 3 of Section 8 we get the following corollary.
Corollary 1 ((USR matrix completion))
Corollary 2 ((USR matrix completion, nonconvex penalty))
Let be i.i.d. random variables. Assume that , and consider the USR matrix completion model. Let with and the maximal singular value for some . Set , where and
for some with a universal constant , independent of , and . Then the Schatten- estimator defined as a minimizer of (7) satisfies
(19) |
with probability at least , where the positive constant is also independent of , and .
Note that the bounds of Corollaries 1(i) and 2 achieve the rate , up to logarithmic factors under different conditions on the maximal singular value of . If then the condition in Corollary 2 does not imply more than a polynomial in growth on , which is a mild assumption. On the other hand, (17) requires uniform boundedness of by some constant to achieve the same rate. However, the estimators of Corollary 2 correspond to nonconvex penalty and are computationally hard.
We now turn to the collaborative sampling matrix completion. The next corollary follows from combination of Theorem 1 with Lemmas 3 and 4 of Section 8.
Corollary 3 ((Collaborative sampling))
Consider the problem of matrix completion with collaborative sampling. {longlist}[(ii)]
Let be i.i.d. random variables. Let be given by (44) with some . Then the Schatten- estimator defined with satisfies
(20) |
with probability at least , where .
Remark 2.
Using the inequality for matrices of rank at most , we find that that the bound (20) is minimax optimal on the class of matrices
for some constant , if the masks fulfill the dispersion condition of Theorem 7 below. It is further interesting to note that the construction in the proof of the lower bound in Theorem 7 fails if the restriction is where of smaller order than .
6 Upper bounds for multi-task learning
For multi-task learning, we can apply both Theorems 2 and 3. Theorem 2 imposes a strong assumption on the masks , namely the RI condition. Nevertheless, the advantage is that Theorem 2 covers the computationally easy case .
Corollary 4 ((Multi-task learning; RI condition))
Let be i.i.d. random variables. Consider the multi-task learning problem with . Assume that the spectra of the Gram matrices are uniformly in bounded from above by a constant . Assume also that the Restricted Isometry condition RI (, ) holds with some and with for a sufficiently small . Set
Let be the Schatten- estimator with this parameter . Then with probability at least we have
where is an absolute constant and depends only on .
The proof of Corollary 4 is straightforward in view of Theorem 2, Lemma 2 and the fact that, under the premise of Corollary 4, we have for all matrices , so that the sampling operator is uniformly bounded [(5) holds with ], and thus .
Taking in the bounds of Corollary 4 the natural scaling factor , we obtain the following inequalities:
(22) | |||||
(23) |
where the constants and do not depend on and .
A remarkable fact is that the rates in Corollary 4 are free of logarithmic inflation factor. This is one of the differences between the matrix estimation problems and vector estimation ones, where the logarithmic risk inflation is inevitable, as first noticed by Donoho et al. (1992), Foster and George (1994). For more details about optimal rates of sparse estimation in the vector case, see Rigollet and Tsybakov (2010).
Since the Group Lasso is a special case of the nuclear norm penalized minimization on block-diagonal matrices [cf., e.g., Bach (2008)] Corollary 4 and the bounds (22), (23) imply the corresponding bounds for the Group Lasso under the low-rank assumption. To note the difference with from the previous results for the Group Lasso, we consider, for example, those obtained in multi-task setting by Lounici et al. (2009, 2010). The main difference is that the sparsity index appearing in Lounici et al. (2009, 2010) is now replaced by . In Lounici et al. (2009, 2010), the columns of are supposed to be sparse, with the sets of nonzero elements of cardinality not more than , whereas here the sparsity is characterized by the rank of .
Finally, we give the following result based on application of Theorem 3.
Corollary 5 ((Multi-task learning; uniformly bounded ))
Let be i.i.d. random variables, and assume that . Consider the multi-task learning problem with , , such that the maximal singular value for some . Assume that the spectra of the Gram matrices are uniformly in bounded from above by where is a constant. Set , where and
for some and a universal constant , independent of , and . Then the Schatten- estimator with this parameter satisfies
(24) |
with probability at least where , and the positive constant is independent of , and .
Corollary 5 follows from Theorem 3. Indeed, it suffices to remark that, under the premises of Corollary 5, we have for all matrices , so that the sampling operator is uniformly bounded; cf. (5).
For , we can write (24) in the form
(25) |
Clearly, this bound achieves the optimal rate “intrinsic dimension/sample size” , up to logarithms (recall that in the multi-task learning). The bounds (22) and (23) achieve this rate in a more precise sense because they are free of extra logarithmic factors.
Another remark concerns the possible range of . It follows from the discussion in Section 3 that the “dimension larger than the sample size” framework is not covered by Corollary 4 since this corollary relies on the RI condition. In contrast, the bounds of Corollary 5 make sense when the dimension is larger than the sample size of each task; we only need to have for Corollary 5 to be meaningful. Corollary 5 holds when the RI assumption is violated and under a mild condition on the masks . The price to pay is to assume that the singular values of do not grow exponentially fast. Also, the estimator of Corollary 5 corresponds to , so it is computationally hard.
7 Minimax lower bounds
In this section, we derive lower bounds for the prediction error, which show that the upper bounds that we have proved are optimal in a minimax sense for two scenarios: (i) under the RI condition and (ii) for matrix completion with collaborative sampling. We also provide a lower bound for USR matrix completion. Under the RI condition with , minimax lower bounds for the Frobenius norm on “Schatten-0” balls are derived in Candès and Plan (2010b) with a technique different from ours, which does not allow one to include further boundedness constraints on in addition to that it has rank at most . Specifically, they prove their lower bound by passage to Bayes risk with an unbounded support prior (Gaussian prior). Our lower bounds are more general in the sense that they are obtained on smaller sets, namely, the intersections of Schatten-0 and Schatten- balls. This is similar in spirit to Rigollet and Tsybakov (2010) establishing minimax lower bounds on the intersection of and balls for the vector sparsity scenario. In what follows, we denote by the infimum over all estimators based on , and for any we denote by the probability distribution of satisfying (1) with .
Theorem 5 ((Lower bound—Restricted Isometry))
Let be i.i.d. random variables for some . Let , , and for define
[(ii)]
Assume that the sampling operator satisfies the right-hand side inequality in the RI -condition (6) for some . Then for any , , ,
(26) |
where is such that as .
Remark 3.
It is worth to note that and do not depend on the constant of the RI condition.
Proof of Theorem 5 Without loss of generality, we assume that . For a constant and an integer , both to be specified later, define
By construction, any element of as well as the difference of any two elements of has rank at most . Due to the Varshamov–Gilbert bound [cf. Lemma 2.9 in Tsybakov (2009)], there exists a subset of cardinality containing such that for any two distinct elements and of ,
(28) |
where the first inequality follows from the left-hand side inequality in the RI condition (6) and is only used to prove (27). We will prove (27); the proof of (26) is analogous in view of the second inequality in (28).
Then, for any , the Kullback–Leibler divergence between and satisfies
(29) |
where we used again the RI condition. We now apply Theorem 2.5 in Tsybakov (2009). Fix some . Note that the condition
(30) |
is satisfied for . Define
and consider separately the following three cases.
The case . In this case, for any , and . Set
Then for all , i.e., is contained in the set
Now, inequality (28) shows that for any two distinct elements , while implies that . Hence, condition (30) is satisfied.
The case . In this case, the rate is equal to . We consider the set with some to be specified below. For , we have . Since also when , it follows that whenever
(31) |
Now define
Then (30) is satisfied and fulfills also the constraint (31), since , . Thus, is a subset of matrices with and . Finally, (28) implies that
for any two distinct elements of .
The case . In this case, . The conditions required in Theorem 2.5 of Tsybakov (2009) follow immediately as above, this time with the set of matrices , where .
Remark 4.
Theorem 5 implies that the rates of convergence in Theorem 4 are optimal in a minimax sense on Schatten- balls under the RI condition and natural assumptions on and . Indeed, using Theorem 5 with no restriction on the rank [i.e., when ], and putting for simplicity , we find that the rate in the lower bound is of the order . For and this minimum equals , which coincides with the upper bound of Theorem 4.
The lower bound for the prediction error (27) in the above theorem does not apply to matrix completion with since then the Restricted Isometry condition cannot be satisfied, as discussed in Section 3. However, for the bound (26) we only need the right-hand side inequality in the RI condition. For example, the latter is trivially satisfied for CS matrix completion with and . This yields the following corollary.
Corollary 6 ((Lower bound—CS matrix completion))
Let be i.i.d. random variables for some . Let , , , and consider the problem of CS matrix completion. Then for any , , ,
where and , are as in Theorem 5.
The model of uniform sampling without replacement considered in Candès and Recht (2009) is a particular case of CS matrix completion. In the noisy case, Keshavan, Montanari and Oh (2009) obtain upper bounds under such a sampling scheme with the rate , up to logarithmic factors. The lower bound of Corollary 6 is of the same order when , that is, for the class of matrices of rank smaller than . However, Keshavan, Montanari and Oh (2009) obtained their bounds on some subclasses of this class characterized by additional strong restrictions.
It is useful to note that for bounds of the type (26) it is enough to have a condition on in expectation, as specified in the next theorem.
Theorem 6
Let be i.i.d. random variables for some . Let , , , and assume that are random matrices independent of , and the sampling operator satisfies for some and all such that . Then for any , , ,
where and , are as in Theorem 5.
We proceed as in Theorem 5, with the only difference in the bound on the Kullback–Leibler divergence. Indeed, under our asumptions, instead of (29) we have
(32) |
Theorem 6 applies to USR matrix completion with . Indeed, in that case . In particular, Theorem 6 with shows that on the class of matrices of rank smaller than the lower bound of estimation in the squared Frobenius norm for USR matrix completion is of the order .
The next theorem gives a lower bound for the prediction error under collaborative sampling without the RI condition. Instead, we only impose a rather natural condition that the observed noisy entries are sufficiently well dispersed, that is, there exist rows or columns with more that observations for some fixed . We state the result with an additional constraint on the Frobenius norm of , in order to fit the corresponding upper bound (cf. Remark 2 in Section 5).
Theorem 7 ((Lower bound—CS matrix completion))
Let be i.i.d. random variables for some and assume that the masks are pairwise different. Let and for some fixed , where . Assume furthermore that the following dispersion condition holds: there exist numbers or such that either the set or the set has cardinality at least . Define Then for any and ,
with a function as and .
We proceed as in the case , , of Theorem 5 taking a different set instead . Let, for definiteness, the dispersion condition be satisfied with the set of indices . Then there exists a subset of with cardinality . We define
Any element of as well as the difference of any two elements of has rank at most , and , . So, if . As in Theorem 5, the Varshamov–Gilbert bound implies that there exists a subset of cardinality containing , that for any two distinct elements and of ,
Instead of the bound (29), we have now the inequality for any . Finally, we choose . With these modifications, the rest of the proof is the same as that of Theorem 5 in the case .
8 Control of the stochastic term
We consider two approaches for bounding the stochastic term on the right-hand side of the basic inequality (8). The first one used for consists in application of the trace duality
(33) |
with and then of suitable exponential bounds for the spectral norm of under different conditions on , . The second approach used to treat the case (nonconvex penalties) (cf. Section 8.2) is based on refined empirical process techniques. Proofs of the results of this section are deferred to Section 10.
8.1 Tail bounds for the spectral norm of random matrices
We say that the random variables , , satisfy the Bernstein condition if
(34) |
with some finite constants and , and we say that they satisfy the light tail condition if
(35) |
for some positive constant .
Lemma 1
Let the i.i.d. zero-mean random variables satisfy the Bernstein condition (34). Let also either
(36) |
and
(37) |
or the conditions
(38) |
and
(39) |
hold true with some constants . Let . Then, respectively, with probability at least or at least we have
(40) |
where if (36) and (37) are satisfied or if (38) and (39) hold. Here
Lemma 2
Let be i.i.d. random variables. Then, for any ,
(41) |
with probability at least , where is the maximal rank 1 eigenvalue of the sampling operator .
If and have the same order of magnitude, the bound of Lemma 2 is better, since it does not contain extra logarithmic factors. On the other hand, if and differ dramatically, for example, , then Lemma 1 can provide a significant improvement. Indeed, the “column” version of Lemma 1 guarantees the rate which in this case is much smaller than . In all the cases, the concentration rate in Lemma 2 is exponential and thus faster than in Lemma 1.
The next lemma treats the stochastic term for USR matrix completion.
Lemma 3 ((USR matrix completion))
(i) Let the i.i.d. zero-mean random variables satisfy the Bernstein condition (34). Consider the USR matrix completion problem and assume that . Then, for any ,
(42) |
with probability at least .
(ii) Assume that the i.i.d. zero-mean random variables satisfy the light tail condition (35) for some . Then for any ,
(43) |
with probability at least for some constant which does not depend on and .
The proof of part (i) is based on a refinement of a technique in Vershynin (2007), whereas that of part (ii) follows immediately from the large deviations inequality of Nemirovski (2004). For example, if , in which case both results apply, the bound (ii) is tighter than (i) for sample sizes which is the most interesting case for matrix completion.
Much tighter bounds are available when the are constrained to be pairwise different. Besides it is noteworthy that the rates in (44) and (45) below are different for Gaussian and Bernstein errors.
Lemma 4 ((Collaborative sampling))
Consider the problem of CS matrix completion. {longlist}[(iii)]
Let be i.i.d. random variables. Then, for any,
(44) |
with probability at least .
Let be i.i.d. zero-mean random variables satisfying the Bernstein condition (34). Then, for any and
(45) |
with probability at least .
Let be i.i.d. random variables. Then for any ,
with probability at least .
Since the masks are distinct, the maximum appearing in (iii) is bounded by ; in case it is attained, the bound (44) is slightly stronger since it is free from the logarithmic factor. For the tightness of the bound in (iii) depends strongly on the geometry of the ’s and the maximum can be significantly smaller than . Note also that the concentration in (44) is exponential, while it is only polynomial in (iii).
8.2 Concentration bounds for the stochastic term under nonconvex penalties
The last bound in this section applies in the case . It is given in the following lemma.
Lemma 5
Let be i.i.d. random variables, and . Assume that the sampling operator is uniformly bounded; cf. (5). Set where . Then for any fixed , and we have
(46) |
with probability at least for some constant which is independent of and and satisfies for all .
Note at this point that we cannot rely the proof of Lemma 5 directly on the trace duality and norm interpolation (cf. Lemma 11), that is, on the inequalities
Indeed, one may think that we could have bounded here the -norm of in the same way as in Section 8.1, and then the proof would be complete after suitable decoupling if we were able to bound from above by times a constant factor. However, this is not possible. Even the Restricted Isometry condition cannot help here because is not necessarily of small rank. Nevertheless, we will show that by other techniques it is possible to derive an inequality similar to (8.2) with instead of . Further details are given in Sections 10 and 11.
9 Proof of Theorem 2
Preliminaries
We first give two lemmas on matrix decomposition needed in our proof, which are essentially provided by Recht, Fazel and Parrilo (2010) [subsequently, RFP(10) for short].
Lemma 6
Let and be matrices of the same dimension. If , , then
For the result is Lemma 2.3 in RFP(10). The argument obviously extends to any since RFP(10) show that the singular values of are equal to the union (with repetition) of the singular values of and .
Lemma 7
Let with and singular value decomposition . Let be arbitrary. Then there exists a decompositon with the following properties: {longlist}[(iii)]
,
, ,
,
and are of the form
(48) |
The points (i)–(iii) are the statement of Lemma 3.4 in RFP(08), the representation (iv) is provided in its proof.
Proof of Theorem 2 First note that there exists a decomposition with the following properties: {longlist}[(iii)]
,
, ,
From the basic inequalities (8) and (3) with , we find
(49) | |||
In particular, for the case ,
(50) |
For brevity, we will conduct the proof with the numerical constants given in (50), that is, with those for . The proof for general differs only in the values of the constants, but their expressions become cumbersome.
Using (3), we get
(51) | |||
since and by construction. Together with (9) this yields
(52) |
from which one may deduce
(53) |
and
(54) |
Consider now the following decomposition of the matrix . First, recall that is of the form
Write with diagonal matrix of dimension and for some . In the next step, and are complemented to orthogonal matrices and of dimension . For instance, set
where complements the columns of the matrix to an orthonormal basis in , and proceed analogously with . In particular, . Also
We now represent as a finite sum of matrices with
and
where the diagonal matrix has the form , . We denote here by the set of indices from corresponding to the largest in absolute value diagonal entries of , by the set of indices corresponding to the next largest in absolute value diagonal entries , etc. Clearly, the matrices are mutually orthogonal: for and . Moreover, is orthogonal to .
Let be the singular values of , then are the singular values of , those of , etc. By construction, we have for all , and for all
Thus,
from which one can deduce for all :
and consequently
Because of the elementary inequality for any nonnegative and ,
Therefore,
where the last inequality results from and
Finally,
(55) |
We now proceed with the final argument. First, note that . Next, by the triangular inequality, the restricted isometry condition and the orthogonality of and we obtain
Define
Then . Now, , and thus
whenever
(57) |
In case of (57), there exists a universal constant such that
(58) |
Now, inequalities (53) and (58) yield
(59) |
where the second inequality results from the fact that , which implies
(60) |
From (59), we obtain
(61) |
Furthermore, from (53), (60) and (61) we find
This proves (11). It remains to prove (12). We first demonstrate (12) for , then for , and finally obtain (12) for all by Schatten norm interpolation.
10 Proofs of the lemmas
{pf*}Proof of Lemma 1 First, observe that
with vectors . Consequently, for any ,
To proceed with the evaluation of the latter probability, we use the following concentration bound [Pinelis and Sakhanenko (1985)].
Lemma 8
Let be independent zero mean random variables in a separable Hilbert space such that
(63) |
with some finite constants . Then
Setting , , note first that, by the Bernstein condition (34),
where and , that is, condition (63) is satisfied. Now an application of Lemma 8 yields for any
Define for some . Then
With this choice of ,
and therefore , where
Similarly, using , and assuming (38) and (39), we get , where
Proof of Lemma 3 The matrix is a sum of i.i.d. random matrices. Therefore, part (ii) of the lemma follows by direct application of the large deviations inequality of Nemirovski (2004).
To prove part (i) of the lemma, we use bounds on maximal eigenvalues of subgaussian matrices due to Mendelson, Pajor and Tomczak-Jaegermann (2007); see also Vershynin (2007). However, direct application of these bounds (based on the overall subgaussianity) does not lead to rates that are accurate enough for our purposes. We therefore need to refine the argument using the specific structure of the matrices. Note first that
where is the unit sphere in . Therefore, denoting by and the minimal -nets in Euclidean metric on and , respectively, we easily get
Now, [cf. Kolmogorov and Tikhomirov (1959)] so that by the union bound, for any ,
(64) |
It remains to bound the last probability in (64) for fixed . Let us fix some and introduce the random event
Note that , and consider the zero-mean random variables . We have . Furthermore,
Therefore, using Bernstein’s inequality and the condition we get
where is the complement of . We now bound the conditional probability
Note that conditionally on , the are independent zero-mean random variables with
where we used the fact that for . This and the Bernstein condition (34) yield that, for ,
with . Therefore, by Lemma 8, for we have
(66) |
For defined in (42) the last expression does not exceed . Together with (64) and (10), this proves the lemma.
Proof of Lemma 2 We act as in the proof of Lemma 3 but since the matrices are now deterministic, we do not need to introduce the event . By the definition of ,
for all . Hence, is a zero-mean Gaussian random variable with variance not larger than . Therefore,
For as in (41) the last expression does not exceed . Combining this with (64), we get the lemma.
Proof of Lemma 4 We proceed again as in the proof of Lemmas 3 and 2. Denote by the set of pairs such that (recall that all are distinct by assumption). Then
(67) |
for any . Hence, under the assumptions of part (i) of the lemma,
which does not exceed for defined in (44). Combining this with (64) we get part (i) of the lemma. To prove part (ii) we note that, as in the proof of Lemma 3, for . This and (67) yield
with . Therefore, by Lemma 8, we have
and we complete the proof of (ii) in the same way as in Lemmas 3 and 2.
Part (iii) follows by an application of Theorem 2.1, Tropp (2010), after replacing every by its self-adjoint dilation [see Paulsen (1986)].
For the proof of Lemma 5 we will need some notation. The th Schatten class of -matrices is denoted by , and we write
for the corresponding closed Schatten- unit ball in . For any pseudo-metric space and any , we define the covering number
In other words, is the smallest number of closed balls of radius in the metric needed to cover the set . We will sometimes write instead of if the metric is associated with the norm . The empirical norm corresponds to , that is, for all ,
Proof of Lemma 5 Let us first assume that . Since
the expression on the LHS of (46) is not greater than
Due to the linear dependence in of the -entropies of the quasi-convex Schatten class embeddings (cf. Corollary 7) and the fact that the required bound should be uniform in and in for , we introduced an additional weighting by . Now define
By the entropy bound of Corollary 7 and the uniform boundedness condition (5),
whence
(68) |
We remark that due to the order specification of in Corollary 7, the expression
(69) |
is uniformly bounded as long as stays uniformly bounded away from . Note that for the entropy integral on the LHS in (68) does not converge.
Claim 1.
For any , there exist constants and , such that for all , all and uniformly in and ,
(70) |
for all .
The bound is essentially stated in van de Geer (2000) as Lemma 3.2 [further referred to as VG(00)]. The constant in VG(00) depends neither on the -diameter of the function class nor on the function class itself and is valid, in particular, for , in the notation of VG(00). The uniformity in follows from the uniform boundedness of (69) over . The required case corresponds to in the notation of VG(00). Its proof follows by taking and applying the theorem of monotone convergence as , since the RHS of the inequality is independent of .
Claim 2.
For any , there exists a constant such that for any
(71) |
for all .
First, observe that
where the last inequality follows from . Define the decomposition of
(72) |
Then by peeling-off the class , we obtain together with claim I for all
(73) |
with the definition
It remains to note that the last sum in (10) is bounded by uniformly in whenever for some suitable constant . This follows from the fact that
and the latter expression is bounded uniformly in .
In particular, the result reveals that the LHS of (46) is bounded by
(74) |
with probability at least for any .
We now use the following simple consequence of the concavity of the logarithm which is stated, for instance, in Tsybakov and van de Geer (2005) (Lemma 5).
Lemma 9
For any positive , and any , we have
where .
The case can be deduced from the above result by the following observation. For any matrix , define the extension with as follows: for , and otherwise. Then one easily checks that for all . Furthermore, and
Consequently, the result follows now from the already established proof for the case .
11 Entropy numbers for quasi-convex Schatten class embeddings
Here we derive bounds for the th entropy numbers of the embeddings for , where denotes the th Schatten class of real -matrices. Corresponding results for the -embeddings are given first by Edmunds and Triebel (1989) but their proof does not carry over to the Schatten spaces. Pajor (1998) provides bounds for the -embeddings in the convex case, . His approach is based on the trace duality (Hölder’s inequality for ) and the geometric formulation of Sudakov’s minoration
for some positive constant , with a -dimensional standard Gaussian vector and an arbitrary subset of . Here is the Euclidean norm in and is the corresponding scalar product. Guédon and Litvak (2000) derive a slightly sharper bound for the -embeddings than Edmunds and Triebel (1989) with a different technique. In addition, they prove lower bounds. We adjust their ideas concerning finite spaces to the nonconvex Schatten spaces.
We denote by the th entropy number of the embedding for , that is, the infimum of all such that there exist balls in of radius that cover . For the general definition of th entropy numbers of bounded linear operators between quasi-Banach spaces and , we refer to Edmunds and Triebel (1996).
Recall that a homogeneous nonnegative functional is called -quasi-norm, if it satisfies for all the inequality . Finally, any -norm is a -quasi-norm with [cf., e.g., Edmunds and Triebel (1996), page 2]. We will use the following lemma.
Lemma 10 ([Guédon and Litvak (2000)])
Assume that are symmetric -quasi-norms on for , and for some , is a quasi-norm on such that for all . Then for any quasi-normed space , any linear operator , and all integers and , we have
where stands for equipped with quasi-norm , .
Guédon and Litvak (2000) did not specify the notion of symmetry they used. So we have to clarify that here a (quasi-)norm is called symmetric if is isometrically isomorphic to a symmetrically (quasi-)normed operator ideal. This includes the diagonal operator spaces (finite ) as well as the Schatten spaces. The proof of Lemma 10 follows the lines of Pietsch (1980), Proposition 12.1.12, replacing the triangle inequality by the quasi-triangle inequality. Recall that the Schatten classes form interpolation couples like their commutative analogs .
Lemma 11 ((Interpolation inequality))
For , let be such that
Then, for all ,
Proof is immediate in view of the inequalities
valid for any nonnegative ’s.
Proposition 1 ((Entropy numbers))
Let , . Then there exists an absolute constant independent of and , such that for all integers and we have
with
The fact that is bounded by is obvious, since . Consider the other case. We start with and then extend the result to by interpolation. Fix some number and let be the smallest constant which satisfies, for all ,
(75) |
Let us show that is finite. Since , , can be viewed as a quasi-norm on (isomorphic to ), Lemma 10 applies with , , , and . This gives
(76) |
Here the factor 4 follows from the relations and . Now, (76) and the factorization theorem for entropy numbers of bounded linear operators between quasi-Banach spaces [see, e.g., Edmunds and Triebel (1996), page 8], with factorization via , leads to the bound
where for any , denotes the smallest integer which is larger or equal to . Proposition 5 of Pajor (1998) entails and hence
(78) |
with constants and independent of , and . Note that, in contrast to the -embedding, for which the th entropy numbers are bounded by with some and [see, e.g., Edmunds and Triebel (1996), page 98], we have in (78) not a logarithmic but linear dependence of in the upper bound. Plugging (75) and (78) into (11) yields
Thus, by definition of ,
which shows that is uniformly bounded in and . This proves the proposition for .
Consider now the case . In view of Lemma 11 with , we can apply Lemma 10 with , , , and . This yields
Corollary 7
For any , there exists a positive constant such that for all integers and any ,
Moreover, for .
Acknowledgment
We are grateful to Alain Pajor for pointing out reference Guédon and Litvak (2000).
References
- (1) Abernethy, J., Bach, F., Evgeniou, T. and Vert, J.-P. (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. J. Mach. Learn. Res. 10 803–826.
- (2) Amini, A. and Wainwright, M. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. Ann. Statist. 37 2877–2921. \MR2541450
- (3) Argyriou, A., Evgeniou, T. and Pontil, M. (2008). Convex multi-task feature learning. Mach. Learn. 73 243–272.
- (4) Argyriou, A., Micchelli, C. A. and Pontil, M. (2010). On spectral learning. J. Mach. Learn. Res. To appear. \MR2600635
- (5) Argyriou, A., Micchelli, C. A., Pontil, M. and Ying, Y. (2008). A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 20 (J.C. Platt, et al., eds.) 25–32. MIT Press, Cambridge, MA.
- (6) Bach, F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res. 9 1019–1048. \MR2417263
- (7) Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227. \MR2387969
- (8) Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. \MR2533469
- (9) Bunea, F., She, Y. and Wegkamp, M. H. (2010). Optimal selection of reduced rank estimators of high-dimensional matrices. Available at arXiv:1004.2995.
- (10) Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697. \MR2351101
- (11) Cai, T., Zhang, C.-H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38 2118–2144. \MR2676885
- (12) Candès, E. J. and Plan, Y. (2010a). Matrix completion with noise. Proc. IEEE 98 925–936.
- (13) Candès, E. J. and Plan, Y. (2010b). Tight oracle bounds for low-rank matrix recovery from a mininal number of noisy random measurements. Available at arXiv:1001.0339.
- (14) Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772. \MR2565240
- (15) Candès, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203–4215. \MR2243152
- (16) Candès, E. J. and Tao, T. (2009). The power of convex relaxation: Near-optimal matrix completion. Unpublished manuscript.
- (17) McCarthy, C. A. (1967). . Israel J. Math. 5 249–272. \MR0225140
- (18) Dalalyan, A. and Tsybakov, A. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity. Mach. Learn. 72 39–61.
- (19) Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object. J. Roy. Statist. Soc. Ser. B 54 41–81. \MR1157714
- (20) Edmunds, D. E. and Triebel, H. (1996). Function Spaces, Entropy Numbers, Differential Operators. Cambridge Univ. Press, Cambridge. \MR1410258
- (21) Edmunds, D. E. and Triebel, H. (1989). Entropy numbers and approximation numbers in function spaces. Proc. London Math. Soc. 58 137–152. \MR0969551
- (22) Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975. \MR1329177
- (23) Guédon, O. and Litvak, A. E. (2000). Euclidean projections of a -convex body. In Geometric Aspects of Functional Analysis, Israel Seminar (GAFA) 1996–2000 (V. D. Milman and G. Schechtman, eds.). Lecutre Notes in Mathematics 1745 95–108. Springer, Berlin. \MR1796715
- (24) Gross, D. (2009). Recovering low-rank matrices from few coefficients in any basis. Available at arXiv:0910.1879.
- (25) Keshavan, R. H., Montanari, A. and Oh, S. (2009). Matrix completion from noisy entries. Available at arXiv:0906.2027.
- (26) Kolmogorov, A. N. and Tikhomirov, V. M. (1959). The -entropy and -capacity of sets in function spaces. Uspekhi Matem. Nauk 14 3–86. \MR0112032
- (27) Koltchinskii, V. (2008). Oracle inequalities in empirical risk minimization and sparse recovery problems. Ecole d’Eté de Probabilités de Saint-Flour, Lecture Notes. Preprint.
- (28) Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of COLT-2009.
- (29) Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2010). Oracle inequalities and optimal inference under group sparsity. Available at arXiv:1007.1771.
- (30) Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. \MR2278363
- (31) Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17 1248–1282. \MR2373017
- (32) Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, NIPS-2009.
- (33) Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low rank matrices with noise and high-dimensional scaling. Ann. Statist. To appear. Available at arXiv:0912.5100.
- (34) Nemirovski, A. (2004). Regular Banach spaces and large deviations of random sums. Unpublished manuscript.
- (35) Pajor, A. (1998). Metric entropy of the Grassmann manifold. Convex Geom. Anal. 34 181–188. \MR1665590
- (36) Paulsen, V. I. (1986). Completely Bounded Maps and Dilations. In Pitman Research Notes in Mathematics 146. Longman, New York. \MR0868472
- (37) Pietsch, A. (1980). Operator Ideals. Elsevier, Amsterdam. \MR0582655
- (38) Pinelis, I. F. and Sakhanenko, A. I. (1985). Remarks on inequalities for the probabilities of large deviations. Theory Probab. Appl. 30 143–148. \MR0779438
- (39) Ravikumar, P., Wainwright, M., Raskutti, G. and Yu, B. (2008). High-dimensional covariance estimation by minimizing -penalized log-determinant divergence. Unpublished manuscript.
- (40) Recht, B. (2009). A simpler approach to matrix completion. Available at arXiv:0910.0651.
- (41) Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501. \MR2680543
- (42) Rigollet, P. and Tsybakov, A. B. (2010). Exponential screening and optimal rates of sparse estimation. Available at arXiv:1003.2654.
- (43) Rotfeld, S. Y. (1969). The singular numbers of the sum of completely continuous operators. In Topics in Mathematical Physics (M. S. Birman, ed.). Spectral Theory 3 73–78. English version published by Consultants Bureau, New York.
- (44) Srebro, N., Rennie, J. and Jaakkola, T. (2005). Maximum margin matrix factorization. In Advances in Neural Information Processing Systems 17 (L. Saul, Y. Weiss and L. Bottou, eds.) 1329–1336. MIT Press, Cambridge, MA.
- (45) Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Learning Theory, Proceedings of COLT-2005. Lecture Notes in Comput. Sci. 3559 545–560. Springer, Berlin. \MR2203286
- (46) Tropp, J. A. (2010). User-friendly tail bounds for sums of random matrices. Available at arXiv:1004.4389.
- (47) Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer, New York. \MR2724359
- (48) Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224. \MR2195633
- (49) van de Geer, S. (2000). Empirical Processes in M-estimation. Cambridge Univ. Press, Cambridge.
- (50) Vershynin, R. (2007). Some problems in asymptotic convex geometry and random matrices motivated by numerical algorithms. In Banach Spaces and Their Applications in Analysis (B. Randrianantoanina and N. Randrianantoanina, eds.) 209–218. de Gruyter, Berlin. \MR2374708
- (51) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541–2563. \MR2274449