Zeroth-order Low-rank Hessian Estimation via Matrix Recovery
Abstract
A zeroth-order Hessian estimator aims to recover the Hessian matrix of an objective function at any given point, using minimal finite-difference computations. This paper studies zeroth-order Hessian estimation for low-rank Hessians, from a matrix recovery perspective. Our challenge lies in the fact that traditional matrix recovery techniques are not directly suitable for our scenario. They either demand incoherence assumptions (or its variants), or require an impractical number of finite-difference computations in our setting. To overcome these hurdles, we employ zeroth-order Hessian estimations aligned with proper matrix measurements, and prove new recovery guarantees for these estimators. More specifically, we prove that for a Hessian matrix of rank , proper zeroth-order finite-difference computations ensures a highly probable exact recovery of . Compared to existing methods, our method can greatly reduce the number of finite-difference computations, and does not require any incoherence assumptions.
1 Introduction
In machine learning, optimization and many other mathematical programming problems, the Hessian matrix plays an important role since it describes the landscape of the objective function. However, in many real-world scenarios, although we can access function values, the lack of analytic form for the objective function precludes direct Hessian computation. Therefore it is important to develop zeroth-order finite-difference Hessian estimators, i.e. to estimate the Hessian matrix by function evaluation and finite-difference.
Finite-difference Hessian estimation has a long history dating back to Newton’s time. In recent years, the rise of large models and big data has posed the high-dimensionality of objective functions as a primary challenge in finite-difference Hessian estimation. To address this, stochastic Hessian estimators, like (Balasubramanian and Ghadimi,, 2021; Wang,, 2023; Feng and Wang,, 2023; Li et al.,, 2023), have emerged to reduce the required number of function value samples. The efficiency of a Hessian estimator is measured by the sample complexity, which quantifies the number of finite-difference computations needed.
Despite the high-dimensionality, the low-rank structure is prevalent in machine learning with high-dimensional datasets (Fefferman et al.,, 2016; Udell and Townsend,, 2019). Numerous research directions, such as manifold learning (e.g., Ghojogh et al.,, 2023) and recommender systems (e.g., Resnick and Varian,, 1997), actively leverage this low-rank structure. While there are many studies on stochastic Hessian estimators, as we detail in section 1.4, none of them exploit the low-rank structure of the Hessian matrix. This omission can lead to overly conservative results and hinder the overall efficiency and effectiveness of the optimization or learning algorithms.
To fill in the gap, in this work, we develop an efficient finite-difference Hessian estimation method for low-rank Hessian via matrix recovery. While a substantial number of literature studies the sample complexity of low-rank matrix recovery, we emphasize that none of them are directly applicable to our scenario. This is either due to the overly restrictive global incoherence assumption or a prohibitively large number of finite-difference computations, as we discuss in detail in section 1.2. We develop a new method and prove that without the incoherence assumption, for an Hessian matrix with rank , we can exactly recover the matrix with high probability from proper zeroth-order finite-difference computations.
In the rest of this section, we present our problem formulation, discuss why existing matrix recovery methods fail on our problem and summarize our contribution.
1.1 Hessian Estimation via Compressed Sensing Formulation
To recover an low-rank Hessian matrix using finite-difference operations, we use the following trace norm minimization approach (Fazel,, 2002; Recht et al.,, 2010; Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012):
(1) |
where and is a matrix measurement operation that can be obtained via finite-difference computations. For our problem, it is worth emphasizing that must satisfy the following requirements.
-
•
(R1) is different from the sampling operation used for matrix completion. Otherwise an incoherence assumption is needed. See (M1) in Section 1.2 for more details.
-
•
(R2) cannot involve the inner product between the Hessian matrix and a general matrix, since this operation cannot be efficiently obtained through finite-difference computations. See (M2) in Section 1.2 for more details.
Due to the above two requirements, existing theory for matrix recovery fails to provide satisfactory guarantees for low-rank Hessian estimation.
1.2 Existing Matrix Recovery Methods
Existing methods for low-rank matrix recovery can be divided into two categories: matrix completion methods, and matrix recovery via linear measurements (or matrix regression type method). Unfortunately, both groups of methods are unsuitable for Hessian estimation tasks.
(M1) Matrix completion methods: A candidate class of methods for low-rank Hessian estimation is matrix completion (Fazel,, 2002; Cai et al.,, 2010; Candes and Plan,, 2010; Candès and Tao,, 2010; Keshavan et al.,, 2010; Lee and Bresler,, 2010; Fornasier et al.,, 2011; Gross,, 2011; Recht,, 2011; Candes and Recht,, 2012; Hu et al.,, 2012; Mohan and Fazel,, 2012; Negahban and Wainwright,, 2012; Wen et al.,, 2012; Vandereycken,, 2013; Wang et al.,, 2014; Chen,, 2015; Tanner and Wei,, 2016; Gotoh et al.,, 2018; Chen et al.,, 2020; Ahn et al.,, 2023).
The motivation for matrix completion tasks originated from the Netflix prize, where the challenge was to predict the ratings of all users on all movies based on only observing ratings of some users on some movies. In order to tackle such problems, it is necessary to assume that the nontrivial singular vectors of the matrix and the observation basis are “incoherent”. Incoherence (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012; Chen,, 2015; Negahban and Wainwright,, 2012), or its alternatives (e.g., Negahban and Wainwright,, 2012), implies that there is a sufficiently large angle between the singular vectors and the basis . The rationale behind this assumption can be explained as follows: Consider a matrix of size with a one in its entry and zeros elsewhere. If we randomly observe a small fraction of the entries, it is highly likely that we will miss the entry, making it difficult to fully recover the matrix. Therefore, an incoherence parameter is assumed between the given canonical basis and the singular vectors of , as illustrated in Figure 1. In the context of zeroth-order optimization, it is often necessary to recover the Hessian at any given point. However, assuming the Hessian is incoherence with the given basis over all points in the domain is overly restrictive.
(M2) Matrix recovery via linear measurements (matrix regression type recovery): In the context of matrix recovery using linear measurements (Tan et al.,, 2011; Eldar et al.,, 2012; Chandrasekaran et al.,, 2012; Rong et al.,, 2021), we observe the inner product of the target matrix with a set of matrices . Specifically, we have the observation and our goal is to recover . In certain scenarios, there may be additional constraints on and the measurements might be corrupted by noise (Rohde and Tsybakov,, 2011; Fan et al.,, 2021; Xiaojun Mao and Wong,, 2019), which receives more attention from the statistics community. Eldar et al., (2012) proved that when the entries of are independently and identically distributed () Gaussian, having linear measurements ensures exact recovery of . Rong et al., (2021) showed that when the density of is absolutely continuous, having measurements guarantees exact recovery of .
Despite the elegant results in matrix recovery using linear measurements, they are not applicable to Hessian estimation tasks. This limitation arises from the fact that a general linear measurement cannot be approximated by a zeroth-order estimation. To further illustrate this fact, let us consider the Taylor approximation, which, by the fundamental theorem of calculus, is the foundation for zeroth-order estimation. In the Taylor approximation of at , the Hessian matrix will always appear as a bilinear form. Therefore, a linear measurement for a general cannot be included in a Taylor approximation of at . In the language of optimization and numerical analysis, for a general measurement matrix , one linear measurement may require far more than finite-difference computations. Consequently, the theory providing guarantees for linear measurements does not extend to zeroth-order Hessian estimation.
1.3 Our Contribution
In this paper, we introduce a low-rank Hessian estimation mechanism that simultaneously satisfies (R1) and (R2). More specifically,
-
•
We prove that, with a proper finite-difference scheme, finite-difference computations are sufficient for guaranteeing an exact recovery of the Hessian matrix with high probability. Our approach simultaneously overcomes limitations of (M1) and (M2).
In the realm of zeroth-order Hessian estimation, no prior arts provide high probability estimation guarantees for low-rank Hessian estimation tasks; See Section 1.4 for more discussions.
1.4 Prior Arts on Hessian Estimation
Zeroth-order Hessian estimation dates back to the birth of calculus. In recent years, researchers from various fields have contributed to this topic (e.g., Broyden et al.,, 1973; Fletcher,, 2000; Spall,, 2000; Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023).
In quasi-Newton-type methods (e.g., Goldfarb,, 1970; Shanno,, 1970; Broyden et al.,, 1973; Ren-Pu and Powell,, 1983; Davidon,, 1991; Fletcher,, 2000; Spall,, 2000; Xu and Zhang,, 2001; Rodomanov and Nesterov,, 2022), gradient-based Hessian estimators were used for iterative optimization algorithms. Based on the Stein’s identity (Stein,, 1981), Balasubramanian and Ghadimi, (2021) introduced a Stein-type Hessian estimator, and combined it with cubic regularized Newton’s method (Nesterov and Polyak,, 2006) for non-convex optimization. Li et al., (2023) generalizes the Stein-type Hessian estimators to Riemannian manifolds. Parallel to (Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023), Wang, (2023); Feng and Wang, (2023) investigated the Hessian estimator that inspires the current work.
Yet prior to our work, no methods from the zeroth-order Hessian estimation community focuses on low-rank Hessian estimation.
2 Notations and Conventions
Before proceeding to main results, we lay out some conventions and notations that will be used throughout the paper. We use the following notations for matrix norms:
-
•
is the operator norm (Schatten -norm);
-
•
is the Euclidean norm (Schatten -norm);
-
•
is the trace norm (Schatten -norm).
Also, the notation is overloaded for vector norm and tensor norm. For a vector , is its Euclidean norm; For a tensor (), is its Schatten -norm. For any matrix with singular value decomposition , we define where applies a function to each entry of .
For a vector and a positive number , we define notations
Also, we use and to denote unimportant absolute constants that does not depend on or . The numbers and may or may not take the same value at each occurrence.
3 Main Results
We start with a finite-difference scheme that can be viewed as a matrix measurement operation. The Hessian of a function at a given point can be estimated as follows (Wang,, 2023; Feng and Wang,, 2023)
(2) |
where is the finite-difference granularity, and are finite-difference directions. Difference choices of laws of and leads to different Hessian estimators. For example, can be independent vectors uniformly distributed over the canonical basis .
We start our discussion by showing that the Hessian estimator (2) can indeed be viewed as a matrix measurement.
Proposition 1.
Consider an estimator defined in (2). Let the underlying function be twice continuously differentiable. Let be two random vectors such that . Then for any fixed ,
as , where denotes convergence in distribution.
Proof.
By Taylor’s Theorem (with integrable remainder) and that the Hessian matrix is symmetric, we have
As , the estimator (2) converges to in distribution.
∎
With Proposition 1 in place, we see that matrix measurements of the form
for some can be efficiently computed via finite-difference computations. For the convex program (1) with sampling operators taking the above form, we have the following guarantee.
Theorem 1.
As a direct consequence of Theorem 1, we have the following result.
Corollary 1.
Let the finite-difference granularity be small. Let and let be twice continuously differentiable. Suppose there exists with such that for some , and the estimator (2) with satisfies
where denotes distributional equivalence. There exists an absolute constant , such that if more than zeroth-order finite-difference are obtained, then with probability exceeding , the solution to (1) satisfies .
By Proposition 1, we know as ,
converges to in distribution. Therefore, Corollary 1 implies that the estimator (2) together with a convex program (1) provides a sample efficient low-rank Hessian estimator. Corollary 1 also implies a guarantee for approximately low-rank Hessian.
3.1 Preparations
To describe the recovering argument for a symmetric low-rank matrix with , we consider the eigenvalue decomposition of ( and ), and a subspace of defined by
where is the projection onto the columns of . We also define a projection operation onto :
Let be the solution of (1) and let . We start with the following lemma, which can be extracted from matrix completion literature (e.g., Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).
Lemma 1.
Proof.
Since , we have
(4) | ||||
(5) |
where the first inequality uses the “pinching” inequality (Exercise II.5.4 & II.5.5 in (Bhatia,, 1997)).
Since , we continue the above computation, and get
(6) |
On the second line, we use the Hölder’s inequality. On the third line, we use that for any real matrix .
3.2 The High Level Roadmap
With estimator (2) and Lemma 1 in place, we are ready to present the high-level roadmap of our argument. On a high level, the rest of the paper aims to prove the following two arguments:
-
•
(A1): With high probability, , where .
-
•
(A2): With high probability, , where .
Once (A1) and (A2) are in place, we can quickly prove Theorem 1.
Sketch of proof of Theorem 1 with (A1) and (A2) assumed.
Now, by Lemma 1 and (A1), we have, with high probability,
which implies w.h.p. Finally another use of (A1) implies w.h.p., which concludes the proof. ∎
Therefore, the core argument reduces to proving (A1) and (A2). In the next subsection, we prove (A1) and (A2) for the random measurements obtained by the Hessian estimator (2), without any incoherence-type assumptions.
3.3 The Concentration Arguments
For the concentration argument, we need to make several observations. One of the key observations is that the spherical measurements are rotation-invariant and reflection-invariant. More specifically, for the random measurement with , we have
for any orthogonal matrix , where denotes distributional equivalence. With a properly chosen , we have
where is the diagonal matrix consisting of eigenvalues of . This observation makes calculating the moments of possible. With the moments of the random matrices properly controlled, we can use matrix-valued Cramer–Chernoff method to arrive at the matrix concentration inequalities.
Another useful property is the Kronecker product and the vectorization of the matrices. Let be the vectorization operation of a matrix. Then as per how is defined, we have, for any ,
(7) |
The above formula implies that can be represented as a matrix of size . Similarly, the measurement operators can also be represented as a matrix of size . Compared to the matrix completion problem, the importance of vectorization presentation and Kronecker product is more pronounced for our case. The reason is again the absence of an incoherence-type assumption. More specifically, a vectorized representation is useful in controlling the cumulant generating function of the random matrices associated with the spherical measurements.
Finally some additional care is needed to properly control the high moments of . Such additional care is showcased in an inequality stated below in Lemma 2. An easy upper bound for the LHS of (8) is . However, an bound for the LHS of (8) will eventually result in a loss in a factor of in the final bound. Overall, tight control is needed over several different places, in order to get the final recovery bound in Theorem 1.
Lemma 2.
Let and be positive integers. Then it holds that
(8) |
Proof.
Case I: . Note that
(9) |
Since the function is concave, Jensen’s inequality gives
(10) |
which implies
where the last inequality uses .
Case II: . For this case, we first show that the maximum of is obtained when for all . To show this, let there exist and such that . Without loss of generality, let . Then
Therefore, we can increase the value of until for all . By the above argument, we have, for ,
Therefore, we have
∎
With all the above preparation in place, we next present Lemma 3, which is the key step leading to (A1).
Lemma 3.
Let
where and are regarded as matrices of size . Pick any . Then there exists some constant , such that when , it holds that .
The operators and can be represented as matrix of size . Therefore, we can apply matrix-valued Cramer–Chernoff-type argument (or matrix Laplace argument (Lieb,, 1973)) to derive a concentration bound. In (Tropp,, 2012; Tropp et al.,, 2015), a master matrix concentration inequality is presented. This result is stated below in Theorem 2.
Theorem 2 (Tropp et al., (2015)).
Consider a finite sequence of independent, random, Hermitian matrices of the same size. Then for all ,
and
For our purpose, a more convenient form is the matrix concentration inequality with Bernstein’s conditions on the moments. Such results may be viewed as corollaries to Theorem 2, and a version is stated below in Theorem 3.
Theorem 3 (Zhu, (2012); Zhang et al., (2014)).
If a finite sequence of independent, random, self-adjoint matrices with dimension , all of which satisfy the Bernstein’s moment condition, i.e.
where is a positive constant and is a positive semi-definite matrix, then,
for each .
Another useful property is the moments of spherical random variables, stated below in Proposition 2. The proof of Proposition 2 is in the Appendix.
Proposition 2.
Let be uniformly sampled from (). It holds that
for all and any positive even integer .
With the above results in place, we can now prove Lemma 3.
Proof of Lemma 3.
Fix , and let for some absolute constant . Following the similar reasoning for (7), we can represent as
(11) |
where .
Thus, by viewing and as matrices of size , we have
Let be an orthogonal matrix such that
Since the distributions of and are rotation-invariant and reflection-invariant, we know
(12) |
where denotes distributional equivalence.
Therefore, it suffices to study the distribution of
For simplicity, introduce notation
and we have
For simplicity, introduce
Next we will show that average of copies of , , concentrates to , , respectively. To do this, we bound the moments of , and , and apply Theorem 3.
Bounding and . The second moment of is
where the last inequality follows from Proposition 2. Thus the centralized second moment of is bounded by
For , we have
which, by operator Jensen, implies
When using the operator Jensen’s inequality, we use as the decomposition of identity.
Let be copies of . Since , Theorem 3 implies that
(13) |
The bound for follows similarly. Let be copies of , and we have
(14) |
The -th power of is
and the -th power of is
Thus by Proposition 2, we have
For (), we notice that
since these terms only involve odd powers of the entries of . Therefore
(15) |
Let be copies of , and for some absolute constant . By (15), we know , and all the above moments of are centralized moments of . Now we apply Theorem 3 to conclude that:
(16) |
where is the orthogonal matrix as introduced in (12). We take a union bound over (13), (14) and (16) to conclude the proof.
∎
Lemma 4.
Suppose is true. Let be the solution of the constrained optimization problem, and let . Then .
Proof.
Represent as a matrix of size . Let be defined as a canonical matrix function. That is, and share the same eigenvectors, and the eigenvalues of are the square roots of the eigenvalues of . Clearly,
(17) |
Clearly we have
Next we turn to prove (A2), whose core argument relies on Lemma 5.
Lemma 5.
Let be fixed. Pick any . Then there exists a constant , such that when , it holds that
Proof.
There exists an orthogonal matrix , such that , where
is a diagonal matrix consists of eigenvalues of . Let be the operator defined as in (11), and we will study the behavior of and then apply Theorem 3. Since the distribution of is rotation-invariant and reflection-invariant, we have
where denotes distributional equivalence. Thus it suffices to study the behavior of . For the matrix , we consider
Next we study the moments of . The second power of is By Proposition 2, we have
and similarly, . For even moments of , we first compute and for . For this, we have
(19) |
where the last inequality uses that expectation of odd powers of or are zero. Note that
(20) |
where the inequality on the last line uses Lemma 2. Now we combine (19) and (20) to obtain
(21) | ||||
(22) |
where the inequality on the last line uses Proposition 2. Similarly, we have
Therefore, we have obtained a bound on even moments of :
for , and thus a bound on the centralized moments on even moments of :
Next we upper bound the odd moments of . Since
it suffices to study and . Since
using the arguments leading to (22), we have
(23) |
Since , the above two inequalities in (23) implies
and thus
Now we have established moment bounds for , thus also for . From here we apply Theorem 3 to conclude the proof.
∎
The next lemma will essentially establish (A2). This argument relies on the existence of a dual certificate (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).
Lemma 6.
Pick . Define
Let . Let for some constant . If for some constant , then .
Proof.
Following (Gross,, 2011), we define random projectors (), such that
Then define
From the above definition, we have
Now we apply Lemma 3 to , and get, when event is true for all ,
(24) |
Note that with probability exceeding , is true for all , . Since are mutually independent, is independent of for each . In view of this, we can apply Lemma 5 to followed by a union bound, and get, with probability exceeding ,
(25) |
∎
Now we are ready to prove Theorem 1.
4 Conclusion
In this paper, we consider the Hessian estimator problem via matrix recovery techniques. In particular, we show that the finite-difference method studied in (Feng and Wang,, 2023; Wang,, 2023), together with a convex program, guarantees a high probability recovery of a rank- Hessian using (up to logarithmic and constant factors) finite-difference operations. Compared to matrix completion methods, we do not assume any incoherence between the coordinate system and the hidden singular space of the Hessian matrix. In a follow-up work, we apply the Hessian estimation mechanism to Newton’s cubic method (Nesterov and Polyak,, 2006; Nesterov,, 2008), and design sample-efficient optimization algorithms for functions with (approximately) low-rank Hessian.
Acknowledgement
The authors thank Dr. Hehui Wu for insightful discussions and his contributions to Lemma 2 and Dr. Abiy Tasissa for helpful discussions.
References
- Ahn et al., (2023) Ahn, J., Elmahdy, A., Mohajer, S., and Suh, C. (2023). On the fundamental limits of matrix completion: Leveraging hierarchical similarity graphs. IEEE Transactions on Information Theory, pages 1–1.
- Balasubramanian and Ghadimi, (2021) Balasubramanian, K. and Ghadimi, S. (2021). Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pages 1–42.
- Bhatia, (1997) Bhatia, R. (1997). Matrix analysis. Graduate Texts in Mathematics.
- Broyden et al., (1973) Broyden, C. G., Dennis Jr, J. E., and Moré, J. J. (1973). On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245.
- Cai et al., (2010) Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982.
- Candes and Recht, (2012) Candes, E. and Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119.
- Candes and Plan, (2010) Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936.
- Candès and Tao, (2010) Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
- Chandrasekaran et al., (2012) Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849.
- Chen, (2015) Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909–2923.
- Chen et al., (2020) Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121.
- Davidon, (1991) Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on optimization, 1(1):1–17.
- Eldar et al., (2012) Eldar, Y., Needell, D., and Plan, Y. (2012). Uniqueness conditions for low-rank matrix recovery. Applied and Computational Harmonic Analysis, 33(2):309–314.
- Fan et al., (2021) Fan, J., Wang, W., and Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. The Annals of Statistics, 49(3):1239 – 1266.
- Fazel, (2002) Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University.
- Fefferman et al., (2016) Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049.
- Feng and Wang, (2023) Feng, Y. and Wang, T. (2023). Stochastic zeroth-order gradient and Hessian estimators: variance reduction and refined bias bounds. Information and Inference: A Journal of the IMA, 12(3):1514–1545.
- Fletcher, (2000) Fletcher, R. (2000). Practical methods of optimization. John Wiley & Sons.
- Fornasier et al., (2011) Fornasier, M., Rauhut, H., and Ward, R. (2011). Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640.
- Ghojogh et al., (2023) Ghojogh, B., Crowley, M., Karray, F., and Ghodsi, A. (2023). Elements of dimensionality reduction and manifold learning. Springer Nature.
- Goldfarb, (1970) Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23–26.
- Gotoh et al., (2018) Gotoh, J.-y., Takeda, A., and Tono, K. (2018). Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, 169(1):141–176.
- Gross, (2011) Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566.
- Hu et al., (2012) Hu, Y., Zhang, D., Ye, J., Li, X., and He, X. (2012). Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE transactions on pattern analysis and machine intelligence, 35(9):2117–2130.
- Keshavan et al., (2010) Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998.
- Lee and Bresler, (2010) Lee, K. and Bresler, Y. (2010). Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory, 56(9):4402–4416.
- Li et al., (2023) Li, J., Balasubramanian, K., and Ma, S. (2023). Stochastic zeroth-order riemannian derivative estimation and optimization. Mathematics of Operations Research, 48(2):1183–1211.
- Lieb, (1973) Lieb, E. H. (1973). Convex trace functions and the wigner-yanase-dyson conjecture. Advances in Mathematics, 11(3):267–288.
- Mohan and Fazel, (2012) Mohan, K. and Fazel, M. (2012). Iterative reweighted algorithms for matrix rank minimization. The Journal of Machine Learning Research, 13(1):3441–3473.
- Negahban and Wainwright, (2012) Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697.
- Nesterov, (2008) Nesterov, Y. (2008). Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181.
- Nesterov and Polyak, (2006) Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
- Recht, (2011) Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12).
- Recht et al., (2010) Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501.
- Ren-Pu and Powell, (1983) Ren-Pu, G. and Powell, M. J. (1983). The convergence of variable metric matrices in unconstrained optimization. Mathematical programming, 27:123–143.
- Resnick and Varian, (1997) Resnick, P. and Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3):56–58.
- Rodomanov and Nesterov, (2022) Rodomanov, A. and Nesterov, Y. (2022). Rates of superlinear convergence for classical quasi-newton methods. Mathematical Programming, 194(1):159–190.
- Rohde and Tsybakov, (2011) Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 – 930.
- Rong et al., (2021) Rong, Y., Wang, Y., and Xu, Z. (2021). Almost everywhere injectivity conditions for the matrix recovery problem. Applied and Computational Harmonic Analysis, 50:386–400.
- Shanno, (1970) Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647–656.
- Spall, (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
- Stein, (1981) Stein, C. M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151.
- Tan et al., (2011) Tan, V. Y., Balzano, L., and Draper, S. C. (2011). Rank minimization over finite fields: Fundamental limits and coding-theoretic interpretations. IEEE transactions on information theory, 58(4):2018–2039.
- Tanner and Wei, (2016) Tanner, J. and Wei, K. (2016). Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis, 40(2):417–429.
- Tropp, (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434.
- Tropp et al., (2015) Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230.
- Udell and Townsend, (2019) Udell, M. and Townsend, A. (2019). Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160.
- Vandereycken, (2013) Vandereycken, B. (2013). Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236.
- Wang, (2023) Wang, T. (2023). On sharp stochastic zeroth-order Hessian estimators over Riemannian manifolds. Information and Inference: A Journal of the IMA, 12(2):787–813.
- Wang et al., (2014) Wang, Z., Lai, M.-J., Lu, Z., Fan, W., Davulcu, H., and Ye, J. (2014). Rank-one matrix pursuit for matrix completion. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 91–99, Bejing, China. PMLR.
- Wen et al., (2012) Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4):333–361.
- Xiaojun Mao and Wong, (2019) Xiaojun Mao, S. X. C. and Wong, R. K. W. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525):198–210.
- Xu and Zhang, (2001) Xu, C. and Zhang, J. (2001). A survey of quasi-newton equations and quasi-newton methods for optimization. Annals of Operations research, 103:213–234.
- Zhang et al., (2014) Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2014). Random projections for classification: A recovery approach. IEEE Transactions on Information Theory, 60(11):7300–7316.
- Zhu, (2012) Zhu, S. (2012). A short note on the tail bound of wishart distribution. arXiv preprint arXiv:1212.5860.
Appendix A Auxiliary Propositions and Lemmas
Proof of Proposition 2.
Let be the spherical coordinate system. We have, for any and an even integer ,
where is the surface area of . Let
Clearly, . By integration by parts, we have . The above two equations give .
Thus we have . We conclude the proof by symmetry.
∎