This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Zeroth-order Low-rank Hessian Estimation via Matrix Recovery

Tianyu Wang***wangtianyu@fudan.edu.cn  Zicheng Wang22110840011@m.fudan.edu.cn  Jiajia Yujiajia.yu@duke.edu
Abstract

A zeroth-order Hessian estimator aims to recover the Hessian matrix of an objective function at any given point, using minimal finite-difference computations. This paper studies zeroth-order Hessian estimation for low-rank Hessians, from a matrix recovery perspective. Our challenge lies in the fact that traditional matrix recovery techniques are not directly suitable for our scenario. They either demand incoherence assumptions (or its variants), or require an impractical number of finite-difference computations in our setting. To overcome these hurdles, we employ zeroth-order Hessian estimations aligned with proper matrix measurements, and prove new recovery guarantees for these estimators. More specifically, we prove that for a Hessian matrix Hn×nH\in\mathbb{R}^{n\times n} of rank rr, 𝒪(nr2log2n)\mathcal{O}(nr^{2}\log^{2}n) proper zeroth-order finite-difference computations ensures a highly probable exact recovery of HH. Compared to existing methods, our method can greatly reduce the number of finite-difference computations, and does not require any incoherence assumptions.

1 Introduction

In machine learning, optimization and many other mathematical programming problems, the Hessian matrix plays an important role since it describes the landscape of the objective function. However, in many real-world scenarios, although we can access function values, the lack of analytic form for the objective function precludes direct Hessian computation. Therefore it is important to develop zeroth-order finite-difference Hessian estimators, i.e. to estimate the Hessian matrix by function evaluation and finite-difference.

Finite-difference Hessian estimation has a long history dating back to Newton’s time. In recent years, the rise of large models and big data has posed the high-dimensionality of objective functions as a primary challenge in finite-difference Hessian estimation. To address this, stochastic Hessian estimators, like (Balasubramanian and Ghadimi,, 2021; Wang,, 2023; Feng and Wang,, 2023; Li et al.,, 2023), have emerged to reduce the required number of function value samples. The efficiency of a Hessian estimator is measured by the sample complexity, which quantifies the number of finite-difference computations needed.

Despite the high-dimensionality, the low-rank structure is prevalent in machine learning with high-dimensional datasets (Fefferman et al.,, 2016; Udell and Townsend,, 2019). Numerous research directions, such as manifold learning (e.g., Ghojogh et al.,, 2023) and recommender systems (e.g., Resnick and Varian,, 1997), actively leverage this low-rank structure. While there are many studies on stochastic Hessian estimators, as we detail in section 1.4, none of them exploit the low-rank structure of the Hessian matrix. This omission can lead to overly conservative results and hinder the overall efficiency and effectiveness of the optimization or learning algorithms.

To fill in the gap, in this work, we develop an efficient finite-difference Hessian estimation method for low-rank Hessian via matrix recovery. While a substantial number of literature studies the sample complexity of low-rank matrix recovery, we emphasize that none of them are directly applicable to our scenario. This is either due to the overly restrictive global incoherence assumption or a prohibitively large number of finite-difference computations, as we discuss in detail in section 1.2. We develop a new method and prove that without the incoherence assumption, for an n×nn\times n Hessian matrix with rank rr, we can exactly recover the matrix with high probability from 𝒪(nr2log2n)\mathcal{O}(nr^{2}\log^{2}n) proper zeroth-order finite-difference computations.

In the rest of this section, we present our problem formulation, discuss why existing matrix recovery methods fail on our problem and summarize our contribution.

1.1 Hessian Estimation via Compressed Sensing Formulation

To recover an n×nn\times n low-rank Hessian matrix HH using n2\ll n^{2} finite-difference operations, we use the following trace norm minimization approach (Fazel,, 2002; Recht et al.,, 2010; Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012):

minH^n×nH^1, subject to 𝒮H^=𝒮H,\displaystyle\min_{\widehat{H}\in\mathbb{R}^{n\times n}}\|\widehat{H}\|_{1},\quad\text{ subject to }\quad\mathcal{S}\widehat{H}=\mathcal{S}H, (1)

where 𝒮:=1Mi=1M𝒫i\mathcal{S}:=\frac{1}{M}\sum_{i=1}^{M}\mathcal{P}_{i} and 𝒫i\mathcal{P}_{i} is a matrix measurement operation that can be obtained via 𝒪(1)\mathcal{O}(1) finite-difference computations. For our problem, it is worth emphasizing that 𝒫i\mathcal{P}_{i} must satisfy the following requirements.

  • (R1) 𝒫i\mathcal{P}_{i} is different from the sampling operation used for matrix completion. Otherwise an incoherence assumption is needed. See (M1) in Section 1.2 for more details.

  • (R2) 𝒫i\mathcal{P}_{i} cannot involve the inner product between the Hessian matrix and a general matrix, since this operation cannot be efficiently obtained through finite-difference computations. See (M2) in Section 1.2 for more details.

Due to the above two requirements, existing theory for matrix recovery fails to provide satisfactory guarantees for low-rank Hessian estimation.

1.2 Existing Matrix Recovery Methods

Existing methods for low-rank matrix recovery can be divided into two categories: matrix completion methods, and matrix recovery via linear measurements (or matrix regression type method). Unfortunately, both groups of methods are unsuitable for Hessian estimation tasks.

(M1) Matrix completion methods: A candidate class of methods for low-rank Hessian estimation is matrix completion (Fazel,, 2002; Cai et al.,, 2010; Candes and Plan,, 2010; Candès and Tao,, 2010; Keshavan et al.,, 2010; Lee and Bresler,, 2010; Fornasier et al.,, 2011; Gross,, 2011; Recht,, 2011; Candes and Recht,, 2012; Hu et al.,, 2012; Mohan and Fazel,, 2012; Negahban and Wainwright,, 2012; Wen et al.,, 2012; Vandereycken,, 2013; Wang et al.,, 2014; Chen,, 2015; Tanner and Wei,, 2016; Gotoh et al.,, 2018; Chen et al.,, 2020; Ahn et al.,, 2023).

The motivation for matrix completion tasks originated from the Netflix prize, where the challenge was to predict the ratings of all users on all movies based on only observing ratings of some users on some movies. In order to tackle such problems, it is necessary to assume that the nontrivial singular vectors of the matrix HH and the observation basis \mathcal{B} are “incoherent”. Incoherence (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012; Chen,, 2015; Negahban and Wainwright,, 2012), or its alternatives (e.g., Negahban and Wainwright,, 2012), implies that there is a sufficiently large angle between the singular vectors and the basis \mathcal{B}. The rationale behind this assumption can be explained as follows: Consider a matrix HH of size n×nn\times n with a one in its (1,1)(1,1) entry and zeros elsewhere. If we randomly observe a small fraction of the n×nn\times n entries, it is highly likely that we will miss the (1,1)(1,1) entry, making it difficult to fully recover the matrix. Therefore, an incoherence parameter ν\nu is assumed between the given canonical basis \mathcal{B} and the singular vectors of HH, as illustrated in Figure 1. In the context of zeroth-order optimization, it is often necessary to recover the Hessian at any given point. However, assuming the Hessian is incoherence with the given basis over all points in the domain is overly restrictive.

(M2) Matrix recovery via linear measurements (matrix regression type recovery): In the context of matrix recovery using linear measurements (Tan et al.,, 2011; Eldar et al.,, 2012; Chandrasekaran et al.,, 2012; Rong et al.,, 2021), we observe the inner product of the target matrix HH with a set of matrices A1,A2,,AMA_{1},A_{2},\cdots,A_{M}. Specifically, we have the observation H,Ai:=tr(HAi)\left<H,A_{i}\right>:=\mathrm{tr}(H^{*}A_{i}) and our goal is to recover HH. In certain scenarios, there may be additional constraints on AiA_{i} and the measurements might be corrupted by noise (Rohde and Tsybakov,, 2011; Fan et al.,, 2021; Xiaojun Mao and Wong,, 2019), which receives more attention from the statistics community. Eldar et al., (2012) proved that when the entries of AiA_{i} are independently and identically distributed (iidiid) Gaussian, having M4nr4r2M\geq 4nr-4r^{2} linear measurements ensures exact recovery of HH. Rong et al., (2021) showed that when the density of (A1,A2,,AM)(A_{1},A_{2},\cdots,A_{M}) is absolutely continuous, having M>nrr2M>nr-r^{2} measurements guarantees exact recovery of HH.

Despite the elegant results in matrix recovery using linear measurements, they are not applicable to Hessian estimation tasks. This limitation arises from the fact that a general linear measurement cannot be approximated by a zeroth-order estimation. To further illustrate this fact, let us consider the Taylor approximation, which, by the fundamental theorem of calculus, is the foundation for zeroth-order estimation. In the Taylor approximation of ff at 𝐱\mathbf{x}, the Hessian matrix 2f(𝐱)\nabla^{2}f(\mathbf{x}) will always appear as a bilinear form. Therefore, a linear measurement A,2f(𝐱)\left<A,\nabla^{2}f(\mathbf{x})\right> for a general AA cannot be included in a Taylor approximation of ff at 𝐱\mathbf{x}. In the language of optimization and numerical analysis, for a general measurement matrix AA, one linear measurement A,H\left<A,H\right> may require far more than 𝒪(1)\mathcal{O}(1) finite-difference computations. Consequently, the theory providing guarantees for linear measurements does not extend to zeroth-order Hessian estimation.

1.3 Our Contribution

In this paper, we introduce a low-rank Hessian estimation mechanism that simultaneously satisfies (R1) and (R2). More specifically,

  • We prove that, with a proper finite-difference scheme, 𝒪(nr2log2n)\mathcal{O}\left(nr^{2}\log^{2}n\right) finite-difference computations are sufficient for guaranteeing an exact recovery of the Hessian matrix with high probability. Our approach simultaneously overcomes limitations of (M1) and (M2).

In the realm of zeroth-order Hessian estimation, no prior arts provide high probability estimation guarantees for low-rank Hessian estimation tasks; See Section 1.4 for more discussions.

1.4 Prior Arts on Hessian Estimation

Zeroth-order Hessian estimation dates back to the birth of calculus. In recent years, researchers from various fields have contributed to this topic (e.g., Broyden et al.,, 1973; Fletcher,, 2000; Spall,, 2000; Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023).

In quasi-Newton-type methods (e.g., Goldfarb,, 1970; Shanno,, 1970; Broyden et al.,, 1973; Ren-Pu and Powell,, 1983; Davidon,, 1991; Fletcher,, 2000; Spall,, 2000; Xu and Zhang,, 2001; Rodomanov and Nesterov,, 2022), gradient-based Hessian estimators were used for iterative optimization algorithms. Based on the Stein’s identity (Stein,, 1981), Balasubramanian and Ghadimi, (2021) introduced a Stein-type Hessian estimator, and combined it with cubic regularized Newton’s method (Nesterov and Polyak,, 2006) for non-convex optimization. Li et al., (2023) generalizes the Stein-type Hessian estimators to Riemannian manifolds. Parallel to (Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023), Wang, (2023); Feng and Wang, (2023) investigated the Hessian estimator that inspires the current work.

Yet prior to our work, no methods from the zeroth-order Hessian estimation community focuses on low-rank Hessian estimation.

singular space of 2f(𝐱1)\nabla^{2}f(\mathbf{x}_{1})basis: 𝐞1\mathbf{e}_{1}basis: 𝐞2\mathbf{e}_{2}incoherence: ν1\nu_{1}
singular-space of2f(𝐱1)\nabla^{2}f(\mathbf{x}_{1})basis: 𝐞1\mathbf{e}_{1}basis: 𝐞2\mathbf{e}_{2}singular-space of2f(𝐱2)\nabla^{2}f(\mathbf{x}_{2})
Figure 1: Incoherence condition for 2f(𝐱)\nabla^{2}f(\mathbf{x}) at multiple points. When the Hessian of ff is low-rank or approximately low-rank, a matrix completion guarantee for 2f(𝐱)\nabla^{2}f(\mathbf{x}) at all 𝐱\mathbf{x} requires an incoherence condition to hold uniformly over 𝐱\mathbf{x}. As illustrated in the right subfigure, such requirement is overly restrictive.

2 Notations and Conventions

Before proceeding to main results, we lay out some conventions and notations that will be used throughout the paper. We use the following notations for matrix norms:

  • \|\cdot\| is the operator norm (Schatten \infty-norm);

  • 2\|\cdot\|_{2} is the Euclidean norm (Schatten 22-norm);

  • 1\|\cdot\|_{1} is the trace norm (Schatten 11-norm).

Also, the notation \|\cdot\| is overloaded for vector norm and tensor norm. For a vector 𝐯n\mathbf{v}\in\mathbb{R}^{n}, \|\cdot\| is its Euclidean norm; For a tensor V(n)pV\in\left(\mathbb{R}^{n}\right)^{\otimes p} (p2p\geq 2), \|\cdot\| is its Schatten \infty-norm. For any matrix AA with singular value decomposition A=UΣVA=U\Sigma V^{\top}, we define sign(A)=Usign(Σ)V\mathrm{sign}(A)=U\mathrm{sign}(\Sigma)V^{\top} where sign(Σ)\mathrm{sign}(\Sigma) applies a sign\mathrm{sign} function to each entry of Σ\Sigma.

For a vector 𝐮=(u1,u2,,un)n\mathbf{u}=\left(u_{1},u_{2},\cdots,u_{n}\right)^{\top}\in\mathbb{R}^{n} and a positive number rnr\leq n, we define notations

𝐮:r=(u1,u2,,ur,0,0,,0)and𝐮r:=(0,0,,0,ur,ur+1,,un).\displaystyle\mathbf{u}_{:r}=\left(u_{1},u_{2},\cdots,u_{r},0,0,\cdots,0\right)^{\top}\;\text{and}\;\mathbf{u}_{r:}=\left(0,0,\cdots,0,u_{r},u_{r+1},\cdots,u_{n}\right)^{\top}.

Also, we use CC and cc to denote unimportant absolute constants that does not depend on nn or rr. The numbers CC and cc may or may not take the same value at each occurrence.

3 Main Results

We start with a finite-difference scheme that can be viewed as a matrix measurement operation. The Hessian of a function f:nf:\mathbb{R}^{n}\to\mathbb{R} at a given point 𝐱\mathbf{x} can be estimated as follows (Wang,, 2023; Feng and Wang,, 2023)

^2f(𝐱):=\displaystyle\widehat{\nabla}^{2}f(\mathbf{x}):=
n2f(𝐱+δ𝐯+δ𝐮)f(𝐱δ𝐯+δ𝐮)f(𝐱+δ𝐯δ𝐮)+f(𝐱δ𝐯δ𝐮)4δ2𝐮𝐯,\displaystyle n^{2}\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top}, (2)

where δ\delta is the finite-difference granularity, and 𝐮,𝐯\mathbf{u},\mathbf{v} are finite-difference directions. Difference choices of laws of 𝐮\mathbf{u} and 𝐯\mathbf{v} leads to different Hessian estimators. For example, 𝐮,𝐯\mathbf{u},\mathbf{v} can be independent vectors uniformly distributed over the canonical basis {𝐞1,𝐞2,,𝐞n}\{\mathbf{e}_{1},\mathbf{e}_{2},\cdots,\mathbf{e}_{n}\}.

We start our discussion by showing that the Hessian estimator (2) can indeed be viewed as a matrix measurement.

Proposition 1.

Consider an estimator defined in (2). Let the underlying function ff be twice continuously differentiable. Let 𝐮,𝐯\mathbf{u},\mathbf{v} be two random vectors such that 𝐮,𝐯<\|\mathbf{u}\|,\|\mathbf{v}\|<\infty a.sa.s. Then for any fixed 𝐱n\mathbf{x}\in\mathbb{R}^{n},

^2f(𝐱)dn2𝐮𝐮2f(𝐱)𝐯𝐯\displaystyle\widehat{\nabla}^{2}f(\mathbf{x})\to_{d}n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}

as δ0+\delta\to 0_{+}, where d\to_{d} denotes convergence in distribution.

Proof.

By Taylor’s Theorem (with integrable remainder) and that the Hessian matrix is symmetric, we have

^2f(𝐱)=\displaystyle\widehat{\nabla}^{2}f(\mathbf{x})= n24((𝐯+𝐮)2f(𝐱)(𝐯+𝐮)(𝐯𝐮)2f(𝐱)(𝐯𝐮))𝐮𝐯\displaystyle\;\frac{n^{2}}{4}\left(\left(\mathbf{v}+\mathbf{u}\right)^{\top}\nabla^{2}f(\mathbf{x})\left(\mathbf{v}+\mathbf{u}\right)-\left(\mathbf{v}-\mathbf{u}\right)^{\top}\nabla^{2}f(\mathbf{x})\left(\mathbf{v}-\mathbf{u}\right)\right)\mathbf{u}\mathbf{v}^{\top}
+𝒪(δ(𝐯+𝐮)3)\displaystyle+\mathcal{O}\left(\delta\left(\|\mathbf{v}\|+\|\mathbf{u}\|\right)^{3}\right)
=\displaystyle= n2𝐮2f(𝐱)𝐯𝐮𝐯+𝒪(δ(𝐯+𝐮)3)\displaystyle\;n^{2}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{u}\mathbf{v}^{\top}+\mathcal{O}\left(\delta\left(\|\mathbf{v}\|+\|\mathbf{u}\|\right)^{3}\right)
=\displaystyle= n2𝐮𝐮2f(𝐱)𝐯𝐯+𝒪(δ(𝐯+𝐮)3).\displaystyle\;n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}+\mathcal{O}\left(\delta\left(\|\mathbf{v}\|+\|\mathbf{u}\|\right)^{3}\right).

As δ0+\delta\to 0_{+}, the estimator (2) converges to n2𝐮𝐮2f(𝐱)𝐯𝐯n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top} in distribution.

With Proposition 1 in place, we see that matrix measurements of the form

𝒫:Hn2𝐮𝐮H𝐯𝐯\displaystyle\mathcal{P}:H\mapsto n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}

for some 𝐮,𝐯\mathbf{u},\mathbf{v} can be efficiently computed via finite-difference computations. For the convex program (1) with sampling operators taking the above form, we have the following guarantee.

Theorem 1.

Consider the problem (1). Let the sampler 𝒮=1Mi=1M𝒫i\mathcal{S}=\frac{1}{M}\sum_{i=1}^{M}\mathcal{P}_{i} be constructed with 𝒫i:An2𝐮i𝐮iA𝐯i𝐯i\mathcal{P}_{i}:A\mapsto n^{2}\mathbf{u}_{i}\mathbf{u}_{i}^{\top}A\mathbf{v}_{i}\mathbf{v}_{i}^{\top} and 𝐮i,𝐯iiidUnif(𝕊n1)\mathbf{u}_{i},\mathbf{v}_{i}\overset{iid}{\sim}\text{Unif}(\mathbb{S}^{n-1}). Then there exists an absolute constant CC, such that if the number of samples MCnr2log2(n)M\geq C\cdot nr^{2}\log^{2}(n) where r:=rank(H)r:=rank(H), then with probability larger than 11n1-\frac{1}{n}, the solution to (1), denoted by H^\widehat{H}, satisfies H^=H\widehat{H}=H.

As a direct consequence of Theorem 1, we have the following result.

Corollary 1.

Let the finite-difference granularity δ>0\delta>0 be small. Let 𝐱n\mathbf{x}\in\mathbb{R}^{n} and let ff be twice continuously differentiable. Suppose there exists HH with rank(H)=rrank(H)=r such that H2f(𝐱)ϵ\|H-\nabla^{2}f(\mathbf{x})\|\leq\epsilon for some ϵ0\epsilon\geq 0, and the estimator (2) with 𝐮,𝐯iidUnif(𝕊n1)\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1}) satisfies

f(𝐱+δ𝐯+δ𝐮)f(𝐱δ𝐯+δ𝐮)f(𝐱+δ𝐯δ𝐮)+f(𝐱δ𝐯δ𝐮)4δ2𝐮𝐯\displaystyle\;\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top}
=d\displaystyle=_{d} 𝐮𝐮H𝐯𝐯,\displaystyle\;\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top},

where =d=_{d} denotes distributional equivalence. There exists an absolute constant CC, such that if more than Cnr2log2nC\cdot nr^{2}\log^{2}n zeroth-order finite-difference are obtained, then with probability exceeding 11n1-\frac{1}{n}, the solution H^\widehat{H} to (1) satisfies H^2f(𝐱)ϵ\|\widehat{H}-\nabla^{2}f(\mathbf{x})\|\leq\epsilon.

By Proposition 1, we know as δ0+\delta\to 0^{+},

f(𝐱+δ𝐯+δ𝐮)f(𝐱δ𝐯+δ𝐮)f(𝐱+δ𝐯δ𝐮)+f(𝐱δ𝐯δ𝐮)4δ2𝐮𝐯\displaystyle\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top}

converges to 𝐮𝐮2f(𝐱)𝐯𝐯\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top} in distribution. Therefore, Corollary 1 implies that the estimator (2) together with a convex program (1) provides a sample efficient low-rank Hessian estimator. Corollary 1 also implies a guarantee for approximately low-rank Hessian.

The rest of this section is devoted to proving Theorem 1 and thus also Corollary 1.

3.1 Preparations

To describe the recovering argument for a symmetric low-rank matrix Hn×nH\in\mathbb{R}^{n\times n} with rank(H)=rrank(H)=r, we consider the eigenvalue decomposition of H=UΛUH=U\Lambda U^{\top} (Un×rU\in\mathbb{R}^{n\times r} and Λr×r\Lambda\in\mathbb{R}^{r\times r}), and a subspace of n×n\mathbb{R}^{n\times n} defined by

T:={An×n:(IPU)A(IPU)=0},\displaystyle T:=\{A\in\mathbb{R}^{n\times n}:\left(I-P_{U}\right)A\left(I-P_{U}\right)=0\},

where PUP_{U} is the projection onto the columns of UU. We also define a projection operation onto TT:

𝒫T:APUA+APUPUAPU.\displaystyle\mathcal{P}_{T}:A\mapsto P_{U}A+AP_{U}-P_{U}AP_{U}.

Let H^\widehat{H} be the solution of (1) and let Δ:=H^H\Delta:=\widehat{H}-H. We start with the following lemma, which can be extracted from matrix completion literature (e.g., Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).

Lemma 1.

Let H^\widehat{H} be the solution of the program (1) and let Δ:=H^H\Delta:=\widehat{H}-H. Then it holds that

sign(H),PUΔPU+ΔT10,\displaystyle\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\|\Delta_{T}^{\perp}\|_{1}\leq 0, (3)

where ΔT:=𝒫TΔ\Delta_{T}^{\perp}:=\mathcal{P}_{T}^{\perp}\Delta.

Proof.

Since HTH\in T, we have

H+Δ1PU(H+Δ)PU1+PU(H+Δ)PU1\displaystyle\;\|H+\Delta\|_{1}\geq\|P_{U}(H+\Delta)P_{U}\|_{1}+\|P_{U}^{\perp}(H+\Delta)P_{U}^{\perp}\|_{1} (4)
=\displaystyle= H+PUΔPU1+ΔT1,\displaystyle\;\|H+P_{U}\Delta P_{U}\|_{1}+\|\Delta_{T}^{\perp}\|_{1}, (5)

where the first inequality uses the “pinching” inequality (Exercise II.5.4 & II.5.5 in (Bhatia,, 1997)).

Since sign(H)=1\|\mathrm{sign}(H)\|=1, we continue the above computation, and get

(5)=\displaystyle(\ref{eq:lem1-1})= sign(H)H+PUΔPU1+ΔT1\displaystyle\;\|\mathrm{sign}(H)\|\|H+P_{U}\Delta P_{U}\|_{1}+\|\Delta_{T}^{\perp}\|_{1}
\displaystyle\geq sign(H),H+PUΔPU+ΔT1\displaystyle\;\left<\mathrm{sign}(H),H+P_{U}\Delta P_{U}\right>+\|\Delta_{T}^{\perp}\|_{1}
=\displaystyle= H1+sign(H),PUΔPU+ΔT1.\displaystyle\;\|H\|_{1}+\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\|\Delta_{T}^{\perp}\|_{1}. (6)

On the second line, we use the Hölder’s inequality. On the third line, we use that A1=sign(A),A\|A\|_{1}=\left<\mathrm{sign}(A),A\right> for any real matrix AA.

Since H^\widehat{H} solves (1), we know H1H^1=H+Δ1\|H\|_{1}\geq\|\widehat{H}\|_{1}=\|H+\Delta\|_{1}. Thus rearranging terms in (6) finishes the proof. ∎

3.2 The High Level Roadmap

With estimator (2) and Lemma 1 in place, we are ready to present the high-level roadmap of our argument. On a high level, the rest of the paper aims to prove the following two arguments:

  • (A1): With high probability, ΔT22nΔT2\|\Delta_{T}\|_{2}\leq 2n\|\Delta_{T}^{\perp}\|_{2}, where ΔT:=𝒫TΔ\Delta_{T}:=\mathcal{P}_{T}\Delta.

  • (A2): With high probability, sign(H),PUΔPU1n20ΔT112ΔT1\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>\geq-\frac{1}{n^{20}}\|\Delta_{T}\|_{1}-\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}, where ΔT:=ΔΔT\Delta_{T}^{\perp}:=\Delta-\Delta_{T}.

Once (A1) and (A2) are in place, we can quickly prove Theorem 1.

Sketch of proof of Theorem 1 with (A1) and (A2) assumed.

Now, by Lemma 1 and (A1), we have, with high probability,

0by Lemma 1\displaystyle 0\overset{\text{by Lemma \ref{lem:prepare}}}{\geq} sign(H),PUΔPU+ΔT1\displaystyle\;\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\|\Delta_{T}^{\perp}\|_{1}
by (A2)\displaystyle\overset{\text{by {(A2)}}}{\geq} 12ΔT11n20ΔT1\displaystyle\;\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{1}{n^{20}}\|\Delta_{T}\|_{1}
by (A1)\displaystyle\overset{\text{by {(A1)}}}{\geq} 12ΔT12n18ΔT1,\displaystyle\;\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{2}{n^{18}}\|\Delta_{T}^{\perp}\|_{1},

which implies ΔT1=0\|\Delta_{T}^{\perp}\|_{1}=0 w.h.p. Finally another use of (A1) implies Δ1=0\|\Delta\|_{1}=0 w.h.p., which concludes the proof. ∎

Therefore, the core argument reduces to proving (A1) and (A2). In the next subsection, we prove (A1) and (A2) for the random measurements obtained by the Hessian estimator (2), without any incoherence-type assumptions.

3.3 The Concentration Arguments

For the concentration argument, we need to make several observations. One of the key observations is that the spherical measurements are rotation-invariant and reflection-invariant. More specifically, for the random measurement 𝒫H=n2𝐮𝐮H𝐯𝐯\mathcal{P}H=n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top} with 𝐮,𝐯iidUnif(𝕊n1)\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1}), we have

n2𝐮𝐮H𝐯𝐯=dn2Q𝐮𝐮QHQ𝐯𝐯Q\displaystyle n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}Q^{\top}HQ\mathbf{v}\mathbf{v}^{\top}Q^{\top}

for any orthogonal matrix QQ, where =d=_{d} denotes distributional equivalence. With a properly chosen QQ, we have

n2𝐮𝐮H𝐯𝐯=dn2Q𝐮𝐮Λ𝐯𝐯Q,\displaystyle n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top},

where Λ\Lambda is the diagonal matrix consisting of eigenvalues of HH. This observation makes calculating the moments of 𝒫H\mathcal{P}H possible. With the moments of the random matrices properly controlled, we can use matrix-valued Cramer–Chernoff method to arrive at the matrix concentration inequalities.

Another useful property is the Kronecker product and the vectorization of the matrices. Let vec()\mathrm{\texttt{vec}}\left(\cdot\right) be the vectorization operation of a matrix. Then as per how 𝒫T\mathcal{P}_{T} is defined, we have, for any An×nA\in\mathbb{R}^{n\times n},

vec(𝒫TA)=\displaystyle\mathrm{\texttt{vec}}\left(\mathcal{P}_{T}A\right)= vec(PUA+APUPUAPU)\displaystyle\;\mathrm{\texttt{vec}}\left(P_{U}A+AP_{U}-P_{U}AP_{U}\right)
=\displaystyle= (PUIn+InPUPUPU)vec(A).\displaystyle\;\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right)\mathrm{\texttt{vec}}\left(A\right). (7)

The above formula implies that 𝒫T\mathcal{P}_{T} can be represented as a matrix of size n2×n2n^{2}\times n^{2}. Similarly, the measurement operators 𝒫:An2𝐮𝐮A𝐯𝐯\mathcal{P}:A\mapsto n^{2}\mathbf{u}\mathbf{u}^{\top}A\mathbf{v}\mathbf{v}^{\top} can also be represented as a matrix of size n2×n2n^{2}\times n^{2}. Compared to the matrix completion problem, the importance of vectorization presentation and Kronecker product is more pronounced for our case. The reason is again the absence of an incoherence-type assumption. More specifically, a vectorized representation is useful in controlling the cumulant generating function of the random matrices associated with the spherical measurements.

Finally some additional care is needed to properly control the high moments of 𝒫H\mathcal{P}H. Such additional care is showcased in an inequality stated below in Lemma 2. An easy upper bound for the LHS of (8) is 𝒪(rp)\mathcal{O}(r^{p}). However, an 𝒪(rp)\mathcal{O}(r^{p}) bound for the LHS of (8) will eventually result in a loss in a factor of rr in the final bound. Overall, tight control is needed over several different places, in order to get the final recovery bound in Theorem 1.

Lemma 2.

Let rr and p2p\geq 2 be positive integers. Then it holds that

maxα1,α2,,αr0;i=1rαi=2p;αi even(2p)!p!i=1r(αi2)!αi!(100r)p1.\displaystyle\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(100r)^{p-1}. (8)
Proof.

Case I: r1250p1r\leq\frac{1}{2}50^{p-1}. Note that

(αi2)!αi!1(αi2)(αi2) and thus log(αi2)!αi!αi2log(αi2).\displaystyle\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq\frac{1}{(\frac{\alpha_{i}}{2})^{(\frac{\alpha_{i}}{2})}}\quad\text{ and thus }\quad\log\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq-\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2}). (9)

Since the function xxlogxx\mapsto-x\log x is concave, Jensen’s inequality gives

i=1rαi2log(αi2)ri=1rαir2log(i=1rαir2)=prlogpr.\displaystyle\frac{-\sum_{i=1}^{r}\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2})}{r}\leq-\frac{\frac{\sum_{i=1}^{r}\alpha_{i}}{r}}{2}\log\left(\frac{\frac{\sum_{i=1}^{r}\alpha_{i}}{r}}{2}\right)=-\frac{p}{r}\log\frac{p}{r}. (10)

Combining (9) and (10) gives

logi=1r(αi2)!αi!i=1rαi2log(αi2)plogpr,\displaystyle\log\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq-\sum_{i=1}^{r}\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2})\leq-p\log\frac{p}{r},

which implies

(2p)!p!i=1r(αi2)!αi!(2p)p(rp)p(2r)p(100r)p1,\displaystyle\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(2p)^{p}(\frac{r}{p})^{p}\leq(2r)^{p}\leq(100r)^{p-1},

where the last inequality uses r1250p1r\leq\frac{1}{2}50^{p-1}.

Case II: r>1250p1r>\frac{1}{2}50^{p-1}. For this case, we first show that the maximum of i=1r(αi2)!αi!\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!} is obtained when |αiαj|2|\alpha_{i}-\alpha_{j}|\leq 2 for all i,ji,j. To show this, let there exist αk\alpha_{k} and αj\alpha_{j} such that |αkαj|>2|\alpha_{k}-\alpha_{j}|>2. Without loss of generality, let αk>αj+2\alpha_{k}>\alpha_{j}+2. Then

(αk2)!αk!(αj2)!αj!(αk22)!(αk2)!((αj+2)2)!(αj+2)!.\displaystyle\frac{(\frac{\alpha_{k}}{2})!}{\alpha_{k}!}\cdot\frac{(\frac{\alpha_{j}}{2})!}{\alpha_{j}!}\leq\frac{(\frac{\alpha_{k}-2}{2})!}{(\alpha_{k}-2)!}\cdot\frac{(\frac{(\alpha_{j}+2)}{2})!}{(\alpha_{j}+2)!}.

Therefore, we can increase the value of i=1r(αi2)!αi!\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!} until |αiαj|2|\alpha_{i}-\alpha_{j}|\leq 2 for all i,ji,j. By the above argument, we have, for r>1250p1pr>\frac{1}{2}50^{p-1}\geq p,

maxα1,α2,,αr0;i=1rαi=2p;αi eveni=1r(αi2)!αi!(12)p(0!0!)rp=12p.\displaystyle\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq\left(\frac{1}{2}\right)^{p}\cdot\left(\frac{0!}{0!}\right)^{r-p}=\frac{1}{2^{p}}.

Therefore, we have

maxα1,α2,,αr0;i=1rαi=2p;αi even(2p)!p!i=1r(αi2)!αi!(2p)p2p\displaystyle\;\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(2p)^{p}\cdot 2^{-p}
=\displaystyle= pp(5050p1)p1(100r)p1.\displaystyle\;p^{p}\leq(50\cdot 50^{p-1})^{p-1}\leq(100r)^{p-1}.

With all the above preparation in place, we next present Lemma 3, which is the key step leading to (A1).

Lemma 3.

Let

1:={𝒫T𝒮𝒫T𝒫T14},\displaystyle\mathcal{E}_{1}:=\left\{\left\|\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|\leq\frac{1}{4}\right\},

where 𝒫T\mathcal{P}_{T} and 𝒮\mathcal{S} are regarded as matrices of size n2×n2n^{2}\times n^{2}. Pick any δ(0,1)\delta\in(0,1). Then there exists some constant CC, such that when MCnrlog(1/δ)M\geq Cnr\log(1/\delta), it holds that (1)1δ\mathbb{P}\left(\mathcal{E}_{1}\right)\geq 1-\delta.

The operators 𝒫T\mathcal{P}_{T} and 𝒮\mathcal{S} can be represented as matrix of size n2×n2n^{2}\times n^{2}. Therefore, we can apply matrix-valued Cramer–Chernoff-type argument (or matrix Laplace argument (Lieb,, 1973)) to derive a concentration bound. In (Tropp,, 2012; Tropp et al.,, 2015), a master matrix concentration inequality is presented. This result is stated below in Theorem 2.

Theorem 2 (Tropp et al., (2015)).

Consider a finite sequence {Xk}\{X_{k}\} of independent, random, Hermitian matrices of the same size. Then for all tt\in\mathbb{R},

(λmax(kXk)t)infθ>0eθttrexp(klog𝔼eθXk),\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\sum_{k}X_{k}\right)\geq t\right)\leq\inf_{\theta>0}e^{-\theta t}\mathrm{tr}\exp\left(\sum_{k}\log\mathbb{E}e^{\theta X_{k}}\right),

and

(λmin(kXk)t)infθ<0eθttrexp(klog𝔼eθXk).\displaystyle\mathbb{P}\left(\lambda_{\min}\left(\sum_{k}X_{k}\right)\leq t\right)\leq\inf_{\theta<0}e^{-\theta t}\mathrm{tr}\exp\left(\sum_{k}\log\mathbb{E}e^{\theta X_{k}}\right).

For our purpose, a more convenient form is the matrix concentration inequality with Bernstein’s conditions on the moments. Such results may be viewed as corollaries to Theorem 2, and a version is stated below in Theorem 3.

Theorem 3 (Zhu, (2012); Zhang et al., (2014)).

If a finite sequence {Xk:k=1,,K}\{X_{k}:k=1,\cdots,K\} of independent, random, self-adjoint matrices with dimension nn, all of which satisfy the Bernstein’s moment condition, i.e.

𝔼[Xkp]p!2Bp2Σ2, for p2,\displaystyle\mathbb{E}\left[X_{k}^{p}\right]\preceq\frac{p!}{2}B^{p-2}\Sigma_{2},\quad\text{ for }p\geq 2,

where BB is a positive constant and Σ2\Sigma_{2} is a positive semi-definite matrix, then,

(λ1(kXk)λ1(k𝔼Xk)+2Kθλ1(Σ2)+θB)nexp(θ),\displaystyle\mathbb{P}\left(\lambda_{1}\left(\sum_{k}X_{k}\right)\geq\lambda_{1}\left(\sum_{k}\mathbb{E}X_{k}\right)+\sqrt{2K\theta\lambda_{1}\left(\Sigma_{2}\right)}+\theta B\right)\leq n\exp\left(-\theta\right),

for each θ>0\theta>0.

Another useful property is the moments of spherical random variables, stated below in Proposition 2. The proof of Proposition 2 is in the Appendix.

Proposition 2.

Let 𝐯\mathbf{v} be uniformly sampled from 𝕊n1\mathbb{S}^{n-1} (n2n\geq 2). It holds that

𝔼[vip]=(p1)(p3)1n(n+2)(n+p2)\displaystyle\mathbb{E}\left[v_{i}^{p}\right]=\frac{(p-1)(p-3)\cdots 1}{n(n+2)\cdots(n+p-2)}

for all i=1,2,,ni=1,2,\cdots,n and any positive even integer pp.

With the above results in place, we can now prove Lemma 3.

Proof of Lemma 3.

Fix δ(0,1)\delta\in(0,1), and let M>Cnrlog(1/δ)M>Cnr\log(1/\delta) for some absolute constant CC. Following the similar reasoning for (7), we can represent 𝒫\mathcal{P} as

𝒫=n2𝐮𝐮𝐯𝐯,\displaystyle\mathcal{P}=n^{2}\mathbf{u}\mathbf{u}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}, (11)

where 𝐮,𝐯iidUnif(𝕊n1)\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1}).

Thus, by viewing 𝒫\mathcal{P} and 𝒫T\mathcal{P}_{T} as matrices of size n2×n2n^{2}\times n^{2}, we have

𝒫T𝒫𝒫T=\displaystyle\mathcal{P}_{T}\mathcal{P}\mathcal{P}_{T}= n2(PUIn+InPUPUPU)(𝐮𝐮𝐯𝐯)\displaystyle\;n^{2}\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right)\left(\mathbf{u}\mathbf{u}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right)
(PUIn+InPUPUPU).\displaystyle\cdot\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right).

Let QQ be an orthogonal matrix such that

QPUQ=In:r:=[I0r×(nr)0(nr)×r0(nr)×(nr).]\displaystyle QP_{U}Q^{\top}=I_{n}^{:r}:=\begin{bmatrix}I&0_{r\times(n-r)}\\ 0_{(n-r)\times r}&0_{(n-r)\times(n-r)}.\end{bmatrix}

Since the distributions of 𝐮\mathbf{u} and 𝐯\mathbf{v} are rotation-invariant and reflection-invariant, we know

(In:rIn+InIn:rIn:rIn:r)𝒫(In:rIn+InIn:rIn:rIn:r)\displaystyle\;\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)\mathcal{P}\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)
=\displaystyle= (QQ)𝒫T(QQ)𝒫(QQ)𝒫T(QQ)\displaystyle\;\left(Q\otimes Q\right)\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right)\mathcal{P}\left(Q\otimes Q\right)\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right)
=d\displaystyle=_{d} (QQ)𝒫T𝒫𝒫T(QQ),\displaystyle\;\left(Q\otimes Q\right)\mathcal{P}_{T}\mathcal{P}\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right), (12)

where =d=_{d} denotes distributional equivalence.

Therefore, it suffices to study the distribution of

(In:rIn+InIn:rIn:rIn:r)𝒫i(In:rIn+InIn:rIn:rIn:r).\displaystyle\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)\mathcal{P}_{i}\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right).

For simplicity, introduce notation

T:=In:rIn+InIn:rIn:rIn:r=In:rIn+Inr+1:In:r,\displaystyle\mathcal{R}_{T}:=I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}=I_{n}^{:r}\otimes I_{n}+I_{n}^{r+1:}\otimes I_{n}^{:r},

and we have

T𝒫T=\displaystyle\mathcal{R}_{T}\mathcal{P}\mathcal{R}_{T}= n2𝐮:r𝐮:r𝐯𝐯+n2𝐮r+1:𝐮r+1:𝐯:r𝐯:r\displaystyle\;n^{2}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}
+n2𝐮r+1:𝐮:r𝐯:r𝐯+n2𝐮:r𝐮r+1:𝐯𝐯:r\displaystyle+n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}

For simplicity, introduce

X:=n2𝐮:r𝐮:r𝐯𝐯\displaystyle\;X:=n^{2}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\quad
Y:=n2𝐮r+1:𝐮r+1:𝐯:r𝐯:r\displaystyle\;Y:=n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}
Z:=n2𝐮r+1:𝐮:r𝐯:r𝐯+n2𝐮:r𝐮r+1:𝐯𝐯:r.\displaystyle\;Z:=n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}.

Next we will show that average of iidiid copies of XX, YY, ZZ concentrates to 𝔼X\mathbb{E}X, 𝔼Y\mathbb{E}Y, 𝔼Z\mathbb{E}Z respectively. To do this, we bound the moments of XX, YY and ZZ, and apply Theorem 3.

Bounding XX and YY. The second moment of XX is

𝔼[X2]=n4𝔼[(𝐮:r𝐮:r)𝐮:r𝐮:r𝐯𝐯]3nr,\displaystyle\mathbb{E}\left[X^{2}\right]=n^{4}\mathbb{E}\left[\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right]\preceq 3nr,

where the last inequality follows from Proposition 2. Thus the centralized second moment of XX is bounded by

𝔼[(X𝔼X)2]3nr.\displaystyle\mathbb{E}\left[\left(X-\mathbb{E}X\right)^{2}\right]\preceq 3nr.

For p>2p>2, we have

𝔼[Xp]=np𝔼[(i=1rui2)𝐮:r𝐮:r𝐯𝐯]p!2(6n(r+2))p1In2,\displaystyle\mathbb{E}\left[X^{p}\right]=n^{p}\mathbb{E}\left[\left(\sum_{i=1}^{r}u_{i}^{2}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right]\preceq\frac{p!}{2}(6n(r+2))^{p-1}I_{n^{2}},

which, by operator Jensen, implies

𝔼[(X𝔼X)p]𝔼[2pXp+2p(𝔼X)p]p!2(24n(r+2))p1In2.\displaystyle\mathbb{E}\left[\left(X-\mathbb{E}X\right)^{p}\right]\preceq\mathbb{E}\left[2^{p}X^{p}+2^{p}\left(\mathbb{E}X\right)^{p}\right]\preceq\frac{p!}{2}(24n(r+2))^{p-1}I_{n^{2}}.

When using the operator Jensen’s inequality, we use In2=12In2+12In2I_{n^{2}}=\frac{1}{2}I_{n^{2}}+\frac{1}{2}I_{n^{2}} as the decomposition of identity.

Let X1,X2,,XMX_{1},X_{2},\cdots,X_{M} be iidiid copies of XX. Since MCnrlog(1/δ)M\geq Cnr\log(1/\delta), Theorem 3 implies that

(1Mi=1M(QQ)Xi(QQ)(QQ)𝔼[X](QQ)16)δ3.\displaystyle\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)X_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[X\right](Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}}. (13)

The bound for YY follows similarly. Let Y1,Y2,,YMY_{1},Y_{2},\cdots,Y_{M} be iidiid copies of YY, and we have

(1Mi=1M(QQ)Yi(QQ)(QQ)𝔼[Y](QQ)16)δ3.\displaystyle\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)Y_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[Y\right](Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}}. (14)

Bounding ZZ. The second moment of ZZ is

𝔼[Z2]=\displaystyle\mathbb{E}\left[Z^{2}\right]= n4𝔼[(𝐮r+1:𝐮:r𝐯:r𝐯+𝐮:r𝐮r+1:𝐯𝐯:r)2]\displaystyle\;n^{4}\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}\right)^{2}\right]
=\displaystyle= n4𝔼[(𝐮r+1:𝐮:r𝐮:r𝐮r+1:)(𝐯:r𝐯:r)+(𝐮:r𝐮r+1:𝐮r+1:𝐮:r𝐯𝐯:r𝐯:r𝐯)]\displaystyle\;n^{4}\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)+\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)\right]
\displaystyle\preceq n2r(n+2)In2+3nrIn24nrIn2,\displaystyle\;\frac{n^{2}r}{(n+2)}I_{n^{2}}+3nrI_{n^{2}}\preceq 4nrI_{n^{2}},

where the last line uses Proposition 2.

The 2p2p-th power of ZZ is

Z2p=\displaystyle Z^{2p}= n4p(𝐮r+1:𝐮:r𝐮:r𝐮r+1:)p(𝐯:r𝐯:r)p\displaystyle\;n^{4p}\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)^{p}
+n4p(𝐮:r𝐮r+1:𝐮r+1:𝐮:r)p(𝐯𝐯:r𝐯:r𝐯)p\displaystyle+n^{4p}\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\otimes\left(\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)^{p}
\displaystyle\preceq n4p(𝐮:r𝐮:r)p𝐮r+1:𝐮r+1:(𝐯:r𝐯:r)p1𝐯:r𝐯:r\displaystyle\;n^{4p}\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\left(\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\right)^{p-1}\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}
+n4p(𝐮:r𝐮:r)p1𝐮:r𝐮:r(𝐯:r𝐯:r)p𝐯𝐯\displaystyle+n^{4p}\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)^{p-1}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\left(\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\right)^{p}\mathbf{v}\mathbf{v}^{\top}

and the (2p+1)(2p+1)-th power of ZZ is

Z2p+1=\displaystyle Z^{2p+1}= n4p+2(𝐮r+1:𝐮:r𝐮:r𝐮r+1:)p𝐮r+1:𝐮:r(𝐯:r𝐯:r)p𝐯:r𝐯\displaystyle\;n^{4p+2}\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)^{p}\mathbf{v}_{:r}\mathbf{v}^{\top}
+(𝐮:r𝐮r+1:𝐮r+1:𝐮:r)p𝐮:r𝐮r+1:(𝐯𝐯:r𝐯:r𝐯)p𝐯𝐯:r.\displaystyle+\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\left(\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)^{p}\mathbf{v}\mathbf{v}_{:r}^{\top}.

Thus by Proposition 2, we have

𝔼[Z2p]\displaystyle\mathbb{E}\left[Z^{2p}\right]\preceq n4p𝔼[rp1(i=1rui2p)𝐮r+1:𝐮r+1:rp2(i=1rvi2p2)𝐯𝐯]\displaystyle\;n^{4p}\mathbb{E}\left[r^{p-1}\left(\sum_{i=1}^{r}u_{i}^{2p}\right)\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes r^{p-2}\left(\sum_{i=1}^{r}v_{i}^{2p-2}\right)\mathbf{v}\mathbf{v}^{\top}\right]
+n4p𝔼[rp2(i=1rui2p2)𝐮:r𝐮:rrp1(i=1rvi2p)𝐯𝐯]\displaystyle+n^{4p}\mathbb{E}\left[r^{p-2}\left(\sum_{i=1}^{r}u_{i}^{2p-2}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes r^{p-1}\left(\sum_{i=1}^{r}v_{i}^{2p}\right)\mathbf{v}\mathbf{v}^{\top}\right]
\displaystyle\preceq  2n4pr2p1(2p+1)(2p1)1n(n+2)(n+2p)(2p1)(2p3)1n(n+2)(n+2p2)In2\displaystyle\;2n^{4p}r^{2p-1}\cdot\frac{(2p+1)(2p-1)\cdots 1}{n(n+2)\cdots(n+2p)}\cdot\frac{(2p-1)(2p-3)\cdots 1}{n(n+2)\cdots(n+2p-2)}I_{n^{2}}
\displaystyle\preceq (2p)!2(8nr)2p1In2.\displaystyle\;\frac{(2p)!}{2}(8nr)^{2p-1}I_{n^{2}}.

For Z2p+1Z^{2p+1} (pp\in\mathbb{N}), we notice that

𝔼[(𝐮r+1:𝐮:r𝐮:r𝐮r+1:)p𝐮r+1:𝐮:r]=𝔼[(𝐮:r𝐮r+1:𝐮r+1:𝐮:r)p𝐮:r𝐮r+1:]=0,\displaystyle\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right]=\mathbb{E}\left[\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right]=0,

since these terms only involve odd powers of the entries of 𝐮\mathbf{u}. Therefore

𝔼[Z2p+1]=0, for p=0,1,2,\displaystyle\mathbb{E}\left[Z^{2p+1}\right]=0,\quad\text{ for }p=0,1,2,\cdots (15)

Let Z1,Z2,,ZMZ_{1},Z_{2},\cdots,Z_{M} be MM iidiid copies of ZZ, and MCnrlog(1/δ)M\geq Cnr\log(1/\delta) for some absolute constant CC. By (15), we know 𝔼[Z]=0\mathbb{E}\left[Z\right]=0, and all the above moments of ZZ are centralized moments of ZZ. Now we apply Theorem 3 to conclude that:

(1Mi=1MZi𝔼Z16)\displaystyle\;\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}Z_{i}-\mathbb{E}Z\right\|\geq\frac{1}{6}\right)
=\displaystyle= (1Mi=1M(QQ)Zi(QQ)(QQ)𝔼[Z](QQ)16)\displaystyle\;\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)Z_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[Z\right](Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)
=\displaystyle= (1Mi=1M(QQ)Zi(QQ)16)δ3,\displaystyle\;\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q^{\top}\otimes Q^{\top})Z_{i}(Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}}, (16)

where QQ is the orthogonal matrix as introduced in (12). We take a union bound over (13), (14) and (16) to conclude the proof.

Now with Lemma 3 in place, we state next Lemma 4. This lemma proves (A1).

Lemma 4.

Suppose 1\mathcal{E}_{1} is true. Let H^\widehat{H} be the solution of the constrained optimization problem, and let Δ:=H^H\Delta:=\widehat{H}-H. Then 𝒫TΔ22n𝒫TΔ2\|\mathcal{P}_{T}\Delta\|_{2}\leq 2n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}.

Proof.

Represent 𝒮\mathcal{S} as a matrix of size n2×n2n^{2}\times n^{2}. Let 𝒮\sqrt{\mathcal{S}} be defined as a canonical matrix function. That is, 𝒮\sqrt{\mathcal{S}} and 𝒮\mathcal{S} share the same eigenvectors, and the eigenvalues of 𝒮\sqrt{\mathcal{S}} are the square roots of the eigenvalues of 𝒮\mathcal{S}. Clearly,

𝒮Δ2=𝒮𝒫TΔ+𝒮𝒫TΔ2𝒮𝒫TΔ2𝒮𝒫TΔ2.\displaystyle\|\sqrt{\mathcal{S}}\Delta\|_{2}=\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta+\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\|_{2}\geq\|\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\|_{2}-\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta\|_{2}. (17)

Clearly we have

𝒮𝒫TΔ2n𝒫TΔ2.\displaystyle\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta\|_{2}\leq n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}.

Also, it holds that

𝒮𝒫TΔ22=𝒮𝒫TΔ,𝒮𝒫TΔ=𝒫TΔ,𝒫T𝒮𝒫TΔ\displaystyle\;\|\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\|_{2}^{2}=\left<\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta,\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\right>=\left<\mathcal{P}_{T}\Delta,\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}\Delta\right>
=\displaystyle= 𝒫TΔ22𝒫TΔ𝒫T𝒮𝒫TΔ,𝒫TΔ12𝒫TΔ22,\displaystyle\;\|\mathcal{P}_{T}\Delta\|_{2}^{2}-\left<\mathcal{P}_{T}\Delta-\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}\Delta,\mathcal{P}_{T}\Delta\right>\geq\frac{1}{2}\|\mathcal{P}_{T}\Delta\|_{2}^{2}, (18)

where the last inequality uses Lemma 3.

Since H^\widehat{H} solves (1), we know 𝒮Δ=0\mathcal{S}\Delta=0, and thus 𝒮Δ=0\sqrt{\mathcal{S}}\Delta=0. Suppose, in order to get a contradiction, that 𝒫TΔ2>2n𝒫TΔ2\|\mathcal{P}_{T}\Delta\|_{2}>2n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}. Then (17) and (18) yield

𝒮Δ212𝒫TΔ2n𝒫TΔ2>0,\displaystyle\|\sqrt{\mathcal{S}}\Delta\|_{2}\geq\frac{1}{2}\|\mathcal{P}_{T}\Delta\|_{2}-n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}>0,

which leads to a contraction. ∎

Next we turn to prove (A2), whose core argument relies on Lemma 5.

Lemma 5.

Let GTG\in T be fixed. Pick any δ(0,1)\delta\in(0,1). Then there exists a constant CC, such that when MCnr2log(1/δ)M\geq Cnr^{2}\log(1/\delta), it holds that

(𝒫T𝒮G14rG)δ.\displaystyle\mathbb{P}\left(\left\|\mathcal{P}_{T}^{\perp}\mathcal{S}G\right\|\geq\frac{1}{4\sqrt{r}}\|G\|\right)\leq\delta.
Proof.

There exists an orthogonal matrix QQ, such that G=QΛQG=Q\Lambda Q^{\top}, where

Λ=Diag(λ1,λ2,,λ2r,0,0,,0)\displaystyle\Lambda=\textrm{Diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{2r},0,0,\cdots,0)

is a diagonal matrix consists of eigenvalues of GG. Let 𝒫\mathcal{P} be the operator defined as in (11), and we will study the behavior of 𝒫G\mathcal{P}G and then apply Theorem 3. Since the distribution of 𝐮,𝐯Unif(𝕊n1)\mathbf{u},\mathbf{v}\sim\text{Unif}(\mathbb{S}^{n-1}) is rotation-invariant and reflection-invariant, we have

𝒫G=n2𝐮𝐮G𝐯𝐯=dn2Q𝐮𝐮QGQ𝐯𝐯Q=n2Q𝐮𝐮Λ𝐯𝐯Q,\displaystyle\mathcal{P}G=n^{2}\mathbf{u}\mathbf{u}^{\top}G\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}Q^{\top}GQ\mathbf{v}\mathbf{v}^{\top}Q^{\top}=n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top},

where =d=_{d} denotes distributional equivalence. Thus it suffices to study the behavior of B:=n2Q𝐮𝐮Λ𝐯𝐯QB:=n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top}. For the matrix BB, we consider

A:=[0n×nBB0n×n].\displaystyle A:=\begin{bmatrix}0_{n\times n}&B\\ B^{\top}&0_{n\times n}\end{bmatrix}.

Next we study the moments of AA. The second power of AA is A2=[BB0n×n0n×nBB].A^{2}=\begin{bmatrix}BB^{\top}&0_{n\times n}\\ 0_{n\times n}&B^{\top}B\end{bmatrix}. By Proposition 2, we have

𝔼[BB]=\displaystyle\mathbb{E}\left[BB^{\top}\right]= n4𝔼[Q𝐮𝐮Λ𝐯𝐯Λ𝐮𝐮Q]\displaystyle\;n^{4}\mathbb{E}\left[Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}\Lambda\mathbf{u}\mathbf{u}^{\top}Q^{\top}\right]
=\displaystyle= n3Q𝔼[𝐮𝐮Λ2𝐮𝐮]Q\displaystyle\;n^{3}Q\mathbb{E}\left[\mathbf{u}\mathbf{u}^{\top}\Lambda^{2}\mathbf{u}\mathbf{u}^{\top}\right]Q^{\top}
\displaystyle\preceq n3Q𝔼[G2𝐮(𝐮:2r𝐮:2r)𝐮]Q4nrG2In,\displaystyle\;n^{3}Q\mathbb{E}\left[\|G\|^{2}\mathbf{u}\left(\mathbf{u}_{:2r}^{\top}\mathbf{u}_{:2r}\right)\mathbf{u}^{\top}\right]Q^{\top}\preceq 4nr\|G\|^{2}I_{n},

and similarly, 𝔼[BB]4nrG2In\mathbb{E}\left[B^{\top}B\right]\preceq 4nr\|G\|^{2}I_{n}. For even moments of AA, we first compute 𝔼[(BB)p]\mathbb{E}\left[\left(BB^{\top}\right)^{p}\right] and 𝔼[(BB)p]\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right] for p2p\geq 2. For this, we have

𝔼[(BB)p]=Q𝔼[n4p(i=12rλiviui)2p𝐯𝐯]Q\displaystyle\;\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right]=Q\mathbb{E}\left[n^{4p}\left(\sum_{i=1}^{2r}\lambda_{i}v_{i}u_{i}\right)^{2p}\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}
=\displaystyle= n4pQ𝔼[(α1,α2,,α2r0;i=12rαi=2p(2pα1,α2,,α2r)i=12r(λiviui)αi)𝐯𝐯]Q\displaystyle\;n^{4p}Q\mathbb{E}\left[\left(\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}
=\displaystyle= n4pQ𝔼[(α1,α2,,α2r0;i=12rαi=2p;αi even(2pα1,α2,,α2r)i=12r(λiviui)αi)𝐯𝐯]Q,\displaystyle\;n^{4p}Q\mathbb{E}\left[\left(\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}, (19)

where the last inequality uses that expectation of odd powers of viv_{i} or uiu_{i} are zero. Note that

α1,α2,,α2r0;i=12rαi=2p;αi even(2pα1,α2,,α2r)i=12r(λiviui)αi\displaystyle\;\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}
=\displaystyle= α1,α2,,α2r0;i=12rαi=2p;αi even(2p)!p!i=12r(αi2)!αi!(pα12,α22,,α2r2)i=12r(λiviui)αi\displaystyle\;\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}\frac{(2p)!}{p!}\prod_{i=1}^{2r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}{p\choose\frac{\alpha_{1}}{2},\frac{\alpha_{2}}{2},\cdots,\frac{\alpha_{2r}}{2}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}
\displaystyle\leq (200r)p1α1,α2,,α2r0;i=12rαi=p(pα1,α2,,α2r)i=12r(λi2vi2ui2)αi\displaystyle\;(200r)^{p-1}\sum_{\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\sum_{i=1}^{2r}\alpha_{i}=p}{p\choose{\alpha_{1}},{\alpha_{2}},\cdots,{\alpha_{2r}}}\prod_{i=1}^{2r}\left(\lambda_{i}^{2}v_{i}^{2}u_{i}^{2}\right)^{\alpha_{i}}
=\displaystyle= (200r)p1(i=12rλi2ui2vi2)p,\displaystyle\;(200r)^{p-1}\left(\sum_{i=1}^{2r}\lambda_{i}^{2}u_{i}^{2}v_{i}^{2}\right)^{p}, (20)

where the inequality on the last line uses Lemma 2. Now we combine (19) and (20) to obtain

𝔼[(BB)p]n4p(200r)p1Q𝔼[(i=12rλi2ui2vi2)p𝐯𝐯]Q\displaystyle\;\mathbb{E}\left[(B^{\top}B)^{p}\right]\preceq n^{4p}(200r)^{p-1}Q\mathbb{E}\left[\left(\sum_{i=1}^{2r}\lambda_{i}^{2}u_{i}^{2}v_{i}^{2}\right)^{p}\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top} (21)
\displaystyle\preceq n4p(200r)2p2Q𝔼[(i=12rλi2pui2pvi2p)𝐯𝐯]Q\displaystyle\;n^{4p}(200r)^{2p-2}Q\mathbb{E}\left[\left(\sum_{i=1}^{2r}\lambda_{i}^{2p}u_{i}^{2p}v_{i}^{2p}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}
\displaystyle\preceq (2p)!2maxiλi2p(Cnr)2p1In=(2p)!2G2p(Cnr)2p1In,\displaystyle\;\frac{(2p)!}{2}\max_{i}\lambda_{i}^{2p}(Cnr)^{2p-1}I_{n}=\frac{(2p)!}{2}\|G\|^{2p}(Cnr)^{2p-1}I_{n}, (22)

where the inequality on the last line uses Proposition 2. Similarly, we have

𝔼[(BB)p](2p)!2G2p(200nr)2p1In.\displaystyle\mathbb{E}\left[(BB^{\top})^{p}\right]\preceq\frac{(2p)!}{2}\|G\|^{2p}(200nr)^{2p-1}I_{n}.

Therefore, we have obtained a bound on even moments of AA:

𝔼[A2p]=[𝔼[(BB)p]0n×n0n×n𝔼[(BB)p]](2p)!2G2p(200nr)2p1I2n,\displaystyle\mathbb{E}\left[A^{2p}\right]=\begin{bmatrix}\mathbb{E}\left[\left(BB^{\top}\right)^{p}\right]&0_{n\times n}\\ 0_{n\times n}&\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right]\end{bmatrix}\preceq\frac{(2p)!}{2}\|G\|^{2p}(200nr)^{2p-1}I_{2n},

for p=2,3,4,p=2,3,4,\cdots, and thus a bound on the centralized moments on even moments of AA:

𝔼[(A𝔼A)2p](2p)!2G2p(400nr)2p1I2n,p=2,3,4,\displaystyle\mathbb{E}\left[\left(A-\mathbb{E}A\right)^{2p}\right]\preceq\frac{(2p)!}{2}\|G\|^{2p}(400nr)^{2p-1}I_{2n},\quad p=2,3,4,\cdots

Next we upper bound the odd moments of AA. Since

𝔼[A2p+1]=[0n×n𝔼[(BB)pB]𝔼[(BB)pB]0n×n],\displaystyle\mathbb{E}\left[A^{2p+1}\right]=\begin{bmatrix}0_{n\times n}&\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right]\\ \mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]&0_{n\times n}\end{bmatrix},

it suffices to study 𝔼[(BB)pB]\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right] and 𝔼[(BB)pB]\mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]. Since

(BB)pB=n4p+2(i=12rλiviui)2pQ𝐯𝐯Λ𝐮𝐮Q,\displaystyle\left(BB^{\top}\right)^{p}B=n^{4p+2}\left(\sum_{i=1}^{2r}\lambda_{i}v_{i}u_{i}\right)^{2p}Q\mathbf{v}\mathbf{v}^{\top}\Lambda\mathbf{u}\mathbf{u}^{\top}Q^{\top},

using the arguments leading to (22), we have

𝔼[(BB)pB](2p+1)!2(Cnr)2pG2p+1In,\displaystyle\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{n},
𝔼[(BB)pB](2p+1)!2(Cnr)2pG2p+1In.\displaystyle\mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{n}. (23)

Since [0n×nInIn0n×n]2I2n\begin{bmatrix}0_{n\times n}&I_{n}\\ I_{n}&0_{n\times n}\end{bmatrix}\preceq 2I_{2n}, the above two inequalities in (23) implies

𝔼[A2p+1](2p+1)!2(Cnr)2pG2p+1I2n,\displaystyle\mathbb{E}\left[A^{2p+1}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{2n},

and thus

𝔼[(A𝔼A)2p+1](2p+1)!2(Cnr)2pG2p+1I2n.\displaystyle\mathbb{E}\left[\left(A-\mathbb{E}A\right)^{2p+1}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{2n}.

Now we have established moment bounds for AA, thus also for 𝒫T𝒫G\mathcal{P}_{T}^{\perp}\mathcal{P}G. From here we apply Theorem 3 to conclude the proof.

The next lemma will essentially establish (A2). This argument relies on the existence of a dual certificate (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).

Lemma 6.

Pick δ>0\delta>0. Define

2:={Yrange(𝒮):𝒫TYsign(H)21n21and𝒫TY12}.\displaystyle\mathcal{E}_{2}:=\left\{\exists\;Y\in range(\mathcal{S}):\|\mathcal{P}_{T}Y-\mathrm{sign}(H)\|_{2}\leq\frac{1}{n^{21}}\quad\text{and}\quad\|\mathcal{P}_{T}^{\perp}Y\|\leq\frac{1}{2}\right\}.

Let L=12log2nL=12\log_{2}n. Let mcnr2log(Lδ)m\geq c\cdot nr^{2}\log\left(\frac{L}{\delta}\right) for some constant cc. If M=mLcnr2lognlog(lognδ)M=mL\geq c\cdot nr^{2}\log n\log\left(\frac{\log n}{\delta}\right) for some constant cc, then (2)1δ\mathbb{P}\left(\mathcal{E}_{2}\right)\geq 1-\delta.

Proof.

Following (Gross,, 2011), we define random projectors 𝒮~l\widetilde{\mathcal{S}}_{l} (1lL1\leq l\leq L), such that

𝒮~l:=1mj=1m𝒫m(l1)+j.\displaystyle\widetilde{\mathcal{S}}_{l}:=\frac{1}{m}\sum_{j=1}^{m}\mathcal{P}_{m(l-1)+j}.

Then define

X0=sign(H),Yi=j=1i𝒮~j𝒫TXj1,Xi=sign(H)𝒫TYi,i1.\displaystyle X_{0}=\mathrm{sign}(H),\quad Y_{i}=\sum_{j=1}^{i}\widetilde{\mathcal{S}}_{j}\mathcal{P}_{T}X_{j-1},\quad X_{i}=\mathrm{sign}(H)-\mathcal{P}_{T}Y_{i},\quad\forall i\geq 1.

From the above definition, we have

Xi=(𝒫T𝒫T𝒮~i𝒫T)(𝒫T𝒫T𝒮~i1𝒫T)(𝒫T𝒫T𝒮~1𝒫T)X0,i1.\displaystyle X_{i}=(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{i}\mathcal{P}_{T})(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{i-1}\mathcal{P}_{T})\cdots(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{1}\mathcal{P}_{T})X_{0},\quad\forall i\geq 1.

Now we apply Lemma 3 to 𝒮~1,𝒮~2,,𝒮~L\widetilde{\mathcal{S}}_{1},\widetilde{\mathcal{S}}_{2},\cdots,\widetilde{\mathcal{S}}_{L}, and get, when event 1\mathcal{E}_{1} is true for all 𝒮~i,i=1,2,,L\widetilde{\mathcal{S}}_{i},i=1,2,\cdots,L,

Xi214Xi12r4i,i=1,2,,L\displaystyle\|X_{i}\|_{2}\leq\frac{1}{4}\|X_{i-1}\|_{2}\leq\cdots\leq\frac{\sqrt{r}}{4^{i}},\quad\forall i=1,2,\cdots,L (24)

Note that with probability exceeding 1δ21-\frac{\delta}{2}, 1\mathcal{E}_{1} is true for all 𝒮~i\widetilde{\mathcal{S}}_{i}, i=1,2,,Li=1,2,\cdots,L. Since 𝒮~i\widetilde{\mathcal{S}}_{i} are mutually independent, 𝒮~i+1\widetilde{\mathcal{S}}_{i+1} is independent of XiX_{i} for each i{0,1,,L1}i\in\{0,1,\cdots,L-1\}. In view of this, we can apply Lemma 5 to 𝒫TYL\mathcal{P}_{T}^{\perp}Y_{L} followed by a union bound, and get, with probability exceeding 1δ21-\frac{\delta}{2},

𝒫TYLi=1L14rXi1214i=1L14i112.\displaystyle\|\mathcal{P}_{T}^{\perp}Y_{L}\|\leq\sum_{i=1}^{L}\frac{1}{4\sqrt{r}}\|X_{i-1}\|_{2}\leq\frac{1}{4}\sum_{i=1}^{L}\frac{1}{4^{i-1}}\leq\frac{1}{2}. (25)

Now combining (24) and (25) finishes the proof.

Now we are ready to prove Theorem 1.

Proof of Theorem 1.

Let 2\mathcal{E}_{2} be true. Then there exists YY such that Y,Δ=0\left<Y,\Delta\right>=0, since 𝒮Δ=0\mathcal{S}\Delta=0. Thus we have

sign(H),PUΔPU=sign(H),Δ=sign(H)Y,Δ\displaystyle\;\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>=\left<\mathrm{sign}(H),\Delta\right>=\left<\mathrm{sign}(H)-Y,\Delta\right>
=\displaystyle= 𝒫T(sign(H)Y),ΔT+𝒫T(sign(H)Y),ΔT\displaystyle\;\left<\mathcal{P}_{T}\left(\mathrm{sign}(H)-Y\right),\Delta_{T}\right>+\left<\mathcal{P}_{T}^{\perp}\left(\mathrm{sign}(H)-Y\right),\Delta_{T}^{\perp}\right>
=\displaystyle= sign(H)𝒫TY,ΔT𝒫TY,ΔT\displaystyle\;\left<\mathrm{sign}(H)-\mathcal{P}_{T}Y,\Delta_{T}\right>-\left<\mathcal{P}_{T}^{\perp}Y,\Delta_{T}^{\perp}\right>
\displaystyle\geq 1n21ΔT212ΔT1,\displaystyle\;-\frac{1}{n^{21}}\|\Delta_{T}\|_{2}-\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1},

where the last inequality uses Lemma 6.

Now, by Lemma 1 and Lemma 4, we have

012ΔT11n21ΔT212ΔT11n20ΔT112ΔT12n18ΔT1,\displaystyle 0\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{1}{n^{21}}\|\Delta_{T}\|_{2}\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{1}{n^{20}}\|\Delta_{T}\|_{1}\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{2}{n^{18}}\|\Delta_{T}^{\perp}\|_{1},

which implies ΔT1=0\|\Delta_{T}^{\perp}\|_{1}=0. Finally another use of Lemma 4 implies Δ1=0\|\Delta\|_{1}=0, which concludes the proof.

Theorem 1, together with Proposition 1, establishes Corollary 1.

4 Conclusion

In this paper, we consider the Hessian estimator problem via matrix recovery techniques. In particular, we show that the finite-difference method studied in (Feng and Wang,, 2023; Wang,, 2023), together with a convex program, guarantees a high probability recovery of a rank-rr Hessian using nr2nr^{2} (up to logarithmic and constant factors) finite-difference operations. Compared to matrix completion methods, we do not assume any incoherence between the coordinate system and the hidden singular space of the Hessian matrix. In a follow-up work, we apply the Hessian estimation mechanism to Newton’s cubic method (Nesterov and Polyak,, 2006; Nesterov,, 2008), and design sample-efficient optimization algorithms for functions with (approximately) low-rank Hessian.

Acknowledgement

The authors thank Dr. Hehui Wu for insightful discussions and his contributions to Lemma 2 and Dr. Abiy Tasissa for helpful discussions.

References

  • Ahn et al., (2023) Ahn, J., Elmahdy, A., Mohajer, S., and Suh, C. (2023). On the fundamental limits of matrix completion: Leveraging hierarchical similarity graphs. IEEE Transactions on Information Theory, pages 1–1.
  • Balasubramanian and Ghadimi, (2021) Balasubramanian, K. and Ghadimi, S. (2021). Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pages 1–42.
  • Bhatia, (1997) Bhatia, R. (1997). Matrix analysis. Graduate Texts in Mathematics.
  • Broyden et al., (1973) Broyden, C. G., Dennis Jr, J. E., and Moré, J. J. (1973). On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245.
  • Cai et al., (2010) Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982.
  • Candes and Recht, (2012) Candes, E. and Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119.
  • Candes and Plan, (2010) Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936.
  • Candès and Tao, (2010) Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
  • Chandrasekaran et al., (2012) Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849.
  • Chen, (2015) Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909–2923.
  • Chen et al., (2020) Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121.
  • Davidon, (1991) Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on optimization, 1(1):1–17.
  • Eldar et al., (2012) Eldar, Y., Needell, D., and Plan, Y. (2012). Uniqueness conditions for low-rank matrix recovery. Applied and Computational Harmonic Analysis, 33(2):309–314.
  • Fan et al., (2021) Fan, J., Wang, W., and Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. The Annals of Statistics, 49(3):1239 – 1266.
  • Fazel, (2002) Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University.
  • Fefferman et al., (2016) Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049.
  • Feng and Wang, (2023) Feng, Y. and Wang, T. (2023). Stochastic zeroth-order gradient and Hessian estimators: variance reduction and refined bias bounds. Information and Inference: A Journal of the IMA, 12(3):1514–1545.
  • Fletcher, (2000) Fletcher, R. (2000). Practical methods of optimization. John Wiley & Sons.
  • Fornasier et al., (2011) Fornasier, M., Rauhut, H., and Ward, R. (2011). Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640.
  • Ghojogh et al., (2023) Ghojogh, B., Crowley, M., Karray, F., and Ghodsi, A. (2023). Elements of dimensionality reduction and manifold learning. Springer Nature.
  • Goldfarb, (1970) Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23–26.
  • Gotoh et al., (2018) Gotoh, J.-y., Takeda, A., and Tono, K. (2018). Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, 169(1):141–176.
  • Gross, (2011) Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566.
  • Hu et al., (2012) Hu, Y., Zhang, D., Ye, J., Li, X., and He, X. (2012). Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE transactions on pattern analysis and machine intelligence, 35(9):2117–2130.
  • Keshavan et al., (2010) Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998.
  • Lee and Bresler, (2010) Lee, K. and Bresler, Y. (2010). Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory, 56(9):4402–4416.
  • Li et al., (2023) Li, J., Balasubramanian, K., and Ma, S. (2023). Stochastic zeroth-order riemannian derivative estimation and optimization. Mathematics of Operations Research, 48(2):1183–1211.
  • Lieb, (1973) Lieb, E. H. (1973). Convex trace functions and the wigner-yanase-dyson conjecture. Advances in Mathematics, 11(3):267–288.
  • Mohan and Fazel, (2012) Mohan, K. and Fazel, M. (2012). Iterative reweighted algorithms for matrix rank minimization. The Journal of Machine Learning Research, 13(1):3441–3473.
  • Negahban and Wainwright, (2012) Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697.
  • Nesterov, (2008) Nesterov, Y. (2008). Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181.
  • Nesterov and Polyak, (2006) Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
  • Recht, (2011) Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12).
  • Recht et al., (2010) Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501.
  • Ren-Pu and Powell, (1983) Ren-Pu, G. and Powell, M. J. (1983). The convergence of variable metric matrices in unconstrained optimization. Mathematical programming, 27:123–143.
  • Resnick and Varian, (1997) Resnick, P. and Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3):56–58.
  • Rodomanov and Nesterov, (2022) Rodomanov, A. and Nesterov, Y. (2022). Rates of superlinear convergence for classical quasi-newton methods. Mathematical Programming, 194(1):159–190.
  • Rohde and Tsybakov, (2011) Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 – 930.
  • Rong et al., (2021) Rong, Y., Wang, Y., and Xu, Z. (2021). Almost everywhere injectivity conditions for the matrix recovery problem. Applied and Computational Harmonic Analysis, 50:386–400.
  • Shanno, (1970) Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647–656.
  • Spall, (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
  • Stein, (1981) Stein, C. M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151.
  • Tan et al., (2011) Tan, V. Y., Balzano, L., and Draper, S. C. (2011). Rank minimization over finite fields: Fundamental limits and coding-theoretic interpretations. IEEE transactions on information theory, 58(4):2018–2039.
  • Tanner and Wei, (2016) Tanner, J. and Wei, K. (2016). Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis, 40(2):417–429.
  • Tropp, (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434.
  • Tropp et al., (2015) Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230.
  • Udell and Townsend, (2019) Udell, M. and Townsend, A. (2019). Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160.
  • Vandereycken, (2013) Vandereycken, B. (2013). Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236.
  • Wang, (2023) Wang, T. (2023). On sharp stochastic zeroth-order Hessian estimators over Riemannian manifolds. Information and Inference: A Journal of the IMA, 12(2):787–813.
  • Wang et al., (2014) Wang, Z., Lai, M.-J., Lu, Z., Fan, W., Davulcu, H., and Ye, J. (2014). Rank-one matrix pursuit for matrix completion. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 91–99, Bejing, China. PMLR.
  • Wen et al., (2012) Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4):333–361.
  • Xiaojun Mao and Wong, (2019) Xiaojun Mao, S. X. C. and Wong, R. K. W. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525):198–210.
  • Xu and Zhang, (2001) Xu, C. and Zhang, J. (2001). A survey of quasi-newton equations and quasi-newton methods for optimization. Annals of Operations research, 103:213–234.
  • Zhang et al., (2014) Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2014). Random projections for classification: A recovery approach. IEEE Transactions on Information Theory, 60(11):7300–7316.
  • Zhu, (2012) Zhu, S. (2012). A short note on the tail bound of wishart distribution. arXiv preprint arXiv:1212.5860.

Appendix A Auxiliary Propositions and Lemmas

Proof of Proposition 2.

Let (r,φ1,φ2,,φn1)(r,\varphi_{1},\varphi_{2},\cdots,\varphi_{n-1}) be the spherical coordinate system. We have, for any i=1,2,,ni=1,2,\cdots,n and an even integer pp,

𝔼[v1p]=\displaystyle\mathbb{E}\left[v_{1}^{p}\right]= 1An02π0π0πcosp(φ1)sinn2(φ1)sinn3(φ2)sin(φn2)𝑑φ1𝑑φ2𝑑φn1,\displaystyle\;\frac{1}{A_{n}}\int_{0}^{2\pi}\int_{0}^{\pi}\cdots\int_{0}^{\pi}\cos^{p}(\varphi_{1})\sin^{n-2}(\varphi_{1})\sin^{n-3}(\varphi_{2})\cdots\sin(\varphi_{n-2})\,d\varphi_{1}\,d\varphi_{2}\cdots d\varphi_{n-1},

where AnA_{n} is the surface area of 𝕊n1\mathbb{S}^{n-1}. Let

I(n,p):=0πsinn(x)cosp(x)𝑑x.\displaystyle I(n,p):=\int_{0}^{\pi}\sin^{n}(x)\cos^{p}(x)\,dx.

Clearly, I(n,p)=I(n,p2)I(n+2,p2)I(n,p)=I(n,p-2)-I(n+2,p-2). By integration by parts, we have I(n+2,p2)=n+1p1I(n,p)I(n+2,p-2)=\frac{n+1}{p-1}I(n,p). The above two equations give I(n,p)=p1n+pI(n,p2)I(n,p)=\frac{p-1}{n+p}I(n,p-2).

Thus we have 𝔼[v1p]=I(n2,p)I(n2,0)=I(n2,p)I(n2,p2)I(n2,p2)I(n2,p4)I(n2,2)I(n2,0)=(p1)(p3)1n(n+2)(n+p2)\mathbb{E}\left[v_{1}^{p}\right]=\frac{I(n-2,p)}{I(n-2,0)}=\frac{I(n-2,p)}{I(n-2,p-2)}\frac{I(n-2,p-2)}{I(n-2,p-4)}\cdots\frac{I(n-2,2)}{I(n-2,0)}=\frac{(p-1)(p-3)\cdots 1}{n(n+2)\cdots(n+p-2)}. We conclude the proof by symmetry.