Zeroth-order Low-rank Hessian Estimation via Matrix Recovery

Tianyu Wang^*^**wangtianyu@fudan.edu.cn Zicheng Wang^†^††22110840011@m.fudan.edu.cn Jiajia Yu^‡^‡‡jiajia.yu@duke.edu

Abstract

A zeroth-order Hessian estimator aims to recover the Hessian matrix of an objective function at any given point, using minimal finite-difference computations. This paper studies zeroth-order Hessian estimation for low-rank Hessians, from a matrix recovery perspective. Our challenge lies in the fact that traditional matrix recovery techniques are not directly suitable for our scenario. They either demand incoherence assumptions (or its variants), or require an impractical number of finite-difference computations in our setting. To overcome these hurdles, we employ zeroth-order Hessian estimations aligned with proper matrix measurements, and prove new recovery guarantees for these estimators. More specifically, we prove that for a Hessian matrix $H\in\mathbb{R}^{n\times n}$ of rank $r$ , $\mathcal{O}(nr^{2}\log^{2}n)$ proper zeroth-order finite-difference computations ensures a highly probable exact recovery of $H$ . Compared to existing methods, our method can greatly reduce the number of finite-difference computations, and does not require any incoherence assumptions.

1 Introduction

In machine learning, optimization and many other mathematical programming problems, the Hessian matrix plays an important role since it describes the landscape of the objective function. However, in many real-world scenarios, although we can access function values, the lack of analytic form for the objective function precludes direct Hessian computation. Therefore it is important to develop zeroth-order finite-difference Hessian estimators, i.e. to estimate the Hessian matrix by function evaluation and finite-difference.

Finite-difference Hessian estimation has a long history dating back to Newton’s time. In recent years, the rise of large models and big data has posed the high-dimensionality of objective functions as a primary challenge in finite-difference Hessian estimation. To address this, stochastic Hessian estimators, like (Balasubramanian and Ghadimi,, 2021; Wang,, 2023; Feng and Wang,, 2023; Li et al.,, 2023), have emerged to reduce the required number of function value samples. The efficiency of a Hessian estimator is measured by the sample complexity, which quantifies the number of finite-difference computations needed.

Despite the high-dimensionality, the low-rank structure is prevalent in machine learning with high-dimensional datasets (Fefferman et al.,, 2016; Udell and Townsend,, 2019). Numerous research directions, such as manifold learning (e.g., Ghojogh et al.,, 2023) and recommender systems (e.g., Resnick and Varian,, 1997), actively leverage this low-rank structure. While there are many studies on stochastic Hessian estimators, as we detail in section 1.4, none of them exploit the low-rank structure of the Hessian matrix. This omission can lead to overly conservative results and hinder the overall efficiency and effectiveness of the optimization or learning algorithms.

To fill in the gap, in this work, we develop an efficient finite-difference Hessian estimation method for low-rank Hessian via matrix recovery. While a substantial number of literature studies the sample complexity of low-rank matrix recovery, we emphasize that none of them are directly applicable to our scenario. This is either due to the overly restrictive global incoherence assumption or a prohibitively large number of finite-difference computations, as we discuss in detail in section 1.2. We develop a new method and prove that without the incoherence assumption, for an $n\times n$ Hessian matrix with rank $r$ , we can exactly recover the matrix with high probability from $\mathcal{O}(nr^{2}\log^{2}n)$ proper zeroth-order finite-difference computations.

In the rest of this section, we present our problem formulation, discuss why existing matrix recovery methods fail on our problem and summarize our contribution.

1.1 Hessian Estimation via Compressed Sensing Formulation

To recover an $n\times n$ low-rank Hessian matrix $H$ using $\ll n^{2}$ finite-difference operations, we use the following trace norm minimization approach (Fazel,, 2002; Recht et al.,, 2010; Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012):

\displaystyle\min_{\widehat{H}\in\mathbb{R}^{n\times n}}\|\widehat{H}\|_{1},\quad\text{ subject to }\quad\mathcal{S}\widehat{H}=\mathcal{S}H,

(1)

where $\mathcal{S}:=\frac{1}{M}\sum_{i=1}^{M}\mathcal{P}_{i}$ and $\mathcal{P}_{i}$ is a matrix measurement operation that can be obtained via $\mathcal{O}(1)$ finite-difference computations. For our problem, it is worth emphasizing that $\mathcal{P}_{i}$ must satisfy the following requirements.

•

(R1) $\mathcal{P}_{i}$ is different from the sampling operation used for matrix completion. Otherwise an incoherence assumption is needed. See (M1) in Section 1.2 for more details.
•

(R2) $\mathcal{P}_{i}$ cannot involve the inner product between the Hessian matrix and a general matrix, since this operation cannot be efficiently obtained through finite-difference computations. See (M2) in Section 1.2 for more details.

Due to the above two requirements, existing theory for matrix recovery fails to provide satisfactory guarantees for low-rank Hessian estimation.

1.2 Existing Matrix Recovery Methods

Existing methods for low-rank matrix recovery can be divided into two categories: matrix completion methods, and matrix recovery via linear measurements (or matrix regression type method). Unfortunately, both groups of methods are unsuitable for Hessian estimation tasks.

(M1) Matrix completion methods: A candidate class of methods for low-rank Hessian estimation is matrix completion (Fazel,, 2002; Cai et al.,, 2010; Candes and Plan,, 2010; Candès and Tao,, 2010; Keshavan et al.,, 2010; Lee and Bresler,, 2010; Fornasier et al.,, 2011; Gross,, 2011; Recht,, 2011; Candes and Recht,, 2012; Hu et al.,, 2012; Mohan and Fazel,, 2012; Negahban and Wainwright,, 2012; Wen et al.,, 2012; Vandereycken,, 2013; Wang et al.,, 2014; Chen,, 2015; Tanner and Wei,, 2016; Gotoh et al.,, 2018; Chen et al.,, 2020; Ahn et al.,, 2023).

The motivation for matrix completion tasks originated from the Netflix prize, where the challenge was to predict the ratings of all users on all movies based on only observing ratings of some users on some movies. In order to tackle such problems, it is necessary to assume that the nontrivial singular vectors of the matrix $H$ and the observation basis $\mathcal{B}$ are “incoherent”. Incoherence (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012; Chen,, 2015; Negahban and Wainwright,, 2012), or its alternatives (e.g., Negahban and Wainwright,, 2012), implies that there is a sufficiently large angle between the singular vectors and the basis $\mathcal{B}$ . The rationale behind this assumption can be explained as follows: Consider a matrix $H$ of size $n\times n$ with a one in its $(1,1)$ entry and zeros elsewhere. If we randomly observe a small fraction of the $n\times n$ entries, it is highly likely that we will miss the $(1,1)$ entry, making it difficult to fully recover the matrix. Therefore, an incoherence parameter $\nu$ is assumed between the given canonical basis $\mathcal{B}$ and the singular vectors of $H$ , as illustrated in Figure 1. In the context of zeroth-order optimization, it is often necessary to recover the Hessian at any given point. However, assuming the Hessian is incoherence with the given basis over all points in the domain is overly restrictive.

(M2) Matrix recovery via linear measurements (matrix regression type recovery): In the context of matrix recovery using linear measurements (Tan et al.,, 2011; Eldar et al.,, 2012; Chandrasekaran et al.,, 2012; Rong et al.,, 2021), we observe the inner product of the target matrix $H$ with a set of matrices $A_{1},A_{2},\cdots,A_{M}$ . Specifically, we have the observation $\left<H,A_{i}\right>:=\mathrm{tr}(H^{*}A_{i})$ and our goal is to recover $H$ . In certain scenarios, there may be additional constraints on $A_{i}$ and the measurements might be corrupted by noise (Rohde and Tsybakov,, 2011; Fan et al.,, 2021; Xiaojun Mao and Wong,, 2019), which receives more attention from the statistics community. Eldar et al., (2012) proved that when the entries of $A_{i}$ are independently and identically distributed ( $iid$ ) Gaussian, having $M\geq 4nr-4r^{2}$ linear measurements ensures exact recovery of $H$ . Rong et al., (2021) showed that when the density of $(A_{1},A_{2},\cdots,A_{M})$ is absolutely continuous, having $M>nr-r^{2}$ measurements guarantees exact recovery of $H$ .

Despite the elegant results in matrix recovery using linear measurements, they are not applicable to Hessian estimation tasks. This limitation arises from the fact that a general linear measurement cannot be approximated by a zeroth-order estimation. To further illustrate this fact, let us consider the Taylor approximation, which, by the fundamental theorem of calculus, is the foundation for zeroth-order estimation. In the Taylor approximation of $f$ at $\mathbf{x}$ , the Hessian matrix $\nabla^{2}f(\mathbf{x})$ will always appear as a bilinear form. Therefore, a linear measurement $\left<A,\nabla^{2}f(\mathbf{x})\right>$ for a general $A$ cannot be included in a Taylor approximation of $f$ at $\mathbf{x}$ . In the language of optimization and numerical analysis, for a general measurement matrix $A$ , one linear measurement $\left<A,H\right>$ may require far more than $\mathcal{O}(1)$ finite-difference computations. Consequently, the theory providing guarantees for linear measurements does not extend to zeroth-order Hessian estimation.

1.3 Our Contribution

In this paper, we introduce a low-rank Hessian estimation mechanism that simultaneously satisfies (R1) and (R2). More specifically,

•

We prove that, with a proper finite-difference scheme, $\mathcal{O}\left(nr^{2}\log^{2}n\right)$ finite-difference computations are sufficient for guaranteeing an exact recovery of the Hessian matrix with high probability. Our approach simultaneously overcomes limitations of (M1) and (M2).

In the realm of zeroth-order Hessian estimation, no prior arts provide high probability estimation guarantees for low-rank Hessian estimation tasks; See Section 1.4 for more discussions.

1.4 Prior Arts on Hessian Estimation

Zeroth-order Hessian estimation dates back to the birth of calculus. In recent years, researchers from various fields have contributed to this topic (e.g., Broyden et al.,, 1973; Fletcher,, 2000; Spall,, 2000; Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023).

In quasi-Newton-type methods (e.g., Goldfarb,, 1970; Shanno,, 1970; Broyden et al.,, 1973; Ren-Pu and Powell,, 1983; Davidon,, 1991; Fletcher,, 2000; Spall,, 2000; Xu and Zhang,, 2001; Rodomanov and Nesterov,, 2022), gradient-based Hessian estimators were used for iterative optimization algorithms. Based on the Stein’s identity (Stein,, 1981), Balasubramanian and Ghadimi, (2021) introduced a Stein-type Hessian estimator, and combined it with cubic regularized Newton’s method (Nesterov and Polyak,, 2006) for non-convex optimization. Li et al., (2023) generalizes the Stein-type Hessian estimators to Riemannian manifolds. Parallel to (Balasubramanian and Ghadimi,, 2021; Li et al.,, 2023), Wang, (2023); Feng and Wang, (2023) investigated the Hessian estimator that inspires the current work.

Yet prior to our work, no methods from the zeroth-order Hessian estimation community focuses on low-rank Hessian estimation.

Figure 1: Incoherence condition for

\nabla^{2}f(\mathbf{x})

at multiple points. When the Hessian of

f

is low-rank or approximately low-rank, a matrix completion guarantee for

\nabla^{2}f(\mathbf{x})

at all

\mathbf{x}

requires an incoherence condition to hold uniformly over

\mathbf{x}

. As illustrated in the right subfigure, such requirement is overly restrictive.

2 Notations and Conventions

Before proceeding to main results, we lay out some conventions and notations that will be used throughout the paper. We use the following notations for matrix norms:

•

$\|\cdot\|$ is the operator norm (Schatten $\infty$ -norm);
•

$\|\cdot\|_{2}$ is the Euclidean norm (Schatten $2$ -norm);
•

$\|\cdot\|_{1}$ is the trace norm (Schatten $1$ -norm).

Also, the notation $\|\cdot\|$ is overloaded for vector norm and tensor norm. For a vector $\mathbf{v}\in\mathbb{R}^{n}$ , $\|\cdot\|$ is its Euclidean norm; For a tensor $V\in\left(\mathbb{R}^{n}\right)^{\otimes p}$ ( $p\geq 2$ ), $\|\cdot\|$ is its Schatten $\infty$ -norm. For any matrix $A$ with singular value decomposition $A=U\Sigma V^{\top}$ , we define $\mathrm{sign}(A)=U\mathrm{sign}(\Sigma)V^{\top}$ where $\mathrm{sign}(\Sigma)$ applies a $\mathrm{sign}$ function to each entry of $\Sigma$ .

For a vector $\mathbf{u}=\left(u_{1},u_{2},\cdots,u_{n}\right)^{\top}\in\mathbb{R}^{n}$ and a positive number $r\leq n$ , we define notations

\displaystyle\mathbf{u}_{:r}=\left(u_{1},u_{2},\cdots,u_{r},0,0,\cdots,0\right)^{\top}\;\text{and}\;\mathbf{u}_{r:}=\left(0,0,\cdots,0,u_{r},u_{r+1},\cdots,u_{n}\right)^{\top}.

Also, we use $C$ and $c$ to denote unimportant absolute constants that does not depend on $n$ or $r$ . The numbers $C$ and $c$ may or may not take the same value at each occurrence.

3 Main Results

We start with a finite-difference scheme that can be viewed as a matrix measurement operation. The Hessian of a function $f:\mathbb{R}^{n}\to\mathbb{R}$ at a given point $\mathbf{x}$ can be estimated as follows (Wang,, 2023; Feng and Wang,, 2023)

	$\displaystyle\widehat{\nabla}^{2}f(\mathbf{x}):=$
	$\displaystyle n^{2}\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top},$		(2)

where $\delta$ is the finite-difference granularity, and $\mathbf{u},\mathbf{v}$ are finite-difference directions. Difference choices of laws of $\mathbf{u}$ and $\mathbf{v}$ leads to different Hessian estimators. For example, $\mathbf{u},\mathbf{v}$ can be independent vectors uniformly distributed over the canonical basis $\{\mathbf{e}_{1},\mathbf{e}_{2},\cdots,\mathbf{e}_{n}\}$ .

We start our discussion by showing that the Hessian estimator (2) can indeed be viewed as a matrix measurement.

Proposition 1.

Consider an estimator defined in (2). Let the underlying function $f$ be twice continuously differentiable. Let $\mathbf{u},\mathbf{v}$ be two random vectors such that $\|\mathbf{u}\|,\|\mathbf{v}\|<\infty$ $a.s$ . Then for any fixed $\mathbf{x}\in\mathbb{R}^{n}$ ,

\displaystyle\widehat{\nabla}^{2}f(\mathbf{x})\to_{d}n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}

as $\delta\to 0_{+}$ , where $\to_{d}$ denotes convergence in distribution.

Proof.

By Taylor’s Theorem (with integrable remainder) and that the Hessian matrix is symmetric, we have

	$\displaystyle\widehat{\nabla}^{2}f(\mathbf{x})=$	$\displaystyle\;\frac{n^{2}}{4}\left(\left(\mathbf{v}+\mathbf{u}\right)^{\top}\nabla^{2}f(\mathbf{x})\left(\mathbf{v}+\mathbf{u}\right)-\left(\mathbf{v}-\mathbf{u}\right)^{\top}\nabla^{2}f(\mathbf{x})\left(\mathbf{v}-\mathbf{u}\right)\right)\mathbf{u}\mathbf{v}^{\top}$
		$\displaystyle+\mathcal{O}\left(\delta\left(\\|\mathbf{v}\\|+\\|\mathbf{u}\\|\right)^{3}\right)$
	$\displaystyle=$	$\displaystyle\;n^{2}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{u}\mathbf{v}^{\top}+\mathcal{O}\left(\delta\left(\\|\mathbf{v}\\|+\\|\mathbf{u}\\|\right)^{3}\right)$
	$\displaystyle=$	$\displaystyle\;n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}+\mathcal{O}\left(\delta\left(\\|\mathbf{v}\\|+\\|\mathbf{u}\\|\right)^{3}\right).$

As $\delta\to 0_{+}$ , the estimator (2) converges to $n^{2}\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}$ in distribution.

∎

With Proposition 1 in place, we see that matrix measurements of the form

\displaystyle\mathcal{P}:H\mapsto n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}

for some $\mathbf{u},\mathbf{v}$ can be efficiently computed via finite-difference computations. For the convex program (1) with sampling operators taking the above form, we have the following guarantee.

Theorem 1.

Consider the problem (1). Let the sampler $\mathcal{S}=\frac{1}{M}\sum_{i=1}^{M}\mathcal{P}_{i}$ be constructed with $\mathcal{P}_{i}:A\mapsto n^{2}\mathbf{u}_{i}\mathbf{u}_{i}^{\top}A\mathbf{v}_{i}\mathbf{v}_{i}^{\top}$ and $\mathbf{u}_{i},\mathbf{v}_{i}\overset{iid}{\sim}\text{Unif}(\mathbb{S}^{n-1})$ . Then there exists an absolute constant $C$ , such that if the number of samples $M\geq C\cdot nr^{2}\log^{2}(n)$ where $r:=rank(H)$ , then with probability larger than $1-\frac{1}{n}$ , the solution to (1), denoted by $\widehat{H}$ , satisfies $\widehat{H}=H$ .

As a direct consequence of Theorem 1, we have the following result.

Corollary 1.

Let the finite-difference granularity $\delta>0$ be small. Let $\mathbf{x}\in\mathbb{R}^{n}$ and let $f$ be twice continuously differentiable. Suppose there exists $H$ with $rank(H)=r$ such that $\|H-\nabla^{2}f(\mathbf{x})\|\leq\epsilon$ for some $\epsilon\geq 0$ , and the estimator (2) with $\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1})$ satisfies

		$\displaystyle\;\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top}$
	$\displaystyle=_{d}$	$\displaystyle\;\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top},$

where $=_{d}$ denotes distributional equivalence. There exists an absolute constant $C$ , such that if more than $C\cdot nr^{2}\log^{2}n$ zeroth-order finite-difference are obtained, then with probability exceeding $1-\frac{1}{n}$ , the solution $\widehat{H}$ to (1) satisfies $\|\widehat{H}-\nabla^{2}f(\mathbf{x})\|\leq\epsilon$ .

By Proposition 1, we know as $\delta\to 0^{+}$ ,

\displaystyle\frac{f(\mathbf{x}+\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}-\delta\mathbf{v}+\delta\mathbf{u})-f(\mathbf{x}+\delta\mathbf{v}-\delta\mathbf{u})+f(\mathbf{x}-\delta\mathbf{v}-\delta\mathbf{u})}{4\delta^{2}}\mathbf{u}\mathbf{v}^{\top}

converges to $\mathbf{u}\mathbf{u}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{v}\mathbf{v}^{\top}$ in distribution. Therefore, Corollary 1 implies that the estimator (2) together with a convex program (1) provides a sample efficient low-rank Hessian estimator. Corollary 1 also implies a guarantee for approximately low-rank Hessian.

The rest of this section is devoted to proving Theorem 1 and thus also Corollary 1.

3.1 Preparations

To describe the recovering argument for a symmetric low-rank matrix $H\in\mathbb{R}^{n\times n}$ with $rank(H)=r$ , we consider the eigenvalue decomposition of $H=U\Lambda U^{\top}$ ( $U\in\mathbb{R}^{n\times r}$ and $\Lambda\in\mathbb{R}^{r\times r}$ ), and a subspace of $\mathbb{R}^{n\times n}$ defined by

\displaystyle T:=\{A\in\mathbb{R}^{n\times n}:\left(I-P_{U}\right)A\left(I-P_{U}\right)=0\},

where $P_{U}$ is the projection onto the columns of $U$ . We also define a projection operation onto $T$ :

\displaystyle\mathcal{P}_{T}:A\mapsto P_{U}A+AP_{U}-P_{U}AP_{U}.

Let $\widehat{H}$ be the solution of (1) and let $\Delta:=\widehat{H}-H$ . We start with the following lemma, which can be extracted from matrix completion literature (e.g., Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).

Lemma 1.

Let $\widehat{H}$ be the solution of the program (1) and let $\Delta:=\widehat{H}-H$ . Then it holds that

\displaystyle\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\|\Delta_{T}^{\perp}\|_{1}\leq 0,

(3)

where $\Delta_{T}^{\perp}:=\mathcal{P}_{T}^{\perp}\Delta$ .

Proof.

Since $H\in T$ , we have

		$\displaystyle\;\\|H+\Delta\\|_{1}\geq\\|P_{U}(H+\Delta)P_{U}\\|_{1}+\\|P_{U}^{\perp}(H+\Delta)P_{U}^{\perp}\\|_{1}$		(4)
	$\displaystyle=$	$\displaystyle\;\\|H+P_{U}\Delta P_{U}\\|_{1}+\\|\Delta_{T}^{\perp}\\|_{1},$		(5)

where the first inequality uses the “pinching” inequality (Exercise II.5.4 & II.5.5 in (Bhatia,, 1997)).

Since $\|\mathrm{sign}(H)\|=1$ , we continue the above computation, and get

$\displaystyle(\ref{eq:lem1-1})=$	$\displaystyle\;\\|\mathrm{sign}(H)\\|\\|H+P_{U}\Delta P_{U}\\|_{1}+\\|\Delta_{T}^{\perp}\\|_{1}$
$\displaystyle\geq$	$\displaystyle\;\left<\mathrm{sign}(H),H+P_{U}\Delta P_{U}\right>+\\|\Delta_{T}^{\perp}\\|_{1}$
$\displaystyle=$	$\displaystyle\;\\|H\\|_{1}+\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\\|\Delta_{T}^{\perp}\\|_{1}.$	(6)

On the second line, we use the Hölder’s inequality. On the third line, we use that $\|A\|_{1}=\left<\mathrm{sign}(A),A\right>$ for any real matrix $A$ .

Since $\widehat{H}$ solves (1), we know $\|H\|_{1}\geq\|\widehat{H}\|_{1}=\|H+\Delta\|_{1}$ . Thus rearranging terms in (6) finishes the proof. ∎

3.2 The High Level Roadmap

With estimator (2) and Lemma 1 in place, we are ready to present the high-level roadmap of our argument. On a high level, the rest of the paper aims to prove the following two arguments:

•

(A1): With high probability, $\|\Delta_{T}\|_{2}\leq 2n\|\Delta_{T}^{\perp}\|_{2}$ , where $\Delta_{T}:=\mathcal{P}_{T}\Delta$ .
•

(A2): With high probability, $\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>\geq-\frac{1}{n^{20}}\|\Delta_{T}\|_{1}-\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}$ , where $\Delta_{T}^{\perp}:=\Delta-\Delta_{T}$ .

Once (A1) and (A2) are in place, we can quickly prove Theorem 1.

Sketch of proof of Theorem 1 with (A1) and (A2) assumed.

Now, by Lemma 1 and (A1), we have, with high probability,

	$\displaystyle 0\overset{\text{by Lemma \ref{lem:prepare}}}{\geq}$	$\displaystyle\;\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>+\\|\Delta_{T}^{\perp}\\|_{1}$
	$\displaystyle\overset{\text{by {(A2)}}}{\geq}$	$\displaystyle\;\frac{1}{2}\\|\Delta_{T}^{\perp}\\|_{1}-\frac{1}{n^{20}}\\|\Delta_{T}\\|_{1}$
	$\displaystyle\overset{\text{by {(A1)}}}{\geq}$	$\displaystyle\;\frac{1}{2}\\|\Delta_{T}^{\perp}\\|_{1}-\frac{2}{n^{18}}\\|\Delta_{T}^{\perp}\\|_{1},$

which implies $\|\Delta_{T}^{\perp}\|_{1}=0$ w.h.p. Finally another use of (A1) implies $\|\Delta\|_{1}=0$ w.h.p., which concludes the proof. ∎

Therefore, the core argument reduces to proving (A1) and (A2). In the next subsection, we prove (A1) and (A2) for the random measurements obtained by the Hessian estimator (2), without any incoherence-type assumptions.

3.3 The Concentration Arguments

For the concentration argument, we need to make several observations. One of the key observations is that the spherical measurements are rotation-invariant and reflection-invariant. More specifically, for the random measurement $\mathcal{P}H=n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}$ with $\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1})$ , we have

\displaystyle n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}Q^{\top}HQ\mathbf{v}\mathbf{v}^{\top}Q^{\top}

for any orthogonal matrix $Q$ , where $=_{d}$ denotes distributional equivalence. With a properly chosen $Q$ , we have

\displaystyle n^{2}\mathbf{u}\mathbf{u}^{\top}H\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top},

where $\Lambda$ is the diagonal matrix consisting of eigenvalues of $H$ . This observation makes calculating the moments of $\mathcal{P}H$ possible. With the moments of the random matrices properly controlled, we can use matrix-valued Cramer–Chernoff method to arrive at the matrix concentration inequalities.

Another useful property is the Kronecker product and the vectorization of the matrices. Let $\mathrm{\texttt{vec}}\left(\cdot\right)$ be the vectorization operation of a matrix. Then as per how $\mathcal{P}_{T}$ is defined, we have, for any $A\in\mathbb{R}^{n\times n}$ ,

	$\displaystyle\mathrm{\texttt{vec}}\left(\mathcal{P}_{T}A\right)=$	$\displaystyle\;\mathrm{\texttt{vec}}\left(P_{U}A+AP_{U}-P_{U}AP_{U}\right)$
	$\displaystyle=$	$\displaystyle\;\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right)\mathrm{\texttt{vec}}\left(A\right).$		(7)

The above formula implies that $\mathcal{P}_{T}$ can be represented as a matrix of size $n^{2}\times n^{2}$ . Similarly, the measurement operators $\mathcal{P}:A\mapsto n^{2}\mathbf{u}\mathbf{u}^{\top}A\mathbf{v}\mathbf{v}^{\top}$ can also be represented as a matrix of size $n^{2}\times n^{2}$ . Compared to the matrix completion problem, the importance of vectorization presentation and Kronecker product is more pronounced for our case. The reason is again the absence of an incoherence-type assumption. More specifically, a vectorized representation is useful in controlling the cumulant generating function of the random matrices associated with the spherical measurements.

Finally some additional care is needed to properly control the high moments of $\mathcal{P}H$ . Such additional care is showcased in an inequality stated below in Lemma 2. An easy upper bound for the LHS of (8) is $\mathcal{O}(r^{p})$ . However, an $\mathcal{O}(r^{p})$ bound for the LHS of (8) will eventually result in a loss in a factor of $r$ in the final bound. Overall, tight control is needed over several different places, in order to get the final recovery bound in Theorem 1.

Lemma 2.

Let $r$ and $p\geq 2$ be positive integers. Then it holds that

\displaystyle\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(100r)^{p-1}.

(8)

Proof.

Case I: $r\leq\frac{1}{2}50^{p-1}$ . Note that

\displaystyle\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq\frac{1}{(\frac{\alpha_{i}}{2})^{(\frac{\alpha_{i}}{2})}}\quad\text{ and thus }\quad\log\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq-\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2}).

(9)

Since the function $x\mapsto-x\log x$ is concave, Jensen’s inequality gives

\displaystyle\frac{-\sum_{i=1}^{r}\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2})}{r}\leq-\frac{\frac{\sum_{i=1}^{r}\alpha_{i}}{r}}{2}\log\left(\frac{\frac{\sum_{i=1}^{r}\alpha_{i}}{r}}{2}\right)=-\frac{p}{r}\log\frac{p}{r}.

(10)

Combining (9) and (10) gives

\displaystyle\log\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq-\sum_{i=1}^{r}\frac{\alpha_{i}}{2}\log(\frac{\alpha_{i}}{2})\leq-p\log\frac{p}{r},

which implies

\displaystyle\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(2p)^{p}(\frac{r}{p})^{p}\leq(2r)^{p}\leq(100r)^{p-1},

where the last inequality uses $r\leq\frac{1}{2}50^{p-1}$ .

Case II: $r>\frac{1}{2}50^{p-1}$ . For this case, we first show that the maximum of $\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}$ is obtained when $|\alpha_{i}-\alpha_{j}|\leq 2$ for all $i,j$ . To show this, let there exist $\alpha_{k}$ and $\alpha_{j}$ such that $|\alpha_{k}-\alpha_{j}|>2$ . Without loss of generality, let $\alpha_{k}>\alpha_{j}+2$ . Then

\displaystyle\frac{(\frac{\alpha_{k}}{2})!}{\alpha_{k}!}\cdot\frac{(\frac{\alpha_{j}}{2})!}{\alpha_{j}!}\leq\frac{(\frac{\alpha_{k}-2}{2})!}{(\alpha_{k}-2)!}\cdot\frac{(\frac{(\alpha_{j}+2)}{2})!}{(\alpha_{j}+2)!}.

Therefore, we can increase the value of $\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}$ until $|\alpha_{i}-\alpha_{j}|\leq 2$ for all $i,j$ . By the above argument, we have, for $r>\frac{1}{2}50^{p-1}\geq p$ ,

\displaystyle\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq\left(\frac{1}{2}\right)^{p}\cdot\left(\frac{0!}{0!}\right)^{r-p}=\frac{1}{2^{p}}.

Therefore, we have

		$\displaystyle\;\max_{\alpha_{1},\alpha_{2},\cdots,\alpha_{r}\geq 0;\;\sum_{i=1}^{r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}}\frac{(2p)!}{p!}\prod_{i=1}^{r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}\leq(2p)^{p}\cdot 2^{-p}$
	$\displaystyle=$	$\displaystyle\;p^{p}\leq(50\cdot 50^{p-1})^{p-1}\leq(100r)^{p-1}.$

∎

With all the above preparation in place, we next present Lemma 3, which is the key step leading to (A1).

Lemma 3.

Let

\displaystyle\mathcal{E}_{1}:=\left\{\left\|\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}-\mathcal{P}_{T}\right\|\leq\frac{1}{4}\right\},

where $\mathcal{P}_{T}$ and $\mathcal{S}$ are regarded as matrices of size $n^{2}\times n^{2}$ . Pick any $\delta\in(0,1)$ . Then there exists some constant $C$ , such that when $M\geq Cnr\log(1/\delta)$ , it holds that $\mathbb{P}\left(\mathcal{E}_{1}\right)\geq 1-\delta$ .

The operators $\mathcal{P}_{T}$ and $\mathcal{S}$ can be represented as matrix of size $n^{2}\times n^{2}$ . Therefore, we can apply matrix-valued Cramer–Chernoff-type argument (or matrix Laplace argument (Lieb,, 1973)) to derive a concentration bound. In (Tropp,, 2012; Tropp et al.,, 2015), a master matrix concentration inequality is presented. This result is stated below in Theorem 2.

Theorem 2 (Tropp et al., (2015)).

Consider a finite sequence $\{X_{k}\}$ of independent, random, Hermitian matrices of the same size. Then for all $t\in\mathbb{R}$ ,

\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\sum_{k}X_{k}\right)\geq t\right)\leq\inf_{\theta>0}e^{-\theta t}\mathrm{tr}\exp\left(\sum_{k}\log\mathbb{E}e^{\theta X_{k}}\right),

and

\displaystyle\mathbb{P}\left(\lambda_{\min}\left(\sum_{k}X_{k}\right)\leq t\right)\leq\inf_{\theta<0}e^{-\theta t}\mathrm{tr}\exp\left(\sum_{k}\log\mathbb{E}e^{\theta X_{k}}\right).

For our purpose, a more convenient form is the matrix concentration inequality with Bernstein’s conditions on the moments. Such results may be viewed as corollaries to Theorem 2, and a version is stated below in Theorem 3.

Theorem 3 (Zhu, (2012); Zhang et al., (2014)).

If a finite sequence $\{X_{k}:k=1,\cdots,K\}$ of independent, random, self-adjoint matrices with dimension $n$ , all of which satisfy the Bernstein’s moment condition, i.e.

\displaystyle\mathbb{E}\left[X_{k}^{p}\right]\preceq\frac{p!}{2}B^{p-2}\Sigma_{2},\quad\text{ for }p\geq 2,

where $B$ is a positive constant and $\Sigma_{2}$ is a positive semi-definite matrix, then,

\displaystyle\mathbb{P}\left(\lambda_{1}\left(\sum_{k}X_{k}\right)\geq\lambda_{1}\left(\sum_{k}\mathbb{E}X_{k}\right)+\sqrt{2K\theta\lambda_{1}\left(\Sigma_{2}\right)}+\theta B\right)\leq n\exp\left(-\theta\right),

for each $\theta>0$ .

Another useful property is the moments of spherical random variables, stated below in Proposition 2. The proof of Proposition 2 is in the Appendix.

Proposition 2.

Let $\mathbf{v}$ be uniformly sampled from $\mathbb{S}^{n-1}$ ( $n\geq 2$ ). It holds that

\displaystyle\mathbb{E}\left[v_{i}^{p}\right]=\frac{(p-1)(p-3)\cdots 1}{n(n+2)\cdots(n+p-2)}

for all $i=1,2,\cdots,n$ and any positive even integer $p$ .

With the above results in place, we can now prove Lemma 3.

Proof of Lemma 3.

Fix $\delta\in(0,1)$ , and let $M>Cnr\log(1/\delta)$ for some absolute constant $C$ . Following the similar reasoning for (7), we can represent $\mathcal{P}$ as

\displaystyle\mathcal{P}=n^{2}\mathbf{u}\mathbf{u}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top},

(11)

where $\mathbf{u},\mathbf{v}\overset{iid}{\sim}\mathrm{Unif}(\mathbb{S}^{n-1})$ .

Thus, by viewing $\mathcal{P}$ and $\mathcal{P}_{T}$ as matrices of size $n^{2}\times n^{2}$ , we have

	$\displaystyle\mathcal{P}_{T}\mathcal{P}\mathcal{P}_{T}=$	$\displaystyle\;n^{2}\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right)\left(\mathbf{u}\mathbf{u}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right)$
		$\displaystyle\cdot\left(P_{U}\otimes I_{n}+I_{n}\otimes P_{U}-P_{U}\otimes P_{U}\right).$

Let $Q$ be an orthogonal matrix such that

\displaystyle QP_{U}Q^{\top}=I_{n}^{:r}:=\begin{bmatrix}I&0_{r\times(n-r)}\\ 0_{(n-r)\times r}&0_{(n-r)\times(n-r)}.\end{bmatrix}

Since the distributions of $\mathbf{u}$ and $\mathbf{v}$ are rotation-invariant and reflection-invariant, we know

	$\displaystyle\;\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)\mathcal{P}\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)$
$\displaystyle=$	$\displaystyle\;\left(Q\otimes Q\right)\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right)\mathcal{P}\left(Q\otimes Q\right)\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right)$
$\displaystyle=_{d}$	$\displaystyle\;\left(Q\otimes Q\right)\mathcal{P}_{T}\mathcal{P}\mathcal{P}_{T}\left(Q^{\top}\otimes Q^{\top}\right),$	(12)

where $=_{d}$ denotes distributional equivalence.

Therefore, it suffices to study the distribution of

\displaystyle\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right)\mathcal{P}_{i}\left(I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}\right).

For simplicity, introduce notation

\displaystyle\mathcal{R}_{T}:=I_{n}^{:r}\otimes I_{n}+I_{n}\otimes I_{n}^{:r}-I_{n}^{:r}\otimes I_{n}^{:r}=I_{n}^{:r}\otimes I_{n}+I_{n}^{r+1:}\otimes I_{n}^{:r},

and we have

	$\displaystyle\mathcal{R}_{T}\mathcal{P}\mathcal{R}_{T}=$	$\displaystyle\;n^{2}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}$
		$\displaystyle+n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}$

For simplicity, introduce

	$\displaystyle\;X:=n^{2}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\quad$
	$\displaystyle\;Y:=n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}$
	$\displaystyle\;Z:=n^{2}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+n^{2}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}.$

Next we will show that average of $iid$ copies of $X$ , $Y$ , $Z$ concentrates to $\mathbb{E}X$ , $\mathbb{E}Y$ , $\mathbb{E}Z$ respectively. To do this, we bound the moments of $X$ , $Y$ and $Z$ , and apply Theorem 3.

Bounding $X$ and $Y$ . The second moment of $X$ is

\displaystyle\mathbb{E}\left[X^{2}\right]=n^{4}\mathbb{E}\left[\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right]\preceq 3nr,

where the last inequality follows from Proposition 2. Thus the centralized second moment of $X$ is bounded by

\displaystyle\mathbb{E}\left[\left(X-\mathbb{E}X\right)^{2}\right]\preceq 3nr.

For $p>2$ , we have

\displaystyle\mathbb{E}\left[X^{p}\right]=n^{p}\mathbb{E}\left[\left(\sum_{i=1}^{r}u_{i}^{2}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}^{\top}\right]\preceq\frac{p!}{2}(6n(r+2))^{p-1}I_{n^{2}},

which, by operator Jensen, implies

\displaystyle\mathbb{E}\left[\left(X-\mathbb{E}X\right)^{p}\right]\preceq\mathbb{E}\left[2^{p}X^{p}+2^{p}\left(\mathbb{E}X\right)^{p}\right]\preceq\frac{p!}{2}(24n(r+2))^{p-1}I_{n^{2}}.

When using the operator Jensen’s inequality, we use $I_{n^{2}}=\frac{1}{2}I_{n^{2}}+\frac{1}{2}I_{n^{2}}$ as the decomposition of identity.

Let $X_{1},X_{2},\cdots,X_{M}$ be $iid$ copies of $X$ . Since $M\geq Cnr\log(1/\delta)$ , Theorem 3 implies that

\displaystyle\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)X_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[X\right](Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}}.

(13)

The bound for $Y$ follows similarly. Let $Y_{1},Y_{2},\cdots,Y_{M}$ be $iid$ copies of $Y$ , and we have

\displaystyle\mathbb{P}\left(\left\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)Y_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[Y\right](Q^{\top}\otimes Q^{\top})\right\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}}.

(14)

Bounding $Z$ . The second moment of $Z$ is

	$\displaystyle\mathbb{E}\left[Z^{2}\right]=$	$\displaystyle\;n^{4}\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}_{:r}\mathbf{v}^{\top}+\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}\right)^{2}\right]$
	$\displaystyle=$	$\displaystyle\;n^{4}\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)+\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)\right]$
	$\displaystyle\preceq$	$\displaystyle\;\frac{n^{2}r}{(n+2)}I_{n^{2}}+3nrI_{n^{2}}\preceq 4nrI_{n^{2}},$

where the last line uses Proposition 2.

The $2p$ -th power of $Z$ is

	$\displaystyle Z^{2p}=$	$\displaystyle\;n^{4p}\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)^{p}$
		$\displaystyle+n^{4p}\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\otimes\left(\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)^{p}$
	$\displaystyle\preceq$	$\displaystyle\;n^{4p}\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes\left(\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\right)^{p-1}\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}$
		$\displaystyle+n^{4p}\left(\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\right)^{p-1}\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes\left(\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\right)^{p}\mathbf{v}\mathbf{v}^{\top}$

and the $(2p+1)$ -th power of $Z$ is

	$\displaystyle Z^{2p+1}=$	$\displaystyle\;n^{4p+2}\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\otimes\left(\mathbf{v}_{:r}\mathbf{v}_{:r}^{\top}\right)^{p}\mathbf{v}_{:r}\mathbf{v}^{\top}$
		$\displaystyle+\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\otimes\left(\mathbf{v}\mathbf{v}_{:r}^{\top}\mathbf{v}_{:r}\mathbf{v}^{\top}\right)^{p}\mathbf{v}\mathbf{v}_{:r}^{\top}.$

Thus by Proposition 2, we have

	$\displaystyle\mathbb{E}\left[Z^{2p}\right]\preceq$	$\displaystyle\;n^{4p}\mathbb{E}\left[r^{p-1}\left(\sum_{i=1}^{r}u_{i}^{2p}\right)\mathbf{u}_{r+1:}\mathbf{u}_{r+1:}^{\top}\otimes r^{p-2}\left(\sum_{i=1}^{r}v_{i}^{2p-2}\right)\mathbf{v}\mathbf{v}^{\top}\right]$
		$\displaystyle+n^{4p}\mathbb{E}\left[r^{p-2}\left(\sum_{i=1}^{r}u_{i}^{2p-2}\right)\mathbf{u}_{:r}\mathbf{u}_{:r}^{\top}\otimes r^{p-1}\left(\sum_{i=1}^{r}v_{i}^{2p}\right)\mathbf{v}\mathbf{v}^{\top}\right]$
	$\displaystyle\preceq$	$\displaystyle\;2n^{4p}r^{2p-1}\cdot\frac{(2p+1)(2p-1)\cdots 1}{n(n+2)\cdots(n+2p)}\cdot\frac{(2p-1)(2p-3)\cdots 1}{n(n+2)\cdots(n+2p-2)}I_{n^{2}}$
	$\displaystyle\preceq$	$\displaystyle\;\frac{(2p)!}{2}(8nr)^{2p-1}I_{n^{2}}.$

For $Z^{2p+1}$ ( $p\in\mathbb{N}$ ), we notice that

\displaystyle\mathbb{E}\left[\left(\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right)^{p}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right]=\mathbb{E}\left[\left(\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\mathbf{u}_{r+1:}\mathbf{u}_{:r}^{\top}\right)^{p}\mathbf{u}_{:r}\mathbf{u}_{r+1:}^{\top}\right]=0,

since these terms only involve odd powers of the entries of $\mathbf{u}$ . Therefore

\displaystyle\mathbb{E}\left[Z^{2p+1}\right]=0,\quad\text{ for }p=0,1,2,\cdots

(15)

Let $Z_{1},Z_{2},\cdots,Z_{M}$ be $M$ $iid$ copies of $Z$ , and $M\geq Cnr\log(1/\delta)$ for some absolute constant $C$ . By (15), we know $\mathbb{E}\left[Z\right]=0$ , and all the above moments of $Z$ are centralized moments of $Z$ . Now we apply Theorem 3 to conclude that:

	$\displaystyle\;\mathbb{P}\left(\left\\|\frac{1}{M}\sum_{i=1}^{M}Z_{i}-\mathbb{E}Z\right\\|\geq\frac{1}{6}\right)$
$\displaystyle=$	$\displaystyle\;\mathbb{P}\left(\left\\|\frac{1}{M}\sum_{i=1}^{M}(Q\otimes Q)Z_{i}(Q^{\top}\otimes Q^{\top})-(Q\otimes Q)\mathbb{E}\left[Z\right](Q^{\top}\otimes Q^{\top})\right\\|\geq\frac{1}{6}\right)$
$\displaystyle=$	$\displaystyle\;\mathbb{P}\left(\left\\|\frac{1}{M}\sum_{i=1}^{M}(Q^{\top}\otimes Q^{\top})Z_{i}(Q^{\top}\otimes Q^{\top})\right\\|\geq\frac{1}{6}\right)\leq{\frac{\delta}{3}},$	(16)

where $Q$ is the orthogonal matrix as introduced in (12). We take a union bound over (13), (14) and (16) to conclude the proof.

∎

Now with Lemma 3 in place, we state next Lemma 4. This lemma proves (A1).

Lemma 4.

Suppose $\mathcal{E}_{1}$ is true. Let $\widehat{H}$ be the solution of the constrained optimization problem, and let $\Delta:=\widehat{H}-H$ . Then $\|\mathcal{P}_{T}\Delta\|_{2}\leq 2n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}$ .

Proof.

Represent $\mathcal{S}$ as a matrix of size $n^{2}\times n^{2}$ . Let $\sqrt{\mathcal{S}}$ be defined as a canonical matrix function. That is, $\sqrt{\mathcal{S}}$ and $\mathcal{S}$ share the same eigenvectors, and the eigenvalues of $\sqrt{\mathcal{S}}$ are the square roots of the eigenvalues of $\mathcal{S}$ . Clearly,

\displaystyle\|\sqrt{\mathcal{S}}\Delta\|_{2}=\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta+\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\|_{2}\geq\|\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\|_{2}-\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta\|_{2}.

(17)

Clearly we have

\displaystyle\|\sqrt{\mathcal{S}}\mathcal{P}_{T}^{\perp}\Delta\|_{2}\leq n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}.

Also, it holds that

		$\displaystyle\;\\|\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\\|_{2}^{2}=\left<\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta,\sqrt{\mathcal{S}}\mathcal{P}_{T}\Delta\right>=\left<\mathcal{P}_{T}\Delta,\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}\Delta\right>$
	$\displaystyle=$	$\displaystyle\;\\|\mathcal{P}_{T}\Delta\\|_{2}^{2}-\left<\mathcal{P}_{T}\Delta-\mathcal{P}_{T}\mathcal{S}\mathcal{P}_{T}\Delta,\mathcal{P}_{T}\Delta\right>\geq\frac{1}{2}\\|\mathcal{P}_{T}\Delta\\|_{2}^{2},$		(18)

where the last inequality uses Lemma 3.

Since $\widehat{H}$ solves (1), we know $\mathcal{S}\Delta=0$ , and thus $\sqrt{\mathcal{S}}\Delta=0$ . Suppose, in order to get a contradiction, that $\|\mathcal{P}_{T}\Delta\|_{2}>2n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}$ . Then (17) and (18) yield

\displaystyle\|\sqrt{\mathcal{S}}\Delta\|_{2}\geq\frac{1}{2}\|\mathcal{P}_{T}\Delta\|_{2}-n\|\mathcal{P}_{T}^{\perp}\Delta\|_{2}>0,

which leads to a contraction. ∎

Next we turn to prove (A2), whose core argument relies on Lemma 5.

Lemma 5.

Let $G\in T$ be fixed. Pick any $\delta\in(0,1)$ . Then there exists a constant $C$ , such that when $M\geq Cnr^{2}\log(1/\delta)$ , it holds that

\displaystyle\mathbb{P}\left(\left\|\mathcal{P}_{T}^{\perp}\mathcal{S}G\right\|\geq\frac{1}{4\sqrt{r}}\|G\|\right)\leq\delta.

Proof.

There exists an orthogonal matrix $Q$ , such that $G=Q\Lambda Q^{\top}$ , where

\displaystyle\Lambda=\textrm{Diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{2r},0,0,\cdots,0)

is a diagonal matrix consists of eigenvalues of $G$ . Let $\mathcal{P}$ be the operator defined as in (11), and we will study the behavior of $\mathcal{P}G$ and then apply Theorem 3. Since the distribution of $\mathbf{u},\mathbf{v}\sim\text{Unif}(\mathbb{S}^{n-1})$ is rotation-invariant and reflection-invariant, we have

\displaystyle\mathcal{P}G=n^{2}\mathbf{u}\mathbf{u}^{\top}G\mathbf{v}\mathbf{v}^{\top}=_{d}n^{2}Q\mathbf{u}\mathbf{u}^{\top}Q^{\top}GQ\mathbf{v}\mathbf{v}^{\top}Q^{\top}=n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top},

where $=_{d}$ denotes distributional equivalence. Thus it suffices to study the behavior of $B:=n^{2}Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}Q^{\top}$ . For the matrix $B$ , we consider

\displaystyle A:=\begin{bmatrix}0_{n\times n}&B\\ B^{\top}&0_{n\times n}\end{bmatrix}.

Next we study the moments of $A$ . The second power of $A$ is $A^{2}=\begin{bmatrix}BB^{\top}&0_{n\times n}\\ 0_{n\times n}&B^{\top}B\end{bmatrix}.$ By Proposition 2, we have

	$\displaystyle\mathbb{E}\left[BB^{\top}\right]=$	$\displaystyle\;n^{4}\mathbb{E}\left[Q\mathbf{u}\mathbf{u}^{\top}\Lambda\mathbf{v}\mathbf{v}^{\top}\Lambda\mathbf{u}\mathbf{u}^{\top}Q^{\top}\right]$
	$\displaystyle=$	$\displaystyle\;n^{3}Q\mathbb{E}\left[\mathbf{u}\mathbf{u}^{\top}\Lambda^{2}\mathbf{u}\mathbf{u}^{\top}\right]Q^{\top}$
	$\displaystyle\preceq$	$\displaystyle\;n^{3}Q\mathbb{E}\left[\\|G\\|^{2}\mathbf{u}\left(\mathbf{u}_{:2r}^{\top}\mathbf{u}_{:2r}\right)\mathbf{u}^{\top}\right]Q^{\top}\preceq 4nr\\|G\\|^{2}I_{n},$

and similarly, $\mathbb{E}\left[B^{\top}B\right]\preceq 4nr\|G\|^{2}I_{n}$ . For even moments of $A$ , we first compute $\mathbb{E}\left[\left(BB^{\top}\right)^{p}\right]$ and $\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right]$ for $p\geq 2$ . For this, we have

	$\displaystyle\;\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right]=Q\mathbb{E}\left[n^{4p}\left(\sum_{i=1}^{2r}\lambda_{i}v_{i}u_{i}\right)^{2p}\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}$
$\displaystyle=$	$\displaystyle\;n^{4p}Q\mathbb{E}\left[\left(\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}$
$\displaystyle=$	$\displaystyle\;n^{4p}Q\mathbb{E}\left[\left(\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top},$	(19)

where the last inequality uses that expectation of odd powers of $v_{i}$ or $u_{i}$ are zero. Note that

	$\displaystyle\;\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}{2p\choose\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}$
$\displaystyle=$	$\displaystyle\;\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\\ \sum_{i=1}^{2r}\alpha_{i}=2p;\;\alpha_{i}\text{ even}\end{subarray}}\frac{(2p)!}{p!}\prod_{i=1}^{2r}\frac{(\frac{\alpha_{i}}{2})!}{\alpha_{i}!}{p\choose\frac{\alpha_{1}}{2},\frac{\alpha_{2}}{2},\cdots,\frac{\alpha_{2r}}{2}}\prod_{i=1}^{2r}\left(\lambda_{i}v_{i}u_{i}\right)^{\alpha_{i}}$
$\displaystyle\leq$	$\displaystyle\;(200r)^{p-1}\sum_{\alpha_{1},\alpha_{2},\cdots,\alpha_{2r}\geq 0;\sum_{i=1}^{2r}\alpha_{i}=p}{p\choose{\alpha_{1}},{\alpha_{2}},\cdots,{\alpha_{2r}}}\prod_{i=1}^{2r}\left(\lambda_{i}^{2}v_{i}^{2}u_{i}^{2}\right)^{\alpha_{i}}$
$\displaystyle=$	$\displaystyle\;(200r)^{p-1}\left(\sum_{i=1}^{2r}\lambda_{i}^{2}u_{i}^{2}v_{i}^{2}\right)^{p},$	(20)

where the inequality on the last line uses Lemma 2. Now we combine (19) and (20) to obtain

	$\displaystyle\;\mathbb{E}\left[(B^{\top}B)^{p}\right]\preceq n^{4p}(200r)^{p-1}Q\mathbb{E}\left[\left(\sum_{i=1}^{2r}\lambda_{i}^{2}u_{i}^{2}v_{i}^{2}\right)^{p}\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}$	(21)
$\displaystyle\preceq$	$\displaystyle\;n^{4p}(200r)^{2p-2}Q\mathbb{E}\left[\left(\sum_{i=1}^{2r}\lambda_{i}^{2p}u_{i}^{2p}v_{i}^{2p}\right)\mathbf{v}\mathbf{v}^{\top}\right]Q^{\top}$
$\displaystyle\preceq$	$\displaystyle\;\frac{(2p)!}{2}\max_{i}\lambda_{i}^{2p}(Cnr)^{2p-1}I_{n}=\frac{(2p)!}{2}\\|G\\|^{2p}(Cnr)^{2p-1}I_{n},$	(22)

where the inequality on the last line uses Proposition 2. Similarly, we have

\displaystyle\mathbb{E}\left[(BB^{\top})^{p}\right]\preceq\frac{(2p)!}{2}\|G\|^{2p}(200nr)^{2p-1}I_{n}.

Therefore, we have obtained a bound on even moments of $A$ :

\displaystyle\mathbb{E}\left[A^{2p}\right]=\begin{bmatrix}\mathbb{E}\left[\left(BB^{\top}\right)^{p}\right]&0_{n\times n}\\ 0_{n\times n}&\mathbb{E}\left[\left(B^{\top}B\right)^{p}\right]\end{bmatrix}\preceq\frac{(2p)!}{2}\|G\|^{2p}(200nr)^{2p-1}I_{2n},

for $p=2,3,4,\cdots$ , and thus a bound on the centralized moments on even moments of $A$ :

\displaystyle\mathbb{E}\left[\left(A-\mathbb{E}A\right)^{2p}\right]\preceq\frac{(2p)!}{2}\|G\|^{2p}(400nr)^{2p-1}I_{2n},\quad p=2,3,4,\cdots

Next we upper bound the odd moments of $A$ . Since

\displaystyle\mathbb{E}\left[A^{2p+1}\right]=\begin{bmatrix}0_{n\times n}&\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right]\\ \mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]&0_{n\times n}\end{bmatrix},

it suffices to study $\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right]$ and $\mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]$ . Since

\displaystyle\left(BB^{\top}\right)^{p}B=n^{4p+2}\left(\sum_{i=1}^{2r}\lambda_{i}v_{i}u_{i}\right)^{2p}Q\mathbf{v}\mathbf{v}^{\top}\Lambda\mathbf{u}\mathbf{u}^{\top}Q^{\top},

using the arguments leading to (22), we have

	$\displaystyle\mathbb{E}\left[\left(BB^{\top}\right)^{p}B\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\\|G\\|^{2p+1}I_{n},$
	$\displaystyle\mathbb{E}\left[\left(B^{\top}B\right)^{p}B^{\top}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\\|G\\|^{2p+1}I_{n}.$		(23)

Since $\begin{bmatrix}0_{n\times n}&I_{n}\\ I_{n}&0_{n\times n}\end{bmatrix}\preceq 2I_{2n}$ , the above two inequalities in (23) implies

\displaystyle\mathbb{E}\left[A^{2p+1}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{2n},

and thus

\displaystyle\mathbb{E}\left[\left(A-\mathbb{E}A\right)^{2p+1}\right]\preceq\frac{(2p+1)!}{2}(Cnr)^{2p}\|G\|^{2p+1}I_{2n}.

Now we have established moment bounds for $A$ , thus also for $\mathcal{P}_{T}^{\perp}\mathcal{P}G$ . From here we apply Theorem 3 to conclude the proof.

∎

The next lemma will essentially establish (A2). This argument relies on the existence of a dual certificate (Candès and Tao,, 2010; Gross,, 2011; Candes and Recht,, 2012).

Lemma 6.

Pick $\delta>0$ . Define

\displaystyle\mathcal{E}_{2}:=\left\{\exists\;Y\in range(\mathcal{S}):\|\mathcal{P}_{T}Y-\mathrm{sign}(H)\|_{2}\leq\frac{1}{n^{21}}\quad\text{and}\quad\|\mathcal{P}_{T}^{\perp}Y\|\leq\frac{1}{2}\right\}.

Let $L=12\log_{2}n$ . Let $m\geq c\cdot nr^{2}\log\left(\frac{L}{\delta}\right)$ for some constant $c$ . If $M=mL\geq c\cdot nr^{2}\log n\log\left(\frac{\log n}{\delta}\right)$ for some constant $c$ , then $\mathbb{P}\left(\mathcal{E}_{2}\right)\geq 1-\delta$ .

Proof.

Following (Gross,, 2011), we define random projectors $\widetilde{\mathcal{S}}_{l}$ ( $1\leq l\leq L$ ), such that

\displaystyle\widetilde{\mathcal{S}}_{l}:=\frac{1}{m}\sum_{j=1}^{m}\mathcal{P}_{m(l-1)+j}.

Then define

\displaystyle X_{0}=\mathrm{sign}(H),\quad Y_{i}=\sum_{j=1}^{i}\widetilde{\mathcal{S}}_{j}\mathcal{P}_{T}X_{j-1},\quad X_{i}=\mathrm{sign}(H)-\mathcal{P}_{T}Y_{i},\quad\forall i\geq 1.

From the above definition, we have

\displaystyle X_{i}=(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{i}\mathcal{P}_{T})(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{i-1}\mathcal{P}_{T})\cdots(\mathcal{P}_{T}-\mathcal{P}_{T}\widetilde{\mathcal{S}}_{1}\mathcal{P}_{T})X_{0},\quad\forall i\geq 1.

Now we apply Lemma 3 to $\widetilde{\mathcal{S}}_{1},\widetilde{\mathcal{S}}_{2},\cdots,\widetilde{\mathcal{S}}_{L}$ , and get, when event $\mathcal{E}_{1}$ is true for all $\widetilde{\mathcal{S}}_{i},i=1,2,\cdots,L$ ,

\displaystyle\|X_{i}\|_{2}\leq\frac{1}{4}\|X_{i-1}\|_{2}\leq\cdots\leq\frac{\sqrt{r}}{4^{i}},\quad\forall i=1,2,\cdots,L

(24)

Note that with probability exceeding $1-\frac{\delta}{2}$ , $\mathcal{E}_{1}$ is true for all $\widetilde{\mathcal{S}}_{i}$ , $i=1,2,\cdots,L$ . Since $\widetilde{\mathcal{S}}_{i}$ are mutually independent, $\widetilde{\mathcal{S}}_{i+1}$ is independent of $X_{i}$ for each $i\in\{0,1,\cdots,L-1\}$ . In view of this, we can apply Lemma 5 to $\mathcal{P}_{T}^{\perp}Y_{L}$ followed by a union bound, and get, with probability exceeding $1-\frac{\delta}{2}$ ,

\displaystyle\|\mathcal{P}_{T}^{\perp}Y_{L}\|\leq\sum_{i=1}^{L}\frac{1}{4\sqrt{r}}\|X_{i-1}\|_{2}\leq\frac{1}{4}\sum_{i=1}^{L}\frac{1}{4^{i-1}}\leq\frac{1}{2}.

(25)

Now combining (24) and (25) finishes the proof.

∎

Now we are ready to prove Theorem 1.

Proof of Theorem 1.

Let $\mathcal{E}_{2}$ be true. Then there exists $Y$ such that $\left<Y,\Delta\right>=0$ , since $\mathcal{S}\Delta=0$ . Thus we have

		$\displaystyle\;\left<\mathrm{sign}(H),P_{U}\Delta P_{U}\right>=\left<\mathrm{sign}(H),\Delta\right>=\left<\mathrm{sign}(H)-Y,\Delta\right>$
	$\displaystyle=$	$\displaystyle\;\left<\mathcal{P}_{T}\left(\mathrm{sign}(H)-Y\right),\Delta_{T}\right>+\left<\mathcal{P}_{T}^{\perp}\left(\mathrm{sign}(H)-Y\right),\Delta_{T}^{\perp}\right>$
	$\displaystyle=$	$\displaystyle\;\left<\mathrm{sign}(H)-\mathcal{P}_{T}Y,\Delta_{T}\right>-\left<\mathcal{P}_{T}^{\perp}Y,\Delta_{T}^{\perp}\right>$
	$\displaystyle\geq$	$\displaystyle\;-\frac{1}{n^{21}}\\|\Delta_{T}\\|_{2}-\frac{1}{2}\\|\Delta_{T}^{\perp}\\|_{1},$

where the last inequality uses Lemma 6.

Now, by Lemma 1 and Lemma 4, we have

\displaystyle 0\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{1}{n^{21}}\|\Delta_{T}\|_{2}\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{1}{n^{20}}\|\Delta_{T}\|_{1}\geq\frac{1}{2}\|\Delta_{T}^{\perp}\|_{1}-\frac{2}{n^{18}}\|\Delta_{T}^{\perp}\|_{1},

which implies $\|\Delta_{T}^{\perp}\|_{1}=0$ . Finally another use of Lemma 4 implies $\|\Delta\|_{1}=0$ , which concludes the proof.

∎

Theorem 1, together with Proposition 1, establishes Corollary 1.

4 Conclusion

In this paper, we consider the Hessian estimator problem via matrix recovery techniques. In particular, we show that the finite-difference method studied in (Feng and Wang,, 2023; Wang,, 2023), together with a convex program, guarantees a high probability recovery of a rank- $r$ Hessian using $nr^{2}$ (up to logarithmic and constant factors) finite-difference operations. Compared to matrix completion methods, we do not assume any incoherence between the coordinate system and the hidden singular space of the Hessian matrix. In a follow-up work, we apply the Hessian estimation mechanism to Newton’s cubic method (Nesterov and Polyak,, 2006; Nesterov,, 2008), and design sample-efficient optimization algorithms for functions with (approximately) low-rank Hessian.

Acknowledgement

The authors thank Dr. Hehui Wu for insightful discussions and his contributions to Lemma 2 and Dr. Abiy Tasissa for helpful discussions.

References

Ahn et al., (2023) Ahn, J., Elmahdy, A., Mohajer, S., and Suh, C. (2023). On the fundamental limits of matrix completion: Leveraging hierarchical similarity graphs. IEEE Transactions on Information Theory, pages 1–1.
Balasubramanian and Ghadimi, (2021) Balasubramanian, K. and Ghadimi, S. (2021). Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pages 1–42.
Bhatia, (1997) Bhatia, R. (1997). Matrix analysis. Graduate Texts in Mathematics.
Broyden et al., (1973) Broyden, C. G., Dennis Jr, J. E., and Moré, J. J. (1973). On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245.
Cai et al., (2010) Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982.
Candes and Recht, (2012) Candes, E. and Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111–119.
Candes and Plan, (2010) Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936.
Candès and Tao, (2010) Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
Chandrasekaran et al., (2012) Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849.
Chen, (2015) Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909–2923.
Chen et al., (2020) Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121.
Davidon, (1991) Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on optimization, 1(1):1–17.
Eldar et al., (2012) Eldar, Y., Needell, D., and Plan, Y. (2012). Uniqueness conditions for low-rank matrix recovery. Applied and Computational Harmonic Analysis, 33(2):309–314.
Fan et al., (2021) Fan, J., Wang, W., and Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. The Annals of Statistics, 49(3):1239 – 1266.
Fazel, (2002) Fazel, M. (2002). Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University.
Fefferman et al., (2016) Fefferman, C., Mitter, S., and Narayanan, H. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049.
Feng and Wang, (2023) Feng, Y. and Wang, T. (2023). Stochastic zeroth-order gradient and Hessian estimators: variance reduction and refined bias bounds. Information and Inference: A Journal of the IMA, 12(3):1514–1545.
Fletcher, (2000) Fletcher, R. (2000). Practical methods of optimization. John Wiley & Sons.
Fornasier et al., (2011) Fornasier, M., Rauhut, H., and Ward, R. (2011). Low-rank matrix recovery via iteratively reweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640.
Ghojogh et al., (2023) Ghojogh, B., Crowley, M., Karray, F., and Ghodsi, A. (2023). Elements of dimensionality reduction and manifold learning. Springer Nature.
Goldfarb, (1970) Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23–26.
Gotoh et al., (2018) Gotoh, J.-y., Takeda, A., and Tono, K. (2018). Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, 169(1):141–176.
Gross, (2011) Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566.
Hu et al., (2012) Hu, Y., Zhang, D., Ye, J., Li, X., and He, X. (2012). Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE transactions on pattern analysis and machine intelligence, 35(9):2117–2130.
Keshavan et al., (2010) Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998.
Lee and Bresler, (2010) Lee, K. and Bresler, Y. (2010). Admira: Atomic decomposition for minimum rank approximation. IEEE Transactions on Information Theory, 56(9):4402–4416.
Li et al., (2023) Li, J., Balasubramanian, K., and Ma, S. (2023). Stochastic zeroth-order riemannian derivative estimation and optimization. Mathematics of Operations Research, 48(2):1183–1211.
Lieb, (1973) Lieb, E. H. (1973). Convex trace functions and the wigner-yanase-dyson conjecture. Advances in Mathematics, 11(3):267–288.
Mohan and Fazel, (2012) Mohan, K. and Fazel, M. (2012). Iterative reweighted algorithms for matrix rank minimization. The Journal of Machine Learning Research, 13(1):3441–3473.
Negahban and Wainwright, (2012) Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1):1665–1697.
Nesterov, (2008) Nesterov, Y. (2008). Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181.
Nesterov and Polyak, (2006) Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
Recht, (2011) Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12).
Recht et al., (2010) Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501.
Ren-Pu and Powell, (1983) Ren-Pu, G. and Powell, M. J. (1983). The convergence of variable metric matrices in unconstrained optimization. Mathematical programming, 27:123–143.
Resnick and Varian, (1997) Resnick, P. and Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3):56–58.
Rodomanov and Nesterov, (2022) Rodomanov, A. and Nesterov, Y. (2022). Rates of superlinear convergence for classical quasi-newton methods. Mathematical Programming, 194(1):159–190.
Rohde and Tsybakov, (2011) Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 – 930.
Rong et al., (2021) Rong, Y., Wang, Y., and Xu, Z. (2021). Almost everywhere injectivity conditions for the matrix recovery problem. Applied and Computational Harmonic Analysis, 50:386–400.
Shanno, (1970) Shanno, D. F. (1970). Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647–656.
Spall, (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
Stein, (1981) Stein, C. M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151.
Tan et al., (2011) Tan, V. Y., Balzano, L., and Draper, S. C. (2011). Rank minimization over finite fields: Fundamental limits and coding-theoretic interpretations. IEEE transactions on information theory, 58(4):2018–2039.
Tanner and Wei, (2016) Tanner, J. and Wei, K. (2016). Low rank matrix completion by alternating steepest descent methods. Applied and Computational Harmonic Analysis, 40(2):417–429.
Tropp, (2012) Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434.
Tropp et al., (2015) Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230.
Udell and Townsend, (2019) Udell, M. and Townsend, A. (2019). Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160.
Vandereycken, (2013) Vandereycken, B. (2013). Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236.
Wang, (2023) Wang, T. (2023). On sharp stochastic zeroth-order Hessian estimators over Riemannian manifolds. Information and Inference: A Journal of the IMA, 12(2):787–813.
Wang et al., (2014) Wang, Z., Lai, M.-J., Lu, Z., Fan, W., Davulcu, H., and Ye, J. (2014). Rank-one matrix pursuit for matrix completion. In Xing, E. P. and Jebara, T., editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 91–99, Bejing, China. PMLR.
Wen et al., (2012) Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4):333–361.
Xiaojun Mao and Wong, (2019) Xiaojun Mao, S. X. C. and Wong, R. K. W. (2019). Matrix completion with covariate information. Journal of the American Statistical Association, 114(525):198–210.
Xu and Zhang, (2001) Xu, C. and Zhang, J. (2001). A survey of quasi-newton equations and quasi-newton methods for optimization. Annals of Operations research, 103:213–234.
Zhang et al., (2014) Zhang, L., Mahdavi, M., Jin, R., Yang, T., and Zhu, S. (2014). Random projections for classification: A recovery approach. IEEE Transactions on Information Theory, 60(11):7300–7316.
Zhu, (2012) Zhu, S. (2012). A short note on the tail bound of wishart distribution. arXiv preprint arXiv:1212.5860.

Appendix A Auxiliary Propositions and Lemmas

Proof of Proposition 2.

Let $(r,\varphi_{1},\varphi_{2},\cdots,\varphi_{n-1})$ be the spherical coordinate system. We have, for any $i=1,2,\cdots,n$ and an even integer $p$ ,

\displaystyle\mathbb{E}\left[v_{1}^{p}\right]=

\displaystyle\;\frac{1}{A_{n}}\int_{0}^{2\pi}\int_{0}^{\pi}\cdots\int_{0}^{\pi}\cos^{p}(\varphi_{1})\sin^{n-2}(\varphi_{1})\sin^{n-3}(\varphi_{2})\cdots\sin(\varphi_{n-2})\,d\varphi_{1}\,d\varphi_{2}\cdots d\varphi_{n-1},

where $A_{n}$ is the surface area of $\mathbb{S}^{n-1}$ . Let

\displaystyle I(n,p):=\int_{0}^{\pi}\sin^{n}(x)\cos^{p}(x)\,dx.

Clearly, $I(n,p)=I(n,p-2)-I(n+2,p-2)$ . By integration by parts, we have $I(n+2,p-2)=\frac{n+1}{p-1}I(n,p)$ . The above two equations give $I(n,p)=\frac{p-1}{n+p}I(n,p-2)$ .

Thus we have $\mathbb{E}\left[v_{1}^{p}\right]=\frac{I(n-2,p)}{I(n-2,0)}=\frac{I(n-2,p)}{I(n-2,p-2)}\frac{I(n-2,p-2)}{I(n-2,p-4)}\cdots\frac{I(n-2,2)}{I(n-2,0)}=\frac{(p-1)(p-3)\cdots 1}{n(n+2)\cdots(n+p-2)}$ . We conclude the proof by symmetry.

∎