Testing Positive Semidefiniteness Using Linear Measurements ^†^†thanks: Authors DN and WS were partially supported by NSF DMS $\#$ 2011140 and NSF DMS $\#$ 2108479. DW was supported by NSF CCF $\#$ 1815840 and Office of Naval Research Grant N00014-18-1-2562.

Deanna Needell
UCLA William Swartworth
UCLA David P. Woodruff
CMU

Abstract

We study the problem of testing whether a symmetric $d\times d$ input matrix $A$ is symmetric positive semidefinite (PSD), or is $\epsilon$ -far from the PSD cone, meaning that $\lambda_{\min}(A)\leq-\epsilon\|A\|_{p}$ , where $\|A\|_{p}$ is the Schatten- $p$ norm of $A$ . In applications one often needs to quickly tell if an input matrix is PSD, and a small distance from the PSD cone may be tolerable. We consider two well-studied query models for measuring efficiency, namely, the matrix-vector and vector-matrix-vector query models. We first consider one-sided testers, which are testers that correctly classify any PSD input, but may fail on a non-PSD input with a tiny failure probability. Up to logarithmic factors, in the matrix-vector query model we show a tight $\widetilde{\Theta}(1/\epsilon^{p/(2p+1)})$ bound, while in the vector-matrix-vector query model we show a tight $\widetilde{\Theta}(d^{1-1/p}/\epsilon)$ bound, for every $p\geq 1$ . We also show a strong separation between one-sided and two-sided testers in the vector-matrix-vector model, where a two-sided tester can fail on both PSD and non-PSD inputs with a tiny failure probability. In particular, for the important case of the Frobenius norm, we show that any one-sided tester requires $\widetilde{\Omega}(\sqrt{d}/\epsilon)$ queries. However we introduce a bilinear sketch for two-sided testing from which we construct a Frobenius norm tester achieving the optimal $\widetilde{O}(1/\epsilon^{2})$ queries. We also give a number of additional separations between adaptive and non-adaptive testers. Our techniques have implications beyond testing, providing new methods to approximate the spectrum of a matrix with Frobenius norm error using dimensionality reduction in a way that preserves the signs of eigenvalues.

1 Introduction

A real-valued matrix $A\in\mathbb{R}^{n\times n}$ is said to be Positive Semi-Definite (PSD) if it defines a non-negative quadratic form, namely, if $x^{T}Ax\geq 0$ for all $x$ . If $A$ is symmetric, the setting on which we focus, this is equivalent to the eigenvalues of $A$ being non-negative. Multiple works [KS03, Han+17, BCJ20] have studied the problem of testing whether a real matrix is PSD, or is far from being PSD, and this testing problem has numerous applications, including to faster algorithms for linear systems and linear algebra problems, detecting the existence of community structure, ascertaining local convexity, and differential equations; we refer the reader to [BCJ20] and the references therein.

We study two fundamental query models. In the matrix-vector model, one is given implicit access to a matrix $A$ and may query $A$ by choosing a vector $v$ and receiving the vector $Av.$ In the vector-matrix-vector model one chooses a pair of vectors $(v,w)$ and queries the bilinear form associated to $A$ . In other words the value of the query is $v^{T}Aw$ . In both models, multiple, adaptively-chosen queries can be made, and the goal is to minimize the number of queries to solve a certain task. These models are standard computational models in the numerical linear algebra community, see, e.g., [Han+17] where PSD testing was studied in the matrix-vector query model. These models were recently formalized in the theoretical computer science community in [Sun+19, RWZ20], though similar models have been studied in numerous fields, such as the number of measurements in compressed sensing, or the sketching dimension of a streaming algorithm. The matrix-vector query and vector-matrix-vector query models are particularly relevant when the input matrix $A$ is not given explicitly.

A natural situation occurs when $A$ is presented implicitly as a the Hessian of a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ at a point $x_{0}$ , where $f$ could be the loss function of a neural network for example. One might want to quickly distinguish between a proposed optimum of $f$ truly being a minimum, or being a saddle point with a direction of steep downward curvature. Our query model is quite natural in this context. A Hessian-vector product is efficient to compute using automatic differentiation techniques. A vector-matrix-vector product corresponds to a single second derivative computation, $D^{2}f(v,w)$ . This can be approximated using $4$ function queries by the finite difference approximation $D^{2}f(v,w)\approx\frac{f(x_{0}+hv+hw)-f(x_{0}+hv)-f(x_{0}+hw)+f(x_{0})}{h^{2}},$ where $h$ is small.

While there are numerically stable methods for computing the spectrum of a symmetric matrix, and thus determining if it is PSD, these methods can be prohibitively slow for very large matrices, and require a large number of matrix-vector or vector-matrix-vector products. Our goal is to obtain significantly more efficient algorithms in these models, and we approach this problem from a property testing perspective. In particular, we focus on the following version of the PSD-testing problem. In what follows, $\|A\|_{p}=\left(\sum_{i=1}^{n}\sigma_{i}^{p}\right)^{1/p}$ is the Schatten- $p$ norm of $A$ , where the $\sigma_{i}$ are the singular values of $A$ .

Definition 1.

For $p\in[1,\infty]$ , an $(\epsilon,\ell_{p})$ -tester is an algorithm that makes either matrix-vector or vector-matrix-vector queries to a real symmetric matrix $A$ , and outputs True with at least $2/3$ probability if $A$ is PSD, and outputs False with $2/3$ probability if $A$ is $\epsilon\left\|A\right\|_{p}$ -far in spectral distance from the PSD cone, or equivalently, if the minimum eigenvalue $\lambda_{\min}(A)\leq-\epsilon\left\|A\right\|_{p}$ . If the tester is guaranteed to output True on all PSD inputs (even if the input is generated by an adversary with access to the random coins of the tester), then the tester has one-sided error. Otherwise it has two-sided error. When $\epsilon$ is clear from the context we will often drop the $\epsilon$ and simply refer to an $\ell_{p}$ -tester.

Our work fits more broadly into the growing body of work on property testing for linear algebra problems, see, for example [Bal+19, BCJ20, BMR21]. However, a key difference is that we focus on matrix-vector and vector-matrix-vector query models, which might be more appropriate than the model in the above works which charges a cost of $1$ for reading a single entry. Indeed, such models need to make the assumption that the entries of the input are bounded by a constant or slow-growing function of $n$ , as otherwise strong impossibility results hold. This can severely limit the applicability of such algorithms to real-life matrices that do not have bounded entries; indeed, even a graph Laplacian matrix with a single degree that is large would not fit into the above models. In contrast, we use the matrix-vector and vector-matrix-vector models, which are ideally suited for modern machines such as graphics processing units and when the input matrix cannot fit into RAM, and are standard models in scientific computing, see, e.g., [BFG96].

While we focus on vector-matrix-vector queries, our results shed light on several other natural settings. Many of our results are in fact tight for general linear measurements which vectorize the input matrix and apply adaptively chosen linear forms to it. For long enough streams the best known single or multi-pass algorithms for any problem in the turnstile streaming model form a sketch using general linear measurements, and with some additional restrictions, it can be shown that the optimal multi-pass streaming algorithm just adaptively chooses general linear measurements [Ai+16]. Therefore, it is quite plausible that many of our vector-matrix-vector algorithms give tight single pass streaming bounds, given that vector-matrix-vector queries are a special case of general linear measurements, and that many our lower bounds are tight even for general linear measurements.

Moreover our vector-matrix-vector algorithms lead to efficient communication protocols for deciding whether a distributed sum of matrices is PSD, provided that exact vector-matrix-vector products may be communicated. While we expect our methods to be stable under small perturbations (i.e. when the vector-matrix-vector products are slightly inexact), we leave the full communication complexity analysis to future work.

We note that our PSD-testing problem is closely related to that of approximating the largest eigenvalue of a PSD matrix. Indeed by appropriately negating and shifting the input matrix, it is essentially equivalent to estimate the largest eigenvalue of a PSD matrix $A$ up to additive error $\epsilon\left(\sum_{i}\left|\lambda_{\max}(A)-\lambda_{i}(A)\right|^{p}\right)^{1/p}.$ However this problem is much less natural as real-world matrices often have many small eigenvalues, but only a few large eigenvalues.

1.1 Our Contributions

Vector-matrix-vector queries
Adaptive, one-sided $\ell_{p}$	$\widetilde{\Theta}(\frac{1}{\epsilon}d^{1-1/p})$	Corollary 3, Theorem 15
Non-adaptive, one-sided $\ell_{p}$	$\widetilde{\Theta}(\frac{1}{\epsilon^{2}}d^{2-2/p})$	Corollary 42, Theorem 13
Adaptive, two-sided $\ell_{2}$	$\widetilde{\Theta}(\frac{1}{\epsilon^{2}})^{*}$	Proposition 29, Corollary 33
Non-adaptive, two-sided $\ell_{2}$	$\widetilde{\Theta}(\frac{1}{\epsilon^{4}})^{*}$	Theorem 28, Theorem 31
Adaptive, two-sided $\ell_{p},$ $2\leq p<\infty$	$\widetilde{\Theta}(\frac{1}{\epsilon^{2}}d^{1-2/p})^{*}$	Corollary 30, Corollary 33
Matrix-vector queries
Adaptive one-sided $\ell_{p}$	$\widetilde{O}((1/\epsilon)^{p/(2p+1)}\log d)$ , $\Omega((1/\epsilon)^{p/(2p+1)})$	Theorem 17, Theorem 20
Adaptive one-sided $\ell_{1}$	$\widetilde{\Theta}((1/\epsilon)^{1/3})$	Theorem 17, Theorem 20
Non-adaptive one-sided $\ell_{p}$	$\Theta(\frac{1}{\epsilon}d^{1-1/p})$	Proposition 43, Corollary 46

Table 1: Our upper and lower bounds for the matrix-vector and vector-matrix-vector query models. ^∗ indicates that the lower bound holds for general linear measurements.

We study PSD-testing in the matrix-vector and vector-matrix-vector models. In particular, given a real symmetric matrix $A,$ and $p\in[1,\infty]$ , we are interested in deciding between (i) $A$ is PSD and (ii) $A$ has an eigenvalue less than $-\epsilon\left\|A\right\|_{p},$ where $\left\|A\right\|_{p}$ is the Schatten $p$ -norm of $A.$

Tight Bounds for One-sided Testers.

We make particular note of the distinction between one-sided and two-sided testers. In some settings one is interested in a tester that produces one-sided error. When such a tester outputs False, it must be able to produce a proof that $A$ is not PSD. The simplest such proof is a witness vector $v$ such that $v^{T}Av<0$ , and indeed we observe that in the matrix-vector model, any one-sided tester can produce such a $v$ when it outputs $\text{False}.$ This may be a desirable feature if one wishes to apply these techniques to saddle point detection for example: given a point that is not a local minimum, it would be useful to produce a descent direction so that optimization may continue. In the vector-matrix-vector model the situation is somewhat more complicated in general, but all of our one-sided testers produce a witness vector whenever they output False.

We provide optimal bounds for one-sided testers for both matrix-vector and vector-matrix-vector models. The bounds below are stated for constant probability algorithms. Here $\widetilde{O}(f)=f\cdot\text{poly}(\log f)$ .

1.

In the matrix-vector query model, we show that up to a factor of $\log d$ , $\widetilde{\Theta}(1/\epsilon^{p/(2p+1)})$ queries are necessary and sufficient for an $\ell_{p}$ -tester for any $p\geq 1$ . In the $p=1$ case, we note that the $\log d$ factor may be removed.
2.

In the vector-matrix-vector query model, we show that $\widetilde{\Theta}(d^{1-1/p}/\epsilon)$ queries are necessary and sufficient for an $\ell_{p}$ -tester for any $p\geq 1$ . Note that when $p=1$ we obtain a very efficient $\widetilde{O}(1/\epsilon)$ -query algorithm. In particular, our tester for $p=1$ has query complexity independent of the matrix dimensions, and we show a sharp phase transition for $p>1$ , showing in some sense that $p=1$ is the largest value of $p$ possible for one-sided queries.

The matrix-vector query complexity is very different than the vector-matrix-vector query complexity, as the query complexity is $\text{poly}(1/\epsilon)$ for any $p\geq 1$ , which captures the fact that each matrix-vector query response reveals more information than that of a vector-matrix-vector query, though a priori it was not clear that such responses in the matrix-vector model could not be compressed using vector-matrix-vector queries.

An Optimal Bilinear Sketch for Two-Sided Testing.

Our main technical contribution for two-sided testers is a bilinear sketch for PSD-testing with respect to the Frobenius norm, i.e. $p=2$ . We consider a Gaussian sketch $G^{T}AG$ , where $G$ has small dimension $\widetilde{O}(\frac{1}{\epsilon^{2}}).$ By looking at the smallest eigenvalue of the sketch, we are able to distinguish between $A$ being PSD and being $\epsilon$ -far from PSD. Notably this tester may reject even when $\lambda_{\min}(G^{T}AG)>0$ , which results in a two-sided error guarantee. This sketch allows us to obtain tight two-sided bounds in the vector-matrix-vector model for $p\geq 2$ , both for adaptive and non-adaptive queries.

Separation Between One-Sided and Two-Sided Testers.

Surprisingly, we show a separation between one-sided and two-sided testers in the vector-matrix-vector model. For the important case of the Frobenius norm, i.e., $p=2$ , we utilize our bilinear sketch to construct a $\widetilde{O}(1/\epsilon^{2})$ query two-sided tester, whereas by our results above, any adaptive one-sided tester requires at least $\Omega(\sqrt{d}/\epsilon)$ queries.

We also show that for any $p>2$ , any possibly adaptive two-sided tester requires $d^{\Omega(1)}$ queries for constant $\epsilon$ , and thus in some sense, $p=2$ is the largest value of $p$ possible for two-sided queries.

On the Importance of Adaptivity.

We also study the role of adaptivity in both matrix-vector and vector-matrix-vector models. In both the one-sided and two-sided vector-matrix-vector models we show a quadratic separation between adaptive and non-adaptive testers, which is the largest gap possible for any vector-matrix-vector problem [Sun+19].

In the matrix-vector model, each query reveals more information about $A$ than in the vector-matrix-vector model, allowing for even better choices for future queries. Thus we have an even larger gap between adaptive and non-adaptive testers in this setting.

Spectrum Estimation.

While the two-sided tester discussed above yields optimal bounds for PSD testing, it does not immediately give a way to estimate the negative eigenvalue when it exists. Via a different approach, we show how to give such an approximation with $\epsilon\left\|A\right\|_{F}$ additive error. In fact, we show how to approximate all of the top $k$ eigenvalues of $A$ using $O(k^{2}\text{poly}\frac{1}{\epsilon})$ non-adaptive vector-matrix-vector queries, which may be of independent interest.

We note that this gives an $O(k^{2}\text{poly}\frac{1}{\epsilon})$ space streaming algorithm for estimating the top $k$ eigenvalues of $A$ to within additive Frobenius error. Prior work yields a similar guarantee for the singular values [AN13], but cannot recover the signs of eigenvalues.

1.2 Our Techniques

Matrix-Vector Queries.

For the case of adaptive matrix-vector queries, we show that Krylov iteration starting with a single random vector yields an optimal $\ell_{p}$ -tester for all $p$ . Interestingly, our analysis is able to beat the usual Krylov matrix-vector query bound for approximating the top eigenvalue, as we modify the usual polynomial analyzed for eigenvalue estimation to implicitly implement a deflation step of all eigenvalues above a certain threshold. We do not need to explicitly know the values of the large eigenvalues in order to deflate them; rather, it suffices that there exists a low degree polynomial in the Krylov space that implements this deflation.

We note that this idea of implicitly deflating the top eigenvalues first appeared in [AL86]. More recently this observation was applied in a setting very similar to ours by [SW09] who deflated eigenvalues that are large relative to the trace.

Further, we show that this technique is tight for all $p\geq 1$ by showing that any smaller number of matrix-vector products would violate a recent lower bound of [Bra+20] for approximating the smallest eigenvalue of a Wishart matrix. This lower bound applies even to two-sided testers.

Vector-Matrix-Vector Queries.

We start by describing our result for $p=1$ . We give one of the first examples of an algorithm in the vector-matrix-vector query model that leverages adaptivity in an interesting way. Most known algorithms in this model work non-adaptively, either by applying a bilinear sketch to the matrix, or by making many independent queries in the case of Hutchinson’s trace estimator [Hut89]. Indeed, the algorithm of [AN13] works by computing $G^{T}AG$ for a Gaussian matrix $G$ with $1/\epsilon$ columns, and arguing that all eigenvalues that are at least $\epsilon\|A\|_{1}$ can be estimated from the sketch. The issue with this approach is that it uses $\Omega(1/\epsilon^{2})$ queries and this bound is tight for non-adaptive testers! One could improve this by running our earlier matrix-vector algorithm on top of this sketch, without ever explicitly forming the $1/\epsilon\times 1/\epsilon$ matrix $G^{T}AG$ ; however, this would only give an $O(1/\epsilon^{4/3})$ query algorithm.

To achieve our optimal $\widetilde{O}(1/\epsilon)$ complexity, our algorithm instead performs a novel twist to Oja’s algorithm [Oja82], the latter being a stochastic gradient descent (SGD) algorithm applied to optimizing the quadratic form $f(x)=x^{T}Ax$ over the sphere. In typical applications, the randomness of SGD arises via randomly sampling from a set of training data. In our setting, we instead artificially introduce randomness at each step, by computing the projection of the gradient onto a randomly chosen direction. This idea is implemented via the iteration

x^{(k+1)}=x^{k}-\eta(g^{T}Ax^{k})g\,\,\,\text{where}\,\,g\sim\mathcal{N}(0,I)

(1)

for a well-chosen step size $\eta.$ If $f$ ever becomes negative before reaching the maximum number of iterations, then the algorithm outputs False, otherwise it outputs True. For $p=1$ , we show that this scheme results in an optimal tester (up to logarithmic factors). Our proof uses a second moment analysis to analyze a random walk, that is similar in style to [Jai+16], though our analysis is quite different. Whereas [Jai+16] considers an arbitrary i.i.d. stream of unbiased estimators to $A$ (with bounded variance), our estimators are simply $gg^{T}A,$ which do not seem to have been considered before. We leverage this special structure to obtain a better variance bound on the iterates throughout the first $\widetilde{O}(1/\epsilon)$ iterations, where each iteration can be implemented with a single vector-matrix-vector query. Our algorithm and analysis gives a new method for the fundamental problem of approximating eigenvalues.

Our result for general $p>1$ follows by relating the Schatten- $p$ norm to the Schatten- $1$ norm and invoking the algorithm above with a different setting of $\epsilon$ . We show our method is optimal by proving an $\Omega(d^{2-2/p}/\epsilon^{2})$ lower bound for non-adaptive one-sided testers, and then using a theorem in [RWZ20] which shows that adaptive one-sided testers can give at most a quadratic improvement. We note that one could instead use a recent streaming lower bound of [INW22] to prove this lower bound, though such a lower bound would depend on the bit complexity.

Two-Sided Testers.

The key technical ingredient behind all of our two-sided testers is a bilinear sketch for PSD-testing. Specifically, we show that a sketch of the form $G^{T}AG$ with $G\in R^{d\times k}$ is sufficient for obtaining a two-sided tester for $p=2$ . In contrast to the $p=1$ case, we do not simply output False when $\lambda_{\min}:=\lambda_{\min}(G^{T}AG)<0$ as such an algorithm would automatically be one-sided. Instead we require a criterion to detect when $\lambda_{\min}$ is suspiciously small. For this we require two results.

The first is a concentration inequality for $\lambda_{\min}(G^{T}AG)$ when $A$ is PSD. We show that $\lambda_{\min}\geq\operatorname{Tr}(A)-\widetilde{O}(\sqrt{k})\left\|A\right\|_{F}$ with very good probability. This result is equivalent to bounding the smallest singular value of $A^{1/2}G$ , which is a Gaussian matrix whose rows have different variances. Although many similar bounds for constant variances exist in the literature [Lit+05, Ver18], we were not able to find a bound for general covariances. In particular, most existing bounds do not seem to give the concentration around $\operatorname{Tr}(A)$ that we require.

When $A$ has a negative eigenvalue of $-\epsilon$ , we show that $\lambda_{\min}\leq\operatorname{Tr}(A)-\epsilon O(k)$ . By combining these two results, we are able to take $k=\widetilde{O}(1/\epsilon^{2})$ , yielding a tight bound for non-adaptive testers in the vector-matrix-vector model. In fact this bound is even tight for general linear sketches, as we show by applying the results in [LW16].

We also utilize this bilinear sketch to give tight bounds for adaptive vector-matrix-vector queries, and indeed for general linear measurements. By first (implicitly) applying the sketch, and then shifting by an appropriate multiple of the identity we are able to reduce to the $(\epsilon^{2},\ell_{1})$ -testing problem, which as described above may solved using $\widetilde{O}(1/\epsilon^{2})$ queries.

Spectrum Estimation.

A natural approach for approximating the eigenvalues of an $n\times n$ matrix $A$ is to first compute a sketch $G^{T}AG$ or a sketch $G^{T}AH$ for Gaussian matrices $G$ and $H$ with a small number of columns. Both of these sketches appear in [AN13]. As noted above, $G^{T}AG$ is a useful non-adaptive sketch for spectrum approximation, but the error in approximating each eigenvalue is proportional to the Schatten- $1$ norm of $A$ . One could instead try to make the error depend on the Frobenius norm $\|A\|_{2}$ of $A$ by instead computing $G^{T}AH$ for independent Gaussian matrices $G$ and $H$ , but now $G^{T}AH$ is no longer symmetric and it is not clear how to extract the signs of the eigenvalues of $A$ from $G^{T}AH$ . Indeed, [AN13] are only able to show that the singular values of $G^{T}AH$ are approximately the same as those of $A$ , up to additive $\epsilon\|A\|_{2}$ error. We thus need a new way to preserve sign information of eigenvalues.

To do this, we show how to use results for providing the best PSD low rank approximation to an input matrix $A$ , where $A$ need not be PSD and need not even be symmetric. In particular, in [CW17a] it was argued that if $G$ is a Gaussian matrix with $O(k/\epsilon)$ columns, then if one sets up the optimization problem $\min_{\textrm{rank k PSD}\ Y}\|AGYG^{T}A^{T}-A\|_{F}^{2}$ , then the cost will be at most $(1+\epsilon)\|A_{k,+}-A\|_{F}^{2}$ , where $A_{k,+}$ is the best rank- $k$ PSD approximation to $A$ . By further sketching on the left and right with so-called affine embeddings $S$ and $T$ , which have $\text{poly}(k/\epsilon)$ rows and columns respectively, one can reduce this problem to $\min_{\textrm{rank k PSD}\ Y}\|SAGYG^{T}A^{T}T-SAT\|_{F}^{2}$ , and now $SAG$ , $G^{T}A^{T}T$ and $SAT$ are all $\text{poly}(k/\epsilon)\times\text{poly}(k/\epsilon)$ matrices so can be computed with a $\text{poly}(k/\epsilon)$ number of vector-matrix-vector products. At this point the optimal $Y$ can be found with no additional queries and its cost can be evaluated. By subtracting this cost from $\left\|A\right\|_{F}^{2}$ , we approximate $\left\|A_{+,i}\right\|_{F}^{2}$ , and $\left\|A_{-,i}\right\|_{F}^{2}$ for all $i\in[k]$ , which in turn allows us to produce (signed) estimates for the eigenvalues of $A$ .

When $A$ is PSD, we note that Theorem 1.2 in [AN13] is able to reproduce our spectral approximation guarantee using sketching dimension $O(\frac{k^{2}}{\epsilon^{8}})$ , compared to our sketch of dimension $O(\frac{k^{2}}{\epsilon^{12}})$ . However as mentioned above, our guarantee is stronger in that it allows for the signs of the eigenvalues to be recovered, i.e. our guarantee holds even when $A$ is not PSD. Additionally, we are able to achieve $O(\frac{k^{2}}{\epsilon^{8}})$ using just a single round of adaptivity.

Lower Bounds for One-sided Testers.

To prove lower bounds for one-sided non-adaptive testers, we first show that a one-sided tester must be able to produce a witness whenever it outputs False. In the matrix-vector model, the witness is a vector $v$ with $v^{T}Av<0$ , and in the vector-matrix-vector model, the witness is a PSD matrix $M$ with $\left\langle M,A\right\rangle<0$ . In both cases we show that even for simplest non-PSD spectrum $(-\lambda,1,\ldots,1)$ , that it takes many queries to produce a witness when $\lambda$ is small. In the matrix-vector model, our approach is simply to show that the $-\lambda$ eigenvector is typically far from span of all queried vectors, when the number of queries is small. This will imply that $A$ is non-negative on the queried subspace, which precludes the tester from producing a witness. In the vector-matrix-vector model our approach is similar, however now the queries take the form of inner products against rank one matrices $x_{i}x_{i}^{T}.$ We therefore need to work within the space of symmetric matrices, and this requires a more delicate argument.

1.3 Additional Related Work

Numerous other works have considered matrix-vector queries and vector-matrix queries, see, e.g., [Mey+21, Bra+20, Sun+19, SER18, MM15, WWZ14]. We outline a few core areas here.

Oja’s Algorithm.

Several works have considered Oja’s algorithm in the context of streaming PCA, [Sha16, Jai+16, AL17]. [Jai+16] gives a tight convergence rate for iteratively approximating the top eigenvector of a PSD matrix, given an eigengap, and [AL17] extends this to a gap free result for $k$ -PCA.

PSD Testing.

As mentioned above, PSD-testing has been investigated in the bounded entry model, where one assumes that the entries of $A$ are bounded by $1$ [BCJ20], and one is allowed to query the entries of $A$ . This is a restriction of the vector-matrix-vector model that we consider where only coordinate vectors may be queried. However since we consider a more general query model, we are able to give a better adaptive tester – for us $\widetilde{O}(1/\epsilon)$ vector-matrix-vector queries suffice, beating the $\Omega(1/\epsilon^{2})$ lower bound given in [BCJ20] for entry queries.

Another work on PSD-testing is that of [Han+17], who construct a PSD-tester in the matrix-vector model. They first show how to approximate a general trace function $\sum f(\lambda_{i})$ for sufficiently smooth $f$ , by using a Chebyshev polynomial construction to approximate $f$ in the sup-norm over an interval. This allows them to construct an $\ell_{\infty}$ -tester by taking $f$ to be a smooth approximation of a shifted Heaviside function. Unfortunately this approach is limited to $\ell_{\infty}$ -testers, and does not achieve the optimal bound; they require $\Omega((\log d)/\epsilon)$ matrix-vector queries compared to the $\widetilde{O}((\log d)/\sqrt{\epsilon})$ queries achieved by Krylov iteration.

Spectrum Estimation.

The closely-related problem of spectrum estimation has been considered several times, in the context of sketching the largest $k$ elements of the spectrum [AN13] discussed above, and approximating the entire spectrum from entry queries in the bounded entry model [BMR21a].

1.4 Notation

A symmetric matrix $A$ is positive semi-definite (PSD) if all eigenvalues are non-negative. We use $\Delta^{d}_{+}$ to represent the PSD-cone, which is the subset of $d\times d$ symmetric matrices that are PSD.

For a matrix $A$ we use $\left\|A\right\|_{p}$ to denote the Schatten $p$ -norm, which is the $\ell_{p}$ norm of the vector of singular values of $A.$ The Frobenius norm will play a special role in several places, so we sometimes use the notation $\left\|A\right\|_{F}$ to emphasize this. Additionally, $\left\|A\right\|$ without the subscript indicates operator norm (which is equivalent to $\left\|A\right\|_{\infty}$ ).

We always use $d$ to indicate the dimension of the matrix being tested, and use $\epsilon<1$ to indicate the parameter in Definition 1.

When applied to vectors, $\left\langle\cdot,\cdot\right\rangle$ indicates the standard inner product on $\mathbb{R}^{n}.$ When applied to matrices, it indicates the Frobenius inner product $\left\langle X,Y\right\rangle:=\operatorname{Tr}(X^{T}Y).$

$S^{d-1}$ indicates the set of all unit vectors in $\mathbb{R}^{d}.$

We use the notation $X^{\dagger}$ to indicate the Moore-Penrose pseudoinverse of $X$ .

For a symmetric matrix $A\in\mathbb{R}^{d\times d}$ with eigenvalues $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{d}$ , we let $A_{k}$ denote the matrix $A$ with all but the top $k$ eigenvalues zeroed out. Formally, if $U$ is an orthogonal matrix diagonalizing $A$ , then $A_{k}=U^{T}\operatorname{diag}(\lambda_{1},\ldots,\lambda_{k},0,\ldots,0)U.$ We also let $A_{-k}=A-{A_{k}}.$

Throughout, we use $c$ to indicate an absolute constant. The value of $c$ may change between instances.

2 Vector-matrix-vector queries

2.1 An optimal one-sided tester.

To construct our vector-matrix-vector tester, we analyze the iteration

x^{(k+1)}=x^{(k)}-\eta\left((g^{(k)})^{T}Ax^{(k)}\right)g^{(k)}=\left(I-\eta g^{(k)}(g^{(k)})^{T}A\right)x^{(k)},

(2)

where $g^{(k)}\sim\mathcal{N}(0,I_{d})$ and $x^{(0)}\sim\mathcal{N}(0,I_{d}).$

Our algorithm is essentially to run this scheme for a fixed number of iterations with with well-chosen step size $\eta.$ If the value of $(x^{(k)})^{T}Ax(k)$ ever becomes negative, then we output False, otherwise we output True. Using this approach we prove the following.

Theorem 2.

There exists a one-sided adaptive $\ell_{1}$ -tester, that makes $O(\frac{1}{\epsilon}\log^{3}\frac{1}{\epsilon})$ vector-matrix-vector queries to $A.$

As an immediate corollary we obtain a bound for $\ell_{p}$ -testers.

Corollary 3.

There is a one-sided adaptive $\ell_{p}$ -tester that makes $O(\frac{1}{\epsilon}d^{1-1/p}\log^{3}(\frac{1}{\epsilon}d^{1-1/p}))$ vector-matrix-vector queries.

Proof.

This follows from the previous result along with the bound $\left\|A\right\|_{p}\geq d^{1/p-1}\left\|A\right\|_{1}.$ ∎

We now turn to the proof of Theorem 2. Since our iterative scheme is rotation-invariant, we assume without loss of generality that $A=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{d}).$ For now, we assume that $\left\|A\right\|_{1}\leq 1,$ and that the smallest eigenvalue of $A$ is $\lambda_{1}=-\epsilon.$ We consider running the algorithm for $N$ iterations. We will show that our iteration finds an $x$ with $x^{T}Ax<0$ in $N=\tilde{O}(1/\epsilon)$ iterations. We will use $c$ to denote absolute constants that we don’t track, and that may vary between uses.

Our key technical lemma is to show that the first coordinate (which is associated to the $-\epsilon$ eigenvalue) grows fairly quickly with good probability.

Lemma 4.

Suppose $\eta$ and $N$ satisfy the following list of assumptions: (1) $\eta\leq\frac{1}{4}$ , (2) $\eta^{2}\epsilon N\leq\frac{1}{8}$ , (3) $(1+\eta^{2}\epsilon^{2})^{N}\leq\frac{5}{4}$ , (4) $(1+\eta\epsilon)^{N}\geq\frac{10}{\epsilon^{2}}$ . Then $x_{1}^{(N)}\geq\frac{1}{\epsilon^{2}}$ with probability at least $0.2$ .

Proof.

Following [Jai+16] we define the matrix $B_{k}=\prod_{i=1}^{k}\left(I-\eta g^{(i)}(g^{(i)})^{T}A\right)$ , where the $g^{(i)}$ are independent $\mathcal{N}(0,I)$ gaussians. Note that $x^{(k)}=B_{k}x^{(0)}.$ We will show that $B_{k}^{T}e_{1}$ has large norm with good probability (in fact we will show that $\left\langle B_{k}^{T}e_{1},e_{1}\right\rangle$ is large). This will then imply that $\left\langle B_{k}x^{(0)},e_{1}\right\rangle$ is large with high probability, where $x^{(0)}\sim\mathcal{N}(0,I).$

Step 1: Deriving a recurrence for the second moments.

Let $y^{(k)}=B_{k}^{T}e_{1}$ and let $u^{(k)}_{i}$ be the second moment of the coordinate $y_{i}^{(k)}$ . Note that $u^{(0)}_{i}=\delta_{1i}$ (where $\delta$ is the Dirac delta). To simplify the notation, we drop the superscript on the $g$ . We compute $y_{i}^{(k+1)}=\left((I-\eta gg^{T}A)y^{(k)}\right)_{i}=y^{(k)}_{i}-\eta(Ag)_{i}(g_{1}y^{(k)}_{1}+\ldots+g_{d}y^{(k)}_{d})=y_{i}^{(k)}-\eta\lambda_{i}g_{i}(g_{1}y^{(k)}_{1}+\ldots+g_{d}y^{(k)}_{d})$ .

Next we observe that (after grouping terms) the coefficients of the $y_{i}^{(k)}$ terms are pairwise uncorrelated. Using this, along with the fact that the $g_{i}$ ’s are independent of the $y_{i}^{(k)}$ ’s gives

	$\displaystyle u_{i}^{(k+1)}=\mathbb{E}(1-\eta\lambda_{i}g_{i}^{2})^{2}u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j\neq i}u_{j}^{(k)}=(1-2\eta\lambda_{i}+3\eta^{2}\lambda_{i}^{2})u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j\neq i}u_{j}^{(k)}$
	$\displaystyle=(1-2\eta\lambda_{i}+2\eta^{2}\lambda_{i}^{2})u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j=1}^{d}u_{j}^{(k)}.$

Let $S^{(k)}=u^{(k)}_{1}+\ldots+u^{(k)}_{d}$ , and $\gamma_{i}=1-2\eta\lambda_{i}+2\eta^{2}\lambda_{i}^{2}.$ Then we can write the recurrence as $u^{(k+1)}_{i}=\gamma_{i}u^{(k)}_{i}+\eta^{2}\lambda_{i}^{2}S^{(k)}$ . Iterating this recurrence gives

u^{(k)}_{i}=\delta_{1i}\gamma_{i}^{k}+\eta^{2}\lambda_{i}^{2}\left(\gamma_{i}^{k-1}S^{(0)}+\gamma_{i}^{k-2}S^{(1)}+\ldots+S^{(k-1)}\right).

(3)

Step 2: Bounding $S^{(k)}$ .

Summing the above equation over $i$ allows us to write a recurrence for the $S^{(k)}$ ’s: $S^{(k)}=\gamma_{1}^{k}+\alpha_{k-1}S^{(0)}+\alpha_{k-2}S^{(1)}+\ldots+\alpha_{0}S^{(k-1)}$ , where we define $\alpha_{j}:=\sum_{i=1}^{d}\eta^{2}\lambda_{i}^{2}\gamma_{i}^{j}$ .

We split $\alpha_{j}$ into two parts, $\alpha_{j}^{+}$ and $\alpha_{j}^{-}$ corresponding to terms in the sum where $\lambda_{i}$ is positive or negative respectively. We now use the recurrence to bound $S^{(k)}.$ First by Holder’s inequality, $S^{(k)}\leq\gamma_{1}^{k}+\max(S^{(0)},\ldots,S^{(k-1)})(\alpha_{0}^{+}+\ldots+\alpha_{k-1}^{+})+(\alpha_{k-1}^{-}S^{(0)}+\alpha_{k-2}^{-}S^{(1)}+\ldots+\alpha_{0}^{-}S^{(k-1)}).$

We calculate

	$\displaystyle\sum_{j=0}^{k-1}\alpha_{j}^{+}$	$\displaystyle=\sum_{j=0}^{k-1}\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\gamma_{i}^{j}=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\sum_{j=0}^{k-1}\gamma_{i}^{j}=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\frac{1-\gamma_{i}^{k}}{1-\gamma_{i}}$
		$\displaystyle=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\frac{1-\gamma_{i}^{k}}{2\eta\lambda_{i}-2\eta^{2}\lambda_{i}^{2}}=\sum_{i:\lambda_{i}>0}\eta\lambda_{i}\frac{1-\gamma_{i}^{k}}{2-2\eta\lambda_{i}}\leq\sum_{i:\lambda_{i}>0}\eta\lambda_{i}\leq\eta,$

where we used that $\eta\lambda_{i}\leq 1/2,$ (which is a consequence of Assumption 1), that $\gamma_{i}<1$ (which holds since $\lambda_{i}>0$ ) and that $\sum_{i:\lambda_{i}>0}\lambda_{i}\leq 1.$ Since we assume that $-\epsilon$ is the smallest eigenvalue,

\displaystyle\alpha_{j}^{-}

\displaystyle\leq\eta^{2}\gamma_{1}^{j}\sum_{i:\lambda_{i}<0}\lambda_{i}^{2}\leq\eta^{2}\gamma_{1}^{j}\epsilon\sum_{i:\lambda_{i}<0}\left|\lambda_{i}\right|\leq\eta^{2}\gamma_{1}^{j}\epsilon.

Let $\tilde{S}^{(k)}=\max(S^{(0)},\ldots S^{(k)}).$ Then combining our bounds gives

\tilde{S}^{(k)}\leq\max\left(\tilde{S}^{(k-1)},\gamma_{1}^{k}+\eta\tilde{S}^{(k-1)}+\eta^{2}\epsilon(\gamma_{1}^{k-1}\tilde{S}^{(0)}+\gamma_{1}^{k-2}\tilde{S}^{(1)}+\ldots+\tilde{S}^{(k-1)})\right).

The next step is to use this recurrence to bound $\tilde{S}^{(k)}.$ For this, define $c^{(k)}$ such that $\tilde{S}^{(k)}=c^{(k)}\gamma_{1}^{k}.$ Plugging in to the above and dividing through by $\gamma_{1}^{k}$ , we get that $c^{(k)}$ satisfies

	$\displaystyle c^{(k)}$	$\displaystyle\leq\max\left(\frac{c^{(k-1)}}{\gamma_{1}},1+\frac{\eta}{\gamma_{1}}c^{(k-1)}+\frac{\eta^{2}\epsilon}{\gamma_{1}}(c^{(0)}+\ldots+c^{(k-1)})\right)$
		$\displaystyle\leq\max\left(c^{(k-1)},1+\eta c^{(k-1)}+\eta^{2}\epsilon(c^{(0)}+\ldots+c^{(k-1)})\right),$

where we used the fact that $\gamma_{1}\geq 1.$ Now set $\tilde{c}^{(k)}=\max(c^{(0)},\ldots c^{(k)}).$ By assumptions 1 and 2, $\eta+\eta^{2}\epsilon k\leq 1/2.$ This gives

\displaystyle\tilde{c}^{(k)}\leq\max\left(\tilde{c}^{(k-1)},1+\eta\tilde{c}^{(k-1)}+\eta^{2}\epsilon k\tilde{c}^{(k-1)}\right)\leq\max\left(\tilde{c}^{(k-1)},1+\frac{1}{2}\tilde{c}^{(k-1)}\right).

Note that $c^{(0)}=S^{(0)}=1,$ so a straightforward induction using the above recurrence shows that $\tilde{c}^{(k)}\leq 2$ for all $k$ . It follows that $S^{(k)}\leq 2\gamma_{1}^{k}$ .

Step 3: Bounding the second moment. Plugging the bound above in to (3) gives

\displaystyle u_{1}^{(k)}

\displaystyle\leq\gamma_{1}^{k}+2k\eta^{2}\epsilon^{2}\gamma_{1}^{k-1}\leq\left(1+2k\eta^{2}\epsilon^{2}\right)\gamma_{1}^{k}.

Step 4: Applying Chebyshev. We focus on the first coordinate, $y_{1}^{(k)}.$ Note that $I-\eta Agg^{T}$ has expectation $I-\eta A$ , so a straightforward induction shows that $\mathbb{E}y_{1}^{(k)}=(1+\eta\epsilon)^{k}$ .

Using the bound for the second moment of the first coordinate, we get $\frac{u_{1}^{(k)}}{\left(\mathbb{E}y_{1}^{(k)}\right)^{2}}\leq\frac{(1+2k\eta^{2}\epsilon^{2})\gamma_{1}^{k}}{(1+\eta\epsilon)^{2k}}=(1+2k\eta^{2}\epsilon^{2})\left(\frac{1+2\eta\epsilon+2\eta^{2}\epsilon^{2}}{1+2\eta\epsilon+\eta^{2}\epsilon^{2}}\right)^{k}=(1+2k\eta^{2}\epsilon^{2})\left(1+\frac{\eta^{2}\epsilon^{2}}{1+2\eta\epsilon+\eta^{2}\epsilon^{2}}\right)^{k}\leq(1+2k\eta^{2}\epsilon^{2})(1+\eta^{2}\epsilon^{2})^{k}.$

By Assumptions 2 and 4, $N\eta^{2}\epsilon^{2}\leq 1/8$ and $(1+\eta^{2}\epsilon^{2})^{N}\leq 5/4,$ so we get that $u_{1}^{(k)}\leq 25/16\left(\mathbb{E}u_{1}^{(k)}\right)^{2}.$

Thus by Chebyshev’s inequality, $\mathbb{P}\left(\left|y_{1}^{(k)}-\mathbb{E}(y_{1}^{(k)})\right|\geq 0.9\mathbb{E}(y_{1}^{(k)})\right)\leq\frac{25}{36}$ . So with probability at least $0.3$ , $y_{1}^{(N)}\geq\frac{1}{10}\mathbb{E}(y_{1}^{(N)})=\frac{1}{10}(1+\eta\epsilon)^{N}$ .

Under assumption 4, $(1+\eta\epsilon)^{N}\geq\frac{10}{\epsilon^{2}},$ which means that $y_{1}^{(N)}\geq\frac{1}{\epsilon^{2}}$ with at least $0.3$ probability.

Step 5: Concluding the argument. We showed that $\left\langle B_{N}^{T}e_{1},e_{1}\right\rangle\geq\frac{1}{\epsilon^{2}}$ with probability at least $0.3$ . In particular this implies that $\left\|B_{N}^{T}e_{1}\right\|\geq\frac{1}{\epsilon^{2}}.$ Now since $x^{(0)}$ is distributed as $\mathcal{N}(0,I),$ $\left\langle B_{N}x^{(0)},e_{1}\right\rangle=\left\langle x^{(0)},B_{N}^{T}e_{1}\right\rangle\sim\mathcal{N}(0,\left\|B_{N}^{T}e_{1}\right\|^{2})$ , which is at least $\left\|B_{N}^{T}e_{1}\right\|$ in magnitude with $0.67$ probability. It follows that $x^{(N)}_{1}\geq\frac{1}{\epsilon^{2}}$ with probability at least $0.2$ . ∎

Let $f(x)=x^{T}Ax.$ We next understand how the value of $f(x^{(k)})$ is updated on each iteration.

Proposition 5.

For $g\sim\mathcal{N}(0,1)$ , we have $f(x^{(k)})-f(x^{(k+1)})=\eta(g^{T}Ax^{(k)})^{2}(2-\eta g^{T}Ag)$ .

Proof.

Plugging in the update rule and expanding gives

	$\displaystyle f(x^{(k+1)})$	$\displaystyle=(x^{(k)})^{T}Ax^{(k)}-2\eta(g^{T}Ax^{(k)})^{2}+\eta^{2}(g^{T}Ax^{(k)})^{2}g^{T}Ag$
		$\displaystyle=(x^{(k)})^{T}Ax^{(k)}-\eta(g^{T}Ax^{(k)})^{2}(2-\eta g^{T}Ag),$

from which the proposition follows. ∎

A consequence of this update is that the sequence $f(x^{(k)})$ is almost guaranteed to be decreasing as long as $\eta$ is chosen small enough.

Proposition 6.

Assume that $\operatorname{Tr}(A)\leq 1$ and that $\eta<c$ . After $N$ iterations, $f(x^{(N)})\leq f(x^{(0)})$ with probability at least $99/100$ provided that $\eta\leq\frac{c}{\log N+1}.$

Proof.

We show something stronger; namely that for the first $N$ iterations, the sequence $f(x^{(k)})$ is decreasing. By Proposition 5, $f(x^{(k+1)})\leq f(x^{(k)})$ as long as $g^{T}Ag\leq\frac{2}{\eta}.$ The probability that this does not occur is $\Pr\left(\sum\lambda_{i}g_{i}^{2}\geq\frac{2}{\eta}\right)\leq\Pr\left(\sum\lambda_{i}(g_{i}^{2}-1)\geq\frac{2}{\eta}-1\right)$ .

The $g_{i}^{2}-1$ terms are independent subexponential random variables. So by Bernstein’s inequality (see [Ver18] Theorem 2.8.2 for the version used here), this probability is bounded by $2\exp(-c/\eta)$ as long as $\eta$ is a sufficiently small constant. Taking a union bound gives that $f(x^{(N)})\leq f(x^{(0)})$ with probability at least $1-2N\exp(-c/\eta)$ , which is at least $99/100$ under the conditions given. ∎

Theorem 7.

Suppose that $\left\|A\right\|_{1}\leq 1$ , $\epsilon<1/2$ , and that $A$ has $-\epsilon$ as an eigenvalue. If we take $\eta$ such that $c\epsilon^{2}\leq\eta\leq\min\left(\frac{1}{32\log(10/\epsilon^{2})},\frac{c}{\log\frac{1}{\epsilon}}\right)$ , then for some $N=\Theta\left(\frac{1}{\epsilon\eta}\log\frac{1}{\epsilon}\right)$ we have $f(x^{(N)})<0$ with at least $1/10$ probability.

Proof.

Given an $\eta$ as in the statement of the theorem, choose $N=\left\lceil\frac{2}{\eta\epsilon}\log\frac{10}{\epsilon^{2}}\right\rceil,$ which satisfies the assumptions of Lemma 4. Then $x^{(N)}_{1}\geq\frac{1}{\epsilon^{2}}$ with probability at least $0.2$ . By proposition 6, $f(x^{(N)})\leq f(x^{(0)})\leq 2$ with at least $0.99$ probability, using the fact that $\eta\leq\frac{c}{\log\frac{1}{\epsilon}}$ for an appropriately chosen absolute constant $c$ , such that the hypothesis of Proposition 6 holds.

If $f(x^{(N)})<0$ , then the algorithm has already terminated. Otherwise conditioned on the events in the above paragraph, we have with at least $0.8$ probability that $2-\eta(g^{(N)})^{T}Ag^{(N)}\geq\frac{1}{2}$ and

(g^{(N)})^{T}Ax^{(N)}\sim\mathcal{N}\left(0,\|Ax^{(N)}\|^{2}\right)\geq\frac{1}{3}\left\|Ax^{(N)}\right\|\geq\frac{1}{3}\lambda_{1}x^{(N)}_{1}\geq\frac{1}{3\epsilon^{2}}\lambda_{1}\geq\frac{1}{3\epsilon}.

Then by Proposition 5 it follows that $f(x^{(N+1)})\leq f(x^{(N)})-\frac{\eta}{20\epsilon^{2}}\leq 2-\frac{\eta}{20\epsilon^{2}}<0.$ ∎

We also observe that we can reduce the dimension of the problem by using a result of Andoni and Nguyen. This allows us to avoid a $\log d$ dependence.

Proposition 8.

Suppose that $A$ satisfies $\lambda_{\min}(A)<-\alpha\left\|A\right\|_{1},$ and let $G\in\mathbb{R}^{d\times m}$ have independent $\mathcal{N}(0,\frac{1}{d})$ entries. Then we can choose $m=O(1/\alpha)$ such that $\lambda_{\min}(G^{T}AG)<-\alpha/2$ and $\left\|G^{T}AG\right\|_{1}\leq 2\left\|A\right\|_{1}.$

Proof.

For the first claim, we simply apply Theorem 1.1 in [AN13] and (in their notation) set $\epsilon=O(1)$ and $k=O(1/\alpha).$

To show that the Schatten $1$ -norm does not grow too much under the sketch, we first write $A=A_{+}+A_{-}$ where the nonzero eigenvalues of $A_{+}$ are exactly the positive eigenvalues of $A$ . Then using the usual analysis of Hutchinson’s trace estimator (see [Mey+21] for example), we have

	$\displaystyle\left\\|G^{T}AG\right\\|_{1}$	$\displaystyle\leq\left\\|G^{T}A_{+}G\right\\|_{1}+\left\\|G^{T}A_{-}G\right\\|_{1}=\operatorname{Tr}(G^{T}A_{+}G)+\operatorname{Tr}(G^{T}A_{-}G)$
		$\displaystyle=(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{+})+(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{-})$
		$\displaystyle\leq(1\pm O(1/\sqrt{m})\left\\|A\right\\|_{1}.$

∎

We are now ready to give the proof of Theorem 2.

Proof.

The above result applies after scaling the $\eta$ given in Theorem 7 by $1/\left\|A\right\|_{1}$ . So it suffices to choose $\eta$ to be bounded above by

\frac{1}{\left\|A\right\|_{1}}\min\left(\frac{1}{32\log(10/\epsilon^{2})},\frac{c}{\log\frac{1}{\epsilon}}\right),

and within a constant factor of this value.

To choose an $\eta$ , pick a standard normal $g$ , and compute $Ag$ using $1/\epsilon$ vector-matrix-vector queries. Then with constant probability, $\lambda_{\max}(A)\leq\left\|Ag\right\|\leq 2d\lambda_{\max}.$ Given this, we have

d\left\|Ag\right\|\geq\left\|A\right\|_{1}\geq\frac{\left\|Ag\right\|}{2d},

(4)

which allows us to approximate $\left\|A\right\|_{1}$ to within a factor of $d^{2}$ with constant probability. Given this, one may simply try the above algorithm with an $\eta$ at each of $O(\log(d^{2}))=O(\log d)$ different scales, with the cost of an extra $\log d$ factor.

Finally, we may improve the $\log d$ factor to a $\log(1/\epsilon)$ factor by using Proposition 8 to sketch $A$ , and then applying the above analysis to $G^{T}AG.$ Note that the sketch may be used implicitly; once $G$ is chosen, a vector-matrix-vector query to $G^{T}AG$ can be simulated with a single vector-matrix-vector query to $A.$ ∎

2.2 Lower bounds

We will show a bound for two-sided testers which will imply that the bound for $\ell_{1}$ -testers given in Theorem 2 is tight up to $\log$ factors. If we require the tester to have one-sided error, then we additionally show that the bound in Corollary 3 is tight for all $p$ . Note that this distinction between one-sided and two-sided testers is necessary given Theorem 29.

In order to obtain these lower bounds for adaptive testers, we first show corresponding lower bounds for non-adaptive testers. A minor modification to Lemma 3.1 in [Sun+19] shows that an adaptive tester can have at most quadratic improvement over a non-adaptive tester. This will allow us to obtain our adaptive lower bounds as a consequence of the non-adaptive bounds.

2.2.1 Non-adaptive lower bounds

We first observe that a one-sided tester must always be able to produce a witness matrix $X$ , that at least certifies that $A$ is not positive definite.

Proposition 9.

If a one-sided tester makes a series of symmetric linear measurements $\left\langle M_{i},A\right\rangle$ of $A$ , and outputs False on a given instance, then there must exist nonzero $X\in\operatorname{span}(M_{1},\ldots,M_{k})$ such that $X$ is PSD and $\left\langle X,A\right\rangle\leq 0.$

Proof.

We work within the space of symmetric matrices. Let $W=\operatorname{span}(M_{1},\ldots,M_{k})$ , and let $\phi(X)=\left\langle A,X\right\rangle$ be the linear functional associated with $A.$ Now suppose that $\phi$ is strictly positive for all nonzero $X\in W\cap\Delta^{d}_{+}$ . We will construct $\widetilde{\phi}$ that agrees with $\phi$ on $W$ and is non-negative on $\Delta^{d}_{+}$ .

Let $W^{\prime}=\ker(\phi)\cap W$ , and note that $W^{\prime}\cap\Delta^{d}_{+}=\{0\}$ . Now by convexity of $\Delta^{d}_{+}$ , there exists a hyperplane $H$ and associated half-space $H^{+}$ such that (i) $H$ contains $W^{\prime}$ (ii) $H\cap\Delta^{d}_{+}=\{0\},$ (iii) $H^{+}\supseteq\Delta^{d}_{+}$ and (iv) $\phi$ is non-negative on $H^{+}\cap W$ . Moreover, since $W^{\prime}$ intersects $\Delta^{d}_{+}$ trivially, $H$ can be chosen such that $H\cap W=W^{\prime}.$ Now let $\Pi$ be a projection onto $W$ that maps $H$ to $W^{\prime}$ , and choose $\widetilde{\phi}=\phi\circ\Pi_{W}$ .

The linear functional $\widetilde{\phi}$ is represented by the inner product against some symmetric matrix $B.$ By construction of $\widetilde{\phi}$ , we have $\left\langle B,M_{i}\right\rangle=\left\langle A,M_{i}\right\rangle$ for all $i$ , and also $\left\langle B,X\right\rangle\geq 0$ for all PSD $X$ . So in particular $\left\langle B,xx^{T}\right\rangle=x^{T}Bx\geq 0$ for all $x$ , which implies that $B$ is PSD. Given the existence of the PSD matrix $B$ consistent with all measurements, the one-sided tester must not reject.

∎

We are now able to give an explicit non-PSD spectrum which is hard for any one-sided tester. Specifically, we show that it is hard for any vector-matrix-vector query algorithm to produce a witness $X$ in the sense of the proposition above.

Theorem 10.

Let $\lambda>0$ and suppose for all matrices $A$ with spectrum $(-\lambda,1,\ldots,1)$ that a non-adaptive one-sided tester $\mathcal{T}$ outputs False with $2/3$ probability. Then $\mathcal{T}$ must make at least $\frac{1}{9}\left(\frac{d}{1+\lambda}\right)^{2}$ vector-matrix-vector queries.

Proof.

By the polarization identity,

x^{T}Ay=\frac{1}{2}\left((x+y)^{T}A(x+y)-y^{T}Ay-x^{T}Ax\right),

we may assume that all queries are of the form $x_{i}^{T}Ax_{i},$ at the cost requiring at most a factor of three increase in the number of queries.

We set $A=I-(1+\lambda)vv^{T}$ where $v$ is uniform over $S^{d-1}$ , and let $W=\operatorname{span}(x_{1}x_{1}^{T},\ldots x_{k}x_{k}^{T}).$ By Proposition 9, the tester may only reject if there is an $X$ in $W\cap\Delta^{d}_{+}$ with $\left\|X\right\|_{F}=1$ such that $\left\langle X,A\right\rangle\leq 0.$ For such an $X$ we have

\left\langle vv^{T},X\right\rangle\geq\frac{\operatorname{Tr}(X)}{1+\lambda}\geq\frac{1}{1+\lambda}.

(5)

But since $vv^{T}$ and $X$ both have unit norm and $X\in W$ , this condition implies that $\left\|\Pi_{W}(vv^{T})\right\|_{F}\geq\frac{1}{1+\lambda}.$

Now we turn to understanding $\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}).$ Indeed we have the following:

Lemma 11.

Let $v$ be drawn uniformly from $S^{d-1},$ and let $W$ be a $k$ -dimensional subspace of the $d\times d$ symmetric matrices. Let $\alpha_{4}=\mathbb{E}(v_{1}^{4})$ and $\alpha_{22}=E(v_{1}^{2}v_{2}^{2}).$ Then

\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})=(\alpha_{4}-\alpha_{22})k+\alpha_{22}\left\|\Pi_{W}(I)\right\|_{F}^{2},

where $I$ is the identity matrix.

Proof.

Let $M_{1},\ldots M_{k}$ be an orthonormal basis for $W$ . By the Pythagorean theorem,

\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})=\sum_{i=1}^{k}\mathbb{E}(\left\|\Pi_{M_{i}}(vv^{T})\right\|_{F}^{2}).

(6)

For fixed $M$ we have

\mathbb{E}(\left\|\Pi_{M}(vv^{T})\right\|_{F}^{2})=\mathbb{E}(\left\langle vv^{T},M\right\rangle^{2})=\mathbb{E}(\operatorname{Tr}(vv^{T}M)^{2})=\mathbb{E}((v^{T}Mv)^{2}).\\

(7)

Since $M$ is symmetric, we can diagonalize $M$ to $D=\operatorname{diag}(a_{1},\ldots,a_{d})$ in some orthonormal basis. Since $M$ has unit norm, $a_{1}^{2}+\ldots+a_{d}^{2}=1.$ Then we have

	$\displaystyle\mathbb{E}((v^{T}Mv)^{2})$	$\displaystyle=\mathbb{E}((v^{T}Dv)^{2})$
		$\displaystyle=\mathbb{E}((a_{1}x_{1}^{2}+\ldots+a_{d}x_{d}^{2})^{2})$
		$\displaystyle=\alpha_{4}(a_{1}^{2}+\ldots+a_{d}^{2})+2\alpha_{22}\sum_{i<j}a_{i}a_{j}$
		$\displaystyle=\alpha_{4}+2\alpha_{22}\sum_{i<j}a_{i}a_{j}.$

Next observe that

\operatorname{Tr}(M)^{2}=(a_{1}+\ldots+a_{d})^{2}=a_{1}^{2}+\ldots+a_{d}^{2}+2\sum_{i<j}a_{i}a_{j}=1+2\sum_{i<j}a_{i}a_{j},

so that

\mathbb{E}((v^{T}Mv)^{2})=\alpha_{4}+\alpha_{22}(\operatorname{Tr}(M)^{2}-1).

Combining with (6) gives

	$\displaystyle\mathbb{E}(\left\\|\Pi_{W}(vv^{T})\right\\|_{F}^{2})$	$\displaystyle=\sum_{i=1}^{k}\left(\alpha_{4}+\alpha_{22}(\operatorname{Tr}(M_{i})^{2}-1)\right)$
		$\displaystyle=(\alpha_{4}-\alpha_{22})k+\alpha_{22}\sum_{i=1}^{k}\operatorname{Tr}(M_{i})^{2}.$

Finally, observe that $\sum_{i=1}^{k}\operatorname{Tr}(M_{i})^{2}=\sum_{i=1}^{k}\left\langle I,M_{i}\right\rangle^{2}=\left\|\Pi_{W}(I)\right\|_{F}^{2},$ by the Pythagorean theorem, which finishes the proof. ∎

Remark 12.

While approximations would suffice, this result gives a quick way to compute $\alpha_{4}$ and $\alpha_{22}.$ Set $W$ to be the entire space of $n\times n$ symmetric matrices, and $k=d(d+1)/2.$ The previous result gives

1=\frac{d(d+1)}{2}(\alpha_{4}-\alpha_{22})+d\alpha_{22}.

On the other hand, by expanding $1=(v_{1}^{2}+\ldots+v_{d}^{2})^{2},$ we have

1=d\alpha_{4}+d(d-1)\alpha_{22}.

Solving the system yields $\alpha_{4}=\frac{3}{d(d+2)}$ and $\alpha_{22}=\frac{1}{d(d+2)}.$

To finish the proof of the theorem, we recall that $W$ is spanned by the matrices $x_{1}x_{1}^{T},\ldots,x_{k}x_{k}^{T},$ each of which has rank one. Therefore each matrix in $W$ , and in particular $\Pi_{W}(I)$ , has rank at most $k$ .

We recall for a general matrix $A,$ that $\operatorname{argmin}_{\operatorname{rk}(U)\leq k}\left\|A-U\right\|_{F}$ is gotten by truncating all but the largest $k$ singular values of $A.$ Applying this to the identity matrix, when $k\leq d,$ we see that

\left\|\Pi_{W}(I)\right\|_{F}^{2}=\left\|I\right\|_{F}^{2}-\left\|\Pi_{W^{\perp}}(I)\right\|_{F}\leq d-(d-k)=k,

since $\left\|\Pi_{W^{\perp}}(I)\right\|_{F}^{2}\geq\min_{\operatorname{rk}(U)\leq k}\left\|I-U\right\|_{F}^{2}=d-k.$ Since $\left\|I\right\|_{F}^{2}=d,$ we always have $\left\|\Pi_{W}(I)\right\|_{F}^{2}\leq k.$

Combining this fact with Lemma 11 gives

\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})\leq(\alpha_{4}-\alpha_{22})k+\alpha_{22}k=k\alpha_{4}=\frac{3k}{d(d+2)}\leq\frac{3k}{d^{2}},

and by Markov’s inequality,

\mathbb{P}\left(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}>\frac{9k}{d^{2}}\right)\leq\frac{1}{3}.

So with probability $2/3,$ $\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}\leq\frac{9k}{d^{2}}.$ But for $\mathcal{A}$ to be correct, we saw that we must have $\left\|\Pi_{W}(vv^{T})\right\|_{F}\geq\frac{1}{1+\lambda}$ with probability $2/3$ . It follows that

\left(\frac{1}{1+\lambda}\right)^{2}\leq\frac{9k}{d^{2}},

which implies that

k\geq\frac{1}{9}\left(\frac{d}{1+\lambda}\right)^{2}.

∎

In particular, this result implies that for non-adaptive one-sided testers, a $\text{poly}(1/\epsilon)$ $\ell_{p}$ -tester can only exist for $p=1.$

Theorem 13.

A one-sided non-adaptive $\ell_{p}$ -tester must make at least $\Omega(\frac{1}{\epsilon^{2}}d^{2-2/p})$ vector-matrix-vector queries.

Proof.

This follows as a corollary of Theorem 10; simply apply that result to the spectrum $(\epsilon(d-1)^{1/p},1\ldots,1)$ where there are $d-1$ $1$ ’s. ∎

2.2.2 Adaptive lower bounds

As remarked earlier, our adaptive lower bounds follow as a corollary of our non-adaptive bounds, and a slightly modified version of Lemma 3.1 in [Sun+19], which we give here.

Lemma 14.

Let $A=X\Sigma X^{T}$ be a random symmetric $d\times d$ real-valued matrix, with $\Sigma$ diagonal, and where $X$ is orthonormal and sampled from the rotationally invariant distribution. Any $s$ adaptive vector-matrix-vector queries to $A$ may be simulated by $O(s^{2})$ non-adaptive vector-matrix-vector queries.

Proof.

(Sketch) First note that the adaptive protocol may be simulated by $3s$ adaptive quadratic form queries, of the form $x^{T}Ax$ by the polarization identity

x^{T}Ay=\frac{1}{2}\left((x+y)^{T}A(x+y)-x^{T}Ax-y^{T}Ay\right).

(8)

These queries in turn may be simulated by $9s^{2}$ non-adaptive queries by following exactly the same proof as Lemma 3.1 in [Sun+19] (but now with $u_{i}=v_{i}$ in their proof). ∎

As a direct consequence of this fact and our Theorem 13 we obtain the following.

Theorem 15.

An adaptive one-sided $\ell_{p}$ -tester must make at least $\Omega(\frac{1}{\epsilon}d^{1-1/p})$ vector-matrix-vector queries.

3 Adaptive matrix-vector queries

We analyze random Krylov iteration. Namely we begin with a random $g\sim\mathcal{N}(0,I_{d})$ and construct the sequence of iterates $g,Ag,A^{2}g,\ldots A^{k}g$ using $k$ adaptive matrix-vector queries. The span of these vectors is denoted $\mathcal{K}_{k}(g)$ and referred to as the ${k}^{\text{th}}$ Krylov subspace.

Krylov iteration suggests a very simple algorithm. First compute $g,Ag,\ldots,A^{k+1}g.$ If $\mathcal{K}_{k}(g)$ contains a vector $v$ such that $v^{T}Av<0$ then output False, otherwise output True. (Note that one can compute $Av$ and hence $v^{T}Av$ for all such $v$ , given the $k+1$ matrix-vector queries.) We show that this simple algorithm is in fact optimal.

As a point of implementation, we note that the above condition on $\mathcal{K}_{k}(g)$ can be checked algorthmically. One first uses Gram-Schmidt to compute the projection $\Pi$ onto $\mathcal{K}_{k}(g)$ . The existence of a $v\in\mathcal{K}_{k}(g)$ with $v^{T}Ab<0$ is equivalent to the condition $\lambda_{\min}(\Pi A\Pi)<0$ . When $A$ is $\epsilon$ -far from PSD, the proof below will show that in fact $\lambda_{\min}(\Pi A\Pi)<-\Omega(\epsilon)\left\|A\right\|_{p}$ , so it suffices to estimate $\lambda_{\min}(\Pi A\Pi)$ to within $O(\epsilon)\left\|A\right\|_{p}$ accuracy.

Proposition 16.

For $r>0$ , $\alpha>0$ and $\delta>0$ there exists a polynomial $p$ of degree $O(\frac{\sqrt{r}}{\sqrt{\alpha}}\log\frac{1}{\delta})$ , such that $p(-\alpha)=1$ and $\left|p(x)\right|\leq\delta$ for all $x\in[0,r].$

Proof.

Recall that the degree $d$ Chebyshev polynomial $T_{d}$ is bounded by $1$ in absolute value on $[-1,1]$ and satisfies

T_{d}(1+\gamma\geq 2^{d\sqrt{\gamma}-1}).

(See [MM15] for example.) The proposition follows by shifting and scaling $T_{d}$ . ∎

Theorem 17.

Suppose that $A$ has an eigenvalue $\lambda_{\min}$ with $\lambda_{\min}\leq-\epsilon\left\|A\right\|_{p}.$ When $p=1$ , the Krylov subspace $\mathcal{K}_{k}(g)$ contains a vector $v$ with $v^{T}Av<0$ for $k=O\left(\left(\frac{1}{\epsilon}\right)^{\frac{1}{3}}\log\frac{1}{\epsilon}\right).$ When $p\in(1,\infty]$ , the same conclusion holds for $k=O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\log d\right).$

Proof.

Without loss of generality, assume that $\left\|A\right\|_{p}\leq 1$ . Fix a value $T$ to be determined later, effectively corresponding to the number of top eigenvalues that we deflate. By Proposition 16 we can construct a polynomial $q$ , such that $q(\lambda_{\min})=1$ and $\left|q(x)\right|\leq\sqrt{\frac{\epsilon/10}{d^{1-1/p}}}$ for $x\in[0,T^{-1/p}]$ with

\deg(q)\leq C\frac{T^{-1/(2p)}}{\sqrt{\epsilon}}\log\left(\sqrt{\frac{d^{1-1/p}}{\epsilon/10}}\right),

(9)

where $C$ is an absolute constant.

Now set

p(x)=q(x)\prod_{i:\lambda_{i}>T^{-1/p}}\frac{\lambda_{i}-x}{\lambda_{i}-\lambda_{\min}}.

(10)

Since we assume $\left\|A\right\|_{p}\leq 1,$ there at most $T$ terms in the product, so

\deg(p)\leq T+C\frac{T^{-1/(2p)}}{\sqrt{\epsilon}}\log\left(\sqrt{\frac{d^{1-1/p}}{\epsilon/10}}\right).

(11)

By setting $T=\epsilon^{-p/(2p+1)},$ we get

\deg(p)=\begin{cases}O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\right)\quad&\text{if }p=1\\ O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\log d\right)\quad&\text{if }p>1\end{cases}

(12)

As long as $k$ is at least $\deg(p)$ , then $v=p(A)g$ lies in $\mathcal{K}_{k}(g),$ and

v^{T}Av=g^{T}p(A)^{2}Ag.

(13)

By construction, $p(\lambda_{\min})=1$ . Also for all $x$ in $[0,T^{-1/p}],$ $|p(x)|\leq|q(x)|\leq\sqrt{\epsilon/10}d^{(1/p)-1}.$

Therefore the matrix $p(A)^{2}A$ has at least one eigenvalue less than $-\epsilon$ , and the positive eigenvalues sum to at most

\sum_{i:\lambda_{i}>0}\frac{\epsilon}{10}d^{1/p-1}\lambda_{i}\leq\frac{\epsilon}{10},

(14)

by using Holder’s inequality along with the fact that $\left\|A\right\|_{p}\leq 1$ . So with at least $2/3$ probability, $g^{T}p(A)^{2}Ag<0$ as desired. ∎

Remark 18.

For $1<p<\infty$ , the $\log d$ dependence can be removed by simply applying the $p=1$ tester to $A^{\lceil p\rceil},$ as a matrix-vector query to $A^{\lceil p\rceil}$ may be simulated via $\lceil p\rceil$ matrix-vector queries to $A.$ However this comes at the cost of a $\left(\frac{1}{\epsilon}\right)^{\lceil p\rceil/3}$ dependence, and is therefore only an improvement when $d$ is extremely large.

Remark 19.

While we observe that deflation of the top eigenvalues can be carried out implicitly within the Krylov space, this can also be done explicitly using block Krylov iteration, along with the guarantee given in Theorem 1 of [MM15].

We showed above that we could improve upon the usual analysis of Krylov iteration in our context. We next establish a matching lower bound that shows our analysis is tight up to $\log$ factors. This is a corollary of the proof of Theorem 3.1 presented in [Bra+20].

Theorem 20.

A two-sided, adaptive $\ell_{p}$ -tester in the matrix-vector model must in general make at least $\Omega(\frac{1}{\epsilon^{p/(2p+1)}})$ queries.

Proof.

We make use of the proof of Theorem 3.1 given in [Bra+20]. We consider an algorithm $\mathcal{A}$ that receives a matrix $W$ sampled from the Wishart distribution makes at most $(1-\beta)d$ queries, and outputs either True or False, depending on whether $\lambda_{\min}(W)$ is greater or less than $t$ (where $t=1/(2d^{2})$ is defined as in [Bra+20]). We say that $\mathcal{A}$ fails on a given instance if either (i) $\mathcal{A}$ outputs True and $t-\frac{1}{4d^{2}}\geq\lambda_{\min}(W)$ or (ii) $\mathcal{A}$ outputs False and $\lambda_{\min}(W)\geq t+\frac{1}{4d^{2}}.$ Exactly the same proof given in [Bra+20] shows that $\mathcal{A}$ must fail with probability at least $c_{\text{wish}}\sqrt{\beta}$ where $c_{\text{wish}}>0$ is an absolute constant, as long as $d$ is chosen sufficiently large depending on $\beta.$ Taking $\beta=1/4$ say, means that any such algorithm fails with probability at least $c_{\text{wish}}/2$ as long as $d$ is a large enough constant.

Now consider an $\ell_{p}$ -tester $\mathcal{T}$ with $d=1/(4\epsilon^{p/(2p+1)})$ , applied to the random matrix $W-tI.$ While our definition allows $\mathcal{T}$ to fail with $1/3$ probability we can reduce this failure probability to $c_{\text{wish}}/2$ by running a constant number of independent instances and taking a majority vote. So from here on we assume that $\mathcal{T}$ fails on a given instance with probability at most $c_{\text{wish}}/2$ .

First recall that $W\sim XX^{T}$ where each entry of $X$ is i.i.d. $\mathcal{N}(0,1/d).$ Then with high probability, the operator norm of $X$ is bounded, say, by $\sqrt{2}$ , and the eigenvalues of $W$ are bounded by $2.$

Therefore with high probability, $\left\|W\right\|_{p}\leq 2d^{1/p},$ and so $\left\|W-tI\right\|_{p}\leq 3d^{1/p}.$ It follows that $1/(4d^{2})=4\epsilon(4d)^{1/p}\geq\epsilon\left\|W-tI\right\|_{p}.$ This means that $\mathcal{T}$ can solve the problem above, and by correctness of the tester, fails with at most $c_{\text{wish}}/2$ probability. For $\epsilon$ sufficiently small, the above analysis implies that $\mathcal{T}$ must make at least $\Omega(d)=\Omega(1/\epsilon^{p/(2p+1)})$ queries.

∎

4 An optimal bilinear sketch

In this section we analyze a bilinear sketch for PSD-testing which will also yield an optimal $\ell_{2}$ -tester in the vector-matrix-vector model.

Our sketch is very simple. We choose $G\in\mathbb{R}^{d\times k}$ to have independent $\mathcal{N}(0,1)$ entries and take our sketch to be $G^{T}AG.$ In parallel we construct estimates $\alpha$ and $\beta$ for the trace and Frobenius norm of $A$ respectively, such that $\beta$ is accurate to within a multiplicative error of $2$ , and $\alpha$ is accurate to with $\left\|A\right\|_{F}$ additive error. (Note that this may be done at the cost of increasing the sketching dimension by $O(1)$ .)

If $G^{T}AG$ is not PSD then we automatically reject. Otherwise, we then consider the quantity

\gamma:=\frac{\alpha-\lambda_{\min}(G^{T}AG)}{\beta\sqrt{k}\log k}

(15)

If $\gamma$ is at most $c_{\text{psd}}$ for some absolute constant $c_{\text{psd}}$ , then the tester outputs False, otherwise it outputs True.

We first show that a large negative eigenvalue of $A$ results causes the smallest sketched eigenvalue to be at most $\operatorname{Tr}(A)-\Omega(\epsilon k).$ On the other hand, when $A$ is PSD, we will show that $\lambda_{\min}(G^{T}AG)$ is substantially larger.

4.1 Upper bound on $\lambda_{\min}(G^{T}AG)$

We start with the following result on trace estimators which we will need below.

Proposition 21.

Let $M$ be a symmetric matrix with eigenvalues $\lambda_{1},\ldots,\lambda_{d},$ and let $u$ be a random unit vector with respect to the spherical measure. Then

\operatorname{Var}(u^{T}Mu)=\frac{2}{d+2}\left(\frac{\lambda_{1}^{2}+\ldots+\lambda_{d}^{2}}{d}-\frac{(\lambda_{1}+\ldots+\lambda_{d})^{2}}{d^{2}}\right):=\frac{2}{d+2}\operatorname{Var}(\lambda_{1},\ldots,\lambda_{d}).

Proof.

By the spectral theorem, it suffices to prove the result when $M$ is diagonal. Then

\operatorname{Var}{u^{T}Mu}=\mathbb{E}(\lambda_{1}u_{1}^{2}+\ldots+\lambda_{d}u_{d}^{2})^{2}-\left(\mathbb{E}(\lambda_{1}u_{1}^{2}+\ldots+\lambda_{d}u_{d}^{2})\right)^{2}.

By Remark 12, we have $\mathbb{E}(u_{i}^{2})=1,$ $\mathbb{E}(u_{i}^{4})=\frac{3}{d(d+2)}$ and $\mathbb{E}(u_{i}^{2}u_{j}^{2})=\frac{1}{d(d+2)}$ for $i\neq j.$ The result follows by expanding using linearity of expectation, and then applying these facts. ∎

The next two results will give an upper bound on the smallest eigenvalue of the Gaussian sketch. For the proof of Lemma 23 we will start with random orthogonal projections, from which the Gaussian result will quickly follow. We include a technical hypothesis that essentially enforces non-negativity of $\operatorname{Tr}(A).$ We write the hypothesis in the form below simply to streamline the argument.

Lemma 22.

Suppose that $\left\|A\right\|_{F}=1$ and that $v$ is an eigenvector of $A$ with associated eigenvalue $-\epsilon.$ Let $\Pi\in R^{d\times d}$ be a projection onto a random $k$ dimensional subspace $S$ of $\mathbb{R}^{d}$ , sampled from the rotationally invariant measure. Also suppose that $x^{T}Ax\geq 0$ with probability $0.999$ when $x$ is a random unit vector. Then

\frac{1}{\left\|\Pi v\right\|^{2}}(\Pi v)^{T}A(\Pi v)\leq\frac{1}{d}(-0.5\epsilon k+\operatorname{Tr}(A)+O(1))

with probability at least $0.99-\exp(-ck)$ .

Proof.

Let $u=\frac{\Pi v}{\left\|\Pi v\right\|}.$ The subspace $S$ was chosen randomly, so with probability at least $1-\exp(-ck)$ ,

\left\langle u,v\right\rangle^{2}=\left\|\Pi v\right\|^{2}\geq 0.5\frac{k}{d}.

Let $u^{\prime}$ be the projection of $u$ onto the hyperplane $v^{\perp}$ orthogonal to $v.$ Observe by symmetry that $u^{\prime}/\left\|u^{\prime}\right\|$ is distributed uniformly over the sphere in $v^{\perp}$ .

Let $A^{\prime}=A-\epsilon vv^{T}$ be the matrix $A$ with the $-\epsilon$ eigenvalue zeroed out. Then

u^{T}Au\leq-0.5\frac{k}{d}\epsilon+(u^{\prime})^{T}A^{\prime}u^{\prime}\leq-0.5\frac{k}{d}\epsilon+(u^{\prime}/\left\|u^{\prime}\right\|)^{T}A^{\prime}(u^{\prime}/\left\|u^{\prime}\right\|),

as long as $(u^{\prime})^{T}A^{\prime}u^{\prime}\geq 0$ , which holds with probability at least $0.999$ as a consequence of the similar hypothesis. The latter term is a trace estimator for $\frac{1}{d}A^{\prime}$ with variance bounded by $\frac{2}{d^{2}}\left\|A^{\prime}\right\|_{F}^{2}\leq\frac{2}{d^{2}}$ (for example by Proposition 21). So with $0.999$ probability

(u^{\prime}/\left\|u^{\prime}\right\|)^{T}A^{\prime}(u^{\prime}/\left\|u^{\prime}\right\|)\leq\frac{1}{d}(\operatorname{Tr}(A^{\prime})+O(1))=\frac{1}{d}(\operatorname{Tr}(A)+\epsilon+O(1))\leq\frac{1}{d}(\operatorname{Tr}(A)+O(1)),

and the result follows. ∎

In the following lemma, we introduce the technical assumption that $k<cd$ . However this will be unimportant later, as any sketch with $k\geq cd$ might as well have sketching dimension $d$ , at which point the testing problem is trivial.

Lemma 23.

Suppose that $A\in\mathbb{R}^{d\times d}$ has an eigenvalue of $-\epsilon$ , $\left\|A\right\|_{F}=1$ , and $G\in\mathbb{R}^{d\times k}$ with $k<d$ has iid $\mathcal{N}(0,1)$ entries. Also suppose that $x^{T}Ax\geq 0$ with probability at least $0.999$ for a random unit vector $x,$ and that $k<cd$ for some absolute constant $c.$ Then with probability at least $0.99-3\exp(-ck)$ ,

\lambda_{\min}(G^{T}AG)\leq-0.4\epsilon k+O(1)+\operatorname{Tr}(A)+c\frac{\sqrt{k}}{\sqrt{d}}\left|\operatorname{Tr}(A)\right|.

Proof.

Let $\Pi_{G}=GG^{\dagger}$ denote projection onto the image of $G$ .

Let $v$ be an eigenvector of $A$ with associated eigenvalue smaller or equal to $-\epsilon,$ and set $u=G^{\dagger}v/\left\|G^{\dagger}v\right\|$ . We then have

\lambda_{\min}(G^{T}AG)\leq u^{T}(G^{T}AG)u=\frac{1}{\left\|G^{\dagger}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v).

We also have

\left\|\Pi_{G}v\right\|=\left\|GG^{\dagger}v\right\|\leq\left\|G\right\|_{\text{op}}\left\|G^{\dagger}v\right\|.

By Theorem 4.6.1 in [Ver18], $\left\|G\right\|_{\text{op}}\leq\sqrt{d}+c\sqrt{k}$ with probability at least $1-2\exp(-k)$ . Conditional on this occurring,

\left\|G^{\dagger}v\right\|\geq\frac{\left\|\Pi_{G}v\right\|}{\sqrt{d}+c\sqrt{k}},

from which it follows that

\lambda_{\min}(G^{T}AG)\leq(d+c\sqrt{d}\sqrt{k})\frac{1}{\left\|\Pi_{G}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v),

as long as the quantity on the right-hand side is non-negative. If this quantity is negative, then we similarly have

\lambda_{\min}(G^{T}AG)\leq(d-c\sqrt{d}\sqrt{k})\frac{1}{\left\|\Pi_{G}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v),

using the analogous bound on the smallest singular value of $G.$

Since $G$ is Gaussian, the image of $G$ is distributed with the respect to the rotationally invariant measure on $k$ -dimensional subspaces. Therefore Lemma 22 applies, and the result follows after collecting terms, and using the assumption that $k\leq cd$ in the negative case. ∎

4.2 Lower bound on $\lambda_{\min}(G^{T}AG)$

We follow a standard protocol for bounding the extreme eigenvalues of a random matrix. We first show that $u^{T}(G^{T}AG)u$ is reasonably large for a fixed vector $u$ with high probability. Then by taking a union bound over an $\epsilon$ -net we upgrade this to a uniform bound over the sphere.

We require two additional tricks. Our lower bound on $u^{T}(G^{T}AG)u$ arises from Berstein’s inequality, which is hampered by the existence of large eigenvalues of $A$ . Therefore in order to get a guarantee that holds with high enough probability, we first prune the large eigenvalues of $A.$

Second, the mesh size of our $\epsilon$ -net needs to be inversely proportional to the Lipschitz constant of $x\mapsto x^{T}(G^{T}AG)x$ as $x$ ranges over the sphere. A priori, the Lipschitz constant might be as bad as $\left\|A\right\|_{\text{op}}$ which is typically larger than $\sqrt{d}$ . This would ultimately give rise to an additional $\log(d)$ factor in the final sketching dimension. However we show that the the Lipschitz constant is in fact bounded by $O(k)$ with good probability, avoiding the need for any $d$ dependence in the sketching dimension.

Proposition 24.

Let $Q$ be a symmetric matrix, and let $x$ and $y$ be unit vectors. Then

\left|x^{T}Qx-y^{T}Qy\right|\leq 2(\lambda_{\max}(Q)-\lambda_{\min}(Q))\left\|x-y\right\|

Proof.

We first reduce to the $2$ -dimensional case. Let $W$ be a $2$ -dimensional subspace passing through $x$ and $y$ . The largest and smallest eigenvalues of the restriction to $W$ of the quadratic form associated to $Q$ are bounded from above and below by $\lambda_{\max}(Q)$ and $\lambda_{\min}(Q)$ respectively. It therefore suffices to prove the result when $Q$ has dimension $2.$

Since the result we wish to show is invariant under shifting $Q$ by multiples of the identity, it suffices to consider the case when $\lambda_{2}=0.$ After these reductions, we have

x^{T}Qx-y^{T}Qy=\lambda_{1}(x_{1}^{2}-y_{1}^{2})=\lambda_{1}(x_{1}+y_{1})(x_{1}-y_{1}).

Since $x$ and $y$ are unit vectors, $\left|x_{1}+y_{1}\right|\leq 2$ and $\left|x_{1}-y_{1}\right|\leq\left\|x-y\right\|$ and the result follows. ∎

Lemma 25.

Let $S=G^{T}AG$ where $G\in\mathbb{R}^{k\times d}$ has iid $\mathcal{N}(0,1)$ entries and $\left\|A\right\|_{F}=1.$ Then

\lambda_{\max}(S)-\lambda_{\min}(S)\leq t

with probability at least $1-\frac{4k(k+2)}{t^{2}}$ .

Proof.

Consider the random quantity $\alpha=u^{T}G^{T}AGu$ , where $u$ is a random unit vector in $\mathbb{R}^{k}$ , independent from $G$ . Note that $Gu$ is distributed as a standard Gaussian, so $\alpha$ is a trace estimator for $A$ with variance $2$ [AT11].

On the other hand, one can also study the variance of $\alpha$ conditional on $G$ by using Proposition 21. Let $E$ be the event that $\lambda_{\max}(S)-\lambda_{\min}(S)\geq t$ . If $E$ occurs too often, then $\operatorname{Var}(\alpha)$ would be too large. Specifically, in the notation of Proposition 21 when $E$ occurs, we necessarily have $\operatorname{Var}(\lambda_{1}(S),\ldots,\lambda_{k}(S))\geq\frac{1}{k}(t/2)^{2}$ , so $\operatorname{Var}(\alpha|E)\geq\frac{t^{2}}{2k(k+2)}.$ Thus we have

2=\operatorname{Var}(\alpha)\geq\Pr(E)\operatorname{Var}(\alpha|E)\geq\Pr(E)\frac{t^{2}}{2k(k+2)},

(16)

and so $\Pr(E)\leq 4k(k+2)/t^{2}$ as desired. ∎

Lemma 26.

Suppose that $A$ is PSD with $\left\|A\right\|_{F}=1,$ and that $v$ consists of iid $\mathcal{N}(0,1)$ entries. Then for $t\geq 2\sqrt{k}$ we have

\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t)\leq\exp(-c\sqrt{k}(t-\sqrt{k})).

Proof.

We have

	$\displaystyle\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t)$	$\displaystyle\leq\Pr(v^{T}A_{-k}v\leq\operatorname{Tr}(A)-t)$		(17)
		$\displaystyle=\Pr(v^{T}A_{-k}v\leq\operatorname{Tr}(A_{-k})-(t-\operatorname{Tr}(A_{k}))$		(18)

Note that $v^{T}A_{-k}v$ has expectation $\operatorname{Tr}(A_{-k})$ , so by Bernstein’s inequality (or Hanson-Wright) [Ver18],

\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t)\leq\exp\left(-c\min\left(\frac{(t-\operatorname{Tr}(A_{k}))^{2}}{\left\|A_{-k}\right\|_{F}^{2}},\frac{t-\operatorname{Tr}(A_{k})}{\lambda_{\max}(A_{-k})}\right)\right),

for $t\geq\operatorname{Tr}(A_{k}).$

Now note that $\left\|A_{-k}\right\|_{F}\leq\left\|A\right\|_{F}\leq 1$ , and that $\lambda_{\max}(A_{-k})\leq\frac{1}{\sqrt{k}},$ since $\left\|A\right\|_{F}=1.$ Additionally, $\operatorname{Tr}(A_{k})\leq\sqrt{k}\left\|A_{k}\right\|_{F}\leq\sqrt{k}.$ These bounds imply that

\frac{(t-\operatorname{Tr}(A_{k}))^{2}}{\left\|A_{-k}\right\|_{F}^{2}}\geq(t-\sqrt{k})^{2}

and

\frac{t-\operatorname{Tr}(A_{k})}{\lambda_{\max}(A_{-k})}\geq\sqrt{k}(t-\sqrt{k})

for $t\geq\operatorname{Tr}(A_{k}).$ When $t>2\sqrt{k}$ , the latter expression is smaller, and the conclusion follows.

∎

Theorem 27.

Suppose that $A$ is PSD with $\left\|A\right\|_{F}=1$ , and that $G\in\mathbb{R}^{d\times k}$ has iid $\mathcal{N}(0,1)$ entries and that $k\geq 5$ . Then with at least $0.99$ probability,

\lambda_{\min}(G^{T}AG)\geq\operatorname{Tr}(A)-c\sqrt{k}\log(k)

for some absolute constant $c$ .

Proof.

For any fixed unit vector $u\in\mathbb{R}^{k}$ , $Gu$ is distributed as a standard Gaussian, and so Lemma 26 applies. Therefore for a choice of constant,

u^{T}(G^{T}AG)u\geq\operatorname{Tr}(A)-c\sqrt{k}\log(k)

(19)

with probability at least $1-\exp(-10k\log k).$

Let $\mathcal{N}$ be a net for the sphere in $\mathbb{R}^{k}$ with mesh size $1/k,$ which can be taken to have at most $(3k)^{k}$ elements [Ver18]. By taking a union bound, equation 19 holds over $\mathcal{N}$ with probability at least

1-(3k)^{k}\exp(-10k\log k)\geq 1-\exp(-k)

for $k\geq 2.$

By choosing $t=100k$ in Lemma 25, and applying Proposition 24, we get that

\left|x^{T}(G^{T}AG)x-y^{T}(G^{T}AG)y\right|\leq 100k\left\|x-y\right\|

with probability at least $0.999$ . Since $\mathcal{N}$ has mesh size $1/k$ , we have that

u^{T}(G^{T}AG)u\geq\operatorname{Tr}(A)-c\sqrt{k}+O(1)

for all unit vectors $u$ in $\mathbb{R}^{k}$ with probability at least $0.999-\exp(-k).$

∎

As a consequence of Theorem 27 and Lemma 23 we obtain our main result.

Theorem 28.

There is a bilinear sketch $G^{T}AG$ with sketching dimension $k=O(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon})$ that yields a two-sided $\ell_{2}$ -tester that is correct with at least $0.9$ probability.

Proof.

If $\lambda_{\min}(A)<0$ then we automatically reject. Otherwise we first use $O(1)$ columns of the sketching matrices to estimate $\alpha$ of $\operatorname{Tr}(A)$ to within an additive error of $\left\|A\right\|_{F}$ with $0.01$ failure probability. We then use another $O(1)$ columns of the sketching matrices to construct an approximation $\beta$ of $\left\|A\right\|_{F}$ with $\frac{1}{2}\left\|A\right\|_{F}\leq\beta\leq 2\left\|A\right\|_{F}$ with $0.01$ failure probability (see for example [MSW19]).

Now consider the quantity

\gamma=\frac{\alpha-\lambda_{\min}(G^{T}AG)}{\beta\sqrt{k}\log k}.

If $A$ is PSD, then by Theorem 27,

\lambda_{\min}(G^{T}AG)\geq\operatorname{Tr}(A)-c\sqrt{k}\log k\left\|A\right\|_{F},

which implies that

\gamma\leq c_{\text{psd}},

for some absolute constant $c_{\text{psd}}.$

On the other hand, if $A$ has a negative eigenvalue less than or equal to $-\epsilon$ , then by Theorem 23

\lambda_{\min}(G^{T}AG)\leq-0.4\epsilon k\left\|A\right\|_{F}+\left(1+c\frac{\sqrt{k}}{\sqrt{d}}\right)\operatorname{Tr}(A)+O(1)\left\|A\right\|_{F},

which implies that

\gamma\geq c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k,

for some absolute constant $c_{\text{far}}.$

Finally by taking $k=\Theta(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon})$ , we have $c_{\text{psd}}<c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k$ , which implies that the tester is correct if it outputs True precisely when $\gamma\leq c_{\text{psd}}.$

∎

Note that this result immediately gives a non-adaptive vector-matrix-vector tester which makes $\widetilde{O}(1/\epsilon^{4})$ queries.

4.3 Application to adaptive vector-matrix-vector queries

By combining our bilinear sketch with Theorem 2 we achieve tight bounds for adaptive queries.

Theorem 29.

There is a two-sided adaptive $\ell_{2}$ -tester in the vector-matrix-vector model, which makes $\widetilde{O}(1/\epsilon^{2})$ queries.

Proof.

To handle the technical condition in Lemma 23, we first compute $x^{T}Ax$ for a constant number of independent Gaussian vectors $x.$ If $x^{T}Ax$ is ever negative, then we automatically reject.

We showed in the proof of Theorem 28 that with $0.9$ probability, $\gamma\leq c_{psd}$ if $A$ is PSD, and $\gamma\geq c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k:=C_{\text{far}}(k)$ if $A$ is $\epsilon$ -far from PSD. By choosing some $k=O(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon})$ we can arrange for $C_{\text{far}}(k)-c_{\text{psd}}\geq 1,$ and also for $C_{\text{far}}(k)=\Theta(1).$

Next we compute estimates $\alpha$ and $\beta$ of $\operatorname{Tr}(A)$ and $\left\|A\right\|_{F}$ as above, and (implicitly) form the matrix

\Gamma=\frac{G^{T}AG-\alpha I_{k}}{\beta\sqrt{k}\log k}+C_{\text{far}}(k)I_{k}-I_{k}.

If $A$ is PSD, then with very good probability,

\lambda_{\text{min}}(\Gamma)=-\gamma+C_{\text{far}}(k)-\frac{1}{\epsilon}\geq-c_{\text{psd}}+C_{\text{far}}(k)-1\geq 0.

Similarly, if $A$ is $\epsilon$ -far from PSD, then

\lambda_{\text{min}}(\Gamma)=-\gamma+C_{\text{far}}(k)-\frac{1}{\epsilon}\leq-C_{\text{far}}(k)+C_{\text{far}}(k)-1=-1.

Thus it suffices to distinguish $\Gamma$ being PSD from $\Gamma$ having a negative eigenvalue less than or equal to $-1.$

For this we will utilize our adaptive $\ell_{1}$ -tester, so we must bound $\left\|\Gamma\right\|_{1}.$ Note that $G^{T}AG$ is a trace estimator for $kA$ with variance $O(k)\left\|A\right\|_{F}$ [AN13]. Therefore $\operatorname{Tr}(G^{T}AG)=k(\operatorname{Tr}(A)\pm O(1)\left\|A\right\|_{F}).$ Define $M=\frac{1}{\beta}(G^{T}AG-\alpha I_{k})$ , so that $\operatorname{Tr}(M)=O(k).$ The negative eigenvalues of $M$ sum to at most $k\lambda_{\min}(M)$ in magnitude, and so the bound on $\operatorname{Tr}(M)$ implies that $\left\|M\right\|_{1}\leq 2k\lambda_{\min}(M)+O(k).$ Write $\Gamma=\frac{1}{\sqrt{k}\log k}M+C_{\text{far}}(k)I-I$ , so that $\left\|\Gamma\right\|_{1}\leq\frac{1}{\sqrt{k}\log k}\left\|M\right\|_{1}+O(k).$ Note that $\lambda_{\min}(M)\leq\sqrt{k}\log k\lambda_{\min}(\Gamma)+O(\sqrt{k}\log k).$ Therefore $\left\|\Gamma\right\|_{1}\leq 2k\lambda_{\min}(\Gamma)+O(k).$ From this we have

\frac{1}{\lambda_{\min}(\Gamma)}\left\|\Gamma\right\|_{1}\leq 2k\left(\frac{\lambda_{\min}(\Gamma)+O(1)}{\lambda_{\min}(\Gamma)}\right)+\frac{O(k)}{\lambda_{\min}(\Gamma)}\leq O(k)

as long as $|\lambda_{\min}(\Gamma)|\geq\Omega(1)$ , which it is by assumption.

Therefore Theorem 2 gives an adaptive vector-matrix-vector tester for $\Gamma$ which requires only $\widetilde{O}(k)=\widetilde{O}(\frac{1}{\epsilon^{2}})$ queries.

∎

As a consequence we also obtain a two-sided $p$ -tester for all $p\geq 2.$

Corollary 30.

For $p\geq 2$ , there is a two-sided adaptive $\ell_{p}$ -tester in the vector-matrix-vector model, which make $\widetilde{O}(1/\epsilon^{2})d^{1-1/p}$ queries.

Proof.

Apply Theorem 29 along with the bound $\left\|A\right\|_{p}\geq d^{\frac{1}{p}-\frac{1}{2}}\left\|A\right\|_{F}.$ ∎

4.4 Lower bounds for two-sided testers

Our lower bounds for two-sided testers come from the spiked Gaussian model introduced in [LW16]. As before, our adaptive lower bounds will come as a consequence of the corresponding non-adaptive bounds.

Theorem 31.

A two-sided $\ell_{p}$ -tester that makes non-adaptive vector-matrix-vector queries requires at least

•

$\Omega(\frac{1}{\epsilon^{2p}})$ queries for $1\leq p\leq 2$
•

$\Omega(\frac{1}{\epsilon^{4}}d^{2-4/p})$ queries for $2<p<\infty$ as long as $d$ can be taken to be $\Omega(1/\epsilon^{p}).$
•

$\Omega(d^{2})$ queries for $p=\infty.$

Proof.

First, take $G$ to be a $d\times d$ matrix with $\mathcal{N}(0,1)$ entries, where $d=1/\epsilon$ . Also let $\widetilde{G}=G+suv^{T}$ where $u$ and $v$ have $\mathcal{N}(0,1)$ entries, and $s$ is to be chosen later. We will show that a PSD-tester can be used to distinguish $G$ and $\widetilde{G}$ , while this is hard for any algorithm that uses only a linear sketch.

Recall that $G$ has spectral norm at most $c\sqrt{d}$ with probability at least $1-2e^{-d},$ where $c$ is an absolute constant. (We will use $c$ throughout to indicate absolute constants that we do not track – it may have different values between uses, even within the same equation.) Set

G_{\text{sym}}=\begin{bmatrix}0&G\\ G^{T}&0,\end{bmatrix}

(20)

and define $\widetilde{G}_{\text{sym}}$ similarly. Note that the eigenvalues of $G_{\text{sym}}$ are precisely $\{\pm\sigma_{i}\}$ where the $\sigma_{i}$ are singular values of $G.$ Therefore $G_{\text{sym}}+c\sqrt{d}I$ is PSD with high probability.

On the other hand, $\left\|uv^{T}\right\|\geq cd$ with high probability so

\left\|\widetilde{G}\right\|\geq s\left\|uv^{T}\right\|-\left\|G\right\|\geq csd-c\sqrt{d},

(21)

which implies that $\widetilde{G}_{\text{sym}}+c\sqrt{d}I$ has a negative eigenvalue with magnitude at least $csd-c\sqrt{d}-c\sqrt{d}=csd-c\sqrt{d}.$

We also have that $\left\|\widetilde{G}_{\text{sym}}\right\|_{p}\leq c\sqrt{d}d^{1/p},$ since the operator norm of $G$ is bounded by $c\sqrt{d}$ with high probability. Hence if

c\frac{sd-\sqrt{d}}{\sqrt{d}d^{1/p}}=c\frac{s\sqrt{d}-1}{d^{1/p}}\geq\epsilon,

(22)

then a two-sided PSD-tester can distinguish between $G$ and $\widetilde{G}$ with constant probability of failure.

On the other hand, Theorem 4 in [LW16] implies that any sketch that distinguishes these distributions with constant probability, must have sketching dimension at least $c/s^{4}.$

It remains to choose values of $s$ and $d$ for which the inequality in equation (22) holds. When $1\leq p\leq 2$ , we take $d=\Theta(\epsilon^{-p})$ and $s=O(\epsilon^{p/2})$ giving a lower bound of $\Omega(1/\epsilon^{2p}).$ When $2<p<\infty$ , we take $d=\Omega(1/\epsilon^{p})$ and $s=O(\epsilon d^{1/p-1/2})$ giving a lower bound of $\Omega(\frac{1}{\epsilon^{4}}d^{2-4/p}).$ Finally, when $p=\infty$ we take $s=O(1/\sqrt{d})$ giving a lower bound of $\Omega(d^{2}).$

∎

Remark 32.

The argument above applies equally well to arbitrary linear sketches, of which a series of non-adaptive vector-matrix-vector queries is a special case.

Corollary 33.

A two-sided adaptive $\ell_{p}$ -tester in the vector-matrix-vector model requires at least

•

$\Omega(\frac{1}{\epsilon^{p}})$ queries for $1\leq p\leq 2$
•

$\Omega(\frac{1}{\epsilon^{2}}d^{1-2/p})$ queries for $2<p<\infty$ as long as $d$ can be taken to be $\Omega(1/\epsilon^{p}).$
•

$\Omega(d)$ queries for $p=\infty.$

Proof.

Apply Theorem 31 along with Lemma 14. ∎

For adaptive measurements, we supply a second proof via communication complexity which has the advantage of applying to general linear measurements, albeit at the cost of an additional bit complexity term.

Proposition 34.

Let $p\in[1,\infty)$ , $\epsilon<\frac{1}{2}$ and $d\geq(p/\epsilon)^{p}$ . An adaptive two-sided $\ell_{p}$ -tester taking general linear measurements $\left\langle M_{i},A\right\rangle$ of $A$ , where each $M_{i}$ has integer entries in $(-2^{b-1},2^{b-1}]$ , must make at least $\frac{c}{b+d\log\frac{1}{\epsilon}}\frac{1}{\epsilon^{2}}d^{1-2/p}$ queries.

Proof.

We reduce from the multiplayer set disjointness problem [KPW21]. Let the $k$ players have sets $S_{1},\ldots,S_{k}\subseteq[d]$ which either are (i) all pairwise disjoint or (ii) all share precisely one common element. Distinguishing between (i) and (ii) with $2/3$ probability in the blackboard model of communication requires $\Omega(d/k)$ bits. We will choose $k=\left\lceil\max(4\epsilon d^{1/p},4p)\right\rceil=\lceil 4\epsilon d^{1/p}\rceil.$

For each $i$ let $\chi_{i}$ be the characteristic vector of $S_{i}$ and let $A_{i}=\operatorname{diag}(\chi_{i}).$ Consider the matrix $A:=I_{d}-\sum_{i=1}^{d}A_{i}$ .

In situation (i), $A$ is PSD. In situation (ii), $\left\|A\right\|_{p}^{p}\leq k^{p}+d$ and $\lambda_{\min}(A)=-(k-1).$ We have

\left|\lambda_{\min}(A)\right|^{p}=(k-1)^{p}=k^{p}\left(1-\frac{1}{k}\right)^{p}\geq k^{p}\left(1-\frac{p}{k}\right)\geq\frac{3}{4}k^{p}

and

\epsilon^{p}\left\|A\right\|_{p}^{p}\leq\epsilon^{p}(k^{p}+d)=\epsilon^{p}k^{p}+\epsilon^{p}d\leq\frac{1}{2}k^{p}+\epsilon^{p}d\leq\frac{3}{4}k^{p}.

Given query access to $A$ , an $\ell_{p}$ -tester can therefore distinguish between (i) and (ii) with $2/3$ probability. Note that a single linear measurement $\left\langle M,A\right\rangle$ may be simulated in the blackboard model using $O(k(\log b+\log d))$ bits; each player simply computes and communicates $\left\langle M,A_{i}\right\rangle$ , and the players add the resulting measurements. The players therefore need at least $\Omega(\frac{1}{k(\log b+\log d)}\cdot\frac{d}{k})$ bits of communication to solve the PSD-testing problem.

∎

5 Spectrum Estimation

We make use of the following result, which is Lemma 11 of [CW17a] specialized to our setting.

Lemma 35.

For a symmetric matrix $A\in\mathbb{R}^{d\times d}$ , there is a distribution over an oblivious sketching matrix $R\in\mathbb{R}^{d\times m}$ with $m=O(\frac{k}{\epsilon})$ so that with at least $0.9$ proability,

\min_{Y^{*}\in\,\textup{rank}\,k,\textup{PSD}}\left\|(AR)Y^{*}(AR)^{T}-A\right\|_{F}^{2}\leq(1+\epsilon)\left\|A_{k,+}-A\right\|_{F}^{2},

(23)

where $A_{k,+}$ is the optimal rank-one PSD approximation to $A$ in Frobenius norm.

Remark 36.

In our setting one can simply take $R$ to be Gaussian since the guarantee above must hold when $A$ is drawn from a rotationally invariant distribution. In many situations, structured or sparse matrices are useful, but we do not need this here.

We also recall the notion of an affine embedding [CW17].

Definition 37.

$S$ is an affine embedding for matrices $A$ and $B$ if for all matrices $X$ of the appropriate dimensions, we have

\left\|S(AX-B)\right\|_{F}^{2}=(1\pm\epsilon)\left\|AX-B\right\|_{F}^{2}.

(24)

We also recall that when $A$ is promised to have rank at most $r$ , there is a distribution over $S$ with $O(\epsilon^{-2}r)$ rows such that (24) holds with constant probability for any choice of $A$ and $B$ [CW17].

Lemma 38.

There is an algorithm which makes $O(\frac{k^{2}}{\epsilon^{6}}\log\frac{1}{\delta})$ vector-matrix-vector queries to $A$ and with at least $1-\delta$ probability outputs an approximation of $\left\|A\right\|_{k,+}$ , accurate to within $\epsilon\left\|A\right\|_{F}^{2}$ additive error.

Proof.

We run two subroutines in parallel.

Subroutine 1. Approximate $\left\|A_{k,+}-A\right\|_{F}^{2}$ up to $O(\epsilon)$ multiplicative error.

Our algorithm first draws affine embedding matrices $S_{1}$ and $S_{2}$ for $r=k/\epsilon$ , and with $\epsilon$ distortion, each with $O(\frac{k}{\epsilon^{3}})$ rows. We also draw a matrix $R$ as in Lemma 35 with $m=O(\frac{k}{\epsilon})$ columns.

We then compute $S_{1}AR$ and $S_{2}AR$ , each requiring $\frac{k^{2}}{\epsilon^{4}}$ vector-matrix-vector queries, and compute $S_{1}AS_{2}^{T}$ requiring $\frac{k^{2}}{\epsilon^{6}}$ queries.

Let $Y_{k}$ be arbitrary with the appropriate dimensions (later we will optimize $Y_{k}$ over rank $k$ PSD matrices). By using the affine embedding property along with the fact that $R$ has rank at most $\frac{k}{\epsilon}$ , we have

	$\displaystyle\left\\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\\|_{F}^{2}$	$\displaystyle=(1\pm\epsilon)\left\\|ARY_{k}(S_{2}AR)^{T}+AS_{2}^{T}\right\\|_{F}^{2}$
		$\displaystyle=(1\pm\epsilon)\left\\|S_{2}ARY_{k}R^{T}A+S_{2}A\right\\|_{F}^{2}$
		$\displaystyle=(1\pm 3\epsilon)\left\\|ARY_{k}R^{T}A+A\right\\|_{F}^{2}.$

As a consequence of this, and the property held by $R$ , we have

	$\displaystyle\min_{\operatorname{rk}(Y_{k})\leq k,Y_{k}\,\text{PSD}}\left\\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\\|_{F}^{2}$	$\displaystyle=(1\pm 3\epsilon)\min_{Y_{k}}\left\\|ARY_{k}R^{T}A+A\right\\|_{F}^{2}$		(25)
		$\displaystyle=(1\pm 7\epsilon)\left\\|A_{k,+}-A\right\\|_{F}^{2}.$		(26)

Thus by computing the quantity in the left-hand-side above, our algorithm computes an $O(\epsilon)$ multiplicative approximation using $O(k^{2}/\epsilon^{6})$ vector-matrix-vector queries.

Subroutine 2. Approximate $\left\|A\right\|_{F}^{2}$ up to $O(\epsilon)$ multiplicative error.

We simply apply Theorem 2.2. of [MSW19], set $q=2,$ and note that the entries of the sketch correspond to vector-matrix-vector products. By their bound we require $O(\epsilon^{-2}\log(1/\epsilon))$ vector-matrix-vector queries.

Since $\left\|A_{k,+}\right\|_{F}^{2}=\left\|A\right\|_{F}^{2}-\left\|A_{k,+}-A\right\|_{F}^{2}$ , we obtain an additive $O(\epsilon)\left\|A\right\|_{F}^{2}$ approximation to $\left\|A_{k,+}\right\|_{F}^{2}$ by running the two subroutines above and subtracting their results.

Finally, by repeating the above procedure $O(\log\frac{1}{\delta})$ times in parallel and taking the median of the trails, we obtain a failure probability of at most $\delta.$

∎

The matrices $S_{1}AR$ and $S_{2}AR$ in Subroutine 1 each have rank $k/\epsilon$ whereas the dimensions of $S_{1}AS_{2}^{T}$ are $k/\epsilon^{3}.$ The matrix $S_{1}AS_{2}^{T}$ therefore contains a large amount of data that will not play a role when optimizing over $Y_{k}$ . If $S_{1}AR$ and $S_{2}AR$ were known ahead of time, then we could choose to compute only the portion of $S_{1}AS_{2}^{T}$ that is relevant to the optimization step, and simply estimate the Frobenius error incurred by the rest. This allows us to construct a slightly more efficient two-pass protocol.

Proposition 39.

By using a single round adaptivity, the guarantee of Lemma 38 may be achieved using $O(\frac{k^{2}}{\epsilon^{4}}\log\frac{1}{\delta})$ vector-matrix-vector queries.

Proof.

As described above, we modify Subroutine 1. Write $M_{i}$ for $S_{i}AR$ and $Q$ for $S_{1}AS_{2}^{T}.$ Instead of computing $M_{1}$ , $M_{2},$ and $Q$ at once, we instead compute $M_{1}$ and $M_{2}$ first using $k^{2}/\epsilon^{4}$ vector-matrix-vector queries.

We wish to estimate $\min_{Y_{k}}\left\|M_{1}Y_{k}M_{2}^{T}-Q\right\|_{F}^{2}$ , where the minimum is over PSD matrices $Y_{k}$ of rank at most $k$ . Let $\Pi_{i}$ denote orthogonal projection onto the image of $M_{i}$ , and set $\Pi_{i}^{\perp}=I-\Pi_{i}.$ Then for fixed $Y$ , we use the Pythagorean theorem to write

$\displaystyle\left\\|M_{1}YM_{2}-Q\right\\|_{F}^{2}$	$\displaystyle=\left\\|\Pi_{1}M_{1}YM_{2}\Pi_{2}-Q\right\\|_{F}^{2}$	(27)
	$\displaystyle=\left\\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}+\Pi_{1}^{\perp}Q\Pi_{2}+\Pi_{1}Q\Pi_{2}^{\perp}+\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}$	(28)
	$\displaystyle=\left\\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}$	(29)
	$\displaystyle=\left\\|M_{1}YM_{2}^{T}-\Pi_{1}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}.$	(30)

Note that each of the last three terms can be estimated to within $O(\epsilon)$ multiplicative error using Subroutine 2, since a vector-matrix-vector query to one of these matrices may be simulated with a single query to $A$ . Also since each $M_{i}$ has rank $O(k/\epsilon)$ , the $\Pi_{i}$ ’s are projections onto $O(k/\epsilon)$ dimensional subspaces. Since the $\Pi_{i}$ ’s are known to the algorithm, we may compute $\Pi_{1}Q\Pi_{2}$ explicitly using $k^{2}/\epsilon^{2}$ vector-matrix-vector queries, as it suffices to query $\Pi_{1}Q\Pi_{2}$ over the Cartesian product of bases for the images of $\Pi_{1}$ and $\Pi_{2}.$ By optimizing the first term over $Y$ , we thus obtain an $O(\epsilon)$ multiplicative approximation to $\min_{Y_{k}}\left\|M_{1}Y_{k}M_{2}^{T}-Q\right\|_{F}^{2}$ as desired. This gives a version of Subroutine 1 that makes $O(k^{2}/\epsilon^{4})$ queries. ∎

We note that we immediately obtain a $\text{poly}(1/\epsilon)$ query $\ell_{2}$ -tester by applying Lemma 38 to approximate $A_{1,-}$ . However this yields a worse $\epsilon$ dependence than Theorem 28. Perhaps more interestingly, these techniques also give a way to approximate the top $k$ (in magnitude) eigenvalues of $A$ while preserving their signs. We note a minor caveat. If $\lambda_{k}$ and $\lambda_{k+1}$ are very close in magnitude, but have opposite signs, then we cannot guarantee that we approximate $\lambda_{k}$ . Therefore in the statement below, we only promise to approximate eigenvalues with magnitude at least $|\lambda_{k}|+2\epsilon$ .

Theorem 40.

Let $\lambda_{1},\lambda_{2},\ldots$ be the (signed) eigenvalues of $A$ sorted in decreasing order of magnitude.

There is an algorithm that makes $O(\frac{k^{2}}{\epsilon^{12}}\log k)$ non-adaptive vector-matrix-vector queries to $A$ , and with probability at least $0.9$ , outputs $\tilde{\lambda}_{1},\ldots,\tilde{\lambda}_{k}$ such that

(i) There exists a permutation $\sigma$ on $[k]$ so that for all $i$ with $|\lambda_{i}|\geq|\lambda_{k}|+2\epsilon$ , $|\tilde{\lambda}_{\sigma(i)}-\lambda_{i}|\leq\epsilon\left\|A\right\|_{F}$

(ii) For all $i$ , there exists $j$ with $|\lambda_{j}|\geq|\lambda_{k}|-\epsilon$ and $|\tilde{\lambda}_{i}-\lambda_{j}|\leq\epsilon\left\|A\right\|_{F}$

With one additional round of adaptivity the number of measurements can be reduced to $O(\frac{k^{2}}{\epsilon^{8}}\log k).$

Proof.

We set $\delta=\frac{1}{20k}$ in Lemma 38 and use it to approximate $\left\|A_{1,+}\right\|_{F}^{2},\ldots,\left\|A_{k,+}\right\|_{F}^{2}$ , along with $\left\|A_{1,-}\right\|_{F}^{2},\ldots,\left\|A_{k,-}\right\|_{F}^{2}$ , each to within $(\epsilon^{2}/2)\left\|A\right\|_{F}^{2}$ additive error. Note that we may use the same sketching matrices for each of these $2k$ tasks, and then take a union bound to obtain a failure probability of at most $0.1$ . Thus we require only $O(\frac{k^{2}}{\epsilon^{12}}\log k)$ queries in total. With an additional round of adaptivity, Proposition 39 reduces this bound to $O(\frac{k^{2}}{\epsilon^{8}}\log k).$

Let $\lambda_{i,+}$ be the $i^{\text{th}}$ largest positive eigenvalue of $A$ if it exists, and $0$ otherwise. Define $\lambda_{i,-}$ similarly. Note that $\lambda_{i,+}^{2}=\left\|A_{i,+}\right\|_{F}^{2}-\left\|A_{i-1,+}\right\|_{F}^{2}$ for $i\geq 2$ , and that $\lambda_{1,+}^{2}=\left\|A_{1,+}\right\|_{F}^{2}$ . This allows us to compute approximations $\tilde{\lambda}_{i,+}\geq 0$ such that $|\tilde{\lambda}_{i,+}^{2}-\lambda_{i,+}^{2}|\leq\epsilon^{2}\left\|A\right\|_{F}^{2}$ , and similarly for the $\lambda_{i,-}$ ’s with $\tilde{\lambda}_{i,-}\leq 0$ . Note that this bound implies $|\tilde{\lambda}_{i,+}-\lambda_{i,+}|\leq\epsilon\left\|A\right\|_{F}.$

Our algorithm then simply returns the $k$ largest magnitude elements of $\{\tilde{\lambda}_{1,+},\ldots,\tilde{\lambda}_{k,+},\tilde{\lambda}_{1,-},\ldots,\tilde{\lambda}_{k,-}\}.$ ∎

6 Non-adaptive testers

6.1 Non-adaptive vector-matrix-vector queries

We gave a lower bound for one-sided testers earlier in Theorem 13. Here we observe that the sketch of Andoni and Nguyen [AN13] provides a matching upper bound.

Proposition 41.

There is a one-sided non-adaptive $\ell_{1}$ -tester that makes $O(1/\epsilon^{2})$ non-adaptive vector matrix-vector queries to $A$ .

Proof.

We simply apply Proposition 8 Note that the sketch is of the form $G^{T}AG$ , where $G\in\mathbb{R}^{m\times d}$ with $m=O(1/\epsilon)$ in our case. Each entry of $G^{T}AG$ of which there are $m^{2}$ can be computed with a single vector-matrix-vector query. ∎

Corollary 42.

There is a one-sided non-adaptive $\ell_{p}$ -tester that makes $O(\frac{1}{\epsilon^{2}}d^{2-2/p})$ non-adaptive vector matrix-vector queries to $A$ .

Proof.

Apply the previous proposition along with the bound $\left\|A\right\|_{p}\geq d^{1/p-1}\left\|A\right\|_{1}.$ ∎

6.2 Non-adaptive matrix-vector queries

As a simple corollary of the algorithm given by Corollary 42 we have the following.

Proposition 43.

There exists a one-sided non-adaptive tester making $O(\frac{1}{\epsilon}d^{1-1/p})$ matrix-vector queries.

Proof.

Simply note that a $k\times k$ bilinear sketch may be simulated with $k$ matrix-vector queries. ∎

We next show that this bound is tight. While we consider the case where the tester queries the standard basis vectors, this is done essentially without loss of generality as any non-adaptive tester may be implemented by querying on an orthonormal set.

Proposition 44.

Suppose that a one sided matrix-vector tester queries on the standard basis vectors $e_{1},\ldots,e_{k}$ and outputs False. Let $U$ be the top $k\times k$ submatrix of $[Ae_{1},\ldots,Ae_{k}].$ Then if $U$ is non-singular, there must exist a “witness vector” $v\in\operatorname{span}(x_{1},\ldots,x_{k})$ such that $v^{T}Av<0$ .

Proof.

Let $Q$ be the matrix with columns $Ae_{i},$ and decompose it as

Q=\begin{pmatrix}U\\ B\end{pmatrix}^{T}

(31)

where $U\in\mathbb{R}^{k\times k}$ and $B\in\mathbb{R}^{d\times(d-k)}.$ Suppose that there does not exist a $v$ as in the statement of the proposition. Note that this implies that $U$ is PSD, and in fact positive definite by the assumption that $U$ was non-singular. Now consider the block matrix

\widetilde{A}s=\begin{pmatrix}U&B^{T}\\ B&\lambda I\end{pmatrix}

(32)

for some choice of $\lambda>0.$ For arbitrary $v$ and $w$ of the appropriate dimensions, we have

	$\displaystyle\begin{pmatrix}v&w\end{pmatrix}\begin{pmatrix}U&B^{T}\\ B&\lambda I\end{pmatrix}\begin{pmatrix}v\\ w\end{pmatrix}$	$\displaystyle=v^{T}Uv+2v^{T}Bw+\lambda w^{T}w$		(33)
		$\displaystyle\geq\left\\|v\right\\|^{2}\sigma_{\min}(U)+\lambda\left\\|w\right\\|^{2}-2\left\\|v\right\\|\left\\|w\right\\|\sigma_{\max}(B).$		(34)

Since $\sigma_{\min}(U)\neq 0$ this expression viewed as a quadratic form in $\left\|v\right\|$ and $\left\|w\right\|$ is positive definite for large enough $\lambda.$ This implies that $\widetilde{A}$ is positive definite as well. Since $\widetilde{A}e_{i}=Ae_{i}$ by construction, this shows that the queries are consistent with a PSD matrix. So a one-sided tester that cannot produce a witness vector in this case must not output False. ∎

Theorem 45.

Set $D=\operatorname{diag}(-\lambda,1,\ldots,1),$ let $S$ be a random orthogonal matrix, and take $A=S^{T}DS$ . In the matrix-vector model, a one-sided non-adaptive tester must make at least $\frac{1}{2}\frac{d}{1+\lambda}$ queries to be correct on this distribution with $2/3$ probability.

Proof.

Given this distribution we may assume without loss of generality that the tester queries on $e_{1},\ldots e_{k},$ whose span we call $E_{k}.$ Let $u$ denote the $-\lambda$ eigen-direction of $A,$ which is distributed uniformly over $S^{d-1}.$ For unit vectors $x$ , the quadratic form associated to $A$ is negative exactly when $\left\langle x,u\right\rangle^{2}\geq\frac{1}{1+\lambda}.$ Also the $U$ as in Proposition 44 is non-singular with probability $1.$ In this case, by Proposition 44 the tester can only succeed if $\left\|\Pi_{E_{k}}u\right\|^{2}\geq\frac{1}{1+\lambda}.$ On the other hand $\mathbb{E}\left\|\Pi_{E_{k}}u\right\|^{2}=k/d$ , so by Markov, $\left\|\Pi_{E_{k}}u\right\|^{2}\leq 2k/d$ with probability at least $1/2.$ Therefore a tester that succeeds with $2/3$ probability must have $2k/d\geq 1/(1+\lambda)$ . ∎

Corollary 46.

In the matrix-vector model, a one-sided non-adaptive $\ell_{p}$ -tester must make at least $\Omega(\frac{1}{\epsilon}d^{1-1/p})$ queries.

Proof.

Apply Theorem 45 with $\lambda=\epsilon d^{1/p}.$ ∎

7 Conclusion and Open Problems

We gave a series of tight bounds for PSD-testing in both the matrix-vector and vector-matrix-vector models. We provided tight bounds as well as a separation between one and two-sided testers in the latter model. There are a number of additional questions that may yield interesting future work.

•

Our adaptive vector-matrix-vector algorithm for $p=1$ uses $\Omega(1/\epsilon)$ rounds of adaptivity, but this may not always be desirable in practice, since the queries cannot be run in parallel. Are there good algorithms that use less adaptivity? What is the optimal trade-off between query complexity and the number of rounds of adaptivity?
•

One could modify our testing model and consider testers which should output False whenever the $\ell_{p}$ norm of the negative eigenvalues is at least an $\epsilon$ fraction of the $\ell_{p}$ norm of positive eigenvalues. Is it possible to give tight bounds for this problem in the models that we considered?
•

Is it possible to use the ideas behind our two-sided bilinear sketch to give better bounds for spectral estimation with additive Frobenius error?

8 Acknowledgements

The authors would like to thank Sushant Sachdeva for pointing out that our analysis for matrix-vector queries was similar to the application of [AL86] in [SW09].

References

[Oja82] Erkki Oja “Simplified neuron model as a principal component analyzer” In Journal of mathematical biology 15.3 Springer, 1982, pp. 267–273
[AL86] Owe Axelsson and Gunhild Lindskog “On the rate of convergence of the preconditioned conjugate gradient method” In Numerische Mathematik 48 Springer, 1986, pp. 499–523
[Hut89] Michael F Hutchinson “A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines” In Communications in Statistics-Simulation and Computation 18.3 Taylor & Francis, 1989, pp. 1059–1076
[BFG96] Zhaojun Bai, Gark Fahey and Gene Golub “Some large-scale matrix computation problems” In Journal of Computational and Applied Mathematics 74.1-2 Elsevier, 1996, pp. 71–89
[KS03] Robert Krauthgamer and Ori Sasson “Property testing of data dimensionality” In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA ACM/SIAM, 2003, pp. 18–27
[Lit+05] Alexander E Litvak, Alain Pajor, Mark Rudelson and Nicole Tomczak-Jaegermann “Smallest singular value of random matrices and geometry of random polytopes” In Advances in Mathematics 195.2 Elsevier, 2005, pp. 491–523
[SW09] Daniel A Spielman and Jaeoh Woo “A note on preconditioning by low-stretch spanning trees” In arXiv preprint arXiv:0903.2816, 2009
[AT11] Haim Avron and Sivan Toledo “Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix” In Journal of the ACM (JACM) 58.2 ACM New York, NY, USA, 2011, pp. 1–34
[AN13] Alexandr Andoni and Huy L Nguyn “Eigenvalues of a matrix in the streaming model” In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, 2013, pp. 1729–1737 SIAM
[WWZ14] Karl Wimmer, Yi Wu and Peng Zhang “Optimal query complexity for estimating the trace of a matrix” In International Colloquium on Automata, Languages, and Programming, 2014, pp. 1051–1062 Springer
[MM15] Cameron Musco and Christopher Musco “Randomized block krylov methods for stronger and faster approximate singular value decomposition” In arXiv preprint arXiv:1504.05477, 2015
[Ai+16] Yuqing Ai, Wei Hu, Yi Li and David P Woodruff “New characterizations in turnstile streams with applications” In 31st Conference on Computational Complexity (CCC 2016), 2016 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
[Jai+16] Prateek Jain et al. “Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm” In Conference on learning theory, 2016, pp. 1147–1164 PMLR
[LW16] Yi Li and David P Woodruff “Tight bounds for sketching the operator norm, Schatten norms, and subspace embeddings” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016), 2016 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
[Sha16] Ohad Shamir “Convergence of stochastic gradient descent for PCA” In International Conference on Machine Learning, 2016, pp. 257–265 PMLR
[AL17] Zeyuan Allen-Zhu and Yuanzhi Li “First efficient convergence for streaming k-pca: a global, gap-free, and near-optimal rate” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 487–492 IEEE
[CW17] Kenneth L Clarkson and David P Woodruff “Low-rank approximation and regression in input sparsity time” In Journal of the ACM (JACM) 63.6 ACM New York, NY, USA, 2017, pp. 1–45
[CW17a] Kenneth L Clarkson and David P Woodruff “Low-rank PSD approximation in input-sparsity time” In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 2017, pp. 2061–2072 SIAM
[Han+17] Insu Han, Dmitry Malioutov, Haim Avron and Jinwoo Shin “Approximating spectral sums of large-scale matrices using stochastic chebyshev approximations” In SIAM Journal on Scientific Computing 39.4 SIAM, 2017, pp. A1558–A1585
[SER18] Max Simchowitz, Ahmed El Alaoui and Benjamin Recht “Tight query complexity lower bounds for PCA via finite sample deformed Wigner law” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018, pp. 1249–1259
[Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
[Bal+19] Maria-Florina Balcan, Yi Li, David P Woodruff and Hongyang Zhang “Testing matrix rank, optimally” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 727–746 SIAM
[MSW19] Michela Meister, Tamas Sarlos and David Woodruff “Tight dimensionality reduction for sketching low degree polynomial kernels”, 2019
[Sun+19] Xiaoming Sun, David P Woodruff, Guang Yang and Jialin Zhang “Querying a matrix through matrix-vector products” In arXiv preprint arXiv:1906.05736, 2019
[BCJ20] Ainesh Bakshi, Nadiia Chepurko and Rajesh Jayaram “Testing positive semi-definiteness via random submatrices” In arXiv preprint arXiv:2005.06441, 2020
[Bra+20] Mark Braverman, Elad Hazan, Max Simchowitz and Blake Woodworth “The gradient complexity of linear regression” In Conference on Learning Theory, 2020, pp. 627–647 PMLR
[RWZ20] Cyrus Rashtchian, David P Woodruff and Hanlin Zhu “Vector-matrix-vector queries for solving linear algebra, statistics, and graph problems” In arXiv preprint arXiv:2006.14015, 2020
[BMR21] Rajarshi Bhattacharjee, Cameron Musco and Archan Ray “Sublinear Time Eigenvalue Approximation via Random Sampling” In CoRR abs/2109.07647, 2021
[BMR21a] Rajarshi Bhattacharjee, Cameron Musco and Archan Ray “Sublinear Time Eigenvalue Approximation via Random Sampling” In arXiv preprint arXiv:2109.07647, 2021
[KPW21] Akshay Kamath, Eric Price and David P Woodruff “A simple proof of a new set disjointness with applications to data streams” In arXiv preprint arXiv:2105.11338, 2021
[Mey+21] Raphael A Meyer, Cameron Musco, Christopher Musco and David P Woodruff “Hutch++: Optimal Stochastic Trace Estimation” In Symposium on Simplicity in Algorithms (SOSA), 2021, pp. 142–155 SIAM
[INW22] Piotr Indyk, Shyam Narayanan and David P. Woodruff “Frequency Estimation with One-Sided Error” In SODA, 2022

	$\displaystyle\left\\|G^{T}AG\right\\|_{1}$	$\displaystyle\leq\left\\|G^{T}A_{+}G\right\\|_{1}+\left\\|G^{T}A_{-}G\right\\|_{1}=\operatorname{Tr}(G^{T}A_{+}G)+\operatorname{Tr}(G^{T}A_{-}G)$
		$\displaystyle=(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{+})+(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{-})$
		$\displaystyle\leq(1\pm O(1/\sqrt{m})\left\\|A\right\\|_{1}.$

	$\displaystyle\left\\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\\|_{F}^{2}$	$\displaystyle=(1\pm\epsilon)\left\\|ARY_{k}(S_{2}AR)^{T}+AS_{2}^{T}\right\\|_{F}^{2}$
		$\displaystyle=(1\pm\epsilon)\left\\|S_{2}ARY_{k}R^{T}A+S_{2}A\right\\|_{F}^{2}$
		$\displaystyle=(1\pm 3\epsilon)\left\\|ARY_{k}R^{T}A+A\right\\|_{F}^{2}.$

	$\displaystyle\min_{\operatorname{rk}(Y_{k})\leq k,Y_{k}\,\text{PSD}}\left\\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\\|_{F}^{2}$	$\displaystyle=(1\pm 3\epsilon)\min_{Y_{k}}\left\\|ARY_{k}R^{T}A+A\right\\|_{F}^{2}$		(25)
		$\displaystyle=(1\pm 7\epsilon)\left\\|A_{k,+}-A\right\\|_{F}^{2}.$		(26)

$\displaystyle\left\\|M_{1}YM_{2}-Q\right\\|_{F}^{2}$	$\displaystyle=\left\\|\Pi_{1}M_{1}YM_{2}\Pi_{2}-Q\right\\|_{F}^{2}$	(27)
	$\displaystyle=\left\\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}+\Pi_{1}^{\perp}Q\Pi_{2}+\Pi_{1}Q\Pi_{2}^{\perp}+\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}$	(28)
	$\displaystyle=\left\\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}$	(29)
	$\displaystyle=\left\\|M_{1}YM_{2}^{T}-\Pi_{1}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}\right\\|_{F}^{2}+\left\\|\Pi_{1}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}+\left\\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\\|_{F}^{2}.$	(30)

Testing Positive Semidefiniteness Using Linear Measurements ††thanks: Authors DN and WS were partially supported by NSF DMS #\#2011140 and NSF DMS #\#2108479. DW was supported by NSF CCF #\#1815840 and Office of Naval Research Grant N00014-18-1-2562.

Abstract

1 Introduction

Definition 1.

1.1 Our Contributions

Tight Bounds for One-sided Testers.

An Optimal Bilinear Sketch for Two-Sided Testing.

Separation Between One-Sided and Two-Sided Testers.

On the Importance of Adaptivity.

Spectrum Estimation.

1.2 Our Techniques

Matrix-Vector Queries.

Vector-Matrix-Vector Queries.

Two-Sided Testers.

Spectrum Estimation.

Lower Bounds for One-sided Testers.

1.3 Additional Related Work

Oja’s Algorithm.

PSD Testing.

Spectrum Estimation.

1.4 Notation

2 Vector-matrix-vector queries

2.1 An optimal one-sided tester.

Theorem 2.

Corollary 3.

Proof.

Lemma 4.

Proof.

Proposition 5.

Proof.

Proposition 6.

Proof.

Theorem 7.

Proof.

Proposition 8.

Proof.

Proof.

2.2 Lower bounds

2.2.1 Non-adaptive lower bounds

Proposition 9.

Proof.

Theorem 10.

Proof.

Lemma 11.

Proof.

Remark 12.

Theorem 13.

Proof.

2.2.2 Adaptive lower bounds

Lemma 14.

Proof.

Theorem 15.

3 Adaptive matrix-vector queries

Proposition 16.

Proof.

Theorem 17.

Proof.

Remark 18.

Remark 19.

Theorem 20.

Proof.

4 An optimal bilinear sketch

4.1 Upper bound on λmin​(GT​A​G)\lambda_{\min}(G^{T}AG)

Proposition 21.

Proof.

Lemma 22.

Proof.

Lemma 23.

Proof.

4.2 Lower bound on λmin​(GT​A​G)\lambda_{\min}(G^{T}AG)

Proposition 24.

Proof.

Lemma 25.

Proof.

Lemma 26.

Proof.

Theorem 27.

Proof.

Theorem 28.

Proof.

Testing Positive Semidefiniteness Using Linear Measurements ^†^†thanks: Authors DN and WS were partially supported by NSF DMS $\#$ 2011140 and NSF DMS $\#$ 2108479. DW was supported by NSF CCF $\#$ 1815840 and Office of Naval Research Grant N00014-18-1-2562.

4.1 Upper bound on $\lambda_{\min}(G^{T}AG)$

4.2 Lower bound on $\lambda_{\min}(G^{T}AG)$