This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Testing Positive Semidefiniteness Using Linear Measurements thanks: Authors DN and WS were partially supported by NSF DMS #\#2011140 and NSF DMS #\#2108479. DW was supported by NSF CCF #\#1815840 and Office of Naval Research Grant N00014-18-1-2562.

Deanna Needell
UCLA
   William Swartworth
UCLA
   David P. Woodruff
CMU
Abstract

We study the problem of testing whether a symmetric d×dd\times d input matrix AA is symmetric positive semidefinite (PSD), or is ϵ\epsilon-far from the PSD cone, meaning that λmin(A)ϵAp\lambda_{\min}(A)\leq-\epsilon\|A\|_{p}, where Ap\|A\|_{p} is the Schatten-pp norm of AA. In applications one often needs to quickly tell if an input matrix is PSD, and a small distance from the PSD cone may be tolerable. We consider two well-studied query models for measuring efficiency, namely, the matrix-vector and vector-matrix-vector query models. We first consider one-sided testers, which are testers that correctly classify any PSD input, but may fail on a non-PSD input with a tiny failure probability. Up to logarithmic factors, in the matrix-vector query model we show a tight Θ~(1/ϵp/(2p+1))\widetilde{\Theta}(1/\epsilon^{p/(2p+1)}) bound, while in the vector-matrix-vector query model we show a tight Θ~(d11/p/ϵ)\widetilde{\Theta}(d^{1-1/p}/\epsilon) bound, for every p1p\geq 1. We also show a strong separation between one-sided and two-sided testers in the vector-matrix-vector model, where a two-sided tester can fail on both PSD and non-PSD inputs with a tiny failure probability. In particular, for the important case of the Frobenius norm, we show that any one-sided tester requires Ω~(d/ϵ)\widetilde{\Omega}(\sqrt{d}/\epsilon) queries. However we introduce a bilinear sketch for two-sided testing from which we construct a Frobenius norm tester achieving the optimal O~(1/ϵ2)\widetilde{O}(1/\epsilon^{2}) queries. We also give a number of additional separations between adaptive and non-adaptive testers. Our techniques have implications beyond testing, providing new methods to approximate the spectrum of a matrix with Frobenius norm error using dimensionality reduction in a way that preserves the signs of eigenvalues.

1 Introduction

A real-valued matrix An×nA\in\mathbb{R}^{n\times n} is said to be Positive Semi-Definite (PSD) if it defines a non-negative quadratic form, namely, if xTAx0x^{T}Ax\geq 0 for all xx. If AA is symmetric, the setting on which we focus, this is equivalent to the eigenvalues of AA being non-negative. Multiple works [KS03, Han+17, BCJ20] have studied the problem of testing whether a real matrix is PSD, or is far from being PSD, and this testing problem has numerous applications, including to faster algorithms for linear systems and linear algebra problems, detecting the existence of community structure, ascertaining local convexity, and differential equations; we refer the reader to [BCJ20] and the references therein.

We study two fundamental query models. In the matrix-vector model, one is given implicit access to a matrix AA and may query AA by choosing a vector vv and receiving the vector Av.Av. In the vector-matrix-vector model one chooses a pair of vectors (v,w)(v,w) and queries the bilinear form associated to AA. In other words the value of the query is vTAwv^{T}Aw. In both models, multiple, adaptively-chosen queries can be made, and the goal is to minimize the number of queries to solve a certain task. These models are standard computational models in the numerical linear algebra community, see, e.g., [Han+17] where PSD testing was studied in the matrix-vector query model. These models were recently formalized in the theoretical computer science community in [Sun+19, RWZ20], though similar models have been studied in numerous fields, such as the number of measurements in compressed sensing, or the sketching dimension of a streaming algorithm. The matrix-vector query and vector-matrix-vector query models are particularly relevant when the input matrix AA is not given explicitly.

A natural situation occurs when AA is presented implicitly as a the Hessian of a function f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} at a point x0x_{0}, where ff could be the loss function of a neural network for example. One might want to quickly distinguish between a proposed optimum of ff truly being a minimum, or being a saddle point with a direction of steep downward curvature. Our query model is quite natural in this context. A Hessian-vector product is efficient to compute using automatic differentiation techniques. A vector-matrix-vector product corresponds to a single second derivative computation, D2f(v,w)D^{2}f(v,w). This can be approximated using 44 function queries by the finite difference approximation D2f(v,w)f(x0+hv+hw)f(x0+hv)f(x0+hw)+f(x0)h2,D^{2}f(v,w)\approx\frac{f(x_{0}+hv+hw)-f(x_{0}+hv)-f(x_{0}+hw)+f(x_{0})}{h^{2}}, where hh is small.

While there are numerically stable methods for computing the spectrum of a symmetric matrix, and thus determining if it is PSD, these methods can be prohibitively slow for very large matrices, and require a large number of matrix-vector or vector-matrix-vector products. Our goal is to obtain significantly more efficient algorithms in these models, and we approach this problem from a property testing perspective. In particular, we focus on the following version of the PSD-testing problem. In what follows, Ap=(i=1nσip)1/p\|A\|_{p}=\left(\sum_{i=1}^{n}\sigma_{i}^{p}\right)^{1/p} is the Schatten-pp norm of AA, where the σi\sigma_{i} are the singular values of AA.

Definition 1.

For p[1,]p\in[1,\infty], an (ϵ,p)(\epsilon,\ell_{p})-tester is an algorithm that makes either matrix-vector or vector-matrix-vector queries to a real symmetric matrix AA, and outputs True with at least 2/32/3 probability if AA is PSD, and outputs False with 2/32/3 probability if AA is ϵAp\epsilon\left\|A\right\|_{p}-far in spectral distance from the PSD cone, or equivalently, if the minimum eigenvalue λmin(A)ϵAp\lambda_{\min}(A)\leq-\epsilon\left\|A\right\|_{p}. If the tester is guaranteed to output True on all PSD inputs (even if the input is generated by an adversary with access to the random coins of the tester), then the tester has one-sided error. Otherwise it has two-sided error. When ϵ\epsilon is clear from the context we will often drop the ϵ\epsilon and simply refer to an p\ell_{p}-tester.

Our work fits more broadly into the growing body of work on property testing for linear algebra problems, see, for example [Bal+19, BCJ20, BMR21]. However, a key difference is that we focus on matrix-vector and vector-matrix-vector query models, which might be more appropriate than the model in the above works which charges a cost of 11 for reading a single entry. Indeed, such models need to make the assumption that the entries of the input are bounded by a constant or slow-growing function of nn, as otherwise strong impossibility results hold. This can severely limit the applicability of such algorithms to real-life matrices that do not have bounded entries; indeed, even a graph Laplacian matrix with a single degree that is large would not fit into the above models. In contrast, we use the matrix-vector and vector-matrix-vector models, which are ideally suited for modern machines such as graphics processing units and when the input matrix cannot fit into RAM, and are standard models in scientific computing, see, e.g., [BFG96].

While we focus on vector-matrix-vector queries, our results shed light on several other natural settings. Many of our results are in fact tight for general linear measurements which vectorize the input matrix and apply adaptively chosen linear forms to it. For long enough streams the best known single or multi-pass algorithms for any problem in the turnstile streaming model form a sketch using general linear measurements, and with some additional restrictions, it can be shown that the optimal multi-pass streaming algorithm just adaptively chooses general linear measurements [Ai+16]. Therefore, it is quite plausible that many of our vector-matrix-vector algorithms give tight single pass streaming bounds, given that vector-matrix-vector queries are a special case of general linear measurements, and that many our lower bounds are tight even for general linear measurements.

Moreover our vector-matrix-vector algorithms lead to efficient communication protocols for deciding whether a distributed sum of matrices is PSD, provided that exact vector-matrix-vector products may be communicated. While we expect our methods to be stable under small perturbations (i.e. when the vector-matrix-vector products are slightly inexact), we leave the full communication complexity analysis to future work.

We note that our PSD-testing problem is closely related to that of approximating the largest eigenvalue of a PSD matrix. Indeed by appropriately negating and shifting the input matrix, it is essentially equivalent to estimate the largest eigenvalue of a PSD matrix AA up to additive error ϵ(i|λmax(A)λi(A)|p)1/p.\epsilon\left(\sum_{i}\left|\lambda_{\max}(A)-\lambda_{i}(A)\right|^{p}\right)^{1/p}. However this problem is much less natural as real-world matrices often have many small eigenvalues, but only a few large eigenvalues.

1.1 Our Contributions

Vector-matrix-vector queries
Adaptive, one-sided p\ell_{p} Θ~(1ϵd11/p)\widetilde{\Theta}(\frac{1}{\epsilon}d^{1-1/p}) Corollary 3, Theorem 15
Non-adaptive, one-sided p\ell_{p} Θ~(1ϵ2d22/p)\widetilde{\Theta}(\frac{1}{\epsilon^{2}}d^{2-2/p}) Corollary 42, Theorem 13
Adaptive, two-sided 2\ell_{2} Θ~(1ϵ2)\widetilde{\Theta}(\frac{1}{\epsilon^{2}})^{*} Proposition 29, Corollary 33
Non-adaptive, two-sided 2\ell_{2} Θ~(1ϵ4)\widetilde{\Theta}(\frac{1}{\epsilon^{4}})^{*} Theorem 28, Theorem 31
Adaptive, two-sided p,\ell_{p}, 2p<2\leq p<\infty Θ~(1ϵ2d12/p)\widetilde{\Theta}(\frac{1}{\epsilon^{2}}d^{1-2/p})^{*} Corollary 30, Corollary 33
Matrix-vector queries
Adaptive one-sided p\ell_{p} O~((1/ϵ)p/(2p+1)logd)\widetilde{O}((1/\epsilon)^{p/(2p+1)}\log d), Ω((1/ϵ)p/(2p+1))\Omega((1/\epsilon)^{p/(2p+1)}) Theorem 17, Theorem 20
Adaptive one-sided 1\ell_{1} Θ~((1/ϵ)1/3)\widetilde{\Theta}((1/\epsilon)^{1/3}) Theorem 17, Theorem 20
Non-adaptive one-sided p\ell_{p} Θ(1ϵd11/p)\Theta(\frac{1}{\epsilon}d^{1-1/p}) Proposition 43, Corollary 46
Table 1: Our upper and lower bounds for the matrix-vector and vector-matrix-vector query models. indicates that the lower bound holds for general linear measurements.

We study PSD-testing in the matrix-vector and vector-matrix-vector models. In particular, given a real symmetric matrix A,A, and p[1,]p\in[1,\infty], we are interested in deciding between (i) AA is PSD and (ii) AA has an eigenvalue less than ϵAp,-\epsilon\left\|A\right\|_{p}, where Ap\left\|A\right\|_{p} is the Schatten pp-norm of A.A.

Tight Bounds for One-sided Testers.

We make particular note of the distinction between one-sided and two-sided testers. In some settings one is interested in a tester that produces one-sided error. When such a tester outputs False, it must be able to produce a proof that AA is not PSD. The simplest such proof is a witness vector vv such that vTAv<0v^{T}Av<0, and indeed we observe that in the matrix-vector model, any one-sided tester can produce such a vv when it outputs False.\text{False}. This may be a desirable feature if one wishes to apply these techniques to saddle point detection for example: given a point that is not a local minimum, it would be useful to produce a descent direction so that optimization may continue. In the vector-matrix-vector model the situation is somewhat more complicated in general, but all of our one-sided testers produce a witness vector whenever they output False.

We provide optimal bounds for one-sided testers for both matrix-vector and vector-matrix-vector models. The bounds below are stated for constant probability algorithms. Here O~(f)=fpoly(logf)\widetilde{O}(f)=f\cdot\text{poly}(\log f).

  1. 1.

    In the matrix-vector query model, we show that up to a factor of logd\log d, Θ~(1/ϵp/(2p+1))\widetilde{\Theta}(1/\epsilon^{p/(2p+1)}) queries are necessary and sufficient for an p\ell_{p}-tester for any p1p\geq 1. In the p=1p=1 case, we note that the logd\log d factor may be removed.

  2. 2.

    In the vector-matrix-vector query model, we show that Θ~(d11/p/ϵ)\widetilde{\Theta}(d^{1-1/p}/\epsilon) queries are necessary and sufficient for an p\ell_{p}-tester for any p1p\geq 1. Note that when p=1p=1 we obtain a very efficient O~(1/ϵ)\widetilde{O}(1/\epsilon)-query algorithm. In particular, our tester for p=1p=1 has query complexity independent of the matrix dimensions, and we show a sharp phase transition for p>1p>1, showing in some sense that p=1p=1 is the largest value of pp possible for one-sided queries.

The matrix-vector query complexity is very different than the vector-matrix-vector query complexity, as the query complexity is poly(1/ϵ)\text{poly}(1/\epsilon) for any p1p\geq 1, which captures the fact that each matrix-vector query response reveals more information than that of a vector-matrix-vector query, though a priori it was not clear that such responses in the matrix-vector model could not be compressed using vector-matrix-vector queries.

An Optimal Bilinear Sketch for Two-Sided Testing.

Our main technical contribution for two-sided testers is a bilinear sketch for PSD-testing with respect to the Frobenius norm, i.e. p=2p=2. We consider a Gaussian sketch GTAGG^{T}AG, where GG has small dimension O~(1ϵ2).\widetilde{O}(\frac{1}{\epsilon^{2}}). By looking at the smallest eigenvalue of the sketch, we are able to distinguish between AA being PSD and being ϵ\epsilon-far from PSD. Notably this tester may reject even when λmin(GTAG)>0\lambda_{\min}(G^{T}AG)>0, which results in a two-sided error guarantee. This sketch allows us to obtain tight two-sided bounds in the vector-matrix-vector model for p2p\geq 2, both for adaptive and non-adaptive queries.

Separation Between One-Sided and Two-Sided Testers.

Surprisingly, we show a separation between one-sided and two-sided testers in the vector-matrix-vector model. For the important case of the Frobenius norm, i.e., p=2p=2, we utilize our bilinear sketch to construct a O~(1/ϵ2)\widetilde{O}(1/\epsilon^{2}) query two-sided tester, whereas by our results above, any adaptive one-sided tester requires at least Ω(d/ϵ)\Omega(\sqrt{d}/\epsilon) queries.

We also show that for any p>2p>2, any possibly adaptive two-sided tester requires dΩ(1)d^{\Omega(1)} queries for constant ϵ\epsilon, and thus in some sense, p=2p=2 is the largest value of pp possible for two-sided queries.

On the Importance of Adaptivity.

We also study the role of adaptivity in both matrix-vector and vector-matrix-vector models. In both the one-sided and two-sided vector-matrix-vector models we show a quadratic separation between adaptive and non-adaptive testers, which is the largest gap possible for any vector-matrix-vector problem [Sun+19].

In the matrix-vector model, each query reveals more information about AA than in the vector-matrix-vector model, allowing for even better choices for future queries. Thus we have an even larger gap between adaptive and non-adaptive testers in this setting.

Spectrum Estimation.

While the two-sided tester discussed above yields optimal bounds for PSD testing, it does not immediately give a way to estimate the negative eigenvalue when it exists. Via a different approach, we show how to give such an approximation with ϵAF\epsilon\left\|A\right\|_{F} additive error. In fact, we show how to approximate all of the top kk eigenvalues of AA using O(k2poly1ϵ)O(k^{2}\text{poly}\frac{1}{\epsilon}) non-adaptive vector-matrix-vector queries, which may be of independent interest.

We note that this gives an O(k2poly1ϵ)O(k^{2}\text{poly}\frac{1}{\epsilon}) space streaming algorithm for estimating the top kk eigenvalues of AA to within additive Frobenius error. Prior work yields a similar guarantee for the singular values [AN13], but cannot recover the signs of eigenvalues.

1.2 Our Techniques

Matrix-Vector Queries.

For the case of adaptive matrix-vector queries, we show that Krylov iteration starting with a single random vector yields an optimal p\ell_{p}-tester for all pp. Interestingly, our analysis is able to beat the usual Krylov matrix-vector query bound for approximating the top eigenvalue, as we modify the usual polynomial analyzed for eigenvalue estimation to implicitly implement a deflation step of all eigenvalues above a certain threshold. We do not need to explicitly know the values of the large eigenvalues in order to deflate them; rather, it suffices that there exists a low degree polynomial in the Krylov space that implements this deflation.

We note that this idea of implicitly deflating the top eigenvalues first appeared in [AL86]. More recently this observation was applied in a setting very similar to ours by [SW09] who deflated eigenvalues that are large relative to the trace.

Further, we show that this technique is tight for all p1p\geq 1 by showing that any smaller number of matrix-vector products would violate a recent lower bound of [Bra+20] for approximating the smallest eigenvalue of a Wishart matrix. This lower bound applies even to two-sided testers.

Vector-Matrix-Vector Queries.

We start by describing our result for p=1p=1. We give one of the first examples of an algorithm in the vector-matrix-vector query model that leverages adaptivity in an interesting way. Most known algorithms in this model work non-adaptively, either by applying a bilinear sketch to the matrix, or by making many independent queries in the case of Hutchinson’s trace estimator [Hut89]. Indeed, the algorithm of [AN13] works by computing GTAGG^{T}AG for a Gaussian matrix GG with 1/ϵ1/\epsilon columns, and arguing that all eigenvalues that are at least ϵA1\epsilon\|A\|_{1} can be estimated from the sketch. The issue with this approach is that it uses Ω(1/ϵ2)\Omega(1/\epsilon^{2}) queries and this bound is tight for non-adaptive testers! One could improve this by running our earlier matrix-vector algorithm on top of this sketch, without ever explicitly forming the 1/ϵ×1/ϵ1/\epsilon\times 1/\epsilon matrix GTAGG^{T}AG; however, this would only give an O(1/ϵ4/3)O(1/\epsilon^{4/3}) query algorithm.

To achieve our optimal O~(1/ϵ)\widetilde{O}(1/\epsilon) complexity, our algorithm instead performs a novel twist to Oja’s algorithm [Oja82], the latter being a stochastic gradient descent (SGD) algorithm applied to optimizing the quadratic form f(x)=xTAxf(x)=x^{T}Ax over the sphere. In typical applications, the randomness of SGD arises via randomly sampling from a set of training data. In our setting, we instead artificially introduce randomness at each step, by computing the projection of the gradient onto a randomly chosen direction. This idea is implemented via the iteration

x(k+1)=xkη(gTAxk)gwhereg𝒩(0,I)x^{(k+1)}=x^{k}-\eta(g^{T}Ax^{k})g\,\,\,\text{where}\,\,g\sim\mathcal{N}(0,I) (1)

for a well-chosen step size η.\eta. If ff ever becomes negative before reaching the maximum number of iterations, then the algorithm outputs False, otherwise it outputs True. For p=1p=1, we show that this scheme results in an optimal tester (up to logarithmic factors). Our proof uses a second moment analysis to analyze a random walk, that is similar in style to [Jai+16], though our analysis is quite different. Whereas [Jai+16] considers an arbitrary i.i.d. stream of unbiased estimators to AA (with bounded variance), our estimators are simply ggTA,gg^{T}A, which do not seem to have been considered before. We leverage this special structure to obtain a better variance bound on the iterates throughout the first O~(1/ϵ)\widetilde{O}(1/\epsilon) iterations, where each iteration can be implemented with a single vector-matrix-vector query. Our algorithm and analysis gives a new method for the fundamental problem of approximating eigenvalues.

Our result for general p>1p>1 follows by relating the Schatten-pp norm to the Schatten-11 norm and invoking the algorithm above with a different setting of ϵ\epsilon. We show our method is optimal by proving an Ω(d22/p/ϵ2)\Omega(d^{2-2/p}/\epsilon^{2}) lower bound for non-adaptive one-sided testers, and then using a theorem in [RWZ20] which shows that adaptive one-sided testers can give at most a quadratic improvement. We note that one could instead use a recent streaming lower bound of [INW22] to prove this lower bound, though such a lower bound would depend on the bit complexity.

Two-Sided Testers.

The key technical ingredient behind all of our two-sided testers is a bilinear sketch for PSD-testing. Specifically, we show that a sketch of the form GTAGG^{T}AG with GRd×kG\in R^{d\times k} is sufficient for obtaining a two-sided tester for p=2p=2. In contrast to the p=1p=1 case, we do not simply output False when λmin:=λmin(GTAG)<0\lambda_{\min}:=\lambda_{\min}(G^{T}AG)<0 as such an algorithm would automatically be one-sided. Instead we require a criterion to detect when λmin\lambda_{\min} is suspiciously small. For this we require two results.

The first is a concentration inequality for λmin(GTAG)\lambda_{\min}(G^{T}AG) when AA is PSD. We show that λminTr(A)O~(k)AF\lambda_{\min}\geq\operatorname{Tr}(A)-\widetilde{O}(\sqrt{k})\left\|A\right\|_{F} with very good probability. This result is equivalent to bounding the smallest singular value of A1/2GA^{1/2}G, which is a Gaussian matrix whose rows have different variances. Although many similar bounds for constant variances exist in the literature [Lit+05, Ver18], we were not able to find a bound for general covariances. In particular, most existing bounds do not seem to give the concentration around Tr(A)\operatorname{Tr}(A) that we require.

When AA has a negative eigenvalue of ϵ-\epsilon, we show that λminTr(A)ϵO(k)\lambda_{\min}\leq\operatorname{Tr}(A)-\epsilon O(k). By combining these two results, we are able to take k=O~(1/ϵ2)k=\widetilde{O}(1/\epsilon^{2}), yielding a tight bound for non-adaptive testers in the vector-matrix-vector model. In fact this bound is even tight for general linear sketches, as we show by applying the results in [LW16].

We also utilize this bilinear sketch to give tight bounds for adaptive vector-matrix-vector queries, and indeed for general linear measurements. By first (implicitly) applying the sketch, and then shifting by an appropriate multiple of the identity we are able to reduce to the (ϵ2,1)(\epsilon^{2},\ell_{1})-testing problem, which as described above may solved using O~(1/ϵ2)\widetilde{O}(1/\epsilon^{2}) queries.

Spectrum Estimation.

A natural approach for approximating the eigenvalues of an n×nn\times n matrix AA is to first compute a sketch GTAGG^{T}AG or a sketch GTAHG^{T}AH for Gaussian matrices GG and HH with a small number of columns. Both of these sketches appear in [AN13]. As noted above, GTAGG^{T}AG is a useful non-adaptive sketch for spectrum approximation, but the error in approximating each eigenvalue is proportional to the Schatten-11 norm of AA. One could instead try to make the error depend on the Frobenius norm A2\|A\|_{2} of AA by instead computing GTAHG^{T}AH for independent Gaussian matrices GG and HH, but now GTAHG^{T}AH is no longer symmetric and it is not clear how to extract the signs of the eigenvalues of AA from GTAHG^{T}AH. Indeed, [AN13] are only able to show that the singular values of GTAHG^{T}AH are approximately the same as those of AA, up to additive ϵA2\epsilon\|A\|_{2} error. We thus need a new way to preserve sign information of eigenvalues.

To do this, we show how to use results for providing the best PSD low rank approximation to an input matrix AA, where AA need not be PSD and need not even be symmetric. In particular, in [CW17a] it was argued that if GG is a Gaussian matrix with O(k/ϵ)O(k/\epsilon) columns, then if one sets up the optimization problem minrank k PSDYAGYGTATAF2\min_{\textrm{rank k PSD}\ Y}\|AGYG^{T}A^{T}-A\|_{F}^{2}, then the cost will be at most (1+ϵ)Ak,+AF2(1+\epsilon)\|A_{k,+}-A\|_{F}^{2}, where Ak,+A_{k,+} is the best rank-kk PSD approximation to AA. By further sketching on the left and right with so-called affine embeddings SS and TT, which have poly(k/ϵ)\text{poly}(k/\epsilon) rows and columns respectively, one can reduce this problem to minrank k PSDYSAGYGTATTSATF2\min_{\textrm{rank k PSD}\ Y}\|SAGYG^{T}A^{T}T-SAT\|_{F}^{2}, and now SAGSAG, GTATTG^{T}A^{T}T and SATSAT are all poly(k/ϵ)×poly(k/ϵ)\text{poly}(k/\epsilon)\times\text{poly}(k/\epsilon) matrices so can be computed with a poly(k/ϵ)\text{poly}(k/\epsilon) number of vector-matrix-vector products. At this point the optimal YY can be found with no additional queries and its cost can be evaluated. By subtracting this cost from AF2\left\|A\right\|_{F}^{2}, we approximate A+,iF2\left\|A_{+,i}\right\|_{F}^{2}, and A,iF2\left\|A_{-,i}\right\|_{F}^{2} for all i[k]i\in[k], which in turn allows us to produce (signed) estimates for the eigenvalues of AA.

When AA is PSD, we note that Theorem 1.2 in [AN13] is able to reproduce our spectral approximation guarantee using sketching dimension O(k2ϵ8)O(\frac{k^{2}}{\epsilon^{8}}), compared to our sketch of dimension O(k2ϵ12)O(\frac{k^{2}}{\epsilon^{12}}). However as mentioned above, our guarantee is stronger in that it allows for the signs of the eigenvalues to be recovered, i.e. our guarantee holds even when AA is not PSD. Additionally, we are able to achieve O(k2ϵ8)O(\frac{k^{2}}{\epsilon^{8}}) using just a single round of adaptivity.

Lower Bounds for One-sided Testers.

To prove lower bounds for one-sided non-adaptive testers, we first show that a one-sided tester must be able to produce a witness whenever it outputs False. In the matrix-vector model, the witness is a vector vv with vTAv<0v^{T}Av<0, and in the vector-matrix-vector model, the witness is a PSD matrix MM with M,A<0\left\langle M,A\right\rangle<0. In both cases we show that even for simplest non-PSD spectrum (λ,1,,1)(-\lambda,1,\ldots,1), that it takes many queries to produce a witness when λ\lambda is small. In the matrix-vector model, our approach is simply to show that the λ-\lambda eigenvector is typically far from span of all queried vectors, when the number of queries is small. This will imply that AA is non-negative on the queried subspace, which precludes the tester from producing a witness. In the vector-matrix-vector model our approach is similar, however now the queries take the form of inner products against rank one matrices xixiT.x_{i}x_{i}^{T}. We therefore need to work within the space of symmetric matrices, and this requires a more delicate argument.

1.3 Additional Related Work

Numerous other works have considered matrix-vector queries and vector-matrix queries, see, e.g., [Mey+21, Bra+20, Sun+19, SER18, MM15, WWZ14]. We outline a few core areas here.

Oja’s Algorithm.

Several works have considered Oja’s algorithm in the context of streaming PCA, [Sha16, Jai+16, AL17]. [Jai+16] gives a tight convergence rate for iteratively approximating the top eigenvector of a PSD matrix, given an eigengap, and [AL17] extends this to a gap free result for kk-PCA.

PSD Testing.

As mentioned above, PSD-testing has been investigated in the bounded entry model, where one assumes that the entries of AA are bounded by 11 [BCJ20], and one is allowed to query the entries of AA. This is a restriction of the vector-matrix-vector model that we consider where only coordinate vectors may be queried. However since we consider a more general query model, we are able to give a better adaptive tester – for us O~(1/ϵ)\widetilde{O}(1/\epsilon) vector-matrix-vector queries suffice, beating the Ω(1/ϵ2)\Omega(1/\epsilon^{2}) lower bound given in [BCJ20] for entry queries.

Another work on PSD-testing is that of [Han+17], who construct a PSD-tester in the matrix-vector model. They first show how to approximate a general trace function f(λi)\sum f(\lambda_{i}) for sufficiently smooth ff, by using a Chebyshev polynomial construction to approximate ff in the sup-norm over an interval. This allows them to construct an \ell_{\infty}-tester by taking ff to be a smooth approximation of a shifted Heaviside function. Unfortunately this approach is limited to \ell_{\infty}-testers, and does not achieve the optimal bound; they require Ω((logd)/ϵ)\Omega((\log d)/\epsilon) matrix-vector queries compared to the O~((logd)/ϵ)\widetilde{O}((\log d)/\sqrt{\epsilon}) queries achieved by Krylov iteration.

Spectrum Estimation.

The closely-related problem of spectrum estimation has been considered several times, in the context of sketching the largest kk elements of the spectrum [AN13] discussed above, and approximating the entire spectrum from entry queries in the bounded entry model [BMR21a].

1.4 Notation

A symmetric matrix AA is positive semi-definite (PSD) if all eigenvalues are non-negative. We use Δ+d\Delta^{d}_{+} to represent the PSD-cone, which is the subset of d×dd\times d symmetric matrices that are PSD.

For a matrix AA we use Ap\left\|A\right\|_{p} to denote the Schatten pp-norm, which is the p\ell_{p} norm of the vector of singular values of A.A. The Frobenius norm will play a special role in several places, so we sometimes use the notation AF\left\|A\right\|_{F} to emphasize this. Additionally, A\left\|A\right\| without the subscript indicates operator norm (which is equivalent to A\left\|A\right\|_{\infty}).

We always use dd to indicate the dimension of the matrix being tested, and use ϵ<1\epsilon<1 to indicate the parameter in Definition 1.

When applied to vectors, ,\left\langle\cdot,\cdot\right\rangle indicates the standard inner product on n.\mathbb{R}^{n}. When applied to matrices, it indicates the Frobenius inner product X,Y:=Tr(XTY).\left\langle X,Y\right\rangle:=\operatorname{Tr}(X^{T}Y).

Sd1S^{d-1} indicates the set of all unit vectors in d.\mathbb{R}^{d}.

We use the notation XX^{\dagger} to indicate the Moore-Penrose pseudoinverse of XX.

For a symmetric matrix Ad×dA\in\mathbb{R}^{d\times d} with eigenvalues λ1λ2λd\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{d}, we let AkA_{k} denote the matrix AA with all but the top kk eigenvalues zeroed out. Formally, if UU is an orthogonal matrix diagonalizing AA, then Ak=UTdiag(λ1,,λk,0,,0)U.A_{k}=U^{T}\operatorname{diag}(\lambda_{1},\ldots,\lambda_{k},0,\ldots,0)U. We also let Ak=AAk.A_{-k}=A-{A_{k}}.

Throughout, we use cc to indicate an absolute constant. The value of cc may change between instances.

2 Vector-matrix-vector queries

2.1 An optimal one-sided tester.

To construct our vector-matrix-vector tester, we analyze the iteration

x(k+1)=x(k)η((g(k))TAx(k))g(k)=(Iηg(k)(g(k))TA)x(k),x^{(k+1)}=x^{(k)}-\eta\left((g^{(k)})^{T}Ax^{(k)}\right)g^{(k)}=\left(I-\eta g^{(k)}(g^{(k)})^{T}A\right)x^{(k)}, (2)

where g(k)𝒩(0,Id)g^{(k)}\sim\mathcal{N}(0,I_{d}) and x(0)𝒩(0,Id).x^{(0)}\sim\mathcal{N}(0,I_{d}).

Our algorithm is essentially to run this scheme for a fixed number of iterations with with well-chosen step size η.\eta. If the value of (x(k))TAx(k)(x^{(k)})^{T}Ax(k) ever becomes negative, then we output False, otherwise we output True. Using this approach we prove the following.

Theorem 2.

There exists a one-sided adaptive 1\ell_{1}-tester, that makes O(1ϵlog31ϵ)O(\frac{1}{\epsilon}\log^{3}\frac{1}{\epsilon}) vector-matrix-vector queries to A.A.

As an immediate corollary we obtain a bound for p\ell_{p}-testers.

Corollary 3.

There is a one-sided adaptive p\ell_{p}-tester that makes O(1ϵd11/plog3(1ϵd11/p))O(\frac{1}{\epsilon}d^{1-1/p}\log^{3}(\frac{1}{\epsilon}d^{1-1/p})) vector-matrix-vector queries.

Proof.

This follows from the previous result along with the bound Apd1/p1A1.\left\|A\right\|_{p}\geq d^{1/p-1}\left\|A\right\|_{1}.

We now turn to the proof of Theorem 2. Since our iterative scheme is rotation-invariant, we assume without loss of generality that A=diag(λ1,,λd).A=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{d}). For now, we assume that A11,\left\|A\right\|_{1}\leq 1, and that the smallest eigenvalue of AA is λ1=ϵ.\lambda_{1}=-\epsilon. We consider running the algorithm for NN iterations. We will show that our iteration finds an xx with xTAx<0x^{T}Ax<0 in N=O~(1/ϵ)N=\tilde{O}(1/\epsilon) iterations. We will use cc to denote absolute constants that we don’t track, and that may vary between uses.

Our key technical lemma is to show that the first coordinate (which is associated to the ϵ-\epsilon eigenvalue) grows fairly quickly with good probability.

Lemma 4.

Suppose η\eta and NN satisfy the following list of assumptions: (1) η14\eta\leq\frac{1}{4}, (2) η2ϵN18\eta^{2}\epsilon N\leq\frac{1}{8}, (3) (1+η2ϵ2)N54(1+\eta^{2}\epsilon^{2})^{N}\leq\frac{5}{4}, (4) (1+ηϵ)N10ϵ2(1+\eta\epsilon)^{N}\geq\frac{10}{\epsilon^{2}}. Then x1(N)1ϵ2x_{1}^{(N)}\geq\frac{1}{\epsilon^{2}} with probability at least 0.20.2.

Proof.

Following [Jai+16] we define the matrix Bk=i=1k(Iηg(i)(g(i))TA)B_{k}=\prod_{i=1}^{k}\left(I-\eta g^{(i)}(g^{(i)})^{T}A\right), where the g(i)g^{(i)} are independent 𝒩(0,I)\mathcal{N}(0,I) gaussians. Note that x(k)=Bkx(0).x^{(k)}=B_{k}x^{(0)}. We will show that BkTe1B_{k}^{T}e_{1} has large norm with good probability (in fact we will show that BkTe1,e1\left\langle B_{k}^{T}e_{1},e_{1}\right\rangle is large). This will then imply that Bkx(0),e1\left\langle B_{k}x^{(0)},e_{1}\right\rangle is large with high probability, where x(0)𝒩(0,I).x^{(0)}\sim\mathcal{N}(0,I).

Step 1: Deriving a recurrence for the second moments.

Let y(k)=BkTe1y^{(k)}=B_{k}^{T}e_{1} and let ui(k)u^{(k)}_{i} be the second moment of the coordinate yi(k)y_{i}^{(k)}. Note that ui(0)=δ1iu^{(0)}_{i}=\delta_{1i} (where δ\delta is the Dirac delta). To simplify the notation, we drop the superscript on the gg. We compute yi(k+1)=((IηggTA)y(k))i=yi(k)η(Ag)i(g1y1(k)++gdyd(k))=yi(k)ηλigi(g1y1(k)++gdyd(k))y_{i}^{(k+1)}=\left((I-\eta gg^{T}A)y^{(k)}\right)_{i}=y^{(k)}_{i}-\eta(Ag)_{i}(g_{1}y^{(k)}_{1}+\ldots+g_{d}y^{(k)}_{d})=y_{i}^{(k)}-\eta\lambda_{i}g_{i}(g_{1}y^{(k)}_{1}+\ldots+g_{d}y^{(k)}_{d}).

Next we observe that (after grouping terms) the coefficients of the yi(k)y_{i}^{(k)} terms are pairwise uncorrelated. Using this, along with the fact that the gig_{i}’s are independent of the yi(k)y_{i}^{(k)}’s gives

ui(k+1)=𝔼(1ηλigi2)2ui(k)+η2λi2jiuj(k)=(12ηλi+3η2λi2)ui(k)+η2λi2jiuj(k)\displaystyle u_{i}^{(k+1)}=\mathbb{E}(1-\eta\lambda_{i}g_{i}^{2})^{2}u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j\neq i}u_{j}^{(k)}=(1-2\eta\lambda_{i}+3\eta^{2}\lambda_{i}^{2})u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j\neq i}u_{j}^{(k)}
=(12ηλi+2η2λi2)ui(k)+η2λi2j=1duj(k).\displaystyle=(1-2\eta\lambda_{i}+2\eta^{2}\lambda_{i}^{2})u_{i}^{(k)}+\eta^{2}\lambda_{i}^{2}\sum_{j=1}^{d}u_{j}^{(k)}.

Let S(k)=u1(k)++ud(k)S^{(k)}=u^{(k)}_{1}+\ldots+u^{(k)}_{d}, and γi=12ηλi+2η2λi2.\gamma_{i}=1-2\eta\lambda_{i}+2\eta^{2}\lambda_{i}^{2}. Then we can write the recurrence as ui(k+1)=γiui(k)+η2λi2S(k)u^{(k+1)}_{i}=\gamma_{i}u^{(k)}_{i}+\eta^{2}\lambda_{i}^{2}S^{(k)}. Iterating this recurrence gives

ui(k)=δ1iγik+η2λi2(γik1S(0)+γik2S(1)++S(k1)).u^{(k)}_{i}=\delta_{1i}\gamma_{i}^{k}+\eta^{2}\lambda_{i}^{2}\left(\gamma_{i}^{k-1}S^{(0)}+\gamma_{i}^{k-2}S^{(1)}+\ldots+S^{(k-1)}\right). (3)

Step 2: Bounding S(k)S^{(k)}.

Summing the above equation over ii allows us to write a recurrence for the S(k)S^{(k)}’s: S(k)=γ1k+αk1S(0)+αk2S(1)++α0S(k1)S^{(k)}=\gamma_{1}^{k}+\alpha_{k-1}S^{(0)}+\alpha_{k-2}S^{(1)}+\ldots+\alpha_{0}S^{(k-1)}, where we define αj:=i=1dη2λi2γij\alpha_{j}:=\sum_{i=1}^{d}\eta^{2}\lambda_{i}^{2}\gamma_{i}^{j}.

We split αj\alpha_{j} into two parts, αj+\alpha_{j}^{+} and αj\alpha_{j}^{-} corresponding to terms in the sum where λi\lambda_{i} is positive or negative respectively. We now use the recurrence to bound S(k).S^{(k)}. First by Holder’s inequality, S(k)γ1k+max(S(0),,S(k1))(α0+++αk1+)+(αk1S(0)+αk2S(1)++α0S(k1)).S^{(k)}\leq\gamma_{1}^{k}+\max(S^{(0)},\ldots,S^{(k-1)})(\alpha_{0}^{+}+\ldots+\alpha_{k-1}^{+})+(\alpha_{k-1}^{-}S^{(0)}+\alpha_{k-2}^{-}S^{(1)}+\ldots+\alpha_{0}^{-}S^{(k-1)}).

We calculate

j=0k1αj+\displaystyle\sum_{j=0}^{k-1}\alpha_{j}^{+} =j=0k1i:λi>0η2λi2γij=i:λi>0η2λi2j=0k1γij=i:λi>0η2λi21γik1γi\displaystyle=\sum_{j=0}^{k-1}\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\gamma_{i}^{j}=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\sum_{j=0}^{k-1}\gamma_{i}^{j}=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\frac{1-\gamma_{i}^{k}}{1-\gamma_{i}}
=i:λi>0η2λi21γik2ηλi2η2λi2=i:λi>0ηλi1γik22ηλii:λi>0ηλiη,\displaystyle=\sum_{i:\lambda_{i}>0}\eta^{2}\lambda_{i}^{2}\frac{1-\gamma_{i}^{k}}{2\eta\lambda_{i}-2\eta^{2}\lambda_{i}^{2}}=\sum_{i:\lambda_{i}>0}\eta\lambda_{i}\frac{1-\gamma_{i}^{k}}{2-2\eta\lambda_{i}}\leq\sum_{i:\lambda_{i}>0}\eta\lambda_{i}\leq\eta,

where we used that ηλi1/2,\eta\lambda_{i}\leq 1/2, (which is a consequence of Assumption 1), that γi<1\gamma_{i}<1 (which holds since λi>0\lambda_{i}>0) and that i:λi>0λi1.\sum_{i:\lambda_{i}>0}\lambda_{i}\leq 1. Since we assume that ϵ-\epsilon is the smallest eigenvalue,

αj\displaystyle\alpha_{j}^{-} η2γ1ji:λi<0λi2η2γ1jϵi:λi<0|λi|η2γ1jϵ.\displaystyle\leq\eta^{2}\gamma_{1}^{j}\sum_{i:\lambda_{i}<0}\lambda_{i}^{2}\leq\eta^{2}\gamma_{1}^{j}\epsilon\sum_{i:\lambda_{i}<0}\left|\lambda_{i}\right|\leq\eta^{2}\gamma_{1}^{j}\epsilon.

Let S~(k)=max(S(0),S(k)).\tilde{S}^{(k)}=\max(S^{(0)},\ldots S^{(k)}). Then combining our bounds gives

S~(k)max(S~(k1),γ1k+ηS~(k1)+η2ϵ(γ1k1S~(0)+γ1k2S~(1)++S~(k1))).\tilde{S}^{(k)}\leq\max\left(\tilde{S}^{(k-1)},\gamma_{1}^{k}+\eta\tilde{S}^{(k-1)}+\eta^{2}\epsilon(\gamma_{1}^{k-1}\tilde{S}^{(0)}+\gamma_{1}^{k-2}\tilde{S}^{(1)}+\ldots+\tilde{S}^{(k-1)})\right).

The next step is to use this recurrence to bound S~(k).\tilde{S}^{(k)}. For this, define c(k)c^{(k)} such that S~(k)=c(k)γ1k.\tilde{S}^{(k)}=c^{(k)}\gamma_{1}^{k}. Plugging in to the above and dividing through by γ1k\gamma_{1}^{k}, we get that c(k)c^{(k)} satisfies

c(k)\displaystyle c^{(k)} max(c(k1)γ1,1+ηγ1c(k1)+η2ϵγ1(c(0)++c(k1)))\displaystyle\leq\max\left(\frac{c^{(k-1)}}{\gamma_{1}},1+\frac{\eta}{\gamma_{1}}c^{(k-1)}+\frac{\eta^{2}\epsilon}{\gamma_{1}}(c^{(0)}+\ldots+c^{(k-1)})\right)
max(c(k1),1+ηc(k1)+η2ϵ(c(0)++c(k1))),\displaystyle\leq\max\left(c^{(k-1)},1+\eta c^{(k-1)}+\eta^{2}\epsilon(c^{(0)}+\ldots+c^{(k-1)})\right),

where we used the fact that γ11.\gamma_{1}\geq 1. Now set c~(k)=max(c(0),c(k)).\tilde{c}^{(k)}=\max(c^{(0)},\ldots c^{(k)}). By assumptions 1 and 2, η+η2ϵk1/2.\eta+\eta^{2}\epsilon k\leq 1/2. This gives

c~(k)max(c~(k1),1+ηc~(k1)+η2ϵkc~(k1))max(c~(k1),1+12c~(k1)).\displaystyle\tilde{c}^{(k)}\leq\max\left(\tilde{c}^{(k-1)},1+\eta\tilde{c}^{(k-1)}+\eta^{2}\epsilon k\tilde{c}^{(k-1)}\right)\leq\max\left(\tilde{c}^{(k-1)},1+\frac{1}{2}\tilde{c}^{(k-1)}\right).

Note that c(0)=S(0)=1,c^{(0)}=S^{(0)}=1, so a straightforward induction using the above recurrence shows that c~(k)2\tilde{c}^{(k)}\leq 2 for all kk. It follows that S(k)2γ1kS^{(k)}\leq 2\gamma_{1}^{k}.

Step 3: Bounding the second moment. Plugging the bound above in to (3) gives

u1(k)\displaystyle u_{1}^{(k)} γ1k+2kη2ϵ2γ1k1(1+2kη2ϵ2)γ1k.\displaystyle\leq\gamma_{1}^{k}+2k\eta^{2}\epsilon^{2}\gamma_{1}^{k-1}\leq\left(1+2k\eta^{2}\epsilon^{2}\right)\gamma_{1}^{k}.

Step 4: Applying Chebyshev. We focus on the first coordinate, y1(k).y_{1}^{(k)}. Note that IηAggTI-\eta Agg^{T} has expectation IηAI-\eta A, so a straightforward induction shows that 𝔼y1(k)=(1+ηϵ)k\mathbb{E}y_{1}^{(k)}=(1+\eta\epsilon)^{k}.

Using the bound for the second moment of the first coordinate, we get u1(k)(𝔼y1(k))2(1+2kη2ϵ2)γ1k(1+ηϵ)2k=(1+2kη2ϵ2)(1+2ηϵ+2η2ϵ21+2ηϵ+η2ϵ2)k=(1+2kη2ϵ2)(1+η2ϵ21+2ηϵ+η2ϵ2)k(1+2kη2ϵ2)(1+η2ϵ2)k.\frac{u_{1}^{(k)}}{\left(\mathbb{E}y_{1}^{(k)}\right)^{2}}\leq\frac{(1+2k\eta^{2}\epsilon^{2})\gamma_{1}^{k}}{(1+\eta\epsilon)^{2k}}=(1+2k\eta^{2}\epsilon^{2})\left(\frac{1+2\eta\epsilon+2\eta^{2}\epsilon^{2}}{1+2\eta\epsilon+\eta^{2}\epsilon^{2}}\right)^{k}=(1+2k\eta^{2}\epsilon^{2})\left(1+\frac{\eta^{2}\epsilon^{2}}{1+2\eta\epsilon+\eta^{2}\epsilon^{2}}\right)^{k}\leq(1+2k\eta^{2}\epsilon^{2})(1+\eta^{2}\epsilon^{2})^{k}.

By Assumptions 2 and 4, Nη2ϵ21/8N\eta^{2}\epsilon^{2}\leq 1/8 and (1+η2ϵ2)N5/4,(1+\eta^{2}\epsilon^{2})^{N}\leq 5/4, so we get that u1(k)25/16(𝔼u1(k))2.u_{1}^{(k)}\leq 25/16\left(\mathbb{E}u_{1}^{(k)}\right)^{2}.

Thus by Chebyshev’s inequality, (|y1(k)𝔼(y1(k))|0.9𝔼(y1(k)))2536\mathbb{P}\left(\left|y_{1}^{(k)}-\mathbb{E}(y_{1}^{(k)})\right|\geq 0.9\mathbb{E}(y_{1}^{(k)})\right)\leq\frac{25}{36}. So with probability at least 0.30.3, y1(N)110𝔼(y1(N))=110(1+ηϵ)Ny_{1}^{(N)}\geq\frac{1}{10}\mathbb{E}(y_{1}^{(N)})=\frac{1}{10}(1+\eta\epsilon)^{N}.

Under assumption 4, (1+ηϵ)N10ϵ2,(1+\eta\epsilon)^{N}\geq\frac{10}{\epsilon^{2}}, which means that y1(N)1ϵ2y_{1}^{(N)}\geq\frac{1}{\epsilon^{2}} with at least 0.30.3 probability.

Step 5: Concluding the argument. We showed that BNTe1,e11ϵ2\left\langle B_{N}^{T}e_{1},e_{1}\right\rangle\geq\frac{1}{\epsilon^{2}} with probability at least 0.30.3. In particular this implies that BNTe11ϵ2.\left\|B_{N}^{T}e_{1}\right\|\geq\frac{1}{\epsilon^{2}}. Now since x(0)x^{(0)} is distributed as 𝒩(0,I),\mathcal{N}(0,I), BNx(0),e1=x(0),BNTe1𝒩(0,BNTe12)\left\langle B_{N}x^{(0)},e_{1}\right\rangle=\left\langle x^{(0)},B_{N}^{T}e_{1}\right\rangle\sim\mathcal{N}(0,\left\|B_{N}^{T}e_{1}\right\|^{2}), which is at least BNTe1\left\|B_{N}^{T}e_{1}\right\| in magnitude with 0.670.67 probability. It follows that x1(N)1ϵ2x^{(N)}_{1}\geq\frac{1}{\epsilon^{2}} with probability at least 0.20.2. ∎

Let f(x)=xTAx.f(x)=x^{T}Ax. We next understand how the value of f(x(k))f(x^{(k)}) is updated on each iteration.

Proposition 5.

For g𝒩(0,1)g\sim\mathcal{N}(0,1), we have f(x(k))f(x(k+1))=η(gTAx(k))2(2ηgTAg)f(x^{(k)})-f(x^{(k+1)})=\eta(g^{T}Ax^{(k)})^{2}(2-\eta g^{T}Ag).

Proof.

Plugging in the update rule and expanding gives

f(x(k+1))\displaystyle f(x^{(k+1)}) =(x(k))TAx(k)2η(gTAx(k))2+η2(gTAx(k))2gTAg\displaystyle=(x^{(k)})^{T}Ax^{(k)}-2\eta(g^{T}Ax^{(k)})^{2}+\eta^{2}(g^{T}Ax^{(k)})^{2}g^{T}Ag
=(x(k))TAx(k)η(gTAx(k))2(2ηgTAg),\displaystyle=(x^{(k)})^{T}Ax^{(k)}-\eta(g^{T}Ax^{(k)})^{2}(2-\eta g^{T}Ag),

from which the proposition follows. ∎

A consequence of this update is that the sequence f(x(k))f(x^{(k)}) is almost guaranteed to be decreasing as long as η\eta is chosen small enough.

Proposition 6.

Assume that Tr(A)1\operatorname{Tr}(A)\leq 1 and that η<c\eta<c. After NN iterations, f(x(N))f(x(0))f(x^{(N)})\leq f(x^{(0)}) with probability at least 99/10099/100 provided that ηclogN+1.\eta\leq\frac{c}{\log N+1}.

Proof.

We show something stronger; namely that for the first NN iterations, the sequence f(x(k))f(x^{(k)}) is decreasing. By Proposition 5, f(x(k+1))f(x(k))f(x^{(k+1)})\leq f(x^{(k)}) as long as gTAg2η.g^{T}Ag\leq\frac{2}{\eta}. The probability that this does not occur is Pr(λigi22η)Pr(λi(gi21)2η1)\Pr\left(\sum\lambda_{i}g_{i}^{2}\geq\frac{2}{\eta}\right)\leq\Pr\left(\sum\lambda_{i}(g_{i}^{2}-1)\geq\frac{2}{\eta}-1\right).

The gi21g_{i}^{2}-1 terms are independent subexponential random variables. So by Bernstein’s inequality (see [Ver18] Theorem 2.8.2 for the version used here), this probability is bounded by 2exp(c/η)2\exp(-c/\eta) as long as η\eta is a sufficiently small constant. Taking a union bound gives that f(x(N))f(x(0))f(x^{(N)})\leq f(x^{(0)}) with probability at least 12Nexp(c/η)1-2N\exp(-c/\eta), which is at least 99/10099/100 under the conditions given. ∎

Theorem 7.

Suppose that A11\left\|A\right\|_{1}\leq 1, ϵ<1/2\epsilon<1/2, and that AA has ϵ-\epsilon as an eigenvalue. If we take η\eta such that cϵ2ηmin(132log(10/ϵ2),clog1ϵ)c\epsilon^{2}\leq\eta\leq\min\left(\frac{1}{32\log(10/\epsilon^{2})},\frac{c}{\log\frac{1}{\epsilon}}\right), then for some N=Θ(1ϵηlog1ϵ)N=\Theta\left(\frac{1}{\epsilon\eta}\log\frac{1}{\epsilon}\right) we have f(x(N))<0f(x^{(N)})<0 with at least 1/101/10 probability.

Proof.

Given an η\eta as in the statement of the theorem, choose N=2ηϵlog10ϵ2,N=\left\lceil\frac{2}{\eta\epsilon}\log\frac{10}{\epsilon^{2}}\right\rceil, which satisfies the assumptions of Lemma 4. Then x1(N)1ϵ2x^{(N)}_{1}\geq\frac{1}{\epsilon^{2}} with probability at least 0.20.2. By proposition 6, f(x(N))f(x(0))2f(x^{(N)})\leq f(x^{(0)})\leq 2 with at least 0.990.99 probability, using the fact that ηclog1ϵ\eta\leq\frac{c}{\log\frac{1}{\epsilon}} for an appropriately chosen absolute constant cc, such that the hypothesis of Proposition 6 holds.

If f(x(N))<0f(x^{(N)})<0, then the algorithm has already terminated. Otherwise conditioned on the events in the above paragraph, we have with at least 0.80.8 probability that 2η(g(N))TAg(N)122-\eta(g^{(N)})^{T}Ag^{(N)}\geq\frac{1}{2} and

(g(N))TAx(N)𝒩(0,Ax(N)2)13Ax(N)13λ1x1(N)13ϵ2λ113ϵ.(g^{(N)})^{T}Ax^{(N)}\sim\mathcal{N}\left(0,\|Ax^{(N)}\|^{2}\right)\geq\frac{1}{3}\left\|Ax^{(N)}\right\|\geq\frac{1}{3}\lambda_{1}x^{(N)}_{1}\geq\frac{1}{3\epsilon^{2}}\lambda_{1}\geq\frac{1}{3\epsilon}.

Then by Proposition 5 it follows that f(x(N+1))f(x(N))η20ϵ22η20ϵ2<0.f(x^{(N+1)})\leq f(x^{(N)})-\frac{\eta}{20\epsilon^{2}}\leq 2-\frac{\eta}{20\epsilon^{2}}<0.

We also observe that we can reduce the dimension of the problem by using a result of Andoni and Nguyen. This allows us to avoid a logd\log d dependence.

Proposition 8.

Suppose that AA satisfies λmin(A)<αA1,\lambda_{\min}(A)<-\alpha\left\|A\right\|_{1}, and let Gd×mG\in\mathbb{R}^{d\times m} have independent 𝒩(0,1d)\mathcal{N}(0,\frac{1}{d}) entries. Then we can choose m=O(1/α)m=O(1/\alpha) such that λmin(GTAG)<α/2\lambda_{\min}(G^{T}AG)<-\alpha/2 and GTAG12A1.\left\|G^{T}AG\right\|_{1}\leq 2\left\|A\right\|_{1}.

Proof.

For the first claim, we simply apply Theorem 1.1 in [AN13] and (in their notation) set ϵ=O(1)\epsilon=O(1) and k=O(1/α).k=O(1/\alpha).

To show that the Schatten 11-norm does not grow too much under the sketch, we first write A=A++AA=A_{+}+A_{-} where the nonzero eigenvalues of A+A_{+} are exactly the positive eigenvalues of AA. Then using the usual analysis of Hutchinson’s trace estimator (see [Mey+21] for example), we have

GTAG1\displaystyle\left\|G^{T}AG\right\|_{1} GTA+G1+GTAG1=Tr(GTA+G)+Tr(GTAG)\displaystyle\leq\left\|G^{T}A_{+}G\right\|_{1}+\left\|G^{T}A_{-}G\right\|_{1}=\operatorname{Tr}(G^{T}A_{+}G)+\operatorname{Tr}(G^{T}A_{-}G)
=(1±O(1/m))Tr(A+)+(1±O(1/m))Tr(A)\displaystyle=(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{+})+(1\pm O(1/\sqrt{m}))\operatorname{Tr}(A_{-})
(1±O(1/m)A1.\displaystyle\leq(1\pm O(1/\sqrt{m})\left\|A\right\|_{1}.

We are now ready to give the proof of Theorem 2.

Proof.

The above result applies after scaling the η\eta given in Theorem 7 by 1/A11/\left\|A\right\|_{1}. So it suffices to choose η\eta to be bounded above by

1A1min(132log(10/ϵ2),clog1ϵ),\frac{1}{\left\|A\right\|_{1}}\min\left(\frac{1}{32\log(10/\epsilon^{2})},\frac{c}{\log\frac{1}{\epsilon}}\right),

and within a constant factor of this value.

To choose an η\eta, pick a standard normal gg, and compute AgAg using 1/ϵ1/\epsilon vector-matrix-vector queries. Then with constant probability, λmax(A)Ag2dλmax.\lambda_{\max}(A)\leq\left\|Ag\right\|\leq 2d\lambda_{\max}. Given this, we have

dAgA1Ag2d,d\left\|Ag\right\|\geq\left\|A\right\|_{1}\geq\frac{\left\|Ag\right\|}{2d}, (4)

which allows us to approximate A1\left\|A\right\|_{1} to within a factor of d2d^{2} with constant probability. Given this, one may simply try the above algorithm with an η\eta at each of O(log(d2))=O(logd)O(\log(d^{2}))=O(\log d) different scales, with the cost of an extra logd\log d factor.

Finally, we may improve the logd\log d factor to a log(1/ϵ)\log(1/\epsilon) factor by using Proposition 8 to sketch AA, and then applying the above analysis to GTAG.G^{T}AG. Note that the sketch may be used implicitly; once GG is chosen, a vector-matrix-vector query to GTAGG^{T}AG can be simulated with a single vector-matrix-vector query to A.A.

2.2 Lower bounds

We will show a bound for two-sided testers which will imply that the bound for 1\ell_{1}-testers given in Theorem 2 is tight up to log\log factors. If we require the tester to have one-sided error, then we additionally show that the bound in Corollary 3 is tight for all pp. Note that this distinction between one-sided and two-sided testers is necessary given Theorem 29.

In order to obtain these lower bounds for adaptive testers, we first show corresponding lower bounds for non-adaptive testers. A minor modification to Lemma 3.1 in [Sun+19] shows that an adaptive tester can have at most quadratic improvement over a non-adaptive tester. This will allow us to obtain our adaptive lower bounds as a consequence of the non-adaptive bounds.

2.2.1 Non-adaptive lower bounds

We first observe that a one-sided tester must always be able to produce a witness matrix XX, that at least certifies that AA is not positive definite.

Proposition 9.

If a one-sided tester makes a series of symmetric linear measurements Mi,A\left\langle M_{i},A\right\rangle of AA, and outputs False on a given instance, then there must exist nonzero Xspan(M1,,Mk)X\in\operatorname{span}(M_{1},\ldots,M_{k}) such that XX is PSD and X,A0.\left\langle X,A\right\rangle\leq 0.

Proof.

We work within the space of symmetric matrices. Let W=span(M1,,Mk)W=\operatorname{span}(M_{1},\ldots,M_{k}), and let ϕ(X)=A,X\phi(X)=\left\langle A,X\right\rangle be the linear functional associated with A.A. Now suppose that ϕ\phi is strictly positive for all nonzero XWΔ+dX\in W\cap\Delta^{d}_{+}. We will construct ϕ~\widetilde{\phi} that agrees with ϕ\phi on WW and is non-negative on Δ+d\Delta^{d}_{+}.

Let W=ker(ϕ)WW^{\prime}=\ker(\phi)\cap W, and note that WΔ+d={0}W^{\prime}\cap\Delta^{d}_{+}=\{0\}. Now by convexity of Δ+d\Delta^{d}_{+}, there exists a hyperplane HH and associated half-space H+H^{+} such that (i)HH contains WW^{\prime} (ii) HΔ+d={0},H\cap\Delta^{d}_{+}=\{0\}, (iii) H+Δ+dH^{+}\supseteq\Delta^{d}_{+} and (iv) ϕ\phi is non-negative on H+WH^{+}\cap W . Moreover, since WW^{\prime} intersects Δ+d\Delta^{d}_{+} trivially, HH can be chosen such that HW=W.H\cap W=W^{\prime}. Now let Π\Pi be a projection onto WW that maps HH to WW^{\prime}, and choose ϕ~=ϕΠW\widetilde{\phi}=\phi\circ\Pi_{W}.

The linear functional ϕ~\widetilde{\phi} is represented by the inner product against some symmetric matrix B.B. By construction of ϕ~\widetilde{\phi}, we have B,Mi=A,Mi\left\langle B,M_{i}\right\rangle=\left\langle A,M_{i}\right\rangle for all ii, and also B,X0\left\langle B,X\right\rangle\geq 0 for all PSD XX. So in particular B,xxT=xTBx0\left\langle B,xx^{T}\right\rangle=x^{T}Bx\geq 0 for all xx, which implies that BB is PSD. Given the existence of the PSD matrix BB consistent with all measurements, the one-sided tester must not reject.

We are now able to give an explicit non-PSD spectrum which is hard for any one-sided tester. Specifically, we show that it is hard for any vector-matrix-vector query algorithm to produce a witness XX in the sense of the proposition above.

Theorem 10.

Let λ>0\lambda>0 and suppose for all matrices AA with spectrum (λ,1,,1)(-\lambda,1,\ldots,1) that a non-adaptive one-sided tester 𝒯\mathcal{T} outputs False with 2/32/3 probability. Then 𝒯\mathcal{T} must make at least 19(d1+λ)2\frac{1}{9}\left(\frac{d}{1+\lambda}\right)^{2} vector-matrix-vector queries.

Proof.

By the polarization identity,

xTAy=12((x+y)TA(x+y)yTAyxTAx),x^{T}Ay=\frac{1}{2}\left((x+y)^{T}A(x+y)-y^{T}Ay-x^{T}Ax\right),

we may assume that all queries are of the form xiTAxi,x_{i}^{T}Ax_{i}, at the cost requiring at most a factor of three increase in the number of queries.

We set A=I(1+λ)vvTA=I-(1+\lambda)vv^{T} where vv is uniform over Sd1S^{d-1}, and let W=span(x1x1T,xkxkT).W=\operatorname{span}(x_{1}x_{1}^{T},\ldots x_{k}x_{k}^{T}). By Proposition 9, the tester may only reject if there is an XX in WΔ+dW\cap\Delta^{d}_{+} with XF=1\left\|X\right\|_{F}=1 such that X,A0.\left\langle X,A\right\rangle\leq 0. For such an XX we have

vvT,XTr(X)1+λ11+λ.\left\langle vv^{T},X\right\rangle\geq\frac{\operatorname{Tr}(X)}{1+\lambda}\geq\frac{1}{1+\lambda}. (5)

But since vvTvv^{T} and XX both have unit norm and XWX\in W, this condition implies that ΠW(vvT)F11+λ.\left\|\Pi_{W}(vv^{T})\right\|_{F}\geq\frac{1}{1+\lambda}.

Now we turn to understanding 𝔼(ΠW(vvT)F2).\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}). Indeed we have the following:

Lemma 11.

Let vv be drawn uniformly from Sd1,S^{d-1}, and let WW be a kk-dimensional subspace of the d×dd\times d symmetric matrices. Let α4=𝔼(v14)\alpha_{4}=\mathbb{E}(v_{1}^{4}) and α22=E(v12v22).\alpha_{22}=E(v_{1}^{2}v_{2}^{2}). Then

𝔼(ΠW(vvT)F2)=(α4α22)k+α22ΠW(I)F2,\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})=(\alpha_{4}-\alpha_{22})k+\alpha_{22}\left\|\Pi_{W}(I)\right\|_{F}^{2},

where II is the identity matrix.

Proof.

Let M1,MkM_{1},\ldots M_{k} be an orthonormal basis for WW. By the Pythagorean theorem,

𝔼(ΠW(vvT)F2)=i=1k𝔼(ΠMi(vvT)F2).\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})=\sum_{i=1}^{k}\mathbb{E}(\left\|\Pi_{M_{i}}(vv^{T})\right\|_{F}^{2}). (6)

For fixed MM we have

𝔼(ΠM(vvT)F2)=𝔼(vvT,M2)=𝔼(Tr(vvTM)2)=𝔼((vTMv)2).\mathbb{E}(\left\|\Pi_{M}(vv^{T})\right\|_{F}^{2})=\mathbb{E}(\left\langle vv^{T},M\right\rangle^{2})=\mathbb{E}(\operatorname{Tr}(vv^{T}M)^{2})=\mathbb{E}((v^{T}Mv)^{2}).\\ (7)

Since MM is symmetric, we can diagonalize MM to D=diag(a1,,ad)D=\operatorname{diag}(a_{1},\ldots,a_{d}) in some orthonormal basis. Since MM has unit norm, a12++ad2=1.a_{1}^{2}+\ldots+a_{d}^{2}=1. Then we have

𝔼((vTMv)2)\displaystyle\mathbb{E}((v^{T}Mv)^{2}) =𝔼((vTDv)2)\displaystyle=\mathbb{E}((v^{T}Dv)^{2})
=𝔼((a1x12++adxd2)2)\displaystyle=\mathbb{E}((a_{1}x_{1}^{2}+\ldots+a_{d}x_{d}^{2})^{2})
=α4(a12++ad2)+2α22i<jaiaj\displaystyle=\alpha_{4}(a_{1}^{2}+\ldots+a_{d}^{2})+2\alpha_{22}\sum_{i<j}a_{i}a_{j}
=α4+2α22i<jaiaj.\displaystyle=\alpha_{4}+2\alpha_{22}\sum_{i<j}a_{i}a_{j}.

Next observe that

Tr(M)2=(a1++ad)2=a12++ad2+2i<jaiaj=1+2i<jaiaj,\operatorname{Tr}(M)^{2}=(a_{1}+\ldots+a_{d})^{2}=a_{1}^{2}+\ldots+a_{d}^{2}+2\sum_{i<j}a_{i}a_{j}=1+2\sum_{i<j}a_{i}a_{j},

so that

𝔼((vTMv)2)=α4+α22(Tr(M)21).\mathbb{E}((v^{T}Mv)^{2})=\alpha_{4}+\alpha_{22}(\operatorname{Tr}(M)^{2}-1).

Combining with (6) gives

𝔼(ΠW(vvT)F2)\displaystyle\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}) =i=1k(α4+α22(Tr(Mi)21))\displaystyle=\sum_{i=1}^{k}\left(\alpha_{4}+\alpha_{22}(\operatorname{Tr}(M_{i})^{2}-1)\right)
=(α4α22)k+α22i=1kTr(Mi)2.\displaystyle=(\alpha_{4}-\alpha_{22})k+\alpha_{22}\sum_{i=1}^{k}\operatorname{Tr}(M_{i})^{2}.

Finally, observe that i=1kTr(Mi)2=i=1kI,Mi2=ΠW(I)F2,\sum_{i=1}^{k}\operatorname{Tr}(M_{i})^{2}=\sum_{i=1}^{k}\left\langle I,M_{i}\right\rangle^{2}=\left\|\Pi_{W}(I)\right\|_{F}^{2}, by the Pythagorean theorem, which finishes the proof. ∎

Remark 12.

While approximations would suffice, this result gives a quick way to compute α4\alpha_{4} and α22.\alpha_{22}. Set WW to be the entire space of n×nn\times n symmetric matrices, and k=d(d+1)/2.k=d(d+1)/2. The previous result gives

1=d(d+1)2(α4α22)+dα22.1=\frac{d(d+1)}{2}(\alpha_{4}-\alpha_{22})+d\alpha_{22}.

On the other hand, by expanding 1=(v12++vd2)2,1=(v_{1}^{2}+\ldots+v_{d}^{2})^{2}, we have

1=dα4+d(d1)α22.1=d\alpha_{4}+d(d-1)\alpha_{22}.

Solving the system yields α4=3d(d+2)\alpha_{4}=\frac{3}{d(d+2)} and α22=1d(d+2).\alpha_{22}=\frac{1}{d(d+2)}.

To finish the proof of the theorem, we recall that WW is spanned by the matrices x1x1T,,xkxkT,x_{1}x_{1}^{T},\ldots,x_{k}x_{k}^{T}, each of which has rank one. Therefore each matrix in WW, and in particular ΠW(I)\Pi_{W}(I), has rank at most kk.

We recall for a general matrix A,A, that argminrk(U)kAUF\operatorname{argmin}_{\operatorname{rk}(U)\leq k}\left\|A-U\right\|_{F} is gotten by truncating all but the largest kk singular values of A.A. Applying this to the identity matrix, when kd,k\leq d, we see that

ΠW(I)F2=IF2ΠW(I)Fd(dk)=k,\left\|\Pi_{W}(I)\right\|_{F}^{2}=\left\|I\right\|_{F}^{2}-\left\|\Pi_{W^{\perp}}(I)\right\|_{F}\leq d-(d-k)=k,

since ΠW(I)F2minrk(U)kIUF2=dk.\left\|\Pi_{W^{\perp}}(I)\right\|_{F}^{2}\geq\min_{\operatorname{rk}(U)\leq k}\left\|I-U\right\|_{F}^{2}=d-k. Since IF2=d,\left\|I\right\|_{F}^{2}=d, we always have ΠW(I)F2k.\left\|\Pi_{W}(I)\right\|_{F}^{2}\leq k.

Combining this fact with Lemma 11 gives

𝔼(ΠW(vvT)F2)(α4α22)k+α22k=kα4=3kd(d+2)3kd2,\mathbb{E}(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2})\leq(\alpha_{4}-\alpha_{22})k+\alpha_{22}k=k\alpha_{4}=\frac{3k}{d(d+2)}\leq\frac{3k}{d^{2}},

and by Markov’s inequality,

(ΠW(vvT)F2>9kd2)13.\mathbb{P}\left(\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}>\frac{9k}{d^{2}}\right)\leq\frac{1}{3}.

So with probability 2/3,2/3, ΠW(vvT)F29kd2.\left\|\Pi_{W}(vv^{T})\right\|_{F}^{2}\leq\frac{9k}{d^{2}}. But for 𝒜\mathcal{A} to be correct, we saw that we must have ΠW(vvT)F11+λ\left\|\Pi_{W}(vv^{T})\right\|_{F}\geq\frac{1}{1+\lambda} with probability 2/32/3. It follows that

(11+λ)29kd2,\left(\frac{1}{1+\lambda}\right)^{2}\leq\frac{9k}{d^{2}},

which implies that

k19(d1+λ)2.k\geq\frac{1}{9}\left(\frac{d}{1+\lambda}\right)^{2}.

In particular, this result implies that for non-adaptive one-sided testers, a poly(1/ϵ)\text{poly}(1/\epsilon) p\ell_{p}-tester can only exist for p=1.p=1.

Theorem 13.

A one-sided non-adaptive p\ell_{p}-tester must make at least Ω(1ϵ2d22/p)\Omega(\frac{1}{\epsilon^{2}}d^{2-2/p}) vector-matrix-vector queries.

Proof.

This follows as a corollary of Theorem 10; simply apply that result to the spectrum (ϵ(d1)1/p,1,1)(\epsilon(d-1)^{1/p},1\ldots,1) where there are d1d-1 11’s. ∎

2.2.2 Adaptive lower bounds

As remarked earlier, our adaptive lower bounds follow as a corollary of our non-adaptive bounds, and a slightly modified version of Lemma 3.1 in [Sun+19], which we give here.

Lemma 14.

Let A=XΣXTA=X\Sigma X^{T} be a random symmetric d×dd\times d real-valued matrix, with Σ\Sigma diagonal, and where XX is orthonormal and sampled from the rotationally invariant distribution. Any ss adaptive vector-matrix-vector queries to AA may be simulated by O(s2)O(s^{2}) non-adaptive vector-matrix-vector queries.

Proof.

(Sketch) First note that the adaptive protocol may be simulated by 3s3s adaptive quadratic form queries, of the form xTAxx^{T}Ax by the polarization identity

xTAy=12((x+y)TA(x+y)xTAxyTAy).x^{T}Ay=\frac{1}{2}\left((x+y)^{T}A(x+y)-x^{T}Ax-y^{T}Ay\right). (8)

These queries in turn may be simulated by 9s29s^{2} non-adaptive queries by following exactly the same proof as Lemma 3.1 in [Sun+19] (but now with ui=viu_{i}=v_{i} in their proof). ∎

As a direct consequence of this fact and our Theorem 13 we obtain the following.

Theorem 15.

An adaptive one-sided p\ell_{p}-tester must make at least Ω(1ϵd11/p)\Omega(\frac{1}{\epsilon}d^{1-1/p}) vector-matrix-vector queries.

3 Adaptive matrix-vector queries

We analyze random Krylov iteration. Namely we begin with a random g𝒩(0,Id)g\sim\mathcal{N}(0,I_{d}) and construct the sequence of iterates g,Ag,A2g,Akgg,Ag,A^{2}g,\ldots A^{k}g using kk adaptive matrix-vector queries. The span of these vectors is denoted 𝒦k(g)\mathcal{K}_{k}(g) and referred to as the kth{k}^{\text{th}} Krylov subspace.

Krylov iteration suggests a very simple algorithm. First compute g,Ag,,Ak+1g.g,Ag,\ldots,A^{k+1}g. If 𝒦k(g)\mathcal{K}_{k}(g) contains a vector vv such that vTAv<0v^{T}Av<0 then output False, otherwise output True. (Note that one can compute AvAv and hence vTAvv^{T}Av for all such vv, given the k+1k+1 matrix-vector queries.) We show that this simple algorithm is in fact optimal.

As a point of implementation, we note that the above condition on 𝒦k(g)\mathcal{K}_{k}(g) can be checked algorthmically. One first uses Gram-Schmidt to compute the projection Π\Pi onto 𝒦k(g)\mathcal{K}_{k}(g). The existence of a v𝒦k(g)v\in\mathcal{K}_{k}(g) with vTAb<0v^{T}Ab<0 is equivalent to the condition λmin(ΠAΠ)<0\lambda_{\min}(\Pi A\Pi)<0. When AA is ϵ\epsilon-far from PSD, the proof below will show that in fact λmin(ΠAΠ)<Ω(ϵ)Ap\lambda_{\min}(\Pi A\Pi)<-\Omega(\epsilon)\left\|A\right\|_{p}, so it suffices to estimate λmin(ΠAΠ)\lambda_{\min}(\Pi A\Pi) to within O(ϵ)ApO(\epsilon)\left\|A\right\|_{p} accuracy.

Proposition 16.

For r>0r>0, α>0\alpha>0 and δ>0\delta>0 there exists a polynomial pp of degree O(rαlog1δ)O(\frac{\sqrt{r}}{\sqrt{\alpha}}\log\frac{1}{\delta}), such that p(α)=1p(-\alpha)=1 and |p(x)|δ\left|p(x)\right|\leq\delta for all x[0,r].x\in[0,r].

Proof.

Recall that the degree dd Chebyshev polynomial TdT_{d} is bounded by 11 in absolute value on [1,1][-1,1] and satisfies

Td(1+γ2dγ1).T_{d}(1+\gamma\geq 2^{d\sqrt{\gamma}-1}).

(See [MM15] for example.) The proposition follows by shifting and scaling TdT_{d}. ∎

Theorem 17.

Suppose that AA has an eigenvalue λmin\lambda_{\min} with λminϵAp.\lambda_{\min}\leq-\epsilon\left\|A\right\|_{p}. When p=1p=1, the Krylov subspace 𝒦k(g)\mathcal{K}_{k}(g) contains a vector vv with vTAv<0v^{T}Av<0 for k=O((1ϵ)13log1ϵ).k=O\left(\left(\frac{1}{\epsilon}\right)^{\frac{1}{3}}\log\frac{1}{\epsilon}\right). When p(1,]p\in(1,\infty], the same conclusion holds for k=O((1ϵ)p2p+1log1ϵlogd).k=O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\log d\right).

Proof.

Without loss of generality, assume that Ap1\left\|A\right\|_{p}\leq 1. Fix a value TT to be determined later, effectively corresponding to the number of top eigenvalues that we deflate. By Proposition 16 we can construct a polynomial qq, such that q(λmin)=1q(\lambda_{\min})=1 and |q(x)|ϵ/10d11/p\left|q(x)\right|\leq\sqrt{\frac{\epsilon/10}{d^{1-1/p}}} for x[0,T1/p]x\in[0,T^{-1/p}] with

deg(q)CT1/(2p)ϵlog(d11/pϵ/10),\deg(q)\leq C\frac{T^{-1/(2p)}}{\sqrt{\epsilon}}\log\left(\sqrt{\frac{d^{1-1/p}}{\epsilon/10}}\right), (9)

where CC is an absolute constant.

Now set

p(x)=q(x)i:λi>T1/pλixλiλmin.p(x)=q(x)\prod_{i:\lambda_{i}>T^{-1/p}}\frac{\lambda_{i}-x}{\lambda_{i}-\lambda_{\min}}. (10)

Since we assume Ap1,\left\|A\right\|_{p}\leq 1, there at most TT terms in the product, so

deg(p)T+CT1/(2p)ϵlog(d11/pϵ/10).\deg(p)\leq T+C\frac{T^{-1/(2p)}}{\sqrt{\epsilon}}\log\left(\sqrt{\frac{d^{1-1/p}}{\epsilon/10}}\right). (11)

By setting T=ϵp/(2p+1),T=\epsilon^{-p/(2p+1)}, we get

deg(p)={O((1ϵ)p2p+1log1ϵ)if p=1O((1ϵ)p2p+1log1ϵlogd)if p>1\deg(p)=\begin{cases}O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\right)\quad&\text{if }p=1\\ O\left(\left(\frac{1}{\epsilon}\right)^{\frac{p}{2p+1}}\log\frac{1}{\epsilon}\log d\right)\quad&\text{if }p>1\end{cases} (12)

As long as kk is at least deg(p)\deg(p), then v=p(A)gv=p(A)g lies in 𝒦k(g),\mathcal{K}_{k}(g), and

vTAv=gTp(A)2Ag.v^{T}Av=g^{T}p(A)^{2}Ag. (13)

By construction, p(λmin)=1p(\lambda_{\min})=1. Also for all xx in [0,T1/p],[0,T^{-1/p}], |p(x)||q(x)|ϵ/10d(1/p)1.|p(x)|\leq|q(x)|\leq\sqrt{\epsilon/10}d^{(1/p)-1}.

Therefore the matrix p(A)2Ap(A)^{2}A has at least one eigenvalue less than ϵ-\epsilon, and the positive eigenvalues sum to at most

i:λi>0ϵ10d1/p1λiϵ10,\sum_{i:\lambda_{i}>0}\frac{\epsilon}{10}d^{1/p-1}\lambda_{i}\leq\frac{\epsilon}{10}, (14)

by using Holder’s inequality along with the fact that Ap1\left\|A\right\|_{p}\leq 1. So with at least 2/32/3 probability, gTp(A)2Ag<0g^{T}p(A)^{2}Ag<0 as desired. ∎

Remark 18.

For 1<p<1<p<\infty, the logd\log d dependence can be removed by simply applying the p=1p=1 tester to Ap,A^{\lceil p\rceil}, as a matrix-vector query to ApA^{\lceil p\rceil} may be simulated via p\lceil p\rceil matrix-vector queries to A.A. However this comes at the cost of a (1ϵ)p/3\left(\frac{1}{\epsilon}\right)^{\lceil p\rceil/3} dependence, and is therefore only an improvement when dd is extremely large.

Remark 19.

While we observe that deflation of the top eigenvalues can be carried out implicitly within the Krylov space, this can also be done explicitly using block Krylov iteration, along with the guarantee given in Theorem 1 of [MM15].

We showed above that we could improve upon the usual analysis of Krylov iteration in our context. We next establish a matching lower bound that shows our analysis is tight up to log\log factors. This is a corollary of the proof of Theorem 3.1 presented in [Bra+20].

Theorem 20.

A two-sided, adaptive p\ell_{p}-tester in the matrix-vector model must in general make at least Ω(1ϵp/(2p+1))\Omega(\frac{1}{\epsilon^{p/(2p+1)}}) queries.

Proof.

We make use of the proof of Theorem 3.1 given in [Bra+20]. We consider an algorithm 𝒜\mathcal{A} that receives a matrix WW sampled from the Wishart distribution makes at most (1β)d(1-\beta)d queries, and outputs either True or False, depending on whether λmin(W)\lambda_{\min}(W) is greater or less than tt (where t=1/(2d2)t=1/(2d^{2}) is defined as in [Bra+20]). We say that 𝒜\mathcal{A} fails on a given instance if either (i) 𝒜\mathcal{A} outputs True and t14d2λmin(W)t-\frac{1}{4d^{2}}\geq\lambda_{\min}(W) or (ii) 𝒜\mathcal{A} outputs False and λmin(W)t+14d2.\lambda_{\min}(W)\geq t+\frac{1}{4d^{2}}. Exactly the same proof given in [Bra+20] shows that 𝒜\mathcal{A} must fail with probability at least cwishβc_{\text{wish}}\sqrt{\beta} where cwish>0c_{\text{wish}}>0 is an absolute constant, as long as dd is chosen sufficiently large depending on β.\beta. Taking β=1/4\beta=1/4 say, means that any such algorithm fails with probability at least cwish/2c_{\text{wish}}/2 as long as dd is a large enough constant.

Now consider an p\ell_{p}-tester 𝒯\mathcal{T} with d=1/(4ϵp/(2p+1))d=1/(4\epsilon^{p/(2p+1)}), applied to the random matrix WtI.W-tI. While our definition allows 𝒯\mathcal{T} to fail with 1/31/3 probability we can reduce this failure probability to cwish/2c_{\text{wish}}/2 by running a constant number of independent instances and taking a majority vote. So from here on we assume that 𝒯\mathcal{T} fails on a given instance with probability at most cwish/2c_{\text{wish}}/2.

First recall that WXXTW\sim XX^{T} where each entry of XX is i.i.d. 𝒩(0,1/d).\mathcal{N}(0,1/d). Then with high probability, the operator norm of XX is bounded, say, by 2\sqrt{2}, and the eigenvalues of WW are bounded by 2.2.

Therefore with high probability, Wp2d1/p,\left\|W\right\|_{p}\leq 2d^{1/p}, and so WtIp3d1/p.\left\|W-tI\right\|_{p}\leq 3d^{1/p}. It follows that 1/(4d2)=4ϵ(4d)1/pϵWtIp.1/(4d^{2})=4\epsilon(4d)^{1/p}\geq\epsilon\left\|W-tI\right\|_{p}. This means that 𝒯\mathcal{T} can solve the problem above, and by correctness of the tester, fails with at most cwish/2c_{\text{wish}}/2 probability. For ϵ\epsilon sufficiently small, the above analysis implies that 𝒯\mathcal{T} must make at least Ω(d)=Ω(1/ϵp/(2p+1))\Omega(d)=\Omega(1/\epsilon^{p/(2p+1)}) queries.

4 An optimal bilinear sketch

In this section we analyze a bilinear sketch for PSD-testing which will also yield an optimal 2\ell_{2}-tester in the vector-matrix-vector model.

Our sketch is very simple. We choose Gd×kG\in\mathbb{R}^{d\times k} to have independent 𝒩(0,1)\mathcal{N}(0,1) entries and take our sketch to be GTAG.G^{T}AG. In parallel we construct estimates α\alpha and β\beta for the trace and Frobenius norm of AA respectively, such that β\beta is accurate to within a multiplicative error of 22, and α\alpha is accurate to with AF\left\|A\right\|_{F} additive error. (Note that this may be done at the cost of increasing the sketching dimension by O(1)O(1).)

If GTAGG^{T}AG is not PSD then we automatically reject. Otherwise, we then consider the quantity

γ:=αλmin(GTAG)βklogk\gamma:=\frac{\alpha-\lambda_{\min}(G^{T}AG)}{\beta\sqrt{k}\log k} (15)

If γ\gamma is at most cpsdc_{\text{psd}} for some absolute constant cpsdc_{\text{psd}}, then the tester outputs False, otherwise it outputs True.

We first show that a large negative eigenvalue of AA results causes the smallest sketched eigenvalue to be at most Tr(A)Ω(ϵk).\operatorname{Tr}(A)-\Omega(\epsilon k). On the other hand, when AA is PSD, we will show that λmin(GTAG)\lambda_{\min}(G^{T}AG) is substantially larger.

4.1 Upper bound on λmin(GTAG)\lambda_{\min}(G^{T}AG)

We start with the following result on trace estimators which we will need below.

Proposition 21.

Let MM be a symmetric matrix with eigenvalues λ1,,λd,\lambda_{1},\ldots,\lambda_{d}, and let uu be a random unit vector with respect to the spherical measure. Then

Var(uTMu)=2d+2(λ12++λd2d(λ1++λd)2d2):=2d+2Var(λ1,,λd).\operatorname{Var}(u^{T}Mu)=\frac{2}{d+2}\left(\frac{\lambda_{1}^{2}+\ldots+\lambda_{d}^{2}}{d}-\frac{(\lambda_{1}+\ldots+\lambda_{d})^{2}}{d^{2}}\right):=\frac{2}{d+2}\operatorname{Var}(\lambda_{1},\ldots,\lambda_{d}).
Proof.

By the spectral theorem, it suffices to prove the result when MM is diagonal. Then

VaruTMu=𝔼(λ1u12++λdud2)2(𝔼(λ1u12++λdud2))2.\operatorname{Var}{u^{T}Mu}=\mathbb{E}(\lambda_{1}u_{1}^{2}+\ldots+\lambda_{d}u_{d}^{2})^{2}-\left(\mathbb{E}(\lambda_{1}u_{1}^{2}+\ldots+\lambda_{d}u_{d}^{2})\right)^{2}.

By Remark 12, we have 𝔼(ui2)=1,\mathbb{E}(u_{i}^{2})=1, 𝔼(ui4)=3d(d+2)\mathbb{E}(u_{i}^{4})=\frac{3}{d(d+2)} and 𝔼(ui2uj2)=1d(d+2)\mathbb{E}(u_{i}^{2}u_{j}^{2})=\frac{1}{d(d+2)} for ij.i\neq j. The result follows by expanding using linearity of expectation, and then applying these facts. ∎

The next two results will give an upper bound on the smallest eigenvalue of the Gaussian sketch. For the proof of Lemma 23 we will start with random orthogonal projections, from which the Gaussian result will quickly follow. We include a technical hypothesis that essentially enforces non-negativity of Tr(A).\operatorname{Tr}(A). We write the hypothesis in the form below simply to streamline the argument.

Lemma 22.

Suppose that AF=1\left\|A\right\|_{F}=1 and that vv is an eigenvector of AA with associated eigenvalue ϵ.-\epsilon. Let ΠRd×d\Pi\in R^{d\times d} be a projection onto a random kk dimensional subspace SS of d\mathbb{R}^{d}, sampled from the rotationally invariant measure. Also suppose that xTAx0x^{T}Ax\geq 0 with probability 0.9990.999 when xx is a random unit vector. Then

1Πv2(Πv)TA(Πv)1d(0.5ϵk+Tr(A)+O(1))\frac{1}{\left\|\Pi v\right\|^{2}}(\Pi v)^{T}A(\Pi v)\leq\frac{1}{d}(-0.5\epsilon k+\operatorname{Tr}(A)+O(1))

with probability at least 0.99exp(ck)0.99-\exp(-ck).

Proof.

Let u=ΠvΠv.u=\frac{\Pi v}{\left\|\Pi v\right\|}. The subspace SS was chosen randomly, so with probability at least 1exp(ck)1-\exp(-ck),

u,v2=Πv20.5kd.\left\langle u,v\right\rangle^{2}=\left\|\Pi v\right\|^{2}\geq 0.5\frac{k}{d}.

Let uu^{\prime} be the projection of uu onto the hyperplane vv^{\perp} orthogonal to v.v. Observe by symmetry that u/uu^{\prime}/\left\|u^{\prime}\right\| is distributed uniformly over the sphere in vv^{\perp}.

Let A=AϵvvTA^{\prime}=A-\epsilon vv^{T} be the matrix AA with the ϵ-\epsilon eigenvalue zeroed out. Then

uTAu0.5kdϵ+(u)TAu0.5kdϵ+(u/u)TA(u/u),u^{T}Au\leq-0.5\frac{k}{d}\epsilon+(u^{\prime})^{T}A^{\prime}u^{\prime}\leq-0.5\frac{k}{d}\epsilon+(u^{\prime}/\left\|u^{\prime}\right\|)^{T}A^{\prime}(u^{\prime}/\left\|u^{\prime}\right\|),

as long as (u)TAu0(u^{\prime})^{T}A^{\prime}u^{\prime}\geq 0, which holds with probability at least 0.9990.999 as a consequence of the similar hypothesis. The latter term is a trace estimator for 1dA\frac{1}{d}A^{\prime} with variance bounded by 2d2AF22d2\frac{2}{d^{2}}\left\|A^{\prime}\right\|_{F}^{2}\leq\frac{2}{d^{2}} (for example by Proposition 21). So with 0.9990.999 probability

(u/u)TA(u/u)1d(Tr(A)+O(1))=1d(Tr(A)+ϵ+O(1))1d(Tr(A)+O(1)),(u^{\prime}/\left\|u^{\prime}\right\|)^{T}A^{\prime}(u^{\prime}/\left\|u^{\prime}\right\|)\leq\frac{1}{d}(\operatorname{Tr}(A^{\prime})+O(1))=\frac{1}{d}(\operatorname{Tr}(A)+\epsilon+O(1))\leq\frac{1}{d}(\operatorname{Tr}(A)+O(1)),

and the result follows. ∎

In the following lemma, we introduce the technical assumption that k<cdk<cd. However this will be unimportant later, as any sketch with kcdk\geq cd might as well have sketching dimension dd, at which point the testing problem is trivial.

Lemma 23.

Suppose that Ad×dA\in\mathbb{R}^{d\times d} has an eigenvalue of ϵ-\epsilon, AF=1\left\|A\right\|_{F}=1, and Gd×kG\in\mathbb{R}^{d\times k} with k<dk<d has iid 𝒩(0,1)\mathcal{N}(0,1) entries. Also suppose that xTAx0x^{T}Ax\geq 0 with probability at least 0.9990.999 for a random unit vector x,x, and that k<cdk<cd for some absolute constant c.c. Then with probability at least 0.993exp(ck)0.99-3\exp(-ck),

λmin(GTAG)0.4ϵk+O(1)+Tr(A)+ckd|Tr(A)|.\lambda_{\min}(G^{T}AG)\leq-0.4\epsilon k+O(1)+\operatorname{Tr}(A)+c\frac{\sqrt{k}}{\sqrt{d}}\left|\operatorname{Tr}(A)\right|.
Proof.

Let ΠG=GG\Pi_{G}=GG^{\dagger} denote projection onto the image of GG.

Let vv be an eigenvector of AA with associated eigenvalue smaller or equal to ϵ,-\epsilon, and set u=Gv/Gvu=G^{\dagger}v/\left\|G^{\dagger}v\right\|. We then have

λmin(GTAG)uT(GTAG)u=1Gv2(ΠGv)TA(ΠGv).\lambda_{\min}(G^{T}AG)\leq u^{T}(G^{T}AG)u=\frac{1}{\left\|G^{\dagger}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v).

We also have

ΠGv=GGvGopGv.\left\|\Pi_{G}v\right\|=\left\|GG^{\dagger}v\right\|\leq\left\|G\right\|_{\text{op}}\left\|G^{\dagger}v\right\|.

By Theorem 4.6.1 in [Ver18], Gopd+ck\left\|G\right\|_{\text{op}}\leq\sqrt{d}+c\sqrt{k} with probability at least 12exp(k)1-2\exp(-k). Conditional on this occurring,

GvΠGvd+ck,\left\|G^{\dagger}v\right\|\geq\frac{\left\|\Pi_{G}v\right\|}{\sqrt{d}+c\sqrt{k}},

from which it follows that

λmin(GTAG)(d+cdk)1ΠGv2(ΠGv)TA(ΠGv),\lambda_{\min}(G^{T}AG)\leq(d+c\sqrt{d}\sqrt{k})\frac{1}{\left\|\Pi_{G}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v),

as long as the quantity on the right-hand side is non-negative. If this quantity is negative, then we similarly have

λmin(GTAG)(dcdk)1ΠGv2(ΠGv)TA(ΠGv),\lambda_{\min}(G^{T}AG)\leq(d-c\sqrt{d}\sqrt{k})\frac{1}{\left\|\Pi_{G}v\right\|^{2}}(\Pi_{G}v)^{T}A(\Pi_{G}v),

using the analogous bound on the smallest singular value of G.G.

Since GG is Gaussian, the image of GG is distributed with the respect to the rotationally invariant measure on kk-dimensional subspaces. Therefore Lemma 22 applies, and the result follows after collecting terms, and using the assumption that kcdk\leq cd in the negative case. ∎

4.2 Lower bound on λmin(GTAG)\lambda_{\min}(G^{T}AG)

We follow a standard protocol for bounding the extreme eigenvalues of a random matrix. We first show that uT(GTAG)uu^{T}(G^{T}AG)u is reasonably large for a fixed vector uu with high probability. Then by taking a union bound over an ϵ\epsilon-net we upgrade this to a uniform bound over the sphere.

We require two additional tricks. Our lower bound on uT(GTAG)uu^{T}(G^{T}AG)u arises from Berstein’s inequality, which is hampered by the existence of large eigenvalues of AA. Therefore in order to get a guarantee that holds with high enough probability, we first prune the large eigenvalues of A.A.

Second, the mesh size of our ϵ\epsilon-net needs to be inversely proportional to the Lipschitz constant of xxT(GTAG)xx\mapsto x^{T}(G^{T}AG)x as xx ranges over the sphere. A priori, the Lipschitz constant might be as bad as Aop\left\|A\right\|_{\text{op}} which is typically larger than d\sqrt{d}. This would ultimately give rise to an additional log(d)\log(d) factor in the final sketching dimension. However we show that the the Lipschitz constant is in fact bounded by O(k)O(k) with good probability, avoiding the need for any dd dependence in the sketching dimension.

Proposition 24.

Let QQ be a symmetric matrix, and let xx and yy be unit vectors. Then

|xTQxyTQy|2(λmax(Q)λmin(Q))xy\left|x^{T}Qx-y^{T}Qy\right|\leq 2(\lambda_{\max}(Q)-\lambda_{\min}(Q))\left\|x-y\right\|
Proof.

We first reduce to the 22-dimensional case. Let WW be a 22-dimensional subspace passing through xx and yy. The largest and smallest eigenvalues of the restriction to WW of the quadratic form associated to QQ are bounded from above and below by λmax(Q)\lambda_{\max}(Q) and λmin(Q)\lambda_{\min}(Q) respectively. It therefore suffices to prove the result when QQ has dimension 2.2.

Since the result we wish to show is invariant under shifting QQ by multiples of the identity, it suffices to consider the case when λ2=0.\lambda_{2}=0. After these reductions, we have

xTQxyTQy=λ1(x12y12)=λ1(x1+y1)(x1y1).x^{T}Qx-y^{T}Qy=\lambda_{1}(x_{1}^{2}-y_{1}^{2})=\lambda_{1}(x_{1}+y_{1})(x_{1}-y_{1}).

Since xx and yy are unit vectors, |x1+y1|2\left|x_{1}+y_{1}\right|\leq 2 and |x1y1|xy\left|x_{1}-y_{1}\right|\leq\left\|x-y\right\| and the result follows. ∎

Lemma 25.

Let S=GTAGS=G^{T}AG where Gk×dG\in\mathbb{R}^{k\times d} has iid 𝒩(0,1)\mathcal{N}(0,1) entries and AF=1.\left\|A\right\|_{F}=1. Then

λmax(S)λmin(S)t\lambda_{\max}(S)-\lambda_{\min}(S)\leq t

with probability at least 14k(k+2)t21-\frac{4k(k+2)}{t^{2}}.

Proof.

Consider the random quantity α=uTGTAGu\alpha=u^{T}G^{T}AGu, where uu is a random unit vector in k\mathbb{R}^{k}, independent from GG. Note that GuGu is distributed as a standard Gaussian, so α\alpha is a trace estimator for AA with variance 22 [AT11].

On the other hand, one can also study the variance of α\alpha conditional on GG by using Proposition 21. Let EE be the event that λmax(S)λmin(S)t\lambda_{\max}(S)-\lambda_{\min}(S)\geq t. If EE occurs too often, then Var(α)\operatorname{Var}(\alpha) would be too large. Specifically, in the notation of Proposition 21 when EE occurs, we necessarily have Var(λ1(S),,λk(S))1k(t/2)2\operatorname{Var}(\lambda_{1}(S),\ldots,\lambda_{k}(S))\geq\frac{1}{k}(t/2)^{2}, so Var(α|E)t22k(k+2).\operatorname{Var}(\alpha|E)\geq\frac{t^{2}}{2k(k+2)}. Thus we have

2=Var(α)Pr(E)Var(α|E)Pr(E)t22k(k+2),2=\operatorname{Var}(\alpha)\geq\Pr(E)\operatorname{Var}(\alpha|E)\geq\Pr(E)\frac{t^{2}}{2k(k+2)}, (16)

and so Pr(E)4k(k+2)/t2\Pr(E)\leq 4k(k+2)/t^{2} as desired. ∎

Lemma 26.

Suppose that AA is PSD with AF=1,\left\|A\right\|_{F}=1, and that vv consists of iid 𝒩(0,1)\mathcal{N}(0,1) entries. Then for t2kt\geq 2\sqrt{k} we have

Pr(vTAvTr(A)t)exp(ck(tk)).\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t)\leq\exp(-c\sqrt{k}(t-\sqrt{k})).
Proof.

We have

Pr(vTAvTr(A)t)\displaystyle\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t) Pr(vTAkvTr(A)t)\displaystyle\leq\Pr(v^{T}A_{-k}v\leq\operatorname{Tr}(A)-t) (17)
=Pr(vTAkvTr(Ak)(tTr(Ak))\displaystyle=\Pr(v^{T}A_{-k}v\leq\operatorname{Tr}(A_{-k})-(t-\operatorname{Tr}(A_{k})) (18)

Note that vTAkvv^{T}A_{-k}v has expectation Tr(Ak)\operatorname{Tr}(A_{-k}), so by Bernstein’s inequality (or Hanson-Wright) [Ver18],

Pr(vTAvTr(A)t)exp(cmin((tTr(Ak))2AkF2,tTr(Ak)λmax(Ak))),\Pr(v^{T}Av\leq\operatorname{Tr}(A)-t)\leq\exp\left(-c\min\left(\frac{(t-\operatorname{Tr}(A_{k}))^{2}}{\left\|A_{-k}\right\|_{F}^{2}},\frac{t-\operatorname{Tr}(A_{k})}{\lambda_{\max}(A_{-k})}\right)\right),

for tTr(Ak).t\geq\operatorname{Tr}(A_{k}).

Now note that AkFAF1\left\|A_{-k}\right\|_{F}\leq\left\|A\right\|_{F}\leq 1, and that λmax(Ak)1k,\lambda_{\max}(A_{-k})\leq\frac{1}{\sqrt{k}}, since AF=1.\left\|A\right\|_{F}=1. Additionally, Tr(Ak)kAkFk.\operatorname{Tr}(A_{k})\leq\sqrt{k}\left\|A_{k}\right\|_{F}\leq\sqrt{k}. These bounds imply that

(tTr(Ak))2AkF2(tk)2\frac{(t-\operatorname{Tr}(A_{k}))^{2}}{\left\|A_{-k}\right\|_{F}^{2}}\geq(t-\sqrt{k})^{2}

and

tTr(Ak)λmax(Ak)k(tk)\frac{t-\operatorname{Tr}(A_{k})}{\lambda_{\max}(A_{-k})}\geq\sqrt{k}(t-\sqrt{k})

for tTr(Ak).t\geq\operatorname{Tr}(A_{k}). When t>2kt>2\sqrt{k}, the latter expression is smaller, and the conclusion follows.

Theorem 27.

Suppose that AA is PSD with AF=1\left\|A\right\|_{F}=1, and that Gd×kG\in\mathbb{R}^{d\times k} has iid 𝒩(0,1)\mathcal{N}(0,1) entries and that k5k\geq 5. Then with at least 0.990.99 probability,

λmin(GTAG)Tr(A)cklog(k)\lambda_{\min}(G^{T}AG)\geq\operatorname{Tr}(A)-c\sqrt{k}\log(k)

for some absolute constant cc.

Proof.

For any fixed unit vector uku\in\mathbb{R}^{k}, GuGu is distributed as a standard Gaussian, and so Lemma 26 applies. Therefore for a choice of constant,

uT(GTAG)uTr(A)cklog(k)u^{T}(G^{T}AG)u\geq\operatorname{Tr}(A)-c\sqrt{k}\log(k) (19)

with probability at least 1exp(10klogk).1-\exp(-10k\log k).

Let 𝒩\mathcal{N} be a net for the sphere in k\mathbb{R}^{k} with mesh size 1/k,1/k, which can be taken to have at most (3k)k(3k)^{k} elements [Ver18]. By taking a union bound, equation 19 holds over 𝒩\mathcal{N} with probability at least

1(3k)kexp(10klogk)1exp(k)1-(3k)^{k}\exp(-10k\log k)\geq 1-\exp(-k)

for k2.k\geq 2.

By choosing t=100kt=100k in Lemma 25, and applying Proposition 24, we get that

|xT(GTAG)xyT(GTAG)y|100kxy\left|x^{T}(G^{T}AG)x-y^{T}(G^{T}AG)y\right|\leq 100k\left\|x-y\right\|

with probability at least 0.9990.999. Since 𝒩\mathcal{N} has mesh size 1/k1/k, we have that

uT(GTAG)uTr(A)ck+O(1)u^{T}(G^{T}AG)u\geq\operatorname{Tr}(A)-c\sqrt{k}+O(1)

for all unit vectors uu in k\mathbb{R}^{k} with probability at least 0.999exp(k).0.999-\exp(-k).

As a consequence of Theorem 27 and Lemma 23 we obtain our main result.

Theorem 28.

There is a bilinear sketch GTAGG^{T}AG with sketching dimension k=O(1ϵ2log21ϵ)k=O(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon}) that yields a two-sided 2\ell_{2}-tester that is correct with at least 0.90.9 probability.

Proof.

If λmin(A)<0\lambda_{\min}(A)<0 then we automatically reject. Otherwise we first use O(1)O(1) columns of the sketching matrices to estimate α\alpha of Tr(A)\operatorname{Tr}(A) to within an additive error of AF\left\|A\right\|_{F} with 0.010.01 failure probability. We then use another O(1)O(1) columns of the sketching matrices to construct an approximation β\beta of AF\left\|A\right\|_{F} with 12AFβ2AF\frac{1}{2}\left\|A\right\|_{F}\leq\beta\leq 2\left\|A\right\|_{F} with 0.010.01 failure probability (see for example [MSW19]).

Now consider the quantity

γ=αλmin(GTAG)βklogk.\gamma=\frac{\alpha-\lambda_{\min}(G^{T}AG)}{\beta\sqrt{k}\log k}.

If AA is PSD, then by Theorem 27,

λmin(GTAG)Tr(A)cklogkAF,\lambda_{\min}(G^{T}AG)\geq\operatorname{Tr}(A)-c\sqrt{k}\log k\left\|A\right\|_{F},

which implies that

γcpsd,\gamma\leq c_{\text{psd}},

for some absolute constant cpsd.c_{\text{psd}}.

On the other hand, if AA has a negative eigenvalue less than or equal to ϵ-\epsilon, then by Theorem 23

λmin(GTAG)0.4ϵkAF+(1+ckd)Tr(A)+O(1)AF,\lambda_{\min}(G^{T}AG)\leq-0.4\epsilon k\left\|A\right\|_{F}+\left(1+c\frac{\sqrt{k}}{\sqrt{d}}\right)\operatorname{Tr}(A)+O(1)\left\|A\right\|_{F},

which implies that

γcfar(ϵkc)/logk,\gamma\geq c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k,

for some absolute constant cfar.c_{\text{far}}.

Finally by taking k=Θ(1ϵ2log21ϵ)k=\Theta(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon}), we have cpsd<cfar(ϵkc)/logkc_{\text{psd}}<c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k, which implies that the tester is correct if it outputs True precisely when γcpsd.\gamma\leq c_{\text{psd}}.

Note that this result immediately gives a non-adaptive vector-matrix-vector tester which makes O~(1/ϵ4)\widetilde{O}(1/\epsilon^{4}) queries.

4.3 Application to adaptive vector-matrix-vector queries

By combining our bilinear sketch with Theorem 2 we achieve tight bounds for adaptive queries.

Theorem 29.

There is a two-sided adaptive 2\ell_{2}-tester in the vector-matrix-vector model, which makes O~(1/ϵ2)\widetilde{O}(1/\epsilon^{2}) queries.

Proof.

To handle the technical condition in Lemma 23, we first compute xTAxx^{T}Ax for a constant number of independent Gaussian vectors x.x. If xTAxx^{T}Ax is ever negative, then we automatically reject.

We showed in the proof of Theorem 28 that with 0.90.9 probability, γcpsd\gamma\leq c_{psd} if AA is PSD, and γcfar(ϵkc)/logk:=Cfar(k)\gamma\geq c_{\text{far}}(\epsilon\sqrt{k}-c)/\log k:=C_{\text{far}}(k) if AA is ϵ\epsilon-far from PSD. By choosing some k=O(1ϵ2log21ϵ)k=O(\frac{1}{\epsilon^{2}}\log^{2}\frac{1}{\epsilon}) we can arrange for Cfar(k)cpsd1,C_{\text{far}}(k)-c_{\text{psd}}\geq 1, and also for Cfar(k)=Θ(1).C_{\text{far}}(k)=\Theta(1).

Next we compute estimates α\alpha and β\beta of Tr(A)\operatorname{Tr}(A) and AF\left\|A\right\|_{F} as above, and (implicitly) form the matrix

Γ=GTAGαIkβklogk+Cfar(k)IkIk.\Gamma=\frac{G^{T}AG-\alpha I_{k}}{\beta\sqrt{k}\log k}+C_{\text{far}}(k)I_{k}-I_{k}.

If AA is PSD, then with very good probability,

λmin(Γ)=γ+Cfar(k)1ϵcpsd+Cfar(k)10.\lambda_{\text{min}}(\Gamma)=-\gamma+C_{\text{far}}(k)-\frac{1}{\epsilon}\geq-c_{\text{psd}}+C_{\text{far}}(k)-1\geq 0.

Similarly, if AA is ϵ\epsilon-far from PSD, then

λmin(Γ)=γ+Cfar(k)1ϵCfar(k)+Cfar(k)1=1.\lambda_{\text{min}}(\Gamma)=-\gamma+C_{\text{far}}(k)-\frac{1}{\epsilon}\leq-C_{\text{far}}(k)+C_{\text{far}}(k)-1=-1.

Thus it suffices to distinguish Γ\Gamma being PSD from Γ\Gamma having a negative eigenvalue less than or equal to 1.-1.

For this we will utilize our adaptive 1\ell_{1}-tester, so we must bound Γ1.\left\|\Gamma\right\|_{1}. Note that GTAGG^{T}AG is a trace estimator for kAkA with variance O(k)AFO(k)\left\|A\right\|_{F} [AN13]. Therefore Tr(GTAG)=k(Tr(A)±O(1)AF).\operatorname{Tr}(G^{T}AG)=k(\operatorname{Tr}(A)\pm O(1)\left\|A\right\|_{F}). Define M=1β(GTAGαIk)M=\frac{1}{\beta}(G^{T}AG-\alpha I_{k}), so that Tr(M)=O(k).\operatorname{Tr}(M)=O(k). The negative eigenvalues of MM sum to at most kλmin(M)k\lambda_{\min}(M) in magnitude, and so the bound on Tr(M)\operatorname{Tr}(M) implies that M12kλmin(M)+O(k).\left\|M\right\|_{1}\leq 2k\lambda_{\min}(M)+O(k). Write Γ=1klogkM+Cfar(k)II\Gamma=\frac{1}{\sqrt{k}\log k}M+C_{\text{far}}(k)I-I, so that Γ11klogkM1+O(k).\left\|\Gamma\right\|_{1}\leq\frac{1}{\sqrt{k}\log k}\left\|M\right\|_{1}+O(k). Note that λmin(M)klogkλmin(Γ)+O(klogk).\lambda_{\min}(M)\leq\sqrt{k}\log k\lambda_{\min}(\Gamma)+O(\sqrt{k}\log k). Therefore Γ12kλmin(Γ)+O(k).\left\|\Gamma\right\|_{1}\leq 2k\lambda_{\min}(\Gamma)+O(k). From this we have

1λmin(Γ)Γ12k(λmin(Γ)+O(1)λmin(Γ))+O(k)λmin(Γ)O(k)\frac{1}{\lambda_{\min}(\Gamma)}\left\|\Gamma\right\|_{1}\leq 2k\left(\frac{\lambda_{\min}(\Gamma)+O(1)}{\lambda_{\min}(\Gamma)}\right)+\frac{O(k)}{\lambda_{\min}(\Gamma)}\leq O(k)

as long as |λmin(Γ)|Ω(1)|\lambda_{\min}(\Gamma)|\geq\Omega(1), which it is by assumption.

Therefore Theorem 2 gives an adaptive vector-matrix-vector tester for Γ\Gamma which requires only O~(k)=O~(1ϵ2)\widetilde{O}(k)=\widetilde{O}(\frac{1}{\epsilon^{2}}) queries.

As a consequence we also obtain a two-sided pp-tester for all p2.p\geq 2.

Corollary 30.

For p2p\geq 2, there is a two-sided adaptive p\ell_{p}-tester in the vector-matrix-vector model, which make O~(1/ϵ2)d11/p\widetilde{O}(1/\epsilon^{2})d^{1-1/p} queries.

Proof.

Apply Theorem 29 along with the bound Apd1p12AF.\left\|A\right\|_{p}\geq d^{\frac{1}{p}-\frac{1}{2}}\left\|A\right\|_{F}.

4.4 Lower bounds for two-sided testers

Our lower bounds for two-sided testers come from the spiked Gaussian model introduced in [LW16]. As before, our adaptive lower bounds will come as a consequence of the corresponding non-adaptive bounds.

Theorem 31.

A two-sided p\ell_{p}-tester that makes non-adaptive vector-matrix-vector queries requires at least

  • Ω(1ϵ2p)\Omega(\frac{1}{\epsilon^{2p}}) queries for 1p21\leq p\leq 2

  • Ω(1ϵ4d24/p)\Omega(\frac{1}{\epsilon^{4}}d^{2-4/p}) queries for 2<p<2<p<\infty as long as dd can be taken to be Ω(1/ϵp).\Omega(1/\epsilon^{p}).

  • Ω(d2)\Omega(d^{2}) queries for p=.p=\infty.

Proof.

First, take GG to be a d×dd\times d matrix with 𝒩(0,1)\mathcal{N}(0,1) entries, where d=1/ϵd=1/\epsilon. Also let G~=G+suvT\widetilde{G}=G+suv^{T} where uu and vv have 𝒩(0,1)\mathcal{N}(0,1) entries, and ss is to be chosen later. We will show that a PSD-tester can be used to distinguish GG and G~\widetilde{G}, while this is hard for any algorithm that uses only a linear sketch.

Recall that GG has spectral norm at most cdc\sqrt{d} with probability at least 12ed,1-2e^{-d}, where cc is an absolute constant. (We will use cc throughout to indicate absolute constants that we do not track – it may have different values between uses, even within the same equation.) Set

Gsym=[0GGT0,]G_{\text{sym}}=\begin{bmatrix}0&G\\ G^{T}&0,\end{bmatrix} (20)

and define G~sym\widetilde{G}_{\text{sym}} similarly. Note that the eigenvalues of GsymG_{\text{sym}} are precisely {±σi}\{\pm\sigma_{i}\} where the σi\sigma_{i} are singular values of G.G. Therefore Gsym+cdIG_{\text{sym}}+c\sqrt{d}I is PSD with high probability.

On the other hand, uvTcd\left\|uv^{T}\right\|\geq cd with high probability so

G~suvTGcsdcd,\left\|\widetilde{G}\right\|\geq s\left\|uv^{T}\right\|-\left\|G\right\|\geq csd-c\sqrt{d}, (21)

which implies that G~sym+cdI\widetilde{G}_{\text{sym}}+c\sqrt{d}I has a negative eigenvalue with magnitude at least csdcdcd=csdcd.csd-c\sqrt{d}-c\sqrt{d}=csd-c\sqrt{d}.

We also have that G~sympcdd1/p,\left\|\widetilde{G}_{\text{sym}}\right\|_{p}\leq c\sqrt{d}d^{1/p}, since the operator norm of GG is bounded by cdc\sqrt{d} with high probability. Hence if

csdddd1/p=csd1d1/pϵ,c\frac{sd-\sqrt{d}}{\sqrt{d}d^{1/p}}=c\frac{s\sqrt{d}-1}{d^{1/p}}\geq\epsilon, (22)

then a two-sided PSD-tester can distinguish between GG and G~\widetilde{G} with constant probability of failure.

On the other hand, Theorem 4 in [LW16] implies that any sketch that distinguishes these distributions with constant probability, must have sketching dimension at least c/s4.c/s^{4}.

It remains to choose values of ss and dd for which the inequality in equation (22) holds. When 1p21\leq p\leq 2, we take d=Θ(ϵp)d=\Theta(\epsilon^{-p}) and s=O(ϵp/2)s=O(\epsilon^{p/2}) giving a lower bound of Ω(1/ϵ2p).\Omega(1/\epsilon^{2p}). When 2<p<2<p<\infty, we take d=Ω(1/ϵp)d=\Omega(1/\epsilon^{p}) and s=O(ϵd1/p1/2)s=O(\epsilon d^{1/p-1/2}) giving a lower bound of Ω(1ϵ4d24/p).\Omega(\frac{1}{\epsilon^{4}}d^{2-4/p}). Finally, when p=p=\infty we take s=O(1/d)s=O(1/\sqrt{d}) giving a lower bound of Ω(d2).\Omega(d^{2}).

Remark 32.

The argument above applies equally well to arbitrary linear sketches, of which a series of non-adaptive vector-matrix-vector queries is a special case.

Corollary 33.

A two-sided adaptive p\ell_{p}-tester in the vector-matrix-vector model requires at least

  • Ω(1ϵp)\Omega(\frac{1}{\epsilon^{p}}) queries for 1p21\leq p\leq 2

  • Ω(1ϵ2d12/p)\Omega(\frac{1}{\epsilon^{2}}d^{1-2/p}) queries for 2<p<2<p<\infty as long as dd can be taken to be Ω(1/ϵp).\Omega(1/\epsilon^{p}).

  • Ω(d)\Omega(d) queries for p=.p=\infty.

Proof.

Apply Theorem 31 along with Lemma 14. ∎

For adaptive measurements, we supply a second proof via communication complexity which has the advantage of applying to general linear measurements, albeit at the cost of an additional bit complexity term.

Proposition 34.

Let p[1,)p\in[1,\infty), ϵ<12\epsilon<\frac{1}{2} and d(p/ϵ)pd\geq(p/\epsilon)^{p}. An adaptive two-sided p\ell_{p}-tester taking general linear measurements Mi,A\left\langle M_{i},A\right\rangle of AA, where each MiM_{i} has integer entries in (2b1,2b1](-2^{b-1},2^{b-1}], must make at least cb+dlog1ϵ1ϵ2d12/p\frac{c}{b+d\log\frac{1}{\epsilon}}\frac{1}{\epsilon^{2}}d^{1-2/p} queries.

Proof.

We reduce from the multiplayer set disjointness problem [KPW21]. Let the kk players have sets S1,,Sk[d]S_{1},\ldots,S_{k}\subseteq[d] which either are (i) all pairwise disjoint or (ii) all share precisely one common element. Distinguishing between (i) and (ii) with 2/32/3 probability in the blackboard model of communication requires Ω(d/k)\Omega(d/k) bits. We will choose k=max(4ϵd1/p,4p)=4ϵd1/p.k=\left\lceil\max(4\epsilon d^{1/p},4p)\right\rceil=\lceil 4\epsilon d^{1/p}\rceil.

For each ii let χi\chi_{i} be the characteristic vector of SiS_{i} and let Ai=diag(χi).A_{i}=\operatorname{diag}(\chi_{i}). Consider the matrix A:=Idi=1dAiA:=I_{d}-\sum_{i=1}^{d}A_{i}.

In situation (i), AA is PSD. In situation (ii), Appkp+d\left\|A\right\|_{p}^{p}\leq k^{p}+d and λmin(A)=(k1).\lambda_{\min}(A)=-(k-1). We have

|λmin(A)|p=(k1)p=kp(11k)pkp(1pk)34kp\left|\lambda_{\min}(A)\right|^{p}=(k-1)^{p}=k^{p}\left(1-\frac{1}{k}\right)^{p}\geq k^{p}\left(1-\frac{p}{k}\right)\geq\frac{3}{4}k^{p}

and

ϵpAppϵp(kp+d)=ϵpkp+ϵpd12kp+ϵpd34kp.\epsilon^{p}\left\|A\right\|_{p}^{p}\leq\epsilon^{p}(k^{p}+d)=\epsilon^{p}k^{p}+\epsilon^{p}d\leq\frac{1}{2}k^{p}+\epsilon^{p}d\leq\frac{3}{4}k^{p}.

Given query access to AA, an p\ell_{p}-tester can therefore distinguish between (i) and (ii) with 2/32/3 probability. Note that a single linear measurement M,A\left\langle M,A\right\rangle may be simulated in the blackboard model using O(k(logb+logd))O(k(\log b+\log d)) bits; each player simply computes and communicates M,Ai\left\langle M,A_{i}\right\rangle, and the players add the resulting measurements. The players therefore need at least Ω(1k(logb+logd)dk)\Omega(\frac{1}{k(\log b+\log d)}\cdot\frac{d}{k}) bits of communication to solve the PSD-testing problem.

5 Spectrum Estimation

We make use of the following result, which is Lemma 11 of [CW17a] specialized to our setting.

Lemma 35.

For a symmetric matrix Ad×dA\in\mathbb{R}^{d\times d}, there is a distribution over an oblivious sketching matrix Rd×mR\in\mathbb{R}^{d\times m} with m=O(kϵ)m=O(\frac{k}{\epsilon}) so that with at least 0.90.9 proability,

minYrankk,PSD(AR)Y(AR)TAF2(1+ϵ)Ak,+AF2,\min_{Y^{*}\in\,\textup{rank}\,k,\textup{PSD}}\left\|(AR)Y^{*}(AR)^{T}-A\right\|_{F}^{2}\leq(1+\epsilon)\left\|A_{k,+}-A\right\|_{F}^{2}, (23)

where Ak,+A_{k,+} is the optimal rank-one PSD approximation to AA in Frobenius norm.

Remark 36.

In our setting one can simply take RR to be Gaussian since the guarantee above must hold when AA is drawn from a rotationally invariant distribution. In many situations, structured or sparse matrices are useful, but we do not need this here.

We also recall the notion of an affine embedding [CW17].

Definition 37.

SS is an affine embedding for matrices AA and BB if for all matrices XX of the appropriate dimensions, we have

S(AXB)F2=(1±ϵ)AXBF2.\left\|S(AX-B)\right\|_{F}^{2}=(1\pm\epsilon)\left\|AX-B\right\|_{F}^{2}. (24)

We also recall that when AA is promised to have rank at most rr, there is a distribution over SS with O(ϵ2r)O(\epsilon^{-2}r) rows such that (24) holds with constant probability for any choice of AA and BB [CW17].

Lemma 38.

There is an algorithm which makes O(k2ϵ6log1δ)O(\frac{k^{2}}{\epsilon^{6}}\log\frac{1}{\delta}) vector-matrix-vector queries to AA and with at least 1δ1-\delta probability outputs an approximation of Ak,+\left\|A\right\|_{k,+}, accurate to within ϵAF2\epsilon\left\|A\right\|_{F}^{2} additive error.

Proof.

We run two subroutines in parallel.

Subroutine 1. Approximate Ak,+AF2\left\|A_{k,+}-A\right\|_{F}^{2} up to O(ϵ)O(\epsilon) multiplicative error.

Our algorithm first draws affine embedding matrices S1S_{1} and S2S_{2} for r=k/ϵr=k/\epsilon, and with ϵ\epsilon distortion, each with O(kϵ3)O(\frac{k}{\epsilon^{3}}) rows. We also draw a matrix RR as in Lemma 35 with m=O(kϵ)m=O(\frac{k}{\epsilon}) columns.

We then compute S1ARS_{1}AR and S2ARS_{2}AR, each requiring k2ϵ4\frac{k^{2}}{\epsilon^{4}} vector-matrix-vector queries, and compute S1AS2TS_{1}AS_{2}^{T} requiring k2ϵ6\frac{k^{2}}{\epsilon^{6}} queries.

Let YkY_{k} be arbitrary with the appropriate dimensions (later we will optimize YkY_{k} over rank kk PSD matrices). By using the affine embedding property along with the fact that RR has rank at most kϵ\frac{k}{\epsilon}, we have

(S1AR)Yk(S2AR)T+S1AS2TF2\displaystyle\left\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\|_{F}^{2} =(1±ϵ)ARYk(S2AR)T+AS2TF2\displaystyle=(1\pm\epsilon)\left\|ARY_{k}(S_{2}AR)^{T}+AS_{2}^{T}\right\|_{F}^{2}
=(1±ϵ)S2ARYkRTA+S2AF2\displaystyle=(1\pm\epsilon)\left\|S_{2}ARY_{k}R^{T}A+S_{2}A\right\|_{F}^{2}
=(1±3ϵ)ARYkRTA+AF2.\displaystyle=(1\pm 3\epsilon)\left\|ARY_{k}R^{T}A+A\right\|_{F}^{2}.

As a consequence of this, and the property held by RR, we have

minrk(Yk)k,YkPSD(S1AR)Yk(S2AR)T+S1AS2TF2\displaystyle\min_{\operatorname{rk}(Y_{k})\leq k,Y_{k}\,\text{PSD}}\left\|(S_{1}AR)Y_{k}(S_{2}AR)^{T}+S_{1}AS_{2}^{T}\right\|_{F}^{2} =(1±3ϵ)minYkARYkRTA+AF2\displaystyle=(1\pm 3\epsilon)\min_{Y_{k}}\left\|ARY_{k}R^{T}A+A\right\|_{F}^{2} (25)
=(1±7ϵ)Ak,+AF2.\displaystyle=(1\pm 7\epsilon)\left\|A_{k,+}-A\right\|_{F}^{2}. (26)

Thus by computing the quantity in the left-hand-side above, our algorithm computes an O(ϵ)O(\epsilon) multiplicative approximation using O(k2/ϵ6)O(k^{2}/\epsilon^{6}) vector-matrix-vector queries.

Subroutine 2. Approximate AF2\left\|A\right\|_{F}^{2} up to O(ϵ)O(\epsilon) multiplicative error.

We simply apply Theorem 2.2. of [MSW19], set q=2,q=2, and note that the entries of the sketch correspond to vector-matrix-vector products. By their bound we require O(ϵ2log(1/ϵ))O(\epsilon^{-2}\log(1/\epsilon)) vector-matrix-vector queries.

Since Ak,+F2=AF2Ak,+AF2\left\|A_{k,+}\right\|_{F}^{2}=\left\|A\right\|_{F}^{2}-\left\|A_{k,+}-A\right\|_{F}^{2}, we obtain an additive O(ϵ)AF2O(\epsilon)\left\|A\right\|_{F}^{2} approximation to Ak,+F2\left\|A_{k,+}\right\|_{F}^{2} by running the two subroutines above and subtracting their results.

Finally, by repeating the above procedure O(log1δ)O(\log\frac{1}{\delta}) times in parallel and taking the median of the trails, we obtain a failure probability of at most δ.\delta.

The matrices S1ARS_{1}AR and S2ARS_{2}AR in Subroutine 1 each have rank k/ϵk/\epsilon whereas the dimensions of S1AS2TS_{1}AS_{2}^{T} are k/ϵ3.k/\epsilon^{3}. The matrix S1AS2TS_{1}AS_{2}^{T} therefore contains a large amount of data that will not play a role when optimizing over YkY_{k}. If S1ARS_{1}AR and S2ARS_{2}AR were known ahead of time, then we could choose to compute only the portion of S1AS2TS_{1}AS_{2}^{T} that is relevant to the optimization step, and simply estimate the Frobenius error incurred by the rest. This allows us to construct a slightly more efficient two-pass protocol.

Proposition 39.

By using a single round adaptivity, the guarantee of Lemma 38 may be achieved using O(k2ϵ4log1δ)O(\frac{k^{2}}{\epsilon^{4}}\log\frac{1}{\delta}) vector-matrix-vector queries.

Proof.

As described above, we modify Subroutine 1. Write MiM_{i} for SiARS_{i}AR and QQ for S1AS2T.S_{1}AS_{2}^{T}. Instead of computing M1M_{1}, M2,M_{2}, and QQ at once, we instead compute M1M_{1} and M2M_{2} first using k2/ϵ4k^{2}/\epsilon^{4} vector-matrix-vector queries.

We wish to estimate minYkM1YkM2TQF2\min_{Y_{k}}\left\|M_{1}Y_{k}M_{2}^{T}-Q\right\|_{F}^{2}, where the minimum is over PSD matrices YkY_{k} of rank at most kk. Let Πi\Pi_{i} denote orthogonal projection onto the image of MiM_{i}, and set Πi=IΠi.\Pi_{i}^{\perp}=I-\Pi_{i}. Then for fixed YY, we use the Pythagorean theorem to write

M1YM2QF2\displaystyle\left\|M_{1}YM_{2}-Q\right\|_{F}^{2} =Π1M1YM2Π2QF2\displaystyle=\left\|\Pi_{1}M_{1}YM_{2}\Pi_{2}-Q\right\|_{F}^{2} (27)
=Π1(M1YM2TQ)Π2+Π1QΠ2+Π1QΠ2+Π1QΠ2F2\displaystyle=\left\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}+\Pi_{1}^{\perp}Q\Pi_{2}+\Pi_{1}Q\Pi_{2}^{\perp}+\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\|_{F}^{2} (28)
=Π1(M1YM2TQ)Π2F2+Π1QΠ2F2+Π1QΠ2F2+Π1QΠ2F2\displaystyle=\left\|\Pi_{1}(M_{1}YM_{2}^{T}-Q)\Pi_{2}\right\|_{F}^{2}+\left\|\Pi_{1}^{\perp}Q\Pi_{2}\right\|_{F}^{2}+\left\|\Pi_{1}Q\Pi_{2}^{\perp}\right\|_{F}^{2}+\left\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\|_{F}^{2} (29)
=M1YM2TΠ1QΠ2F2+Π1QΠ2F2+Π1QΠ2F2+Π1QΠ2F2.\displaystyle=\left\|M_{1}YM_{2}^{T}-\Pi_{1}Q\Pi_{2}\right\|_{F}^{2}+\left\|\Pi_{1}^{\perp}Q\Pi_{2}\right\|_{F}^{2}+\left\|\Pi_{1}Q\Pi_{2}^{\perp}\right\|_{F}^{2}+\left\|\Pi_{1}^{\perp}Q\Pi_{2}^{\perp}\right\|_{F}^{2}. (30)

Note that each of the last three terms can be estimated to within O(ϵ)O(\epsilon) multiplicative error using Subroutine 2, since a vector-matrix-vector query to one of these matrices may be simulated with a single query to AA. Also since each MiM_{i} has rank O(k/ϵ)O(k/\epsilon), the Πi\Pi_{i}’s are projections onto O(k/ϵ)O(k/\epsilon) dimensional subspaces. Since the Πi\Pi_{i}’s are known to the algorithm, we may compute Π1QΠ2\Pi_{1}Q\Pi_{2} explicitly using k2/ϵ2k^{2}/\epsilon^{2} vector-matrix-vector queries, as it suffices to query Π1QΠ2\Pi_{1}Q\Pi_{2} over the Cartesian product of bases for the images of Π1\Pi_{1} and Π2.\Pi_{2}. By optimizing the first term over YY, we thus obtain an O(ϵ)O(\epsilon) multiplicative approximation to minYkM1YkM2TQF2\min_{Y_{k}}\left\|M_{1}Y_{k}M_{2}^{T}-Q\right\|_{F}^{2} as desired. This gives a version of Subroutine 1 that makes O(k2/ϵ4)O(k^{2}/\epsilon^{4}) queries. ∎

We note that we immediately obtain a poly(1/ϵ)\text{poly}(1/\epsilon) query 2\ell_{2}-tester by applying Lemma 38 to approximate A1,A_{1,-}. However this yields a worse ϵ\epsilon dependence than Theorem 28. Perhaps more interestingly, these techniques also give a way to approximate the top kk (in magnitude) eigenvalues of AA while preserving their signs. We note a minor caveat. If λk\lambda_{k} and λk+1\lambda_{k+1} are very close in magnitude, but have opposite signs, then we cannot guarantee that we approximate λk\lambda_{k}. Therefore in the statement below, we only promise to approximate eigenvalues with magnitude at least |λk|+2ϵ|\lambda_{k}|+2\epsilon.

Theorem 40.

Let λ1,λ2,\lambda_{1},\lambda_{2},\ldots be the (signed) eigenvalues of AA sorted in decreasing order of magnitude.

There is an algorithm that makes O(k2ϵ12logk)O(\frac{k^{2}}{\epsilon^{12}}\log k) non-adaptive vector-matrix-vector queries to AA, and with probability at least 0.90.9, outputs λ~1,,λ~k\tilde{\lambda}_{1},\ldots,\tilde{\lambda}_{k} such that

(i) There exists a permutation σ\sigma on [k][k] so that for all ii with |λi||λk|+2ϵ|\lambda_{i}|\geq|\lambda_{k}|+2\epsilon, |λ~σ(i)λi|ϵAF|\tilde{\lambda}_{\sigma(i)}-\lambda_{i}|\leq\epsilon\left\|A\right\|_{F}

(ii) For all ii, there exists jj with |λj||λk|ϵ|\lambda_{j}|\geq|\lambda_{k}|-\epsilon and |λ~iλj|ϵAF|\tilde{\lambda}_{i}-\lambda_{j}|\leq\epsilon\left\|A\right\|_{F}

With one additional round of adaptivity the number of measurements can be reduced to O(k2ϵ8logk).O(\frac{k^{2}}{\epsilon^{8}}\log k).

Proof.

We set δ=120k\delta=\frac{1}{20k} in Lemma 38 and use it to approximate A1,+F2,,Ak,+F2\left\|A_{1,+}\right\|_{F}^{2},\ldots,\left\|A_{k,+}\right\|_{F}^{2}, along with A1,F2,,Ak,F2\left\|A_{1,-}\right\|_{F}^{2},\ldots,\left\|A_{k,-}\right\|_{F}^{2}, each to within (ϵ2/2)AF2(\epsilon^{2}/2)\left\|A\right\|_{F}^{2} additive error. Note that we may use the same sketching matrices for each of these 2k2k tasks, and then take a union bound to obtain a failure probability of at most 0.10.1. Thus we require only O(k2ϵ12logk)O(\frac{k^{2}}{\epsilon^{12}}\log k) queries in total. With an additional round of adaptivity, Proposition 39 reduces this bound to O(k2ϵ8logk).O(\frac{k^{2}}{\epsilon^{8}}\log k).

Let λi,+\lambda_{i,+} be the ithi^{\text{th}} largest positive eigenvalue of AA if it exists, and 0 otherwise. Define λi,\lambda_{i,-} similarly. Note that λi,+2=Ai,+F2Ai1,+F2\lambda_{i,+}^{2}=\left\|A_{i,+}\right\|_{F}^{2}-\left\|A_{i-1,+}\right\|_{F}^{2} for i2i\geq 2, and that λ1,+2=A1,+F2\lambda_{1,+}^{2}=\left\|A_{1,+}\right\|_{F}^{2}. This allows us to compute approximations λ~i,+0\tilde{\lambda}_{i,+}\geq 0 such that |λ~i,+2λi,+2|ϵ2AF2|\tilde{\lambda}_{i,+}^{2}-\lambda_{i,+}^{2}|\leq\epsilon^{2}\left\|A\right\|_{F}^{2}, and similarly for the λi,\lambda_{i,-}’s with λ~i,0\tilde{\lambda}_{i,-}\leq 0. Note that this bound implies |λ~i,+λi,+|ϵAF.|\tilde{\lambda}_{i,+}-\lambda_{i,+}|\leq\epsilon\left\|A\right\|_{F}.

Our algorithm then simply returns the kk largest magnitude elements of {λ~1,+,,λ~k,+,λ~1,,,λ~k,}.\{\tilde{\lambda}_{1,+},\ldots,\tilde{\lambda}_{k,+},\tilde{\lambda}_{1,-},\ldots,\tilde{\lambda}_{k,-}\}.

6 Non-adaptive testers

6.1 Non-adaptive vector-matrix-vector queries

We gave a lower bound for one-sided testers earlier in Theorem 13. Here we observe that the sketch of Andoni and Nguyen [AN13] provides a matching upper bound.

Proposition 41.

There is a one-sided non-adaptive 1\ell_{1}-tester that makes O(1/ϵ2)O(1/\epsilon^{2}) non-adaptive vector matrix-vector queries to AA.

Proof.

We simply apply Proposition 8 Note that the sketch is of the form GTAGG^{T}AG, where Gm×dG\in\mathbb{R}^{m\times d} with m=O(1/ϵ)m=O(1/\epsilon) in our case. Each entry of GTAGG^{T}AG of which there are m2m^{2} can be computed with a single vector-matrix-vector query. ∎

Corollary 42.

There is a one-sided non-adaptive p\ell_{p}-tester that makes O(1ϵ2d22/p)O(\frac{1}{\epsilon^{2}}d^{2-2/p}) non-adaptive vector matrix-vector queries to AA.

Proof.

Apply the previous proposition along with the bound Apd1/p1A1.\left\|A\right\|_{p}\geq d^{1/p-1}\left\|A\right\|_{1}.

6.2 Non-adaptive matrix-vector queries

As a simple corollary of the algorithm given by Corollary 42 we have the following.

Proposition 43.

There exists a one-sided non-adaptive tester making O(1ϵd11/p)O(\frac{1}{\epsilon}d^{1-1/p}) matrix-vector queries.

Proof.

Simply note that a k×kk\times k bilinear sketch may be simulated with kk matrix-vector queries. ∎

We next show that this bound is tight. While we consider the case where the tester queries the standard basis vectors, this is done essentially without loss of generality as any non-adaptive tester may be implemented by querying on an orthonormal set.

Proposition 44.

Suppose that a one sided matrix-vector tester queries on the standard basis vectors e1,,eke_{1},\ldots,e_{k} and outputs False. Let UU be the top k×kk\times k submatrix of [Ae1,,Aek].[Ae_{1},\ldots,Ae_{k}]. Then if UU is non-singular, there must exist a “witness vector” vspan(x1,,xk)v\in\operatorname{span}(x_{1},\ldots,x_{k}) such that vTAv<0v^{T}Av<0.

Proof.

Let QQ be the matrix with columns Aei,Ae_{i}, and decompose it as

Q=(UB)TQ=\begin{pmatrix}U\\ B\end{pmatrix}^{T} (31)

where Uk×kU\in\mathbb{R}^{k\times k} and Bd×(dk).B\in\mathbb{R}^{d\times(d-k)}. Suppose that there does not exist a vv as in the statement of the proposition. Note that this implies that UU is PSD, and in fact positive definite by the assumption that UU was non-singular. Now consider the block matrix

A~s=(UBTBλI)\widetilde{A}s=\begin{pmatrix}U&B^{T}\\ B&\lambda I\end{pmatrix} (32)

for some choice of λ>0.\lambda>0. For arbitrary vv and ww of the appropriate dimensions, we have

(vw)(UBTBλI)(vw)\displaystyle\begin{pmatrix}v&w\end{pmatrix}\begin{pmatrix}U&B^{T}\\ B&\lambda I\end{pmatrix}\begin{pmatrix}v\\ w\end{pmatrix} =vTUv+2vTBw+λwTw\displaystyle=v^{T}Uv+2v^{T}Bw+\lambda w^{T}w (33)
v2σmin(U)+λw22vwσmax(B).\displaystyle\geq\left\|v\right\|^{2}\sigma_{\min}(U)+\lambda\left\|w\right\|^{2}-2\left\|v\right\|\left\|w\right\|\sigma_{\max}(B). (34)

Since σmin(U)0\sigma_{\min}(U)\neq 0 this expression viewed as a quadratic form in v\left\|v\right\| and w\left\|w\right\| is positive definite for large enough λ.\lambda. This implies that A~\widetilde{A} is positive definite as well. Since A~ei=Aei\widetilde{A}e_{i}=Ae_{i} by construction, this shows that the queries are consistent with a PSD matrix. So a one-sided tester that cannot produce a witness vector in this case must not output False. ∎

Theorem 45.

Set D=diag(λ,1,,1),D=\operatorname{diag}(-\lambda,1,\ldots,1), let SS be a random orthogonal matrix, and take A=STDSA=S^{T}DS. In the matrix-vector model, a one-sided non-adaptive tester must make at least 12d1+λ\frac{1}{2}\frac{d}{1+\lambda} queries to be correct on this distribution with 2/32/3 probability.

Proof.

Given this distribution we may assume without loss of generality that the tester queries on e1,ek,e_{1},\ldots e_{k}, whose span we call Ek.E_{k}. Let uu denote the λ-\lambda eigen-direction of A,A, which is distributed uniformly over Sd1.S^{d-1}. For unit vectors xx, the quadratic form associated to AA is negative exactly when x,u211+λ.\left\langle x,u\right\rangle^{2}\geq\frac{1}{1+\lambda}. Also the UU as in Proposition 44 is non-singular with probability 1.1. In this case, by Proposition 44 the tester can only succeed if ΠEku211+λ.\left\|\Pi_{E_{k}}u\right\|^{2}\geq\frac{1}{1+\lambda}. On the other hand 𝔼ΠEku2=k/d\mathbb{E}\left\|\Pi_{E_{k}}u\right\|^{2}=k/d, so by Markov, ΠEku22k/d\left\|\Pi_{E_{k}}u\right\|^{2}\leq 2k/d with probability at least 1/2.1/2. Therefore a tester that succeeds with 2/32/3 probability must have 2k/d1/(1+λ)2k/d\geq 1/(1+\lambda). ∎

Corollary 46.

In the matrix-vector model, a one-sided non-adaptive p\ell_{p}-tester must make at least Ω(1ϵd11/p)\Omega(\frac{1}{\epsilon}d^{1-1/p}) queries.

Proof.

Apply Theorem 45 with λ=ϵd1/p.\lambda=\epsilon d^{1/p}.

7 Conclusion and Open Problems

We gave a series of tight bounds for PSD-testing in both the matrix-vector and vector-matrix-vector models. We provided tight bounds as well as a separation between one and two-sided testers in the latter model. There are a number of additional questions that may yield interesting future work.

  • Our adaptive vector-matrix-vector algorithm for p=1p=1 uses Ω(1/ϵ)\Omega(1/\epsilon) rounds of adaptivity, but this may not always be desirable in practice, since the queries cannot be run in parallel. Are there good algorithms that use less adaptivity? What is the optimal trade-off between query complexity and the number of rounds of adaptivity?

  • One could modify our testing model and consider testers which should output False whenever the p\ell_{p} norm of the negative eigenvalues is at least an ϵ\epsilon fraction of the p\ell_{p} norm of positive eigenvalues. Is it possible to give tight bounds for this problem in the models that we considered?

  • Is it possible to use the ideas behind our two-sided bilinear sketch to give better bounds for spectral estimation with additive Frobenius error?

8 Acknowledgements

The authors would like to thank Sushant Sachdeva for pointing out that our analysis for matrix-vector queries was similar to the application of [AL86] in [SW09].

References

  • [Oja82] Erkki Oja “Simplified neuron model as a principal component analyzer” In Journal of mathematical biology 15.3 Springer, 1982, pp. 267–273
  • [AL86] Owe Axelsson and Gunhild Lindskog “On the rate of convergence of the preconditioned conjugate gradient method” In Numerische Mathematik 48 Springer, 1986, pp. 499–523
  • [Hut89] Michael F Hutchinson “A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines” In Communications in Statistics-Simulation and Computation 18.3 Taylor & Francis, 1989, pp. 1059–1076
  • [BFG96] Zhaojun Bai, Gark Fahey and Gene Golub “Some large-scale matrix computation problems” In Journal of Computational and Applied Mathematics 74.1-2 Elsevier, 1996, pp. 71–89
  • [KS03] Robert Krauthgamer and Ori Sasson “Property testing of data dimensionality” In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA ACM/SIAM, 2003, pp. 18–27
  • [Lit+05] Alexander E Litvak, Alain Pajor, Mark Rudelson and Nicole Tomczak-Jaegermann “Smallest singular value of random matrices and geometry of random polytopes” In Advances in Mathematics 195.2 Elsevier, 2005, pp. 491–523
  • [SW09] Daniel A Spielman and Jaeoh Woo “A note on preconditioning by low-stretch spanning trees” In arXiv preprint arXiv:0903.2816, 2009
  • [AT11] Haim Avron and Sivan Toledo “Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix” In Journal of the ACM (JACM) 58.2 ACM New York, NY, USA, 2011, pp. 1–34
  • [AN13] Alexandr Andoni and Huy L Nguyn “Eigenvalues of a matrix in the streaming model” In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, 2013, pp. 1729–1737 SIAM
  • [WWZ14] Karl Wimmer, Yi Wu and Peng Zhang “Optimal query complexity for estimating the trace of a matrix” In International Colloquium on Automata, Languages, and Programming, 2014, pp. 1051–1062 Springer
  • [MM15] Cameron Musco and Christopher Musco “Randomized block krylov methods for stronger and faster approximate singular value decomposition” In arXiv preprint arXiv:1504.05477, 2015
  • [Ai+16] Yuqing Ai, Wei Hu, Yi Li and David P Woodruff “New characterizations in turnstile streams with applications” In 31st Conference on Computational Complexity (CCC 2016), 2016 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
  • [Jai+16] Prateek Jain et al. “Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm” In Conference on learning theory, 2016, pp. 1147–1164 PMLR
  • [LW16] Yi Li and David P Woodruff “Tight bounds for sketching the operator norm, Schatten norms, and subspace embeddings” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016), 2016 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
  • [Sha16] Ohad Shamir “Convergence of stochastic gradient descent for PCA” In International Conference on Machine Learning, 2016, pp. 257–265 PMLR
  • [AL17] Zeyuan Allen-Zhu and Yuanzhi Li “First efficient convergence for streaming k-pca: a global, gap-free, and near-optimal rate” In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 487–492 IEEE
  • [CW17] Kenneth L Clarkson and David P Woodruff “Low-rank approximation and regression in input sparsity time” In Journal of the ACM (JACM) 63.6 ACM New York, NY, USA, 2017, pp. 1–45
  • [CW17a] Kenneth L Clarkson and David P Woodruff “Low-rank PSD approximation in input-sparsity time” In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 2017, pp. 2061–2072 SIAM
  • [Han+17] Insu Han, Dmitry Malioutov, Haim Avron and Jinwoo Shin “Approximating spectral sums of large-scale matrices using stochastic chebyshev approximations” In SIAM Journal on Scientific Computing 39.4 SIAM, 2017, pp. A1558–A1585
  • [SER18] Max Simchowitz, Ahmed El Alaoui and Benjamin Recht “Tight query complexity lower bounds for PCA via finite sample deformed Wigner law” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018, pp. 1249–1259
  • [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge university press, 2018
  • [Bal+19] Maria-Florina Balcan, Yi Li, David P Woodruff and Hongyang Zhang “Testing matrix rank, optimally” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 727–746 SIAM
  • [MSW19] Michela Meister, Tamas Sarlos and David Woodruff “Tight dimensionality reduction for sketching low degree polynomial kernels”, 2019
  • [Sun+19] Xiaoming Sun, David P Woodruff, Guang Yang and Jialin Zhang “Querying a matrix through matrix-vector products” In arXiv preprint arXiv:1906.05736, 2019
  • [BCJ20] Ainesh Bakshi, Nadiia Chepurko and Rajesh Jayaram “Testing positive semi-definiteness via random submatrices” In arXiv preprint arXiv:2005.06441, 2020
  • [Bra+20] Mark Braverman, Elad Hazan, Max Simchowitz and Blake Woodworth “The gradient complexity of linear regression” In Conference on Learning Theory, 2020, pp. 627–647 PMLR
  • [RWZ20] Cyrus Rashtchian, David P Woodruff and Hanlin Zhu “Vector-matrix-vector queries for solving linear algebra, statistics, and graph problems” In arXiv preprint arXiv:2006.14015, 2020
  • [BMR21] Rajarshi Bhattacharjee, Cameron Musco and Archan Ray “Sublinear Time Eigenvalue Approximation via Random Sampling” In CoRR abs/2109.07647, 2021
  • [BMR21a] Rajarshi Bhattacharjee, Cameron Musco and Archan Ray “Sublinear Time Eigenvalue Approximation via Random Sampling” In arXiv preprint arXiv:2109.07647, 2021
  • [KPW21] Akshay Kamath, Eric Price and David P Woodruff “A simple proof of a new set disjointness with applications to data streams” In arXiv preprint arXiv:2105.11338, 2021
  • [Mey+21] Raphael A Meyer, Cameron Musco, Christopher Musco and David P Woodruff “Hutch++: Optimal Stochastic Trace Estimation” In Symposium on Simplicity in Algorithms (SOSA), 2021, pp. 142–155 SIAM
  • [INW22] Piotr Indyk, Shyam Narayanan and David P. Woodruff “Frequency Estimation with One-Sided Error” In SODA, 2022