This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A General Algorithm for Solving Rank-one Matrix Sensing

Lianke Qin lianke@ucsb.edu. UCSB.    Zhao Song zsong@adobe.com. Adobe Research.    Ruizhe Zhang ruizhe@utexas.edu. The University of Texas at Austin.

Matrix sensing has many real-world applications in science and engineering, such as system control, distance embedding, and computer vision. The goal of matrix sensing is to recover a matrix An×nA_{\star}\in\mathbb{R}^{n\times n}, based on a sequence of measurements (ui,bi)n×(u_{i},b_{i})\in\mathbb{R}^{n}\times\mathbb{R} such that uiAui=biu_{i}^{\top}A_{\star}u_{i}=b_{i}. Previous work [39] focused on the scenario where matrix AA_{\star} has a small rank, e.g. rank-kk. Their analysis heavily relies on the RIP assumption, making it unclear how to generalize to high-rank matrices. In this paper, we relax that rank-kk assumption and solve a much more general matrix sensing problem. Given an accuracy parameter δ(0,1)\delta\in(0,1), we can compute An×nA\in\mathbb{R}^{n\times n} in O~(m3/2n2δ1)\widetilde{O}(m^{3/2}n^{2}\delta^{-1}), such that |uiAuibi|δ|u_{i}^{\top}Au_{i}-b_{i}|\leq\delta for all i[m]i\in[m]. We design an efficient algorithm with provable convergence guarantees using stochastic gradient descent for this problem.

1 Introduction

Matrix sensing is a generalization of the famous compressed sensing problem. Informally, the goal of matrix sensing is to reconstruct a matrix An×nA\in\mathbb{R}^{n\times n} using a small number of quadratic measurements (i.e., uAuu^{\top}Au). It has many real-world applications, including image processing [4, 37], quantum computing [1, 9, 20], systems [29] and sensor localization [16] problems. For this problem, there are two important theoretical questions:

  • Q1. Compression: how to design the sensing vectors unu\in\mathbb{R}^{n} so that the matrix can be recovered with a small number of measurements?

  • Q2. Reconstruction: how fast can we recover the matrix given the measurements?

[39] initializes the study of the rank-one matrix sensing problem, where the ground-truth matrix AA_{\star} has only rank-kk, and the measurements are of the form uiAuiu_{i}^{\top}A_{\star}u_{i}. They want to know the smallest number of measurements mm to recover the matrix AA_{\star}. In our setting, we assume mm is a fixed input parameter and we’re not allowed to choose. We show that for any mm and nn, how to design a faster algorithm for solving an optimization problem which is finding AAA\approx A_{\star}. Thus, in some sense, previous work [39, 7] mainly focuses on problem Q1 with a low-rank assumption on AA_{\star}. Our work is focusing on Q2 without the low-rank assumption.

We observe that in many applications, the ground-truth matrix AA_{\star} does not need to be recovered exactly (i.e., AAnc\|A-A_{\star}\|\leq n^{-c}). For example, for distance embedding, we would like to learn an embedding matrix between all the data points in a high-dimensional space. The embedding matrix is then used for calculating data points’ pairwise distances for a higher-level machine learning algorithm, such as kk-nearest neighbor clustering. As long as we can recover a good approximation of the embedding matrix, the clustering algorithm can deliver the desired results. As we relax the accuracy constraints of the matrix sensing, we have the opportunity to speed up the matrix sensing time.

We formulate our problem in the following way:

Problem 1.1 (Approximate matrix sensing).

Given a ground-truth positive definite matrix An×nA_{\star}\in\mathbb{R}^{n\times n} and mm samples (ui,bi)n×(u_{i},b_{i})\in\mathbb{R}^{n}\times\mathbb{R} such that uiAui=biu_{i}^{\top}A_{\star}u_{i}=b_{i}. Let R=maxi[m]|bi|R=\max_{i\in[m]}|b_{i}|. For any accuracy parameter δ(0,1)\delta\in(0,1), find a matrix An×nA\in\mathbb{R}^{n\times n} such that

(uiAuiuiAui)2δ,i[m]\displaystyle(u_{i}^{\top}Au_{i}-u_{i}A_{\star}u_{i})^{2}\leq\delta,~{}~{}~{}\forall i\in[m] (1)

or

(1δ)AA(1+δ)A.\displaystyle(1-\delta)A_{\star}\preceq A\preceq(1+\delta)A_{\star}. (2)

We make a few remarks about Problem 1.1. First, our formulation doesn’t require the matrix AA_{\star} to be low-rank as literature [39]. Second, we need the measurement vectors uiu_{i} to be “approximately orthogonal” (i.e., |uiuj||u_{i}^{\top}u_{j}| are small), while [39] make much stronger assumptions for exact reconstruction. Third, the measure approximation guarantee (Eq. (1)) does not imply the spectral approximation guarantee (Eq. (2)). We mainly focus on achieving the first guarantee and discuss the second one in the appendix.

This problem is interesting for two reasons. First, speeding up matrix sensing is salient for a wide range of applications, where exact matrix recovery is not required. Second, we would like to understand the fundamental tradeoff between the accuracy constraint ϵ\epsilon and the running time. This tradeoff can give us insights on the fundamental computation complexity for matrix sensing.

This paper makes the following contributions:

  • We design a potential function to measure the distance between the approximate solution and the ground-truth matrix.

  • Based on the potential function, we show that gradient descent can efficiently find an approximate solution of the matrix sensing problem. We also prove the convergence rate of our algorithm.

  • Furthermore, we show that the cost-per-iteration can be improved by using stochastic gradient descent with a provable convergence guarantee, which is proved by generalizing the potential function to a randomized potential function.

Technically, our potential function applies a cosh\cosh function to each “training loss” (i.e., uiAuibiu_{i}^{\top}Au_{i}-b_{i}), which is inspired by the potential function for linear programming [5]. We prove that the potential is decreasing for each iteration of gradient descent, and a small potential implies a good approximation. In this way, we can upper bound the number of iterations needed for the gradient descent algorithm.

To reduce the cost-per-iteration, we follow the idea of stochastic gradient descent and evaluate the gradient of potential function on a subset of measurements. However, we still need to know the full gradient’s norm for normalization, which is a function of the training losses. It is too slow to naively compute each training loss. Instead, we use the idea of maintenance [5, 28, 3, 2, 19, 11, 34, 35, 13, 31] and show that the training loss at the (t+1)(t+1)-th iteration (i.e., uiAt+1uibiu_{i}^{\top}A_{t+1}u_{i}-b_{i}) can be very efficiently obtained from those at the tt-th iteration (i.e., uiAtuibiu_{i}^{\top}A_{t}u_{i}-b_{i}). Therefore, we first preprocess the initial full gradient’s norm, and in the following iterations, we can update this quantity based on the previous iteration’s result.

We state our main result as follows:

Theorem 1.2 (Informal of Theorem 6.1).

Given mm measurements of matrix sensing problems, there is an algorithm that outputs a n×nn\times n matrix AA in O~(m3/2n2Rδ1)\widetilde{O}(m^{3/2}n^{2}R\delta^{-1}) time such that |uiAuibi|δ|u_{i}^{\top}Au_{i}-b_{i}|\leq\delta, i[m]\forall i\in[m].

2 Related Work

Linear Progamming

Linear programming is one of foundations of the algorithm design and convex optimization. many problems can be modeled as linear programs to take advantage of fast algorithms. There are many works in accelerating linear programming runtime complexity [25, 26, 5, 28, 3, 2, 33, 8, 19, 10].

Semi-definite Programming

Semidefinite programming optimizes a linear objective function over the intersection of the positive semidefinite cone with an affine space. Semidefinite programming is a fundamental class of optimization problems and many problems in machine learning, and theoretical computer science can be modeled or approximated as semidefinite programming problems. There are many studies to speedup the running time of Semidefinite programming [30, 12, 27, 15, 14, 11, 10].

Matrix Sensing

Matrix sensing [22, 32, 17, 39, 7] is a generalization of the popular compressive sensing problem for the sparse vectors and has applications in several domains such as control, vision etc. a set of universal Pauli measurements, used in quantum state tomography, have been shown to satisfy the RIP condition [23]. These measurement operators are Kronecker products of 2×22\times 2 matrices, thus, they have appealing computation and memory efficiency. Rank-one measurement using nuclear norm minimization is also used in other work [6, 21]. There is also previous work working on low-rank matrix sensing to reconstruct a matrix exactly using a small number of linear measurements. ProcrustesFlow [36] designs an algorithm to recover a low-rank matrix from linear measurements. There are other low-rank matrix recovering algorithms based on non-convex optimizations [38, 24].

3 Preliminary

Notations.

For a positive integer, we use [n][n] to denote set {1,2,,n}\{1,2,\cdots,n\}. We use cosh(x)=12(ex+ex)\cosh(x)=\frac{1}{2}(e^{x}+e^{-x}) and sinh(x)=12(exex)\sinh(x)=\frac{1}{2}(e^{x}-e^{-x}). For a square matrix, we use tr[A]\operatorname{tr}[A] to denote the trace of AA. An n×nn\times n symmetric real matrix AA is said to be positive-definite if xAx>0x^{\top}Ax>0 for all non-zero xnx\in\mathbb{R}^{n}. An n×nn\times n symmetric real matrix AA is said to be positive-semidefinite if xAx0x^{\top}Ax\geq 0 for all non-zero xnx\in\mathbb{R}^{n}. For any function ff, we use O~(f)=fpoly(logf)\widetilde{O}(f)=f\cdot\operatorname{poly}(\log f).

3.1 Matrix hyperbolic functions

Definition 3.1 (Matrix function).

Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a real function and An×nA\in\mathbb{R}^{n\times n} be a real symmetric function with eigendecomposition

A=QΛQ1\displaystyle A=Q\Lambda Q^{-1}

where Λn×n\Lambda\in\mathbb{R}^{n\times n} is a diagonal matrix. Then, we have

f(A):=Qf(Λ)Q1,\displaystyle f(A):=Qf(\Lambda)Q^{-1},

where f(Λ)n×nf(\Lambda)\in\mathbb{R}^{n\times n} is the matrix obtained by applying ff to each diagonal entry of Λ\Lambda.

We have the following lemma to bound cosh(A)\cosh(A) and delay the proof to Appendix A.3.

Lemma 3.2.

Let AA be a real symmetric matrix, then we have

cosh(A)=cosh(A)tr[cosh(A)].\displaystyle\|\cosh(A)\|=\cosh(\|A\|)\leq\operatorname{tr}[\cosh(A)].

We also have

A1+log(tr[cosh(A)]).\displaystyle\|A\|\leq 1+\log(\operatorname{tr}[\cosh(A)]).

3.2 Properties of sinh\sinh and cosh\cosh

We have the following lemma for properties of sinh\sinh and cosh\cosh.

Lemma 3.3 (Scalar version).

Given a list of numbers x1,xnx_{1},\cdots x_{n}, we have

  • (i=1ncosh2(xi))1/2n+(i=1nsinh2(xi))1/2(\sum_{i=1}^{n}\cosh^{2}(x_{i}))^{1/2}\leq\sqrt{n}+(\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2},

  • (i=1nsinh2(xi))1/21n(i=1ncosh(xi)n)(\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2}\geq\frac{1}{\sqrt{n}}(\sum_{i=1}^{n}\cosh(x_{i})-n).

Proof.

For the first equation, we can bound (i=1ncosh2(xi))1/2(\sum_{i=1}^{n}\cosh^{2}(x_{i}))^{1/2} by:

(i=1ncosh2(xi))1/2=\displaystyle(\sum_{i=1}^{n}\cosh^{2}(x_{i}))^{1/2}= (n+i=1nsinh2(xi))1/2\displaystyle~{}(n+\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2}
\displaystyle\leq n+(i=1nsinh2(xi))1/2\displaystyle~{}\sqrt{n}+(\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2}

where the first step comes from fact A.5, and the second step follows from a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}.

For the second equation, we can bound (i=1nsinh2(xi))1/2(\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2} by:

(i=1nsinh2(xi))1/2\displaystyle(\sum_{i=1}^{n}\sinh^{2}(x_{i}))^{1/2}\geq 1n(i=1nsinh(xi))\displaystyle~{}\frac{1}{\sqrt{n}}(\sum_{i=1}^{n}\sinh(x_{i}))
\displaystyle\geq 1n(i=1ncosh(xi)n)\displaystyle~{}\frac{1}{\sqrt{n}}(\sum_{i=1}^{n}\cosh(x_{i})-n)

where the first step follows that i=1nxi2ni=1nxin\sqrt{\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}}\geq\frac{\sum_{i=1}^{n}x_{i}}{n}, and the second step follows from fact A.5 and x21x1\sqrt{x^{2}-1}\geq\sqrt{x}-1. ∎

We also have a lemma for the matrix version.

Lemma 3.4 (Matrix version).

For any real symmetric matrix AA, we have

  • (tr[cosh2(A)])1/2n+tr[sinh2(A)]1/2(\operatorname{tr}[\cosh^{2}(A)])^{1/2}\leq\sqrt{n}+\operatorname{tr}[\sinh^{2}(A)]^{1/2},

  • (tr[sinh2(A)])1/21n(tr[cosh(A)]n)(\operatorname{tr}[\sinh^{2}(A)])^{1/2}\geq\frac{1}{\sqrt{n}}(\operatorname{tr}[\cosh(A)]-n).

Proof.

Part 1. We have

(tr[cosh2(A)])1/2=\displaystyle(\operatorname{tr}[\cosh^{2}(A)])^{1/2}= (n+tr[sinh2(A)])1/2\displaystyle~{}(n+\operatorname{tr}[\sinh^{2}(A)])^{1/2}
\displaystyle\leq n+tr[sinh2(A)]1/2.\displaystyle~{}\sqrt{n}+\operatorname{tr}[\sinh^{2}(A)]^{1/2}.

where the first step follows from cosh2(A)sinh2(A)=I\cosh^{2}(A)-\sinh^{2}(A)=I.

Part 2. Let σi\sigma_{i} denote the singular value of cosh(A)\cosh(A)

(tr[sinh2(A)])1/2=\displaystyle(\operatorname{tr}[\sinh^{2}(A)])^{1/2}= (tr[cosh2(A)]n)1/2\displaystyle~{}(\operatorname{tr}[\cosh^{2}(A)]-n)^{1/2}
=\displaystyle= (i=1nσi21)1/2\displaystyle~{}(\sum_{i=1}^{n}\sigma_{i}^{2}-1)^{1/2}
\displaystyle\geq 1ni=1nσi21\displaystyle~{}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\sqrt{\sigma_{i}^{2}-1}
\displaystyle\geq 1n(i=1nσi1)\displaystyle~{}\frac{1}{\sqrt{n}}(\sum_{i=1}^{n}\sigma_{i}-1)
=\displaystyle= 1n(tr[cosh(A)]n)\displaystyle~{}\frac{1}{\sqrt{n}}(\operatorname{tr}[\cosh(A)]-n)

where the second step follows from 21n1\|\cdot\|_{2}\geq\frac{1}{\sqrt{n}}\|\cdot\|_{1}, the third step follows from σi1\sigma_{i}\geq 1.

4 Technique Overview

We first analyze the convergence guarantee of our matrix sensing algorithm based on gradient descent and improve its time complexity with stochastic gradient descent under the assumption where {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal vectors. We then analyze the convergence guarantee of our matrix sensing algorithm under a more general assumption where {ui}i[m]\{u_{i}\}_{i\in[m]} are non-orthogonal vectors and |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho.

Gradient descent.

We begin from the case where {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal vectors in n\mathbb{R}^{n}. We can the following entry-wise potential function:

Φλ(A):=i=1mcosh(λ(uiAuibi))\displaystyle\Phi_{\lambda}(A):=\sum_{i=1}^{m}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i}))

and analyze its progress during the gradient descent according to the update formula defined in Eq. (4) for each iteration. We split the gradient of the potential function into diagonal and off-diagonal terms. We can upper bound the diagonal term and prove that the off-diagonal term is zero. Combining the two terms together, we can upper bound the progress of update per iteration in Lemma 5.3 by:

Φλ(At+1)(10.9λϵm)Φλ(At)+λϵm.\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}.

By accumulating the progress of update for the entry-wise potential function over T=Ω~(mRδ1)T=\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations, we have Φ(AT+1)O(m)\Phi(A_{T+1})\leq O(m). This implies that our Algorithm 1 can output a matrix ATn×nA_{T}\in\mathbb{R}^{n\times n} satisfying guarantee in Eq. (23), and the corresponding time complexity is O(mn2)O(mn^{2}).

We then analyze the gradient descent under the assumption where {ui}i[m]\{u_{i}\}_{i\in[m]} are non-orthogonal vectors in n\mathbb{R}^{n}, |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho and ρ110m\rho\leq\frac{1}{10m}. We can upper bound the diagonal entries and off-diagonal entries respectively and obtain the same progress of update per iteration in Lemma D.1. Accumulating in T=Ω~(mRδ1)T=\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations, we can prove the approximation guarantee of the output matrix of our matrix sensing algorithm.

Stochastic gradient descent.

To further improve the time cost per iteration of our approximate matrix sensing, by uniformly sampling a subset [m]{\cal B}\subset[m] of size BB, we compute the gradient of the stochastic potential function:

Φλ(A,):=m||iuiuiλsinh(λ(uiAuibi)),\displaystyle\nabla\Phi_{\lambda}(A{,{\cal B}}):={\frac{m}{|{\cal B}|}\sum_{i\in{\cal B}}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda(u_{i}^{\top}Au_{i}-b_{i})),

and update the potential function based on update formula defined in Eq. (8). We upper bound the diagonal and off-diagonal terms respectively and obtain the expected progress on the potential function in Lemma 6.3.

Over T=Ω~(m3/2B1Rδ1)T=\widetilde{\Omega}(m^{3/2}B^{-1}R\delta^{-1}) iterations, we can upper bound Φ(AT+1)O(m)\Phi(A_{T+1})\leq O(m) with high probability. With similar argument to gradient descent section, we can prove that the SGD matrix sensing algorithm can output a solution matrix satisfying the same approximation guarantees with high success probability in Lemma 6.4. The optimized time complexity is O(Bn2)O(Bn^{2}) where BB is the SGD batch size.

For the more general assumption where {ui}i[m]\{u_{i}\}_{i\in[m]} are non-orthogonal vectors in n\mathbb{R}^{n} and |uiuj||u_{i}^{\top}u_{j}| has an upper bound, We also provide the cost-per-iteration analysis for stochastic gradient descent by bounding the diagonal entries and off-diagonal entries of the gradient matrix respectively. Then we prove that the progress on the expected potential satisfies the same guarantee as the gradient descent in Lemma E.2. Therefore, our SGD matrix sensing algorithm can output an matrix satisfying the approximation guarantee after

T=Ω~(m3/2B1Rδ1)\displaystyle T=\widetilde{\Omega}(m^{3/2}B^{-1}R\delta^{-1})

iterations under the general assumption.

5 Gradient descent for entry-wise potential function

In this section, we show how to obtain an approximate solution of matrix sensing via gradient descent. For simplicity, we start from a case that {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal vectors in n\mathbb{R}^{n}111We note that A:=i=1mbiuiuiA^{\prime}:=\sum_{i=1}^{m}b_{i}u_{i}u_{i}^{\top} is a solution satisfying uiAui=biu_{i}^{\top}A^{\prime}u_{i}=b_{i} for all i[m]i\in[m]. However, we pretend that we do not know this solution in this section., which already conveys the key idea of our algorithm and analysis and we generalize the solution to the non-orthogonal case (see Appendix D). We show that Ω~(m/δ)\widetilde{\Omega}(\sqrt{m}/\delta) iterations of gradient descent can output a δ\delta-approximate solution, where each iteration takes O(mn2)O(mn^{2})-time. Below is the main theorem of this section:

Theorem 5.1 (Gradient descent for orthogonal measurements).

Suppose u1,,umnu_{1},\dots,u_{m}\in\mathbb{R}^{n} are orthogonal unit vectors, and suppose |bi|R|b_{i}|\leq R for all i[m]i\in[m]. There exists an algorithm such that for any δ(0,1)\delta\in(0,1), performs Ω~(mRδ1)\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations of gradient descent with O(mn2)O(mn^{2})-time per iteration and outputs a matrix An×nA\in\mathbb{R}^{n\times n} satisfies:

|uiAuibi|δi[m].\displaystyle|u_{i}^{\top}Au_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m].

In Section 5.1, we introduce the algorithm and prove the time complexity. In Section 5.2 - 5.4, we analyze the convergence of our algorithm.

5.1 Algorithm

The key idea of the gradient descent matrix sensing algorithm (Algorithm 1) is to follow the gradient of the entry-wise potential function defined as follows:

Φλ(A):=i=1mcosh(λ(uiAuibi)).\displaystyle\Phi_{\lambda}(A):=\sum_{i=1}^{m}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i})). (3)

Then, we have the following solution update formula:

At+1AtϵΦλ(At)/Φλ(At)F.\displaystyle A_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. (4)
Lemma 5.2 (Cost-per-iteration of gradient descent).

Each iteration of Algorithm 1 takes O(mn2)O(mn^{2})-time.

Proof.

In each iteration, we first evaluate uiAtuiu_{i}^{\top}A_{t}u_{i} for all i[m]i\in[m], which takes O(mn2)O(mn^{2})-time. Then, Φλ(At)\nabla\Phi_{\lambda}(A_{t}) can be computed by summing mm rank-1 matrices, which takes O(mn2)O(mn^{2})-time. Finally, at Line 6, the solution can be updated in O(n2)O(n^{2})-time. Thus, the total running time for each iteration is O(mn2)O(mn^{2}). ∎

Algorithm 1 Matrix Sensing by Gradient Descent.
1:procedure GradientDescent({ui,bi}i[m]\{u_{i},b_{i}\}_{i\in[m]}) \triangleright Theorem 5.1
2:     τmaxi[m]bi\tau\leftarrow\max_{i\in[m]}b_{i}
3:     A1τIA_{1}\leftarrow\tau\cdot I
4:     for t=1Tt=1\to T do
5:         Φλ(At)i=1muiuiλsinh(λ(uiAtuibi))\nabla\Phi_{\lambda}(A_{t})\leftarrow\sum_{i=1}^{m}u_{i}u_{i}^{\top}\lambda\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})) \triangleright Compute the gradient
6:         At+1AtϵΦλ(At)/Φλ(At)FA_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}
7:     end for
8:     return AT+1A_{T+1}
9:end procedure

5.2 Analysis of One Iteration

Throughout this section, we suppose An×nA\in\mathbb{R}^{n\times n} is a symmetric matrix.

We can compute the gradient of Φλ(A)\Phi_{\lambda}(A) with respect to AA as follows:

Φλ(A)\displaystyle~{}\nabla\Phi_{\lambda}(A)
=\displaystyle= i=1muiuiλsinh(λ(uiAuibi))n×n.\displaystyle~{}\sum_{i=1}^{m}u_{i}u_{i}^{\top}\lambda\sinh\left(\lambda(u_{i}^{\top}Au_{i}-b_{i})\right)\in\mathbb{R}^{n\times n}. (5)

We can compute the Hessian of Φλ(A)\Phi_{\lambda}(A) with respect to AA as follows

2Φλ(A)\displaystyle~{}\nabla^{2}\Phi_{\lambda}(A)
=\displaystyle= i=1m(uiui)(uiui)λ2cosh(λ(uiAuibi)).\displaystyle~{}\sum_{i=1}^{m}(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\lambda^{2}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i})).

The Hessian 2Φλ(A)n2×n2\nabla^{2}\Phi_{\lambda}(A)\in\mathbb{R}^{n^{2}\times n^{2}} and \otimes is the Kronecker product.

Lemma 5.3 (Progress on entry-wise potential).

Assume that uiuj=0u_{i}\perp u_{j}=0 for any i,j[m]i,j\in[m] and ui2=1\|u_{i}\|^{2}=1. Let c(0,1)c\in(0,1) denote a sufficiently small positive constant. Then, for any ϵ,λ>0\epsilon,\lambda>0 such that ϵλc\epsilon\lambda\leq c,

we have for any t>0t>0,

Φλ(At+1)(10.9λϵm)Φλ(At)+λϵm\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}
Proof.

We defer the proof to Appendix B.1

5.3 Technical Claims

We prove some technical claims in below.

Claim 5.4.

For Q1Q_{1} defined in Eq. (18), we have

Q1(m+1λΦλ(At)F)Φλ(At)F2.\displaystyle Q_{1}\leq\Big{(}\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}\Big{)}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}.
Proof.

For simplicity, we define zt,iz_{t,i} to be

zt,i:=λ(uiAtuibi).\displaystyle z_{t,i}:=\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}).

Recall that

2Φλ(At)=λ2i=1m(uiui)(uiui)cosh(zt,i).\displaystyle\nabla^{2}\Phi_{\lambda}(A_{t})=\lambda^{2}\cdot\sum_{i=1}^{m}(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\cosh(z_{t,i}).

For Q1Q_{1}, we have

Q1=\displaystyle Q_{1}= tr[2Φλ(At)i=1msinh2(zt,i)(uiuiuiui))]\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})\sum_{i=1}^{m}\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))]
=\displaystyle= λ2tr[i=1mcosh(zt,i)(uiui)(uiui)i=1msinh2(zt,i)(uiui)(uiui)]\displaystyle~{}\lambda^{2}\cdot\operatorname{tr}[\sum_{i=1}^{m}\cosh(z_{t,i})(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\cdot\sum_{i=1}^{m}\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})]
=\displaystyle= λ2i=1mtr[cosh(zt,i)sinh2(zt,i)(uiuiuiui)(uiuiuiui)]\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\operatorname{tr}[\cosh(z_{t,i})\cdot\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top}u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top}u_{i}u_{i}^{\top})]
=\displaystyle= λ2i=1mcosh(zt,i)sinh2(zt,i)\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\cosh(z_{t,i})\sinh^{2}(z_{t,i})
\displaystyle\leq λ2(i=1mcosh2(zt,i))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}\lambda^{2}\cdot(\sum_{i=1}^{m}\cosh^{2}(z_{t,i}))^{1/2}\cdot(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}
\displaystyle\leq λ2B1B2,\displaystyle~{}\lambda^{2}\cdot B_{1}\cdot B_{2}, (6)

where the first step comes from the definition of Q1Q_{1}, the second step comes from the definition of 2Φλ(At)\nabla^{2}\Phi_{\lambda}(A_{t}), the third step follows from (AB)(CD)=(AC)(BD)(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD) and uiuj=0u_{i}^{\top}u_{j}=0 , the fourth step comes from ui=1\|u_{i}\|=1 and tr[(uiui)(uiui)]=1\operatorname{tr}[(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})]=1.

For the term B1B_{1}, we have

B1=\displaystyle B_{1}= (i=1mcosh2(λ(uiAtuibi)))1/2\displaystyle~{}(\sum_{i=1}^{m}\cosh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{1/2}
\displaystyle\leq m+1λΦλ(At)F,\displaystyle~{}\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F},

where the second step follows Part 1 of Lemma 3.3.

For the term B2B_{2}, we have

B2=\displaystyle B_{2}= (i=1msinh4(λ(uiAtuibi)))1/2\displaystyle~{}(\sum_{i=1}^{m}\sinh^{4}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{1/2}
\displaystyle\leq 1λ2Φλ(At)F2,\displaystyle~{}\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2},

where the second step follows from x42x22\|x\|_{4}^{2}\leq\|x\|_{2}^{2}. This implies that

Q1\displaystyle Q_{1}\leq λ2B1B2\displaystyle~{}\lambda^{2}\cdot B_{1}\cdot B_{2}
\displaystyle\leq λ2(m+1λΦλ(At)F)1λ2Φλ(At)F2\displaystyle~{}\lambda^{2}\cdot(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= (m+1λΦλ(At)F)Φλ(At)F2.\displaystyle~{}(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}.

Claim 5.5.

For Q2Q_{2} defined in Eq. (19), we have Q2=0Q_{2}=0.

Proof.

Because in Q2Q_{2} we have :

=1m(uuuu)ij(uiuiujuj)\displaystyle~{}\sum_{\ell=1}^{m}(u_{\ell}u_{\ell}^{\top}\otimes u_{\ell}u_{\ell}^{\top})\sum_{i\neq j}(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top})
=\displaystyle= =1mij(uuuiui)(uuujuj)\displaystyle~{}\sum_{\ell=1}^{m}\sum_{i\neq j}(u_{\ell}u_{\ell}^{\top}u_{i}u_{i}^{\top})\otimes(u_{\ell}u_{\ell}^{\top}u_{j}u_{j}^{\top})
=\displaystyle= 0,\displaystyle~{}0, (7)

where the first step follows from (AB)(CD)=(AC)(BD)(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD) , the second step follows that uiuj=0u_{i}^{\top}u_{j}=0 if iji\neq j and i\ell\neq i or j\ell\neq j always holds in Eq. (5.3).

Therefore, we get that Q2=0Q_{2}=0. ∎

5.4 Convergence for multiple iterations

The goal of this section is to prove the convergence of Algorithm 1:

Lemma 5.6 (Convergence of gradient descent).

Suppose the measurement vectors {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal unit vectors, and suppose |bi||b_{i}| is bounded by RR for i[m]i\in[m]. Then, for any δ(0,1)\delta\in(0,1), if we take λ=Ω(δ1logm)\lambda=\Omega(\delta^{-1}\log m) and ϵ=O(λ1)\epsilon=O(\lambda^{-1}) in Algorithm 1, then for T=Ω~(mRδ1)T=\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations, the solution matrix ATA_{T} satisfies:

|uiATuibi|δi[m].\displaystyle|u_{i}^{\top}A_{T}u_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m].
Proof.

We defer the proof to Appendix B.2

Theorem 5.1 follows immediately from Lemma 5.2 and Lemma 5.6.

6 Stochastic gradient descent

In this section, we show that the cost-per-iteration of the approximate matrix sensing algorithm can be improved by using a stochastic gradient descent (SGD). More specifically, SGD can obtain a δ\delta-approximate solution with O(Bn2)O(Bn^{2}), where 0<B<m0<B<m is the size of the mini batch in SGD. Below is the main theorem of this section:

Theorem 6.1 (Stochastic gradient descent for orthogonal measurements).

Suppose u1,,umnu_{1},\dots,u_{m}\in\mathbb{R}^{n} are orthogonal unit vectors, and suppose |bi|R|b_{i}|\leq R for all i[m]i\in[m]. There exists an algorithm such that for any δ(0,1)\delta\in(0,1), performs O~(m3/2B1Rδ1)\widetilde{O}(m^{3/2}B^{-1}R\delta^{-1}) iterations of gradient descent with O(Bn2)O(Bn^{2})-time per iteration and outputs a matrix An×nA\in\mathbb{R}^{n\times n} satisfies:

|uiAuibi|δi[m].\displaystyle|u_{i}^{\top}Au_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m].

The algorithm and its time complexity are provided in Section 6.1. The convergence is proved in Section 6.2 and 6.3. The SGD algorithm for the general measurement without the assumption that the {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal vectors is deferred to Appendix E.

6.1 Algorithm

We can use the stochastic gradient descent algorithm (Algorithm 2) for matrix sensing. More specifically, in each iteration, we will uniformly sample a subset [m]{\cal B}\subset[m] of size BB, and then compute the gradient of the stochastic potential function:

Φλ(A,):=m||iuiuiλsinh(λ(uiAuibi)),\displaystyle\nabla\Phi_{\lambda}(A{,{\cal B}}):={\frac{m}{|{\cal B}|}\sum_{i\in{\cal B}}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda(u_{i}^{\top}Au_{i}-b_{i})), (8)

which is an nn-by-nn matrix. Then, we do the following gradient step:

At+1AtϵΦλ(At,t)/Φλ(At)F.\displaystyle A_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. (9)
Lemma 6.2 (Running time of stochastic gradient descent).

Algorithm 2 takes O(mn2)O(mn^{2})-time for preprocessing and each iteration takes O(Bn2)O(Bn^{2})-time.

Proof.

The time-consuming step is to compute Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. Since

Φλ(At)=i=1muiuiλsinh(λ(uiAtuibi)),\displaystyle\nabla\Phi_{\lambda}(A_{t})=\sum_{i=1}^{m}u_{i}u_{i}^{\top}\lambda\sinh\left(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})\right),

and uiuju_{i}\bot u_{j} for ij[m]i\neq j\in[m], we know that uiu_{i} is an eigenvector of Φλ(A)\nabla\Phi_{\lambda}(A) with eigenvalue λsinh(λ(uiAtuibi))\lambda\sinh\left(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})\right) for each i[m]i\in[m]. Thus, we have

Φλ(At)F2=\displaystyle\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}= i=1mλ2sinh2(λ(uiAtuibi))\displaystyle~{}\sum_{i=1}^{m}\lambda^{2}\sinh^{2}\left(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})\right)
=\displaystyle= i=1mλ2sinh2(λzt,i),\displaystyle~{}\sum_{i=1}^{m}\lambda^{2}\sinh^{2}(\lambda z_{t,i}),

where zt,i:=uiAtuibiz_{t,i}:=u_{i}^{\top}A_{t}u_{i}-b_{i} for i[m]i\in[m]. Then, if we know zt,ii[m]{z_{t,i}}_{i\in[m]}, we can compute Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F} in O(m)O(m)-time.

Consider the change zt+1,izt,iz_{t+1,i}-z_{t,i}:

zt+1,izt,i\displaystyle z_{t+1,i}-z_{t,i}
=\displaystyle= ui(At+1At)ui\displaystyle~{}u_{i}^{\top}(A_{t+1}-A_{t})u_{i}
=\displaystyle= ϵΦλ(At)FuiΦλ(At,t)ui\displaystyle~{}-\frac{\epsilon}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}}\cdot u_{i}^{\top}\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})u_{i}
=\displaystyle= ϵλmΦλ(At)FBjtuiujujuisinh(λzt,j)\displaystyle~{}-\frac{\epsilon\lambda m}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B}\sum_{j\in{\cal B}_{t}}u_{i}^{\top}u_{j}u_{j}^{\top}u_{i}\cdot\sinh(\lambda z_{t,j})
=\displaystyle= ϵλmsinh(λzt,i)Φλ(At)FB𝟏it,\displaystyle~{}-\frac{\epsilon\lambda m\sinh(\lambda z_{t,i})}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B}\cdot{\bf 1}_{i\in{\cal B}_{t}},

where the last step follows from uiuju_{i}\bot u_{j} for iji\neq j. Hence, if we have already computed {zt,i}i[m]\{z_{t,i}\}_{i\in[m]} and Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F}, {zt+1,i}i[m]\{z_{t+1,i}\}_{i\in[m]} can be obtained in O(B)O(B)-time.

Therefore, we preprocess z1,i=uiA1uibiz_{1,i}=u_{i}^{\top}A_{1}u_{i}-b_{i} for all i[m]i\in[m] in O(mn2)O(mn^{2})-time. Then, in the tt-th iteration (t>0t>0), we first compute

Φλ(At,t)=mBituiuiλsinh(λzt,i)\displaystyle\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})=\frac{m}{B}\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda z_{t,i})

in O(Bn2)O(Bn^{2})-time. Next, we compute Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F} using zt,iz_{t,i} in O(m)O(m)-time. At+1A_{t+1} can be obtained in O(n2)O(n^{2})-time. Finally, we use O(B)O(B)-time to update {zt+1,i}i[m]\{z_{t+1,i}\}_{i\in[m]}.

Hence, the total running time per iteration is

O(Bn2+m+n2+B)=O(Bn2).\displaystyle O(Bn^{2}+m+n^{2}+B)=O(Bn^{2}).

Algorithm 2 Matrix Sensing by Stochastic Gradient Descent.
1:procedure SGD({ui,bi}i[m]\{u_{i},b_{i}\}_{i\in[m]}) \triangleright Theorem 6.1
2:     τmaxi[m]bi\tau\leftarrow\max_{i\in[m]}b_{i}
3:     A1τIA_{1}\leftarrow\tau\cdot I
4:     ziuiA1uibiz_{i}\leftarrow u_{i}^{\top}A_{1}u_{i}-b_{i} for i[m]i\in[m]
5:     for t=1Tt=1\to T do
6:         Sample t[m]{\cal B}_{t}\subset[m] of size BB uniformly at random
7:         Φλ(At,t)mBituiuiλsinh(λzi)\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})\leftarrow\frac{m}{B}\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda z_{i})
8:         Φλ(At)F(i=1mλ2sinh2(λzi))1/2\|\nabla\Phi_{\lambda}(A_{t})\|_{F}\leftarrow\left(\sum_{i=1}^{m}\lambda^{2}\sinh^{2}(\lambda z_{i})\right)^{1/2}
9:         At+1AtϵΦλ(At,t)/Φλ(At)FA_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}
10:         for iti\in{\cal B}_{t} do
11:               ziziϵλmsinh(λzi)/(Φλ(At)FB)z_{i}\leftarrow z_{i}-\epsilon\lambda m\sinh(\lambda z_{i})/(\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B)
12:         end for
13:     end for
14:     return AT+1A_{T+1}
15:end procedure

6.2 Analysis of One Iteration

Suppose An×nA\in\mathbb{R}^{n\times n}. Let t{\cal B}_{t} be a uniformly random BB-subset of [m][m] at the tt-th iteration, where BB is a parameter.

We can compute the gradient of Φλ(A,)\Phi_{\lambda}(A{,{\cal B}}) with respect to AA as follows:

Φλ(A,)\displaystyle~{}\nabla\Phi_{\lambda}(A{,{\cal B}})
=\displaystyle= m||iuiuiλsinh(λ(uiAuibi)),\displaystyle~{}{\frac{m}{|{\cal B}|}\sum_{i\in{\cal B}}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda(u_{i}^{\top}Au_{i}-b_{i})),

where Φλ(A,)n×n\nabla\Phi_{\lambda}(A{,{\cal B}})\in\mathbb{R}^{n\times n}.

We can also compute the Hessian of Φλ(A,)\Phi_{\lambda}(A{,{\cal B}}) with respect to AA as follows:

2Φλ(A,)\displaystyle~{}\nabla^{2}\Phi_{\lambda}(A{,{\cal B}})
=\displaystyle= m||i(uiui)(uiui)λ2cosh(λ(uiAuibi))\displaystyle~{}{\frac{m}{|{\cal B}|}\sum_{i\in{\cal B}}}(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\lambda^{2}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i}))

where 2Φλ(A,)n2×n2\nabla^{2}\Phi_{\lambda}(A{,{\cal B}})\in\mathbb{R}^{n^{2}\times n^{2}} and \otimes is the Kronecker product.

It is easy to see the expectations of the gradient and Hessian of Φλ(A,)\Phi_{\lambda}(A,{\cal B}) over a random set {\cal B}:

𝔼[m][Φλ(A,)]=Φλ(A),\displaystyle~{}\operatorname*{{\mathbb{E}}}_{{\cal B}\sim[m]}[\nabla\Phi_{\lambda}(A,{\cal B})]=\nabla\Phi_{\lambda}(A),
𝔼[m][2Φλ(A,)]=2Φλ(A)\displaystyle~{}\operatorname*{{\mathbb{E}}}_{{\cal B}\sim[m]}[\nabla^{2}\Phi_{\lambda}(A,{\cal B})]=\nabla^{2}\Phi_{\lambda}(A)
Lemma 6.3 (Expected progress on potential).

Given mm vectors u1,u2,,umnu_{1},u_{2},\cdots,u_{m}\in\mathbb{R}^{n}. Assume ui,uj=0\langle u_{i},u_{j}\rangle=0 for any ij[m]i\neq j\in[m] and ui2=1\|u_{i}\|^{2}=1, for all i[m]i\in[m]. Let ϵλ0.01|t|m\epsilon\lambda\leq 0.01\frac{|{\cal B}_{t}|}{m}, for all t>0t>0.

Then, we have

𝔼[Φλ(At+1)](10.9λϵm)Φλ(At)+λϵm.\displaystyle\operatorname*{{\mathbb{E}}}[\Phi_{\lambda}(A_{t+1})]\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot{\Phi_{\lambda}(A_{t})}+\lambda\epsilon\sqrt{m}.
Proof.

We first express the expectation as follows:

𝔼At+1[Φλ(At+1)]Φλ(At)\displaystyle~{}\operatorname*{{\mathbb{E}}}_{A_{t+1}}[\Phi_{\lambda}(A_{t+1})]-\Phi_{\lambda}(A_{t})
\displaystyle\leq 𝔼At+1[Φλ(At),(At+1At)]\displaystyle~{}\operatorname*{{\mathbb{E}}}_{A_{t+1}}[\langle\nabla\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\rangle]
+\displaystyle+ O(1)𝔼At+1[2Φλ(At),(At+1At)(At+1At)],\displaystyle~{}O(1)\cdot\operatorname*{{\mathbb{E}}}_{A_{t+1}}[\langle\nabla^{2}\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})\rangle], (10)

which follows from Corollary A.2.

We choose

At+1=AtϵΦλ(At,t)/Φλ(At)F.\displaystyle A_{t+1}=A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t}{,{\cal B}_{t}})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}.

Then, we can bound

𝔼At+1[tr[Φλ(At)(At+1At)]]\displaystyle~{}\operatorname*{{\mathbb{E}}}_{A_{t+1}}[-\operatorname{tr}[\nabla\Phi_{\lambda}(A_{t})\cdot(A_{t+1}-A_{t})]]
=\displaystyle= 𝔼t[tr[Φλ(At)ϵΦλ(At,t)Φλ(At)F]]\displaystyle~{}\operatorname*{{\mathbb{E}}}_{{\cal B}_{t}}\Big{[}\operatorname{tr}\Big{[}\nabla\Phi_{\lambda}(A_{t})\cdot\frac{\epsilon\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}}\Big{]}\Big{]}
=\displaystyle= ϵΦλ(At)F\displaystyle~{}\epsilon\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F} (11)

We define for t>0t>0 and i[m]i\in[m],

zt,i:=uiAtuibi.\displaystyle z_{t,i}:=u_{i}^{\top}A_{t}u_{i}-b_{i}.

We need to compute this Δ2\Delta_{2}. For simplificity, we consider Δ2Φλ(At)F2\Delta_{2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2},

=\displaystyle= tr[2Φλ(At)(At+1At)(At+1At)]Φλ(At)F2\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})]\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= (λϵ)2(m|t|)2tr[2Φλ(At)(ituiuisinh(zt,i)(ituiuisinh(zt,i)].\displaystyle~{}(\lambda\epsilon)^{2}\cdot(\frac{m}{|{\cal B}_{t}|})^{2}\cdot\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\sinh(z_{t,i})\otimes(\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\sinh(z_{t,i})\Big{]}. (12)

Ignoring the scalar factor in the above equation, we have

=\displaystyle= tr[2Φλ(At)(i,jBtsinh(zt,i)sinh(zt,i)(uiuiujuj))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i,j\in B_{t}}\sinh(z_{t,i})\sinh(z_{t,i})\cdot(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]}
=\displaystyle= tr[2Φλ(At)(iBtsinh2(zt,i)(uiuiuiui))]\displaystyle\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\in B_{t}}\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))\Big{]}
+\displaystyle+ tr[2Φλ(At)(ijBtsinh(zt,i)sinh(zt,i)(uiuiujuj))]\displaystyle\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\neq j\in B_{t}}\sinh(z_{t,i})\sinh(z_{t,i})\cdot(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]}
=:\displaystyle=: Q~1+Q~2,\displaystyle~{}\widetilde{Q}_{1}+\widetilde{Q}_{2}, (13)

where the first step follows that we extract the scalar values from Kronecker product, the second step comes from splitting into two partitions based on whether i=ji=j, the third step comes from the definition of Q~1\widetilde{Q}_{1} and Q~2\widetilde{Q}_{2} where Q~1\widetilde{Q}_{1} denotes the diagonal term, and Q~2\widetilde{Q}_{2} denotes the off-diagonal term. Taking expectation, we have

𝔼[Δ2Φλ(At)F2]\displaystyle~{}\operatorname*{{\mathbb{E}}}[\Delta_{2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}]
=\displaystyle= (λϵ)2(m|t|)2𝔼[Q~1]\displaystyle~{}(\lambda\epsilon)^{2}\cdot(\frac{m}{|{\cal B}_{t}|})^{2}\operatorname*{{\mathbb{E}}}[\widetilde{Q}_{1}]
=\displaystyle= (λϵ)2(m|t|)2|t|mQ1\displaystyle~{}(\lambda\epsilon)^{2}\cdot(\frac{m}{|{\cal B}_{t}|})^{2}\cdot\frac{|{\cal B}_{t}|}{m}\cdot Q_{1}
\displaystyle\leq (λϵ)2m|t|(m+1λΦλ(At)F)Φλ(At)F2\displaystyle~{}(\lambda\epsilon)^{2}\cdot\frac{m}{|{\cal B}_{t}|}\cdot(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2} (14)

where the first step comes from extracting the constant terms from the expectation and Claim 5.5, the second step follows that 𝔼[Q~1]=|t|mQ1\operatorname*{{\mathbb{E}}}[\widetilde{Q}_{1}]=\frac{|{\cal B}_{t}|}{m}\cdot Q_{1}, and the third step comes from the Claim 5.4. Therefore, we have:

𝔼[Φλ(At+1)]Φλ(At)\displaystyle~{}\operatorname*{{\mathbb{E}}}[\Phi_{\lambda}(A_{t+1})]-\Phi_{\lambda}(A_{t})
\displaystyle\leq 𝔼[Δ1]+O(1)𝔼[Δ2]\displaystyle-\operatorname*{{\mathbb{E}}}[\Delta_{1}]+O(1)\cdot\operatorname*{{\mathbb{E}}}[\Delta_{2}]
\displaystyle\leq ϵ(1O(ϵλ)m|t|)Φλ(At)F+O(ϵλ)2m\displaystyle-\epsilon(1-O(\epsilon\lambda)\cdot\frac{m}{|{\cal B}_{t}|})\|\nabla\Phi_{\lambda}(A_{t})\|_{F}+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵΦλ(At)F+O(ϵλ)2m\displaystyle-0.9\epsilon\|\nabla\Phi_{\lambda}(A_{t})\|_{F}+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵλ1m(Φλ(At)m)+O(ϵλ)2m\displaystyle-0.9\epsilon\lambda\frac{1}{\sqrt{m}}(\Phi_{\lambda}(A_{t})-m)+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵλ1mΦλ(At)+ϵλm,\displaystyle-0.9\epsilon\lambda\frac{1}{\sqrt{m}}\Phi_{\lambda}(A_{t})+\epsilon\lambda\sqrt{m},

where the first step comes from Eq. (6.2), the second step comes from Eq. (6.2) and Eq. (6.2), the third step follows from ϵ0.01|t|λm\epsilon\leq 0.01\frac{|{\cal B}_{t}|}{\lambda m}, the forth step follows from Eq. (B.1), and the last step follows from ϵλ(0,0.01)\epsilon\lambda\in(0,0.01). ∎

6.3 Convergence for multiple iterations

The goal of this section is to prove the convergence of Algorithm 2.

Lemma 6.4 (Convergence of stochastic gradient descent).

Suppose the measurement vectors {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal unit vectors, and suppose |bi||b_{i}| is bounded by RR for i[m]i\in[m]. Then, for any δ(0,1)\delta\in(0,1), if we take λ=Ω(δ1logm)\lambda=\Omega(\delta^{-1}\log m) and ϵ=O(λ1m1B)\epsilon=O(\lambda^{-1}m^{-1}B) in Algorithm 2, then for

T=Ω~(m3/2B1Rδ1)\displaystyle T=\widetilde{\Omega}(m^{3/2}B^{-1}R\delta^{-1})

iterations, with high probability, the solution matrix ATA_{T} satisfies:

|uiAT+1uibi|δi[m].\displaystyle|u_{i}^{\top}A_{T+1}u_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m].
Proof.

Similar to the proof of Lemma 5.6, we can bound the initial potential by:

Φ(A1)2O(λR).\displaystyle\Phi(A_{1})\leq 2^{O(\lambda R)}.

In the following iterations, by Lemma 6.3, we have

𝔼[Φλ(At+1)](10.9λϵm)Φλ(At)+λϵm,\displaystyle\operatorname*{{\mathbb{E}}}[\Phi_{\lambda}(A_{t+1})]\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot{\Phi_{\lambda}(A_{t})}+\lambda\epsilon\sqrt{m},

as long as ϵ0.01|Bt|λm\epsilon\leq 0.01\frac{|B_{t}|}{\lambda m}, where BtB_{t} is a uniformly random subset of [m][m] of size BB.

It suffices to take ϵ=O(λ1m1B)\epsilon=O(\lambda^{-1}m^{-1}B).

Now, we can apply Lemma 6.3 for TT times and get that

𝔼[Φ(AT+1)]2Ω(Tϵλ/m)+O(λR)+2m.\displaystyle\operatorname*{{\mathbb{E}}}[\Phi(A_{T+1})]\leq 2^{-\Omega(T\epsilon\lambda/\sqrt{m})+O(\lambda R)}+2m.

By taking

T=Ω~(m3/2B1Rδ1),\displaystyle T=\widetilde{\Omega}(m^{3/2}B^{-1}R\delta^{-1}),

we have

Φ(AT+1)O(m)\displaystyle\Phi(A_{T+1})\leq O(m)

holds with high probability. By the same argument as in the proof of Lemma 5.6, we have

|uiAT+1uibi|δi[m].\displaystyle|u_{i}^{\top}A_{T+1}u_{i}-b_{i}|\leq\delta~{}\forall i\in[m].

The lemma is thus proved. ∎

References

  • Aar [07] Scott Aaronson. The learnability of quantum states. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088):3089–3114, 2007.
  • BLSS [20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 775–788, 2020.
  • Bra [20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
  • CLMW [11] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3), 2011.
  • CLS [19] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, page 938–942, New York, NY, USA, 2019. Association for Computing Machinery.
  • CZ [15] T Tony Cai and Anru Zhang. Rop: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102–138, 2015.
  • DLS [23] Yichuan Deng, Zhihang Li, and Zhao Song. An improved sample complexity for rank-1 matrix sensing. arXiv preprint arXiv:2303.06895, 2023.
  • DLY [21] Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: A multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1784–1797, 2021.
  • FGLE [12] Steven T Flammia, David Gross, Yi-Kai Liu, and Jens Eisert. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012.
  • GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
  • HJS+ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In FOCS, 2022.
  • HRVW [96] Christoph Helmberg, Franz Rendl, Robert J Vanderbei, and Henry Wolkowicz. An interior-point method for semidefinite programming. SIAM Journal on optimization, 6(2):342–361, 1996.
  • HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
  • JKL+ [20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pages 910–918. IEEE, 2020.
  • JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 944–953, 2020.
  • JM [13] Adel Javanmard and Andrea Montanari. Localization from incomplete noisy distance measurements. Found. Comput. Math., 13(3):297–345, jun 2013.
  • JMD [10] Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems, 23, 2010.
  • JN [08] Anatoli Juditsky and Arkadii S Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv preprint arXiv:0809.0813, 2008.
  • JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
  • KKD [15] Amir Kalev, Robert L Kosut, and Ivan H Deutsch. Quantum tomography protocols with positivity are compressed sensing protocols. npj Quantum Information, 1(1):1–6, 2015.
  • KRT [17] Richard Kueng, Holger Rauhut, and Ulrich Terstiege. Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116, 2017.
  • LB [09] Kiryung Lee and Yoram Bresler. Guaranteed minimum rank approximation from linear observations by nuclear norm minimization with an ellipsoidal constraint. arXiv preprint arXiv:0903.4742, 2009.
  • Liu [11] Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements. Advances in Neural Information Processing Systems, 24, 2011.
  • LMCC [19] Yuanxin Li, Cong Ma, Yuxin Chen, and Yuejie Chi. Nonconvex matrix factorization from rank-one measurements. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1496–1505. PMLR, 2019.
  • LS [14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 424–433. IEEE, 2014.
  • LS [15] Yin Tat Lee and Aaron Sidford. Efficient inverse maintenance and faster algorithms for linear programming. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 230–249. IEEE, 2015.
  • LSW [15] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049–1065. IEEE, 2015.
  • LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT, 2019.
  • LV [10] Zhang Liu and Lieven Vandenberghe. Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2010.
  • NN [94] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • QSZZ [23] Liank Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
  • RFP [10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
  • SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
  • SYZ [21] Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems, 34, 2021.
  • SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • TBS+ [16] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, pages 964–973. PMLR, 2016.
  • WSB [11] Andrew Waters, Aswin Sankaranarayanan, and Richard Baraniuk. Sparcs: Recovering low-rank and sparse matrices from compressive measurements. Advances in neural information processing systems, 24, 2011.
  • WZG [17] Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Artificial Intelligence and Statistics, pages 981–990. PMLR, 2017.
  • ZJD [15] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Efficient matrix sensing using rank-1 gaussian measurements. In International conference on algorithmic learning theory, pages 3–18. Springer, 2015.

Appendix

Roadmap.

We first provide the proofs for matrix hyperbolic functions and properties of sinh\sinh and cosh\cosh in Appendix A. Then we provide the proofs for the gradient descent and stochastic gradient descent convergence analysis in Appendix B. We consider the spectral potential function with ground-truth oracle scenario in Appendix C. We analyze the gradient descent with non-orthogonal measurements in Appendix D. We provide the cost-per-iteration analysis for stochastic gradient descent under non-orthogonal measurements in Appendix E.

Appendix A Proofs of Preliminary Lemmas

A.1 Calculus tools

We state a useful calculus tool from prior work,

Lemma A.1 (Proposition 3.1 in [18]).

Let Δ\Delta be an open interval on the axis, and ff be C2C^{2} function on Δ\Delta such that for certain θ±,μ±\theta_{\pm},\mu_{\pm}\in\mathbb{R} one has

(a<b,a,bΔ):\displaystyle~{}\forall(a<b,a,b\in\Delta):
θf′′(a)+f′′(b)2+μf(b)f(a)ba\displaystyle~{}\theta_{-}\cdot\frac{f^{\prime\prime}(a)+f^{\prime\prime}(b)}{2}+\mu_{-}\leq\frac{f^{\prime}(b)-f^{\prime}(a)}{b-a}
f(b)f(a)baθ+f′′(a)+f′′(b)2+μ+,\displaystyle~{}\frac{f^{\prime}(b)-f^{\prime}(a)}{b-a}\leq\theta_{+}\cdot\frac{f^{\prime\prime}(a)+f^{\prime\prime}(b)}{2}+\mu_{+},

where ff^{\prime} and f′′f^{\prime\prime} means the first- and second-order derivatives of ff, respectively.

Let, further, 𝒳n(Δ){\cal X}_{n}(\Delta) be the set of all n×nn\times n symmetric matrices with eigenvalues belonging to Δ\Delta. Then 𝒳n(Δ){\cal X}_{n}(\Delta) is an open convex set in the space SnS^{n} of n×nn\times n symmetric matrices, the function

F(X)=tr[f(X)]:𝒳n(Δ)\displaystyle F(X)=\operatorname{tr}[f(X)]:{\cal X}_{n}(\Delta)\rightarrow\mathbb{R}

is C2C^{2}, and for every X𝒳n(Δ)X\in{\cal X}_{n}(\Delta) and every HSnH\in S^{n} one has

θtr[Hf′′(X)H]+μtr[H2]D2F(X)[H,H]\displaystyle~{}\theta_{-}\cdot\operatorname{tr}[Hf^{\prime\prime}(X)H]+\mu_{-}\cdot\operatorname{tr}[H^{2}]\leq D^{2}F(X)[H,H]
D2F(X)[H,H]θ+tr[Hf′′(X)H]+μ+tr[H2],\displaystyle~{}D^{2}F(X)[H,H]\leq\theta_{+}\cdot\operatorname{tr}[Hf^{\prime\prime}(X)H]+\mu_{+}\cdot\operatorname{tr}[H^{2}],

where DD means directional derivative.

We will use below corollary to compute the trace with a map f:f:\mathbb{R}\rightarrow\mathbb{R}.

Corollary A.2.

Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a C2C^{2} function. Let AA and BB be two symmetric matrices. We have

tr[f(A)]\displaystyle\operatorname{tr}[f(A)]\leq tr[f(B)]+tr[f(B)(AB)]\displaystyle~{}\operatorname{tr}[f(B)]+\operatorname{tr}[f^{\prime}(B)(A-B)]
+O(1)tr[f′′(B)(AB)2].\displaystyle~{}+O(1)\cdot\operatorname{tr}[f^{\prime\prime}(B)(A-B)^{2}].

A.2 Kronecker product

Suppose we have two matrice Am×nA\in\mathbb{R}^{m\times n} and Bp×qB\in\mathbb{R}^{p\times q}, we use ABA\otimes B denote the Kronecker product:

AB=[A1,1BA1,nBAm,1BAm,nB].\displaystyle A\otimes B=\left[\begin{array}[]{ccc}A_{1,1}{B}&\cdots&A_{1,n}{B}\\ \vdots&\ddots&\vdots\\ A_{m,1}{B}&\cdots&A_{m,n}{B}\end{array}\right].

We state a fact and delay the proof into Section A.

Fact A.3.

Suppose we have two matrices Am×nA\in\mathbb{R}^{m\times n} and Bn×kB\in\mathbb{R}^{n\times k}, we have

(AB)(CD)=(AC)(BD).\displaystyle(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD).
Proof.

From the definition of Kronecker product we have:

(AB)(CD)\displaystyle~{}(A\otimes B)\cdot(C\otimes D)
=\displaystyle= [A1,1BA1,nBAm,1BAm,nB][C1,1DC1,kDCn,1DCn,kD]\displaystyle~{}{\left[\begin{array}[]{ccc}A_{1,1}B&\ldots&A_{1,n}B\\ \vdots&\ddots&\vdots\\ A_{m,1}B&\ldots&A_{m,n}B\end{array}\right]\left[\begin{array}[]{ccc}C_{1,1}D&\ldots&C_{1,k}D\\ \vdots&\ddots&\vdots\\ C_{n,1}D&\ldots&C_{n,k}D\end{array}\right]}
=\displaystyle= [(i=1nA1,iCi,1)BD(i=1nA1,iCi,k)BD(i=1nAm,iCi,1)BD(i=1nAm,iCi,k)BD]\displaystyle~{}\left[\begin{array}[]{ccc}(\sum_{i=1}^{n}A_{1,i}C_{i,1}){BD}&\cdots&(\sum_{i=1}^{n}A_{1,i}C_{i,k}){BD}\\ \vdots&\ddots&\vdots\\ (\sum_{i=1}^{n}A_{m,i}C_{i,1}){BD}&\cdots&(\sum_{i=1}^{n}A_{m,i}C_{i,k}){BD}\end{array}\right]
=\displaystyle= [(AC)1,1BD(AC)1,kBD(AC)m,1BD(AC)m,kBD]\displaystyle~{}\left[\begin{array}[]{ccc}(AC)_{1,1}BD&\cdots&(AC)_{1,k}BD\\ \vdots&\ddots&\vdots\\ (AC)_{m,1}BD&\cdots&(AC)_{m,k}BD\end{array}\right]
=\displaystyle= (AC)(BD)\displaystyle~{}(AC)\otimes(BD)

Thus we complete the proof. ∎

A.3 Proof of cosh(A)\cosh(A) upper bound

Lemma A.4 (Restatement of Lemma 3.2).

Let AA be a real symmetric matrix, then we have

cosh(A)=cosh(A)tr[cosh(A)].\displaystyle\|\cosh(A)\|=\cosh(\|A\|)\leq\operatorname{tr}[\cosh(A)].

We also have A1+log(tr[cosh(A)])\|A\|\leq 1+\log(\operatorname{tr}[\cosh(A)]).

Proof.

Note that for each eigenvalue λ\lambda of AA, we know that it corresponds to cosh(λ)\cosh(\lambda) for cosh(A)\cosh(A). The second inequality follows from the fact that cosh(A)\cosh(A) is psd.

For the second part, we know that exp(x)/2cosh(x)\exp(x)/2\leq\cosh(x), hence, exp(A)/2cosh(A)\exp(\|A\|)/2\leq\cosh(\|A\|), and

A=\displaystyle\|A\|= log(exp(A))\displaystyle~{}\log(\exp(\|A\|))
\displaystyle\leq log(2cosh(A))\displaystyle~{}\log(2\cosh(\|A\|))
\displaystyle\leq 1+log(tr[cosh(A)]),\displaystyle~{}1+\log(\operatorname{tr}[\cosh(A)]),

where the second step is by the monotonicity of log()\log(\cdot) and exp(A)2cosh(A)\exp(\|A\|)\leq 2\cosh(\|A\|), the last step is by cosh(A)tr[cosh(A)]\cosh(\|A\|)\leq\operatorname{tr}[\cosh(A)]. ∎

We state a fact as follows:

Fact A.5.

For any real number xx, cosh2(x)sinh2(x)=1\cosh^{2}(x)-\sinh^{2}(x)=1

From the definition of cosh(x)\cosh(x) and sinh(x)\sinh(x) we have:

cosh2(x)sinh2(x)\displaystyle~{}\cosh^{2}(x)-\sinh^{2}(x)
=\displaystyle= 14(e2x+2+e2x)14(e2x2+e2x)\displaystyle~{}\frac{1}{4}(e^{2x}+2+e^{-2x})-\frac{1}{4}(e^{2x}-2+e^{-2x})
=\displaystyle= 1\displaystyle~{}1

We also have the following lemma for matrix.

Lemma A.6.

Let AA be a real symmetric matrix, then we have

cosh2(A)sinh2(A)=\displaystyle\cosh^{2}(A)-\sinh^{2}(A)= I.\displaystyle~{}I.
Proof.

Since AA is real symmetric, we write it in the eigendecomposition form: A=UΛUA=U\Lambda U^{\top}, then

cosh2(A)sinh2(A)\displaystyle~{}\cosh^{2}(A)-\sinh^{2}(A)
=\displaystyle= Ucosh2(Λ)UUsinh2(Λ)U\displaystyle~{}U\cosh^{2}(\Lambda)U^{\top}-U\sinh^{2}(\Lambda)U^{\top}
=\displaystyle= U(cosh2(Λ)sinh2(Λ))U\displaystyle~{}U(\cosh^{2}(\Lambda)-\sinh^{2}(\Lambda))U^{\top}
=\displaystyle= UU\displaystyle~{}UU^{\top}
=\displaystyle= I,\displaystyle~{}I,

where the first step follows from cosh\cosh and sinh\sinh can be expressed as exp\exp, the third step is by applying entrywise the identity cosh2(x)sinh2(x)=1\cosh^{2}(x)-\sinh^{2}(x)=1. ∎

Appendix B Proofs of GD and SGD convergence

In this section, we provide proofs of convergence analysis the gradient descent and stochastic gradient descent matrix sensing algorithms.

B.1 Proof of GD Progress on Potential Function

We start with the progress of the gradient on the potential function in below lemma.

Lemma B.1 (Restatement of Lemma 5.3).

Assume that uiuj=0u_{i}\perp u_{j}=0 for any i,j[m]i,j\in[m] and ui2=1\|u_{i}\|^{2}=1. Let c(0,1)c\in(0,1) denote a sufficiently small positive constant. Then, for any ϵ,λ>0\epsilon,\lambda>0 such that ϵλc\epsilon\lambda\leq c,

we have for any t>0t>0,

Φλ(At+1)(10.9λϵm)Φλ(At)+λϵm\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}
Proof.

We first Taylor expand Φλ(At+1)\Phi_{\lambda}(A_{t+1}) as follows:

Φλ(At+1)Φλ(At)\displaystyle~{}\Phi_{\lambda}(A_{t+1})-\Phi_{\lambda}(A_{t})
\displaystyle\leq Φλ(At),(At+1At)+O(1)2Φλ(At),(At+1At)(At+1At)\displaystyle~{}\langle\nabla\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\rangle+O(1)\langle\nabla^{2}\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})\rangle
:=\displaystyle:= Δ1+O(1)Δ2,\displaystyle~{}\Delta_{1}+O(1)\cdot\Delta_{2}, (15)

which follows from Lemma A.1.

We choose

At+1=AtϵΦλ(At)/Φλ(At)F.\displaystyle A_{t+1}=A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}.

We can bound

Δ1=\displaystyle\Delta_{1}= tr[Φλ(At)(At+1At)]\displaystyle~{}\operatorname{tr}[\nabla\Phi_{\lambda}(A_{t})(A_{t+1}-A_{t})]
=\displaystyle= ϵΦλ(At)F.\displaystyle~{}-\epsilon\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. (16)

Next, we upper-bound Δ2\Delta_{2}. Define

zt,i:=λ(uiAtuibi).\displaystyle z_{t,i}:=\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}).

and consider Δ2(λϵ)2Φλ(At)F2\Delta_{2}\cdot(\lambda\epsilon)^{-2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}, which can be expressed as:

Δ2(λϵ)2Φλ(At)F2\displaystyle\Delta_{2}\cdot(\lambda\epsilon)^{-2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= (λϵ)2tr[2Φλ(At)(At+1At)(At+1At)]Φλ(At)F2\displaystyle~{}(\lambda\epsilon)^{-2}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})]\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= tr[2Φλ(At)(i=1muiuisinh(zt,i))(i=1muiuisinh(zt,i))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(z_{t,i}))\otimes(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(z_{t,i}))\Big{]}
=\displaystyle= tr[2Φλ(At)(i,jsinh(zt,i)sinh(zt,i)(uiuiujuj))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i,j}\sinh(z_{t,i})\sinh(z_{t,i})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]}
=\displaystyle= tr[2Φλ(At)(i=1msinh2(zt,i))(uiuiuiui))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i=1}^{m}\sinh^{2}(z_{t,i}))(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))\Big{]}
+\displaystyle+ tr[2Φλ(At)(ijsinh(zt,i)sinh(zt,j)(uiuiujuj))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\neq j}\sinh(z_{t,i})\sinh(z_{t,j})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]}
=:\displaystyle=: Q1+Q2,\displaystyle~{}Q_{1}+Q_{2}, (17)

where

Q1:=tr[2Φλ(At)(i=1msinh2(zt,i))(uiuiuiui))]\displaystyle Q_{1}:=\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i=1}^{m}\sinh^{2}(z_{t,i}))(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))\Big{]} (18)

denotes the diagonal term, and

Q2:=\displaystyle Q_{2}:= tr[2Φλ(At)(ijsinh(zt,i)sinh(zt,j)(uiuiujuj))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\neq j}\sinh(z_{t,i})\sinh(z_{t,j})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]} (19)

denotes the off-diagonal term. The first step comes from the definition of Δ2\Delta_{2}, the second step follows fromr eplacing At+1AtA_{t+1}-A_{t} using Eq (4), the third step follows that we extract the scalar values from Kronecker product, the fourth step comes from splitting into two partitions based on whether i=ji=j, the fifth step comes from the definition of Q1Q_{1} and Q2Q_{2}.

Thus,

Δ2=\displaystyle\Delta_{2}= (ϵλ)2(Q1+Q2)/Φλ(At)F2\displaystyle~{}(\epsilon\lambda)^{2}(Q_{1}+Q_{2})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= (ϵλ)2(Q1+0)/Φλ(At)F2\displaystyle~{}(\epsilon\lambda)^{2}(Q_{1}+0)/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= (ϵλ)2(m+1λΦλ(At)F).\displaystyle~{}(\epsilon\lambda)^{2}\cdot(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}). (20)

where the second step follows from Claim 5.5, and the third step follows from Claim 5.4.

Hence, we have

Φλ(At+1)Φλ(At)\displaystyle~{}\Phi_{\lambda}(A_{t+1})-\Phi_{\lambda}(A_{t})
\displaystyle\leq Δ1+O(1)Δ2\displaystyle~{}\Delta_{1}+O(1)\cdot\Delta_{2}
\displaystyle\leq ϵΦλ(At)F+O(1)(ϵλ)2(m+1λΦλ(At)F)\displaystyle~{}-\epsilon\|\nabla\Phi_{\lambda}(A_{t})\|_{F}+O(1)(\epsilon\lambda)^{2}(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})
\displaystyle\leq 0.9ϵΦλ(At)F+O(ϵλ)2m\displaystyle~{}-0.9\epsilon\|\Phi_{\lambda}(A_{t})\|_{F}+O(\epsilon\lambda)^{2}\sqrt{m}

where the first step follows from Eq. (B.1), the second step follows from Eq. (22) and Eq. (B.1), the third step follows from ϵλ(0,0.01)\epsilon\lambda\in(0,0.01).

For Φλ(At)F\|\Phi_{\lambda}(A_{t})\|_{F}, we have

1λ2Φλ(At)F2\displaystyle~{}\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= tr[(i=1muiuisinh(λ(uiAtuibi)))2]\displaystyle~{}\operatorname{tr}[(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{2}]
=\displaystyle= tr[i=1m(uiui)2sinh2(λ(uiAtuibi))]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{m}(u_{i}u_{i}^{\top})^{2}\sinh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))]
=\displaystyle= i=1msinh2(λ(uiAtuibi))\displaystyle~{}\sum_{i=1}^{m}\sinh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))
\displaystyle\geq 1m(i=1mcosh(λ(uiAtuibi))m)2\displaystyle~{}\frac{1}{m}(\sum_{i=1}^{m}\cosh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))-m)^{2}
=\displaystyle= 1m(Φλ(At)m)2,\displaystyle~{}\frac{1}{m}(\Phi_{\lambda}(A_{t})-m)^{2}, (21)

where the first step comes from Eq. (5.2), the second steps follow from uiuj=0u_{i}^{\top}u_{j}=0, the third step follows from ui2=1\|u_{i}\|_{2}=1, the forth step follows from Part 2 in Lemma 3.3, the fifth step follows from the definition of Φλ(A)\Phi_{\lambda}(A).

Thus, we get that

Φλ(At)F2λϵ1m|Φλ(At)m|,\displaystyle\|\Phi_{\lambda}(A_{t})\|_{F}^{2}\geq~{}\lambda\epsilon\cdot\frac{1}{\sqrt{m}}|\Phi_{\lambda}(A_{t})-m|, (22)

It implies that

Φλ(At+1)Φλ(At)\displaystyle\Phi_{\lambda}(A_{t+1})-\Phi_{\lambda}(A_{t})
\displaystyle\leq 0.9ϵλ1m|Φλ(At)m|+O(ϵλ)2m\displaystyle~{}-0.9\epsilon\lambda\frac{1}{\sqrt{m}}|\Phi_{\lambda}(A_{t})-m|+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵλ1m|Φλ(At)m|+0.1ϵλm,\displaystyle~{}-0.9\epsilon\lambda\frac{1}{\sqrt{m}}|\Phi_{\lambda}(A_{t})-m|+0.1\epsilon\lambda\sqrt{m},

where the second step follows from extracting the constant term from the summation.

Then, when Φ(At)>m\Phi(A_{t})>m, we have

Φλ(At+1)(10.9λϵm)Φλ(At)+λϵm.\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}.

When Φ(At)m\Phi(A_{t})\leq m, we have

Φλ(At+1)(1+0.9λϵm)Φλ(At)0.8λϵm.\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1+0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})-0.8\lambda\epsilon\sqrt{m}.

The lemma is then proved. ∎

B.2 Proof of GD Convergence

In this section, we provide proofs of convergence analysis of gradient descent matrix sensing algorithm.

Lemma B.2 (Restatement of Lemma 5.6).

Suppose the measurement vectors {ui}i[m]\{u_{i}\}_{i\in[m]} are orthogonal unit vectors, and suppose |bi||b_{i}| is bounded by RR for i[m]i\in[m]. Then, for any δ(0,1)\delta\in(0,1), if we take λ=Ω(δ1logm)\lambda=\Omega(\delta^{-1}\log m) and ϵ=O(λ1)\epsilon=O(\lambda^{-1}) in Algorithm 1, then for T=Ω~(mRδ1)T=\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations, the solution matrix ATA_{T} satisfies:

|uiATuibi|δi[m].\displaystyle|u_{i}^{\top}A_{T}u_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m].
Proof.

Let τ=maxi[m]bi\tau=\max_{i\in[m]}b_{i}. At the beginning, we choose the initial solution A1:=τInA_{1}:=\tau I_{n} where Inn×nI_{n}\in\mathbb{R}^{n\times n} is the identity matrix, and we have

Φ(A1)=\displaystyle\Phi(A_{1})= i=1mcosh(λ(τbi))\displaystyle~{}\sum_{i=1}^{m}\cosh(\lambda\cdot(\tau-b_{i}))
\displaystyle\leq eλτi=1meλbi2O(λR),\displaystyle~{}e^{\lambda\tau}\sum_{i=1}^{m}e^{-\lambda b_{i}}\leq 2^{O(\lambda R)},

where the last step follows from |bi|R|b_{i}|\leq R for all i[m]i\in[m].

After TT iterations, we have

Φ(AT+1)\displaystyle\Phi(A_{T+1})\leq (1ϵλm)TΦ(A1)+2m\displaystyle~{}(1-\frac{\epsilon\lambda}{\sqrt{m}})^{T}\Phi(A_{1})+2m
\displaystyle\leq (1ϵλm)T2O(λR)+2m\displaystyle~{}(1-\frac{\epsilon\lambda}{\sqrt{m}})^{T}\cdot 2^{O(\lambda R)}+2m
\displaystyle\leq 2Ω(Tϵλ/m)+O(λR)+2m\displaystyle~{}2^{-\Omega(T\epsilon\lambda/\sqrt{m})+O(\lambda R)}+2m

where the first step follows from applying Lemma 5.3 for TT times, and i=1T(1ϵλ/m)i1ϵλm2m\sum_{i=1}^{T}(1-\epsilon\lambda/\sqrt{m})^{i-1}\epsilon\lambda\sqrt{m}\leq 2m.

As long as T=Ω(Rm/ϵ)=Ω(Rmλ)T=\Omega(R\sqrt{m}/\epsilon)=\Omega(R\sqrt{m}\lambda), then we have

Φ(AT+1)O(m).\displaystyle\Phi(A_{T+1})\leq O(m).

This implies that for any i[m]i\in[m],

|uiAT+1uibi|\displaystyle|u_{i}^{\top}A_{T+1}u_{i}-b_{i}|\leq λ1cosh1(O(m))\displaystyle~{}\lambda^{-1}\cdot\cosh^{-1}(O(m))
=\displaystyle= λ1O(logm)\displaystyle~{}\lambda^{-1}\cdot O(\log m)
=\displaystyle= δ,\displaystyle~{}\delta,

where we take R=Ω(δ1logm)R=\Omega(\delta^{-1}\log m).

Therefore, with T=Ω~(mRδ1)T=\widetilde{\Omega}(\sqrt{m}R\delta^{-1}) iterations, Algorithm 1 can achieve that

|uiAT+1uibi|δi[m].\displaystyle|u_{i}^{\top}A_{T+1}u_{i}-b_{i}|\leq\delta~{}~{}~{}\forall i\in[m]. (23)

The theorem is then proved. ∎

Appendix C Spectral Potential function with ground-truth oracle

In this section, we consider the matrix sensing with spectral approximation; that is, we want to obtain a matrix AA that is a δ\delta-spectral approximation of the ground-truth matrix AA_{\star}, i.e.,

(1δ)AA(1+δ)A.\displaystyle(1-\delta)A_{\star}\preceq A\preceq(1+\delta)A_{\star}.

To do this, instead of performing a series of quadratic measurements, we assume that we have access to an oracle 𝒪A{\cal O}_{A_{\star}} such that for any matrix An×nA\in\mathbb{R}^{n\times n}, the oracle will output a matrix A1/2AA1/2A_{\star}^{-1/2}AA_{\star}^{-1/2}. Algorithm 3 implements a matrix sensing algorithm with spectral approximation guarantee with the assumption of oracle 𝒪A{\cal O}_{A_{\star}}.

We define the spectral loss function as follows:

Ψλ(A):=tr[cosh(λ(I(A)1/2A(A)1/2))].\displaystyle\Psi_{\lambda}(A):=\operatorname{tr}[\cosh(\lambda(I-(A_{\star})^{-1/2}A(A_{\star})^{-1/2}))].

We will show that Ψλ(A)\Psi_{\lambda}(A) can characterize the spectral approximation of AA with respect to AA_{\star}.

It is easy to see that if we can query an arbitrary AA to the ground-truth oracle 𝒪A{\cal O}_{A_{\star}}, then we can definitely recover AA_{\star} exactly by querying 𝒪A(I){\cal O}_{A_{\star}}(I). Instead, in Algorithm 3, we focus on the following process: the initial matrix A1A_{1} is given, and in the tt-th iteration, we first compute

Xt=λ(IA1/2AtA1/2)\displaystyle X_{t}=\lambda(I-A_{\star}^{-1/2}A_{t}A_{\star}^{-1/2})

and do eigendecompsotion of XtX_{t} to obtain Λt\Lambda_{t} such that Xt=QtΛtQtX_{t}=Q_{t}\Lambda_{t}Q_{t}^{\top}. Then we update the matrix At+1A_{t+1} by:

At+1=At+ϵA1/2sinh(Xt)A1/2/sinh(Xt)F.\displaystyle A_{t+1}=A_{t}+\epsilon\cdot A_{\star}^{1/2}\sinh(X_{t})A_{\star}^{1/2}/\|\sinh(X_{t})\|_{F}.

We are interested in the number of iterations needed to make AtA_{t} be a δ\delta-spectral approximation. We believe this example will provide some insight into this problem, and we leave the question of spectral-approximated matrix sensing without the ground-truth oracle to future work.

Algorithm 3 Matrix Sensing with Spectral Approximation.
1:procedure GradientDescent(𝒪A{\cal O}_{A_{\star}}, A1A_{1})
2:     for t=1Tt=1\to T do
3:         Xtλ(In𝒪A(At))X_{t}\leftarrow\lambda\cdot(I_{n}-{\cal O}_{A_{\star}}(A_{t}))
4:         QtΛtQtQ_{t}\Lambda_{t}Q_{t}^{\top}\leftarrow Eigendecomposition of XtX_{t}\triangleright It takes O(nω)O(n^{\omega})-time
5:         YtQtsinh(Λt)QtY_{t}\leftarrow Q_{t}\cdot\sinh(\Lambda_{t})\cdot Q_{t}^{\top}\triangleright Yt=sinh(Xt)Y_{t}=\sinh(X_{t}). It takes O(n2)O(n^{2})-time
6:         At+1At+ϵ𝒪A(Yt)/YtFA_{t+1}\leftarrow A_{t}+\epsilon\cdot{\cal O}_{A_{\star}}(Y_{t})/\|Y_{t}\|_{F}\triangleright It takes O(n2)O(n^{2})-time
7:     end for
8:     return AT+1A_{T+1}
9:end procedure
Lemma C.1 (Progress on the spectral potential function).

Let c(0,1)c\in(0,1) denote a sufficiently small positive constant. We define XtX_{t} as follows:

Xt:=λ(I(A)1/2At(A)1/2)\displaystyle X_{t}:=\lambda(I-(A_{\star})^{-1/2}A_{t}(A_{\star})^{-1/2})

Let

At+1=At+ϵλ(A)1/2sinh(Xt)(A)1/2/λsinh(Xt)F.\displaystyle A_{t+1}=A_{t}+\epsilon\cdot\lambda(A_{\star})^{1/2}\sinh(X_{t})(A_{\star})^{1/2}/\|\lambda\cdot\sinh(X_{t})\|_{F}.

For any ϵ(0,1)\epsilon\in(0,1) and λ1\lambda\geq 1 such λϵc\lambda\epsilon\leq c, we have for any t>0t>0,

Ψλ(At+1)(10.9ϵλ/n)Ψλ(At)+ϵλn.\displaystyle\Psi_{\lambda}(A_{t+1})\leq(1-0.9\epsilon\lambda/\sqrt{n})\Psi_{\lambda}(A_{t})+\epsilon\lambda\sqrt{n}.
Proof.

We can compute

Ψλ(At+1)Ψλ(At)\displaystyle~{}\Psi_{\lambda}(A_{t+1})-\Psi_{\lambda}(A_{t})
=\displaystyle= tr[cosh(Xt+1)]tr[cosh(Xt)]\displaystyle~{}\operatorname{tr}[\cosh(X_{t+1})]-\operatorname{tr}[\cosh(X_{t})]
\displaystyle\leq λtr[sinh(Xt)((A)1/2(At+1At)(A)1/2)]\displaystyle~{}-\lambda\cdot\operatorname{tr}[\sinh(X_{t})\cdot((A_{\star})^{-1/2}(A_{t+1}-A_{t})(A_{\star})^{-1/2})]
+\displaystyle+ O(1)λ2tr[cosh(Xt)((A)1/2(AtAt+1)(A)1/2)2]\displaystyle~{}O(1)\cdot\lambda^{2}\cdot\operatorname{tr}[\cosh(X_{t})\cdot((A_{\star})^{-1/2}(A_{t}-A_{t+1})(A_{\star})^{-1/2})^{2}]
=\displaystyle= Δ1+O(1)Δ2,\displaystyle~{}-\Delta_{1}+O(1)\cdot\Delta_{2}, (24)

the first step is by expanding by definition, the second step is by Taylor expanding the first term at the point I(A)1/2At(A)1/2I-(A_{\star})^{-1/2}A_{t}(A_{\star})^{-1/2} (via Lemma A.1), and the last step is by definition of Δ1\Delta_{1} and Δ2\Delta_{2}.

To further simplify proofs, we define

Ψλ(At):=\displaystyle\nabla\Psi_{\lambda}(A_{t}):= λ(A)1/2sinh(Xt)(A)1/2\displaystyle~{}\lambda\cdot(A_{\star})^{1/2}\sinh(X_{t})(A_{\star})^{1/2}
~Ψλ(At):=\displaystyle\widetilde{\nabla}\Psi_{\lambda}(A_{t}):= λsinh(Xt)\displaystyle~{}\lambda\cdot\sinh(X_{t})
Δ~Ψλ(At):=\displaystyle\widetilde{\Delta}\Psi_{\lambda}(A_{t}):= λcosh(Xt)\displaystyle~{}\lambda\cdot\cosh(X_{t})

To maximize the gradient progress, we should choose

At+1=At+ϵΨλ(At)/~Ψλ(At)F\displaystyle A_{t+1}=A_{t}+\epsilon\cdot\nabla\Psi_{\lambda}(A_{t})/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}

Then

Δ1=\displaystyle\Delta_{1}= (ϵλ2)tr[sinh2(Xt)]/~Ψλ(At)F\displaystyle~{}(\epsilon\lambda^{2})\cdot\operatorname{tr}[\sinh^{2}(X_{t})]/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}
=\displaystyle= ϵ~Ψλ(At)F2/~Ψλ(At)F\displaystyle~{}\epsilon\cdot\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}^{2}/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}
=\displaystyle= ϵ~Ψλ(At)F\displaystyle~{}\epsilon\cdot\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F} (25)

and

Δ2=\displaystyle\Delta_{2}= ϵ2λ4tr[cosh(Xt)sinh2(λ(Xt)]/~Ψλ(At)F2\displaystyle~{}\epsilon^{2}\lambda^{4}\cdot\operatorname{tr}[\cosh(X_{t})\cdot\sinh^{2}(\lambda(X_{t})]/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= ϵ2λtr[Δ~Ψλ(At)~Ψλ(At)2]/~Ψλ(At)F2\displaystyle~{}\epsilon^{2}\lambda\cdot\operatorname{tr}[\widetilde{\Delta}\Psi_{\lambda}(A_{t})\cdot\widetilde{\nabla}\Psi_{\lambda}(A_{t})^{2}]/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}^{2}
\displaystyle\leq ϵ2λΔ~Ψλ(At)F~Ψλ(At)2F/~Ψλ(At)F2\displaystyle~{}\epsilon^{2}\lambda\cdot\|\widetilde{\Delta}\Psi_{\lambda}(A_{t})\|_{F}\cdot\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})^{2}\|_{F}/\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}^{2}
\displaystyle\leq ϵ2λΔ~Ψλ(At)F\displaystyle~{}\epsilon^{2}\lambda\cdot\|\widetilde{\Delta}\Psi_{\lambda}(A_{t})\|_{F}
\displaystyle\leq ϵ2λ(λn+~Ψλ(At)F)\displaystyle~{}\epsilon^{2}\lambda\cdot(\lambda\sqrt{n}+\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}) (26)

where the first step follows from the definition of Δ2\Delta_{2}, the second step comes from the definition of Δ~Ψλ(At)\widetilde{\Delta}\Psi_{\lambda}(A_{t}) and ~Ψλ(At)\widetilde{\nabla}\Psi_{\lambda}(A_{t}), the third step follows that ABFAFBF\|AB\|_{F}\leq\|A\|_{F}\|B\|_{F}, the forth step follows from x42x22\|x\|_{4}^{2}\leq\|x\|_{2}^{2}, and the fifth step follows from Part 1 of Lemma 3.4.

Now, we need to lower bound ~Ψλ(At)F\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}, we have

~Ψλ(At)F=\displaystyle\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}= (tr[λ2sinh2(Xt)])1/2\displaystyle~{}(\operatorname{tr}[\lambda^{2}\sinh^{2}(X_{t})])^{1/2}
\displaystyle\geq λn(tr[cosh(Xt)]n)\displaystyle~{}\frac{\lambda}{\sqrt{n}}(\operatorname{tr}[\cosh(X_{t})]-n)
=\displaystyle= λn(Ψλ(At)n)\displaystyle~{}\frac{\lambda}{\sqrt{n}}(\Psi_{\lambda}(A_{t})-n) (27)

where the second step follows from Part 2 in Lemma 3.4.

We know that

Then, we have

Ψλ(At+1)Ψλ(At)\displaystyle~{}\Psi_{\lambda}(A_{t+1})-\Psi_{\lambda}(A_{t})
\displaystyle\leq ϵ~Ψλ(At)F+ϵ2λ(n+~Ψλ(At)F)\displaystyle~{}-\epsilon\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}+\epsilon^{2}\lambda(\sqrt{n}+\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F})
\displaystyle\leq 0.9ϵ~Ψλ(At)F+ϵ2λ2n\displaystyle~{}-0.9\epsilon\|\widetilde{\nabla}\Psi_{\lambda}(A_{t})\|_{F}+\epsilon^{2}\lambda^{2}\sqrt{n}
\displaystyle\leq 0.9ϵλ1nΨλ(At)+ϵλn\displaystyle~{}-0.9\epsilon\lambda\frac{1}{\sqrt{n}}\Psi_{\lambda}(A_{t})+\epsilon\lambda\sqrt{n}

where the first step follows from Eq. (C) and Eq. (C) , the second steps comes from ϵ(0,0.01)\epsilon\in(0,0.01), the third step comes from Eq. (C) and ϵλ1\epsilon\lambda\leq 1.

Finally, we complete the proof.

Lemma C.2 (Small spectral potential implies good spectral approximation).

Let An×nA\in\mathbb{R}^{n\times n} be symmetric, and λ>0\lambda>0. Suppose Ψλ(A)p\Psi_{\lambda}(A)\leq p for some p>1p>1. Then, we have

(1δ)AA(1+δ)A\displaystyle(1-\delta)A_{\star}\preceq A\preceq(1+\delta)A_{\star}

for δ=O(λ1logp)\delta=O(\lambda^{-1}\log p).

Proof.

By the definition of Ψλ(A)\Psi_{\lambda}(A), Ψλ(A)p\Psi_{\lambda}(A)\leq p implies that for any i[n]i\in[n],

cosh(λ(1λi(A1/2AA1/2)))p,\displaystyle\cosh(\lambda(1-\lambda_{i}(A_{\star}^{-1/2}AA_{\star}^{-1/2})))\leq p,

or equivalently,

|(1λi(A1/2AA1/2))|O(λ1logp).\displaystyle\left|(1-\lambda_{i}(A_{\star}^{-1/2}AA_{\star}^{-1/2}))\right|\leq O(\lambda^{-1}\log p).

Hence, we have

(1δ)InA1/2AA1/2(1+δ)In,\displaystyle(1-\delta)I_{n}\preceq A_{\star}^{-1/2}AA_{\star}^{-1/2}\preceq(1+\delta)I_{n},

where δ:=O(λ1logp)\delta:=O(\lambda^{-1}\log p). Therefore, by multiplying A1/2A_{\star}^{-1/2} on both sides, we get that

(1δ)AA(1+δ)A,\displaystyle(1-\delta)A_{\star}\preceq A\preceq(1+\delta)A_{\star},

which completes the proof of the lemma. ∎

Appendix D Gradient descent with General Measurements

In this section, we analyze the potential decay by gradient descent with non-orthogonal measurements. The main result of this section is Lemma D.1 in below.

We first recall the definition of the potential function Φλ(A)\Phi_{\lambda}(A):

Φλ(A):=i=1mcosh(λ(uiAuibi)),\displaystyle\Phi_{\lambda}(A):=\sum_{i=1}^{m}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i})),

its gradient Φλ(A)n×n\nabla\Phi_{\lambda}(A)\in\mathbb{R}^{n\times n}:

Φλ(A)=i=1muiuiλsinh(λ(uiAuibi)),\displaystyle\nabla\Phi_{\lambda}(A)=\sum_{i=1}^{m}u_{i}u_{i}^{\top}\lambda\sinh\left(\lambda(u_{i}^{\top}Au_{i}-b_{i})\right), (28)

and its Hessian 2Φλ(A)n2×n2\nabla^{2}\Phi_{\lambda}(A)\in\mathbb{R}^{n^{2}\times n^{2}}:

2Φλ(A)=i=1m(uiui)(uiui)λ2cosh(λ(uiAuibi)).\displaystyle\nabla^{2}\Phi_{\lambda}(A)=\sum_{i=1}^{m}(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\lambda^{2}\cosh(\lambda(u_{i}^{\top}Au_{i}-b_{i})).
Lemma D.1 (Progress on entry-wise potential with general measurements).

Assume that |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho and ρ110m\rho\leq\frac{1}{10m}, for any i,j[m]i,j\in[m] and ui2=1\|u_{i}\|^{2}=1. Let c(0,1)c\in(0,1) denote a sufficiently small positive constant. Then, for any ϵ,λ>0\epsilon,\lambda>0 such that ϵλc\epsilon\lambda\leq c, we have for any t>0t>0,

Φλ(At+1)(10.9λϵm)Φλ(At)+λϵm\displaystyle\Phi_{\lambda}(A_{t+1})\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}
Proof.

We first have

Φλ(At+1)Φλ(At)\displaystyle~{}\Phi_{\lambda}(A_{t+1})-\Phi_{\lambda}(A_{t})
\displaystyle\leq Φλ(At),(At+1At)+O(1)2Φλ(At),(At+1At)(At+1At)\displaystyle~{}\langle\nabla\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\rangle+O(1)\langle\nabla^{2}\Phi_{\lambda}(A_{t}),(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})\rangle
:=\displaystyle:= Δ1+O(1)Δ2,\displaystyle~{}-\Delta_{1}+O(1)\cdot\Delta_{2}, (29)

which follows from Corollary A.2.

We choose

At+1=AtϵΦλ(At)/Φλ(At)F.\displaystyle A_{t+1}=A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. (30)

We can bound

Δ1=\displaystyle\Delta_{1}= tr[Φλ(At)(At+1At)]\displaystyle~{}-\operatorname{tr}[\nabla\Phi_{\lambda}(A_{t})(A_{t+1}-A_{t})]
=\displaystyle= ϵΦλ(At)F.\displaystyle~{}\epsilon\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}. (31)

For Φλ(At)F2\|\Phi_{\lambda}(A_{t})\|_{F}^{2},

1λ2Φλ(At)F2\displaystyle~{}\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= tr[(i=1muiuisinh(λ(uiAtuibi)))2]\displaystyle~{}\operatorname{tr}[(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{2}]
=\displaystyle= tr[i=1msinh2(λ(uiAtuibi))]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{m}\sinh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))]
+\displaystyle+ tr[i=1mjim(uiui)(ujuj)sinh(λ(uiAtuibi))sinh(λ(ujAtujbj))]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{m}\sum_{j\neq i}^{m}(u_{i}u_{i}^{\top})(u_{j}u_{j}^{\top})\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))\cdot\sinh(\lambda(u_{j}^{\top}A_{t}u_{j}-b_{j}))]
\displaystyle\geq 0.9tr[i=1msinh2(λ(uiAtuibi))]\displaystyle~{}0.9\operatorname{tr}[\sum_{i=1}^{m}\sinh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))]
\displaystyle\geq 0.91m(i=1mcosh(λ(uiAtuibi))m)2\displaystyle~{}0.9\frac{1}{m}(\sum_{i=1}^{m}\cosh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))-m)^{2}
=\displaystyle= 0.91m(Φλ(At)m)2,\displaystyle~{}0.9\frac{1}{m}(\Phi_{\lambda}(A_{t})-m)^{2}, (32)

where the first step follows from Eq. (28), the second steps follow from partitioning based on whether i=ji=j and ui2=1\|u_{i}\|_{2}=1, the third step comes from Claim D.2, the fourth step in Eq. (D) follows from Part 2 in Lemma 3.3, the fifth step follows from the definition of Φλ(A)\Phi_{\lambda}(A).

Thus,

Δ1=\displaystyle\Delta_{1}= tr[Φλ(At)(At+1At)]\displaystyle~{}-\operatorname{tr}[\nabla\Phi_{\lambda}(A_{t})(A_{t+1}-A_{t})]
\displaystyle\geq λϵ1m(Φλ(At)m).\displaystyle~{}\lambda\epsilon\cdot\frac{1}{\sqrt{m}}(\Phi_{\lambda}(A_{t})-m). (33)

For simplicity, we define

zt,i:=λ(uiAtuibi).\displaystyle z_{t,i}:=\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}).

We need to compute this Δ2\Delta_{2}. For simplificity, we consider Δ2(1ϵλ)2Φλ(At)F2\Delta_{2}\cdot(\frac{1}{\epsilon\lambda})^{2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}, which can be expressed as:

Δ2(1ϵλ)2Φλ(At)F2\displaystyle\Delta_{2}\cdot(\frac{1}{\epsilon\lambda})^{2}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= 1(λϵ)2tr[2Φλ(At)(At+1At)(At+1At)]Φλ(At)F2\displaystyle~{}\frac{1}{(\lambda\epsilon)^{2}}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(A_{t+1}-A_{t})\otimes(A_{t+1}-A_{t})]\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= tr[2Φλ(At)(i=1muiuisinh(zt,i))(i=1muiuisinh(zt,i))]\displaystyle~{}\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(z_{t,i}))\otimes(\sum_{i=1}^{m}u_{i}u_{i}^{\top}\sinh(z_{t,i}))\Big{]}
=\displaystyle= tr[2Φλ(At)(i,jsinh(zt,i)sinh(zt,i)(uiuiujuj))]\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})(\sum_{i,j}\sinh(z_{t,i})\sinh(z_{t,i})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))]
=\displaystyle= tr[2Φλ(At)(i=1msinh2(zt,i))(uiuiuiui))]\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})(\sum_{i=1}^{m}\sinh^{2}(z_{t,i}))(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))]
+\displaystyle+ tr[2Φλ(At)(ijsinh(zt,i)sinh(zt,j)(uiuiujuj))]\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})(\sum_{i\neq j}\sinh(z_{t,i})\sinh(z_{t,j})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))]
=\displaystyle= Q1+Q2,\displaystyle~{}Q_{1}+Q_{2}, (34)

where

Q1:=tr[2Φλ(At)(i=1msinh2(zt,i))(uiuiuiui))]\displaystyle Q_{1}:=\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i=1}^{m}\sinh^{2}(z_{t,i}))(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))\Big{]} (35)

denotes the diagonal term, and

Q2:=tr[2Φλ(At)(ijsinh(zt,i)sinh(zt,j)(uiuiujuj))]\displaystyle Q_{2}:=\operatorname{tr}\Big{[}\nabla^{2}\Phi_{\lambda}(A_{t})\cdot(\sum_{i\neq j}\sinh(z_{t,i})\sinh(z_{t,j})(u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top}))\Big{]} (36)

denotes the off-diagonal term. The first step comes from the definition of Δ2\Delta_{2}, the second step follows from replacing At+1AtA_{t+1}-A_{t} using Eq. (30), the third step follows that we extract the scalar values from Kronecker product, the fourth step comes from splitting into two partitions based on whether i=ji=j, the fifth step comes from the definition of Q1Q_{1} and Q2Q_{2}.

Thus,

Δ2\displaystyle\Delta_{2}\leq (ϵλ)2(Q1+Q2)/Φλ(At)F2\displaystyle~{}(\epsilon\lambda)^{2}(Q_{1}+Q_{2})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= 1.3(ϵλ)2(m+1λΦλ(At)F).\displaystyle~{}1.3(\epsilon\lambda)^{2}\cdot(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}). (37)

where the second step follows from Claim D.3 and Claim D.5.

Hence, we have

Φλ(At+1)Φλ(At)\displaystyle~{}\Phi_{\lambda}(A_{t+1})-\Phi_{\lambda}(A_{t})
\displaystyle\leq Δ1+O(1)Δ2\displaystyle~{}-\Delta_{1}+O(1)\cdot\Delta_{2}
\displaystyle\leq ϵΦλ(At)F+O(1)(ϵλ)2(m+1λΦλ(At)F)\displaystyle~{}-\epsilon\|\nabla\Phi_{\lambda}(A_{t})\|_{F}+O(1)(\epsilon\lambda)^{2}(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})
\displaystyle\leq 0.9ϵΦλ(At)F+O(ϵλ)2m\displaystyle~{}-0.9\epsilon\|\Phi_{\lambda}(A_{t})\|_{F}+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵλ1m(Φλ(At)m)+O(ϵλ)2m\displaystyle~{}-0.9\epsilon\lambda\frac{1}{\sqrt{m}}(\Phi_{\lambda}(A_{t})-m)+O(\epsilon\lambda)^{2}\sqrt{m}
\displaystyle\leq 0.9ϵλ1mΦλ(At)+ϵλm,\displaystyle~{}-0.9\epsilon\lambda\frac{1}{\sqrt{m}}\Phi_{\lambda}(A_{t})+\epsilon\lambda\sqrt{m},

where the first step follows from Eq. (D), the second step follows from Eq. (D) and Eq. (D), the third step follows from ϵλ(0,0.01)\epsilon\lambda\in(0,0.01), the fourth step follows from Lemma A.6, and the final step follows that extracting the constant term from the summation.

The lemma is then proved. ∎

We prove some technical claims in below.

Claim D.2.

It holds that:

ij[m]ui,uj2sinh(λ(uiAtuibi))sinh(λ(ujAtujbj))\displaystyle\sum_{i\neq j\in[m]}\langle u_{i},u_{j}\rangle^{2}\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))\sinh(\lambda(u_{j}^{\top}A_{t}u_{j}-b_{j}))
\displaystyle\leq 0.1i=1msinh2(λ(uiAtuibi))\displaystyle~{}0.1\sum_{i=1}^{m}\sinh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))
Proof.

We define Ri,jR_{i,j} and RR as follows:

Ri,j=\displaystyle R_{i,j}= sinh(λ(uiAtuibi))sinh(λ(ujAtujbj))\displaystyle~{}\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))\sinh(\lambda(u_{j}^{\top}A_{t}u_{j}-b_{j}))
R=\displaystyle R= tr[i=1mjim(uiui)(ujuj)sinh(λ(uiAtuibi))sinh(λ(ujAtujbj))]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{m}\sum_{j\neq i}^{m}(u_{i}u_{i}^{\top})(u_{j}u_{j}^{\top})\sinh(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}))\cdot\sinh(\lambda(u_{j}^{\top}A_{t}u_{j}-b_{j}))]

Then we can upper bound |R||R| by:

|R|=\displaystyle|R|= tr[i=1mjim|(uiui)(ujuj)||Ri,j|]\displaystyle~{}\operatorname{tr}[\sum_{i=1}^{m}\sum_{j\neq i}^{m}|(u_{i}u_{i}^{\top})(u_{j}u_{j}^{\top})||R_{i,j}|]
\displaystyle\leq ρ2tr[i=1mjim|Ri,j|]\displaystyle~{}\rho^{2}\operatorname{tr}[\sum_{i=1}^{m}\sum_{j\neq i}^{m}|R_{i,j}|]
\displaystyle\leq ρ22tr[i=1mjim(Ri,i+Rj,j)]\displaystyle~{}\frac{\rho^{2}}{2}\operatorname{tr}[\sum_{i=1}^{m}\sum_{j\neq i}^{m}(R_{i,i}+R_{j,j})]
\displaystyle\leq mρ2tr[i=1mRi,i]\displaystyle~{}m\rho^{2}\operatorname{tr}[\sum_{i=1}^{m}R_{i,i}]
\displaystyle\leq 0.1tr[i=1mRi,i]\displaystyle~{}0.1\operatorname{tr}[\sum_{i=1}^{m}R_{i,i}]

where the first step follows |ab|=|a||b||ab|=|a||b|, the second step follows |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho, the third step follows that |ab|a2+b22|ab|\leq\frac{a^{2}+b^{2}}{2}, the fourth step follows from the summation over jj, and the fifth step comes from mρ20.1m\rho^{2}\leq 0.1. ∎

Claim D.3.

For Q1Q_{1} defined in Eq. (35), we have

Q11.1(m+1λΦλ(At)F)Φλ(At)F2.\displaystyle Q_{1}\leq 1.1\Big{(}\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}\Big{)}\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}.
Proof.

For simplicity, we define zt,iz_{t,i} to be

zt,i:=λ(uiAtuibi).\displaystyle z_{t,i}:=\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i}).

Recall that

2Φλ(At)=λ2i=1m(uiui)(uiui)cosh(zt,i).\displaystyle\nabla^{2}\Phi_{\lambda}(A_{t})=\lambda^{2}\cdot\sum_{i=1}^{m}(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\cosh(z_{t,i}).

For Q1Q_{1}, we have

Q1=\displaystyle Q_{1}= tr[2Φλ(At)i=1msinh2(zt,i)(uiuiuiui))]\displaystyle~{}\operatorname{tr}[\nabla^{2}\Phi_{\lambda}(A_{t})\sum_{i=1}^{m}\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top}\otimes u_{i}u_{i}^{\top}))]
=\displaystyle= λ2tr[i=1mcosh(zt,i)(uiui)(uiui)i=1msinh2(zt,i)(uiui)(uiui)]\displaystyle~{}\lambda^{2}\cdot\operatorname{tr}[\sum_{i=1}^{m}\cosh(z_{t,i})(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})\cdot\sum_{i=1}^{m}\sinh^{2}(z_{t,i})(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})]
=\displaystyle= λ2i=1mtr[cosh(zt,i)sinh2(zt,i)(uiuiuiui)(uiuiuiui)]\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\operatorname{tr}[\cosh(z_{t,i})\sinh^{2}(z_{t,i})\cdot(u_{i}u_{i}^{\top}u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top}u_{i}u_{i}^{\top})]
+\displaystyle+ λ2i=1mjimtr[cosh(zt,i)sinh2(zt,j)(uiuiujuj)(ujujuiui)]\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\sum_{j\neq i}^{m}\operatorname{tr}[\cosh(z_{t,i})\sinh^{2}(z_{t,j})\cdot(u_{i}u_{i}^{\top}u_{j}u_{j}^{\top})\otimes(u_{j}u_{j}^{\top}u_{i}u_{i}^{\top})]
=\displaystyle= λ2i=1mcosh(zt,i)sinh2(zt,i)\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\cosh(z_{t,i})\sinh^{2}(z_{t,i})
+\displaystyle+ λ2i=1mjimtr[cosh(zt,i)sinh2(zt,j)(uiuiujuj)(ujujuiui)]\displaystyle~{}\lambda^{2}\cdot\sum_{i=1}^{m}\sum_{j\neq i}^{m}\operatorname{tr}[\cosh(z_{t,i})\sinh^{2}(z_{t,j})\cdot(u_{i}u_{i}^{\top}u_{j}u_{j}^{\top})\otimes(u_{j}u_{j}^{\top}u_{i}u_{i}^{\top})]
\displaystyle\leq 1.1λ2(i=1mcosh2(zt,i))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}1.1\lambda^{2}\cdot(\sum_{i=1}^{m}\cosh^{2}(z_{t,i}))^{1/2}\cdot(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}
\displaystyle\leq 1.1λ2B1B2,\displaystyle~{}1.1\lambda^{2}\cdot B_{1}\cdot B_{2}, (38)

where the first step comes from the definition of Q1Q_{1}, the second step comes from the definition of 2Φλ(At)\nabla^{2}\Phi_{\lambda}(A_{t}), the third step follows from (AB)(CD)=(AC)(BD)(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD) and partition the terms based on whether i=ji=j, the fourth step comes from ui=1\|u_{i}\|=1 and tr[(uiui)(uiui)]=1\operatorname{tr}[(u_{i}u_{i}^{\top})\otimes(u_{i}u_{i}^{\top})]=1, and the fifth step comes from Cauchy–Schwarz inequality and Claim D.4.

Claim D.4.

We can bound the off-diagonal entries by:

|λ2i=1mjimtr[cosh(zt,i)sinh2(zt,j)(uiuiujuj)(ujujuiui)]|\displaystyle~{}|\lambda^{2}\cdot\sum_{i=1}^{m}\sum_{j\neq i}^{m}\operatorname{tr}[\cosh(z_{t,i})\sinh^{2}(z_{t,j})\cdot(u_{i}u_{i}^{\top}u_{j}u_{j}^{\top})\otimes(u_{j}u_{j}^{\top}u_{i}u_{i}^{\top})]|
\displaystyle\leq 0.1λ2(i=1m(cosh(zt,1))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}0.1\lambda^{2}(\sum_{i=1}^{m}(\cosh(z_{t,1}))^{1/2}\cdot(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}
Proof.
|λ2i=1mjimtr[cosh(zt,i)sinh2(zt,j)(uiuiujuj)(ujujuiui)]|\displaystyle~{}|\lambda^{2}\cdot\sum_{i=1}^{m}\sum_{j\neq i}^{m}\operatorname{tr}[\cosh(z_{t,i})\sinh^{2}(z_{t,j})\cdot(u_{i}u_{i}^{\top}u_{j}u_{j}^{\top})\otimes(u_{j}u_{j}^{\top}u_{i}u_{i}^{\top})]|
\displaystyle\leq ρ2λ2|i=1mjimcosh(zt,i)sinh2(zt,j)|\displaystyle~{}\rho^{2}\lambda^{2}|\sum_{i=1}^{m}\sum_{j\neq i}^{m}\cosh(z_{t,i})\sinh^{2}(z_{t,j})|
\displaystyle\leq ρ2λ2(i=1mjim(cosh2(zt,i))1/2(i=1mjimsinh4(zt,j))1/2\displaystyle~{}\rho^{2}\lambda^{2}(\sum_{i=1}^{m}\sum_{j\neq i}^{m}(\cosh^{2}(z_{t,i}))^{1/2}\cdot(\sum_{i=1}^{m}\sum_{j\neq i}^{m}\sinh^{4}(z_{t,j}))^{1/2}
\displaystyle\leq mρ2λ2(i=1m(cosh2(zt,i))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}m\rho^{2}\lambda^{2}(\sum_{i=1}^{m}(\cosh^{2}(z_{t,i}))^{1/2}\cdot(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}
\displaystyle\leq 0.1λ2(i=1m(cosh2(zt,i))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}0.1\lambda^{2}(\sum_{i=1}^{m}(\cosh^{2}(z_{t,i}))^{1/2}\cdot(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}

where the first step comes from |ui,uj|ρ|\langle u_{i},u_{j}\rangle|\leq\rho, the second step comes from Cauchy–Schwarz inequality, the third step follows from summation over mm terms, and the fourth step comes from ρ2m0.1\rho^{2}m\leq 0.1. ∎

For the term B1B_{1}, we have

B1=\displaystyle B_{1}= (i=1mcosh2(λ(uiAtuibi)))1/2\displaystyle~{}(\sum_{i=1}^{m}\cosh^{2}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{1/2}
\displaystyle\leq m+1λΦλ(At)F,\displaystyle~{}\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}, (39)

where the second step follows Part 1 of Lemma 3.3.

For the term B2B_{2}, we have

B2=\displaystyle B_{2}= (i=1msinh4(λ(uiAtuibi)))1/2\displaystyle~{}(\sum_{i=1}^{m}\sinh^{4}(\lambda(u_{i}^{\top}A_{t}u_{i}-b_{i})))^{1/2}
\displaystyle\leq 1λ2Φλ(At)F2,\displaystyle~{}\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}, (40)

where the second step follows from x42x22\|x\|_{4}^{2}\leq\|x\|_{2}^{2}. This implies that

Q1\displaystyle Q_{1}\leq 1.1λ2B1B2\displaystyle~{}1.1\lambda^{2}\cdot B_{1}\cdot B_{2}
\displaystyle\leq 1.1λ2(m+1λΦλ(At)F)1λ2Φλ(At)F2\displaystyle~{}1.1\lambda^{2}\cdot(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\frac{1}{\lambda^{2}}\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= 1.1(m+1λΦλ(At)F)Φλ(At)F2.\displaystyle~{}1.1(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}.

This completes the proof. ∎

Claim D.5.

For Q2Q_{2} defined in Eq. (36), we have:

Q20.2λ2(m+1λΦλ(At)F)Φλ(At)F2\displaystyle Q_{2}\leq 0.2\lambda^{2}(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
Proof.

Because in Q2Q_{2} we have :

Q2=\displaystyle Q_{2}= λ2tr[=1m(cosh(zt,)uuuu)ijm(sinh(zt,i)sinh(zt,j)uiuiujuj)]\displaystyle~{}\lambda^{2}\operatorname{tr}[\sum_{\ell=1}^{m}(\cosh(z_{t,\ell})\cdot u_{\ell}u_{\ell}^{\top}\otimes u_{\ell}u_{\ell}^{\top})\cdot\sum_{i\neq j}^{m}(\sinh(z_{t,i})\sinh(z_{t,j})\cdot u_{i}u_{i}^{\top}\otimes u_{j}u_{j}^{\top})]
=\displaystyle= λ2tr[=1mijmcosh(zt,)sinh(zt,i)sinh(zt,j)(uuuiui)(uuujuj)]\displaystyle~{}\lambda^{2}\operatorname{tr}[\sum_{\ell=1}^{m}\sum_{i\neq j}^{m}\cosh(z_{t,\ell})\sinh(z_{t,i})\sinh(z_{t,j})\cdot(u_{\ell}u_{\ell}^{\top}u_{i}u_{i}^{\top})\otimes(u_{\ell}u_{\ell}^{\top}u_{j}u_{j}^{\top})]
\displaystyle\leq λ2ρ2=1mijmcosh(zt,)(sinh2(zt,i)+sinh2(zt,j))\displaystyle~{}\lambda^{2}\rho^{2}\sum_{\ell=1}^{m}\sum_{i\neq j}^{m}\cosh(z_{t,\ell})(\sinh^{2}(z_{t,i})+\sinh^{2}(z_{t,j}))
\displaystyle\leq 2mλ2ρ2=1mi=1mcosh(zt,)sinh2(zt,i)\displaystyle~{}2m\lambda^{2}\rho^{2}\sum_{\ell=1}^{m}\sum_{i=1}^{m}\cosh(z_{t,\ell})\sinh^{2}(z_{t,i})
\displaystyle\leq 2m2λ2ρ2i=1mcosh(zt,i)sinh2(zt,i)\displaystyle~{}2m^{2}\lambda^{2}\rho^{2}\sum_{i=1}^{m}\cosh(z_{t,i})\sinh^{2}(z_{t,i})
\displaystyle\leq 2m2λ2ρ2(i=1m(cosh2(zt,i))1/2(i=1msinh4(zt,i))1/2\displaystyle~{}2m^{2}\lambda^{2}\rho^{2}(\sum_{i=1}^{m}(\cosh^{2}(z_{t,i}))^{1/2}(\sum_{i=1}^{m}\sinh^{4}(z_{t,i}))^{1/2}
\displaystyle\leq 0.2λ2(m+1λΦλ(At)F)Φλ(At)F2\displaystyle~{}0.2\lambda^{2}(\sqrt{m}+\frac{1}{\lambda}\|\nabla\Phi_{\lambda}(A_{t})\|_{F})\cdot\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2} (41)

where the second step follows from (AB)(CD)=(AC)(BD)(A\otimes B)\cdot(C\otimes D)=(AC)\otimes(BD), the third step follows Cauchy–Schwarz inequality and |ui,uj|ρ|\langle u_{i},u_{j}\rangle|\leq\rho, the fourth step follows from combining sinh2(zt,i)\sinh^{2}(z_{t,i}) and sinh2(zt,j)\sinh^{2}(z_{t,j}), the fifth step comes from summation over mm terms, and the sixth step comes from Cauchy–Schwarz inequality and the seventh step follows from Eq. (D) and Eq. (D) and m2ρ20.1m^{2}\rho^{2}\leq 0.1.

Appendix E Stochastic Gradient Descent for General Measurements

In this section, we further extend the general measurement where {ui}i[m]\{u_{i}\}_{i\in[m]} are non-orthogonal vectors and |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho to the convergence analysis of the stochastic gradient descent matrix sensing algorithm. Algorithm 4 implements the stochastic gradient descent version of the matrix sensing algorithm.

In Algorithm 4, at each iteration tt, we first compute the stochastic gradient descent by:

Φλ(At,t)mBituiuiλsinh(λzi)\displaystyle\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})\leftarrow\frac{m}{B}\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda z_{i})

then we update the matrix with the gradient:

At+1AtϵΦλ(At,t)/Φλ(At)F\displaystyle A_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}

At the end of each iteration, we update ziz_{i} by:

ziziϵλmwi,j2sinh(λzj)/(Φλ(At)FBi[m],jt\displaystyle z_{i}\leftarrow z_{i}-\epsilon\lambda mw_{i,j}^{2}\sinh(\lambda z_{j})/(\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B\;\;\forall i\in[m],j\in{\cal B}_{t}

We are interested in studying the time complexity and convergence analysis under the general measurement assumption.

Lemma E.1 (Cost-per-iteration of stochastic gradient descent for general measurements).

Algorithm 4 takes O(mn2)O(mn^{2})-time for preprocessing and each iteration takes O(Bn2+m2)O(Bn^{2}+m^{2})-time.

Proof.

Since uiu_{i}’s are no longer orthogonal, we need to compute Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F} in the following way:

Φλ(At)F2\displaystyle\|\nabla\Phi_{\lambda}(A_{t})\|_{F}^{2}
=\displaystyle= tr[(i=1muiuiλsinh(λ(λzt,i)))2]\displaystyle~{}\operatorname{tr}\Big{[}\Big{(}\sum_{i=1}^{m}u_{i}u_{i}^{\top}\lambda\sinh(\lambda(\lambda z_{t,i}))\Big{)}^{2}\Big{]}
=\displaystyle= λ2i,j=1mui,uj2sinh(λ(λzt,i))sinh(λ(λzt,j))\displaystyle~{}\lambda^{2}\sum_{i,j=1}^{m}\langle u_{i},u_{j}\rangle^{2}\sinh(\lambda(\lambda z_{t,i}))\sinh(\lambda(\lambda z_{t,j}))
=\displaystyle= λ2i,j=1mwi,j2sinh(λ(λzt,i))sinh(λ(λzt,j)).\displaystyle~{}\lambda^{2}\sum_{i,j=1}^{m}w_{i,j}^{2}\sinh(\lambda(\lambda z_{t,i}))\sinh(\lambda(\lambda z_{t,j})).

Hence, with {zt,i}i[m]\{z_{t,i}\}_{i\in[m]}, we can compute Φλ(At)F\|\nabla\Phi_{\lambda}(A_{t})\|_{F} in O(m2)O(m^{2})-time.

Another difference from the orthogonal measurement case is the update for zt+1,iz_{t+1,i}. Now, we have

zt+1,izt,i\displaystyle z_{t+1,i}-z_{t,i}
=\displaystyle= ui(At+1At)ui\displaystyle~{}u_{i}^{\top}(A_{t+1}-A_{t})u_{i}
=\displaystyle= ϵΦλ(At)FuiΦλ(At,t)ui\displaystyle~{}-\frac{\epsilon}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}}\cdot u_{i}^{\top}\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})u_{i}
=\displaystyle= ϵλmΦλ(At)FBjtuiujujuisinh(λzt,j)\displaystyle~{}-\frac{\epsilon\lambda m}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B}\sum_{j\in{\cal B}_{t}}u_{i}^{\top}u_{j}u_{j}^{\top}u_{i}\cdot\sinh(\lambda z_{t,j})
=\displaystyle= ϵλmΦλ(At)FBjtwi,j2sinh(λzt,j).\displaystyle~{}-\frac{\epsilon\lambda m}{\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B}\sum_{j\in{\cal B}_{t}}w_{i,j}^{2}\cdot\sinh(\lambda z_{t,j}).

Hence, each zt+1,iz_{t+1,i} can be computed in O(B)O(B)-time. And it takes O(mB)O(mB)-time to update all zt+1,iz_{t+1,i}.

The other steps’ time costs are quite clear from Algorithm 4. ∎

Lemma E.2 (Progress on expected potential with general measurements).

Assume that |uiuj|ρ|u_{i}^{\top}u_{j}|\leq\rho and ρ110m\rho\leq\frac{1}{10m}, for any i,j[m]i,j\in[m] and ui2=1\|u_{i}\|^{2}=1. Let c(0,1)c\in(0,1) denote a sufficiently small positive constant. Then, for any ϵ,λ>0\epsilon,\lambda>0 such that ϵλc|t|m\epsilon\lambda\leq c\frac{|{\cal B}_{t}|}{m}, we have for any t>0t>0,

𝔼[Φλ(At+1)](10.9λϵm)Φλ(At)+λϵm\displaystyle\operatorname*{{\mathbb{E}}}[\Phi_{\lambda}(A_{t+1})]\leq(1-0.9\frac{\lambda\epsilon}{\sqrt{m}})\cdot\Phi_{\lambda}(A_{t})+\lambda\epsilon\sqrt{m}

The proof is a direct generalization of Lemma 6.3 and is very similar to Lemma D.1. Thus, we omit it here.

Algorithm 4 Matrix Sensing with Stochastic Gradient Descent (General Measurements).
1:procedure SGD_General({ui,bi}i[m]\{u_{i},b_{i}\}_{i\in[m]}) \triangleright Lemma E.1
2:     τmaxi[m]bi\tau\leftarrow\max_{i\in[m]}b_{i}
3:     A1τIA_{1}\leftarrow\tau\cdot I
4:     ziuiA1uibiz_{i}\leftarrow u_{i}^{\top}A_{1}u_{i}-b_{i} for i[m]i\in[m]\triangleright zmz\in\mathbb{R}^{m}
5:     wi,jui,ujw_{i,j}\leftarrow\langle u_{i},u_{j}\rangle for i,j[m]i,j\in[m] \triangleright wm×mw\in\mathbb{R}^{m\times m}
6:     for t=1Tt=1\to T do
7:         Sample t[m]{\cal B}_{t}\subset[m] of size BB uniformly at random
8:         Φλ(At,t)mBituiuiλsinh(λzi)\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})\leftarrow\frac{m}{B}\sum_{i\in{\cal B}_{t}}u_{i}u_{i}^{\top}\lambda\sinh(\lambda z_{i})\triangleright It takes O(Bn2)O(Bn^{2})-time
9:         Φλ(At)Fλ(i,j=1mwi,j2sinh(λzi)sinh(λzj))1/2\|\nabla\Phi_{\lambda}(A_{t})\|_{F}\leftarrow\lambda\left(\sum_{i,j=1}^{m}w_{i,j}^{2}\sinh(\lambda z_{i})\sinh(\lambda z_{j})\right)^{1/2}\triangleright It takes O(m2)O(m^{2})-time
10:         At+1AtϵΦλ(At,t)/Φλ(At)FA_{t+1}\leftarrow A_{t}-\epsilon\cdot\nabla\Phi_{\lambda}(A_{t},{\cal B}_{t})/\|\nabla\Phi_{\lambda}(A_{t})\|_{F}\triangleright It takes O(n2)O(n^{2})-time
11:         for i[m]i\in[m] do\triangleright Update zz. It takes O(mB)O(mB)-time
12:              for jtj\in{\cal B}_{t} do
13:                  ziziϵλmwi,j2sinh(λzj)/(Φλ(At)FB)z_{i}\leftarrow z_{i}-\epsilon\lambda mw_{i,j}^{2}\sinh(\lambda z_{j})/(\|\nabla\Phi_{\lambda}(A_{t})\|_{F}B)
14:              end for
15:         end for
16:     end for
17:     return AT+1A_{T+1}
18:end procedure