This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

Stefan Steinerberger Department of Mathematics, University of Washington, Seattle steinerb@uw.edu
Abstract.

We study the behavior of stochastic gradient descent applied to Axb22min\|Ax-b\|_{2}^{2}\rightarrow\min for invertible An×nA\in\mathbb{R}^{n\times n}. We show that there is an explicit constant cAc_{A} depending (mildly) on AA such that

𝔼Axk+1b22(1+cAAF2)Axkb222AF2ATA(xkx)22.\mathbb{E}~{}\left\|Ax_{k+1}-b\right\|^{2}_{2}\leq\left(1+\frac{c_{A}}{\|A\|_{F}^{2}}\right)\left\|Ax_{k}-b\right\|^{2}_{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}A(x_{k}-x)\right\|^{2}_{2}.

This is a curious inequality: the last term has one more matrix applied to the residual ukuu_{k}-u than the remaining terms: if xkxx_{k}-x is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

Key words and phrases:
Stochastic Gradient Descent, Kaczmarz method, Least Squares
2010 Mathematics Subject Classification:
65F10, 65K10, 65K15, 90C06, 93E24
S.S. is supported by the NSF (DMS-1763179) and the Alfred P. Sloan Foundation.

1. Introduction

1.1. Stochastic Gradient Descent.

In this paper, we consider the finite-dimensional linear inverse problem

Ax=b,Ax=b,

where An×nA\in\mathbb{R}^{n\times n} is an invertible matrix, xnx\in\mathbb{R}^{n} is the (unknown) signal of interest and bb is a given right-hand side. Throughout this paper, we will use a1,,ana_{1},\dots,a_{n} to denote the rows of AA. Equivalently, we will try to solve the problem

Axb2=i=1n(ai,xbi)2min.\|Ax-b\|^{2}=\sum_{i=1}^{n}\left(\left\langle a_{i},x\right\rangle-b_{i}\right)^{2}\rightarrow\min.

Following Needell, Srebro & Ward [30], we can interpret this problem as

i=1nfi(x)2minwherefi(x)=ai,xbi.\sum_{i=1}^{n}f_{i}(x)^{2}\rightarrow\min\qquad\mbox{where}\qquad f_{i}(x)=\left\langle a_{i},x\right\rangle-b_{i}.

The Lipschitz constant of fif_{i} is ai2\|a_{i}\|_{\ell^{2}} which motivates the following basic form of stochastic gradient descent: pick one of the nn functions with likelihood proportional to the Lipschitz constant and then do a gradient descent on this much simpler function resulting in the update rule

xk+1=xkbiai,xkai2ai.x_{k+1}=x_{k}-\frac{b_{i}-\left\langle a_{i},x_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

This is also known as the Algebraic Reconstruction Technique (ART) in computer tomography [10, 14, 15, 26], the Projection onto Convex Sets Method [3, 5, 6, 9, 38] and the Randomized Kaczmarz method [2, 7, 8, 11, 13, 12, 20, 21, 22, 24, 25, 27, 28, 29, 30, 31, 33, 36, 39, 40, 41, 42, 43, 44]. Strohmer & Vershynin [41] showed that

𝔼xkx2(11A12AF2)kx0x2,\mathbb{E}\|x_{k}-x\|^{2}\leq\left(1-\frac{1}{\|A^{-1}\|^{2}\|A\|_{F}^{2}}\right)^{k}\|x_{0}-x\|^{2},

where AF2\|A\|_{F}^{2} is the Frobenius norm. In practice, the algorithm often converges a lot faster initially and this was studied in [17, 18, 39]. In particular, [39] obtains an identity in terms of the behavior with regards to the singular values showing that singular vectors associated to large singular values are expected to undergo a more rapid decay. Motivated by this insight, we provide rigorous bounds that quantify this energy cascade from large singular values to small singular values by identifying an interesting inequality for SGD when applied to Least Squares.

1.2. A Motivating Example.

We discuss a simple example that exemplifies the phenomenon that we are interested in. Let us take A100×100A\in\mathbb{R}^{100\times 100} by picking each entry independently at random from 𝒩(0,1)\mathcal{N}(0,1) and then normalizing the rows to ai=1\|a_{i}\|=1. The right-hand side is b=(1,1,,1)b=(1,1,\dots,1) and we initialize with x0=0nx_{0}=\textbf{0}\in\mathbb{R}^{n}.

Refer to captionAxkb2\|Ax_{k}-b\|_{\ell^{2}}Refer to captionAxkb2\|Ax_{k}-b\|_{\ell^{2}}
Figure 1. The size of Axkb2\|Ax_{k}-b\|_{\ell^{2}} for k=1,,10000k=1,\dots,10000 (left) followed by the next 1000010000 iterations Ax10000+kb2\|Ax_{10000+k}-b\|_{\ell^{2}}. We observe rapid initial decay which then slows down.

The picture tells a very interesting story: the error in Axkb\|Ax_{k}-b\| decays initially quite rapidly before then stabilizing in a certain regime. Moreover, for the example shown in Figure 1, x0=0\|x_{0}\|=0 and x2000028\|x_{20000}\|\sim 28 which is not even close to the true solution x128\|x\|\sim 128 – nonetheless, the approximation of AxkAx_{k} to bb is quite good. This leaves us with a curious conundrum: we have a good approximation xkx_{k} of the true solution in the sense that AxkbAx_{k}\sim b even though xkx_{k} is not very close to xx. One way this can be achieved is if xkxx_{k}-x is mainly a linear combination of small singular vectors of AA. This is related to the following result recently obtained by the author.

Theorem ([39]).

Let vv_{\ell} be a (right) singular vector of AA associated to the singular value σ\sigma_{\ell}. Then, for the sequence (xk)k0(x_{k})_{k-0}^{\infty} obtained in this randomized manner

𝔼xkx,v=(1σ2AF2)kx0x,v.\mathbb{E}\left\langle x_{k}-x,v_{\ell}\right\rangle=\left(1-\frac{\sigma_{\ell}^{2}}{\|A\|_{F}^{2}}\right)^{k}\left\langle x_{0}-x,v_{\ell}\right\rangle.

Here, AF\|A\|_{F} denotes the Frobenius norm. This shows that we expect xkxx_{k}-x to be indeed mainly a linear combination of singular vectors associated to small singular values since those are the ones undergoing the slowest decay. It also mirrors the bound obtained by Strohmer & Vershynin [41] since

σσn=A11.\sigma_{\ell}\geq\sigma_{n}=\|A^{-1}\|^{-1}.

While being interesting in itself, this identity by itself does not fully explain the behavior shown above: it is only in expectation with no control of the variance. Moreover, the inner product does initially undergo some fluctuations. Taking the same type of matrix as above, we see an example of such fluctuations in Fig. 2.

Refer to captionxkxxkx,v1\left\langle\frac{x_{k}-x}{\|x_{k}-x\|},v_{1}\right\rangle
Figure 2. The evolution of the normalized residual against the leading singular vector v1v_{1}: fluctuations around the mean.

1.3. Related results.

This type of question is well studied, we refer to Ali, Dobridan & Tibshirani [1], Defossez & Bach [4], Jain, Kakade, Kidambi, Netrapalli, Pillutla & Sidford [16], Neu & Rosasco [32], Oymak & Soltanolkotabi [34], Schmidt, Le Roux & Bach [37] and references therein. The connection of Stochastic Gradient Descent applied to Least Square Problems and the Randomized Kaczmarz Method has been pointed out by Needell, Srebro & Ward [30]. We also mention the papers by Jiao, Jin & Lu [17] and Jin & Lu [18] who studied a similar question and noted that there is energy transfer from large singular values to small singular values.

2. Results

2.1. Main Result.

The main goal of this note is to provide a simple explanation for the rapid initial regularization: the expected decay of A(xkx)\|A(x_{k}-x)\| under SGD can be bounded from above by a term involving ATA(xkx)2\left\|A^{T}A(x_{k}-x)\right\|_{2}: this is the same term except that the matrix has been applied to the existing quantity one more time. This increases the norm of the underlying vector except when A(xkx)A(x_{k}-x) is mainly the linear combination of singular vectors with small singular values. So while this is not the case, we actually inherit strong decay properties and this leads to the rapid initial regularization. We now make this precise.

Theorem 1.

Let An×nA\in\mathbb{R}^{n\times n} be invertible and consider Axb22min\|Ax-b\|^{2}_{2}\rightarrow\min via the stochastic gradient descent method introduced above. Abbreviating

α=max1inAai2ai2,\alpha=\max_{1\leq i\leq n}{\frac{\|Aa_{i}\|^{2}}{\|a_{i}\|^{2}}},

we have

𝔼Axk+1b22(1+αAF2)Axkb222AF2AT(Axkb)22.\mathbb{E}~{}\|Ax_{k+1}-b\|_{2}^{2}\leq\left(1+\frac{\alpha}{\|A\|_{F}^{2}}\right)\|Ax_{k}-b\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2}.

The inequality also holds for Am×nA\in\mathbb{R}^{m\times n} with mnm\geq n as long as Ax=bAx=b has a unique solution. We note that α1\alpha\geq 1 is usually quite small: for random matrices, we would expect α2\alpha\sim 2. This is also true for matrices discretizing PDEs via finite elements. The main point of the inequality is that the last term has an additional factor of ATA^{T}: we can rewrite it as

AT(Axkb)22=ATA(xkx)22.\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2}=\|A^{T}A(x_{k}-x)\|_{2}^{2}.

This shows that the presence of large singular vectors in xkxx_{k}-x forces large decay on A(xkx)22\|A(x_{k}-x)\|_{2}^{2}. Conversely, once the algorithm has reached the plateau phase (see Fig. 1), the terms

αA(xkx)22andATA(xkx)22\alpha\|A(x_{k}-x)\|_{2}^{2}\qquad\mbox{and}\qquad\|A^{T}A(x_{k}-x)\|_{2}^{2}

are nearly comparable. Thus, this forces xkxx_{k}-x to be mostly orthogonal to most singular vectors corresponding to large singular values: that, however, shows that it is mainly comprised of small singular vectors and thus explains why A(xkx)xkx\|A(x_{k}-x)\|\ll\|x_{k}-x\| is possible in cases where xkx_{k} is far away from xx. In particular, this suggests why the method could be effective for the problem of finding a vector xx so that AxbAx\approx b. One way is to initialize stochastic gradient descent with x0=0x_{0}=0 and run it for a while – due to the difference in scales, second-order norms regularizing first-order norms, we observe that AxkAx_{k} converges quite rapidly; whether it converges to something sufficiently close to bb for the purpose at hand, is a different question.

2.2. A Sobolev Space Explanation.

An interesting way to illustrate the result is in terms of partial differential equation. Suppose we try to solve Δu=f-\Delta u=f on some domain Ωn\Omega\subset\mathbb{R}^{n}. After a suitably discretization, this results in a discrete linear system Lu=fLu=f, where Ln×nL\in\mathbb{R}^{n\times n} is a discretization of the Laplacian Δ-\Delta. By an abuse of notation, uu denotes a decent approximation of the continuous solution and ff a discretization of the continuous right-hand side. However, we also have more information: since LL discretizes the Laplacian, we expect that

Lu,uΩ|u|2𝑑xandLu,LuΩ|Δu|2𝑑x.\left\langle Lu,u\right\rangle\sim\int_{\Omega}{|\nabla u|^{2}dx}\qquad\mbox{and}\qquad\left\langle Lu,Lu\right\rangle\sim\int_{\Omega}{|\Delta u|^{2}dx}.

Here, the first term correspond to the size of uu in the Sobolev space H˙1\dot{H}^{1} while the second term is the size of uu in the Sobolev space H˙2\dot{H}^{2}. In fact, this is a common way to define discretized approximations of Sobolev space, also known as the spectral definition since they are defined in terms of the spectrum of LL. Suppose now we compute a sequence of approximations uku_{k} via the method outlined above. Then Theorem 1 can be rephrased as

𝔼uk+1uH˙12(1+αAF2)ukuH˙122AF2ukuH˙22\mathbb{E}~{}\|u_{k+1}-u\|^{2}_{\dot{H}^{1}}\leq\left(1+\frac{\alpha}{\|A\|_{F}^{2}}\right)\|u_{k}-u\|_{\dot{H}^{1}}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|u_{k}-u\right\|_{\dot{H}^{2}}^{2}

What is of great interest here is that the decay of the error in H˙1\dot{H}^{1} is driven by decay of the error in H˙2\dot{H}^{2} (which is usually larger).

2.3. An Example.

We illustrate this with an example. Choosing A500×500A\in\mathbb{R}^{500\times 500} at random (and then, for convenience, normalize the rows to ai=1\|a_{i}\|=1), we solve Ax=(1,1,,1)Ax=(1,1,\dots,1) starting with a random initial vector x0x_{0} where each entry is chosen independently from a standardized 𝒩(0,1)\mathcal{N}(0,1) distribution. We consider both the evolution of Axkb22\|Ax_{k}-b\|_{2}^{2} across multiple runs as well as the size of

αAF2Axkb222AF2AT(Axkb)22,\frac{\alpha}{\|A\|_{F}^{2}}\|Ax_{k}-b\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2},

which is the term from our Theorem quantifying the expected decay at each step.

Refer to captionAxkb2\|Ax_{k}-b\|_{\ell^{2}}Refer to captionαAF2Axkb222AF2AT(Axkb)22\frac{\alpha}{\|A\|_{F}^{2}}\|Ax_{k}-b\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2}
Figure 3. Axkb2\|Ax_{k}-b\|_{\ell^{2}} for k=1,,3000k=1,\dots,3000 (left) and the decay guaranteed by Theorem in expectation (right) over multiple runs.

We see in Fig. 3. that over 3000 periods, the approximation decays roughly by a factor 800\sim 800 (with little variation across multiple runs). The bound in the Theorem implies an expected decay of 0.23-0.23 per time-step which, across 3000 time steps, leads to a total of roughly 696\sim 696 in decay.

2.4. Higher Powers.

If the matrix An×nA\in\mathbb{R}^{n\times n} is symmetric, we can extend the result to higher powers of the matrix.

Theorem 2.

Let An×nA\in\mathbb{R}^{n\times n} be symmetric and invertible. When solving Axb22min\|Ax-b\|^{2}_{2}\rightarrow\min via the stochastic gradient descent method outlined above, we have, for any \ell\in\mathbb{N} with

α=max1inAai2ai2,\alpha_{\ell}=\max_{1\leq i\leq n}{\frac{\|A^{\ell}a_{i}\|^{2}}{\|a_{i}\|^{2}}},

the estimate

𝔼A(xk+1x)22(1+αAF2)A(xkx)222AF2A+1(xkx)22.\mathbb{E}~{}\|A^{\ell}(x_{k+1}-x^{*})\|_{2}^{2}\leq\left(1+\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\right)\|A^{\ell}(x_{k}-x^{*})\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{\ell+1}(x_{k}-x^{*})\right\|_{2}^{2}.

This shows that the same phenomenon does indeed happen at all scales of ‘smoothness’. The applicability of the result is, naturally, depending on the growth of α\alpha_{\ell} in \ell, though, generically, one would not expect this to be badly behaved: a priori, there is no good reason to expect that the row of a matrix happens in any way to be the linear combination of singular vectors associated to large singular values – though, naturally, this can happen (for example, if AA has one very large entry on the diagonal).

3. Proofs

3.1. Proof of Theorem 1

Proof.

To simplify exposition, we introduce the residual

rk=xkx.r_{k}=x_{k}-x.

Plugging in, we obtain that if the ii-th equation is chosen, then

x+rk+1\displaystyle x+r_{k+1} =xk+1=xk+biai,rkai2ai\displaystyle=x_{k+1}=x_{k}+\frac{b_{i}-\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}
=x+rk+biai,x+rkai2ai\displaystyle=x+r_{k}+\frac{b_{i}-\left\langle a_{i},x+r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}
=x+rkai,rkai2ai+(biai,xai2ai).\displaystyle=x+r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}+\left(\frac{b_{i}-\left\langle a_{i},x\right\rangle}{\|a_{i}\|^{2}}a_{i}\right).

Since xx is the exact solution, we have biai,x=0b_{i}-\left\langle a_{i},x\right\rangle=0 and

rk+1=rkai,rkai2ai.r_{k+1}=r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

Recalling that the ii-th row is chosen with probability proportional to ai2\|a_{i}\|^{2},

𝔼Ark+12\displaystyle\mathbb{E}~{}\|Ar_{k+1}\|^{2} =𝔼iA(rkai,rkai2ai)2\displaystyle=\mathbb{E}_{i}~{}\left\|A\left(r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}\right)\right\|^{2}
=i=1nai2AF2Arkai,rkai2Aai2.\displaystyle=\sum_{i=1}^{n}\frac{\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\left\|Ar_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}Aa_{i}\right\|^{2}.

This norm can be explicitly squared out as

Arkai,rkai2Aai2=Ark22ai,rkai2Ark,Aai+ai,rk2ai4Aai2.\displaystyle\left\|Ar_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}Aa_{i}\right\|^{2}=\|A{r_{k}}\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\|a_{i}\|^{4}}\|Aa_{i}\|^{2}.

This allows us to rewrite the summation as

𝔼Ark+12\displaystyle\mathbb{E}~{}\|Ar_{k+1}\|^{2} =i=1nai2AF2(Ark22ai,rkai2Ark,Aai+ai,rk2ai4Aai2)\displaystyle=\sum_{i=1}^{n}\frac{\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\left(\|A{r_{k}}\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\|a_{i}\|^{4}}\|Aa_{i}\|^{2}\right)
=Ark22AF2i=1nai,rkArk,Aai+1AF2i=1nai,rk2Aai2ai2.\displaystyle=\|Ar_{k}\|^{2}-\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|Aa_{i}\|^{2}}{\|a_{i}\|^{2}}.

We have

2AF2i=1nai,rkArk,Aai\displaystyle\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle Ar_{k},Aa_{i}\right\rangle =2AF2i=1nai,rkATArk,ai\displaystyle=\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{T}Ar_{k},a_{i}\right\rangle
=2AF2Ark,AATArk\displaystyle=\frac{2}{\|A\|_{F}^{2}}\left\langle Ar_{k},AA^{T}Ar_{k}\right\rangle
=2AF2ATArk2.\displaystyle=\frac{2}{\|A\|_{F}^{2}}\|A^{T}Ar_{k}\|^{2}.

The last sum we bound from above via

1AF2i=1nai,rk2Aai2ai2maxiAai2/ai2AF2Ark2.\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|Aa_{i}\|^{2}}{\|a_{i}\|^{2}}\leq\frac{\max_{i}\|Aa_{i}\|^{2}/\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\|Ar_{k}\|^{2}.

This results in the desired estimate.

3.2. Proof of Theorem 2

Proof.

We again reduce the problem to that of the study of the residual

rk+1=rkai,rkai2ai.r_{k+1}=r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

When looking at integer powers, we observe that, by the same reasoning,

𝔼Ark+12\displaystyle\mathbb{E}~{}\|A^{\ell}r_{k+1}\|^{2} =𝔼A(rkai,rkai2ai)2\displaystyle=\mathbb{E}~{}\left\|A^{\ell}\left(r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}\right)\right\|^{2}
=i=1nai2AF2Arkai,rkai2Aai2\displaystyle=\sum_{i=1}^{n}\frac{\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\left\|A^{\ell}r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}A^{\ell}a_{i}\right\|^{2}
=i=1nai2AF2(Ark22ai,rkai2Ark,Aai+ai,rk2ai4Aai2)\displaystyle=\sum_{i=1}^{n}\frac{\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\left(\|A^{\ell}{r_{k}}\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\|a_{i}\|^{4}}\|A^{\ell}a_{i}\|^{2}\right)
=Ark22AF2i=1nai,rkArk,Aai\displaystyle=\|A^{\ell}r_{k}\|^{2}-\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle
+1AF2i=1nai,rk2Aai2ai2.\displaystyle+\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|A^{\ell}a_{i}\|^{2}}{\|a_{i}\|^{2}}.

The first term is easy and the third term can, as before, be bounded by

1AF2i=1nai,rk2Aai2ai2αAF2i=1nai,rk2=αAF2Ark2.\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|A^{\ell}a_{i}\|^{2}}{\|a_{i}\|^{2}}\leq\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}=\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\|Ar_{k}\|^{2}.

It remains to understand the second term: here, we can use the symmetry of the matrix to write

2AF2i=1nai,rkArk,Aai\displaystyle\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle =2AF2i=1nai,rkA2rk,ai\displaystyle=\frac{2}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{2\ell}r_{k},a_{i}\right\rangle
=2AF2Ark,A2+1rk\displaystyle=\frac{2}{\|A\|_{F}^{2}}\left\langle Ar_{k},A^{2\ell+1}r_{k}\right\rangle
=2AF2A+1rk,A+1rk\displaystyle=\frac{2}{\|A\|_{F}^{2}}\left\langle A^{\ell+1}r_{k},A^{\ell+1}r_{k}\right\rangle
=2AF2A+1rk2.\displaystyle=\frac{2}{\|A\|_{F}^{2}}\|A^{\ell+1}r_{k}\|^{2}.

References

  • [1] A. Ali, E. Dobridan and R. Tibshirani, The Implicit Regularization of Stochastic Gradient Flow for Least Squares, arXiv:2003.07802
  • [2] Z.-Z. Bai and W.-T. Wu, On convergence rate of the randomized Kaczmarz method, Linear Algebra and its Applications 553 (2018), p. 252–269
  • [3] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer, New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. Proc. SPIE: Visual Communications and Image Processing, p. 299–310, 1992.
  • [4] A. Defossez and F. Bach, Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions, AISTATS, 2015
  • [5] F. Deutsch. Rate of convergence of the method of alternating projections. Internat. Schriftenreihe Numer. Math., 72 (1985), p.96–107.
  • [6] F. Deutsch and H. Hundal. The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl., 205 (1997), p. 381–405.
  • [7] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms, 58 (2011):p. 163–177.
  • [8] T. Elfving, P.-C. Hansen and T. Nikazad, Semi-convergence properties of Kaczmarz’s method, Inverse Problems 30 (2014), 055007
  • [9] A. Galantai. On the rate of convergence of the alternating projection method in finite dimensional ´ spaces. J. Math. Anal. Appl., 310 (2005), p. 30–44.
  • [10] R. Gordon, R. Bender and G. Herman, Algebraic reconstruction techniques (ART) for threedimensional electron microscopy and x-ray photography, Journal of Theoretical Biology 29 (1970): p. 471–481
  • [11] D. Gordon, A derandomization approach to recovering bandlimited signals across a wide range of random sampling rates, Numer. Algor. 77 (2018): p. 1141–1157
  • [12] R. M. Gower, D. Molitor, J. Moorman and D. Needell, Adaptive Sketch-and-Project Methods for Solving Linear Systems, arXiv:1909.03604
  • [13] R. M. Gower and P. Richtarik. Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl., 36 (2015):1660–1690.
  • [14] G.T. Herman. Image reconstruction from projections. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York, 1980. The fundamentals of computerized tomography, Computer Science and Applied Mathematics.
  • [15] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Medical Imaging, 12 (1993): p. 600–609.
  • [16] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, V. Pillutla and A. Sidford, A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares). In 37th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (2018).
  • [17] Y. Jiao, B. Jin and X. Lu, Preasymptotic Convergence of Randomized Kaczmarz Method, Inverse Problems 33 (2017), article: 125012
  • [18] B. Jin, and X. Lu. On the regularizing property of stochastic gradient descent. Inverse Problems 35 (2018): 015004.
  • [19] S. Kaczmarz, Angenaherte Auflosung von Systemen linearer Gleichungen, Bulletin International de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences Mathematiques et Naturelles. Serie A, Sciences Mathematiques, 35 (1937), pp. 355–357
  • [20] Y.-T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, FOCS 2013
  • [21] D. Leventhal and A. S. Lewis, Randomized Methods for Linear Constraints: Convergence Rates and Conditioning, Mathematics of Operation Research, 35 (2010), p. 641–654
  • [22] J. Liu and S. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp. 85 (2016), p. 153-178
  • [23] A. Ma, and D Needell, Stochastic gradient descent for linear systems with missing data. Numer. Math. Theory Methods Appl. 12 (2019), p. 1–20.
  • [24] A. Ma, D Needell and A Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. 36 (2015), p. 1590–1604
  • [25] J. Moorman, T. Tu, D. Molitor and D. Needell, Randomized Kaczmarz with Averaging, arXiv:2002.04126
  • [26] F. Natterer. The Mathematics of Computerized Tomography. Wiley, New York, 1986.
  • [27] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathematics, 50 (2010): p. 395–403.
  • [28] D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kaczmarz method, Linear Algebra and its Applications 441 (2014), p. 199–221
  • [29] D. Needell and R. Ward, Two-Subspace Projection Method for Coherent Overdetermined Systems, J. Fourier Anal Appl 19 (2013), p. 256–269.
  • [30] D. Needell, R. Ward and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems, p. 1017–1025
  • [31] D. Needell, R. Zhao and A. Zouzias, Randomized block Kaczmarz method with projection for solving least squares, Linear Algebra and its Applications 484 (2015), p. 322–343
  • [32] G. Neu and L. Rosasco, Iterate Averaging as Regularization for Stochastic Gradient Descent, Proceedings of Machine Learning Research 75 (2018), p. 1–21.
  • [33] J. Nutini, B. Sepehry, I. Laradji, M. Schmidt, H. Koepke, A. Virani, Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Randomized Kaczmarz Rules Using the Orthogonality Graph, The 32th Conference on Uncertainty in Artificial Intelligence, 2016.
  • [34] S. Oymak and M. Soltanolkotabi, Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning (2019), pp. 4951–4960.
  • [35] E. L. Piccolomini, and F. Zama. The conjugate gradient regularization method in computed tomography problems. Applied Mathematics and Computation 102 (1999): 87–99.
  • [36] C. Popa, Convergence rates for Kaczmarz-type algorithms, Numer. Algor. 79 (2018): p. 1–17
  • [37] M. Schmidt, N. Le Reoux and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., Ser. A 162 (2017), p. 83–112
  • [38] K.M. Sezan and H. Stark. Applications of convex projection theory to image recovery in tomography and related areas. In H. Stark, editor, Image Recovery: Theory and application, pages 415–462. Acad. Press, 1987
  • [39] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors, arXiv:2006.16978
  • [40] S. Steinerberger, A Weighted Randomized Kaczmarz Method for Solving Linear Systems, arXiv:2007.02910
  • [41] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm for linear systems with exponential convergence, Journal of Fourier Analysis and Applications 15 (2009): p. 262–278
  • [42] Y. Tan and R. Vershynin, Phase retrieval via randomized Kaczmarz: theoretical guarantees, Information and Inference: A Journal of the IMA 8 (2019), p. 97–123
  • [43] J.-J. Zhang, A new greedy Kaczmarz algorithm for the solution of very large linear systems, Applied Mathematics Letters 91 (2019), p. 207–212
  • [44] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34: p. 773–793, 2013.