On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

Stefan Steinerberger Department of Mathematics, University of Washington, Seattle steinerb@uw.edu

Abstract.

We study the behavior of stochastic gradient descent applied to $\|Ax-b\|_{2}^{2}\rightarrow\min$ for invertible $A\in\mathbb{R}^{n\times n}$ . We show that there is an explicit constant $c_{A}$ depending (mildly) on $A$ such that

\mathbb{E}~{}\left\|Ax_{k+1}-b\right\|^{2}_{2}\leq\left(1+\frac{c_{A}}{\|A\|_{F}^{2}}\right)\left\|Ax_{k}-b\right\|^{2}_{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}A(x_{k}-x)\right\|^{2}_{2}.

This is a curious inequality: the last term has one more matrix applied to the residual $u_{k}-u$ than the remaining terms: if $x_{k}-x$ is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

Key words and phrases:

Stochastic Gradient Descent, Kaczmarz method, Least Squares

2010 Mathematics Subject Classification:

65F10, 65K10, 65K15, 90C06, 93E24

S.S. is supported by the NSF (DMS-1763179) and the Alfred P. Sloan Foundation.

1. Introduction

1.1. Stochastic Gradient Descent.

In this paper, we consider the finite-dimensional linear inverse problem

Ax=b,

where $A\in\mathbb{R}^{n\times n}$ is an invertible matrix, $x\in\mathbb{R}^{n}$ is the (unknown) signal of interest and $b$ is a given right-hand side. Throughout this paper, we will use $a_{1},\dots,a_{n}$ to denote the rows of $A$ . Equivalently, we will try to solve the problem

\|Ax-b\|^{2}=\sum_{i=1}^{n}\left(\left\langle a_{i},x\right\rangle-b_{i}\right)^{2}\rightarrow\min.

Following Needell, Srebro & Ward [30], we can interpret this problem as

\sum_{i=1}^{n}f_{i}(x)^{2}\rightarrow\min\qquad\mbox{where}\qquad f_{i}(x)=\left\langle a_{i},x\right\rangle-b_{i}.

The Lipschitz constant of $f_{i}$ is $\|a_{i}\|_{\ell^{2}}$ which motivates the following basic form of stochastic gradient descent: pick one of the $n$ functions with likelihood proportional to the Lipschitz constant and then do a gradient descent on this much simpler function resulting in the update rule

x_{k+1}=x_{k}-\frac{b_{i}-\left\langle a_{i},x_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

This is also known as the Algebraic Reconstruction Technique (ART) in computer tomography [10, 14, 15, 26], the Projection onto Convex Sets Method [3, 5, 6, 9, 38] and the Randomized Kaczmarz method [2, 7, 8, 11, 13, 12, 20, 21, 22, 24, 25, 27, 28, 29, 30, 31, 33, 36, 39, 40, 41, 42, 43, 44]. Strohmer & Vershynin [41] showed that

\mathbb{E}\|x_{k}-x\|^{2}\leq\left(1-\frac{1}{\|A^{-1}\|^{2}\|A\|_{F}^{2}}\right)^{k}\|x_{0}-x\|^{2},

where $\|A\|_{F}^{2}$ is the Frobenius norm. In practice, the algorithm often converges a lot faster initially and this was studied in [17, 18, 39]. In particular, [39] obtains an identity in terms of the behavior with regards to the singular values showing that singular vectors associated to large singular values are expected to undergo a more rapid decay. Motivated by this insight, we provide rigorous bounds that quantify this energy cascade from large singular values to small singular values by identifying an interesting inequality for SGD when applied to Least Squares.

1.2. A Motivating Example.

We discuss a simple example that exemplifies the phenomenon that we are interested in. Let us take $A\in\mathbb{R}^{100\times 100}$ by picking each entry independently at random from $\mathcal{N}(0,1)$ and then normalizing the rows to $\|a_{i}\|=1$ . The right-hand side is $b=(1,1,\dots,1)$ and we initialize with $x_{0}=\textbf{0}\in\mathbb{R}^{n}$ .

Refer to caption — Figure 1. The size of $\|Ax_{k}-b\|_{\ell^{2}}$ for $k=1,\dots,10000$ (left) followed by the next $10000$ iterations $\|Ax_{10000+k}-b\|_{\ell^{2}}$ . We observe rapid initial decay which then slows down.

The picture tells a very interesting story: the error in $\|Ax_{k}-b\|$ decays initially quite rapidly before then stabilizing in a certain regime. Moreover, for the example shown in Figure 1, $\|x_{0}\|=0$ and $\|x_{20000}\|\sim 28$ which is not even close to the true solution $\|x\|\sim 128$ – nonetheless, the approximation of $Ax_{k}$ to $b$ is quite good. This leaves us with a curious conundrum: we have a good approximation $x_{k}$ of the true solution in the sense that $Ax_{k}\sim b$ even though $x_{k}$ is not very close to $x$ . One way this can be achieved is if $x_{k}-x$ is mainly a linear combination of small singular vectors of $A$ . This is related to the following result recently obtained by the author.

Theorem ([39]).

Let $v_{\ell}$ be a (right) singular vector of $A$ associated to the singular value $\sigma_{\ell}$ . Then, for the sequence $(x_{k})_{k-0}^{\infty}$ obtained in this randomized manner

\mathbb{E}\left\langle x_{k}-x,v_{\ell}\right\rangle=\left(1-\frac{\sigma_{\ell}^{2}}{\|A\|_{F}^{2}}\right)^{k}\left\langle x_{0}-x,v_{\ell}\right\rangle.

Here, $\|A\|_{F}$ denotes the Frobenius norm. This shows that we expect $x_{k}-x$ to be indeed mainly a linear combination of singular vectors associated to small singular values since those are the ones undergoing the slowest decay. It also mirrors the bound obtained by Strohmer & Vershynin [41] since

\sigma_{\ell}\geq\sigma_{n}=\|A^{-1}\|^{-1}.

While being interesting in itself, this identity by itself does not fully explain the behavior shown above: it is only in expectation with no control of the variance. Moreover, the inner product does initially undergo some fluctuations. Taking the same type of matrix as above, we see an example of such fluctuations in Fig. 2.

1.3. Related results.

This type of question is well studied, we refer to Ali, Dobridan & Tibshirani [1], Defossez & Bach [4], Jain, Kakade, Kidambi, Netrapalli, Pillutla & Sidford [16], Neu & Rosasco [32], Oymak & Soltanolkotabi [34], Schmidt, Le Roux & Bach [37] and references therein. The connection of Stochastic Gradient Descent applied to Least Square Problems and the Randomized Kaczmarz Method has been pointed out by Needell, Srebro & Ward [30]. We also mention the papers by Jiao, Jin & Lu [17] and Jin & Lu [18] who studied a similar question and noted that there is energy transfer from large singular values to small singular values.

2. Results

2.1. Main Result.

The main goal of this note is to provide a simple explanation for the rapid initial regularization: the expected decay of $\|A(x_{k}-x)\|$ under SGD can be bounded from above by a term involving $\left\|A^{T}A(x_{k}-x)\right\|_{2}$ : this is the same term except that the matrix has been applied to the existing quantity one more time. This increases the norm of the underlying vector except when $A(x_{k}-x)$ is mainly the linear combination of singular vectors with small singular values. So while this is not the case, we actually inherit strong decay properties and this leads to the rapid initial regularization. We now make this precise.

Theorem 1.

Let $A\in\mathbb{R}^{n\times n}$ be invertible and consider $\|Ax-b\|^{2}_{2}\rightarrow\min$ via the stochastic gradient descent method introduced above. Abbreviating

\alpha=\max_{1\leq i\leq n}{\frac{\|Aa_{i}\|^{2}}{\|a_{i}\|^{2}}},

we have

\mathbb{E}~{}\|Ax_{k+1}-b\|_{2}^{2}\leq\left(1+\frac{\alpha}{\|A\|_{F}^{2}}\right)\|Ax_{k}-b\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2}.

The inequality also holds for $A\in\mathbb{R}^{m\times n}$ with $m\geq n$ as long as $Ax=b$ has a unique solution. We note that $\alpha\geq 1$ is usually quite small: for random matrices, we would expect $\alpha\sim 2$ . This is also true for matrices discretizing PDEs via finite elements. The main point of the inequality is that the last term has an additional factor of $A^{T}$ : we can rewrite it as

\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2}=\|A^{T}A(x_{k}-x)\|_{2}^{2}.

This shows that the presence of large singular vectors in $x_{k}-x$ forces large decay on $\|A(x_{k}-x)\|_{2}^{2}$ . Conversely, once the algorithm has reached the plateau phase (see Fig. 1), the terms

\alpha\|A(x_{k}-x)\|_{2}^{2}\qquad\mbox{and}\qquad\|A^{T}A(x_{k}-x)\|_{2}^{2}

are nearly comparable. Thus, this forces $x_{k}-x$ to be mostly orthogonal to most singular vectors corresponding to large singular values: that, however, shows that it is mainly comprised of small singular vectors and thus explains why $\|A(x_{k}-x)\|\ll\|x_{k}-x\|$ is possible in cases where $x_{k}$ is far away from $x$ . In particular, this suggests why the method could be effective for the problem of finding a vector $x$ so that $Ax\approx b$ . One way is to initialize stochastic gradient descent with $x_{0}=0$ and run it for a while – due to the difference in scales, second-order norms regularizing first-order norms, we observe that $Ax_{k}$ converges quite rapidly; whether it converges to something sufficiently close to $b$ for the purpose at hand, is a different question.

2.2. A Sobolev Space Explanation.

An interesting way to illustrate the result is in terms of partial differential equation. Suppose we try to solve $-\Delta u=f$ on some domain $\Omega\subset\mathbb{R}^{n}$ . After a suitably discretization, this results in a discrete linear system $Lu=f$ , where $L\in\mathbb{R}^{n\times n}$ is a discretization of the Laplacian $-\Delta$ . By an abuse of notation, $u$ denotes a decent approximation of the continuous solution and $f$ a discretization of the continuous right-hand side. However, we also have more information: since $L$ discretizes the Laplacian, we expect that

\left\langle Lu,u\right\rangle\sim\int_{\Omega}{|\nabla u|^{2}dx}\qquad\mbox{and}\qquad\left\langle Lu,Lu\right\rangle\sim\int_{\Omega}{|\Delta u|^{2}dx}.

Here, the first term correspond to the size of $u$ in the Sobolev space $\dot{H}^{1}$ while the second term is the size of $u$ in the Sobolev space $\dot{H}^{2}$ . In fact, this is a common way to define discretized approximations of Sobolev space, also known as the spectral definition since they are defined in terms of the spectrum of $L$ . Suppose now we compute a sequence of approximations $u_{k}$ via the method outlined above. Then Theorem 1 can be rephrased as

\mathbb{E}~{}\|u_{k+1}-u\|^{2}_{\dot{H}^{1}}\leq\left(1+\frac{\alpha}{\|A\|_{F}^{2}}\right)\|u_{k}-u\|_{\dot{H}^{1}}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|u_{k}-u\right\|_{\dot{H}^{2}}^{2}

What is of great interest here is that the decay of the error in $\dot{H}^{1}$ is driven by decay of the error in $\dot{H}^{2}$ (which is usually larger).

2.3. An Example.

We illustrate this with an example. Choosing $A\in\mathbb{R}^{500\times 500}$ at random (and then, for convenience, normalize the rows to $\|a_{i}\|=1$ ), we solve $Ax=(1,1,\dots,1)$ starting with a random initial vector $x_{0}$ where each entry is chosen independently from a standardized $\mathcal{N}(0,1)$ distribution. We consider both the evolution of $\|Ax_{k}-b\|_{2}^{2}$ across multiple runs as well as the size of

\frac{\alpha}{\|A\|_{F}^{2}}\|Ax_{k}-b\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{T}(Ax_{k}-b)\right\|_{2}^{2},

which is the term from our Theorem quantifying the expected decay at each step.

We see in Fig. 3. that over 3000 periods, the approximation decays roughly by a factor $\sim 800$ (with little variation across multiple runs). The bound in the Theorem implies an expected decay of $-0.23$ per time-step which, across 3000 time steps, leads to a total of roughly $\sim 696$ in decay.

2.4. Higher Powers.

If the matrix $A\in\mathbb{R}^{n\times n}$ is symmetric, we can extend the result to higher powers of the matrix.

Theorem 2.

Let $A\in\mathbb{R}^{n\times n}$ be symmetric and invertible. When solving $\|Ax-b\|^{2}_{2}\rightarrow\min$ via the stochastic gradient descent method outlined above, we have, for any $\ell\in\mathbb{N}$ with

\alpha_{\ell}=\max_{1\leq i\leq n}{\frac{\|A^{\ell}a_{i}\|^{2}}{\|a_{i}\|^{2}}},

the estimate

\mathbb{E}~{}\|A^{\ell}(x_{k+1}-x^{*})\|_{2}^{2}\leq\left(1+\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\right)\|A^{\ell}(x_{k}-x^{*})\|_{2}^{2}-\frac{2}{\|A\|_{F}^{2}}\left\|A^{\ell+1}(x_{k}-x^{*})\right\|_{2}^{2}.

This shows that the same phenomenon does indeed happen at all scales of ‘smoothness’. The applicability of the result is, naturally, depending on the growth of $\alpha_{\ell}$ in $\ell$ , though, generically, one would not expect this to be badly behaved: a priori, there is no good reason to expect that the row of a matrix happens in any way to be the linear combination of singular vectors associated to large singular values – though, naturally, this can happen (for example, if $A$ has one very large entry on the diagonal).

3. Proofs

3.1. Proof of Theorem 1

Proof.

To simplify exposition, we introduce the residual

r_{k}=x_{k}-x.

Plugging in, we obtain that if the $i-$ th equation is chosen, then

	$\displaystyle x+r_{k+1}$	$\displaystyle=x_{k+1}=x_{k}+\frac{b_{i}-\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}$
		$\displaystyle=x+r_{k}+\frac{b_{i}-\left\langle a_{i},x+r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}$
		$\displaystyle=x+r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}+\left(\frac{b_{i}-\left\langle a_{i},x\right\rangle}{\\|a_{i}\\|^{2}}a_{i}\right).$

Since $x$ is the exact solution, we have $b_{i}-\left\langle a_{i},x\right\rangle=0$ and

r_{k+1}=r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

Recalling that the $i-$ th row is chosen with probability proportional to $\|a_{i}\|^{2}$ ,

	$\displaystyle\mathbb{E}~{}\\|Ar_{k+1}\\|^{2}$	$\displaystyle=\mathbb{E}_{i}~{}\left\\|A\left(r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}\right)\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left\\|Ar_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}Aa_{i}\right\\|^{2}.$

This norm can be explicitly squared out as

\displaystyle\left\|Ar_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}Aa_{i}\right\|^{2}=\|A{r_{k}}\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\|a_{i}\|^{4}}\|Aa_{i}\|^{2}.

This allows us to rewrite the summation as

	$\displaystyle\mathbb{E}~{}\\|Ar_{k+1}\\|^{2}$	$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left(\\|A{r_{k}}\\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\\|a_{i}\\|^{4}}\\|Aa_{i}\\|^{2}\right)$
		$\displaystyle=\\|Ar_{k}\\|^{2}-\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle Ar_{k},Aa_{i}\right\rangle+\frac{1}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\\|Aa_{i}\\|^{2}}{\\|a_{i}\\|^{2}}.$

We have

	$\displaystyle\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle Ar_{k},Aa_{i}\right\rangle$	$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{T}Ar_{k},a_{i}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle Ar_{k},AA^{T}Ar_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\\|A^{T}Ar_{k}\\|^{2}.$

The last sum we bound from above via

\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|Aa_{i}\|^{2}}{\|a_{i}\|^{2}}\leq\frac{\max_{i}\|Aa_{i}\|^{2}/\|a_{i}\|^{2}}{\|A\|_{F}^{2}}\|Ar_{k}\|^{2}.

This results in the desired estimate.

∎

3.2. Proof of Theorem 2

Proof.

We again reduce the problem to that of the study of the residual

r_{k+1}=r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\|a_{i}\|^{2}}a_{i}.

When looking at integer powers, we observe that, by the same reasoning,

	$\displaystyle\mathbb{E}~{}\\|A^{\ell}r_{k+1}\\|^{2}$	$\displaystyle=\mathbb{E}~{}\left\\|A^{\ell}\left(r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}\right)\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left\\|A^{\ell}r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}A^{\ell}a_{i}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left(\\|A^{\ell}{r_{k}}\\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\\|a_{i}\\|^{4}}\\|A^{\ell}a_{i}\\|^{2}\right)$
		$\displaystyle=\\|A^{\ell}r_{k}\\|^{2}-\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle$
		$\displaystyle+\frac{1}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\\|A^{\ell}a_{i}\\|^{2}}{\\|a_{i}\\|^{2}}.$

The first term is easy and the third term can, as before, be bounded by

\frac{1}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\|A^{\ell}a_{i}\|^{2}}{\|a_{i}\|^{2}}\leq\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}=\frac{\alpha_{\ell}}{\|A\|_{F}^{2}}\|Ar_{k}\|^{2}.

It remains to understand the second term: here, we can use the symmetry of the matrix to write

	$\displaystyle\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle$	$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{2\ell}r_{k},a_{i}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle Ar_{k},A^{2\ell+1}r_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle A^{\ell+1}r_{k},A^{\ell+1}r_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\\|A^{\ell+1}r_{k}\\|^{2}.$

∎

References

[1] A. Ali, E. Dobridan and R. Tibshirani, The Implicit Regularization of Stochastic Gradient Flow for Least Squares, arXiv:2003.07802
[2] Z.-Z. Bai and W.-T. Wu, On convergence rate of the randomized Kaczmarz method, Linear Algebra and its Applications 553 (2018), p. 252–269
[3] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer, New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. Proc. SPIE: Visual Communications and Image Processing, p. 299–310, 1992.
[4] A. Defossez and F. Bach, Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions, AISTATS, 2015
[5] F. Deutsch. Rate of convergence of the method of alternating projections. Internat. Schriftenreihe Numer. Math., 72 (1985), p.96–107.
[6] F. Deutsch and H. Hundal. The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl., 205 (1997), p. 381–405.
[7] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms, 58 (2011):p. 163–177.
[8] T. Elfving, P.-C. Hansen and T. Nikazad, Semi-convergence properties of Kaczmarz’s method, Inverse Problems 30 (2014), 055007
[9] A. Galantai. On the rate of convergence of the alternating projection method in finite dimensional ´ spaces. J. Math. Anal. Appl., 310 (2005), p. 30–44.
[10] R. Gordon, R. Bender and G. Herman, Algebraic reconstruction techniques (ART) for threedimensional electron microscopy and x-ray photography, Journal of Theoretical Biology 29 (1970): p. 471–481
[11] D. Gordon, A derandomization approach to recovering bandlimited signals across a wide range of random sampling rates, Numer. Algor. 77 (2018): p. 1141–1157
[12] R. M. Gower, D. Molitor, J. Moorman and D. Needell, Adaptive Sketch-and-Project Methods for Solving Linear Systems, arXiv:1909.03604
[13] R. M. Gower and P. Richtarik. Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl., 36 (2015):1660–1690.
[14] G.T. Herman. Image reconstruction from projections. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York, 1980. The fundamentals of computerized tomography, Computer Science and Applied Mathematics.
[15] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Medical Imaging, 12 (1993): p. 600–609.
[16] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, V. Pillutla and A. Sidford, A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares). In 37th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (2018).
[17] Y. Jiao, B. Jin and X. Lu, Preasymptotic Convergence of Randomized Kaczmarz Method, Inverse Problems 33 (2017), article: 125012
[18] B. Jin, and X. Lu. On the regularizing property of stochastic gradient descent. Inverse Problems 35 (2018): 015004.
[19] S. Kaczmarz, Angenaherte Auflosung von Systemen linearer Gleichungen, Bulletin International de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences Mathematiques et Naturelles. Serie A, Sciences Mathematiques, 35 (1937), pp. 355–357
[20] Y.-T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, FOCS 2013
[21] D. Leventhal and A. S. Lewis, Randomized Methods for Linear Constraints: Convergence Rates and Conditioning, Mathematics of Operation Research, 35 (2010), p. 641–654
[22] J. Liu and S. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp. 85 (2016), p. 153-178
[23] A. Ma, and D Needell, Stochastic gradient descent for linear systems with missing data. Numer. Math. Theory Methods Appl. 12 (2019), p. 1–20.
[24] A. Ma, D Needell and A Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. 36 (2015), p. 1590–1604
[25] J. Moorman, T. Tu, D. Molitor and D. Needell, Randomized Kaczmarz with Averaging, arXiv:2002.04126
[26] F. Natterer. The Mathematics of Computerized Tomography. Wiley, New York, 1986.
[27] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathematics, 50 (2010): p. 395–403.
[28] D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kaczmarz method, Linear Algebra and its Applications 441 (2014), p. 199–221
[29] D. Needell and R. Ward, Two-Subspace Projection Method for Coherent Overdetermined Systems, J. Fourier Anal Appl 19 (2013), p. 256–269.
[30] D. Needell, R. Ward and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems, p. 1017–1025
[31] D. Needell, R. Zhao and A. Zouzias, Randomized block Kaczmarz method with projection for solving least squares, Linear Algebra and its Applications 484 (2015), p. 322–343
[32] G. Neu and L. Rosasco, Iterate Averaging as Regularization for Stochastic Gradient Descent, Proceedings of Machine Learning Research 75 (2018), p. 1–21.
[33] J. Nutini, B. Sepehry, I. Laradji, M. Schmidt, H. Koepke, A. Virani, Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Randomized Kaczmarz Rules Using the Orthogonality Graph, The 32th Conference on Uncertainty in Artificial Intelligence, 2016.
[34] S. Oymak and M. Soltanolkotabi, Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning (2019), pp. 4951–4960.
[35] E. L. Piccolomini, and F. Zama. The conjugate gradient regularization method in computed tomography problems. Applied Mathematics and Computation 102 (1999): 87–99.
[36] C. Popa, Convergence rates for Kaczmarz-type algorithms, Numer. Algor. 79 (2018): p. 1–17
[37] M. Schmidt, N. Le Reoux and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., Ser. A 162 (2017), p. 83–112
[38] K.M. Sezan and H. Stark. Applications of convex projection theory to image recovery in tomography and related areas. In H. Stark, editor, Image Recovery: Theory and application, pages 415–462. Acad. Press, 1987
[39] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors, arXiv:2006.16978
[40] S. Steinerberger, A Weighted Randomized Kaczmarz Method for Solving Linear Systems, arXiv:2007.02910
[41] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm for linear systems with exponential convergence, Journal of Fourier Analysis and Applications 15 (2009): p. 262–278
[42] Y. Tan and R. Vershynin, Phase retrieval via randomized Kaczmarz: theoretical guarantees, Information and Inference: A Journal of the IMA 8 (2019), p. 97–123
[43] J.-J. Zhang, A new greedy Kaczmarz algorithm for the solution of very large linear systems, Applied Mathematics Letters 91 (2019), p. 207–212
[44] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34: p. 773–793, 2013.

	$\displaystyle x+r_{k+1}$	$\displaystyle=x_{k+1}=x_{k}+\frac{b_{i}-\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}$
		$\displaystyle=x+r_{k}+\frac{b_{i}-\left\langle a_{i},x+r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}$
		$\displaystyle=x+r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}+\left(\frac{b_{i}-\left\langle a_{i},x\right\rangle}{\\|a_{i}\\|^{2}}a_{i}\right).$

	$\displaystyle\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle Ar_{k},Aa_{i}\right\rangle$	$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{T}Ar_{k},a_{i}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle Ar_{k},AA^{T}Ar_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\\|A^{T}Ar_{k}\\|^{2}.$

	$\displaystyle\mathbb{E}~{}\\|A^{\ell}r_{k+1}\\|^{2}$	$\displaystyle=\mathbb{E}~{}\left\\|A^{\ell}\left(r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}a_{i}\right)\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left\\|A^{\ell}r_{k}-\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}A^{\ell}a_{i}\right\\|^{2}$
		$\displaystyle=\sum_{i=1}^{n}\frac{\\|a_{i}\\|^{2}}{\\|A\\|_{F}^{2}}\left(\\|A^{\ell}{r_{k}}\\|^{2}-2\frac{\left\langle a_{i},r_{k}\right\rangle}{\\|a_{i}\\|^{2}}\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle+\frac{\left\langle a_{i},r_{k}\right\rangle^{2}}{\\|a_{i}\\|^{4}}\\|A^{\ell}a_{i}\\|^{2}\right)$
		$\displaystyle=\\|A^{\ell}r_{k}\\|^{2}-\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle$
		$\displaystyle+\frac{1}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle^{2}\frac{\\|A^{\ell}a_{i}\\|^{2}}{\\|a_{i}\\|^{2}}.$

	$\displaystyle\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{\ell}r_{k},A^{\ell}a_{i}\right\rangle$	$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\sum_{i=1}^{n}\left\langle a_{i},r_{k}\right\rangle\left\langle A^{2\ell}r_{k},a_{i}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle Ar_{k},A^{2\ell+1}r_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\left\langle A^{\ell+1}r_{k},A^{\ell+1}r_{k}\right\rangle$
		$\displaystyle=\frac{2}{\\|A\\|_{F}^{2}}\\|A^{\ell+1}r_{k}\\|^{2}.$