On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares
Abstract.
We study the behavior of stochastic gradient descent applied to for invertible . We show that there is an explicit constant depending (mildly) on such that
This is a curious inequality: the last term has one more matrix applied to the residual than the remaining terms: if is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.
Key words and phrases:
Stochastic Gradient Descent, Kaczmarz method, Least Squares2010 Mathematics Subject Classification:
65F10, 65K10, 65K15, 90C06, 93E241. Introduction
1.1. Stochastic Gradient Descent.
In this paper, we consider the finite-dimensional linear inverse problem
where is an invertible matrix, is the (unknown) signal of interest and is a given right-hand side. Throughout this paper, we will use to denote the rows of . Equivalently, we will try to solve the problem
Following Needell, Srebro & Ward [30], we can interpret this problem as
The Lipschitz constant of is which motivates the following basic form of stochastic gradient descent: pick one of the functions with likelihood proportional to the Lipschitz constant and then do a gradient descent on this much simpler function resulting in the update rule
This is also known as the Algebraic Reconstruction Technique (ART) in computer tomography [10, 14, 15, 26], the Projection onto Convex Sets Method [3, 5, 6, 9, 38] and the Randomized Kaczmarz method [2, 7, 8, 11, 13, 12, 20, 21, 22, 24, 25, 27, 28, 29, 30, 31, 33, 36, 39, 40, 41, 42, 43, 44]. Strohmer & Vershynin [41] showed that
where is the Frobenius norm. In practice, the algorithm often converges a lot faster initially and this was studied in [17, 18, 39]. In particular, [39] obtains an identity in terms of the behavior with regards to the singular values showing that singular vectors associated to large singular values are expected to undergo a more rapid decay. Motivated by this insight, we provide rigorous bounds that quantify this energy cascade from large singular values to small singular values by identifying an interesting inequality for SGD when applied to Least Squares.
1.2. A Motivating Example.
We discuss a simple example that exemplifies the phenomenon that we are interested in. Let us take by picking each entry independently at random from and then normalizing the rows to . The right-hand side is and we initialize with .
The picture tells a very interesting story: the error in decays initially quite rapidly before then stabilizing in a certain regime. Moreover, for the example shown in Figure 1, and which is not even close to the true solution – nonetheless, the approximation of to is quite good. This leaves us with a curious conundrum: we have a good approximation of the true solution in the sense that even though is not very close to . One way this can be achieved is if is mainly a linear combination of small singular vectors of . This is related to the following result recently obtained by the author.
Theorem ([39]).
Let be a (right) singular vector of associated to the singular value . Then, for the sequence obtained in this randomized manner
Here, denotes the Frobenius norm. This shows that we expect to be indeed mainly a linear combination of singular vectors associated to small singular values since those are the ones undergoing the slowest decay. It also mirrors the bound obtained by Strohmer & Vershynin [41] since
While being interesting in itself, this identity by itself does not fully explain the behavior shown above: it is only in expectation with no control of the variance. Moreover, the inner product does initially undergo some fluctuations. Taking the same type of matrix as above, we see an example of such fluctuations in Fig. 2.
1.3. Related results.
This type of question is well studied, we refer to Ali, Dobridan & Tibshirani [1], Defossez & Bach [4], Jain, Kakade, Kidambi, Netrapalli, Pillutla & Sidford [16], Neu & Rosasco [32], Oymak & Soltanolkotabi [34], Schmidt, Le Roux & Bach [37] and references therein. The connection of Stochastic Gradient Descent applied to Least Square Problems and the Randomized Kaczmarz Method has been pointed out by Needell, Srebro & Ward [30]. We also mention the papers by Jiao, Jin & Lu [17] and Jin & Lu [18] who studied a similar question and noted that there is energy transfer from large singular values to small singular values.
2. Results
2.1. Main Result.
The main goal of this note is to provide a simple explanation for the rapid initial regularization: the expected decay of under SGD can be bounded from above by a term involving : this is the same term except that the matrix has been applied to the existing quantity one more time. This increases the norm of the underlying vector except when is mainly the linear combination of singular vectors with small singular values. So while this is not the case, we actually inherit strong decay properties and this leads to the rapid initial regularization. We now make this precise.
Theorem 1.
Let be invertible and consider via the stochastic gradient descent method introduced above. Abbreviating
we have
The inequality also holds for with as long as has a unique solution. We note that is usually quite small: for random matrices, we would expect . This is also true for matrices discretizing PDEs via finite elements. The main point of the inequality is that the last term has an additional factor of : we can rewrite it as
This shows that the presence of large singular vectors in forces large decay on . Conversely, once the algorithm has reached the plateau phase (see Fig. 1), the terms
are nearly comparable. Thus, this forces to be mostly orthogonal to most singular vectors corresponding to large singular values: that, however, shows that it is mainly comprised of small singular vectors and thus explains why is possible in cases where is far away from . In particular, this suggests why the method could be effective for the problem of finding a vector so that . One way is to initialize stochastic gradient descent with and run it for a while – due to the difference in scales, second-order norms regularizing first-order norms, we observe that converges quite rapidly; whether it converges to something sufficiently close to for the purpose at hand, is a different question.
2.2. A Sobolev Space Explanation.
An interesting way to illustrate the result is in terms of partial differential equation. Suppose we try to solve on some domain . After a suitably discretization, this results in a discrete linear system , where is a discretization of the Laplacian . By an abuse of notation, denotes a decent approximation of the continuous solution and a discretization of the continuous right-hand side. However, we also have more information: since discretizes the Laplacian, we expect that
Here, the first term correspond to the size of in the Sobolev space while the second term is the size of in the Sobolev space . In fact, this is a common way to define discretized approximations of Sobolev space, also known as the spectral definition since they are defined in terms of the spectrum of . Suppose now we compute a sequence of approximations via the method outlined above. Then Theorem 1 can be rephrased as
What is of great interest here is that the decay of the error in is driven by decay of the error in (which is usually larger).
2.3. An Example.
We illustrate this with an example. Choosing at random (and then, for convenience, normalize the rows to ), we solve starting with a random initial vector where each entry is chosen independently from a standardized distribution. We consider both the evolution of across multiple runs as well as the size of
which is the term from our Theorem quantifying the expected decay at each step.
We see in Fig. 3. that over 3000 periods, the approximation decays roughly by a factor (with little variation across multiple runs). The bound in the Theorem implies an expected decay of per time-step which, across 3000 time steps, leads to a total of roughly in decay.
2.4. Higher Powers.
If the matrix is symmetric, we can extend the result to higher powers of the matrix.
Theorem 2.
Let be symmetric and invertible. When solving via the stochastic gradient descent method outlined above, we have, for any with
the estimate
This shows that the same phenomenon does indeed happen at all scales of ‘smoothness’. The applicability of the result is, naturally, depending on the growth of in , though, generically, one would not expect this to be badly behaved: a priori, there is no good reason to expect that the row of a matrix happens in any way to be the linear combination of singular vectors associated to large singular values – though, naturally, this can happen (for example, if has one very large entry on the diagonal).
3. Proofs
3.1. Proof of Theorem 1
Proof.
To simplify exposition, we introduce the residual
Plugging in, we obtain that if the th equation is chosen, then
Since is the exact solution, we have and
Recalling that the th row is chosen with probability proportional to ,
This norm can be explicitly squared out as
This allows us to rewrite the summation as
We have
The last sum we bound from above via
This results in the desired estimate.
∎
3.2. Proof of Theorem 2
Proof.
We again reduce the problem to that of the study of the residual
When looking at integer powers, we observe that, by the same reasoning,
The first term is easy and the third term can, as before, be bounded by
It remains to understand the second term: here, we can use the symmetry of the matrix to write
∎
References
- [1] A. Ali, E. Dobridan and R. Tibshirani, The Implicit Regularization of Stochastic Gradient Flow for Least Squares, arXiv:2003.07802
- [2] Z.-Z. Bai and W.-T. Wu, On convergence rate of the randomized Kaczmarz method, Linear Algebra and its Applications 553 (2018), p. 252–269
- [3] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer, New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. Proc. SPIE: Visual Communications and Image Processing, p. 299–310, 1992.
- [4] A. Defossez and F. Bach, Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions, AISTATS, 2015
- [5] F. Deutsch. Rate of convergence of the method of alternating projections. Internat. Schriftenreihe Numer. Math., 72 (1985), p.96–107.
- [6] F. Deutsch and H. Hundal. The rate of convergence for the method of alternating projections, II. J. Math. Anal. Appl., 205 (1997), p. 381–405.
- [7] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms, 58 (2011):p. 163–177.
- [8] T. Elfving, P.-C. Hansen and T. Nikazad, Semi-convergence properties of Kaczmarz’s method, Inverse Problems 30 (2014), 055007
- [9] A. Galantai. On the rate of convergence of the alternating projection method in finite dimensional ´ spaces. J. Math. Anal. Appl., 310 (2005), p. 30–44.
- [10] R. Gordon, R. Bender and G. Herman, Algebraic reconstruction techniques (ART) for threedimensional electron microscopy and x-ray photography, Journal of Theoretical Biology 29 (1970): p. 471–481
- [11] D. Gordon, A derandomization approach to recovering bandlimited signals across a wide range of random sampling rates, Numer. Algor. 77 (2018): p. 1141–1157
- [12] R. M. Gower, D. Molitor, J. Moorman and D. Needell, Adaptive Sketch-and-Project Methods for Solving Linear Systems, arXiv:1909.03604
- [13] R. M. Gower and P. Richtarik. Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl., 36 (2015):1660–1690.
- [14] G.T. Herman. Image reconstruction from projections. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York, 1980. The fundamentals of computerized tomography, Computer Science and Applied Mathematics.
- [15] G. T. Herman and L. B. Meyer. Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Medical Imaging, 12 (1993): p. 600–609.
- [16] P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, V. Pillutla and A. Sidford, A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares). In 37th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (2018).
- [17] Y. Jiao, B. Jin and X. Lu, Preasymptotic Convergence of Randomized Kaczmarz Method, Inverse Problems 33 (2017), article: 125012
- [18] B. Jin, and X. Lu. On the regularizing property of stochastic gradient descent. Inverse Problems 35 (2018): 015004.
- [19] S. Kaczmarz, Angenaherte Auflosung von Systemen linearer Gleichungen, Bulletin International de l’Academie Polonaise des Sciences et des Lettres. Classe des Sciences Mathematiques et Naturelles. Serie A, Sciences Mathematiques, 35 (1937), pp. 355–357
- [20] Y.-T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, FOCS 2013
- [21] D. Leventhal and A. S. Lewis, Randomized Methods for Linear Constraints: Convergence Rates and Conditioning, Mathematics of Operation Research, 35 (2010), p. 641–654
- [22] J. Liu and S. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp. 85 (2016), p. 153-178
- [23] A. Ma, and D Needell, Stochastic gradient descent for linear systems with missing data. Numer. Math. Theory Methods Appl. 12 (2019), p. 1–20.
- [24] A. Ma, D Needell and A Ramdas, Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl. 36 (2015), p. 1590–1604
- [25] J. Moorman, T. Tu, D. Molitor and D. Needell, Randomized Kaczmarz with Averaging, arXiv:2002.04126
- [26] F. Natterer. The Mathematics of Computerized Tomography. Wiley, New York, 1986.
- [27] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathematics, 50 (2010): p. 395–403.
- [28] D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kaczmarz method, Linear Algebra and its Applications 441 (2014), p. 199–221
- [29] D. Needell and R. Ward, Two-Subspace Projection Method for Coherent Overdetermined Systems, J. Fourier Anal Appl 19 (2013), p. 256–269.
- [30] D. Needell, R. Ward and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, Advances in Neural Information Processing Systems, p. 1017–1025
- [31] D. Needell, R. Zhao and A. Zouzias, Randomized block Kaczmarz method with projection for solving least squares, Linear Algebra and its Applications 484 (2015), p. 322–343
- [32] G. Neu and L. Rosasco, Iterate Averaging as Regularization for Stochastic Gradient Descent, Proceedings of Machine Learning Research 75 (2018), p. 1–21.
- [33] J. Nutini, B. Sepehry, I. Laradji, M. Schmidt, H. Koepke, A. Virani, Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Randomized Kaczmarz Rules Using the Orthogonality Graph, The 32th Conference on Uncertainty in Artificial Intelligence, 2016.
- [34] S. Oymak and M. Soltanolkotabi, Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning (2019), pp. 4951–4960.
- [35] E. L. Piccolomini, and F. Zama. The conjugate gradient regularization method in computed tomography problems. Applied Mathematics and Computation 102 (1999): 87–99.
- [36] C. Popa, Convergence rates for Kaczmarz-type algorithms, Numer. Algor. 79 (2018): p. 1–17
- [37] M. Schmidt, N. Le Reoux and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., Ser. A 162 (2017), p. 83–112
- [38] K.M. Sezan and H. Stark. Applications of convex projection theory to image recovery in tomography and related areas. In H. Stark, editor, Image Recovery: Theory and application, pages 415–462. Acad. Press, 1987
- [39] S. Steinerberger, Randomized Kaczmarz converges along small singular vectors, arXiv:2006.16978
- [40] S. Steinerberger, A Weighted Randomized Kaczmarz Method for Solving Linear Systems, arXiv:2007.02910
- [41] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm for linear systems with exponential convergence, Journal of Fourier Analysis and Applications 15 (2009): p. 262–278
- [42] Y. Tan and R. Vershynin, Phase retrieval via randomized Kaczmarz: theoretical guarantees, Information and Inference: A Journal of the IMA 8 (2019), p. 97–123
- [43] J.-J. Zhang, A new greedy Kaczmarz algorithm for the solution of very large linear systems, Applied Mathematics Letters 91 (2019), p. 207–212
- [44] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34: p. 773–793, 2013.