This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Simple Convergence Proof of Adam and Adagrad

Alexandre Défossez defossez@meta.com
Meta AI
Léon Bottou
Meta AI Francis Bach
INRIA / PSL Nicolas Usunier
Meta AI
Abstract

We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension dd, and the total number of iterations NN. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence O(dln(N)/N)O(d\ln(N)/\sqrt{N}). When used with the default parameters, Adam doesn’t converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate β1\beta_{1} among all previous convergence bounds for non-convex Adam and Adagrad, improving from O((1β1)3)O((1-\beta_{1})^{-3}) to O((1β1)1)O((1-\beta_{1})^{-1}).

1 Introduction

First-order methods with adaptive step sizes have proved useful in many fields of machine learning, be it for sparse optimization (Duchi et al., 2013), tensor factorization (Lacroix et al., 2018) or deep learning (Goodfellow et al., 2016). Duchi et al. (2011) introduced Adagrad, which rescales each coordinate by a sum of squared past gradient values. While Adagrad proved effective for sparse optimization (Duchi et al., 2013), experiments showed that it under-performed when applied to deep learning (Wilson et al., 2017). RMSProp (Tieleman & Hinton, 2012) proposed an exponential moving average instead of a cumulative sum to solve this. Kingma & Ba (2015) developed Adam, one of the most popular adaptive methods in deep learning, built upon RMSProp and added corrective terms at the beginning of training, together with heavy-ball style momentum.

In the online convex optimization setting, Duchi et al. (2011) showed that Adagrad achieves optimal regret for online convex optimization. Kingma & Ba (2015) provided a similar proof for Adam when using a decreasing overall step size, although this proof was later shown to be incorrect by Reddi et al. (2018), who introduced AMSGrad as a convergent alternative. Ward et al. (2019) proved that Adagrad also converges to a critical point for non convex objectives with a rate O(ln(N)/N)O(\ln(N)/\sqrt{N}) when using a scalar adaptive step-size, instead of diagonal. Zou et al. (2019b) extended this proof to the vector case, while Zou et al. (2019a) displayed a bound for Adam, showing convergence when the decay of the exponential moving average scales as 11/N1-1/N and the learning rate as 1/N1/\sqrt{N}.

In this paper, we present a simplified and unified proof of convergence to a critical point for Adagrad and Adam for stochastic non-convex smooth optimization. We assume that the objective function is lower bounded, smooth and the stochastic gradients are almost surely bounded. We recover the standard O(ln(N)/N)O(\ln(N)/\sqrt{N}) convergence rate for Adagrad for all step sizes, and the same rate with Adam with an appropriate choice of the step sizes and decay parameters, in particular, Adam can converge without using the AMSGrad variant. Compared to previous work, our bound significantly improves the dependency on the momentum parameter β1\beta_{1}. The best known bounds for Adagrad and Adam are respectively in O((1β1)3)O((1-\beta_{1})^{-3}) and O((1β1)5)O((1-\beta_{1})^{-5}) (see Section 3), while our result is in O((1β1)1)O((1-\beta_{1})^{-1}) for both algorithms. This improvement is a step toward understanding the practical efficiency of heavy-ball momentum.

Outline.

The precise setting and assumptions are stated in the next section, and previous work is then described in Section 3. The main theorems are presented in Section 4, followed by a full proof for the case without momentum in Section 5. The proof of the convergence with momentum is deferred to the supplementary material, Section A. Finally we compare our bounds with experimental results, both on toy and real life problems in Section 6.

2 Setup

2.1 Notation

Let dd\in\mathbb{N} be the dimension of the problem (i.e. the number of parameters of the function to optimize) and take [d]={1,2,,d}[d]=\{1,2,\ldots,d\}. Given a function h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R}, we denote by h\nabla h its gradient and ih\nabla_{i}h the ii-th component of the gradient. We use a small constant ϵ\epsilon, e.g. 10810^{-8}, for numerical stability. Given a sequence (un)n(u_{n})_{n\in\mathbb{N}} with n,und\forall n\in\mathbb{N},u_{n}\in\mathbb{R}^{d}, we denote un,iu_{n,i} for nn\in\mathbb{N} and i[d]i\in[d] the ii-th component of the nn-th element of the sequence.

We want to optimize a function F:dF:\mathbb{R}^{d}\rightarrow\mathbb{R}. We assume there exists a random function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} such that 𝔼[f(x)]=F(x)\mathbb{E}\left[\nabla f(x)\right]=\nabla F(x) for all xdx\in\mathbb{R}^{d}, and that we have access to an oracle providing i.i.d. samples (fn)n(f_{n})_{n\in\mathbb{N}^{*}}. We note 𝔼n1[]\mathbb{E}_{n-1}\left[\cdot\right] the conditional expectation knowing f1,,fn1f_{1},\ldots,f_{n-1}. In machine learning, xx typically represents the weights of a linear or deep model, ff represents the loss from individual training examples or minibatches, and FF is the full training objective function. The goal is to find a critical point of FF.

2.2 Adaptive methods

We study both Adagrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2015) using a unified formulation. We assume we have 0<β210<\beta_{2}\leq 1, 0β1<β20\leq\beta_{1}<\beta_{2}, and a non negative sequence (αn)n(\alpha_{n})_{n\in\mathbb{N}^{*}}. We define three vectors mn,vn,xndm_{n},v_{n},x_{n}\in\mathbb{R}^{d} iteratively. Given x0dx_{0}\in\mathbb{R}^{d} our starting point, m0=0m_{0}=0, and v0=0v_{0}=0, we define for all iterations nn\in\mathbb{N}^{*},

mn,i\displaystyle m_{n,i} =β1mn1,i+ifn(xn1)\displaystyle=\beta_{1}m_{n-1,i}+\nabla_{i}f_{n}(x_{n-1}) (1)
vn,i\displaystyle v_{n,i} =β2vn1,i+(ifn(xn1))2\displaystyle=\beta_{2}v_{n-1,i}+\left(\nabla_{i}f_{n}(x_{n-1})\right)^{2} (2)
xn,i\displaystyle x_{n,i} =xn1,iαnmn,iϵ+vn,i.\displaystyle=x_{n-1,i}-\alpha_{n}\frac{m_{n,i}}{\sqrt{\epsilon+v_{n,i}}}. (3)

The parameter β1\beta_{1} is a heavy-ball style momentum parameter (Polyak, 1964), while β2\beta_{2} controls the decay rate of the per-coordinate exponential moving average of the squared gradients. Taking β1=0\beta_{1}=0, β2=1\beta_{2}=1 and αn=α\alpha_{n}=\alpha gives Adagrad. While the original Adagrad algorithm did not include a heavy-ball-like momentum, our analysis also applies to the case β1>0\beta_{1}>0.

Adam and its corrective terms

The original Adam algorithm (Kingma & Ba, 2015) uses a weighed average, rather than a weighted sum for (1) and (2), i.e. it uses

m~n,i\displaystyle\tilde{m}_{n,i} =(1β1)k=1nβ1nkifn(xk1)=(1β1)mn,i,\displaystyle=(1-\beta_{1})\sum_{k=1}^{n}\beta_{1}^{n-k}\nabla_{i}f_{n}(x_{k-1})=(1-\beta_{1})m_{n,i},

We can achieve the same definition by taking αadam=α1β11β2\alpha_{\textrm{adam}}=\alpha\cdot\frac{1-\beta_{1}}{\sqrt{1-\beta_{2}}}. The original Adam algorithm further includes two corrective terms to account for the fact that mnm_{n} and vnv_{n} are biased towards 0 for the first few iterations. Those corrective terms are equivalent to taking a step-size αn\alpha_{n} of the form

αn,adam=α1β11β211β1ncorrectiveterm for mn1β2ncorrectiveterm for vn.\alpha_{n,\text{adam}}=\alpha\cdot\frac{1-\beta_{1}}{\sqrt{1-\beta_{2}}}\cdot\underbrace{\frac{1}{1-\beta_{1}^{n}}}_{\begin{subarray}{c}\text{corrective}\\ \text{term for $m_{n}$}\end{subarray}}\cdot\underbrace{\sqrt{1-\beta_{2}^{n}}}_{\begin{subarray}{c}\text{corrective}\\ \text{term for $v_{n}$}\end{subarray}}. (4)

Those corrective terms can be seen as the normalization factors for the weighted sum given by (1) and (2) Note that each term goes to its limit value within a few times 1/(1β)1/(1-\beta) updates (with β{β1,β2}\beta\in\{\beta_{1},\beta_{2}\}). which explains the (1β1)(1-\beta_{1}) term in (4). In the present work, we propose to drop the corrective term for mnm_{n}, and to keep only the one for vnv_{n}, thus using the alternative step size

αn=α(1β1)1β2n1β2.\alpha_{n}=\alpha(1-\beta_{1})\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}. (5)

This simplification motivated by several observations:

  • By dropping either corrective terms, αn\alpha_{n} becomes monotonic, which simplifies the proof.

  • For typical values of β1\beta_{1} and β2\beta_{2} (e.g. 0.9 and 0.999), the corrective term for mnm_{n} converges to its limit value much faster than the one for vnv_{n}.

  • Removing the corrective term for mnm_{n} is equivalent to a learning-rate warmup, which is popular in deep learning, while removing the one for vnv_{n} would lead to an increased step size during early training. For values of β2\beta_{2} close to 1, this can lead to divergence in practice.

We experimentally verify in Section 6.3 that dropping the corrective term for mnm_{n} has no observable effect on the training process, while dropping the corrective term for vnv_{n} leads to observable perturbations. In the following, we thus consider the variation of Adam obtained by taking αn\alpha_{n} provided by (5).

2.3 Assumptions

We make three assumptions. We first assume FF is bounded below by FF_{*}, that is,

xd,F(x)F.\forall x\in\mathbb{R}^{d},\ F(x)\geq F_{*}. (6)

We then assume the \ell_{\infty} norm of the stochastic gradients is uniformly almost surely bounded, i.e. there is RϵR\geq\sqrt{\epsilon} (ϵ\sqrt{\epsilon} is used here to simplify the final bounds) so that

xd,f(x)Rϵa.s.,\forall x\in\mathbb{R}^{d},\quad\left\|\nabla f(x)\right\|_{\infty}\leq R-\ \sqrt{\epsilon}\quad\text{a.s.}, (7)

and finally, the smoothness of the objective function, e.g., its gradient is LL-Liptchitz-continuous with respect to the 2\ell_{2}-norm:

x,yd,F(x)F(y)2Lxy2.\forall x,y\in\mathbb{R}^{d},\quad\left\|\nabla F(x)-\nabla F(y)\right\|_{2}\leq L\left\|x-y\right\|_{2}. (8)

We discuss the use of assumption (7) in Section 4.2.

3 Related work

Early work on adaptive methods (McMahan & Streeter, 2010; Duchi et al., 2011) showed that Adagrad achieves an optimal rate of convergence of O(1/N)O(1/\sqrt{N}) for convex optimization (Agarwal et al., 2009). Later, RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) were developed for training deep neural networks, using an exponential moving average of the past squared gradients.

Kingma & Ba (2015) offered a proof that Adam with a decreasing step size converges for convex objectives. However, the proof contained a mistake spotted by Reddi et al. (2018), who also gave examples of convex problems where Adam does not converge to an optimal solution. They proposed AMSGrad as a convergent variant, which consisted in retaining the maximum value of the exponential moving average. When α\alpha goes to zero, AMSGrad is shown to converge in the convex and non-convex setting (Fang & Klabjan, 2019; Zhou et al., 2018). Despite this apparent flaw in the Adam algorithm, it remains a widely popular optimizer, raising the question as to whether it converges. When β2\beta_{2} goes to 11 and α\alpha to 0, our results and previous work (Zou et al., 2019a) show that Adam does converge with the same rate as Adagrad. This is coherent with the counter examples of Reddi et al. (2018), because they uses a small exponential decay parameter β2<1/5\beta_{2}<1/5.

The convergence of Adagrad for non-convex objectives was first tackled by Li & Orabona (2019), who proved its convergence, but under restrictive conditions (e.g., αϵ/L\alpha\leq\sqrt{\epsilon}/L). The proof technique was improved by Ward et al. (2019), who showed the convergence of “scalar” Adagrad, i.e., with a single learning rate, for any value of α\alpha with a rate of O(ln(N)/N)O(\ln(N)/\sqrt{N}). Our approach builds on this work but we extend it to both Adagrad and Adam, in their coordinate-wise version, as used in practice, while also supporting heavy-ball momentum.

The coordinate-wise version of Adagrad was also tackled by Zou et al. (2019b), offering a convergence result for Adagrad with either heavy-ball or Nesterov style momentum. We obtain the same rate for heavy-ball momentum with respect to NN (i.e., O(ln(N)/N)O(\ln(N)/\sqrt{N})), but we improve the dependence on the momentum parameter β1\beta_{1} from O((1β1)3)O((1-\beta_{1})^{-3}) to O((1β1)1)O((1-\beta_{1})^{-1}). Chen et al. (2019) also provided a bound for Adagrad and Adam, but without convergence guarantees for Adam for any hyper-parameter choice, and with a worse dependency on β1\beta_{1}. Zhou et al. (2018) also cover Adagrad in the stochastic setting, however their proof technique leads to a 1/ϵ\sqrt{1/\epsilon} term in their bound, typically with ϵ=108\epsilon{=}10^{-8}. Finally, a convergence bound for Adam was introduced by Zou et al. (2019a). We recover the same scaling of the bound with respect to α\alpha and β2\beta_{2}. However their bound has a dependency of O((1β1)5)O((1-\beta_{1})^{-5}) with respect to β1\beta_{1}, while we get O((1β1)1)O((1-\beta_{1})^{-1}), a significant improvement. Shi et al. (2020) obtain similar convergence results for RMSProp and Adam when considering the random shuffling setup. They use an affine growth condition (i.e. norm of the stochastic gradient is bounded by an affine function of the norm of the deterministic gradient) instead of the boundness of the gradient, but their bound decays with the number of total epochs, not stochastic updates leading to an overall s\sqrt{s} extra term with ss the size of the dataset. Finally, Faw et al. (2022) use the same affine growth assumption to derive high probability bounds for scalar Adagrad.

Non adaptive methods like SGD are also well studied in the non convex setting (Ghadimi & Lan, 2013), with a convergence rate of O(1/N)O(1/\sqrt{N}) for a smooth objective with bounded variance of the gradients. Unlike adaptive methods, SGD requires knowing the smoothness constant. When adding heavy-ball momentum, Yang et al. (2016) showed that the convergence bound degrades as O((1β1)2)O((1-\beta_{1})^{-2}), assuming that the gradients are bounded. We apply our proof technique for momentum to SGD in the Appendix, Section B and improve this dependency to O((1β1)1)O((1-\beta_{1})^{-1}). Recent work by Liu et al. (2020) achieves the same dependency with weaker assumptions. Defazio (2020) provided an in-depth analysis of SGD-M with a tight Liapunov analysis.

4 Main results

For a number of iterations NN\in\mathbb{N}^{*}, we note τN\tau_{N} a random index with value in {0,,N1}\{0,\ldots,N-1\}, so that

j,j<N,[τ=j]1β1Nj.\displaystyle\forall j\in\mathbb{N},j<N,\mathbb{P}\left[\tau=j\right]\propto 1-\beta_{1}^{N-j}. (9)

If β1=0\beta_{1}=0, this is equivalent to sampling τ\tau uniformly in {0,,N1}\{0,\ldots,N{-}1\}. If β1>0\beta_{1}>0, the last few 11β1\frac{1}{1-\beta_{1}} iterations are sampled rarely, and iterations older than a few times that number are sampled almost uniformly. Our results bound the expected squared norm of the gradient at iteration τ\tau, which is standard for non convex stochastic optimization (Ghadimi & Lan, 2013).

4.1 Convergence bounds

For simplicity, we first give convergence results for β1=0\beta_{1}=0, along with a complete proof in Section 5. We then provide the results with momentum, with their proofs in the Appendix, Section A.6. We also provide a bound on the convergence of SGD with a O(1/(1β1)O(1/(1-\beta_{1}) dependency in the Appendix, Section B.2, along with its proof in Section B.4.

No heavy-ball momentum
Theorem 1 (Convergence of Adagrad without momentum).

Given the assumptions from Section 2.3, the iterates xnx_{n} defined in Section 2.2 with hyper-parameters verifying β2=1\beta_{2}=1, αn=α\alpha_{n}=\alpha with α>0\alpha>0 and β1=0\beta_{1}=0, and τ\tau defined by (9), we have for any NN\in\mathbb{N}^{*},

𝔼[F(xτ)2]2RF(x0)FαN+1N(4dR2+αdRL)ln(1+NR2ϵ).\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq 2R\frac{F(x_{0})-F_{*}}{\alpha\sqrt{N}}+\frac{1}{\sqrt{N}}\left(4dR^{2}+\alpha dRL\right)\ln\left(1+\frac{NR^{2}}{\epsilon}\right). (10)
Theorem 2 (Convergence of Adam without momentum).

Given the assumptions from Section 2.3, the iterates xnx_{n} defined in Section 2.2 with hyper-parameters verifying 0<β2<10<\beta_{2}<1, αn=α1β2n1β2\alpha_{n}=\alpha\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}} with α>0\alpha>0 and β1=0\beta_{1}=0, and τ\tau defined by (9), we have for any NN\in\mathbb{N}^{*},

𝔼[F(xτ)2]2RF(x0)FαN+E(1Nln(1+R2(1β2)ϵ)ln(β2)),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq 2R\frac{F(x_{0})-F_{*}}{\alpha N}+E\left(\frac{1}{N}\ln\left(1+\frac{R^{2}}{(1-\beta_{2})\epsilon}\right)-\ln(\beta_{2})\right), (11)

with

E=4dR21β2+αdRL1β2.E=\frac{4dR^{2}}{\sqrt{1-\beta_{2}}}+\frac{\alpha dRL}{1-\beta_{2}}.
With heavy-ball momentum
Theorem 3 (Convergence of Adagrad with momentum).

Given the assumptions from Section 2.3, the iterates xnx_{n} defined in Section 2.2 with hyper-parameters verifying β2=1\beta_{2}=1, αn=α\alpha_{n}=\alpha with α>0\alpha>0 and 0β1<10\leq\beta_{1}<1, and τ\tau defined by (9), we have for any NN\in\mathbb{N}^{*} such that N>β11β1N>\frac{\beta_{1}}{1-\beta_{1}},

𝔼[F(xτ)2]2RNF(x0)FαN~+NN~Eln(1+NR2ϵ),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq 2R\sqrt{N}\frac{F(x_{0})-F_{*}}{\alpha\tilde{N}}+\frac{\sqrt{N}}{\tilde{N}}E\ln\left(1+\frac{NR^{2}}{\epsilon}\right), (12)

with N~=Nβ11β1\tilde{N}=N-\frac{\beta_{1}}{1-\beta_{1}}, and,

E=αdRL+12dR21β1+2α2dL2β11β1.E=\alpha dRL+\frac{12dR^{2}}{1-\beta_{1}}+\frac{2\alpha^{2}dL^{2}\beta_{1}}{1-\beta_{1}}.
Theorem 4 (Convergence of Adam with momentum).

Given the assumptions from Section 2.3, the iterates xnx_{n} defined in Section 2.2 with hyper-parameters verifying 0<β2<10<\beta_{2}<1, 0β1<β20\leq\beta_{1}<\beta_{2}, and, αn=α(1β1)1β2n1β2{\alpha_{n}=\alpha(1-\beta_{1})\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}} with α>0\alpha>0, and τ\tau defined by (9), we have for any NN\in\mathbb{N}^{*} such that N>β11β1N>\frac{\beta_{1}}{1-\beta_{1}},

𝔼[F(xτ)2]2RF(x0)FαN~+E(1N~ln(1+R2(1β2)ϵ)NN~ln(β2)),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq 2R\frac{F(x_{0})-F_{*}}{\alpha\tilde{N}}+E\left(\frac{1}{\tilde{N}}\ln\left(1+\frac{R^{2}}{(1-\beta_{2})\epsilon}\right)-\frac{N}{\tilde{N}}\ln(\beta_{2})\right), (13)

with N~=Nβ11β1\tilde{N}=N-\frac{\beta_{1}}{1-\beta_{1}}, and

E\displaystyle E =αdRL(1β1)(1β1/β2)(1β2)+12dR21β1(1β1/β2)3/21β2+2α2dL2β1(1β1/β2)(1β2)3/2.\displaystyle=\frac{\alpha dRL(1-\beta_{1})}{(1-\beta_{1}/\beta_{2})(1-\beta_{2})}+\frac{12dR^{2}\sqrt{1-\beta_{1}}}{(1-\beta_{1}/\beta_{2})^{3/2}\sqrt{1-\beta_{2}}}+\frac{2\alpha^{2}dL^{2}\beta_{1}}{(1-\beta_{1}/\beta_{2})(1-\beta_{2})^{3/2}}.

4.2 Analysis of the bounds

Dependency on dd.

The dependency in dd is present in previous works on coordinate wise adaptive methods (Zou et al., 2019a; b). Note however that RR is defined as the \ell_{\infty} bound on the on the stochastic gradient, so that in the case where the gradient has a similar scale along all dimensions, dR2dR^{2} would be a reasonable bound for f(x)22\left\|\nabla f(x)\right\|_{2}^{2}. However, if many dimensions contribute little to the norm of the gradient, this would still lead to a worse dependency in dd that e.g. scalar Adagrad Ward et al. (2019) or SGD.

Diving into the technicalities of the proof to come, we will see in Section 5 that we apply Lemma 5.2 once per dimension. The contribution from each coordinate is mostly independent of the actual scale of its gradients (as it only appears in the log), so that the right hand side of the convergence bound will grow as dd. In contrast, the scalar version of Adagrad (Ward et al., 2019) has a single learning rate, so that Lemma 5.2 is only applied once, removing the dependency on dd. However, this variant is rarely used in practice.

Almost sure bound on the gradient.

We chose to assume the existence of an almost sure uniform \ell_{\infty}-bound on the gradients given by (7). This is a strong assumption, although it is weaker than the one used by Duchi et al. (2011) for Adagrad in the convex case, where the iterates were assumed to be almost surely bounded. There exist a few real life problems that verifies this assumption, for instance logistic regression without weight penalty, and with bounded inputs. It is possible instead to assume only a uniform bound on the expected gradient F(x)\nabla F(x), as done by Ward et al. (2019) and Zou et al. (2019b). This however lead to a bound on 𝔼[F(xτ)24/3]2/3\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{4/3}\right]^{2/3} instead of a bound on 𝔼[F(xτ)22]\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right], all the other terms staying the same. We provide the sketch of the proof using Hölder inequality in the Appendix, Section A.7.

It is also possible to replace the bound on the gradient with an affine growth condition, i.e. the norm of the stochastic gradient is bounded by an affine function of the norm of the expected gradient. A proof for scalar Adagrad is provided by Faw et al. (2022). Shi et al. (2020) do the same for RMSProp, however their convergence bound is decays as O(log(T)/T)O(\log(T)/\sqrt{T}) with TT the number of epoch, not the number of updates, leading to a significantly less tight bound for large datasets.

Impact of heavy-ball momentum.

Looking at Theorems 3 and 4, we see that increasing β1\beta_{1} always deteriorates the bounds. Taking β1=0\beta_{1}=0 in those theorems gives us almost exactly the bound without heavy-ball momentum from Theorems 1 and 2, up to a factor 3 in the terms of the form dR2dR^{2}.

As discussed in Section 3, previous bounds for Adagrad in the non-convex setting deteriorates as O((1β1)3)O((1-\beta_{1})^{-3}) (Zou et al., 2019b), while bounds for Adam deteriorates as O((1β1)5)O((1-\beta_{1})^{-5}) (Zou et al., 2019a). Our unified proof for Adam and Adagrad achieves a dependency of O((1β1)1)O((1-\beta_{1})^{-1}), a significant improvement. We refer the reader to the Appendix, Section A.3, for a detailed analysis. While our dependency still contradicts the benefits of using momentum observed in practice, see Section 6, our tighter analysis is a step in the right direction.

On sampling of τ\tau

Note that in (9), we sample with a lower probability the latest iterations. This can be explained by the fact that the proof technique for stochastic optimization in the non-convex case is based on the idea that for every iteration nn, either F(xn)\nabla F(x_{n}) is small, or F(xn+1)F(x_{n+1}) will decrease by some amount. However, when introducing momentum, and especially when taking the limit β11\beta_{1}\rightarrow 1, the latest gradient F(xn)\nabla F(x_{n}) has almost no influence over xn+1x_{n+1}, as the momentum term updates slowly. Momentum spreads the influence of the gradients over time, and thus, it will take a few updates for a gradient to have fully influenced the iterate xnx_{n} and thus the value of the function F(xn)F(x_{n}). From a formal point of view, the sampling weights given by (9) naturally appear as part of the proof which is presented in Section A.6.

4.3 Optimal finite horizon Adam is Adagrad

Let us take a closer look at the result from Theorem 2. It could seem like some quantities can explode but actually not for any reasonable values of α\alpha, β2\beta_{2} and NN. Let us try to find the best possible rate of convergence for Adam for a finite horizon NN, i.e. q+q\in\mathbb{R}_{+} such that 𝔼[F(xτ)2]=O(ln(N)Nq)\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]=O(\ln(N)N^{-q}) for some choice of the hyper-parameters α(N)\alpha(N) and β2(N)\beta_{2}(N). Given that the upper bound in (11) is a sum of non-negative terms, we need each term to be of the order of ln(N)Nq\ln(N)N^{-q} or negligible. Let us assume that this rate is achieved for α(N)\alpha(N) and β2(N)\beta_{2}(N). The bound tells us that convergence can only be achieved if limα(N)=0\lim\alpha(N)=0 and limβ2(N)=1\lim\beta_{2}(N)=1, with the limits taken for NN\rightarrow\infty. This motivates us to assume that there exists an asymptotic development of α(N)Na+o(Na)\alpha(N)\propto N^{-a}+o(N^{-a}), and of 1β2(N)Nb+o(Nb)1-\beta_{2}(N)\propto N^{-b}+o(N^{-b}) for aa and bb positive. Thus, let us consider only the leading term in those developments, ignoring the leading constant (which is assumed to be non-zero). Let us further assume that ϵR2\epsilon\ll R^{2}, we have

𝔼[F(xτ)2]2RF(x0)FN1a+E(1Nln(R2Nbϵ)+Nb1Nb),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq 2R\frac{F(x_{0})-F_{*}}{N^{1-a}}+E\left(\frac{1}{N}\ln\left(\frac{R^{2}N^{b}}{\epsilon}\right)+\frac{N^{-b}}{1-N^{-b}}\right), (14)

with E=4dR2Nb/2+dRLNbaE=4dR^{2}N^{b/2}+dRLN^{b-a}. Let us ignore the log terms for now, and use Nb1NbNb\frac{N^{-b}}{1-N^{-b}}\sim N^{-b} for NN\rightarrow\infty , to get

𝔼[F(xτ)2]2RF(x0)FN1a+4dR2Nb/21+4dR2Nb/2+dRLNba1+dRLNa.\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\lessapprox 2R\frac{F(x_{0})-F_{*}}{N^{1-a}}+4dR^{2}N^{b/2-1}+4dR^{2}N^{-b/2}+dRLN^{b-a-1}+dRLN^{-a}.

Adding back the logarithmic term, the best rate we can obtain is O(ln(N)/N)O(\ln(N)/\sqrt{N}), and it is only achieved for a=1/2a=1/2 and b=1b=1, i.e., α=α1/N\alpha=\alpha_{1}/\sqrt{N} and β2=11/N\beta_{2}=1-1/N. We can see the resemblance between Adagrad on one side and Adam with a finite horizon and such parameters on the other. Indeed, an exponential moving average with a parameter β2=11/N\beta_{2}=1-1/N as a typical averaging window length of size NN, while Adagrad would be an exact average of the past NN terms. In particular, the bound for Adam now becomes

𝔼[F(xτ)2]F(x0)Fα1N+1N(4dR2+α1dRL)(ln(1+RNϵ)+NN1),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\leq\frac{F(x_{0})-F_{*}}{\alpha_{1}\sqrt{N}}+\frac{1}{\sqrt{N}}\left(4dR^{2}+\alpha_{1}dRL\right)\left(\ln\left(1+\frac{RN}{\epsilon}\right)+\frac{N}{N-1}\right), (15)

which differ from (10) only by a +N/(N1)+N/(N-1) next to the log term.

Adam and Adagrad are twins.

Our analysis highlights an important fact: Adam is to Adagrad like constant step size SGD is to decaying step size SGD. While Adagrad is asymptotically optimal, it also leads to a slower decrease of the term proportional to F(x0)FF(x_{0})-F_{*}, as 1/N1/\sqrt{N} instead of 1/N1/N for Adam. During the initial phase of training, it is likely that this term dominates the loss, which could explain the popularity of Adam for training deep neural networks rather than Adagrad. With its default parameters, Adam will not converge. It is however possible to choose α\alpha and β2\beta_{2} to achieve an ϵ\epsilon critical point for ϵ\epsilon arbitrarily small and, for a known time horizon, they can be chosen to obtain the exact same bound as Adagrad.

5 Proofs for β1=0\beta_{1}=0 (no momentum)

We assume here for simplicity that β1=0\beta_{1}=0, i.e., there is no heavy-ball style momentum. Taking nn\in\mathbb{N}^{*}, the recursions introduced in Section 2.2 can be simplified into

{vn,i=β2vn1,i+(ifn(xn1))2,xn,i=xn1,iαnifn(xn1)ϵ+vn,i.\displaystyle\begin{cases}v_{n,i}&=\beta_{2}v_{n-1,i}+\left(\nabla_{i}f_{n}(x_{n-1})\right)^{2},\\ x_{n,i}&=x_{n-1,i}-\alpha_{n}\frac{\nabla_{i}f_{n}(x_{n-1})}{\sqrt{\epsilon+v_{n,i}}}.\end{cases} (16)

Remember that we recover Adagrad when αn=α\alpha_{n}=\alpha for α>0\alpha>0 and β2=1\beta_{2}=1, while Adam can be obtained taking 0<β2<10<\beta_{2}<1, α>0\alpha>0,

αn=α1β2n1β2,\alpha_{n}=\alpha\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}, (17)

Throughout the proof we denote by 𝔼n1[]\mathbb{E}_{n-1}\left[\cdot\right] the conditional expectation with respect to f1,,fn1f_{1},\ldots,f_{n-1}. In particular, xn1x_{n-1} and vn1v_{n-1} are deterministic knowing f1,,fn1f_{1},\ldots,f_{n-1}. For all nn\in\mathbb{N}^{*}, we also define v~nd\tilde{v}_{n}\in\mathbb{R}^{d} so that for all i[d]i\in[d],

v~n,i=β2vn1,i+𝔼n1[(ifn(xn1))2],\tilde{v}_{n,i}=\beta_{2}v_{n-1,i}+\mathbb{E}_{n-1}\left[\left(\nabla_{i}f_{n}(x_{n-1})\right)^{2}\right], (18)

i.e., we replace the last gradient contribution by its expected value conditioned on f1,,fn1f_{1},\ldots,f_{n-1}.

5.1 Technical lemmas

A problem posed by the update (16) is the correlation between the numerator and denominator. This prevents us from easily computing the conditional expectation and as noted by Reddi et al. (2018), the expected direction of update can have a positive dot product with the objective gradient. It is however possible to control the deviation from the descent direction, following Ward et al. (2019) with this first lemma.

Lemma 5.1 (adaptive update approximately follow a descent direction).

For all nn\in\mathbb{N}^{*} and i[d]i\in[d], we have:

𝔼n1[iF(xn1)ifn(xn1)ϵ+vn,i](iF(xn1))22ϵ+v~n,i2R𝔼n1[(ifn(xn1))2ϵ+vn,i].\displaystyle\mathbb{E}_{n-1}\left[\nabla_{i}F(x_{n-1})\frac{\nabla_{i}f_{n}(x_{n-1})}{\sqrt{\epsilon+v_{n,i}}}\right]\geq\frac{\left(\nabla_{i}F(x_{n-1})\right)^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,i}}}-2R\mathbb{E}_{n-1}\left[\frac{\left(\nabla_{i}f_{n}(x_{n-1})\right)^{2}}{\epsilon+v_{n,i}}\right]. (19)
Proof.

We take i[d]i\in[d] and note G=iF(xn1)G=\nabla_{i}F(x_{n-1}), g=ifn(xn1)g=\nabla_{i}f_{n}(x_{n-1}), v=vn,iv=v_{n,i} and v~=v~n,i\tilde{v}=\tilde{v}_{n,i}.

𝔼n1[Ggϵ+v]=𝔼n1[Ggϵ+v~]+𝔼n1[Gg(1ϵ+v1ϵ+v~)A].\displaystyle\mathbb{E}_{n-1}\left[\frac{Gg}{\sqrt{\epsilon+v}}\right]=\mathbb{E}_{n-1}\left[\frac{Gg}{\sqrt{\epsilon+\tilde{v}}}\right]+\mathbb{E}_{n-1}\bigg{[}\underbrace{Gg\left(\frac{1}{\sqrt{\epsilon+v}}-\frac{1}{\sqrt{\epsilon+\tilde{v}}}\right)}_{A}\bigg{]}. (20)

Given that gg and v~\tilde{v} are independent knowing f1,,fn1f_{1},\ldots,f_{n-1}, we immediately have

𝔼n1[Ggϵ+v~]=G2ϵ+v~.\mathbb{E}_{n-1}\left[\frac{Gg}{\sqrt{\epsilon+\tilde{v}}}\right]=\frac{G^{2}}{\sqrt{\epsilon+\tilde{v}}}. (21)

Now we need to control the size of the second term AA,

A\displaystyle A =Ggv~vϵ+vϵ+v~(ϵ+v+ϵ+v~)\displaystyle=Gg\frac{\tilde{v}-v}{\sqrt{\epsilon+v}\sqrt{\epsilon+\tilde{v}}(\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}})}
=Gg𝔼n1[g2]g2ϵ+vϵ+v~(ϵ+v+ϵ+v~)\displaystyle=Gg\frac{\mathbb{E}_{n-1}\left[g^{2}\right]-g^{2}}{\sqrt{\epsilon+v}\sqrt{\epsilon+\tilde{v}}(\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}})}
|A|\displaystyle\left|A\right| |Gg|𝔼n1[g2]ϵ+v(ϵ+v~)κ+|Gg|g2(ϵ+v)ϵ+v~ρ.\displaystyle\leq\underbrace{\left|Gg\right|\frac{\mathbb{E}_{n-1}\left[g^{2}\right]}{\sqrt{\epsilon+v}(\epsilon+\tilde{v})}}_{\kappa}+\underbrace{\left|Gg\right|\frac{g^{2}}{(\epsilon+v)\sqrt{\epsilon+\tilde{v}}}}_{\rho}.

The last inequality comes from the fact that ϵ+v+ϵ+v~max(ϵ+v,ϵ+v~)\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}}\geq\max(\sqrt{\epsilon+v},\sqrt{\epsilon+\tilde{v}}) and |𝔼n1[g2]g2|𝔼n1[g2]+g2\left|\mathbb{E}_{n-1}\left[g^{2}\right]-g^{2}\right|\leq\mathbb{E}_{n-1}\left[g^{2}\right]+g^{2}. Following Ward et al. (2019), we can use the following inequality to bound κ\kappa and ρ\rho,

λ>0,x,y,xyλ2x2+y22λ.\forall\lambda>0,\,x,y\in\mathbb{R},xy\leq\frac{\lambda}{2}x^{2}+\frac{y^{2}}{2\lambda}. (22)

First applying (22) to κ\kappa with

λ=ϵ+v~2,x=|G|ϵ+v~,y=|g|𝔼n1[g2]ϵ+v~ϵ+v,\lambda=\frac{\sqrt{\epsilon+\tilde{v}}}{2},\;x=\frac{\left|G\right|}{\sqrt{\epsilon+\tilde{v}}},\;y=\frac{\left|g\right|\mathbb{E}_{n-1}\left[g^{2}\right]}{\sqrt{\epsilon+\tilde{v}}\sqrt{\epsilon+v}},

we obtain

κG24ϵ+v~+g2𝔼n1[g2]2(ϵ+v~)3/2(ϵ+v).\displaystyle\kappa\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{g^{2}\mathbb{E}_{n-1}\left[g^{2}\right]^{2}}{(\epsilon+\tilde{v})^{3/2}(\epsilon+v)}.

Given that ϵ+v~𝔼n1[g2]\epsilon+\tilde{v}\geq\mathbb{E}_{n-1}\left[g^{2}\right] and taking the conditional expectation, we can simplify as

𝔼n1[κ]G24ϵ+v~+𝔼n1[g2]ϵ+v~𝔼n1[g2ϵ+v].\displaystyle\mathbb{E}_{n-1}\left[\kappa\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{\mathbb{E}_{n-1}\left[g^{2}\right]}{\sqrt{\epsilon+\tilde{v}}}\mathbb{E}_{n-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (23)

Given that 𝔼n1[g2]ϵ+v~\sqrt{\mathbb{E}_{n-1}\left[g^{2}\right]}\leq\sqrt{\epsilon+\tilde{v}} and 𝔼n1[g2]R\sqrt{\mathbb{E}_{n-1}\left[g^{2}\right]}\leq R, we can simplify (23) as

𝔼n1[κ]G24ϵ+v~+R𝔼n1[g2ϵ+v].\displaystyle\mathbb{E}_{n-1}\left[\kappa\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+R\mathbb{E}_{n-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (24)

Now turning to ρ\rho, we use (22) with

λ=ϵ+v~2𝔼n1[g2],x=|Gg|ϵ+v~,y=g2ϵ+v,\lambda=\frac{\sqrt{\epsilon+\tilde{v}}}{2\mathbb{E}_{n-1}\left[g^{2}\right]},\;x=\frac{\left|Gg\right|}{\sqrt{\epsilon+\tilde{v}}},\;y=\frac{g^{2}}{\epsilon+v},

we obtain

ρG24ϵ+v~g2𝔼n1[g2]+𝔼n1[g2]ϵ+v~g4(ϵ+v)2,\displaystyle\rho\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}\frac{g^{2}}{\mathbb{E}_{n-1}\left[g^{2}\right]}+\frac{\mathbb{E}_{n-1}\left[g^{2}\right]}{\sqrt{\epsilon+\tilde{v}}}\frac{g^{4}}{(\epsilon+v)^{2}}, (25)

Given that ϵ+vg2\epsilon+v\geq g^{2} and taking the conditional expectation we obtain

𝔼n1[ρ]G24ϵ+v~+𝔼n1[g2]ϵ+v~𝔼n1[g2ϵ+v],\displaystyle\hskip-5.69046pt\mathbb{E}_{n-1}\left[\rho\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{\mathbb{E}_{n-1}\left[g^{2}\right]}{\sqrt{\epsilon+\tilde{v}}}\mathbb{E}_{n-1}\left[\frac{g^{2}}{\epsilon+v}\right], (26)

which we simplify using the same argument as for (24) into

𝔼n1[ρ]G24ϵ+v~+R𝔼n1[g2ϵ+v].\displaystyle\mathbb{E}_{n-1}\left[\rho\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+R\mathbb{E}_{n-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (27)

Notice that in (25), we possibly divide by zero. It suffice to notice that if 𝔼n1[g2]=0\mathbb{E}_{n-1}\left[g^{2}\right]=0 then g2=0g^{2}=0 a.s. so that ρ=0\rho=0 and (27) is still verified. Summing (24) and (27) we can bound

𝔼n1[|A|]G22ϵ+v~+2R𝔼n1[g2ϵ+v].\displaystyle\mathbb{E}_{n-1}\left[\left|A\right|\right]\leq\frac{G^{2}}{2\sqrt{\epsilon+\tilde{v}}}+2R\mathbb{E}_{n-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (28)

Injecting (28) and (21) into (20) finishes the proof. ∎

Anticipating on Section 5.2, the previous Lemma gives us a bound on the deviation from a descent direction. While for a specific iteration, this deviation can take us away from a descent direction, the next lemma tells us that the sum of those deviations cannot grow larger than a logarithmic term. This key insight introduced in Ward et al. (2019) is what makes the proof work.

Lemma 5.2 (sum of ratios with the denominator being the sum of past numerators).

We assume we have 0<β210<\beta_{2}\leq 1 and a non-negative sequence (an)n(a_{n})_{n\in\mathbb{N}^{*}}. We define for all nn\in\mathbb{N}^{*}, bn=j=1nβ2njajb_{n}=\sum_{j=1}^{n}\beta_{2}^{n-j}a_{j}. We have

j=1Najϵ+bjln(1+bNϵ)Nln(β2).\sum_{j=1}^{N}\frac{a_{j}}{\epsilon+b_{j}}\leq\ln\left(1+\frac{b_{N}}{\epsilon}\right)-N\ln(\beta_{2}). (29)
Proof.

Given that ln\ln is increasing, and the fact that bj>aj0b_{j}>a_{j}\geq 0, we have for all jj\in\mathbb{N}^{*},

ajϵ+bj\displaystyle\frac{a_{j}}{\epsilon+b_{j}} ln(ϵ+bj)ln(ϵ+bjaj)\displaystyle\leq\ln(\epsilon+b_{j})-\ln(\epsilon+b_{j}-a_{j})
=ln(ϵ+bj)ln(ϵ+β2bj1)\displaystyle=\ln(\epsilon+b_{j})-\ln(\epsilon+\beta_{2}b_{j-1})
=ln(ϵ+bjϵ+bj1)+ln(ϵ+bj1ϵ+β2bj1).\displaystyle=\ln\left(\frac{\epsilon+b_{j}}{\epsilon+b_{j-1}}\right)+\ln\left(\frac{\epsilon+b_{j-1}}{\epsilon+\beta_{2}b_{j-1}}\right).

The first term forms a telescoping series, while the second one is bounded by ln(β2)-\ln(\beta_{2}). Summing over all j[N]j\in[N] gives the desired result. ∎

5.2 Proof of Adam and Adagrad without momentum

Let us take an iteration nn\in\mathbb{N}^{*}, we define the update undu_{n}\in\mathbb{R}^{d}:

i[d],un,i=ifn(xn1)ϵ+vn,i.\forall i\in[d],u_{n,i}=\frac{\nabla_{i}f_{n}(x_{n-1})}{\sqrt{\epsilon+v_{n,i}}}. (30)
Adagrad.

As explained in Section 2.2, we have αn=α\alpha_{n}=\alpha for α>0\alpha>0. Using the smoothness of FF (8), we have

F(xn+1)F(xn)αF(xn)Tun+α2L2un22.F(x_{n+1})\leq F(x_{n})-\alpha\nabla F(x_{n})^{T}u_{n}+\frac{\alpha^{2}L}{2}\left\|u_{n}\right\|^{2}_{2}. (31)

Taking the conditional expectation with respect to f0,,fn1f_{0},\ldots,f_{n-1} we can apply the descent Lemma 5.1. Notice that due to the a.s. \ell_{\infty} bound on the gradients (7), we have for any i[d]i\in[d], ϵ+v~n,iRn\sqrt{\epsilon+\tilde{v}_{n,i}}\leq R\sqrt{n}, so that,

α(iF(xn1))22ϵ+v~n,iα(iF(xn1))22Rn.\frac{\alpha\left(\nabla_{i}F(x_{n-1})\right)^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,i}}}\geq\frac{\alpha\left(\nabla_{i}F(x_{n-1})\right)^{2}}{2R\sqrt{n}}. (32)

This gives us

𝔼n1[F(xn)]F(xn1)α2RnF(xn1)22+(2αR+α2L2)𝔼n1[un22].\displaystyle\mathbb{E}_{n-1}\left[F(x_{n})\right]\leq F(x_{n-1})-\frac{\alpha}{2R\sqrt{n}}\left\|\nabla F(x_{n-1})\right\|_{2}^{2}+\left(2\alpha R+\frac{\alpha^{2}L}{2}\right)\mathbb{E}_{n-1}\left[\left\|u_{n}\right\|^{2}_{2}\right].

Summing the previous inequality for all n[N]n\in[N], taking the complete expectation, and using that nN\sqrt{n}\leq\sqrt{N} gives us,

𝔼[F(xN)]F(x0)α2RNn=0N1𝔼[F(xn)22]+(2αR+α2L2)n=0N1𝔼[un22].\displaystyle\mathbb{E}\left[F(x_{N})\right]\leq F(x_{0})-\frac{\alpha}{2R\sqrt{N}}\sum_{n=0}^{N-1}\mathbb{E}\left[\left\|\nabla F(x_{n})\right\|_{2}^{2}\right]+\left(2\alpha R+\frac{\alpha^{2}L}{2}\right)\sum_{n=0}^{N-1}\mathbb{E}\left[\left\|u_{n}\right\|^{2}_{2}\right].

From there, we can bound the last sum on the right hand side using Lemma 5.2 once for each dimension. Rearranging the terms, we obtain the result of Theorem 1.

Adam.

As given by (5) in Section 2.2, we have αn=α1β2n1β2\alpha_{n}=\alpha\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}} for α>0\alpha>0. Using the smoothness of FF defined in (8), we have

F(xn)F(xn1)αnF(xn1)Tun+αn2L2un22.F(x_{n})\leq F(x_{n-1})-\alpha_{n}\nabla F(x_{n-1})^{T}u_{n}+\frac{\alpha_{n}^{2}L}{2}\left\|u_{n}\right\|^{2}_{2}. (33)

We have for any i[d]i\in[d], ϵ+v~n,iRj=0n1β2j=R1β2n1β2\sqrt{\epsilon+\tilde{v}_{n,i}}\leq R\sqrt{\sum_{j=0}^{n-1}\beta_{2}^{j}}=R\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}, thanks to the a.s. \ell_{\infty} bound on the gradients (7), so that,

αn(iF(xn1))22ϵ+v~n,iα(iF(xn1))22R.\alpha_{n}\frac{\left(\nabla_{i}F(x_{n-1})\right)^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,i}}}\geq\frac{\alpha\left(\nabla_{i}F(x_{n-1})\right)^{2}}{2R}. (34)

Taking the conditional expectation with respect to f1,,fn1f_{1},\ldots,f_{n-1} we can apply the descent Lemma 5.1 and use (34) to obtain from (33),

𝔼n1[F(xn)]F(xn1)α2RF(xn1)22+(2αnR+αn2L2)𝔼n1[un22].\displaystyle\mathbb{E}_{n-1}\left[F(x_{n})\right]\leq F(x_{n-1})-\frac{\alpha}{2R}\left\|\nabla F(x_{n-1})\right\|_{2}^{2}+\left(2\alpha_{n}R+\frac{\alpha_{n}^{2}L}{2}\right)\mathbb{E}_{n-1}\left[\left\|u_{n}\right\|^{2}_{2}\right].

Given that β2<1\beta_{2}<1, we have αnα1β2\alpha_{n}\leq\frac{\alpha}{\sqrt{1-\beta_{2}}}. Summing the previous inequality for all n[N]n\in[N] and taking the complete expectation yields

𝔼[F(xN)]F(x0)α2Rn=0N1𝔼[F(xn)22]+(2αR1β2+α2L2(1β2))n=0N1𝔼[un22].\displaystyle\mathbb{E}\left[F(x_{N})\right]\leq F(x_{0})-\frac{\alpha}{2R}\sum_{n=0}^{N-1}\mathbb{E}\left[\left\|\nabla F(x_{n})\right\|_{2}^{2}\right]+\left(\frac{2\alpha R}{\sqrt{1-\beta_{2}}}+\frac{\alpha^{2}L}{2(1-\beta_{2})}\right)\sum_{n=0}^{N-1}\mathbb{E}\left[\left\|u_{n}\right\|^{2}_{2}\right].

Applying Lemma 5.2 for each dimension and rearranging the terms finishes the proof of Theorem 2.

10610^{-6}10510^{-5}10410^{-4}10310^{-3}10210^{-2}10110^{-1}10010^{0}10510^{-5}10410^{-4}10310^{-3}10210^{-2}10110^{-1}10010^{0}Parameter𝔼[F(xτ)22]\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]
(a) Average squared norm of the gradient on a toy task, see Section 6, for more details. For the α\alpha and 1β21-\beta_{2} curves, we initialize close to the optimum to make the F0FF_{0}-F_{*} term negligible.
10610^{-6}10510^{-5}10410^{-4}10310^{-3}10210^{-2}10110^{-1}10010^{0}10110^{1}10210^{2}Parameterα\alpha1β11-\beta_{1}1β21-\beta_{2}
(b) Average squared norm of the gradient of a small convolutional model Gitman & Ginsburg (2017) trained on CIFAR-10, with a random initialization. The full gradient is evaluated every epoch.
Figure 1: Observed average squared norm of the objective gradients after a fixed number of iterations when varying a single parameter out of α\alpha, 1β11-\beta_{1} and 1β21-\beta_{2}, on a toy task (left, 10610^{6} iterations) and on CIFAR-10 (right, 600 epochs with a batch size 128). All curves are averaged over 3 runs, error bars are negligible except for small values of α\alpha on CIFAR-10. See Section 6 for details.
0505010010010110^{-1}10010^{0}EpochCross-EntropyAll corrections
0505010010010110^{-1}10010^{0}EpochDropping correction for mnm_{n}
0505010010010110^{-1}10010^{0}EpochDropping correction for vnv_{n}
0505010010010110^{-1}10010^{0}10110^{1}Epoch𝔼[F(xn)22]\mathbb{E}\left[\left\|\nabla F(x_{n})\right\|_{2}^{2}\right]
0505010010010110^{-1}10010^{0}10110^{1}Epoch
0505010010010110^{-1}10110^{1}Epoch
Figure 2: Training trajectories for varying values of α{104,103}\alpha\in\{10^{-4},10^{-3}\}, β1{0.,0.5,0.8,0.9,0.99}\beta_{1}\in\{0.,0.5,0.8,0.9,0.99\} and β2{0.9,0.99,0.999,0.9999}\beta_{2}\in\{0.9,0.99,0.999,0.9999\}. The top row (resp. bottom) gives the training loss (resp. squared norm of the expected gradient). The left column uses all corrective terms in the original Adam algorithm, the middle column drops the corrective term on mnm_{n} (equivalent to our proof setup), and the right column drops the corrective term on vnv_{n}. We notice a limited impact when dropping the corrective term on mnm_{n}, but dropping the corrective term on vnv_{n} has a much stronger impact.

6 Experiments

On Figure 1, we compare the effective dependency of the average squared norm of the gradient in the parameters α\alpha, β1\beta_{1} and β2\beta_{2} for Adam, when used on a toy task and CIFAR-10.

6.1 Setup

Toy problem.

In order to support the bounds presented in Section 4, in particular the dependency in β2\beta_{2}, we test Adam on a specifically crafted toy problem. We take x6x\in\mathbb{R}^{6} and define for all i[6]i\in[6], pi=10ip_{i}=10^{-i}. We take (Qi)i[6](Q_{i})_{i\in[6]}, Bernoulli variables with [Qi=1]=pi\mathbb{P}\left[Q_{i}=1\right]=p_{i}. We then define ff for all xdx\in\mathbb{R}^{d} as

f(x)=i[6](1Qi)Huber(xi1)+QipiHuber(xi+1),f(x)=\sum_{i\in[6]}(1-Q_{i})\operatorname{\mathrm{Huber}}(x_{i}-1)+\frac{Q_{i}}{\sqrt{p_{i}}}\operatorname{\mathrm{Huber}}(x_{i}+1), (35)

with for all yy\in\mathbb{R},

Huber(y)={y22when |y|1|y|12otherwise.\operatorname{\mathrm{Huber}}(y)=\begin{cases}&\frac{y^{2}}{2}\quad\text{when $\left|y\right|\leq 1$}\\ &\left|y\right|-\frac{1}{2}\quad\text{otherwise}.\end{cases}

Intuitively, each coordinate is pointing most of the time towards 1, but exceptionally towards -1 with a weight of 1/pi1/\sqrt{p_{i}}. Those rare events happens less and less often as ii increase, but with an increasing weight. Those weights are chosen so that all the coordinates of the gradient have the same variance111We deviate from the a.s. bounded gradient assumption for this experiment, see Section 4.2 for a discussion on a.s. bound vs bound in expectation.. It is necessary to take different probabilities for each coordinate. If we use the same pp for all, we observe a phase transition when 1β2p1-\beta_{2}\approx p, but not the continuous improvement we obtain on Figure 1(a).

We plot the variation of 𝔼[F(xτ)22]\mathbb{E}\left[\left\|F(x_{\tau})\right\|_{2}^{2}\right] after 10610^{6} iterations with batch size 1 when varying either α\alpha, 1β11-\beta_{1} or 1β21-\beta_{2} through a range of 13 values uniformly spaced in log-scale between 10610^{-6} and 11. When varying α\alpha, we take β1=0\beta_{1}=0 and β2=1106\beta_{2}=1-10^{-6}. When varying β1\beta_{1}, we take α=105\alpha=10^{-5} and β2=1106\beta_{2}=1-10^{-6} (i.e. β2\beta_{2} is so that we are in the Adagrad-like regime). Finally, when varying β2\beta_{2}, we take β1=0\beta_{1}=0 and α=106\alpha=10^{-6}. We start from x0x_{0} close to the optimum by running first 10610^{6} iterations with α=104\alpha=10^{-4}, then 10610^{6} iterations with α=105\alpha=10^{-5}, always with β2=1106\beta_{2}=1-10^{-6}. This allows to have F(x0)F0F(x_{0})-F_{*}\approx 0 in (11) and (13) and focus on the second part of both bounds. All curves are averaged over three runs. Error bars are plotted but not visible in log-log scale.

CIFAR-10.

We train a simple convolutional network (Gitman & Ginsburg, 2017) on the CIFAR-10222https://www.cs.toronto.edu/~kriz/cifar.html image classification dataset. Starting from a random initialization, we train the model on a single V100 for 600 epochs with a batch size of 128, evaluating the full training gradient after each epoch. This is a proxy for 𝔼[F(xτ)22]\mathbb{E}\left[\left\|F(x_{\tau})\right\|_{2}^{2}\right], which would be to costly to evaluate exactly. All runs use the default config α=103\alpha=10^{-3}, β2=0.999\beta_{2}=0.999 and β1=0.9\beta_{1}=0.9, and we then change one of the parameter.

We take α\alpha from a uniform range in log-space between 10610^{-6} and 10210^{-2} with 9 values, for 1β11-\beta_{1} the range is from 10510^{-5} to 0.30.3 with 9 values, and for 1β21-\beta_{2}, from 10610^{-6} to 10110^{-1} with 11 values. Unlike for the toy problem, we do not initialize close to the optimum, as even after 600 epochs, the norm of the gradients indicates that we are not at a critical point. All curves are averaged over three runs. Error bars are plotted but not visible in log-log scale, except for large values of α\alpha.

6.2 Analysis

Toy problem.

Looking at Figure 1(a), we observe a continual improvement as β2\beta_{2} increases. Fitting a linear regression in log-log scale of 𝔼[F(xτ)22]\mathbb{E}[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}] with respect to 1β21-\beta_{2} gives a slope of 0.56 which is compatible with our bound (11), in particular the dependency in O(1/1β2)O(1/\sqrt{1-\beta_{2}}). As we initialize close to the optimum, a small step size α\alpha yields as expected the best performance. Doing the same regression in log-log scale, we find a slope of 0.87, which is again compatible with the O(α)O(\alpha) dependency of the second term in (11). Finally, we observe a limited impact of β1\beta_{1}, except when 1β11-\beta_{1} is small. The regression in log-log scale gives a slope of -0.16, while our bound predicts a slope of -1.

CIFAR 10.

Let us now turn to Figure 1(b). As we start from random weights for this problem, we observe that a large step size gives the best performance, although we observe a high variance for the largest α\alpha. This indicates that training becomes unstable for large α\alpha, which is not predicted by the theory. This is likely a consequence of the bounded gradient assumption (7) not being verified for deep neural networks. We observe a small improvement as 1β21-\beta_{2} decreases, although nowhere near what we observed on our toy problem. Finally, we observe a sweet spot for the momentum β1\beta_{1}, not predicted by our theory. We conjecture that this is due to the variance reduction effect of momentum (averaging of the gradients over multiple mini-batches, while the weights have not moved so much as to invalidate past information).

6.3 Impact of the Adam corrective terms

Using the same experimental setup on CIFAR-10, we compare the impact of removing either of the corrective term of the original Adam algorithm (Kingma & Ba, 2015), as discussed in Section 2.2. We ran a cartesian product of training for 100 epochs, with β1{0,0.5,0.8,0.9,0.99}\beta_{1}\in\{0,0.5,0.8,0.9,0.99\}, β2{0.9,0.99,0.999,0.9999}\beta_{2}\in\{0.9,0.99,0.999,0.9999\}, and α{104,103}\alpha\in\{10^{-4},10^{-3}\}. We report both the training loss and norm of the expected gradient on Figure 2. We notice a limited difference when dropping the corrective term on mnm_{n}, but dropping the term vnv_{n} has an important impact on the training trajectories. This confirm our motivation for simplifying the proof by removing the corrective term on the momentum.

7 Conclusion

We provide a simple proof on the convergence of Adam and Adagrad without heavy-ball style momentum. Our analysis highlights a link between the two algorithms: with right the hyper-parameters, Adam converges like Adagrad. The extension to heavy-ball momentum is more complex, but we significantly improve the dependence on the momentum parameter for Adam, Adagrad, as well as SGD. We exhibit a toy problem where the dependency on α\alpha and β2\beta_{2} experimentally matches our prediction. However, we do not predict the practical interest of momentum, so that improvements to the proof are needed for future work.

Broader Impact Statement

The present theoretical results on the optimization of non convex losses in a stochastic settings impact our understanding of the training of deep neural network. It might allow a deeper understanding of neural network training dynamics and thus reinforce any existing deep learning applications. There would be however no direct possible negative impact to society.

References

  • Agarwal et al. (2009) Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, 2009.
  • Chen et al. (2019) Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2019.
  • Defazio (2020) Aaron Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization. arXiv preprint arXiv:2010.00406, 2020.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2011.
  • Duchi et al. (2013) John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems 26, 2013.
  • Fang & Klabjan (2019) Biyi Fang and Diego Klabjan. Convergence analyses of online adam algorithm in convex setting and two-layer relu neural network. arXiv preprint arXiv:1905.09356, 2019.
  • Faw et al. (2022) Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, and Rachel Ward. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research. PMLR, 2022.
  • Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013.
  • Gitman & Ginsburg (2017) Igor Gitman and Boris Ginsburg. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. arXiv preprint arXiv:1709.08145, 2017.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of the International Conference on Learning Representations (ICLR), 2015.
  • Lacroix et al. (2018) Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. arXiv preprint arXiv:1806.07297, 2018.
  • Li & Orabona (2019) Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In AI Stats, 2019.
  • Liu et al. (2020) Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  18261–18271. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/d3f5d4de09ea19461dab00590df91e4f-Paper.pdf.
  • McMahan & Streeter (2010) H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.
  • Polyak (1964) Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1964.
  • Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In Proc. of the International Conference on Learning Representations (ICLR), 2018.
  • Shi et al. (2020) Naichen Shi, Dawei Li, Mingyi Hong, and Ruoyu Sun. Rmsprop converges with proper hyper-parameter. In International Conference on Learning Representations, 2020.
  • Tieleman & Hinton (2012) T. Tieleman and G. Hinton. Lecture 6.5 — rmsprop. COURSERA: Neural Networks for Machine Learning, 2012.
  • Ward et al. (2019) Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, 2019.
  • Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, 2017.
  • Yang et al. (2016) Tianbao Yang, Qihang Lin, and Zhe Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257, 2016.
  • Zhou et al. (2018) Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
  • Zou et al. (2019a) Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of Adam and RMSprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019a.
  • Zou et al. (2019b) Fengyu Zou, Li Shen, Zenqun Jie, Ju Sun, and Wei Liu. Weighted Adagrad with unified momentum. arXiv preprint arXiv:1808.03408, 2019b.

Supplementary material for A Simple Convergence Proof of Adam and Adagrad

Overview

In Section A, we detail the results for the convergence of Adam and Adagrad with heavy-ball momentum. For an overview of the contributions of our proof technique, see Section A.4.

Then in Section B, we show how our technique also applies to SGD and improves its dependency in β1\beta_{1} compared with previous work by Yang et al. (2016), from O((1β1)2)O((1-\beta_{1})^{-2}) to O(1β1)1O(1-\beta_{1})^{-1}. The proof is simpler than for Adam/Adagrad, and show the generality of our technique.

Appendix A Convergence of adaptive methods with heavy-ball momentum

A.1 Setup and notations

We recall the dynamic system introduced in Section 2.3. In the rest of this section, we take an iteration nn\in\mathbb{N}^{*}, and when needed, i[d]i\in[d] refers to a specific coordinate. Given x0dx_{0}\in\mathbb{R}^{d} our starting point, m0=0m_{0}=0, and v0=0v_{0}=0, we define

{mn,i=β1mn1,i+ifn(xn1),vn,i=β2vn1,i+(ifn(xn1))2,xn,i=xn1,iαnmn,iϵ+vn,i.\displaystyle\begin{cases}m_{n,i}&=\beta_{1}m_{n-1,i}+\nabla_{i}f_{n}(x_{n-1}),\\ v_{n,i}&=\beta_{2}v_{n-1,i}+\left(\nabla_{i}f_{n}(x_{n-1})\right)^{2},\\ x_{n,i}&=x_{n-1,i}-\alpha_{n}\frac{m_{n,i}}{\sqrt{\epsilon+v_{n,i}}}.\end{cases} (A.1)

For Adam, the step size is given by

αn=α(1β1)1β2n1β2.\displaystyle\alpha_{n}=\alpha(1-\beta_{1})\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}. (A.2)

For Adagrad (potentially extended with heavy-ball momentum), we have β2=1\beta_{2}=1 and

αn=α(1β1).\displaystyle\alpha_{n}=\alpha(1-\beta_{1}). (A.3)

Notice we include the factor 1β11-\beta_{1} in the step size rather than in (A.1), as this allows for a more elegant proof. The original Adam algorithm included compensation factors for both β1\beta_{1} and β2\beta_{2} (Kingma & Ba, 2015) to correct the initial scale of mm and vv which are initialized at 0. Adam would be exactly recovered by replacing (A.2) with

αn=α1β11β1n1β2n1β2.\alpha_{n}=\alpha\frac{1-\beta_{1}}{1-\beta_{1}^{n}}\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}. (A.4)

However, the denominator 1β1n1-\beta_{1}^{n} potentially makes (αn)n(\alpha_{n})_{n\in\mathbb{N}^{*}} non monotonic, which complicates the proof. Thus, we instead replace the denominator by its limit value for nn\rightarrow\infty. This has little practical impact as (i) early iterates are noisy because vv is averaged over a small number of gradients, so making smaller step can be more stable, (ii) for β1=0.9\beta_{1}=0.9 (Kingma & Ba, 2015), (A.2) differs from (A.4) only for the first 50 iterations.

Throughout the proof we note 𝔼n1[]\mathbb{E}_{n-1}\left[\cdot\right] the conditional expectation with respect to f1,,fn1f_{1},\ldots,f_{n-1}. In particular, xn1x_{n-1}, vn1v_{n-1} is deterministic knowing f1,,fn1f_{1},\ldots,f_{n-1}. We introduce

Gn=F(xn1)andgn=fn(xn1).G_{n}=\nabla F(x_{n-1})\quad\text{and}\quad g_{n}=\nabla f_{n}(x_{n-1}). (A.5)

Like in Section 5.2, we introduce the update undu_{n}\in\mathbb{R}^{d}, as well as the update without heavy-ball momentum UndU_{n}\in\mathbb{R}^{d}:

un,i=mn,iϵ+vn,iandUn,i=gn,iϵ+vn,i.u_{n,i}=\frac{m_{n,i}}{\sqrt{\epsilon+v_{n,i}}}\quad\text{and}\quad U_{n,i}=\frac{g_{n,i}}{\sqrt{\epsilon+v_{n,i}}}. (A.6)

For any kk\in\mathbb{N} with k<nk<n, we define v~n,kd\tilde{v}_{n,k}\in\mathbb{R}^{d} by

v~n,k,i=β2kvnk,i+𝔼nk1[j=nk+1nβ2njgj,i2],\tilde{v}_{n,k,i}=\beta_{2}^{k}v_{n-k,i}+\mathbb{E}_{n-k-1}\left[\sum_{j=n-k+1}^{n}\beta_{2}^{n-j}g_{j,i}^{2}\right], (A.7)

i.e. the contribution from the kk last gradients are replaced by their expected value for know values of f1,,fnk1f_{1},\ldots,f_{n-k-1}. For k=1k=1, we recover the same definition as in (18).

A.2 Results

For any total number of iterations NN\in\mathbb{N}^{*}, we define τN\tau_{N} a random index with value in {0,,N1}\{0,\ldots,N-1\}, verifying

j,j<N,[τ=j]1β1Nj.\displaystyle\forall j\in\mathbb{N},j<N,\mathbb{P}\left[\tau=j\right]\propto 1-\beta_{1}^{N-j}. (A.8)

If β1=0\beta_{1}=0, this is equivalent to sampling τ\tau uniformly in {0,,N1}\{0,\ldots,N-1\}. If β1>0\beta_{1}>0, the last few 11β1\frac{1}{1-\beta_{1}} iterations are sampled rarely, and all iterations older than a few times that number are sampled almost uniformly. We bound the expected squared norm of the total gradient at iteration τ\tau, which is standard for non convex stochastic optimization (Ghadimi & Lan, 2013).

Note that like in previous works, the bound worsen as β1\beta_{1} increases, with a dependency of the form O((1β1)1)O((1-\beta_{1})^{-1}). This is a significant improvement over the existing bound for Adagrad with heavy-ball momentum, which scales as (1β1)3(1-\beta_{1})^{-3} (Zou et al., 2019b), or the best known bound for Adam which scales as (1β1)5(1-\beta_{1})^{-5} (Zou et al., 2019a).

Technical lemmas to prove the following theorems are introduced in Section A.5, while the proof of Theorems 3 and 4 are provided in Section A.6.

See 3

See 4

A.3 Analysis of the results with momentum

First notice that taking β10\beta_{1}\rightarrow 0 in Theorems 3 and 4, we almost recover the same result as stated in 2 and 1, only losing on the term 4dR24dR^{2} which becomes 12dR212dR^{2}.

Simplified expressions with momentum

Assuming Nβ11β1N\gg\frac{\beta_{1}}{1-\beta_{1}} and β1/β2β1\beta_{1}/\beta_{2}\approx\beta_{1}, which is verified for typical values of β1\beta_{1} and β2\beta_{2} (Kingma & Ba, 2015), it is possible to simplify the bound for Adam (13) as

𝔼[F(xτ)2]2RF(x0)FαN\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\lessapprox 2R\frac{F(x_{0})-F_{*}}{\alpha N}
+(αdRL1β2+12dR2(1β1)1β2+2α2dL2β1(1β1)(1β2)3/2)(1Nln(1+R2ϵ(1β2))ln(β2)).\displaystyle\qquad+\left(\frac{\alpha dRL}{1-\beta_{2}}+\frac{12dR^{2}}{(1-\beta_{1})\sqrt{1-\beta_{2}}}+\frac{2\alpha^{2}dL^{2}\beta_{1}}{(1-\beta_{1})(1-\beta_{2})^{3/2}}\right)\left(\frac{1}{N}\ln\left(1+\frac{R^{2}}{\epsilon(1-\beta_{2})}\right)-\ln(\beta_{2})\right). (A.9)

Similarly, if we assume Nβ11β1N\gg\frac{\beta_{1}}{1-\beta_{1}}, we can simplify the bound for Adagrad (LABEL:app:eq:main_bound_adagad_momentum) as

𝔼[F(xτ)2]2RF(x0)FαN+1N(αdRL+12dR21β1+2α2dL2β11β1)ln(1+NR2ϵ),\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\lessapprox 2R\frac{F(x_{0})-F_{*}}{\alpha\sqrt{N}}+\frac{1}{\sqrt{N}}\left(\alpha dRL+\frac{12dR^{2}}{1-\beta_{1}}+\frac{2\alpha^{2}dL^{2}\beta_{1}}{1-\beta_{1}}\right)\ln\left(1+\frac{NR^{2}}{\epsilon}\right), (A.10)
Optimal finite horizon Adam is still Adagrad

We can perform the same finite horizon analysis as in Section 4.3. If we take α=α~N\alpha=\frac{\tilde{\alpha}}{\sqrt{N}} and β2=11/N\beta_{2}=1-1/N, then (A.9) simplifies to

𝔼[F(xτ)2]2RF(x0)Fα~N+1N(α~dRL+12dR21β1+2α~2dL2β11β1)(ln(1+NR2ϵ)+1).\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|^{2}\right]\lessapprox 2R\frac{F(x_{0})-F_{*}}{\tilde{\alpha}\sqrt{N}}+\frac{1}{\sqrt{N}}\left(\tilde{\alpha}dRL+\frac{12dR^{2}}{1-\beta_{1}}+\frac{2\tilde{\alpha}^{2}dL^{2}\beta_{1}}{1-\beta_{1}}\right)\left(\ln\left(1+\frac{NR^{2}}{\epsilon}\right)+1\right). (A.11)

The term (1β2)3/2(1-\beta_{2})^{3/2} in the denominator in (A.9) is indeed compensated by the α2\alpha^{2} in the numerator and we again recover the proper ln(N)/N\ln(N)/\sqrt{N} convergence rate, which matches (A.10) up to a +1+1 term next to the log.

A.4 Overview of the proof, contributions and limitations

There is a number of steps to the proof. First we derive a Lemma similar in spirit to the descent Lemma 5.1. There are two differences: first, when computing the dot product between the current expected gradient and each past gradient contained in the momentum, we have to re-center the expected gradient to its values in the past, using the smoothness assumption. Besides, we now have to decorrelate more terms between the numerator and denominator, as the numerator contains not only the latest gradient but a decaying sum of the past ones. We similarly extend Lemma 5.2 to support momentum specific terms. The rest of the proof follows mostly as in Section 5, except with a few more manipulation to regroup the gradient terms coming from different iterations.

Compared with previous work (Zou et al., 2019b; a), the re-centering of past gradients in (A.14) is a key aspect to improve the dependency in β1\beta_{1}, with a small price to pay using the smoothness of FF which is compensated by the introduction of extra Gnk,i2G_{n-k,i}^{2} in (A.1). Then, a tight handling of the different summations as well as the the introduction of a non uniform sampling of the iterates (A.8), which naturally arises when grouping the different terms in (A.49), allow to obtain the overall improved dependency in O((1β1)1)O((1-\beta_{1})^{-1}).

The same technique can be applied to SGD, the proof becoming simpler as there is no correlation between the step size and the gradient estimate, see Section B. If you want to better understand the handling of momentum without the added complexity of adaptive methods, we recommend starting with this proof.

A limitation of the proof technique is that we do not show that heavy-ball momentum can lead to a variance reduction of the update. Either more powerful probabilistic results, or extra regularity assumptions could allow to further improve our worst case bounds of the variance of the update, which in turn might lead to a bound with an improvement when using heavy-ball momentum.

A.5 Technical lemmas

We first need an updated version of 5.1 that includes momentum.

Lemma A.1 (Adaptive update with momentum approximately follows a descent direction).

Given x0dx_{0}\in\mathbb{R}^{d}, the iterates defined by the system (A.1) for (αj)j(\alpha_{j})_{j\in\mathbb{N}^{*}} that is non-decreasing, and under the conditions (6), (7), and (8), as well as 0β1<β210\leq\beta_{1}<\beta_{2}\leq 1, we have for all iterations nn\in\mathbb{N}^{*},

𝔼[i[d]Gn,imn,iϵ+vn,i]12(i[d]k=0n1β1k𝔼[Gnk,i2ϵ+v~n,k+1,i])\displaystyle\mathbb{E}\left[\sum_{i\in[d]}G_{n,i}\frac{m_{n,i}}{\sqrt{\epsilon+v_{n,i}}}\right]\geq\frac{1}{2}\left(\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]\right)
αn2L24R1β1(l=1n1unl22k=ln1β1kk)3R1β1(k=0n1(β1β2)kk+1Unk22).\displaystyle\qquad-\frac{\alpha_{n}^{2}L^{2}}{4R}\sqrt{1-\beta_{1}}\left(\sum_{l=1}^{n-1}\left\|u_{n-l}\right\|_{2}^{2}\sum_{k=l}^{n-1}\beta_{1}^{k}\sqrt{k}\right)-\frac{3R}{\sqrt{1-\beta_{1}}}\left(\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\left\|U_{n-k}\right\|_{2}^{2}\right). (A.12)
Proof.

We use multiple times (22) in this proof, which we repeat here for convenience,

λ>0,x,y,xyλ2x2+y22λ.\forall\lambda>0,\,x,y\in\mathbb{R},xy\leq\frac{\lambda}{2}x^{2}+\frac{y^{2}}{2\lambda}. (A.13)

Let us take an iteration nn\in\mathbb{N}^{*} for the duration of the proof. We have

i[d]Gn,imn,iϵ+vn,i\displaystyle\sum_{i\in[d]}G_{n,i}\frac{m_{n,i}}{\sqrt{\epsilon+v_{n,i}}} =i[d]k=0n1β1kGn,ignk,iϵ+vn,i\displaystyle=\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}G_{n,i}\frac{g_{n-k,i}}{\sqrt{\epsilon+v_{n,i}}}
=i[d]k=0n1β1kGnk,ignk,iϵ+vn,iA+i[d]k=0n1β1k(Gn,iGnk,i)gnk,iϵ+vn,iB,\displaystyle=\underbrace{\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}G_{n-k,i}\frac{g_{n-k,i}}{\sqrt{\epsilon+v_{n,i}}}}_{A}+\underbrace{\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\left(G_{n,i}-G_{n-k,i}\right)\frac{g_{n-k,i}}{\sqrt{\epsilon+v_{n,i}}}}_{B}, (A.14)

Let us now take an index 0kn10\leq k\leq n-1. We show that the contribution of past gradients GnkG_{n-k} and gnkg_{n-k} due to the heavy-ball momentum can be controlled thanks to the decay term β1k\beta_{1}^{k}. Let us first have a look at BB. Using (A.13) with

λ=1β12Rk+1,x=|Gn,iGnk,i|,y=|gnk,i|ϵ+vn,i,\lambda=\frac{\sqrt{1-\beta_{1}}}{2R\sqrt{k+1}},\;x=\left|G_{n,i}-G_{n-k,i}\right|,\;y=\frac{\left|g_{n-k,i}\right|}{\sqrt{\epsilon+v_{n,i}}},

we have

|B|\displaystyle\left|B\right| i[d]k=0n1β1k(1β14Rk+1(Gn,iGnk,i)2+Rk+11β1gnk,i2ϵ+vn,i).\displaystyle\leq\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\left(\frac{\sqrt{1-\beta_{1}}}{4R\sqrt{k+1}}\left(G_{n,i}-G_{n-k,i}\right)^{2}+\frac{R\sqrt{k+1}}{\sqrt{1-\beta_{1}}}\frac{g_{n-k,i}^{2}}{\epsilon+v_{n,i}}\right). (A.15)

Notice first that for any dimension i[d]i\in[d], ϵ+vn,iϵ+β2kvnk,iβ2k(ϵ+vnk,i)\epsilon+v_{n,i}\geq\epsilon+\beta_{2}^{k}v_{n-k,i}\geq\beta_{2}^{k}(\epsilon+v_{n-k,i}), so that

gnk,i2ϵ+vn,i1β2kUnk,i2\frac{g^{2}_{n-k,i}}{\epsilon+v_{n,i}}\leq\frac{1}{\beta_{2}^{k}}U_{n-k,i}^{2} (A.16)

Besides, using the L-smoothness of FF given by (8), we have

GnGnk22\displaystyle\left\|G_{n}-G_{n-k}\right\|_{2}^{2} L2xn1xnk122\displaystyle\leq L^{2}\left\|x_{n-1}-x_{n-k-1}\right\|_{2}^{2}
=L2l=1kαnlunl22\displaystyle=L^{2}\left\|\sum_{l=1}^{k}\alpha_{n-l}u_{n-l}\right\|_{2}^{2}
αn2L2kl=1kunl22,\displaystyle\leq\alpha_{n}^{2}L^{2}k\sum_{l=1}^{k}\left\|u_{n-l}\right\|^{2}_{2}, (A.17)

using Jensen inequality and the fact that αn\alpha_{n} is non-decreasing. Injecting (A.16) and (A.17) into (A.15), we obtain

|B|\displaystyle\left|B\right| (k=0n1αn2L24R1β1β1kkl=1kunl22)+(k=0n1R1β1(β1β2)kk+1Unk22)\displaystyle\leq\left(\sum_{k=0}^{n-1}\frac{\alpha_{n}^{2}L^{2}}{4R}\sqrt{1-\beta_{1}}\beta_{1}^{k}\sqrt{k}\sum_{l=1}^{k}\left\|u_{n-l}\right\|^{2}_{2}\right)+\left(\sum_{k=0}^{n-1}\frac{R}{\sqrt{1-\beta_{1}}}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\left\|U_{n-k}\right\|_{2}^{2}\right)
=1β1αn2L24R(l=1n1unl22k=ln1β1kk)+R1β1(k=0n1(β1β2)kk+1Unk22).\displaystyle=\sqrt{1-\beta_{1}}\frac{\alpha_{n}^{2}L^{2}}{4R}\left(\sum_{l=1}^{n-1}\left\|u_{n-l}\right\|_{2}^{2}\sum_{k=l}^{n-1}\beta_{1}^{k}\sqrt{k}\right)+\frac{R}{\sqrt{1-\beta_{1}}}\left(\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\left\|U_{n-k}\right\|_{2}^{2}\right). (A.18)

Now going back to the AA term in (A.14), we will study the main term of the summation, i.e. for i[d]i\in[d] and k<nk<n

𝔼[Gnk,ignk,iϵ+vn,i]=𝔼[iF(xnk1)ifnk(xnk1)ϵ+vn,i].\displaystyle\mathbb{E}\left[G_{n-k,i}\frac{g_{n-k,i}}{\sqrt{\epsilon+v_{n,i}}}\right]=\mathbb{E}\left[\nabla_{i}F(x_{n-k-1})\frac{\nabla_{i}f_{n-k}(x_{n-k-1})}{\sqrt{\epsilon+v_{n,i}}}\right]. (A.19)

Notice that we could almost apply Lemma 5.1 to it, except that we have vn,iv_{n,i} in the denominator instead of vnk,iv_{n-k,i}. Thus we will need to extend the proof to decorrelate more terms. We will further drop indices in the rest of the proof, noting G=Gnk,iG=G_{n-k,i}, g=gnk,ig=g_{n-k,i}, v~=v~n,k+1,i\tilde{v}=\tilde{v}_{n,k+1,i} and v=vn,iv=v_{n,i}. Finally, let us note

δ2=j=nknβ2njgj,i2andr2=𝔼nk1[δ2].\delta^{2}=\sum_{j=n-k}^{n}\beta_{2}^{n-j}g_{j,i}^{2}\qquad\text{and}\qquad r^{2}=\mathbb{E}_{n-k-1}\left[\delta^{2}\right]. (A.20)

In particular we have v~v=r2δ2\tilde{v}-v=r^{2}-\delta^{2}. With our new notations, we can rewrite (A.19) as

𝔼[Ggϵ+v]\displaystyle\mathbb{E}\left[G\frac{g}{\sqrt{\epsilon+v}}\right] =𝔼[Ggϵ+v~+Gg(1ϵ+v1ϵ+v~)]\displaystyle=\mathbb{E}\left[G\frac{g}{\sqrt{\epsilon+\tilde{v}}}+Gg\left(\frac{1}{\sqrt{\epsilon+v}}-\frac{1}{\sqrt{\epsilon+\tilde{v}}}\right)\right]
=𝔼[𝔼nk1[Ggϵ+v~]+Ggr2δ2ϵ+vϵ+v~(ϵ+v+ϵ+v~)]\displaystyle=\mathbb{E}\left[\mathbb{E}_{n-k-1}\left[G\frac{g}{\sqrt{\epsilon+\tilde{v}}}\right]+Gg\frac{r^{2}-\delta^{2}}{\sqrt{\epsilon+v}\sqrt{\epsilon+\tilde{v}}(\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}})}\right]
=𝔼[G2ϵ+v~]+𝔼[Ggr2δ2ϵ+vϵ+v~(ϵ+v+ϵ+v~)C].\displaystyle=\mathbb{E}\left[\frac{G^{2}}{\sqrt{\epsilon+\tilde{v}}}\right]+\mathbb{E}\left[\underbrace{Gg\frac{r^{2}-\delta^{2}}{\sqrt{\epsilon+v}\sqrt{\epsilon+\tilde{v}}(\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}})}}_{C}\right]. (A.21)

We first focus on CC:

|C|\displaystyle\left|C\right| |Gg|r2ϵ+v(ϵ+v~)κ+|Gg|δ2(ϵ+v)ϵ+v~ρ,\displaystyle\leq\underbrace{\left|Gg\right|\frac{r^{2}}{\sqrt{\epsilon+v}(\epsilon+\tilde{v})}}_{\kappa}+\underbrace{\left|Gg\right|\frac{\delta^{2}}{(\epsilon+v)\sqrt{\epsilon+\tilde{v}}}}_{\rho},

due to the fact that ϵ+v+ϵ+v~max(ϵ+v,ϵ+v~)\sqrt{\epsilon+v}+\sqrt{\epsilon+\tilde{v}}\geq\max(\sqrt{\epsilon+v},\sqrt{\epsilon+\tilde{v}}) and |r2δ2|r2+δ2\left|r^{2}-\delta^{2}\right|\leq r^{2}+\delta^{2}.

Applying (A.13) to κ\kappa with

λ=1β1ϵ+v~2,x=|G|ϵ+v~,y=|g|r2ϵ+v~ϵ+v,\lambda=\frac{\sqrt{1-\beta_{1}}\sqrt{\epsilon+\tilde{v}}}{2},\;x=\frac{\left|G\right|}{\sqrt{\epsilon+\tilde{v}}},\;y=\frac{\left|g\right|r^{2}}{\sqrt{\epsilon+\tilde{v}}\sqrt{\epsilon+v}},

we obtain

κG24ϵ+v~+11β1g2r4(ϵ+v~)3/2(ϵ+v).\displaystyle\kappa\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{1}{\sqrt{1-\beta_{1}}}\frac{g^{2}r^{4}}{(\epsilon+\tilde{v})^{3/2}(\epsilon+v)}.

Given that ϵ+v~r2\epsilon+\tilde{v}\geq r^{2} and taking the conditional expectation, we can simplify as

𝔼nk1[κ]G24ϵ+v~+11β1r2ϵ+v~𝔼nk1[g2ϵ+v].\displaystyle\mathbb{E}_{n-k-1}\left[\kappa\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{1}{\sqrt{1-\beta_{1}}}\frac{r^{2}}{\sqrt{\epsilon+\tilde{v}}}\mathbb{E}_{n-k-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (A.22)

Now turning to ρ\rho, we use (A.13) with

λ=1β1ϵ+v~2r2,x=|Gδ|ϵ+v~,y=|δg|ϵ+v,\lambda=\frac{\sqrt{1-\beta_{1}}\sqrt{\epsilon+\tilde{v}}}{2r^{2}},\;x=\frac{\left|G\delta\right|}{\sqrt{\epsilon+\tilde{v}}},\;y=\frac{\left|\delta g\right|}{\epsilon+v},

we obtain

ρG24ϵ+v~δ2r2+11β1r2ϵ+v~g2δ2(ϵ+v)2.\displaystyle\rho\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}\frac{\delta^{2}}{r^{2}}+\frac{1}{\sqrt{1-\beta_{1}}}\frac{r^{2}}{\sqrt{\epsilon+\tilde{v}}}\frac{g^{2}\delta^{2}}{(\epsilon+v)^{2}}. (A.23)

Given that ϵ+vδ2\epsilon+v\geq\delta^{2}, and 𝔼nk1[δ2r2]=1\mathbb{E}_{n-k-1}\left[\frac{\delta^{2}}{r^{2}}\right]=1, we obtain after taking the conditional expectation,

𝔼nk1[ρ]G24ϵ+v~+11β1r2ϵ+v~𝔼nk1[g2ϵ+v].\displaystyle\mathbb{E}_{n-k-1}\left[\rho\right]\leq\frac{G^{2}}{4\sqrt{\epsilon+\tilde{v}}}+\frac{1}{\sqrt{1-\beta_{1}}}\frac{r^{2}}{\sqrt{\epsilon+\tilde{v}}}\mathbb{E}_{n-k-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (A.24)

Notice that in A.23, we possibly divide by zero. It suffice to notice that if r2=0r^{2}=0 then δ2=0\delta^{2}=0 a.s. so that ρ=0\rho=0 and (A.24) is still verified. Summing (A.22) and (A.24), we get

𝔼nk1[|C|]G22ϵ+v~+21β1r2ϵ+v~𝔼nk1[g2ϵ+v].\displaystyle\mathbb{E}_{n-k-1}\left[\left|C\right|\right]\leq\frac{G^{2}}{2\sqrt{\epsilon+\tilde{v}}}+\frac{2}{\sqrt{1-\beta_{1}}}\frac{r^{2}}{\sqrt{\epsilon+\tilde{v}}}\mathbb{E}_{n-k-1}\left[\frac{g^{2}}{\epsilon+v}\right]. (A.25)

Given that rϵ+v~r\leq\sqrt{\epsilon+\tilde{v}} by definition of v~\tilde{v}, and that using (7), rk+1Rr\leq\sqrt{k+1}R, we have333Note that we do not need the almost sure bound on the gradient, and a bound on 𝔼[f(x)2]\mathbb{E}\left[\left\|\nabla f(x)\right\|_{\infty}^{2}\right] would be sufficient., reintroducing the indices we had dropped

𝔼nk1[|C|]Gnk,i22ϵ+v~n,k+1,i+2R1β1k+1𝔼nk1[gnk,i2ϵ+vn,i].\displaystyle\mathbb{E}_{n-k-1}\left[\left|C\right|\right]\leq\frac{G_{n-k,i}^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}+\frac{2R}{\sqrt{1-\beta_{1}}}\sqrt{k+1}\mathbb{E}_{n-k-1}\left[\frac{g_{n-k,i}^{2}}{\epsilon+v_{n,i}}\right]. (A.26)

Taking the complete expectation and using that by definition ϵ+vn,iϵ+β2kvnk,iβ2k(ϵ+vnk,i)\epsilon+v_{n,i}\geq\epsilon+\beta_{2}^{k}v_{n-k,i}\geq\beta_{2}^{k}(\epsilon+v_{n-k,i}) we get

𝔼[|C|]12𝔼[Gnk,i2ϵ+v~n,k+1,i]+2R1β1β2kk+1𝔼[gnk,i2ϵ+vnk,i].\displaystyle\mathbb{E}\left[\left|C\right|\right]\leq\frac{1}{2}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]+\frac{2R}{\sqrt{1-\beta_{1}}\beta_{2}^{k}}\sqrt{k+1}\mathbb{E}\left[\frac{g_{n-k,i}^{2}}{\epsilon+v_{n-k,i}}\right]. (A.27)

Injecting (A.27) into (A.21) gives us

𝔼[A]\displaystyle\mathbb{E}\left[A\right] i[d]k=0n1β1k(𝔼[Gnk,i2ϵ+v~n,k+1,i](12𝔼[Gnk,i2ϵ+v~n,k,i]+2R1β1β2kk+1𝔼[gnk,i2ϵ+vnk,i]))\displaystyle\geq\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\left(\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]-\left(\frac{1}{2}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{\sqrt{\epsilon+\tilde{v}_{n,k,i}}}\right]+\frac{2R}{\sqrt{1-\beta_{1}}\beta_{2}^{k}}\sqrt{k+1}\mathbb{E}\left[\frac{g_{n-k,i}^{2}}{\epsilon+v_{n-k,i}}\right]\right)\right)
=12(i[d]k=0n1β1k𝔼[Gnk,i2ϵ+v~n,k+1,i])2R1β1(i[d]k=0n1(β1β2)kk+1𝔼[Unk22]).\displaystyle=\frac{1}{2}\left(\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]\right)-\frac{2R}{\sqrt{1-\beta_{1}}}\left(\sum_{i\in[d]}\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\mathbb{E}\left[\left\|U_{n-k}\right\|_{2}^{2}\right]\right). (A.28)

Injecting (A.28) and (A.18) into (A.14) finishes the proof.

Similarly, we will need an updated version of 5.2.

Lemma A.2 (sum of ratios of the square of a decayed sum and a decayed sum of square).

We assume we have 0<β210<\beta_{2}\leq 1 and 0<β1<β20<\beta_{1}<\beta_{2}, and a sequence of real numbers (an)n(a_{n})_{n\in\mathbb{N}^{*}}. We define bn=j=1nβ2njaj2b_{n}=\sum_{j=1}^{n}\beta_{2}^{n-j}a_{j}^{2} and cn=j=1nβ1njajc_{n}=\sum_{j=1}^{n}\beta_{1}^{n-j}a_{j}. Then we have

j=1ncj2ϵ+bj1(1β1)(1β1/β2)(ln(1+bnϵ)nln(β2)).\sum_{j=1}^{n}\frac{c_{j}^{2}}{\epsilon+b_{j}}\leq\frac{1}{(1-\beta_{1})(1-\beta_{1}/\beta_{2})}\left(\ln\left(1+\frac{b_{n}}{\epsilon}\right)-n\ln(\beta_{2})\right). (A.29)
Proof.

Now let us take jj\in\mathbb{N}^{*}, jnj\leq n, we have using Jensen inequality

cj211β1l=1jβ1jlal2,\displaystyle c_{j}^{2}\leq\frac{1}{1-\beta_{1}}\sum_{l=1}^{j}\beta_{1}^{j-l}a_{l}^{2},

so that

cj2ϵ+bj\displaystyle\frac{c_{j}^{2}}{\epsilon+b_{j}} 11β1l=1jβ1jlal2ϵ+bj.\displaystyle\leq\frac{1}{1-\beta_{1}}\sum_{l=1}^{j}\beta_{1}^{j-l}\frac{a_{l}^{2}}{\epsilon+b_{j}}.

Given that for l[j]l\in[j], we have by definition ϵ+bjϵ+β2jlblβ2jl(ϵ+bl)\epsilon+b_{j}\geq\epsilon+\beta_{2}^{j-l}b_{l}\geq\beta_{2}^{j-l}(\epsilon+b_{l}), we get

cj2ϵ+bj\displaystyle\frac{c_{j}^{2}}{\epsilon+b_{j}} 11β1l=1j(β1β2)jlal2ϵ+bl.\displaystyle\leq\frac{1}{1-\beta_{1}}\sum_{l=1}^{j}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{j-l}\frac{a_{l}^{2}}{\epsilon+b_{l}}. (A.30)

Thus, when summing over all j[n]j\in[n], we get

j=1ncj2ϵ+bj\displaystyle\sum_{j=1}^{n}\frac{c_{j}^{2}}{\epsilon+b_{j}} 11β1j=1nl=1j(β1β2)jlal2ϵ+bl\displaystyle\leq\frac{1}{1-\beta_{1}}\sum_{j=1}^{n}\sum_{l=1}^{j}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{j-l}\frac{a_{l}^{2}}{\epsilon+b_{l}}
=11β1l=1nal2ϵ+blj=ln(β1β2)jl\displaystyle=\frac{1}{1-\beta_{1}}\sum_{l=1}^{n}\frac{a_{l}^{2}}{\epsilon+b_{l}}\sum_{j=l}^{n}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{j-l}
1(1β1)(1β1/β2)l=1nal2ϵ+bl.\displaystyle\leq\frac{1}{(1-\beta_{1})(1-\beta_{1}/\beta_{2})}\sum_{l=1}^{n}\frac{a_{l}^{2}}{\epsilon+b_{l}}. (A.31)

Applying Lemma 5.2, we obtain (A.29).

We also need two technical lemmas on the sum of series.

Lemma A.3 (sum of a geometric term times a square root).

Given 0<a<10<a<1 and QQ\in\mathbb{N}, we have,

q=0Q1aqq+111a(1+π2ln(a))2(1a)3/2.\sum_{q=0}^{Q-1}a^{q}\sqrt{q+1}\leq\frac{1}{1-a}\left(1+\frac{\sqrt{\pi}}{2\sqrt{-\ln(a)}}\right)\leq\frac{2}{(1-a)^{3/2}}. (A.32)
Proof.

We first need to study the following integral:

0ax2xdx\displaystyle\int_{0}^{\infty}\frac{a^{x}}{2\sqrt{x}}\mathrm{d}{}x =0eln(a)x2xdx, then introducing y=x,\displaystyle=\int_{0}^{\infty}\frac{\mathrm{e}^{\ln(a)x}}{2\sqrt{x}}\mathrm{d}{}x\quad,\text{ then introducing $y=\sqrt{x}$,}
=0eln(a)y2dy, then introducing u=2ln(a)y,\displaystyle=\int_{0}^{\infty}\mathrm{e}^{\ln(a)y^{2}}\mathrm{d}{}y\quad,\text{ then introducing $u=\sqrt{-2\ln(a)}y$,}
=12ln(a)0eu2/2du\displaystyle=\frac{1}{\sqrt{-2\ln(a)}}\int_{0}^{\infty}\mathrm{e}^{-u^{2}/2}\mathrm{d}{}u
0ax2xdx\displaystyle\int_{0}^{\infty}\frac{a^{x}}{2\sqrt{x}}\mathrm{d}{}x =π2ln(a),\displaystyle=\frac{\sqrt{\pi}}{2\sqrt{-\ln(a)}}, (A.33)

where we used the classical integral of the standard Gaussian density function.

Let us now introduce AQA_{Q}:

AQ=q=0Q1aqq+1,\displaystyle A_{Q}=\sum_{q=0}^{Q-1}a^{q}\sqrt{q+1},

then we have

AQaAQ\displaystyle A_{Q}-aA_{Q} =q=0Q1aqq+1q=1Qaqq, then using the concavity of ,\displaystyle=\sum_{q=0}^{Q-1}a^{q}\sqrt{q+1}-\sum_{q=1}^{Q}a^{q}\sqrt{q}\quad,\text{ then using the concavity of $\sqrt{\cdot}$,}
1aQQ+q=1Q1aq2q\displaystyle\leq 1-a^{Q}\sqrt{Q}+\sum_{q=1}^{Q-1}\frac{a^{q}}{2\sqrt{q}}
1+0ax2xdx\displaystyle\leq 1+\int_{0}^{\infty}\frac{a^{x}}{2\sqrt{x}}\mathrm{d}{}x
(1a)AQ\displaystyle(1-a)A_{Q} 1+π2ln(a),\displaystyle\leq 1+\frac{\sqrt{\pi}}{2\sqrt{-\ln(a)}},

where we used (A.33). Given that ln(a)1a\sqrt{-\ln(a)}\geq\sqrt{1-a} we obtain (A.32). ∎

Lemma A.4 (sum of a geometric term times roughly a power 3/2).

Given 0<a<10<a<1 and QQ\in\mathbb{N}, we have,

q=0Q1aqq(q+1)4a(1a)5/2.\sum_{q=0}^{Q-1}a^{q}\sqrt{q}(q+1)\leq\frac{4a}{(1-a)^{5/2}}. (A.34)
Proof.

Let us introduce AQA_{Q}:

AQ=q=0Q1aqq(q+1),\displaystyle A_{Q}=\sum_{q=0}^{Q-1}a^{q}\sqrt{q}(q+1),

then we have

AQaAQ\displaystyle A_{Q}-aA_{Q} =q=0Q1aqq(q+1)q=1Qaqq1q\displaystyle=\sum_{q=0}^{Q-1}a^{q}\sqrt{q}(q+1)-\sum_{q=1}^{Q}a^{q}\sqrt{q-1}q
q=1Q1aqq((q+1)qq1)\displaystyle\leq\sum_{q=1}^{Q-1}a^{q}\sqrt{q}\left((q+1)-\sqrt{q}\sqrt{q-1}\right)
q=1Q1aqq((q+1)(q1))\displaystyle\leq\sum_{q=1}^{Q-1}a^{q}\sqrt{q}\left((q+1)-(q-1)\right)
2q=1Q1aqq\displaystyle\leq 2\sum_{q=1}^{Q-1}a^{q}\sqrt{q}
=2aq=0Q2aqq+1,then using Lemma A.3,\displaystyle=2a\sum_{q=0}^{Q-2}a^{q}\sqrt{q+1}\quad,\text{then using Lemma~\ref{app:lemma:sum_geom_sqrt}},
(1a)AQ\displaystyle(1-a)A_{Q} 4a(1a)3/2.\displaystyle\leq\frac{4a}{(1-a)^{3/2}}.

A.6 Proof of Adam and Adagrad with momentum

Common part of the proof

Let us a take an iteration nn\in\mathbb{N}^{*}. Using the smoothness of FF defined in (8), we have

F(xn)F(xn1)αnGnTun+αn2L2un22.\displaystyle F(x_{n})\leq F(x_{n-1})-\alpha_{n}G_{n}^{T}u_{n}+\frac{\alpha_{n}^{2}L}{2}\left\|u_{n}\right\|^{2}_{2}.

Taking the full expectation and using Lemma A.1,

𝔼[F(xn)]𝔼[F(xn1)]αn2(i[d]k=0n1β1k𝔼[Gnk,i22ϵ+v~n,k+1,i])+αn2L2𝔼[un22]\displaystyle\mathbb{E}\left[F(x_{n})\right]\leq\mathbb{E}\left[F(x_{n-1})\right]-\frac{\alpha_{n}}{2}\left(\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]\right)+\frac{\alpha_{n}^{2}L}{2}\mathbb{E}\left[\left\|u_{n}\right\|^{2}_{2}\right]
+αn3L24R1β1(l=1n1unl22k=ln1β1kk)+3αnR1β1(k=0n1(β1β2)kk+1Unk22).\displaystyle\quad+\frac{\alpha_{n}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\left(\sum_{l=1}^{n-1}\left\|u_{n-l}\right\|_{2}^{2}\sum_{k=l}^{n-1}\beta_{1}^{k}\sqrt{k}\right)+\frac{3\alpha_{n}R}{\sqrt{1-\beta_{1}}}\left(\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\left\|U_{n-k}\right\|_{2}^{2}\right). (A.35)

Notice that because of the bound on the \ell_{\infty} norm of the stochastic gradients at the iterates (7), we have for any kk\in\mathbb{N}, k<nk<n, and any coordinate i[d]i\in[d], ϵ+v~n,k+1,iRj=0n1β2j\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}\leq R\sqrt{\sum_{j=0}^{n-1}\beta_{2}^{j}}. Introducing Ωn=j=0n1β2j\Omega_{n}=\sqrt{\sum_{j=0}^{n-1}\beta_{2}^{j}}, we have

𝔼[F(xn)]𝔼[F(xn1)]αn2RΩnk=0n1β1k𝔼[Gnk22]+αn2L2𝔼[un22]\displaystyle\mathbb{E}\left[F(x_{n})\right]\leq\mathbb{E}\left[F(x_{n-1})\right]-\frac{\alpha_{n}}{2R\Omega_{n}}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]+\frac{\alpha_{n}^{2}L}{2}\mathbb{E}\left[\left\|u_{n}\right\|^{2}_{2}\right]
+αn3L24R1β1(l=1n1unl22k=ln1β1kk)+3αnR1β1(k=0n1(β1β2)kk+1Unk22).\displaystyle\quad+\frac{\alpha_{n}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\left(\sum_{l=1}^{n-1}\left\|u_{n-l}\right\|_{2}^{2}\sum_{k=l}^{n-1}\beta_{1}^{k}\sqrt{k}\right)+\frac{3\alpha_{n}R}{\sqrt{1-\beta_{1}}}\left(\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\left\|U_{n-k}\right\|_{2}^{2}\right). (A.36)

Now summing over all iterations n[N]n\in[N] for NN\in\mathbb{N}^{*}, and using that for both Adam (A.2) and Adagrad (A.3), αn\alpha_{n} is non-decreasing, as well the fact that FF is bounded below by FF_{*} from (6), we get

12Rn=1NαnΩnk=0n1β1k𝔼[Gnk22]AF(x0)F+αN2L2n=1N𝔼[un22]B\displaystyle\underbrace{\frac{1}{2R}\sum_{n=1}^{N}\frac{\alpha_{n}}{\Omega_{n}}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]}_{A}\leq F(x_{0})-F_{*}+\underbrace{\frac{\alpha_{N}^{2}L}{2}\sum_{n=1}^{N}\mathbb{E}\left[\left\|u_{n}\right\|^{2}_{2}\right]}_{B}
+αN3L24R1β1n=1Nl=1n1𝔼[unl22]k=ln1β1kkC+3αNR1β1n=1Nk=0n1(β1β2)kk+1𝔼[Unk22]D.\displaystyle\quad+\underbrace{\frac{\alpha_{N}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\sum_{n=1}^{N}\sum_{l=1}^{n-1}\mathbb{E}\left[\left\|u_{n-l}\right\|^{2}_{2}\right]\sum_{k=l}^{n-1}\beta_{1}^{k}\sqrt{k}}_{C}+\underbrace{\frac{3\alpha_{N}R}{\sqrt{1-\beta_{1}}}\sum_{n=1}^{N}\sum_{k=0}^{n-1}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{k}\sqrt{k+1}\mathbb{E}\left[\left\|U_{n-k}\right\|^{2}_{2}\right]}_{D}. (A.37)

First looking at BB, we have using Lemma A.2,

BαN2L2(1β1)(1β1/β2)i[d](ln(1+vN,iϵ)Nlog(β2)).\displaystyle B\leq\frac{\alpha_{N}^{2}L}{2(1-\beta_{1})(1-\beta_{1}/\beta_{2})}\sum_{i\in[d]}\left(\ln\left(1+\frac{v_{N,i}}{\epsilon}\right)-N\log(\beta_{2})\right). (A.38)

Then looking at CC and introducing the change of index j=nlj=n-l,

C\displaystyle C =αN3L24R1β1n=1Nj=1n𝔼[uj22]k=njn1β1kk\displaystyle=\frac{\alpha_{N}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\sum_{n=1}^{N}\sum_{j=1}^{n}\mathbb{E}\left[\left\|u_{j}\right\|^{2}_{2}\right]\sum_{k=n-j}^{n-1}\beta_{1}^{k}\sqrt{k}
=αN3L24R1β1j=1N𝔼[uj22]n=jNk=njn1β1kk\displaystyle=\frac{\alpha_{N}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|u_{j}\right\|^{2}_{2}\right]\sum_{n=j}^{N}\sum_{k=n-j}^{n-1}\beta_{1}^{k}\sqrt{k}
=αN3L24R1β1j=1N𝔼[uj22]k=0N1β1kkn=jj+k1\displaystyle=\frac{\alpha_{N}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|u_{j}\right\|^{2}_{2}\right]\sum_{k=0}^{N-1}\beta_{1}^{k}\sqrt{k}\sum_{n=j}^{j+k}1
=αN3L24R1β1j=1N𝔼[uj22]k=0N1β1kk(k+1)\displaystyle=\frac{\alpha_{N}^{3}L^{2}}{4R}\sqrt{1-\beta_{1}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|u_{j}\right\|^{2}_{2}\right]\sum_{k=0}^{N-1}\beta_{1}^{k}\sqrt{k}(k+1)
αN3L2Rj=1N𝔼[uj22]β1(1β1)2,\displaystyle\leq\frac{\alpha_{N}^{3}L^{2}}{R}\sum_{j=1}^{N}\mathbb{E}\left[\left\|u_{j}\right\|^{2}_{2}\right]\frac{\beta_{1}}{(1-\beta_{1})^{2}}, (A.39)

using Lemma A.4. Finally, using Lemma A.2, we get

CαN3L2β1R(1β1)3(1β1/β2)i[d](ln(1+vN,iϵ)Nlog(β2)).C\leq\frac{\alpha_{N}^{3}L^{2}\beta_{1}}{R(1-\beta_{1})^{3}(1-\beta_{1}/\beta_{2})}\sum_{i\in[d]}\left(\ln\left(1+\frac{v_{N,i}}{\epsilon}\right)-N\log(\beta_{2})\right). (A.40)

Finally, introducing the same change of index j=nkj=n-k for DD, we get

D\displaystyle D =3αNR1β1n=1Nj=1n(β1β2)nj1+nj𝔼[Uj22]\displaystyle=\frac{3\alpha_{N}R}{\sqrt{1-\beta_{1}}}\sum_{n=1}^{N}\sum_{j=1}^{n}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{n-j}\sqrt{1+n-j}\mathbb{E}\left[\left\|U_{j}\right\|_{2}^{2}\right]
=3αNR1β1j=1N𝔼[Uj22]n=jN(β1β2)nj1+nj\displaystyle=\frac{3\alpha_{N}R}{\sqrt{1-\beta_{1}}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|U_{j}\right\|_{2}^{2}\right]\sum_{n=j}^{N}\left(\frac{\beta_{1}}{\beta_{2}}\right)^{n-j}\sqrt{1+n-j}
6αNR1β1j=1N𝔼[Uj22]1(1β1/β2)3/2,\displaystyle\leq\frac{6\alpha_{N}R}{\sqrt{1-\beta_{1}}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|U_{j}\right\|_{2}^{2}\right]\frac{1}{(1-\beta_{1}/\beta_{2})^{3/2}}, (A.41)

using Lemma A.3. Finally, using Lemma 5.2 or equivalently Lemma A.2 with β1=0\beta_{1}=0, we get

D6αNR1β1(1β1/β2)3/2i[d](ln(1+vN,iϵ)Nln(β2)).D\leq\frac{6\alpha_{N}R}{\sqrt{1-\beta_{1}}(1-\beta_{1}/\beta_{2})^{3/2}}\sum_{i\in[d]}\left(\ln\left(1+\frac{v_{N,i}}{\epsilon}\right)-N\ln(\beta_{2})\right). (A.42)

This is as far as we can get without having to use the specific form of αN\alpha_{N} given by either (A.2) for Adam or (A.3) for Adagrad. We will now split the proof for either algorithm.

Adam

For Adam, using (A.2), we have αn=(1β1)Ωnα\alpha_{n}=(1-\beta_{1})\Omega_{n}\alpha. Thus, we can simplify the AA term from (A.37), also using the usual change of index j=nkj=n-k, to get

A\displaystyle A =12Rn=1NαnΩnj=1nβ1nj𝔼[Gj22]\displaystyle=\frac{1}{2R}\sum_{n=1}^{N}\frac{\alpha_{n}}{\Omega_{n}}\sum_{j=1}^{n}\beta_{1}^{n-j}\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]
=α(1β1)2Rj=1N𝔼[Gj22]n=jNβ1nj\displaystyle=\frac{\alpha(1-\beta_{1})}{2R}\sum_{j=1}^{N}\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]\sum_{n=j}^{N}\beta_{1}^{n-j}
=α2Rj=1N(1β1Nj+1)𝔼[Gj22]\displaystyle=\frac{\alpha}{2R}\sum_{j=1}^{N}(1-\beta_{1}^{N-j+1})\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]
=α2Rj=1N(1β1Nj+1)𝔼[F(xj1)22]\displaystyle=\frac{\alpha}{2R}\sum_{j=1}^{N}(1-\beta_{1}^{N-j+1})\mathbb{E}\left[\left\|\nabla F(x_{j-1})\right\|_{2}^{2}\right]
=α2Rj=0N1(1β1Nj)𝔼[F(xj)22].\displaystyle=\frac{\alpha}{2R}\sum_{j=0}^{N-1}(1-\beta_{1}^{N-j})\mathbb{E}\left[\left\|\nabla F(x_{j})\right\|_{2}^{2}\right]. (A.43)

If we now introduce τ\tau as in (A.8), we can first notice that

j=0N1(1β1Nj)=Nβ11β1N1β1Nβ11β1.\displaystyle\sum_{j=0}^{N-1}(1-\beta_{1}^{N-j})=N-\beta_{1}\frac{1-\beta_{1}^{N}}{1-\beta_{1}}\geq N-\frac{\beta_{1}}{1-\beta_{1}}. (A.44)

Introducing

N~=Nβ11β1,\tilde{N}=N-\frac{\beta_{1}}{1-\beta_{1}}, (A.45)

we then have

AαN~2R𝔼[F(xτ)22].\displaystyle A\geq\frac{\alpha\tilde{N}}{2R}\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]. (A.46)

Further notice that for any coordinate i[d]i\in[d], we have vN,iR21β2v_{N,i}\leq\frac{R^{2}}{1-\beta_{2}}, besides αNα1β11β2\alpha_{N}\leq\alpha\frac{1-\beta_{1}}{\sqrt{1-\beta_{2}}}, so that putting together (A.37), (A.46), (A.38), (A.40) and (A.42) we get

𝔼[F(xτ)22]\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right] 2RF0FαN~+EN~(ln(1+R2ϵ(1β2))Nlog(β2)),\displaystyle\leq 2R\frac{F_{0}-F_{*}}{\alpha\tilde{N}}+\frac{E}{\tilde{N}}\left(\ln\left(1+\frac{R^{2}}{\epsilon(1-\beta_{2})}\right)-N\log(\beta_{2})\right), (A.47)

with

E=αdRL(1β1)(1β1/β2)(1β2)+2α2dL2β1(1β1/β2)(1β2)3/2+12dR21β1(1β1/β2)3/21β2.E=\frac{\alpha dRL(1-\beta_{1})}{(1-\beta_{1}/\beta_{2})(1-\beta_{2})}+\frac{2\alpha^{2}dL^{2}\beta_{1}}{(1-\beta_{1}/\beta_{2})(1-\beta_{2})^{3/2}}+\frac{12dR^{2}\sqrt{1-\beta_{1}}}{(1-\beta_{1}/\beta_{2})^{3/2}\sqrt{1-\beta_{2}}}. (A.48)

This conclude the proof of theorem 4.

Adagrad

For Adagrad, we have αn=(1β1)α\alpha_{n}=(1-\beta_{1})\alpha, β2=1\beta_{2}=1 and ΩnN\Omega_{n}\leq\sqrt{N} so that,

A\displaystyle A =12Rn=1NαnΩnj=1nβ1nj𝔼[Gj22]\displaystyle=\frac{1}{2R}\sum_{n=1}^{N}\frac{\alpha_{n}}{\Omega_{n}}\sum_{j=1}^{n}\beta_{1}^{n-j}\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]
α(1β1)2RNj=1N𝔼[Gj22]n=jNβ1nj\displaystyle\geq\frac{\alpha(1-\beta_{1})}{2R\sqrt{N}}\sum_{j=1}^{N}\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]\sum_{n=j}^{N}\beta_{1}^{n-j}
=α2RNj=1N(1β1Nj+1)𝔼[Gj22]\displaystyle=\frac{\alpha}{2R\sqrt{N}}\sum_{j=1}^{N}(1-\beta_{1}^{N-j+1})\mathbb{E}\left[\left\|G_{j}\right\|_{2}^{2}\right]
=α2RNj=1N(1β1Nj+1)𝔼[F(xj1)22]\displaystyle=\frac{\alpha}{2R\sqrt{N}}\sum_{j=1}^{N}(1-\beta_{1}^{N-j+1})\mathbb{E}\left[\left\|\nabla F(x_{j-1})\right\|_{2}^{2}\right]
=α2RNj=0N1(1β1Nj)𝔼[F(xj)22].\displaystyle=\frac{\alpha}{2R\sqrt{N}}\sum_{j=0}^{N-1}(1-\beta_{1}^{N-j})\mathbb{E}\left[\left\|\nabla F(x_{j})\right\|_{2}^{2}\right]. (A.49)

Reusing (A.44) and (A.45) from the Adam proof, and introducing τ\tau as in (9), we immediately have

AαN~2RN𝔼[F(xτ)22].\displaystyle A\geq\frac{\alpha\tilde{N}}{2R\sqrt{N}}\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]. (A.50)

Further notice that for any coordinate i[d]i\in[d], we have vNNR2v_{N}\leq NR^{2}, besides αN=(1β1)α\alpha_{N}=(1-\beta_{1})\alpha, so that putting together (A.37), (A.50), (A.38), (A.40) and (A.42) with β2=1\beta_{2}=1, we get

𝔼[F(xτ)22]\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right] 2RNF0FαN~+NN~Eln(1+NR2ϵ),\displaystyle\leq 2R\sqrt{N}\frac{F_{0}-F_{*}}{\alpha\tilde{N}}+\frac{\sqrt{N}}{\tilde{N}}E\ln\left(1+\frac{NR^{2}}{\epsilon}\right), (A.51)

with

E=αdRL+2α2dL2β11β1+12dR21β1.E=\alpha dRL+\frac{2\alpha^{2}dL^{2}\beta_{1}}{1-\beta_{1}}+\frac{12dR^{2}}{1-\beta_{1}}. (A.52)

This conclude the proof of theorem 3.

A.7 Proof variant using Hölder inequality

Following (Ward et al., 2019; Zou et al., 2019b), it is possible to get rid of the almost sure bound on the gradient given by (7), and replace it with a bound in expectation, i.e.

xd,𝔼[f(x)22]R~ϵ.\forall x\in\mathbb{R}^{d},\,\mathbb{E}\left[\left\|\nabla f(x)\right\|_{2}^{2}\right]\leq\tilde{R}-\sqrt{\epsilon}. (A.53)

Note that we now need an 2\ell_{2} bound in order to properly apply the Hölder inequality hereafter:

We do not provide the full proof for the result, but point the reader to the few places where we have used (7). We first use it in Lemma A.1. We inject R into (A.15), which we can just replace with R~\tilde{R}. Then we use (7) to bound rr and derive (A.26). Remember that rr is defined in (A.20), and is actually a weighted sum of the squared gradients in expectation. Thus, a bound in expectation is acceptable, and Lemma A.1 is valid replacing the assumption (7) with (A.53).

Looking at the actual proof, we use (6) in a single place: just after (A.35), in order to derive an upper bound for the denominator in the following term:

M=αn2(i[d]k=0n1β1k𝔼[Gnk,i22ϵ+v~n,k+1,i]).M=\frac{\alpha_{n}}{2}\left(\sum_{i\in[d]}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\frac{G_{n-k,i}^{2}}{2\sqrt{\epsilon+\tilde{v}_{n,k+1,i}}}\right]\right). (A.54)

Let us introduce V~n,k+1=i[d]v~n,k+1,i\tilde{V}_{n,k+1}=\sum_{i\in[d]}\tilde{v}_{n,k+1,i}. We immediately have that

Mαn2(k=0n1β1k𝔼[Gnk222ϵ+V~n,k+1])M\geq\frac{\alpha_{n}}{2}\left(\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\frac{\left\|G_{n-k}\right\|_{2}^{2}}{2\sqrt{\epsilon+\tilde{V}_{n,k+1}}}\right]\right) (A.55)

Taking X=(Gnk22ϵ+V~n,k+1)23X=\left(\frac{\left\|G_{n-k}\right\|_{2}^{2}}{\sqrt{\epsilon+\tilde{V}_{n,k+1}}}\right)^{\frac{2}{3}}, Y=(ϵ+V~n,k+1)23Y=\left(\sqrt{\epsilon+\tilde{V}_{n,k+1}}\right)^{\frac{2}{3}}, we can apply Hölder inquality as

𝔼[|X|32](𝔼[|XY|]𝔼[|Y|3]13)32,\mathbb{E}\left[\left|X\right|^{\frac{3}{2}}\right]\geq\left(\frac{\mathbb{E}\left[\left|XY\right|\right]}{\mathbb{E}\left[\left|Y\right|^{3}\right]^{\frac{1}{3}}}\right)^{\frac{3}{2}}, (A.56)

which gives us

𝔼[Gnk22ϵ+V~n,k+1]𝔼[Gnk243]32𝔼[ϵ+V~n,k+1]𝔼[Gnk243]32ΩnR~,\mathbb{E}\left[\frac{\left\|G_{n-k}\right\|_{2}^{2}}{\sqrt{\epsilon+\tilde{V}_{n,k+1}}}\right]\geq\frac{\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{\frac{4}{3}}\right]^{\frac{3}{2}}}{\sqrt{\mathbb{E}\left[\epsilon+\tilde{V}_{n,k+1}\right]}}\geq\frac{\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{\frac{4}{3}}\right]^{\frac{3}{2}}}{\Omega_{n}\tilde{R}}, (A.57)

with Ωn=j=0n1β2j\Omega_{n}=\sqrt{\sum_{j=0}^{n-1}\beta_{2}^{j}}, and using the fact that 𝔼[ϵ+i[d]v~n,k+1,i]R2Ωn2\mathbb{E}\left[\epsilon+\sum_{i\in[d]}\tilde{v}_{n,k+1,i}\right]\leq R^{2}\Omega_{n}^{2}.

Thus we can recover almost exactly (A.36) except we have to replace all terms of the form 𝔼[Gnk22]\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right] with 𝔼[Gnk243]32\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{\frac{4}{3}}\right]^{\frac{3}{2}}. The rest of the proof follows as before, with all the dependencies in α\alpha, β1\beta_{1}, β2\beta_{2} remaining the same.

Appendix B Non convex SGD with heavy-ball momentum

We extend the existing proof of convergence for SGD in the non convex setting to use heavy-ball momentum (Ghadimi & Lan, 2013). Compared with previous work on momentum for non convex SGD byYang et al. (2016), we improve the dependency in β1\beta_{1} from O((1β1)2)O((1-\beta_{1})^{-2}) to O((1β1)1)O((1-\beta_{1})^{-1}). A recent work by Liu et al. (2020) achieve a similar dependency of O(1/(1β1))O(1/(1-\beta_{1})), with weaker assumptions (without the bounded gradients assumptions).

B.1 Assumptions

We reuse the notations from Section 2.1. Note however that we use here different assumptions than in Section 2.3. We first assume FF is bounded below by FF_{*}, that is,

xd,F(x)F.\forall x\in\mathbb{R}^{d},\ F(x)\geq F_{*}. (B.1)

We then assume that the stochastic gradients have bounded variance, and that the gradients of FF are uniformly bounded, i.e. there exist RR and σ\sigma so that

xd,F(x)22R2and𝔼[f(x)22]F(x)22σ2,\forall x\in\mathbb{R}^{d},\left\|\nabla F(x)\right\|_{2}^{2}\leq R^{2}\quad\text{and}\quad\mathbb{E}\left[\left\|\nabla f(x)\right\|_{2}^{2}\right]-\left\|\nabla F(x)\right\|_{2}^{2}\leq\sigma^{2}, (B.2)

and finally, the smoothness of the objective function, e.g., its gradient is LL-Liptchitz-continuous with respect to the 2\ell_{2}-norm:

x,yd,F(x)F(y)2Lxy2.\forall x,y\in\mathbb{R}^{d},\left\|\nabla F(x)-\nabla F(y)\right\|_{2}\leq L\left\|x-y\right\|_{2}. (B.3)

B.2 Result

Let us take a step size α>0\alpha>0 and a heavy-ball parameter 1>β101>\beta_{1}\geq 0. Given x0dx_{0}\in\mathbb{R}^{d}, taking m0=0m_{0}=0, we define for any iteration nn\in\mathbb{N}^{*} the iterates of SGD with momentum as,

{mn=β1mn1+fn(xn1)xn=xn1αmn.\begin{cases}m_{n}&=\beta_{1}m_{n-1}+\nabla f_{n}(x_{n-1})\\ x_{n}&=x_{n-1}-\alpha m_{n}.\end{cases} (B.4)

Note that in (B.4), the scale of the typical size of mnm_{n} will increases with β1\beta_{1}. For any total number of iterations NN\in\mathbb{N}^{*}, we define τN\tau_{N} a random index with value in {0,,N1}\{0,\ldots,N-1\}, verifying

j,j<N,[τ=j]1β1Nj.\displaystyle\forall j\in\mathbb{N},j<N,\mathbb{P}\left[\tau=j\right]\propto 1-\beta_{1}^{N-j}. (B.5)
Theorem B.1 (Convergence of SGD with momemtum).

Given the assumptions from Section B.1, given τ\tau as defined in (B.5) for a total number of iterations N>11β1N>\frac{1}{1-\beta_{1}}, x0dx_{0}\in\mathbb{R}^{d}, α>0\alpha>0, 1>β101>\beta_{1}\geq 0, and (xn)n(x_{n})_{n\in\mathbb{N}^{*}} given by (B.4),

𝔼[F(xτ)22]1β1αN~(F(x0)F)+NN~αL(1+β1)(R2+σ2)2(1β1)2,\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]\leq\frac{1-\beta_{1}}{\alpha\tilde{N}}(F(x_{0})-F_{*})+\frac{N}{\tilde{N}}\frac{\alpha L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{2}}, (B.6)

with N~=Nβ11β1\tilde{N}=N-\frac{\beta_{1}}{1-\beta_{1}}.

B.3 Analysis

We can first simplify (B.6), if we assume N11β1N\gg\frac{1}{1-\beta_{1}}, which is always the case for practical values of NN and β1\beta_{1}, so that N~N\tilde{N}\approx N, and,

𝔼[F(xτ)22]1β1αN(F(x0)F)+αL(1+β1)(R2+σ2)2(1β1)2.\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]\leq\frac{1-\beta_{1}}{\alpha N}(F(x_{0})-F_{*})+\frac{\alpha L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{2}}. (B.7)

It is possible to achieve a rate of convergence of the form O(1/N)O(1/\sqrt{N}), by taking for any C>0C>0,

α=(1β1)CN,\alpha=(1-\beta_{1})\frac{C}{\sqrt{N}}, (B.8)

which gives us

𝔼[F(xτ)22]1CN(F(x0)F)+CNL(1+β1)(R2+σ2)2(1β1).\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]\leq\frac{1}{C\sqrt{N}}(F(x_{0})-F_{*})+\frac{C}{\sqrt{N}}\frac{L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})}. (B.9)

In comparison, Theorem 3 by Yang et al. (2016) would give us, assuming now that α=(1β1)min{1L,CN}\alpha=(1-\beta_{1})\min\left\{\frac{1}{L},\frac{C}{\sqrt{N}}\right\},

mink{0,N1}𝔼[F(xk)22]2N(F(x0)F)max{2L,NC}\displaystyle\min_{k\in\{0,\ldots N-1\}}\mathbb{E}\left[\left\|\nabla F(x_{k})\right\|_{2}^{2}\right]\leq\frac{2}{N}(F(x_{0})-F_{*})\max\left\{2L,\frac{\sqrt{N}}{C}\right\}
+CNL(1β1)2(β12(R2+σ2)+(1β1)2σ2).\displaystyle\qquad+\frac{C}{\sqrt{N}}\frac{L}{(1-\beta_{1})^{2}}\left(\beta_{1}^{2}(R^{2}+\sigma^{2})+(1-\beta_{1})^{2}\sigma^{2}\right). (B.10)

We observe an overall dependency in β1\beta_{1} of the form O((1β1)2)O((1-\beta_{1})^{-2}) for Theorem 3 by Yang et al. (2016) , which we improve to O((1β1)1)O((1-\beta_{1})^{-1}) with our proof.

Liu et al. (2020) achieves a similar dependency in (1β1)(1-\beta_{1}) as here, but with weaker assumptions. Indeed, in their Theorem 1, their result contains a term in O(1/α)O(1/\alpha) with α(1β1)M\alpha\leq(1-\beta_{1})M for some problem dependent constant MM that does not depend on β1\beta_{1}.

Notice that as the typical size of the update mnm_{n} will increase with β1\beta_{1}, by a factor 1/(1β1)1/(1-\beta_{1}), it is convenient to scale down α\alpha by the same factor, as we did with (B.8) (without loss of generality, as CC can take any value). Taking α\alpha of this form has the advantage of keeping the first term on the right hand side in (B.6) independent of β1\beta_{1}, allowing us to focus only on the second term.

B.4 Proof

For all nn\in\mathbb{N}^{*}, we note Gn=F(xn1)G_{n}=\nabla F(x_{n-1}) and gn=f(xn1)g_{n}=\nabla f(x_{n-1}). 𝔼n1[]\mathbb{E}_{n-1}\left[\cdot\right] is the conditional expectation with respect to f1,,fn1f_{1},\ldots,f_{n-1}. In particular, xn1x_{n-1} and mn1m_{n-1} are deterministic knowing f1,,fn1f_{1},\ldots,f_{n-1}.

Lemma B.1 (Bound on mnm_{n}).

Given α>0\alpha>0, 1>β101>\beta_{1}\geq 0, and (xn)(x_{n}) and (mn)(m_{n}) defined as by B.4, under the assumptions from Section B.1, we have for all nn\in\mathbb{N}^{*},

𝔼[mn22]R2+σ2(1β1)2.\mathbb{E}\left[\left\|m_{n}\right\|_{2}^{2}\right]\leq\frac{R^{2}+\sigma^{2}}{(1-\beta_{1})^{2}}. (B.11)
Proof.

Let us take an iteration nn\in\mathbb{N}^{*},

𝔼[mn22]\displaystyle\mathbb{E}\left[\left\|m_{n}\right\|_{2}^{2}\right] =𝔼[k=0n1β1kgnk22]using Jensen we get,\displaystyle=\mathbb{E}\left[\left\|\sum_{k=0}^{n-1}\beta_{1}^{k}g_{n-k}\right\|_{2}^{2}\right]\quad\text{using Jensen we get,}
(k=0n1β1k)k=0n1β1k𝔼[gnk22]\displaystyle\leq\left(\sum_{k=0}^{n-1}\beta_{1}^{k}\right)\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|g_{n-k}\right\|^{2}_{2}\right]
11β1k=0n1β1k(R2+σ2)\displaystyle\leq\frac{1}{1-\beta_{1}}\sum_{k=0}^{n-1}\beta_{1}^{k}(R^{2}+\sigma^{2})
=R2+σ2(1β1)2.\displaystyle=\frac{R^{2}+\sigma^{2}}{(1-\beta_{1})^{2}}.

Lemma B.2 (sum of a geometric term times index).

Given 0<a<10<a<1, ii\in\mathbb{N} and QQ\in\mathbb{N} with QiQ\geq i,

q=iQaqq=ai1a(iaQi+1Q+aaQ+1i1a)a(1a)2.\sum_{q=i}^{Q}a^{q}q=\frac{a^{i}}{1-a}\left(i-a^{Q-i+1}Q+\frac{a-a^{Q+1-i}}{1-a}\right)\leq\frac{a}{(1-a)^{2}}. (B.12)
Proof.

Let Ai=q=iQaqqA_{i}=\sum_{q=i}^{Q}a^{q}q, we have

AiaAi\displaystyle A_{i}-aA_{i} =aiiaQ+1Q+q=i+1Qaq(i+1i)\displaystyle=a^{i}i-a^{Q+1}Q+\sum_{q=i+1}^{Q}a^{q}\left(i+1-i\right)
(1a)Ai\displaystyle(1-a)A_{i} =aiiaQ+1Q+ai+1aQ+11a.\displaystyle=a^{i}i-a^{Q+1}Q+\frac{a^{i+1}-a^{Q+1}}{1-a}.

Finally, taking i=0i=0 and QQ\rightarrow\infty gives us the upper bound. ∎

Lemma B.3 (Descent lemma).

Given α>0\alpha>0, 1>β101>\beta_{1}\geq 0, and (xn)(x_{n}) and (mn)(m_{n}) defined as by B.4, under the assumptions from Section B.1, we have for all nn\in\mathbb{N}^{*},

𝔼[F(xn1)Tmn]k=0n1β1k𝔼[F(xnk1)22]αLβ1(R2+σ2)(1β1)3\mathbb{E}\left[\nabla F(x_{n-1})^{T}m_{n}\right]\geq\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|\nabla F(x_{n-k-1})\right\|_{2}^{2}\right]-\frac{\alpha L\beta_{1}(R^{2}+\sigma^{2})}{(1-\beta_{1})^{3}} (B.13)
Proof.

For simplicity, we note Gn=F(xn1)G_{n}=\nabla F(x_{n-1}) the expected gradient and gn=fn(xn1)g_{n}=\nabla f_{n}(x_{n-1}) the stochastisc gradient at iteration nn.

GnTmn\displaystyle G_{n}^{T}m_{n} =k=0n1β1kGnTgnk\displaystyle=\sum_{k=0}^{n-1}\beta_{1}^{k}G_{n}^{T}g_{n-k}
=k=0n1β1kGnkTgnk+k=1n1β1k(GnGnk)Tgnk.\displaystyle=\sum_{k=0}^{n-1}\beta_{1}^{k}G_{n-k}^{T}g_{n-k}+\sum_{k=1}^{n-1}\beta_{1}^{k}(G_{n}-G_{n-k})^{T}g_{n-k}. (B.14)

This last step is the main difference with previous proofs with momentum (Yang et al., 2016): we replace the current gradient with an old gradient in order to obtain extra terms of the form Gnk22\left\|G_{n-k}\right\|_{2}^{2}. The price to pay is the second term on the right hand side but we will see that it is still beneficial to perform this step. Notice that as FF is LL-smooth so that we have, for all kk\in\mathbb{N}^{*}

GnGnk22\displaystyle\left\|G_{n}-G_{n-k}\right\|_{2}^{2} L2l=1kαmnl2\displaystyle\leq L^{2}\left\|\sum_{l=1}^{k}\alpha m_{n-l}\right\|^{2}
α2L2kl=1kmnl22,\displaystyle\leq\alpha^{2}L^{2}k\sum_{l=1}^{k}\left\|m_{n-l}\right\|_{2}^{2}, (B.15)

using Jensen inequality. We apply

λ>0,x,y,xy2λ2x22+y222λ,\forall\lambda>0,\,x,y\in\mathbb{R},\left\|xy\right\|_{2}\leq\frac{\lambda}{2}\left\|x\right\|_{2}^{2}+\frac{\left\|y\right\|_{2}^{2}}{2\lambda}, (B.16)

with x=GnGnkx=G_{n}-G_{n-k}, y=gnky=g_{n-k} and λ=1β1kαL\displaystyle\lambda=\frac{1-\beta_{1}}{k\alpha L} to the second term in (B.14), and use (B.15) to get

GnTmn\displaystyle G_{n}^{T}m_{n} k=0n1β1kGnkTgnkk=1n1β1k2(((1β1)αLl=1kmnl22)+αLk1β1gnk22).\displaystyle\geq\sum_{k=0}^{n-1}\beta_{1}^{k}G_{n-k}^{T}g_{n-k}-\sum_{k=1}^{n-1}\frac{\beta_{1}^{k}}{2}\left(\left((1-\beta_{1})\alpha L\sum_{l=1}^{k}\left\|m_{n-l}\right\|_{2}^{2}\right)+\frac{\alpha Lk}{1-\beta_{1}}\left\|g_{n-k}\right\|_{2}^{2}\right).

Taking the full expectation we have

𝔼[GnTmn]\displaystyle\mathbb{E}\left[G_{n}^{T}m_{n}\right] k=0n1β1k𝔼[GnkTgnk]αLk=1n1β1k2(((1β1)l=1k𝔼[mnl22])+k1β1𝔼[gnk22]).\displaystyle\geq\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[G_{n-k}^{T}g_{n-k}\right]-\alpha L\sum_{k=1}^{n-1}\frac{\beta_{1}^{k}}{2}\left(\left((1-\beta_{1})\sum_{l=1}^{k}\mathbb{E}\left[\left\|m_{n-l}\right\|_{2}^{2}\right]\right)+\frac{k}{1-\beta_{1}}\mathbb{E}\left[\left\|g_{n-k}\right\|_{2}^{2}\right]\right). (B.17)

Now let us take k{0,,n1}k\in\{0,\ldots,n-1\}, first notice that

𝔼[GnkTgnk]\displaystyle\mathbb{E}\left[G_{n-k}^{T}g_{n-k}\right] =𝔼[𝔼nk1[F(xnk1)Tfnk(xnk1)]]\displaystyle=\mathbb{E}\left[\mathbb{E}_{n-k-1}\left[\nabla F(x_{n-k-1})^{T}\nabla f_{n-k}(x_{n-k-1})\right]\right]
=𝔼[F(xnk1)TF(xnk1)]\displaystyle=\mathbb{E}\left[\nabla F(x_{n-k-1})^{T}\nabla F(x_{n-k-1})\right]
=𝔼[Gnk22].\displaystyle=\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right].

Furthermore, we have 𝔼[gnk22]R2+σ2\mathbb{E}\left[\left\|g_{n-k}\right\|_{2}^{2}\right]\leq R^{2}+\sigma^{2} from (B.2), while 𝔼[mnk22]R2+σ2(1β1)2\mathbb{E}\left[\left\|m_{n-k}\right\|_{2}^{2}\right]\leq\frac{R^{2}+\sigma^{2}}{(1-\beta_{1})^{2}} using (B.11) from Lemma B.1. Injecting those three results in (B.17), we have

𝔼[GnTmn]\displaystyle\mathbb{E}\left[G_{n}^{T}m_{n}\right] k=0n1β1k𝔼[Gnk22]αL(R2+σ2)k=1n1β1k2((11β1l=1k1)+k1β1)\displaystyle\geq\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]-\alpha L(R^{2}+\sigma^{2})\sum_{k=1}^{n-1}\frac{\beta_{1}^{k}}{2}\left(\left(\frac{1}{1-\beta_{1}}\sum_{l=1}^{k}1\right)+\frac{k}{1-\beta_{1}}\right) (B.18)
=k=0n1β1k𝔼[Gnk22]αL1β1(R2+σ2)k=1n1β1kk.\displaystyle=\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]-\frac{\alpha L}{1-\beta_{1}}(R^{2}+\sigma^{2})\sum_{k=1}^{n-1}\beta_{1}^{k}k. (B.19)

Now, using (B.12) from Lemma B.2, we obtain

𝔼[GnTmn]\displaystyle\mathbb{E}\left[G_{n}^{T}m_{n}\right] k=0n1β1k𝔼[Gnk22]αLβ1(R2+σ2)(1β1)3,\displaystyle\geq\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]-\frac{\alpha L\beta_{1}(R^{2}+\sigma^{2})}{(1-\beta_{1})^{3}}, (B.20)

which concludes the proof.

Proof of Theorem B.1
Proof.

Let us take a specific iteration nn\in\mathbb{N}^{*}. Using the smoothness of FF given by (B.3), we have,

F(xn)F(xn1)αGnTmn+α2L2mn22.\displaystyle F(x_{n})\leq F(x_{n-1})-\alpha G_{n}^{T}m_{n}+\frac{\alpha^{2}L}{2}\left\|m_{n}\right\|^{2}_{2}. (B.21)

Taking the expectation, and using Lemma B.3 and Lemma B.1, we get

𝔼[F(xn)]\displaystyle\mathbb{E}\left[F(x_{n})\right] 𝔼[F(xn1)]α(k=0n1β1k𝔼[Gnk22])+α2Lβ1(R2+σ2)(1β1)3+α2L(R2+σ2)2(1β1)2\displaystyle\leq\mathbb{E}\left[F(x_{n-1})\right]-\alpha\left(\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]\right)+\frac{\alpha^{2}L\beta_{1}(R^{2}+\sigma^{2})}{(1-\beta_{1})^{3}}+\frac{\alpha^{2}L(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{2}}
𝔼[F(xn1)]α(k=0n1β1k𝔼[Gnk22])+α2L(1+β1)(R2+σ2)2(1β1)3\displaystyle\leq\mathbb{E}\left[F(x_{n-1})\right]-\alpha\left(\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]\right)+\frac{\alpha^{2}L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{3}} (B.22)

rearranging, and summing over n{1,,N}n\in\{1,\ldots,N\}, we get

αn=1Nk=0n1β1k𝔼[Gnk22]AF(x0)𝔼[F(xN)]+Nα2L(1+β1)(R2+σ2)2(1β1)3\displaystyle\underbrace{\alpha\sum_{n=1}^{N}\sum_{k=0}^{n-1}\beta_{1}^{k}\mathbb{E}\left[\left\|G_{n-k}\right\|_{2}^{2}\right]}_{A}\leq F(x_{0})-\mathbb{E}\left[F(x_{N})\right]+N\frac{\alpha^{2}L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{3}} (B.23)

Let us focus on the AA term on the left-hand side first. Introducing the change of index i=nki=n-k, we get

A\displaystyle A =αn=1Ni=1nβ1ni𝔼[Gi22]\displaystyle=\alpha\sum_{n=1}^{N}\sum_{i=1}^{n}\beta_{1}^{n-i}\mathbb{E}\left[\left\|G_{i}\right\|_{2}^{2}\right]
=αi=1N𝔼[Gi22]n=iNβ1ni\displaystyle=\alpha\sum_{i=1}^{N}\mathbb{E}\left[\left\|G_{i}\right\|_{2}^{2}\right]\sum_{n=i}^{N}\beta_{1}^{n-i}
=α1β1i=1N𝔼[F(xi1)22](1βNi+1)\displaystyle=\frac{\alpha}{1-\beta_{1}}\sum_{i=1}^{N}\mathbb{E}\left[\left\|\nabla F(x_{i-1})\right\|_{2}^{2}\right](1-\beta^{N-i+1})
=α1β1i=0N1𝔼[F(xi)22](1βNi).\displaystyle=\frac{\alpha}{1-\beta_{1}}\sum_{i=0}^{N-1}\mathbb{E}\left[\left\|\nabla F(x_{i})\right\|_{2}^{2}\right](1-\beta^{N-i}). (B.24)

We recognize the unnormalized probability given by the random iterate τ\tau as defined by (B.5). The normalization constant is

i=0N11β1Ni=Nβ11β1N1βNβ11β1=N~,\displaystyle\sum_{i=0}^{N-1}1-\beta_{1}^{N-i}=N-\beta_{1}\frac{1-\beta_{1}^{N}}{1-\beta}\geq N-\frac{\beta_{1}}{1-\beta_{1}}=\tilde{N},

which we can inject into (B.24) to obtain

A\displaystyle A αN~1β1𝔼[F(xτ)22].\displaystyle\geq\frac{\alpha\tilde{N}}{1-\beta_{1}}\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right]. (B.25)

Injecting (B.25) into (B.23), and using the fact that FF is bounded below by FF_{*} (B.1), we have

𝔼[F(xτ)22]\displaystyle\mathbb{E}\left[\left\|\nabla F(x_{\tau})\right\|_{2}^{2}\right] 1β1αN~(F(x0)F)+NN~αL(1+β1)(R2+σ2)2(1β1)2\displaystyle\leq\frac{1-\beta_{1}}{\alpha\tilde{N}}(F(x_{0})-F_{*})+\frac{N}{\tilde{N}}\frac{\alpha L(1+\beta_{1})(R^{2}+\sigma^{2})}{2(1-\beta_{1})^{2}} (B.26)

which concludes the proof of Theorem B.1.