This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High Probability Convergence for Accelerated Stochastic Mirror Descent

Alina Ene Department of Computer Science, Boston University. πšŠπšŽπš—πšŽβ€‹@β€‹πš‹πšž.𝚎𝚍𝚞{\tt aene@bu.edu}    Huy L. Nguyen Khoury College of Computer and Information Science, Northeastern University. πš‘πšž.πš—πšπšžπš’πšŽπš—β€‹@β€‹πš—πš˜πš›πšπš‘πšŽπšŠπšœπšπšŽπš›πš—.𝚎𝚍𝚞{\tt hu.nguyen@northeastern.edu}
Abstract

In this work, we describe a generic approach to show convergence with high probability for stochastic convex optimization. In previous works, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution as opposed to the domain diameter. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations.

1 Introduction

Stochastic convex optimization is a well-studied area with numerous applications in algorithms, machine learning, and beyond. Various algorithms have been shown to converge for many classes of functions including Lipschitz functions, smooth functions, and their linear combinations. However, one curious gap remains in the understanding of their convergence with high probability compared with convergence in expectation. Classical results show that in expectation, the function value gap of the final solution is proportional to the distance between the original solution and the optimal solution. On the other hand, classical results for convergence with high probability could only show that the function value gap of the final solution is proportional to the diameter of the domain, which could be much larger or even unbounded. In this work, we bridge this gap and establish a generic approach to show convergence with high probability where the final function value gap is proportional to the distance between the original solution and the optimal solution. We instantiate our approach in two settings, stochastic mirror descent and stochastic accelerated gradient descent. The results are analogous to known results for convergence in expectation but now with high probability. The algorithms are universal for both Lipschitz functions and smooth functions.

The proof technique is inspired by classical works in concentration inequalities, specifically a type of martingale inequalities where the variance of the martingale difference is bounded by a linear function of the previous value. This technique is first applied to showing high probability convergence by Harvey et al. [2]. Our proof is inspired by the proof of Theorem 7.3 by Chung and Lu [1]. In each time step with iterate xtx_{t}, let ΞΎt:=βˆ‡^​f​(xt)βˆ’βˆ‡f​(xt)\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right) be the error in our gradient estimate. Classical proofs of convergence evolve around analyzing the sum of ⟨ξt,xβˆ—βˆ’xt⟩\left\langle\xi_{t},x^{*}-x_{t}\right\rangle, which can be viewed as a martingale sequence. Assuming a bounded domain, the concentration of the sum can be shown via classical martingale inequalities. The key new insight is that instead of analyzing this sum, we analyze a related sum where the coefficients decrease over time to account for the fact that we have a looser grip on the distance to the optimal solution as time increases. Nonetheless, the coefficients are kept within a constant factor of each others and the same asymptotic convergence is attained with high probability.

Related work

Lan [5] establishes high probability bounds for the general setting of stochastic mirror descent and accelerated stochastic mirror descent under the assumption that the stochastic noise is subgaussian. The rates shown in [5] match the best rates known in expectation, but they depend on the Bregman diameter maxx,yβˆˆπ’³β‘πƒΟˆβ€‹(x,y)\max_{x,y\in\mathcal{X}}\mathbf{D}_{\psi}\left(x,y\right) of the domain, which can be unbounded. Our work complements the analysis of [5] with a novel concentration argument that allows us to establish convergence with respect to the distance πƒΟˆβ€‹(xβˆ—,x1)\mathbf{D}_{\psi}\left(x^{*},x_{1}\right) from the initial point. Our analysis applies to the general setting considered in [5] and we use the same subgraussian assumption on the stochastic noise.

The algorithms and step sizes we consider capture the stochastic gradient descent algorithms with the standard setting of the step sizes for both smooth and non-smooth problems. The high-probability convergence of SGD is studied in the works [4, 6, 3, 2]. These works either assume that the function is strongly convex or the domain is compact. In contrast, our work applies to non-strongly convex optimization with a general domain.

2 Preliminaries

We consider the problem minxβˆˆπ’³β‘f​(x)\min_{x\in\mathcal{X}}f(x) where f:ℝd→ℝf\colon\mathbb{R}^{d}\to\mathbb{R} is a convex function and π’³βŠ†β„d\mathcal{X}\subseteq\mathbb{R}^{d} is a convex domain. We consider the general setting where ff is potentially not strongly convex and the domain 𝒳\mathcal{X} is not necessarily compact.

We assume we have access to a stochastic gradient oracle that returns a stochastic gradient βˆ‡^​f​(x)\widehat{\nabla}f(x) that satisfies the following two assumptions for any prior history:

  1. 1.

    Unbiased estimator: 𝔼​[βˆ‡^​f​(x)|x]=βˆ‡f​(x)\mathbb{E}\left[\widehat{\nabla}f\left(x\right)|x\right]=\nabla f\left(x\right).

  2. 2.

    Sub-Gaussian noise: β€–βˆ‡^​f​(x)βˆ’βˆ‡f​(x)β€–\left\|\widehat{\nabla}f\left(x\right)-\nabla f\left(x\right)\right\| is a Οƒ\sigma-subgaussian random variable (Definition 2.1).

There are several equivalent definitions of subgaussian random variables up to an absolute constant scaling (see, e.g., Proposition 2.5.2 in [7]). For convenience, we use the following property as the definition.

Definition 2.1.

A random variable XX is Οƒ\sigma-subgaussian if

𝔼​[exp⁑(Ξ»2​X2)]≀exp⁑(Ξ»2​σ2)​ for all ​λ​ such that ​|Ξ»|≀1Οƒ\mathbb{E}\left[\exp\left(\lambda^{2}X^{2}\right)\right]\leq\exp\left(\lambda^{2}\sigma^{2}\right)\text{ for all }\lambda\text{ such that }\left|\lambda\right|\leq\frac{1}{\sigma}

The above definition is equivalent to the following property, see Proposition 2.5.2 in [7].

Lemma 2.2.

(Proposition 2.5.2 in [7]) Let XX be a Οƒ\sigma-subgaussian random variables. Then

𝔼​[exp⁑(X2Οƒ2)]≀exp⁑(1)\mathbb{E}\left[\exp\left(\frac{X^{2}}{\sigma^{2}}\right)\right]\leq\exp\left(1\right)

We will also use the following helper lemma whose proof we defer to the Appendix.

Lemma 2.3.

For any aβ‰₯0a\geq 0, 0≀b≀12​σ0\leq b\leq\frac{1}{2\sigma} and a nonnegative Οƒ\sigma-subgaussian random variable XX,

𝔼​[1+b2​X2+βˆ‘i=2∞1i!​(a​X+b2​X2)i]≀exp⁑(3​(a2+b2)​σ2)\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right]\leq\exp\left(3\left(a^{2}+b^{2}\right)\sigma^{2}\right)

3 Analysis of Stochastic Mirror Descent

Algorithm 1 Stochastic Mirror Descent Algorithm. ψ:ℝd→ℝ\psi\colon\mathbb{R}^{d}\to\mathbb{R} is a strongly convex mirror map. πƒΟˆβ€‹(x,y)=Οˆβ€‹(x)βˆ’Οˆβ€‹(y)βˆ’βŸ¨βˆ‡Οˆβ€‹(y),xβˆ’y⟩\mathbf{D}_{\psi}\left(x,y\right)=\psi\left(x\right)-\psi\left(y\right)-\left\langle\nabla\psi\left(y\right),x-y\right\rangle is the Bregman divergence of ψ\psi.

Parameters: initial point x1x_{1}, step sizes {Ξ·t}\left\{\eta_{t}\right\}

for t=1t=1 to TT:

Β Β Β xt+1=arg⁑minxβˆˆπ’³β‘{Ξ·tβ€‹βŸ¨βˆ‡^​f​(xt),x⟩+πƒΟˆβ€‹(x,xt)}x_{t+1}=\arg\min_{x\in\mathcal{X}}\left\{\eta_{t}\left\langle\widehat{\nabla}f\left(x_{t}\right),x\right\rangle+\mathbf{D}_{\psi}\left(x,x_{t}\right)\right\}

return 1Tβ€‹βˆ‘t=1Txt\frac{1}{T}\sum_{t=1}^{T}x_{t}

In this section, we analyze the Stochastic Mirror Descent algorithm (Algorithm 1). For simplicity, here we consider the non-smooth setting, and assume that ff is GG-Lipschitz continuous, i.e., we have β€–βˆ‡f​(x)‖≀G\left\|\nabla f(x)\right\|\leq G for all xβˆˆπ’³x\in\mathcal{X}. The analysis for the smooth setting follows via a simple modification to the analysis presented here as well as the analysis for the accelerated setting given in the next section.

We define

ΞΎt:=βˆ‡^​f​(xt)βˆ’βˆ‡f​(xt)\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right)

We let β„±t=σ​(ΞΎ1,…,ΞΎtβˆ’1)\mathcal{F}_{t}=\sigma\left(\xi_{1},\dots,\xi_{t-1}\right) denote the natural filtration. Note that xtx_{t} is β„±t\mathcal{F}_{t}-measurable.

The starting point of our analysis is the following inequality that follows from the standard stochastic mirror descent analysis (see, e.g., [5]). We include the proof in the Appendix for completeness.

Lemma 3.1.

([5])For every iteration tt, we have

Ξ·t​(f​(xt)βˆ’f​(xβˆ—))βˆ’Ξ·t2​G2+πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt)≀ηtβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xt⟩+Ξ·t2​‖ξtβ€–2\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}+\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}

We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let w1β‰₯w2β‰₯β‹―β‰₯wTβ‰₯wT+1β‰₯0w_{1}\geq w_{2}\geq\dots\geq w_{T}\geq w_{T+1}\geq 0 be a non-increasing sequence where wtβˆˆβ„w_{t}\in\mathbb{R} for all tt. We define

Zt\displaystyle Z_{t} =wt+1​(Ξ·t​(f​(xt)βˆ’f​(xβˆ—))βˆ’Ξ·t2​G2)+wT+1​(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))\displaystyle=w_{t+1}\left(\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right) βˆ€1≀t≀T\displaystyle\forall 1\leq t\leq T
St\displaystyle S_{t} =βˆ‘i=tTZi\displaystyle=\sum_{i=t}^{T}Z_{i} βˆ€1≀t≀T+1\displaystyle\forall 1\leq t\leq T+1

Before proceeding with the analysis, we provide intuition for our approach. If we consider S1S_{1}, we see that it combines the gains in function value gaps with weights given by the non-increasing sequence {wt}\left\{w_{t}\right\}. The intuition here is that we want to leverage the progress in function value to absorb the error from the stochastic error terms on the RHS of Lemma 3.1. For the divergence terms, we use the same coefficient to allow for the terms to telescope. In Theorem 3.2, we upper bound the moment generating function of S1S_{1} and derive a set of conditions for the weights {wt}\left\{w_{t}\right\} that allow us to absorb the stochastic errors. In Corollary 3.3, we show how to choose the weights {wt}\left\{w_{t}\right\} and obtain a convergence rate that matches the standard rates that hold in expectation.

We now give our main concentration argument that bounds the moment generating function of StS_{t}. The proof of the following theorem is nspired by the proof of Theorem 7.3 in [1].

Theorem 3.2.

Suppose that wtβ‰₯wt+1+6​σ2​ηt2​wt+12w_{t}\geq w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2} and wt+1​ηt2≀14​σ2w_{t+1}\eta_{t}^{2}\leq\frac{1}{4\sigma^{2}} for every 1≀t≀T1\leq t\leq T. For every 1≀t≀T+11\leq t\leq T+1, we have

𝔼​[exp⁑(St)|β„±t]≀exp⁑((wtβˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)+3​σ2β€‹βˆ‘i=tTwi+1​ηi2)\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)
Proof.

We proceed by induction on tt. Consider the base case t=T+1t=T+1. We have St=0S_{t}=0 and (wtβˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)=0\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)=0, and the inequality follows. Next, we consider 1≀t≀T1\leq t\leq T. We have

𝔼​[exp⁑(St)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right] =𝔼​[exp⁑(Zt+St+1)|β„±t]=𝔼​[𝔼​[exp⁑(Zt+St+1)|β„±t+1]|β„±t]\displaystyle=\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t}\right]=\mathbb{E}\left[\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]|\mathcal{F}_{t}\right] (1)

We now analyze the inner expectation. Conditioned on β„±t+1\mathcal{F}_{t+1}, ZtZ_{t} is fixed. Using the inductive hypothesis , we obtain

𝔼​[exp⁑(Zt+St+1)|β„±t+1]≀exp⁑(Zt)​exp⁑((wt+1βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt+1)+3​σ2β€‹βˆ‘i=t+1Twi+1​ηi2)\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(Z_{t}\right)\exp\left(\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right) (2)

Let Xt=Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xt⟩X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle. By Lemma 3.1, we have

Ξ·t​(f​(xt)βˆ’f​(xβˆ—))βˆ’Ξ·t2​G2≀Xtβˆ’(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))+Ξ·t2​‖ξtβ€–2\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}\leq X_{t}-\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)+\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}

and thus

Zt\displaystyle Z_{t} =wt+1​(Ξ·t​(f​(xt)βˆ’f​(xβˆ—))βˆ’Ξ·t2​G2)+wT+1​(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))\displaystyle=w_{t+1}\left(\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)
≀wt+1​(Xtβˆ’(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))+Ξ·t2​‖ξtβ€–2)+wT+1​(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))\displaystyle\leq w_{t+1}\left(X_{t}-\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)+\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)
=wt+1​Xtβˆ’(wt+1βˆ’wT+1)​(πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt))+wt+1​ηt2​‖ξtβ€–2\displaystyle=w_{t+1}X_{t}-\left(w_{t+1}-w_{T+1}\right)\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}

Plugging into (2), we obtain

𝔼​[exp⁑(Zt+St+1)|β„±t+1]≀exp⁑(wt+1​Xt+(wt+1βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)+wt+1​ηt2​‖ξtβ€–2+3​σ2β€‹βˆ‘i=t+1Twi+1​ηi2)\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(w_{t+1}X_{t}+\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right)

Plugging into (1), we obtain

𝔼​[exp⁑(St)|β„±t]≀exp⁑((wt+1βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)+3​σ2β€‹βˆ‘i=t+1Twi+1​ηi2)​𝔼​[exp⁑(wt+1​Xt+wt+1​ηt2​‖ξtβ€–2)|β„±t]\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right)\mathbb{E}\left[\exp\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)|\mathcal{F}_{t}\right] (3)

Next, we analyze the the expectation on the RHS of the above inequality. We have

𝔼​[exp⁑(wt+1​Xt+wt+1​ηt2​‖ξtβ€–2)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)|\mathcal{F}_{t}\right]
=𝔼​[βˆ‘i=0∞1i!​(wt+1​Xt+wt+1​ηt2​‖ξtβ€–2)i|β„±t]\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
=𝔼​[1+wt+1​ηt2​‖ξtβ€–2+βˆ‘i=2∞1i!​(wt+1​Xt+wt+1​ηt2​‖ξtβ€–2)i|β„±t]\displaystyle=\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
≀𝔼​[1+wt+1​ηt2​‖ξtβ€–2+βˆ‘i=2∞1i!​(wt+1​ηt​‖xβˆ—βˆ’xt‖​‖ξtβ€–+wt+1​ηt2​‖ξtβ€–2)i|β„±t]\displaystyle\leq\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}\eta_{t}\left\|x^{*}-x_{t}\right\|\left\|\xi_{t}\right\|+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
≀exp⁑(3​σ2​(wt+12​ηt2​‖xβˆ—βˆ’xtβ€–2+wt+1​ηt2))\displaystyle\leq\exp\left(3\sigma^{2}\left(w_{t+1}^{2}\eta_{t}^{2}\left\|x^{*}-x_{t}\right\|^{2}+w_{t+1}\eta_{t}^{2}\right)\right)
≀exp⁑(3​σ2​(2​wt+12​ηt2β€‹πƒΟˆβ€‹(xβˆ—,xt)+wt+1​ηt2))\displaystyle\leq\exp\left(3\sigma^{2}\left(2w_{t+1}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+w_{t+1}\eta_{t}^{2}\right)\right) (4)

On the first line we used the Taylor expansion of exe^{x}, and on the second line we used that 𝔼​[Xt|β„±t]=0\mathbb{E}\left[X_{t}|\mathcal{F}_{t}\right]=0. On the third line, we used Cauchy-Schwartz and obtained

Xt=Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xtβŸ©β‰€Ξ·t​‖ξt‖​‖xβˆ—βˆ’xtβ€–X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle\leq\eta_{t}\left\|\xi_{t}\right\|\left\|x^{*}-x_{t}\right\|

On the fourth line, we applied Lemma 2.3 with X=β€–ΞΎtβ€–X=\left\|\xi_{t}\right\|, a=wt+1​ηt​‖xβˆ—βˆ’xtβ€–a=w_{t+1}\eta_{t}\left\|x^{*}-x_{t}\right\|, and b2=wt+1​ηt2≀14​σ2b^{2}=w_{t+1}\eta_{t}^{2}\leq\frac{1}{4\sigma^{2}}. On the fifth line, we used that πƒΟˆβ€‹(xβˆ—,xt)β‰₯12​‖xβˆ—βˆ’xtβ€–2\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\geq\frac{1}{2}\left\|x^{*}-x_{t}\right\|^{2}, which follows from the strong convexity of ψ\psi.

Plugging (4) into (3) and using that wtβ‰₯wt+1+6​σ2​ηt2​wt+12w_{t}\geq w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}, we obtain

𝔼​[exp⁑(St)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right] ≀exp⁑((wt+1+6​σ2​ηt2​wt+12βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)+3​σ2β€‹βˆ‘i=tTwi+1​ηi2)\displaystyle\leq\exp\left(\left(w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)
≀exp⁑((wtβˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,xt)+3​σ2β€‹βˆ‘i=tTwi+1​ηi2)\displaystyle\leq\exp\left(\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)

as needed. ∎

Theorem 3.2 and Markov’s inequality gives us the following convergence guarantee.

Corollary 3.3.

Suppose the sequence {wt}\left\{w_{t}\right\} satisfies the conditions of Theorem 3.2. For any Ξ΄>0\delta>0, the following event holds with probability at least 1βˆ’Ξ΄1-\delta:

βˆ‘t=1Twt+1​ηt​(f​(xt)βˆ’f​(xβˆ—))+wT+1β€‹πƒΟˆβ€‹(xβˆ—,xT+1)\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)
≀w1β€‹πƒΟˆβ€‹(xβˆ—,x1)+(G2+3​σ2)β€‹βˆ‘t=1Twt+1​ηt2+ln⁑(1Ξ΄)\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)
Proof.

Let

K=(w1βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,x1)+3​σ2β€‹βˆ‘t=1Twt+1​ηt2+ln⁑(1Ξ΄)K=\left(w_{1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

By Theorem 3.2 and Markov’s inequality, we have

Pr⁑[S1β‰₯K]\displaystyle\Pr\left[S_{1}\geq K\right] ≀Pr⁑[exp⁑(S1)β‰₯exp⁑(K)]\displaystyle\leq\Pr\left[\exp\left(S_{1}\right)\geq\exp\left(K\right)\right]
≀exp⁑(βˆ’K)​𝔼​[exp⁑(S1)]\displaystyle\leq\exp\left(-K\right)\mathbb{E}\left[\exp\left(S_{1}\right)\right]
≀exp⁑(βˆ’K)​exp⁑((w1βˆ’wT+1)β€‹πƒΟˆβ€‹(xβˆ—,x1)+3​σ2β€‹βˆ‘t=1Twt+1​ηt2)\displaystyle\leq\exp\left(-K\right)\exp\left(\left(w_{1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}\right)
=Ξ΄\displaystyle=\delta

Note that

S1\displaystyle S_{1} =βˆ‘t=1TZt=βˆ‘t=1Twt+1​ηt​(f​(xt)βˆ’f​(xβˆ—))βˆ’G2β€‹βˆ‘t=1Twt+1​ηt2+wT+1​(πƒΟˆβ€‹(xβˆ—,xT)βˆ’πƒΟˆβ€‹(xβˆ—,x1))\displaystyle=\sum_{t=1}^{T}Z_{t}=\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-G^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{*},x_{T}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\right)

Therefore, with probability at least 1βˆ’Ξ΄1-\delta, we have

βˆ‘t=1Twt+1​ηt​(f​(xt)βˆ’f​(xβˆ—))+wT+1β€‹πƒΟˆβ€‹(xβˆ—,xT+1)≀w1β€‹πƒΟˆβ€‹(xβˆ—,x1)+(G2+3​σ2)β€‹βˆ‘t=1Twt+1​ηt2+ln⁑(1Ξ΄)\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

∎

With the above result in hand, we complete the convergence analysis by showing how to define the sequence {wt}\left\{w_{t}\right\} with the desired properties.

Corollary 3.4.

Suppose we run the Stochastic Mirror Descent algorithm with fixed step sizes Ξ·t=Ξ·\eta_{t}=\eta. Let wT+1=112​σ2​η2​(T+1)w_{T+1}=\frac{1}{12\sigma^{2}\eta^{2}\left(T+1\right)} and wt=wt+1+6​σ2​η2​wt+12w_{t}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2} for all 1≀t≀T1\leq t\leq T. The sequence {wt}\left\{w_{t}\right\} satisfies the conditions required by Corollary 3.3. By Corollary 3.3, for any Ξ΄>0\delta>0, the following events hold with probability at least 1βˆ’Ξ΄1-\delta:

1Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))≀O​(πƒΟˆβ€‹(xβˆ—,x1)η​T+(G2+Οƒ2​(1+ln⁑(1Ξ΄)))​η)\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta T}+\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\eta\right)

and

πƒΟˆβ€‹(xβˆ—,xT+1)≀O​(πƒΟˆβ€‹(xβˆ—,x1)+(G2+Οƒ2​(1+ln⁑(1Ξ΄)))​η2​T)\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T\right)

Setting Ξ·=πƒΟˆβ€‹(xβˆ—,x1)(G2+Οƒ2​(1+ln⁑(1Ξ΄)))​T\eta=\sqrt{\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)T}} to balance the two terms in the first inequality gives

1Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))≀O​(πƒΟˆβ€‹(xβˆ—,x1)​(G2+Οƒ2​(1+ln⁑(1Ξ΄)))T)\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\sqrt{\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)}{T}}\right)

and

πƒΟˆβ€‹(xβˆ—,xT+1)≀O​(πƒΟˆβ€‹(xβˆ—,x1))\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\right)
Proof.

Recall from Corollary 3.3 that the sequence {wt}\left\{w_{t}\right\} needs to satisfy the following conditions for all 1≀t≀T1\leq t\leq T:

wt+1+6​σ2​ηt2\displaystyle w_{t+1}+6\sigma^{2}\eta_{t}^{2} wt+12≀wt\displaystyle w_{t+1}^{2}\leq w_{t}
wt+1​ηt2\displaystyle w_{t+1}\eta_{t}^{2} ≀14​σ2\displaystyle\leq\frac{1}{4\sigma^{2}}

Let C=6​σ2​η2​(T+1)C=6\sigma^{2}\eta^{2}\left(T+1\right). We set wT+1=1C+6​σ2​η2​(T+1)=12​Cw_{T+1}=\frac{1}{C+6\sigma^{2}\eta^{2}\left(T+1\right)}=\frac{1}{2C}. For 1≀t≀T1\leq t\leq T, we set wtw_{t} so that the first condition holds with equality

wt=wt+1+6​σ2​wt+12​ηt2=wt+1+6​σ2​η2​wt+12w_{t}=w_{t+1}+6\sigma^{2}w_{t+1}^{2}\eta_{t}^{2}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}

We can show by induction that, for every 1≀t≀T+11\leq t\leq T+1, we have

wt≀1C+6​σ2​η2​tw_{t}\leq\frac{1}{C+6\sigma^{2}\eta^{2}t}

The base case t=T+1t=T+1 follows from the definition of wT+1w_{T+1}. Consider 1≀t≀T1\leq t\leq T. Using the definition of wtw_{t} and the inductive hypothesis, we obtain

wt\displaystyle w_{t} =wt+1+6​σ2​η2​wt+12\displaystyle=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}
≀1C+6​σ2​η2​(t+1)+6​σ2​η2(C+6​σ2​η2​(t+1))2\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\left(t+1\right)}+\frac{6\sigma^{2}\eta^{2}}{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)^{2}}
≀1C+6​σ2​η2​(t+1)+(C+6​σ2​η2​(t+1))βˆ’(C+6​σ2​η2​t)(C+6​σ2​η2​(t+1))​(C+6​σ2​η2​t)\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\left(t+1\right)}+\frac{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)-\left(C+6\sigma^{2}\eta^{2}t\right)}{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)\left(C+6\sigma^{2}\eta^{2}t\right)}
=1C+6​σ2​η2​t\displaystyle=\frac{1}{C+6\sigma^{2}\eta^{2}t}

as needed.

Using this fact, we now show that {wt}\left\{w_{t}\right\} satisfies the second condition. For every 1≀t≀T1\leq t\leq T, we have

wt+1​ηt2=wt+1​η2≀η2C=16​σ2​(T+1)≀16​σ2w_{t+1}\eta_{t}^{2}=w_{t+1}\eta^{2}\leq\frac{\eta^{2}}{C}=\frac{1}{6\sigma^{2}\left(T+1\right)}\leq\frac{1}{6\sigma^{2}}

as needed.

Thus, by Corollary 3.3, with probability β‰₯1βˆ’Ξ΄\geq 1-\delta, we have

βˆ‘t=1Twt+1​ηt​(f​(xt)βˆ’f​(xβˆ—))+wT+1β€‹πƒΟˆβ€‹(xβˆ—,xT+1)\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right) ≀w1β€‹πƒΟˆβ€‹(xβˆ—,x1)+(G2+3​σ2)β€‹βˆ‘t=1Twt+1​ηt2+ln⁑(1Ξ΄)\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Note that wT+1=12​Cw_{T+1}=\frac{1}{2C} and 12​C≀wt≀1C\frac{1}{2C}\leq w_{t}\leq\frac{1}{C} for all 1≀t≀T+11\leq t\leq T+1. Thus we obtain

Ξ·β€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,xT+1)\displaystyle\eta\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right) ≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)+2​(G2+3​σ2)​η2​T+2​C​ln⁑(1Ξ΄)\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+2\left(G^{2}+3\sigma^{2}\right)\eta^{2}T+2C\ln\left(\frac{1}{\delta}\right)
=2β€‹πƒΟˆβ€‹(xβˆ—,x1)+2​(G2+3​σ2)​η2​T+12​σ2​ln⁑(1Ξ΄)​η2​(T+1)\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+2\left(G^{2}+3\sigma^{2}\right)\eta^{2}T+12\sigma^{2}\ln\left(\frac{1}{\delta}\right)\eta^{2}\left(T+1\right)
≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)+(2​G2+6​σ2​(1+4​ln⁑(1Ξ΄)))​η2​T\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T

Thus we have

1Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)η​T+(2​G2+6​σ2​(1+4​ln⁑(1Ξ΄)))​η\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq\frac{2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta T}+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta

and

πƒΟˆβ€‹(xβˆ—,xT+1)≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)+(2​G2+6​σ2​(1+4​ln⁑(1Ξ΄)))​η2​T\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T

∎

The analysis readily extends to the setting where the time horizon TT is not known and we set time-varying step sizes. We include below the analysis for well-studied steps Ξ·t=Ξ·t\eta_{t}=\frac{\eta}{\sqrt{t}}.

Corollary 3.5.

Suppose we run the Stochastic Mirror Descent algorithm with time-varying step sizes Ξ·t=Ξ·t\eta_{t}=\frac{\eta}{\sqrt{t}}. Let wT+1=112​σ2​η2​(βˆ‘t=1T1t)w_{T+1}=\frac{1}{12\sigma^{2}\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)} and wt=wt+1+6​σ2​η2​wt+12w_{t}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2} for all 1≀t≀T1\leq t\leq T. The sequence {wt}\left\{w_{t}\right\} satisfies the conditions required by Corollary 3.3. By Corollary 3.3, for any Ξ΄>0\delta>0, the following events hold with probability at least 1βˆ’Ξ΄1-\delta:

1Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))≀O​(1T​(πƒΟˆβ€‹(xβˆ—,x1)Ξ·+η​(G2+Οƒ2​(1+ln⁑(1Ξ΄)))​ln⁑T))\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\frac{1}{\sqrt{T}}\left(\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta}+\eta\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\ln T\right)\right)

and

πƒΟˆβ€‹(xβˆ—,xT+1)≀O​(πƒΟˆβ€‹(xβˆ—,x1)+Ξ·2​(G2+Οƒ2​(1+ln⁑(1Ξ΄)))​ln⁑T)\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\eta^{2}\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\ln T\right)
Proof.

Recall from Corollary 3.3 that the sequence {wt}\left\{w_{t}\right\} needs to satisfy the following conditions for all 1≀t≀T1\leq t\leq T:

wt+1+6​σ2​ηt2\displaystyle w_{t+1}+6\sigma^{2}\eta_{t}^{2} wt+12≀wt\displaystyle w_{t+1}^{2}\leq w_{t}
wt+1​ηt2\displaystyle w_{t+1}\eta_{t}^{2} ≀14​σ2\displaystyle\leq\frac{1}{4\sigma^{2}}

Let Bt=6​σ2β€‹βˆ‘i=1tβˆ’1Ξ·i2B_{t}=6\sigma^{2}\sum_{i=1}^{t-1}\eta_{i}^{2} and C=BT+1=6​σ2​η2​(βˆ‘t=1T1t)C=B_{T+1}=6\sigma^{2}\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right). We set wT+1=1C+BT+1w_{T+1}=\frac{1}{C+B_{T+1}}. For 1≀t≀T1\leq t\leq T, we set wtw_{t} so that the first condition holds with equality

wt=wt+1+6​σ2​ηt2​wt+12w_{t}=w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}

We can show by induction that, for every 1≀t≀T+11\leq t\leq T+1, we have

wt≀1C+Btw_{t}\leq\frac{1}{C+B_{t}}

The base case t=T+1t=T+1 follows from the definition of wT+1w_{T+1}. Consider 1≀t≀T1\leq t\leq T. Using the definition of wtw_{t} and the inductive hypothesis, we obtain

wt\displaystyle w_{t} =wt+1+6​σ2​ηt2​wt+12\displaystyle=w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}
≀1C+Bt+1+6​σ2​ηt2(C+Bt+1)2\displaystyle\leq\frac{1}{C+B_{t+1}}+\frac{6\sigma^{2}\eta_{t}^{2}}{\left(C+B_{t+1}\right)^{2}}
≀1C+Bt+1+(C+Bt+1)βˆ’(C+Bt)(C+Bt+1)​(C+Bt)\displaystyle\leq\frac{1}{C+B_{t+1}}+\frac{\left(C+B_{t+1}\right)-\left(C+B_{t}\right)}{\left(C+B_{t+1}\right)\left(C+B_{t}\right)}
=1C+Bt+1\displaystyle=\frac{1}{C+B_{t+1}}

as needed.

Using this fact, we now show that {wt}\left\{w_{t}\right\} satisfies the second condition. For every 1≀t≀T1\leq t\leq T, we have

wt+1​ηt2≀ηt2C=1t​(6​σ2β€‹βˆ‘t=1T1t)≀16​σ2w_{t+1}\eta_{t}^{2}\leq\frac{\eta_{t}^{2}}{C}=\frac{1}{t\left(6\sigma^{2}\sum_{t=1}^{T}\frac{1}{t}\right)}\leq\frac{1}{6\sigma^{2}}

as needed.

Thus, by Corollary 3.3, with probability β‰₯1βˆ’Ξ΄\geq 1-\delta, we have

βˆ‘t=1Twt+1​ηt​(f​(xt)βˆ’f​(xβˆ—))+wT+1β€‹πƒΟˆβ€‹(xβˆ—,xT+1)\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right) ≀w1β€‹πƒΟˆβ€‹(xβˆ—,x1)+(G2+3​σ2)β€‹βˆ‘t=1Twt+1​ηt2+ln⁑(1Ξ΄)\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Note that wT+1=12​Cw_{T+1}=\frac{1}{2C} and 12​C≀wt≀1C\frac{1}{2C}\leq w_{t}\leq\frac{1}{C} for all 1≀t≀T+11\leq t\leq T+1. Thus we obtain

12​C​ηTβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))+12​Cβ€‹πƒΟˆβ€‹(xβˆ—,xT+1)≀1Cβ€‹πƒΟˆβ€‹(xβˆ—,x1)+(G2+3​σ2)​1Cβ€‹βˆ‘t=1TΞ·t2+ln⁑(1Ξ΄)\frac{1}{2C}\eta_{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\frac{1}{2C}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq\frac{1}{C}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\frac{1}{C}\sum_{t=1}^{T}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Plugging in Ξ·t=Ξ·t\eta_{t}=\frac{\eta}{\sqrt{t}} and simplifying, we obtain

Ξ·Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,xT+1)\displaystyle\frac{\eta}{\sqrt{T}}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right) ≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)+(2​G2+6​σ2)​η2​(βˆ‘t=1T1t)+2​C​ln⁑(1Ξ΄)\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)+2C\ln\left(\frac{1}{\delta}\right)
=2β€‹πƒΟˆβ€‹(xβˆ—,x1)+(2​G2+6​σ2​(1+2​ln⁑(1Ξ΄)))​η2​(βˆ‘t=1T1t)\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)

Thus we have

1Tβ€‹βˆ‘t=1T(f​(xt)βˆ’f​(xβˆ—))≀1T​(2β€‹πƒΟˆβ€‹(xβˆ—,x1)Ξ·+(2​G2+6​σ2​(1+2​ln⁑(1Ξ΄)))​η​(βˆ‘t=1T1t))\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq\frac{1}{\sqrt{T}}\left(\frac{2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta}+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta\left(\sum_{t=1}^{T}\frac{1}{t}\right)\right)

and

πƒΟˆβ€‹(xβˆ—,xT+1)≀2β€‹πƒΟˆβ€‹(xβˆ—,x1)+(2​G2+6​σ2​(1+2​ln⁑(1Ξ΄)))​η2​(βˆ‘t=1T1t)\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)

∎

4 Analysis of Accelerated Stochastic Mirror Descent

Algorithm 2 Accelerated Stochastic Mirror Descent Algorithm [5]. ψ:ℝd→ℝ\psi\colon\mathbb{R}^{d}\to\mathbb{R} is a strongly convex mirror map. πƒΟˆβ€‹(x,y)=Οˆβ€‹(x)βˆ’Οˆβ€‹(y)βˆ’βŸ¨βˆ‡Οˆβ€‹(y),xβˆ’y⟩\mathbf{D}_{\psi}\left(x,y\right)=\psi\left(x\right)-\psi\left(y\right)-\left\langle\nabla\psi\left(y\right),x-y\right\rangle is the Bregman divergence of ψ\psi.

Parameters: initial point x0=y0=z0x_{0}=y_{0}=z_{0}, step size Ξ·\eta

Set Ξ±t=2t+1\alpha_{t}=\frac{2}{t+1}, Ξ·t=t​η\eta_{t}=t\eta

for t=1t=1 to TT:

Β Β Β xt=(1βˆ’Ξ±t)​ytβˆ’1+Ξ±t​ztβˆ’1x_{t}=\left(1-\alpha_{t}\right)y_{t-1}+\alpha_{t}z_{t-1}

Β Β Β zt=arg⁑minxβˆˆπ’³β‘(Ξ·tβ€‹βŸ¨βˆ‡^​f​(xt),x⟩+πƒΟˆβ€‹(x,ztβˆ’1))z_{t}=\arg\min_{x\in\mathcal{X}}\left(\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x\right\rangle+\mathbf{D}_{\psi}\left(x,z_{t-1}\right)\right)

Β Β Β yt=(1βˆ’Ξ±t)​ytβˆ’1+Ξ±t​zty_{t}=\left(1-\alpha_{t}\right)y_{t-1}+\alpha_{t}z_{t}

return yTy_{T}

In this section, we analyze the Accelerated Stochastic Mirror Descent Algorithm (Algorithm (2)). We assume that ff satisfies the following condition:

f​(y)≀f​(x)+βŸ¨βˆ‡f​(x),yβˆ’x⟩+G​‖yβˆ’xβ€–+Ξ²2​‖yβˆ’xβ€–2β€‹βˆ€x,yβˆˆπ’³f(y)\leq f(x)+\left\langle\nabla f\left(x\right),y-x\right\rangle+G\left\|y-x\right\|+\frac{\beta}{2}\left\|y-x\right\|^{2}\ \forall x,y\in\mathcal{X}

Ξ²\beta-smooth functions, GG-Lipschitz functions, and their sums all satisfy the above conditions.

As before, we define

ΞΎt:=βˆ‡^​f​(xt)βˆ’βˆ‡f​(xt)\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right)

We let β„±t=σ​(ΞΎ1,…,ΞΎtβˆ’1)\mathcal{F}_{t}=\sigma\left(\xi_{1},\dots,\xi_{t-1}\right) denote the natural filtration. Note that xtx_{t} is β„±t\mathcal{F}_{t}-measurable and ztz_{t} and yty_{t} are β„±t+1\mathcal{F}_{t+1}-measurable.

We follow a similar analysis to the previous section. As before, we start with the inequalities shown in the standard analysis of the algorithm, and we combine them using coefficients {wt}1≀t≀T\left\{w_{t}\right\}_{1\leq t\leq T}. The following lemma follows from the analysis given in [5] and we include the proof in the Appendix for completeness.

Lemma 4.1.

([5]) For every iteration tt, we have

Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·tΞ±t​(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))βˆ’Ξ·t21βˆ’Ξ²β€‹Ξ±t​ηt​G2+πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1)\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}G^{2}+\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)
≀ηtβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ·t21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}

We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let w0β‰₯w1β‰₯w2β‰₯β‹―β‰₯wTβ‰₯0w_{0}\geq w_{1}\geq w_{2}\geq\dots\geq w_{T}\geq 0 be a non-increasing sequence where wtβˆˆβ„w_{t}\in\mathbb{R} for all tt. We define

Zt\displaystyle Z_{t} =wt​(Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·t​(1βˆ’Ξ±t)Ξ±t​(f​(ytβˆ’1)βˆ’f​(xβˆ—))βˆ’Ξ·t2​G21βˆ’Ξ²β€‹Ξ±t​ηt)\displaystyle=w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}^{2}G^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)
+wT​(πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1))\displaystyle\quad+w_{T}\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right) βˆ€β€‰1≀t≀T\displaystyle\forall\,1\leq t\leq T
St\displaystyle S_{t} =βˆ‘i=tTZi\displaystyle=\sum_{i=t}^{T}Z_{i} βˆ€β€‰1≀t≀T+1\displaystyle\forall\,1\leq t\leq T+1
Theorem 4.2.

Suppose that wtβˆ’1β‰₯wt+6​σ2​ηt2​wt2w_{t-1}\geq w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2} for every 1≀t≀T1\leq t\leq T and wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt≀14​σ2\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\leq\frac{1}{4\sigma^{2}} for every 0≀t≀T0\leq t\leq T. For every 1≀t≀T+11\leq t\leq T+1, we have

𝔼​[exp⁑(St)|β„±t]≀exp⁑((wtβˆ’1βˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+3​σ2β€‹βˆ‘i=tTwi​ηi21βˆ’Ξ²β€‹Ξ±i​ηi)\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t-1}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)
Proof.

We proceed by induction on tt. Consider the base case t=T+1t=T+1. We have St=0S_{t}=0 and wtβˆ’1βˆ’wT=0w_{t-1}-w_{T}=0, and the inequality follows. Next, we consider t≀Tt\leq T. We have

𝔼​[exp⁑(St)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right] =𝔼​[exp⁑(Zt+St+1)|β„±t]=𝔼​[𝔼​[exp⁑(Zt+St+1)|β„±t+1]|β„±t]\displaystyle=\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t}\right]=\mathbb{E}\left[\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]|\mathcal{F}_{t}\right] (5)

We now analyze the inner expectation. Conditioned on β„±t+1\mathcal{F}_{t+1}, ZtZ_{t} is fixed. Using the inductive hypothesis, we obtain

𝔼​[exp⁑(Zt+St+1)|β„±t+1]≀exp⁑(Zt)​exp⁑((wtβˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,zt)+3​σ2β€‹βˆ‘i=t+1Twi​ηi21βˆ’Ξ²β€‹Ξ±i​ηi)\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(Z_{t}\right)\exp\left(\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right) (6)

Let Xt=Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle. By Lemma 4.1, we have

Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·tΞ±t​(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))βˆ’Ξ·t21βˆ’Ξ²β€‹Ξ±t​ηt​G2\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}G^{2}
≀Xt+Ξ·t2(1βˆ’Ξ²β€‹Ξ±t​ηt)​‖ξtβ€–2βˆ’(πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1))\displaystyle\leq X_{t}+\frac{\eta_{t}^{2}}{\left(1-\beta\alpha_{t}\eta_{t}\right)}\left\|\xi_{t}\right\|^{2}-\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right)

and thus

Zt\displaystyle Z_{t} ≀wt​Xtβˆ’(wtβˆ’wT)​(πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1))+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2\displaystyle\leq w_{t}X_{t}-\left(w_{t}-w_{T}\right)\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right)+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}

Plugging into (6), we obtain

𝔼​[exp⁑(Zt+St+1)|β„±t+1]\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]
≀exp⁑(wt​Xt+(wtβˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+wt​ηt2(1βˆ’Ξ²β€‹Ξ±t​ηt)​‖ξtβ€–2+3​σ2β€‹βˆ‘i=t+1Twi​ηi21βˆ’Ξ²β€‹Ξ±i​ηi)\displaystyle\leq\exp\left(w_{t}X_{t}+\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+w_{t}\frac{\eta_{t}^{2}}{\left(1-\beta\alpha_{t}\eta_{t}\right)}\left\|\xi_{t}\right\|^{2}+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)

Plugging into (5), we obtain

𝔼​[exp⁑(St)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]
≀exp⁑((wtβˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+3​σ2β€‹βˆ‘i=t+1Twi​ηi21βˆ’Ξ²β€‹Ξ±i​ηi)​𝔼​[exp⁑(wt​Xt+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2)|β„±t]\displaystyle\leq\exp\left(\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)\mathbb{E}\left[\exp\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}\right)|\mathcal{F}_{t}\right] (7)

Next, we analyze the the expectation on the RHS of the above inequality. We have

𝔼​[exp⁑(wt​Xt+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2)|β„±t]\displaystyle\mathbb{E}\left[\exp\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}\right)|\mathcal{F}_{t}\right]
=𝔼​[βˆ‘i=0∞1i!​(wt​Xt+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2)i|β„±t]\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
=𝔼​[1+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2+βˆ‘i=2∞1i!​(wt​Xt+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2)i|β„±t]\displaystyle=\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
≀𝔼​[1+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2+βˆ‘i=2∞1i!​(wt​ηt​‖xβˆ—βˆ’ztβˆ’1‖​‖ξtβ€–+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt​‖ξtβ€–2)i|β„±t]\displaystyle\leq\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}\eta_{t}\left\|x^{*}-z_{t-1}\right\|\left\|\xi_{t}\right\|+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}\right)^{i}|\mathcal{F}_{t}\right]
≀exp⁑(3​(wt2​ηt2​‖xβˆ—βˆ’ztβˆ’1β€–2+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt)​σ2)\displaystyle\leq\exp\left(3\left(w_{t}^{2}\eta_{t}^{2}\left\|x^{*}-z_{t-1}\right\|^{2}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right)
≀exp⁑(3​(2​wt2​ηt2β€‹πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt)​σ2)\displaystyle\leq\exp\left(3\left(2w_{t}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right) (8)

On the first line we used the Taylor expansion of exe^{x}, and on the second line we used that 𝔼​[Xt|β„±t]=0\mathbb{E}\left[X_{t}|\mathcal{F}_{t}\right]=0. On the third line, we used Cauchy-Schwartz and obtained

Xt=Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1βŸ©β‰€Ξ·t​‖ξt‖​‖xβˆ—βˆ’ztβˆ’1β€–X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle\leq\eta_{t}\left\|\xi_{t}\right\|\left\|x^{*}-z_{t-1}\right\|

On the fourth line, we applied Lemma 2.3 with X=β€–ΞΎtβ€–X=\left\|\xi_{t}\right\|, a=wt​ηt​‖xβˆ—βˆ’ztβˆ’1β€–a=w_{t}\eta_{t}\left\|x^{*}-z_{t-1}\right\|, and b2=wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt≀14​σ2b^{2}=w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\leq\frac{1}{4\sigma^{2}}. On the fifth line, we used that πƒΟˆβ€‹(xβˆ—,ztβˆ’1)β‰₯12​‖xβˆ—βˆ’ztβˆ’1β€–2\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\geq\frac{1}{2}\left\|x^{*}-z_{t-1}\right\|^{2}, which follows from the strong convexity of ψ\psi.

Plugging in (8) into (7) and using that wtβˆ’1β‰₯wt+6​σ2​wt2​ηt2w_{t-1}\geq w_{t}+6\sigma^{2}w_{t}^{2}\eta_{t}^{2}, we obtain

𝔼​[exp⁑(St)|β„±t]≀exp⁑((wt+6​σ2​wt2​ηt2βˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+3​σ2β€‹βˆ‘i=tTwi​ηi21βˆ’Ξ²β€‹Ξ±i​ηi)\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t}+6\sigma^{2}w_{t}^{2}\eta_{t}^{2}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)

as needed. ∎

Theorem 4.2 and Markov’s inequality gives us the following convergence guarantee.

Corollary 4.3.

Suppose the sequence {wt}\left\{w_{t}\right\} satisfies the conditions of Theorem 4.2. For any Ξ΄>0\delta>0, the following event holds with probability at least 1βˆ’Ξ΄1-\delta:

βˆ‘t=1Twt​(Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·t​(1βˆ’Ξ±t)Ξ±t​(f​(ytβˆ’1)βˆ’f​(xβˆ—)))+wTβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀w0β€‹πƒΟˆβ€‹(xβˆ—,z0)+(G2+3​σ2)β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)
Proof.

Let

K=(w0βˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,z0)+3​σ2β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)K=\left(w_{0}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

By Theorem 4.2 and Markov’s inequality, we have

Pr⁑[S1β‰₯K]\displaystyle\Pr\left[S_{1}\geq K\right] ≀Pr⁑[exp⁑(S1)β‰₯exp⁑(K)]\displaystyle\leq\Pr\left[\exp\left(S_{1}\right)\geq\exp\left(K\right)\right]
≀exp⁑(βˆ’K)​𝔼​[exp⁑(S1)]\displaystyle\leq\exp\left(-K\right)\mathbb{E}\left[\exp\left(S_{1}\right)\right]
≀exp⁑(βˆ’K)​exp⁑((w0βˆ’wT)β€‹πƒΟˆβ€‹(xβˆ—,z0)+3​σ2β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt)\displaystyle\leq\exp\left(-K\right)\exp\left(\left(w_{0}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)
=Ξ΄\displaystyle=\delta

Note that

S1\displaystyle S_{1} =βˆ‘t=1TZt\displaystyle=\sum_{t=1}^{T}Z_{t}
=βˆ‘t=1Twt​(Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·t​(1βˆ’Ξ±t)Ξ±t​(f​(ytβˆ’1)βˆ’f​(xβˆ—)))\displaystyle=\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)\right)
βˆ’G2β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+wT​(πƒΟˆβ€‹(xβˆ—,zT)βˆ’πƒΟˆβ€‹(xβˆ—,z0))\displaystyle\quad-G^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+w_{T}\left(\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)\right)

Therefore, with probability at least 1βˆ’Ξ΄1-\delta, we have

βˆ‘t=1Twt​(Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·t​(1βˆ’Ξ±t)Ξ±t​(f​(ytβˆ’1)βˆ’f​(xβˆ—)))+wTβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀w0β€‹πƒΟˆβ€‹(xβˆ—,z0)+(G2+3​σ2)β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

∎

With the above result in hand, we complete the convergence analysis by showing how to define the sequence {wt}\left\{w_{t}\right\} with the desired properties.

Corollary 4.4.

Suppose we run the Accelerated Stochastic Mirror Descent algorithm with the standard choices Ξ±t=2t+1\alpha_{t}=\frac{2}{t+1} and Ξ·t=η​t\eta_{t}=\eta t with η≀14​β\eta\leq\frac{1}{4\beta}. Let wT=13​σ2​η2​T​(T+1)​(2​T+1)w_{T}=\frac{1}{3\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)} and wtβˆ’1=wt+6​σ2​ηt2​wt2w_{t-1}=w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2} for all 1≀t≀T1\leq t\leq T. The sequence {wt}0≀t≀T\left\{w_{t}\right\}_{0\leq t\leq T} satisfies the conditions required by Corollary 4.3. By Corollary 4.3, for any Ξ΄>0\delta>0, the following events hold with probability at least 1βˆ’Ξ΄1-\delta:

f​(yT)βˆ’f​(xβˆ—)≀O​(πƒΟˆβ€‹(xβˆ—,z0)η​T2+(G2+(1+ln⁑(1Ξ΄))​σ2)​η​T)f\left(y_{T}\right)-f\left(x^{*}\right)\leq O\left(\frac{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{\eta T^{2}}+\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta T\right)

and

πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{T}\right) ≀O​(πƒΟˆβ€‹(xβˆ—,z0)+(G2+(1+ln⁑(1Ξ΄))​σ2)​η2​T3)\displaystyle\leq O\left(\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}\right)

Setting Ξ·=min⁑{14​β,πƒΟˆβ€‹(xβˆ—,z0)G2+Οƒ2​(1+ln⁑(1Ξ΄))​T3/2}\eta=\min\left\{\frac{1}{4\beta},\frac{\sqrt{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}}{\sqrt{G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)}T^{3/2}}\right\} to balance the two terms in the first inequality gives

f​(yT)βˆ’f​(xβˆ—)≀O​(Ξ²β€‹πƒΟˆβ€‹(xβˆ—,z0)T2+πƒΟˆβ€‹(xβˆ—,z0)​(G2+(1+ln⁑(1Ξ΄))​σ2)T)f\left(y_{T}\right)-f\left(x^{*}\right)\leq O\left(\frac{\beta\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{T^{2}}+\frac{\sqrt{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)}}{\sqrt{T}}\right)

and

πƒΟˆβ€‹(xβˆ—,zT)≀O​(πƒΟˆβ€‹(xβˆ—,z0))\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)\right)
Proof.

Recall from Corollary 4.3 that the sequence {wt}\left\{w_{t}\right\} needs to satisfy the following conditions for all 1≀t≀T1\leq t\leq T:

wt+6​σ2​ηt2​wt2\displaystyle w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2} ≀wtβˆ’1βˆ€1≀t≀T\displaystyle\leq w_{t-1}\quad\forall 1\leq t\leq T (9)
wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt\displaystyle\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}} ≀14​σ2βˆ€0≀t≀T\displaystyle\leq\frac{1}{4\sigma^{2}}\quad\forall 0\leq t\leq T (10)

We will set {wt}\left\{w_{t}\right\} so that it satisfies the following additional condition, which will allow us to telescope the sum on the RHS of Corollary 4.3:

wtβˆ’1​ηtβˆ’1Ξ±tβˆ’1β‰₯wt​ηt​(1βˆ’Ξ±t)Ξ±tβˆ€1≀t≀Tβˆ’1w_{t-1}\frac{\eta_{t-1}}{\alpha_{t-1}}\geq w_{t}\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\quad\forall 1\leq t\leq T-1 (11)

Given wTw_{T}, we set wtβˆ’1w_{t-1} for every 1≀t≀T1\leq t\leq T so that the first condition (9) holds with equality:

wtβˆ’1=wt+6​σ2​ηt2​wt2=wt+6​σ2​η2​t2​wt2w_{t-1}=w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2}=w_{t}+6\sigma^{2}\eta^{2}t^{2}w_{t}^{2}

Let C=Οƒ2​η2​T​(T+1)​(2​T+1)C=\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right). We set

wT=1C+6​σ2​η2β€‹βˆ‘i=1Ti2=1C+Οƒ2​η2​T​(T+1)​(2​T+1)=12​σ2​η2​T​(T+1)​(2​T+1)w_{T}=\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{T}i^{2}}=\frac{1}{C+\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)}=\frac{1}{2\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)}

Given this choice for wTw_{T}, we now verify that, for all 0≀t≀T0\leq t\leq T, we have

wt≀1C+6​σ2​η2β€‹βˆ‘i=1ti2=1C+Οƒ2​η2​t​(t+1)​(2​t+1)w_{t}\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}=\frac{1}{C+\sigma^{2}\eta^{2}t\left(t+1\right)\left(2t+1\right)}

We proceed by induction on tt. The base case t=Tt=T follows from the definition of wTw_{T}. Consider t<Tt<T. Using the definition of wtβˆ’1w_{t-1} and the inductive hypothesis, we obtain

wtβˆ’1\displaystyle w_{t-1} =wt+6​σ2​η2​t2​wt2\displaystyle=w_{t}+6\sigma^{2}\eta^{2}t^{2}w_{t}^{2}
≀1C+6​σ2​η2β€‹βˆ‘i=1ti2+6​σ2​η2​t2(C+6​σ2​η2β€‹βˆ‘i=1ti2)2\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}+\frac{6\sigma^{2}\eta^{2}t^{2}}{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)^{2}}
≀1C+6​σ2​η2β€‹βˆ‘i=1ti2+(C+6​σ2​η2β€‹βˆ‘i=1ti2)βˆ’(C+6​σ2​η2β€‹βˆ‘i=1tβˆ’1i2)(C+6​σ2​η2β€‹βˆ‘i=1ti2)​(C+6​σ2​η2β€‹βˆ‘i=1tβˆ’1i2)\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}+\frac{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)-\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}\right)}{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}\right)}
=1C+6​σ2​η2β€‹βˆ‘i=1tβˆ’1i2\displaystyle=\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}}

as needed.

Let us now verify that the second condition (10) also holds. Using that 2​tt+1≀2\frac{2t}{t+1}\leq 2, β​η≀14\beta\eta\leq\frac{1}{4}, and Tβ‰₯2T\geq 2, we obtain

wt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt=wt​η2​t21βˆ’Ξ²β€‹Ξ·β€‹2​tt+1≀2​wt​η2​t2≀2​η2​t2C=t2Οƒ2​T​(T+1)​(2​T+1)≀1Οƒ2​(2​T+1)≀14​σ2\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}=\frac{w_{t}\eta^{2}t^{2}}{1-\beta\eta\frac{2t}{t+1}}\leq 2w_{t}\eta^{2}t^{2}\leq\frac{2\eta^{2}t^{2}}{C}=\frac{t^{2}}{\sigma^{2}T\left(T+1\right)\left(2T+1\right)}\leq\frac{1}{\sigma^{2}\left(2T+1\right)}\leq\frac{1}{4\sigma^{2}}

as needed.

Let us now verify that the third condition (11) also holds. Since Ξ·t=η​t\eta_{t}=\eta t and Ξ±t=2t+1\alpha_{t}=\frac{2}{t+1}, we have Ξ·tβˆ’1Ξ±tβˆ’1=Ξ·t​(1βˆ’Ξ±t)Ξ±t=η​t​(tβˆ’1)2\frac{\eta_{t-1}}{\alpha_{t-1}}=\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}=\frac{\eta t\left(t-1\right)}{2}. Since wt≀wtβˆ’1w_{t}\leq w_{t-1}, it follows that condition (11) holds.

We now turn our attention to the convergence. By Corollary 4.3, with probability β‰₯1βˆ’Ξ΄\geq 1-\delta, we have

βˆ‘t=1Twt​(Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))βˆ’Ξ·t​(1βˆ’Ξ±t)Ξ±t​(f​(ytβˆ’1)βˆ’f​(xβˆ—)))+wTβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀w0β€‹πƒΟˆβ€‹(xβˆ—,z0)+(G2+3​σ2)β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

Grouping terms on the LHS and using that Ξ±1=1\alpha_{1}=1, we obtain

βˆ‘t=1Tβˆ’1(wt​ηtΞ±tβˆ’wt+1​ηt+1​(1βˆ’Ξ±t+1)Ξ±t+1)​(f​(yt)βˆ’f​(xβˆ—))+wT​ηTΞ±T​(f​(yT)βˆ’f​(xβˆ—))+wTβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\sum_{t=1}^{T-1}\left(w_{t}\frac{\eta_{t}}{\alpha_{t}}-w_{t+1}\frac{\eta_{t+1}\left(1-\alpha_{t+1}\right)}{\alpha_{t+1}}\right)\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)+w_{T}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀w0β€‹πƒΟˆβ€‹(xβˆ—,z0)+(G2+3​σ2)β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

Since {wt}\left\{w_{t}\right\} satisfies condition (11), the coefficient of f​(yt)βˆ’f​(xβˆ—)f\left(y_{t}\right)-f\left(x^{*}\right) is non-negative and thus we can drop the above sum. We obtain

wT​ηTΞ±T​(f​(yT)βˆ’f​(xβˆ—))+wTβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle w_{T}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right) ≀w0β€‹πƒΟˆβ€‹(xβˆ—,z0)+(G2+3​σ2)β€‹βˆ‘t=1Twt​ηt21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

Using that wT=12​Cw_{T}=\frac{1}{2C} and wt≀1Cw_{t}\leq\frac{1}{C} for all 0≀t≀Tβˆ’10\leq t\leq T-1, we obtain

12​C​ηTΞ±T​(f​(yT)βˆ’f​(xβˆ—))+12​Cβ€‹πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\frac{1}{2C}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+\frac{1}{2C}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀1Cβ€‹πƒΟˆβ€‹(xβˆ—,z0)+1C​(G2+3​σ2)β€‹βˆ‘t=1TΞ·t21βˆ’Ξ²β€‹Ξ±t​ηt+ln⁑(1Ξ΄)\displaystyle\leq\frac{1}{C}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\frac{1}{C}\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

Thus

Ξ·TΞ±T​(f​(yT)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀2β€‹πƒΟˆβ€‹(xβˆ—,z0)+2​(G2+3​σ2)β€‹βˆ‘t=1TΞ·t21βˆ’Ξ²β€‹Ξ±t​ηt+2​C​ln⁑(1Ξ΄)\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+2C\ln\left(\frac{1}{\delta}\right)
=2β€‹πƒΟˆβ€‹(xβˆ—,z0)+2​(G2+3​σ2)β€‹βˆ‘t=1TΞ·t21βˆ’Ξ²β€‹Ξ±t​ηt+2​σ2​ln⁑(1Ξ΄)​η2​T​(T+1)​(2​T+1)\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+2\sigma^{2}\ln\left(\frac{1}{\delta}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)

Using that β​η≀14\beta\eta\leq\frac{1}{4} and 2​tt+1≀2\frac{2t}{t+1}\leq 2, we obtain

βˆ‘t=1TΞ·t21βˆ’Ξ²β€‹Ξ±t​ηt=βˆ‘t=1TΞ·2​t21βˆ’Ξ²β€‹Ξ·β€‹2​tt+1β‰€βˆ‘t=1T2​η2​t2=13​η2​T​(T+1)​(2​T+1)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}=\sum_{t=1}^{T}\frac{\eta^{2}t^{2}}{1-\beta\eta\frac{2t}{t+1}}\leq\sum_{t=1}^{T}2\eta^{2}t^{2}=\frac{1}{3}\eta^{2}T\left(T+1\right)\left(2T+1\right)

Plugging in and using that Ξ·T=η​T\eta_{T}=\eta T and Ξ±T=2T+1\alpha_{T}=\frac{2}{T+1}, we obtain

η​T​(T+1)2​(f​(yT)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\eta\frac{T\left(T+1\right)}{2}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)
≀2β€‹πƒΟˆβ€‹(xβˆ—,z0)+(23​G2+2​(1+ln⁑(1Ξ΄))​σ2)​η2​T​(T+1)​(2​T+1)\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(\frac{2}{3}G^{2}+2\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)
≀2β€‹πƒΟˆβ€‹(xβˆ—,z0)+2​(G2+(1+ln⁑(1Ξ΄))​σ2)​η2​T​(T+1)​(2​T+1)\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)

We can further simplify the bound by lower bounding T​(T+1)β‰₯T2T\left(T+1\right)\geq T^{2} and upper bounding T​(T+1)​(2​T+1)≀6​T3T\left(T+1\right)\left(2T+1\right)\leq 6T^{3}. We obtain

η​T2​(f​(yT)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\eta T^{2}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},z_{T}\right) ≀4β€‹πƒΟˆβ€‹(xβˆ—,z0)+24​(G2+(1+ln⁑(1Ξ΄))​σ2)​η2​T3\displaystyle\leq 4\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+24\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}

Thus we obtain

f​(yT)βˆ’f​(xβˆ—)≀4β€‹πƒΟˆβ€‹(xβˆ—,z0)η​T2+24​(G2+(1+ln⁑(1Ξ΄))​σ2)​η​Tf\left(y_{T}\right)-f\left(x^{*}\right)\leq\frac{4\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{\eta T^{2}}+24\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta T

and

πƒΟˆβ€‹(xβˆ—,zT)\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{T}\right) ≀2β€‹πƒΟˆβ€‹(xβˆ—,z0)+12​(G2+(1+ln⁑(1Ξ΄))​σ2)​η2​T3\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+12\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}

∎

References

  • [1] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006.
  • [2] NicholasΒ JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
  • [3] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
  • [4] ShamΒ M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008.
  • [5] Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
  • [6] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
  • [7] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volumeΒ 47. Cambridge university press, 2018.

Appendix A Omitted Proofs

Proof.

(Lemma 2.3) Consider two cases either aβ‰₯1/(2​σ)a\geq 1/(2\sigma) or a≀1/(2​σ)a\leq 1/(2\sigma). First suppose aβ‰₯1/(2​σ)a\geq 1/(2\sigma). We use the inequality u​v≀u24+v2uv\leq\frac{u^{2}}{4}+v^{2},

𝔼​[1+b2​X2+βˆ‘i=2∞1i!​(a​X+b2​X2)i]\displaystyle\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right] ≀𝔼​[1+b2​X2+βˆ‘i=2∞1i!​(14​σ2​X2+a2​σ2+b2​X2)i]\displaystyle\leq\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(\frac{1}{4\sigma^{2}}X^{2}+a^{2}\sigma^{2}+b^{2}X^{2}\right)^{i}\right]
=𝔼​[b2​X2+exp⁑((14​σ2+b2)​X2+a2​σ2)βˆ’(14​σ2+b2)​X2βˆ’a2​σ2]\displaystyle=\mathbb{E}\left[b^{2}X^{2}+\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}+a^{2}\sigma^{2}\right)-\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}-a^{2}\sigma^{2}\right]
=𝔼​[exp⁑((14​σ2+b2)​X2+a2​σ2)βˆ’14​σ2​X2βˆ’a2​σ2]\displaystyle=\mathbb{E}\left[\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}+a^{2}\sigma^{2}\right)-\frac{1}{4\sigma^{2}}X^{2}-a^{2}\sigma^{2}\right]
≀exp⁑((14​σ2+b2)​σ2+a2​σ2)\displaystyle\leq\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)\sigma^{2}+a^{2}\sigma^{2}\right)
≀exp⁑(b2​σ2+2​a2​σ2)\displaystyle\leq\exp\left(b^{2}\sigma^{2}+2a^{2}\sigma^{2}\right)

Next, let c=max⁑(a,b)≀1/(2​σ)c=\max(a,b)\leq 1/(2\sigma). We have

𝔼​[1+b2​X2+βˆ‘i=2∞1i!​(a​X+b2​X2)i]\displaystyle\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right] =𝔼​[exp⁑(a​X+b2​X2)βˆ’a​X]\displaystyle=\mathbb{E}\left[\exp\left(aX+b^{2}X^{2}\right)-aX\right]
≀𝔼​[(a​X+exp⁑(a2​X2))​exp⁑(b2​X2)βˆ’a​X]\displaystyle\leq\mathbb{E}\left[\left(aX+\exp\left(a^{2}X^{2}\right)\right)\exp\left(b^{2}X^{2}\right)-aX\right]
=𝔼​[exp⁑((a2+b2)​X2)+a​X​(exp⁑(b2​X2)βˆ’1)]\displaystyle=\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+aX\left(\exp\left(b^{2}X^{2}\right)-1\right)\right]
≀𝔼​[exp⁑((a2+b2)​X2)+c​X​(exp⁑(c2​X2)βˆ’1)]\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+cX\left(\exp\left(c^{2}X^{2}\right)-1\right)\right]
≀𝔼​[exp⁑((a2+b2)​X2)+exp⁑(2​c2​X2)βˆ’1]\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+\exp\left(2c^{2}X^{2}\right)-1\right]
≀𝔼​[exp⁑((a2+b2+2​c2)​X2)]\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}+2c^{2}\right)X^{2}\right)\right]
≀exp⁑((a2+b2+2​c2)​σ2)\displaystyle\leq\exp\left(\left(a^{2}+b^{2}+2c^{2}\right)\sigma^{2}\right)

In the first inequality, we use the inequality exβˆ’x≀ex2β€‹βˆ€xe^{x}-x\leq e^{x^{2}}\forall x. In the third inequality, we use x​(ex2βˆ’1)≀e2​x2βˆ’1β€‹βˆ€xx\left(e^{x^{2}}-1\right)\leq e^{2x^{2}}-1\ \forall x. This inequality can be proved with the Taylor expansion.

x​(ex2βˆ’1)\displaystyle x\left(e^{x^{2}}-1\right) =βˆ‘i=1∞1i!​x2​i+1\displaystyle=\sum_{i=1}^{\infty}\frac{1}{i!}x^{2i+1}
β‰€βˆ‘i=1∞1i!​x2​i+x2​i+22\displaystyle\leq\sum_{i=1}^{\infty}\frac{1}{i!}\frac{x^{2i}+x^{2i+2}}{2}
=x22+βˆ‘i=2∞(1+i2​i!)​x2​i\displaystyle=\frac{x^{2}}{2}+\sum_{i=2}^{\infty}\left(\frac{1+i}{2i!}\right)x^{2i}
≀x22+βˆ‘i=2∞(2ii!)​x2​i\displaystyle\leq\frac{x^{2}}{2}+\sum_{i=2}^{\infty}\left(\frac{2^{i}}{i!}\right)x^{2i}
≀e2​x2βˆ’1\displaystyle\leq e^{2x^{2}}-1

∎

Proof.

(Lemma (3.1)) By the optimality condition, we have

⟨ηtβ€‹βˆ‡^​f​(xt)+βˆ‡xπƒΟˆβ€‹(xt+1,xt),xβˆ—βˆ’xt+1⟩β‰₯0\left\langle\eta_{t}\widehat{\nabla}f(x_{t})+\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle\geq 0

and thus

⟨ηtβ€‹βˆ‡^​f​(xt),xt+1βˆ’xβˆ—βŸ©β‰€βŸ¨βˆ‡xπƒΟˆβ€‹(xt+1,xt),xβˆ—βˆ’xt+1⟩\left\langle\eta_{t}\widehat{\nabla}f(x_{t}),x_{t+1}-x^{*}\right\rangle\leq\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle

Note that

βŸ¨βˆ‡xπƒΟˆβ€‹(xt+1,xt),xβˆ—βˆ’xt+1⟩\displaystyle\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle =βŸ¨βˆ‡Οˆβ€‹(xt+1)βˆ’βˆ‡Οˆβ€‹(xt),xβˆ—βˆ’xt+1⟩\displaystyle=\left\langle\nabla\psi\left(x_{t+1}\right)-\nabla\psi\left(x_{t}\right),x^{*}-x_{t+1}\right\rangle
=πƒΟˆβ€‹(xβˆ—,xt)βˆ’πƒΟˆβ€‹(xt+1,xt)βˆ’πƒΟˆβ€‹(xβˆ—,xt+1)\displaystyle=\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)-\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)

and thus

Ξ·tβ€‹βŸ¨βˆ‡^​f​(xt),xt+1βˆ’xβˆ—βŸ©\displaystyle\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x_{t+1}-x^{*}\right\rangle β‰€πƒΟˆβ€‹(xβˆ—,xt)βˆ’πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xt+1,xt)\displaystyle\leq\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)
β‰€πƒΟˆβ€‹(xβˆ—,xt)βˆ’πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’12​‖xt+1βˆ’xtβ€–2\displaystyle\leq\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\frac{1}{2}\left\|x_{t+1}-x_{t}\right\|^{2}

where we have used that πƒΟˆβ€‹(xt+1,xt)β‰₯12​‖xt+1βˆ’xtβ€–2\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)\geq\frac{1}{2}\left\|x_{t+1}-x_{t}\right\|^{2} by the strong convexity of ψ\psi.

By convexity,

f​(xt)βˆ’f​(xβˆ—)β‰€βŸ¨βˆ‡f​(xt),xtβˆ’xβˆ—βŸ©=⟨ξt,xβˆ—βˆ’xt⟩+βŸ¨βˆ‡^​f​(xt),xtβˆ’xβˆ—βŸ©f\left(x_{t}\right)-f\left(x^{*}\right)\leq\left\langle\nabla f\left(x_{t}\right),x_{t}-x^{*}\right\rangle=\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\left\langle\widehat{\nabla}f\left(x_{t}\right),x_{t}-x^{*}\right\rangle

Combining the two inequalities, we obtain

Ξ·t​(f​(xt)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt)\displaystyle\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)
≀ηtβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xt⟩+Ξ·tβ€‹βŸ¨βˆ‡^​f​(xt),xtβˆ’xt+1βŸ©βˆ’12​‖xt+1βˆ’xtβ€–2\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x_{t}-x_{t+1}\right\rangle-\frac{1}{2}\left\|x_{t+1}-x_{t}\right\|^{2}
≀ηtβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xt⟩+Ξ·t22β€‹β€–βˆ‡^​f​(xt)β€–2\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\frac{\eta_{t}^{2}}{2}\left\|\widehat{\nabla}f(x_{t})\right\|^{2}

Using the triangle inequality and the bounded gradient assumption β€–βˆ‡f​(x)‖≀G\left\|\nabla f(x)\right\|\leq G , we obtain

β€–βˆ‡^​f​(xt)β€–2=β€–ΞΎt+βˆ‡f​(xt)β€–2≀2​‖ξtβ€–2+2β€‹β€–βˆ‡f​(xt)β€–2≀2​(β€–ΞΎtβ€–2+G2)\left\|\widehat{\nabla}f(x_{t})\right\|^{2}=\left\|\xi_{t}+\nabla f(x_{t})\right\|^{2}\leq 2\left\|\xi_{t}\right\|^{2}+2\left\|\nabla f(x_{t})\right\|^{2}\leq 2\left(\left\|\xi_{t}\right\|^{2}+G^{2}\right)

Thus

Ξ·t​(f​(xt)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,xt+1)βˆ’πƒΟˆβ€‹(xβˆ—,xt)≀ηtβ€‹βŸ¨ΞΎt,xβˆ—βˆ’xt⟩+Ξ·t2​(β€–ΞΎtβ€–2+G2)\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}^{2}\left(\left\|\xi_{t}\right\|^{2}+G^{2}\right)

as needed. ∎

Proof.

(Lemma 4.1) Starting with smoothness, we obtain

f​(yt)\displaystyle f\left(y_{t}\right) ≀f​(xt)+βŸ¨βˆ‡f​(xt),ytβˆ’xt⟩+G​‖ytβˆ’xtβ€–+Ξ²2​‖ytβˆ’xtβ€–2β€‹βˆ€xβˆˆπ’³\displaystyle\leq f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t}-x_{t}\right\rangle+G\left\|y_{t}-x_{t}\right\|+\frac{\beta}{2}\left\|y_{t}-x_{t}\right\|^{2}\ \forall x\in\mathcal{X}
=f​(xt)+βŸ¨βˆ‡f​(xt),ytβˆ’1βˆ’xt⟩+βŸ¨βˆ‡f​(xt),ytβˆ’ytβˆ’1⟩+G​‖ytβˆ’xtβ€–+Ξ²2​‖ytβˆ’xtβ€–2\displaystyle=f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle+\left\langle\nabla f\left(x_{t}\right),y_{t}-y_{t-1}\right\rangle+G\left\|y_{t}-x_{t}\right\|+\frac{\beta}{2}\left\|y_{t}-x_{t}\right\|^{2}
=(1βˆ’Ξ±t)​(f​(xt)+βŸ¨βˆ‡f​(xt),ytβˆ’1βˆ’xt⟩)⏟convexity+Ξ±t​(f​(xt)+βŸ¨βˆ‡f​(xt),ytβˆ’1βˆ’xt⟩)⏟convexity\displaystyle=\left(1-\alpha_{t}\right)\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}
+Ξ±tβ€‹βŸ¨βˆ‡f​(xt),ztβˆ’ytβˆ’1⟩+G​‖ytβˆ’xtβ€–+Ξ²2​‖ytβˆ’xtβ€–2\displaystyle+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-y_{t-1}\right\rangle+G\left\|y_{t}-x_{t}\right\|+\frac{\beta}{2}\left\|y_{t}-x_{t}\right\|^{2}
≀(1βˆ’Ξ±t)​f​(ytβˆ’1)+Ξ±t​f​(xt)+Ξ±tβ€‹βŸ¨βˆ‡f​(xt),ztβˆ’xt⟩+G​‖ytβˆ’xtβ€–βŸ=Ξ±t​‖ztβˆ’ztβˆ’1β€–+Ξ²2​‖ytβˆ’xtβ€–2⏟=Ξ±t2​‖ztβˆ’ztβˆ’1β€–2\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\underbrace{\left\|y_{t}-x_{t}\right\|}_{=\alpha_{t}\left\|z_{t}-z_{t-1}\right\|}+\frac{\beta}{2}\underbrace{\left\|y_{t}-x_{t}\right\|^{2}}_{=\alpha_{t}^{2}\left\|z_{t}-z_{t-1}\right\|^{2}}
=(1βˆ’Ξ±t)​f​(ytβˆ’1)+Ξ±t​f​(xt)+Ξ±tβ€‹βŸ¨βˆ‡f​(xt),ztβˆ’xt⟩+G​αt​‖ztβˆ’ztβˆ’1β€–+Ξ²2​αt2​‖ztβˆ’ztβˆ’1β€–2\displaystyle=\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\alpha_{t}\left\|z_{t}-z_{t-1}\right\|+\frac{\beta}{2}\alpha_{t}^{2}\left\|z_{t}-z_{t-1}\right\|^{2}

By the optimality condition for ztz_{t},

Ξ·tβ€‹βŸ¨βˆ‡^​f​(xt),ztβˆ’xβˆ—βŸ©β‰€βŸ¨βˆ‡xπƒΟˆβ€‹(zt,ztβˆ’1),xβˆ—βˆ’zt⟩=πƒΟˆβ€‹(xβˆ—,ztβˆ’1)βˆ’πƒΟˆβ€‹(zt,ztβˆ’1)βˆ’πƒΟˆβ€‹(xβˆ—,zt)\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),z_{t}-x^{*}\right\rangle\leq\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right),x^{*}-z_{t}\right\rangle=\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)-\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)

Rearranging, we obtain

πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+πƒΟˆβ€‹(zt,ztβˆ’1)\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right) ≀ηtβ€‹βŸ¨βˆ‡^​f​(xt),xβˆ—βˆ’zt⟩=Ξ·tβ€‹βŸ¨βˆ‡f​(xt)+ΞΎt,xβˆ—βˆ’zt⟩\displaystyle\leq\eta_{t}\left\langle\widehat{\nabla}f\left(x_{t}\right),x^{*}-z_{t}\right\rangle=\eta_{t}\left\langle\nabla f\left(x_{t}\right)+\xi_{t},x^{*}-z_{t}\right\rangle

By combining the two inequalities, we obtain

f​(yt)+Ξ±tΞ·t​(πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1)+πƒΟˆβ€‹(zt,ztβˆ’1))\displaystyle f\left(y_{t}\right)+\frac{\alpha_{t}}{\eta_{t}}\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)\right)
≀(1βˆ’Ξ±t)​f​(ytβˆ’1)+Ξ±t​(f​(xt)+βŸ¨βˆ‡f​(xt),xβˆ—βˆ’xt⟩)⏟convexity\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),x^{*}-x_{t}\right\rangle\right)}_{\text{convexity}}
+G​αt​‖ztβˆ’ztβˆ’1β€–+Ξ²2​αt2​‖ztβˆ’ztβˆ’1β€–2+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’zt⟩\displaystyle+G\alpha_{t}\left\|z_{t}-z_{t-1}\right\|+\frac{\beta}{2}\alpha_{t}^{2}\left\|z_{t}-z_{t-1}\right\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t}\right\rangle
≀(1βˆ’Ξ±t)​f​(ytβˆ’1)+Ξ±t​f​(xβˆ—)+G​αt​‖ztβˆ’ztβˆ’1β€–+Ξ²2​αt2​‖ztβˆ’ztβˆ’1β€–2+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’zt⟩\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x^{*}\right)+G\alpha_{t}\left\|z_{t}-z_{t-1}\right\|+\frac{\beta}{2}\alpha_{t}^{2}\left\|z_{t}-z_{t-1}\right\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t}\right\rangle

Subtracting f​(xβˆ—)f\left(x^{*}\right) from both sides, rearranging, and using that πƒΟˆβ€‹(zt,ztβˆ’1)β‰₯12​‖ztβˆ’ztβˆ’1β€–2\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)\geq\frac{1}{2}\left\|z_{t}-z_{t-1}\right\|^{2}, we obtain

f​(yt)βˆ’f​(xβˆ—)+Ξ±tΞ·t​(πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1))\displaystyle f\left(y_{t}\right)-f\left(x^{*}\right)+\frac{\alpha_{t}}{\eta_{t}}\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right)
≀(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’zt⟩+G​αt​‖ztβˆ’ztβˆ’1β€–βˆ’Ξ±t​1βˆ’Ξ²β€‹Ξ±t​ηt2​ηt​‖ztβˆ’ztβˆ’1β€–2\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t}\right\rangle+G\alpha_{t}\left\|z_{t}-z_{t-1}\right\|-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\|z_{t}-z_{t-1}\right\|^{2}
=(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ±tβ€‹βŸ¨ΞΎt,ztβˆ’ztβˆ’1⟩+G​αt​‖ztβˆ’ztβˆ’1β€–βˆ’Ξ±t​1βˆ’Ξ²β€‹Ξ±t​ηt2​ηt​‖ztβˆ’ztβˆ’1β€–2\displaystyle=\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\alpha_{t}\left\langle\xi_{t},z_{t}-z_{t-1}\right\rangle+G\alpha_{t}\left\|z_{t}-z_{t-1}\right\|-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\|z_{t}-z_{t-1}\right\|^{2}
≀(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ±t​‖ztβˆ’ztβˆ’1‖​(β€–ΞΎtβ€–+G)βˆ’Ξ±t​1βˆ’Ξ²β€‹Ξ±t​ηt2​ηt​‖ztβˆ’ztβˆ’1β€–2\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\alpha_{t}\left\|z_{t}-z_{t-1}\right\|\left(\left\|\xi_{t}\right\|+G\right)-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\|z_{t}-z_{t-1}\right\|^{2}
≀(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ±tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ±t​ηt2​(1βˆ’Ξ²β€‹Ξ±t​ηt)​(β€–ΞΎtβ€–+G)2\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\frac{\alpha_{t}\eta_{t}}{2\left(1-\beta\alpha_{t}\eta_{t}\right)}\left(\left\|\xi_{t}\right\|+G\right)^{2}

Finally, we divide by Ξ±tΞ·t\frac{\alpha_{t}}{\eta_{t}}, and obtain

Ξ·tΞ±t​(f​(yt)βˆ’f​(xβˆ—))+πƒΟˆβ€‹(xβˆ—,zt)βˆ’πƒΟˆβ€‹(xβˆ—,ztβˆ’1)\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)
≀ηtΞ±t​(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ·t22​(1βˆ’Ξ²β€‹Ξ±t​ηt)​(β€–ΞΎtβ€–+G)2\displaystyle\leq\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{2\left(1-\beta\alpha_{t}\eta_{t}\right)}\left(\left\|\xi_{t}\right\|+G\right)^{2}
≀ηtΞ±t​(1βˆ’Ξ±t)​(f​(ytβˆ’1)βˆ’f​(xβˆ—))+Ξ·tβ€‹βŸ¨ΞΎt,xβˆ—βˆ’ztβˆ’1⟩+Ξ·t21βˆ’Ξ²β€‹Ξ±t​ηt​(β€–ΞΎtβ€–2+G2)\displaystyle\leq\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{*}\right)\right)+\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left(\left\|\xi_{t}\right\|^{2}+G^{2}\right)

∎