High Probability Convergence for Accelerated Stochastic Mirror Descent

Alina Ene Department of Computer Science, Boston University.

{\tt aene@bu.edu}

Huy L. Nguyen Khoury College of Computer and Information Science, Northeastern University.

{\tt hu.nguyen@northeastern.edu}

Abstract

In this work, we describe a generic approach to show convergence with high probability for stochastic convex optimization. In previous works, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution as opposed to the domain diameter. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations.

1 Introduction

Stochastic convex optimization is a well-studied area with numerous applications in algorithms, machine learning, and beyond. Various algorithms have been shown to converge for many classes of functions including Lipschitz functions, smooth functions, and their linear combinations. However, one curious gap remains in the understanding of their convergence with high probability compared with convergence in expectation. Classical results show that in expectation, the function value gap of the final solution is proportional to the distance between the original solution and the optimal solution. On the other hand, classical results for convergence with high probability could only show that the function value gap of the final solution is proportional to the diameter of the domain, which could be much larger or even unbounded. In this work, we bridge this gap and establish a generic approach to show convergence with high probability where the final function value gap is proportional to the distance between the original solution and the optimal solution. We instantiate our approach in two settings, stochastic mirror descent and stochastic accelerated gradient descent. The results are analogous to known results for convergence in expectation but now with high probability. The algorithms are universal for both Lipschitz functions and smooth functions.

The proof technique is inspired by classical works in concentration inequalities, specifically a type of martingale inequalities where the variance of the martingale difference is bounded by a linear function of the previous value. This technique is first applied to showing high probability convergence by Harvey et al. [2]. Our proof is inspired by the proof of Theorem 7.3 by Chung and Lu [1]. In each time step with iterate $x_{t}$ , let $\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right)$ be the error in our gradient estimate. Classical proofs of convergence evolve around analyzing the sum of $\left\langle\xi_{t},x^{*}-x_{t}\right\rangle$ , which can be viewed as a martingale sequence. Assuming a bounded domain, the concentration of the sum can be shown via classical martingale inequalities. The key new insight is that instead of analyzing this sum, we analyze a related sum where the coefficients decrease over time to account for the fact that we have a looser grip on the distance to the optimal solution as time increases. Nonetheless, the coefficients are kept within a constant factor of each others and the same asymptotic convergence is attained with high probability.

Related work

Lan [5] establishes high probability bounds for the general setting of stochastic mirror descent and accelerated stochastic mirror descent under the assumption that the stochastic noise is subgaussian. The rates shown in [5] match the best rates known in expectation, but they depend on the Bregman diameter $\max_{x,y\in\mathcal{X}}\mathbf{D}_{\psi}\left(x,y\right)$ of the domain, which can be unbounded. Our work complements the analysis of [5] with a novel concentration argument that allows us to establish convergence with respect to the distance $\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)$ from the initial point. Our analysis applies to the general setting considered in [5] and we use the same subgraussian assumption on the stochastic noise.

The algorithms and step sizes we consider capture the stochastic gradient descent algorithms with the standard setting of the step sizes for both smooth and non-smooth problems. The high-probability convergence of SGD is studied in the works [4, 6, 3, 2]. These works either assume that the function is strongly convex or the domain is compact. In contrast, our work applies to non-strongly convex optimization with a general domain.

2 Preliminaries

We consider the problem $\min_{x\in\mathcal{X}}f(x)$ where $f\colon\mathbb{R}^{d}\to\mathbb{R}$ is a convex function and $\mathcal{X}\subseteq\mathbb{R}^{d}$ is a convex domain. We consider the general setting where $f$ is potentially not strongly convex and the domain $\mathcal{X}$ is not necessarily compact.

We assume we have access to a stochastic gradient oracle that returns a stochastic gradient $\widehat{\nabla}f(x)$ that satisfies the following two assumptions for any prior history:

1.

Unbiased estimator: $\mathbb{E}\left[\widehat{\nabla}f\left(x\right)|x\right]=\nabla f\left(x\right)$ .
2.

Sub-Gaussian noise: $\left\|\widehat{\nabla}f\left(x\right)-\nabla f\left(x\right)\right\|$ is a $\sigma$ -subgaussian random variable (Definition 2.1).

There are several equivalent definitions of subgaussian random variables up to an absolute constant scaling (see, e.g., Proposition 2.5.2 in [7]). For convenience, we use the following property as the definition.

Definition 2.1.

A random variable $X$ is $\sigma$ -subgaussian if

\mathbb{E}\left[\exp\left(\lambda^{2}X^{2}\right)\right]\leq\exp\left(\lambda^{2}\sigma^{2}\right)\text{ for all }\lambda\text{ such that }\left|\lambda\right|\leq\frac{1}{\sigma}

The above definition is equivalent to the following property, see Proposition 2.5.2 in [7].

Lemma 2.2.

(Proposition 2.5.2 in [7]) Let $X$ be a $\sigma$ -subgaussian random variables. Then

\mathbb{E}\left[\exp\left(\frac{X^{2}}{\sigma^{2}}\right)\right]\leq\exp\left(1\right)

We will also use the following helper lemma whose proof we defer to the Appendix.

Lemma 2.3.

For any $a\geq 0$ , $0\leq b\leq\frac{1}{2\sigma}$ and a nonnegative $\sigma$ -subgaussian random variable $X$ ,

\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right]\leq\exp\left(3\left(a^{2}+b^{2}\right)\sigma^{2}\right)

3 Analysis of Stochastic Mirror Descent

Algorithm 1 Stochastic Mirror Descent Algorithm.

\psi\colon\mathbb{R}^{d}\to\mathbb{R}

is a strongly convex mirror map.

\mathbf{D}_{\psi}\left(x,y\right)=\psi\left(x\right)-\psi\left(y\right)-\left\langle\nabla\psi\left(y\right),x-y\right\rangle

is the Bregman divergence of

\psi

Parameters: initial point $x_{1}$ , step sizes $\left\{\eta_{t}\right\}$

for $t=1$ to $T$ :

$x_{t+1}=\arg\min_{x\in\mathcal{X}}\left\{\eta_{t}\left\langle\widehat{\nabla}f\left(x_{t}\right),x\right\rangle+\mathbf{D}_{\psi}\left(x,x_{t}\right)\right\}$

return $\frac{1}{T}\sum_{t=1}^{T}x_{t}$

In this section, we analyze the Stochastic Mirror Descent algorithm (Algorithm 1). For simplicity, here we consider the non-smooth setting, and assume that $f$ is $G$ -Lipschitz continuous, i.e., we have $\left\|\nabla f(x)\right\|\leq G$ for all $x\in\mathcal{X}$ . The analysis for the smooth setting follows via a simple modification to the analysis presented here as well as the analysis for the accelerated setting given in the next section.

We define

\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right)

We let $\mathcal{F}_{t}=\sigma\left(\xi_{1},\dots,\xi_{t-1}\right)$ denote the natural filtration. Note that $x_{t}$ is $\mathcal{F}_{t}$ -measurable.

The starting point of our analysis is the following inequality that follows from the standard stochastic mirror descent analysis (see, e.g., [5]). We include the proof in the Appendix for completeness.

Lemma 3.1.

([5])For every iteration $t$ , we have

\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}+\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}

We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let $w_{1}\geq w_{2}\geq\dots\geq w_{T}\geq w_{T+1}\geq 0$ be a non-increasing sequence where $w_{t}\in\mathbb{R}$ for all $t$ . We define

	$\displaystyle Z_{t}$	$\displaystyle=w_{t+1}\left(\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)-\eta_{t}^{2}G^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)$	$\displaystyle\forall 1\leq t\leq T$
	$\displaystyle S_{t}$	$\displaystyle=\sum_{i=t}^{T}Z_{i}$	$\displaystyle\forall 1\leq t\leq T+1$

Before proceeding with the analysis, we provide intuition for our approach. If we consider $S_{1}$ , we see that it combines the gains in function value gaps with weights given by the non-increasing sequence $\left\{w_{t}\right\}$ . The intuition here is that we want to leverage the progress in function value to absorb the error from the stochastic error terms on the RHS of Lemma 3.1. For the divergence terms, we use the same coefficient to allow for the terms to telescope. In Theorem 3.2, we upper bound the moment generating function of $S_{1}$ and derive a set of conditions for the weights $\left\{w_{t}\right\}$ that allow us to absorb the stochastic errors. In Corollary 3.3, we show how to choose the weights $\left\{w_{t}\right\}$ and obtain a convergence rate that matches the standard rates that hold in expectation.

We now give our main concentration argument that bounds the moment generating function of $S_{t}$ . The proof of the following theorem is nspired by the proof of Theorem 7.3 in [1].

Theorem 3.2.

Suppose that $w_{t}\geq w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}$ and $w_{t+1}\eta_{t}^{2}\leq\frac{1}{4\sigma^{2}}$ for every $1\leq t\leq T$ . For every $1\leq t\leq T+1$ , we have

\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)

Proof.

We proceed by induction on $t$ . Consider the base case $t=T+1$ . We have $S_{t}=0$ and $\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)=0$ , and the inequality follows. Next, we consider $1\leq t\leq T$ . We have

\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]

\displaystyle=\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t}\right]=\mathbb{E}\left[\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]|\mathcal{F}_{t}\right]

(1)

We now analyze the inner expectation. Conditioned on $\mathcal{F}_{t+1}$ , $Z_{t}$ is fixed. Using the inductive hypothesis , we obtain

\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(Z_{t}\right)\exp\left(\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right)

(2)

Let $X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle$ . By Lemma 3.1, we have

\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-\eta_{t}^{2}G^{2}\leq X_{t}-\left(\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)+\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}

and thus

	$\displaystyle Z_{t}$	$\displaystyle=w_{t+1}\left(\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)-\eta_{t}^{2}G^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)$
		$\displaystyle\leq w_{t+1}\left(X_{t}-\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)+\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)$
		$\displaystyle=w_{t+1}X_{t}-\left(w_{t+1}-w_{T+1}\right)\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}$

Plugging into (2), we obtain

\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(w_{t+1}X_{t}+\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right)

Plugging into (1), we obtain

\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t+1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i+1}\eta_{i}^{2}\right)\mathbb{E}\left[\exp\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\|\xi_{t}\right\|^{2}\right)|\mathcal{F}_{t}\right]

(3)

Next, we analyze the the expectation on the RHS of the above inequality. We have

	$\displaystyle\mathbb{E}\left[\exp\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}\eta_{t}\left\\|x^{*}-x_{t}\right\\|\left\\|\xi_{t}\right\\|+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\exp\left(3\sigma^{2}\left(w_{t+1}^{2}\eta_{t}^{2}\left\\|x^{*}-x_{t}\right\\|^{2}+w_{t+1}\eta_{t}^{2}\right)\right)$
	$\displaystyle\leq\exp\left(3\sigma^{2}\left(2w_{t+1}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+w_{t+1}\eta_{t}^{2}\right)\right)$		(4)

On the first line we used the Taylor expansion of $e^{x}$ , and on the second line we used that $\mathbb{E}\left[X_{t}|\mathcal{F}_{t}\right]=0$ . On the third line, we used Cauchy-Schwartz and obtained

X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle\leq\eta_{t}\left\|\xi_{t}\right\|\left\|x^{*}-x_{t}\right\|

On the fourth line, we applied Lemma 2.3 with $X=\left\|\xi_{t}\right\|$ , $a=w_{t+1}\eta_{t}\left\|x^{*}-x_{t}\right\|$ , and $b^{2}=w_{t+1}\eta_{t}^{2}\leq\frac{1}{4\sigma^{2}}$ . On the fifth line, we used that $\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\geq\frac{1}{2}\left\|x^{*}-x_{t}\right\|^{2}$ , which follows from the strong convexity of $\psi$ .

Plugging (4) into (3) and using that $w_{t}\geq w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}$ , we obtain

	$\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)\|\mathcal{F}_{t}\right]$	$\displaystyle\leq\exp\left(\left(w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)$
		$\displaystyle\leq\exp\left(\left(w_{t}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i+1}\eta_{i}^{2}\right)$

as needed. ∎

Theorem 3.2 and Markov’s inequality gives us the following convergence guarantee.

Corollary 3.3.

Suppose the sequence $\left\{w_{t}\right\}$ satisfies the conditions of Theorem 3.2. For any $\delta>0$ , the following event holds with probability at least $1-\delta$ :

	$\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{},x_{T+1}\right)$
	$\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)$

Proof.

Let

K=\left(w_{1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

By Theorem 3.2 and Markov’s inequality, we have

	$\displaystyle\Pr\left[S_{1}\geq K\right]$	$\displaystyle\leq\Pr\left[\exp\left(S_{1}\right)\geq\exp\left(K\right)\right]$
		$\displaystyle\leq\exp\left(-K\right)\mathbb{E}\left[\exp\left(S_{1}\right)\right]$
		$\displaystyle\leq\exp\left(-K\right)\exp\left(\left(w_{1}-w_{T+1}\right)\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}\right)$
		$\displaystyle=\delta$

Note that

\displaystyle S_{1}

\displaystyle=\sum_{t=1}^{T}Z_{t}=\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)-G^{2}\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{*},x_{T}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\right)

Therefore, with probability at least $1-\delta$ , we have

\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

∎

With the above result in hand, we complete the convergence analysis by showing how to define the sequence $\left\{w_{t}\right\}$ with the desired properties.

Corollary 3.4.

Suppose we run the Stochastic Mirror Descent algorithm with fixed step sizes $\eta_{t}=\eta$ . Let $w_{T+1}=\frac{1}{12\sigma^{2}\eta^{2}\left(T+1\right)}$ and $w_{t}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}$ for all $1\leq t\leq T$ . The sequence $\left\{w_{t}\right\}$ satisfies the conditions required by Corollary 3.3. By Corollary 3.3, for any $\delta>0$ , the following events hold with probability at least $1-\delta$ :

\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta T}+\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\eta\right)

and

\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T\right)

Setting $\eta=\sqrt{\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)T}}$ to balance the two terms in the first inequality gives

\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\sqrt{\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)}{T}}\right)

and

\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)\right)

Proof.

Recall from Corollary 3.3 that the sequence $\left\{w_{t}\right\}$ needs to satisfy the following conditions for all $1\leq t\leq T$ :

	$\displaystyle w_{t+1}+6\sigma^{2}\eta_{t}^{2}$	$\displaystyle w_{t+1}^{2}\leq w_{t}$
	$\displaystyle w_{t+1}\eta_{t}^{2}$	$\displaystyle\leq\frac{1}{4\sigma^{2}}$

Let $C=6\sigma^{2}\eta^{2}\left(T+1\right)$ . We set $w_{T+1}=\frac{1}{C+6\sigma^{2}\eta^{2}\left(T+1\right)}=\frac{1}{2C}$ . For $1\leq t\leq T$ , we set $w_{t}$ so that the first condition holds with equality

w_{t}=w_{t+1}+6\sigma^{2}w_{t+1}^{2}\eta_{t}^{2}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}

We can show by induction that, for every $1\leq t\leq T+1$ , we have

w_{t}\leq\frac{1}{C+6\sigma^{2}\eta^{2}t}

The base case $t=T+1$ follows from the definition of $w_{T+1}$ . Consider $1\leq t\leq T$ . Using the definition of $w_{t}$ and the inductive hypothesis, we obtain

	$\displaystyle w_{t}$	$\displaystyle=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}$
		$\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\left(t+1\right)}+\frac{6\sigma^{2}\eta^{2}}{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)^{2}}$
		$\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\left(t+1\right)}+\frac{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)-\left(C+6\sigma^{2}\eta^{2}t\right)}{\left(C+6\sigma^{2}\eta^{2}\left(t+1\right)\right)\left(C+6\sigma^{2}\eta^{2}t\right)}$
		$\displaystyle=\frac{1}{C+6\sigma^{2}\eta^{2}t}$

as needed.

Using this fact, we now show that $\left\{w_{t}\right\}$ satisfies the second condition. For every $1\leq t\leq T$ , we have

w_{t+1}\eta_{t}^{2}=w_{t+1}\eta^{2}\leq\frac{\eta^{2}}{C}=\frac{1}{6\sigma^{2}\left(T+1\right)}\leq\frac{1}{6\sigma^{2}}

as needed.

Thus, by Corollary 3.3, with probability $\geq 1-\delta$ , we have

\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)

\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Note that $w_{T+1}=\frac{1}{2C}$ and $\frac{1}{2C}\leq w_{t}\leq\frac{1}{C}$ for all $1\leq t\leq T+1$ . Thus we obtain

	$\displaystyle\eta\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},x_{T+1}\right)$	$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+2\left(G^{2}+3\sigma^{2}\right)\eta^{2}T+2C\ln\left(\frac{1}{\delta}\right)$
		$\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+2\left(G^{2}+3\sigma^{2}\right)\eta^{2}T+12\sigma^{2}\ln\left(\frac{1}{\delta}\right)\eta^{2}\left(T+1\right)$
		$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T$

Thus we have

\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq\frac{2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta T}+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta

and

\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+4\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}T

∎

The analysis readily extends to the setting where the time horizon $T$ is not known and we set time-varying step sizes. We include below the analysis for well-studied steps $\eta_{t}=\frac{\eta}{\sqrt{t}}$ .

Corollary 3.5.

Suppose we run the Stochastic Mirror Descent algorithm with time-varying step sizes $\eta_{t}=\frac{\eta}{\sqrt{t}}$ . Let $w_{T+1}=\frac{1}{12\sigma^{2}\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)}$ and $w_{t}=w_{t+1}+6\sigma^{2}\eta^{2}w_{t+1}^{2}$ for all $1\leq t\leq T$ . The sequence $\left\{w_{t}\right\}$ satisfies the conditions required by Corollary 3.3. By Corollary 3.3, for any $\delta>0$ , the following events hold with probability at least $1-\delta$ :

\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq O\left(\frac{1}{\sqrt{T}}\left(\frac{\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta}+\eta\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\ln T\right)\right)

and

\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\eta^{2}\left(G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)\right)\ln T\right)

Proof.

Recall from Corollary 3.3 that the sequence $\left\{w_{t}\right\}$ needs to satisfy the following conditions for all $1\leq t\leq T$ :

	$\displaystyle w_{t+1}+6\sigma^{2}\eta_{t}^{2}$	$\displaystyle w_{t+1}^{2}\leq w_{t}$
	$\displaystyle w_{t+1}\eta_{t}^{2}$	$\displaystyle\leq\frac{1}{4\sigma^{2}}$

Let $B_{t}=6\sigma^{2}\sum_{i=1}^{t-1}\eta_{i}^{2}$ and $C=B_{T+1}=6\sigma^{2}\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)$ . We set $w_{T+1}=\frac{1}{C+B_{T+1}}$ . For $1\leq t\leq T$ , we set $w_{t}$ so that the first condition holds with equality

w_{t}=w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}

We can show by induction that, for every $1\leq t\leq T+1$ , we have

w_{t}\leq\frac{1}{C+B_{t}}

The base case $t=T+1$ follows from the definition of $w_{T+1}$ . Consider $1\leq t\leq T$ . Using the definition of $w_{t}$ and the inductive hypothesis, we obtain

	$\displaystyle w_{t}$	$\displaystyle=w_{t+1}+6\sigma^{2}\eta_{t}^{2}w_{t+1}^{2}$
		$\displaystyle\leq\frac{1}{C+B_{t+1}}+\frac{6\sigma^{2}\eta_{t}^{2}}{\left(C+B_{t+1}\right)^{2}}$
		$\displaystyle\leq\frac{1}{C+B_{t+1}}+\frac{\left(C+B_{t+1}\right)-\left(C+B_{t}\right)}{\left(C+B_{t+1}\right)\left(C+B_{t}\right)}$
		$\displaystyle=\frac{1}{C+B_{t+1}}$

as needed.

Using this fact, we now show that $\left\{w_{t}\right\}$ satisfies the second condition. For every $1\leq t\leq T$ , we have

w_{t+1}\eta_{t}^{2}\leq\frac{\eta_{t}^{2}}{C}=\frac{1}{t\left(6\sigma^{2}\sum_{t=1}^{T}\frac{1}{t}\right)}\leq\frac{1}{6\sigma^{2}}

as needed.

Thus, by Corollary 3.3, with probability $\geq 1-\delta$ , we have

\displaystyle\sum_{t=1}^{T}w_{t+1}\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+w_{T+1}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)

\displaystyle\leq w_{1}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t+1}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Note that $w_{T+1}=\frac{1}{2C}$ and $\frac{1}{2C}\leq w_{t}\leq\frac{1}{C}$ for all $1\leq t\leq T+1$ . Thus we obtain

\frac{1}{2C}\eta_{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\frac{1}{2C}\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq\frac{1}{C}\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(G^{2}+3\sigma^{2}\right)\frac{1}{C}\sum_{t=1}^{T}\eta_{t}^{2}+\ln\left(\frac{1}{\delta}\right)

Plugging in $\eta_{t}=\frac{\eta}{\sqrt{t}}$ and simplifying, we obtain

	$\displaystyle\frac{\eta}{\sqrt{T}}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},x_{T+1}\right)$	$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)+2C\ln\left(\frac{1}{\delta}\right)$
		$\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)$

Thus we have

\frac{1}{T}\sum_{t=1}^{T}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)\leq\frac{1}{\sqrt{T}}\left(\frac{2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)}{\eta}+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta\left(\sum_{t=1}^{T}\frac{1}{t}\right)\right)

and

\mathbf{D}_{\psi}\left(x^{*},x_{T+1}\right)\leq 2\mathbf{D}_{\psi}\left(x^{*},x_{1}\right)+\left(2G^{2}+6\sigma^{2}\left(1+2\ln\left(\frac{1}{\delta}\right)\right)\right)\eta^{2}\left(\sum_{t=1}^{T}\frac{1}{t}\right)

∎

4 Analysis of Accelerated Stochastic Mirror Descent

Algorithm 2 Accelerated Stochastic Mirror Descent Algorithm [5].

\psi\colon\mathbb{R}^{d}\to\mathbb{R}

is a strongly convex mirror map.

\mathbf{D}_{\psi}\left(x,y\right)=\psi\left(x\right)-\psi\left(y\right)-\left\langle\nabla\psi\left(y\right),x-y\right\rangle

is the Bregman divergence of

\psi

Parameters: initial point $x_{0}=y_{0}=z_{0}$ , step size $\eta$

Set $\alpha_{t}=\frac{2}{t+1}$ , $\eta_{t}=t\eta$

for $t=1$ to $T$ :

$x_{t}=\left(1-\alpha_{t}\right)y_{t-1}+\alpha_{t}z_{t-1}$

$z_{t}=\arg\min_{x\in\mathcal{X}}\left(\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x\right\rangle+\mathbf{D}_{\psi}\left(x,z_{t-1}\right)\right)$

$y_{t}=\left(1-\alpha_{t}\right)y_{t-1}+\alpha_{t}z_{t}$

return $y_{T}$

In this section, we analyze the Accelerated Stochastic Mirror Descent Algorithm (Algorithm (2)). We assume that $f$ satisfies the following condition:

f(y)\leq f(x)+\left\langle\nabla f\left(x\right),y-x\right\rangle+G\left\|y-x\right\|+\frac{\beta}{2}\left\|y-x\right\|^{2}\ \forall x,y\in\mathcal{X}

$\beta$ -smooth functions, $G$ -Lipschitz functions, and their sums all satisfy the above conditions.

As before, we define

\xi_{t}:=\widehat{\nabla}f\left(x_{t}\right)-\nabla f\left(x_{t}\right)

We let $\mathcal{F}_{t}=\sigma\left(\xi_{1},\dots,\xi_{t-1}\right)$ denote the natural filtration. Note that $x_{t}$ is $\mathcal{F}_{t}$ -measurable and $z_{t}$ and $y_{t}$ are $\mathcal{F}_{t+1}$ -measurable.

We follow a similar analysis to the previous section. As before, we start with the inequalities shown in the standard analysis of the algorithm, and we combine them using coefficients $\left\{w_{t}\right\}_{1\leq t\leq T}$ . The following lemma follows from the analysis given in [5] and we include the proof in the Appendix for completeness.

Lemma 4.1.

([5]) For every iteration $t$ , we have

	$\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}G^{2}+\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{},z_{t-1}\right)$
	$\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}$

We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let $w_{0}\geq w_{1}\geq w_{2}\geq\dots\geq w_{T}\geq 0$ be a non-increasing sequence where $w_{t}\in\mathbb{R}$ for all $t$ . We define

$\displaystyle Z_{t}$	$\displaystyle=w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}^{2}G^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)$
	$\displaystyle\quad+w_{T}\left(\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{},z_{t-1}\right)\right)$	$\displaystyle\forall\,1\leq t\leq T$
$\displaystyle S_{t}$	$\displaystyle=\sum_{i=t}^{T}Z_{i}$	$\displaystyle\forall\,1\leq t\leq T+1$

Theorem 4.2.

Suppose that $w_{t-1}\geq w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2}$ for every $1\leq t\leq T$ and $\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\leq\frac{1}{4\sigma^{2}}$ for every $0\leq t\leq T$ . For every $1\leq t\leq T+1$ , we have

\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t-1}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)

Proof.

We proceed by induction on $t$ . Consider the base case $t=T+1$ . We have $S_{t}=0$ and $w_{t-1}-w_{T}=0$ , and the inequality follows. Next, we consider $t\leq T$ . We have

\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]

\displaystyle=\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t}\right]=\mathbb{E}\left[\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]|\mathcal{F}_{t}\right]

(5)

We now analyze the inner expectation. Conditioned on $\mathcal{F}_{t+1}$ , $Z_{t}$ is fixed. Using the inductive hypothesis, we obtain

\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)|\mathcal{F}_{t+1}\right]\leq\exp\left(Z_{t}\right)\exp\left(\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)

(6)

Let $X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle$ . By Lemma 4.1, we have

	$\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}G^{2}$
	$\displaystyle\leq X_{t}+\frac{\eta_{t}^{2}}{\left(1-\beta\alpha_{t}\eta_{t}\right)}\left\\|\xi_{t}\right\\|^{2}-\left(\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{},z_{t-1}\right)\right)$

and thus

\displaystyle Z_{t}

\displaystyle\leq w_{t}X_{t}-\left(w_{t}-w_{T}\right)\left(\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right)+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\|\xi_{t}\right\|^{2}

Plugging into (6), we obtain

	$\displaystyle\mathbb{E}\left[\exp\left(Z_{t}+S_{t+1}\right)\|\mathcal{F}_{t+1}\right]$
	$\displaystyle\leq\exp\left(w_{t}X_{t}+\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+w_{t}\frac{\eta_{t}^{2}}{\left(1-\beta\alpha_{t}\eta_{t}\right)}\left\\|\xi_{t}\right\\|^{2}+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)$

Plugging into (5), we obtain

	$\displaystyle\mathbb{E}\left[\exp\left(S_{t}\right)\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\exp\left(\left(w_{t}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t+1}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)\mathbb{E}\left[\exp\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)\|\mathcal{F}_{t}\right]$		(7)

Next, we analyze the the expectation on the RHS of the above inequality. We have

	$\displaystyle\mathbb{E}\left[\exp\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}\eta_{t}\left\\|x^{*}-z_{t-1}\right\\|\left\\|\xi_{t}\right\\|+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\exp\left(3\left(w_{t}^{2}\eta_{t}^{2}\left\\|x^{*}-z_{t-1}\right\\|^{2}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right)$
	$\displaystyle\leq\exp\left(3\left(2w_{t}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right)$		(8)

X_{t}=\eta_{t}\left\langle\xi_{t},x^{*}-z_{t-1}\right\rangle\leq\eta_{t}\left\|\xi_{t}\right\|\left\|x^{*}-z_{t-1}\right\|

On the fourth line, we applied Lemma 2.3 with $X=\left\|\xi_{t}\right\|$ , $a=w_{t}\eta_{t}\left\|x^{*}-z_{t-1}\right\|$ , and $b^{2}=w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\leq\frac{1}{4\sigma^{2}}$ . On the fifth line, we used that $\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\geq\frac{1}{2}\left\|x^{*}-z_{t-1}\right\|^{2}$ , which follows from the strong convexity of $\psi$ .

Plugging in (8) into (7) and using that $w_{t-1}\geq w_{t}+6\sigma^{2}w_{t}^{2}\eta_{t}^{2}$ , we obtain

\mathbb{E}\left[\exp\left(S_{t}\right)|\mathcal{F}_{t}\right]\leq\exp\left(\left(w_{t}+6\sigma^{2}w_{t}^{2}\eta_{t}^{2}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+3\sigma^{2}\sum_{i=t}^{T}w_{i}\frac{\eta_{i}^{2}}{1-\beta\alpha_{i}\eta_{i}}\right)

as needed. ∎

Theorem 4.2 and Markov’s inequality gives us the following convergence guarantee.

Corollary 4.3.

Suppose the sequence $\left\{w_{t}\right\}$ satisfies the conditions of Theorem 4.2. For any $\delta>0$ , the following event holds with probability at least $1-\delta$ :

	$\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)$
	$\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)$

Proof.

Let

K=\left(w_{0}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

By Theorem 4.2 and Markov’s inequality, we have

	$\displaystyle\Pr\left[S_{1}\geq K\right]$	$\displaystyle\leq\Pr\left[\exp\left(S_{1}\right)\geq\exp\left(K\right)\right]$
		$\displaystyle\leq\exp\left(-K\right)\mathbb{E}\left[\exp\left(S_{1}\right)\right]$
		$\displaystyle\leq\exp\left(-K\right)\exp\left(\left(w_{0}-w_{T}\right)\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+3\sigma^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)$
		$\displaystyle=\delta$

Note that

	$\displaystyle S_{1}$	$\displaystyle=\sum_{t=1}^{T}Z_{t}$
		$\displaystyle=\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)\right)$
		$\displaystyle\quad-G^{2}\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+w_{T}\left(\mathbf{D}_{\psi}\left(x^{},z_{T}\right)-\mathbf{D}_{\psi}\left(x^{},z_{0}\right)\right)$

Therefore, with probability at least $1-\delta$ , we have

	$\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)$
	$\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)$

∎

With the above result in hand, we complete the convergence analysis by showing how to define the sequence $\left\{w_{t}\right\}$ with the desired properties.

Corollary 4.4.

Suppose we run the Accelerated Stochastic Mirror Descent algorithm with the standard choices $\alpha_{t}=\frac{2}{t+1}$ and $\eta_{t}=\eta t$ with $\eta\leq\frac{1}{4\beta}$ . Let $w_{T}=\frac{1}{3\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)}$ and $w_{t-1}=w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2}$ for all $1\leq t\leq T$ . The sequence $\left\{w_{t}\right\}_{0\leq t\leq T}$ satisfies the conditions required by Corollary 4.3. By Corollary 4.3, for any $\delta>0$ , the following events hold with probability at least $1-\delta$ :

f\left(y_{T}\right)-f\left(x^{*}\right)\leq O\left(\frac{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{\eta T^{2}}+\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta T\right)

and

\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)

\displaystyle\leq O\left(\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}\right)

Setting $\eta=\min\left\{\frac{1}{4\beta},\frac{\sqrt{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}}{\sqrt{G^{2}+\sigma^{2}\left(1+\ln\left(\frac{1}{\delta}\right)\right)}T^{3/2}}\right\}$ to balance the two terms in the first inequality gives

f\left(y_{T}\right)-f\left(x^{*}\right)\leq O\left(\frac{\beta\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{T^{2}}+\frac{\sqrt{\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)}}{\sqrt{T}}\right)

and

\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)\leq O\left(\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)\right)

Proof.

Recall from Corollary 4.3 that the sequence $\left\{w_{t}\right\}$ needs to satisfy the following conditions for all $1\leq t\leq T$ :

	$\displaystyle w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2}$	$\displaystyle\leq w_{t-1}\quad\forall 1\leq t\leq T$		(9)
	$\displaystyle\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}$	$\displaystyle\leq\frac{1}{4\sigma^{2}}\quad\forall 0\leq t\leq T$		(10)

We will set $\left\{w_{t}\right\}$ so that it satisfies the following additional condition, which will allow us to telescope the sum on the RHS of Corollary 4.3:

w_{t-1}\frac{\eta_{t-1}}{\alpha_{t-1}}\geq w_{t}\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\quad\forall 1\leq t\leq T-1

(11)

Given $w_{T}$ , we set $w_{t-1}$ for every $1\leq t\leq T$ so that the first condition (9) holds with equality:

w_{t-1}=w_{t}+6\sigma^{2}\eta_{t}^{2}w_{t}^{2}=w_{t}+6\sigma^{2}\eta^{2}t^{2}w_{t}^{2}

Let $C=\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)$ . We set

w_{T}=\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{T}i^{2}}=\frac{1}{C+\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)}=\frac{1}{2\sigma^{2}\eta^{2}T\left(T+1\right)\left(2T+1\right)}

Given this choice for $w_{T}$ , we now verify that, for all $0\leq t\leq T$ , we have

w_{t}\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}=\frac{1}{C+\sigma^{2}\eta^{2}t\left(t+1\right)\left(2t+1\right)}

We proceed by induction on $t$ . The base case $t=T$ follows from the definition of $w_{T}$ . Consider $t<T$ . Using the definition of $w_{t-1}$ and the inductive hypothesis, we obtain

	$\displaystyle w_{t-1}$	$\displaystyle=w_{t}+6\sigma^{2}\eta^{2}t^{2}w_{t}^{2}$
		$\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}+\frac{6\sigma^{2}\eta^{2}t^{2}}{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)^{2}}$
		$\displaystyle\leq\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}}+\frac{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)-\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}\right)}{\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t}i^{2}\right)\left(C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}\right)}$
		$\displaystyle=\frac{1}{C+6\sigma^{2}\eta^{2}\sum_{i=1}^{t-1}i^{2}}$

as needed.

Let us now verify that the second condition (10) also holds. Using that $\frac{2t}{t+1}\leq 2$ , $\beta\eta\leq\frac{1}{4}$ , and $T\geq 2$ , we obtain

\frac{w_{t}\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}=\frac{w_{t}\eta^{2}t^{2}}{1-\beta\eta\frac{2t}{t+1}}\leq 2w_{t}\eta^{2}t^{2}\leq\frac{2\eta^{2}t^{2}}{C}=\frac{t^{2}}{\sigma^{2}T\left(T+1\right)\left(2T+1\right)}\leq\frac{1}{\sigma^{2}\left(2T+1\right)}\leq\frac{1}{4\sigma^{2}}

as needed.

Let us now verify that the third condition (11) also holds. Since $\eta_{t}=\eta t$ and $\alpha_{t}=\frac{2}{t+1}$ , we have $\frac{\eta_{t-1}}{\alpha_{t-1}}=\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}=\frac{\eta t\left(t-1\right)}{2}$ . Since $w_{t}\leq w_{t-1}$ , it follows that condition (11) holds.

We now turn our attention to the convergence. By Corollary 4.3, with probability $\geq 1-\delta$ , we have

	$\displaystyle\sum_{t=1}^{T}w_{t}\left(\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)-\frac{\eta_{t}\left(1-\alpha_{t}\right)}{\alpha_{t}}\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)$
	$\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)$

Grouping terms on the LHS and using that $\alpha_{1}=1$ , we obtain

	$\displaystyle\sum_{t=1}^{T-1}\left(w_{t}\frac{\eta_{t}}{\alpha_{t}}-w_{t+1}\frac{\eta_{t+1}\left(1-\alpha_{t+1}\right)}{\alpha_{t+1}}\right)\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)+w_{T}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{}\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)$
	$\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)$

Since $\left\{w_{t}\right\}$ satisfies condition (11), the coefficient of $f\left(y_{t}\right)-f\left(x^{*}\right)$ is non-negative and thus we can drop the above sum. We obtain

\displaystyle w_{T}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+w_{T}\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)

\displaystyle\leq w_{0}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)

Using that $w_{T}=\frac{1}{2C}$ and $w_{t}\leq\frac{1}{C}$ for all $0\leq t\leq T-1$ , we obtain

	$\displaystyle\frac{1}{2C}\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{}\right)\right)+\frac{1}{2C}\mathbf{D}_{\psi}\left(x^{},z_{T}\right)$
	$\displaystyle\leq\frac{1}{C}\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\frac{1}{C}\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+\ln\left(\frac{1}{\delta}\right)$

Thus

	$\displaystyle\frac{\eta_{T}}{\alpha_{T}}\left(f\left(y_{T}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},z_{T}\right)$
	$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+2C\ln\left(\frac{1}{\delta}\right)$
	$\displaystyle=2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+3\sigma^{2}\right)\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}+2\sigma^{2}\ln\left(\frac{1}{\delta}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)$

Using that $\beta\eta\leq\frac{1}{4}$ and $\frac{2t}{t+1}\leq 2$ , we obtain

\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}=\sum_{t=1}^{T}\frac{\eta^{2}t^{2}}{1-\beta\eta\frac{2t}{t+1}}\leq\sum_{t=1}^{T}2\eta^{2}t^{2}=\frac{1}{3}\eta^{2}T\left(T+1\right)\left(2T+1\right)

Plugging in and using that $\eta_{T}=\eta T$ and $\alpha_{T}=\frac{2}{T+1}$ , we obtain

	$\displaystyle\eta\frac{T\left(T+1\right)}{2}\left(f\left(y_{T}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},z_{T}\right)$
	$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+\left(\frac{2}{3}G^{2}+2\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)$
	$\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+2\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T\left(T+1\right)\left(2T+1\right)$

We can further simplify the bound by lower bounding $T\left(T+1\right)\geq T^{2}$ and upper bounding $T\left(T+1\right)\left(2T+1\right)\leq 6T^{3}$ . We obtain

\displaystyle\eta T^{2}\left(f\left(y_{T}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)

\displaystyle\leq 4\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+24\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}

Thus we obtain

f\left(y_{T}\right)-f\left(x^{*}\right)\leq\frac{4\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)}{\eta T^{2}}+24\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta T

and

\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{T}\right)

\displaystyle\leq 2\mathbf{D}_{\psi}\left(x^{*},z_{0}\right)+12\left(G^{2}+\left(1+\ln\left(\frac{1}{\delta}\right)\right)\sigma^{2}\right)\eta^{2}T^{3}

∎

References

[1] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006.
[2] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
[3] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
[4] Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008.
[5] Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
[6] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
[7] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Appendix A Omitted Proofs

Proof.

(Lemma 2.3) Consider two cases either $a\geq 1/(2\sigma)$ or $a\leq 1/(2\sigma)$ . First suppose $a\geq 1/(2\sigma)$ . We use the inequality $uv\leq\frac{u^{2}}{4}+v^{2}$ ,

	$\displaystyle\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right]$	$\displaystyle\leq\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(\frac{1}{4\sigma^{2}}X^{2}+a^{2}\sigma^{2}+b^{2}X^{2}\right)^{i}\right]$
		$\displaystyle=\mathbb{E}\left[b^{2}X^{2}+\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}+a^{2}\sigma^{2}\right)-\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}-a^{2}\sigma^{2}\right]$
		$\displaystyle=\mathbb{E}\left[\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)X^{2}+a^{2}\sigma^{2}\right)-\frac{1}{4\sigma^{2}}X^{2}-a^{2}\sigma^{2}\right]$
		$\displaystyle\leq\exp\left(\left(\frac{1}{4\sigma^{2}}+b^{2}\right)\sigma^{2}+a^{2}\sigma^{2}\right)$
		$\displaystyle\leq\exp\left(b^{2}\sigma^{2}+2a^{2}\sigma^{2}\right)$

Next, let $c=\max(a,b)\leq 1/(2\sigma)$ . We have

	$\displaystyle\mathbb{E}\left[1+b^{2}X^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(aX+b^{2}X^{2}\right)^{i}\right]$	$\displaystyle=\mathbb{E}\left[\exp\left(aX+b^{2}X^{2}\right)-aX\right]$
		$\displaystyle\leq\mathbb{E}\left[\left(aX+\exp\left(a^{2}X^{2}\right)\right)\exp\left(b^{2}X^{2}\right)-aX\right]$
		$\displaystyle=\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+aX\left(\exp\left(b^{2}X^{2}\right)-1\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+cX\left(\exp\left(c^{2}X^{2}\right)-1\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}\right)X^{2}\right)+\exp\left(2c^{2}X^{2}\right)-1\right]$
		$\displaystyle\leq\mathbb{E}\left[\exp\left(\left(a^{2}+b^{2}+2c^{2}\right)X^{2}\right)\right]$
		$\displaystyle\leq\exp\left(\left(a^{2}+b^{2}+2c^{2}\right)\sigma^{2}\right)$

In the first inequality, we use the inequality $e^{x}-x\leq e^{x^{2}}\forall x$ . In the third inequality, we use $x\left(e^{x^{2}}-1\right)\leq e^{2x^{2}}-1\ \forall x$ . This inequality can be proved with the Taylor expansion.

	$\displaystyle x\left(e^{x^{2}}-1\right)$	$\displaystyle=\sum_{i=1}^{\infty}\frac{1}{i!}x^{2i+1}$
		$\displaystyle\leq\sum_{i=1}^{\infty}\frac{1}{i!}\frac{x^{2i}+x^{2i+2}}{2}$
		$\displaystyle=\frac{x^{2}}{2}+\sum_{i=2}^{\infty}\left(\frac{1+i}{2i!}\right)x^{2i}$
		$\displaystyle\leq\frac{x^{2}}{2}+\sum_{i=2}^{\infty}\left(\frac{2^{i}}{i!}\right)x^{2i}$
		$\displaystyle\leq e^{2x^{2}}-1$

∎

Proof.

(Lemma (3.1)) By the optimality condition, we have

\left\langle\eta_{t}\widehat{\nabla}f(x_{t})+\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle\geq 0

and thus

\left\langle\eta_{t}\widehat{\nabla}f(x_{t}),x_{t+1}-x^{*}\right\rangle\leq\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle

Note that

	$\displaystyle\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right),x^{*}-x_{t+1}\right\rangle$	$\displaystyle=\left\langle\nabla\psi\left(x_{t+1}\right)-\nabla\psi\left(x_{t}\right),x^{*}-x_{t+1}\right\rangle$
		$\displaystyle=\mathbf{D}_{\psi}\left(x^{},x_{t}\right)-\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)$

and thus

	$\displaystyle\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x_{t+1}-x^{*}\right\rangle$	$\displaystyle\leq\mathbf{D}_{\psi}\left(x^{},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)$
		$\displaystyle\leq\mathbf{D}_{\psi}\left(x^{},x_{t}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\frac{1}{2}\left\\|x_{t+1}-x_{t}\right\\|^{2}$

where we have used that $\mathbf{D}_{\psi}\left(x_{t+1},x_{t}\right)\geq\frac{1}{2}\left\|x_{t+1}-x_{t}\right\|^{2}$ by the strong convexity of $\psi$ .

By convexity,

f\left(x_{t}\right)-f\left(x^{*}\right)\leq\left\langle\nabla f\left(x_{t}\right),x_{t}-x^{*}\right\rangle=\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\left\langle\widehat{\nabla}f\left(x_{t}\right),x_{t}-x^{*}\right\rangle

Combining the two inequalities, we obtain

	$\displaystyle\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)$
	$\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),x_{t}-x_{t+1}\right\rangle-\frac{1}{2}\left\\|x_{t+1}-x_{t}\right\\|^{2}$
	$\displaystyle\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\frac{\eta_{t}^{2}}{2}\left\\|\widehat{\nabla}f(x_{t})\right\\|^{2}$

Using the triangle inequality and the bounded gradient assumption $\left\|\nabla f(x)\right\|\leq G$ , we obtain

\left\|\widehat{\nabla}f(x_{t})\right\|^{2}=\left\|\xi_{t}+\nabla f(x_{t})\right\|^{2}\leq 2\left\|\xi_{t}\right\|^{2}+2\left\|\nabla f(x_{t})\right\|^{2}\leq 2\left(\left\|\xi_{t}\right\|^{2}+G^{2}\right)

Thus

\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{*}\right)\right)+\mathbf{D}_{\psi}\left(x^{*},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\leq\eta_{t}\left\langle\xi_{t},x^{*}-x_{t}\right\rangle+\eta_{t}^{2}\left(\left\|\xi_{t}\right\|^{2}+G^{2}\right)

as needed. ∎

Proof.

(Lemma 4.1) Starting with smoothness, we obtain

	$\displaystyle f\left(y_{t}\right)$	$\displaystyle\leq f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t}-x_{t}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}\ \forall x\in\mathcal{X}$
		$\displaystyle=f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle+\left\langle\nabla f\left(x_{t}\right),y_{t}-y_{t-1}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}$
		$\displaystyle=\left(1-\alpha_{t}\right)\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}$
		$\displaystyle+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-y_{t-1}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}$
		$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\underbrace{\left\\|y_{t}-x_{t}\right\\|}_{=\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|}+\frac{\beta}{2}\underbrace{\left\\|y_{t}-x_{t}\right\\|^{2}}_{=\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}}$
		$\displaystyle=\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}$

By the optimality condition for $z_{t}$ ,

\eta_{t}\left\langle\widehat{\nabla}f(x_{t}),z_{t}-x^{*}\right\rangle\leq\left\langle\nabla_{x}\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right),x^{*}-z_{t}\right\rangle=\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)-\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)

Rearranging, we obtain

\displaystyle\mathbf{D}_{\psi}\left(x^{*},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)

\displaystyle\leq\eta_{t}\left\langle\widehat{\nabla}f\left(x_{t}\right),x^{*}-z_{t}\right\rangle=\eta_{t}\left\langle\nabla f\left(x_{t}\right)+\xi_{t},x^{*}-z_{t}\right\rangle

By combining the two inequalities, we obtain

	$\displaystyle f\left(y_{t}\right)+\frac{\alpha_{t}}{\eta_{t}}\left(\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{},z_{t-1}\right)+\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)\right)$
	$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),x^{*}-x_{t}\right\rangle\right)}_{\text{convexity}}$
	$\displaystyle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t}\right\rangle$
	$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x^{}\right)+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t}\right\rangle$

Subtracting $f\left(x^{*}\right)$ from both sides, rearranging, and using that $\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)\geq\frac{1}{2}\left\|z_{t}-z_{t-1}\right\|^{2}$ , we obtain

	$\displaystyle f\left(y_{t}\right)-f\left(x^{}\right)+\frac{\alpha_{t}}{\eta_{t}}\left(\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)\right)$
	$\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t}\right\rangle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\\|z_{t}-z_{t-1}\right\\|^{2}$
	$\displaystyle=\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t-1}\right\rangle+\alpha_{t}\left\langle\xi_{t},z_{t}-z_{t-1}\right\rangle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\\|z_{t}-z_{t-1}\right\\|^{2}$
	$\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t-1}\right\rangle+\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|\left(\left\\|\xi_{t}\right\\|+G\right)-\alpha_{t}\frac{1-\beta\alpha_{t}\eta_{t}}{2\eta_{t}}\left\\|z_{t}-z_{t-1}\right\\|^{2}$
	$\displaystyle\leq\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t-1}\right\rangle+\frac{\alpha_{t}\eta_{t}}{2\left(1-\beta\alpha_{t}\eta_{t}\right)}\left(\left\\|\xi_{t}\right\\|+G\right)^{2}$

Finally, we divide by $\frac{\alpha_{t}}{\eta_{t}}$ , and obtain

	$\displaystyle\frac{\eta_{t}}{\alpha_{t}}\left(f\left(y_{t}\right)-f\left(x^{}\right)\right)+\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)$
	$\displaystyle\leq\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\eta_{t}\left\langle\xi_{t},x^{}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{2\left(1-\beta\alpha_{t}\eta_{t}\right)}\left(\left\\|\xi_{t}\right\\|+G\right)^{2}$
	$\displaystyle\leq\frac{\eta_{t}}{\alpha_{t}}\left(1-\alpha_{t}\right)\left(f\left(y_{t-1}\right)-f\left(x^{}\right)\right)+\eta_{t}\left\langle\xi_{t},x^{}-z_{t-1}\right\rangle+\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left(\left\\|\xi_{t}\right\\|^{2}+G^{2}\right)$

∎

	$\displaystyle Z_{t}$	$\displaystyle=w_{t+1}\left(\eta_{t}\left(f\left(x_{t}\right)-f\left(x^{}\right)\right)-\eta_{t}^{2}G^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)\right)$
		$\displaystyle\leq w_{t+1}\left(X_{t}-\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)+\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)+w_{T+1}\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)$
		$\displaystyle=w_{t+1}X_{t}-\left(w_{t+1}-w_{T+1}\right)\left(\mathbf{D}_{\psi}\left(x^{},x_{t+1}\right)-\mathbf{D}_{\psi}\left(x^{},x_{t}\right)\right)+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}$

	$\displaystyle\mathbb{E}\left[\exp\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}X_{t}+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\mathbb{E}\left[1+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t+1}\eta_{t}\left\\|x^{*}-x_{t}\right\\|\left\\|\xi_{t}\right\\|+w_{t+1}\eta_{t}^{2}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\exp\left(3\sigma^{2}\left(w_{t+1}^{2}\eta_{t}^{2}\left\\|x^{*}-x_{t}\right\\|^{2}+w_{t+1}\eta_{t}^{2}\right)\right)$
	$\displaystyle\leq\exp\left(3\sigma^{2}\left(2w_{t+1}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},x_{t}\right)+w_{t+1}\eta_{t}^{2}\right)\right)$		(4)

	$\displaystyle\mathbb{E}\left[\exp\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[\sum_{i=0}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle=\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}X_{t}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\mathbb{E}\left[1+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}+\sum_{i=2}^{\infty}\frac{1}{i!}\left(w_{t}\eta_{t}\left\\|x^{*}-z_{t-1}\right\\|\left\\|\xi_{t}\right\\|+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\left\\|\xi_{t}\right\\|^{2}\right)^{i}\|\mathcal{F}_{t}\right]$
	$\displaystyle\leq\exp\left(3\left(w_{t}^{2}\eta_{t}^{2}\left\\|x^{*}-z_{t-1}\right\\|^{2}+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right)$
	$\displaystyle\leq\exp\left(3\left(2w_{t}^{2}\eta_{t}^{2}\mathbf{D}_{\psi}\left(x^{*},z_{t-1}\right)+w_{t}\frac{\eta_{t}^{2}}{1-\beta\alpha_{t}\eta_{t}}\right)\sigma^{2}\right)$		(8)

	$\displaystyle f\left(y_{t}\right)$	$\displaystyle\leq f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t}-x_{t}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}\ \forall x\in\mathcal{X}$
		$\displaystyle=f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle+\left\langle\nabla f\left(x_{t}\right),y_{t}-y_{t-1}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}$
		$\displaystyle=\left(1-\alpha_{t}\right)\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),y_{t-1}-x_{t}\right\rangle\right)}_{\text{convexity}}$
		$\displaystyle+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-y_{t-1}\right\rangle+G\left\\|y_{t}-x_{t}\right\\|+\frac{\beta}{2}\left\\|y_{t}-x_{t}\right\\|^{2}$
		$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\underbrace{\left\\|y_{t}-x_{t}\right\\|}_{=\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|}+\frac{\beta}{2}\underbrace{\left\\|y_{t}-x_{t}\right\\|^{2}}_{=\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}}$
		$\displaystyle=\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x_{t}\right)+\alpha_{t}\left\langle\nabla f\left(x_{t}\right),z_{t}-x_{t}\right\rangle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}$

	$\displaystyle f\left(y_{t}\right)+\frac{\alpha_{t}}{\eta_{t}}\left(\mathbf{D}_{\psi}\left(x^{},z_{t}\right)-\mathbf{D}_{\psi}\left(x^{},z_{t-1}\right)+\mathbf{D}_{\psi}\left(z_{t},z_{t-1}\right)\right)$
	$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}\underbrace{\left(f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right),x^{*}-x_{t}\right\rangle\right)}_{\text{convexity}}$
	$\displaystyle+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{*}-z_{t}\right\rangle$
	$\displaystyle\leq\left(1-\alpha_{t}\right)f\left(y_{t-1}\right)+\alpha_{t}f\left(x^{}\right)+G\alpha_{t}\left\\|z_{t}-z_{t-1}\right\\|+\frac{\beta}{2}\alpha_{t}^{2}\left\\|z_{t}-z_{t-1}\right\\|^{2}+\alpha_{t}\left\langle\xi_{t},x^{}-z_{t}\right\rangle$