This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

New First-Order Algorithms for Stochastic Variational Inequalities

Kevin Huang           Shuzhong Zhang Department of Industrial and System Engineering, University of Minnesota, huan1741@umn.eduDepartment of Industrial and System Engineering, University of Minnesota, zhangs@umn.edu
Abstract

In this paper, we propose two new solution schemes to solve the stochastic strongly monotone variational inequality problems: the stochastic extra-point solution scheme and the stochastic extra-momentum solution scheme. The first one is a general scheme based on updating the iterative sequence and an auxiliary extra-point sequence. In the case of deterministic VI model, this approach includes several state-of-the-art first-order methods as its special cases. The second scheme combines two momentum-based directions: the so-called heavy-ball direction and the optimism direction, where only one projection per iteration is required in its updating process. We show that, if the variance of the stochastic oracle is appropriately controlled, then both schemes can be made to achieve optimal iteration complexity of 𝒪(κln(1ϵ))\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right) to reach an ϵ\epsilon-solution for a strongly monotone VI problem with condition number κ\kappa. We show that these methods can be readily incorporated in a zeroth-order approach to solve stochastic minimax saddle-point problems, where only noisy and biased samples of the objective can be obtained, with a total sample complexity of 𝒪(κ2ϵln(1ϵ))\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

Keywords: variational inequality, minimax saddle-point, stochastic first-order method, zeroth-order method.

1 Introduction

Given a constraint set 𝒵n\mathcal{Z}\subseteq\mathbb{R}^{n} and a mapping F:𝒵nF:\mathcal{Z}\rightarrow\mathbb{R}^{n}, the classical variational inequality (VI) problem is to find z𝒵z^{*}\in\mathcal{Z} such that

F(z)(zz)0,z𝒵.\displaystyle F(z^{*})^{\top}(z-z^{*})\geq 0,\quad\forall z\in\mathcal{Z}. (1)

For an introduction to VI and its applications, we refer the readers to Facchinei and Pang [4] and the references therein.

In this paper, we consider a stochastic version of problem (1), where the exact evaluation of the mapping F()F(\cdot) is inaccessible. Instead, only a stochastic oracle is available. The stochasticity in question may stem from, e.g., the non-deterministic nature of mixed strategies of the players in a game-setting, or simply because of the difficulty in evaluating the mapping itself. The latter has become more pronounced in the literature, due to its recent-found application as a training/learning subproblem in machine learning and/or statistical learning. The so-called stochastic oracle is a noisy estimation of the mapping F()F(\cdot), and an iterative scheme that incorporates such oracle is known as stochastic approximation (SA). As far as we know, the first proposal to use such approach for stochastic optimization can be traced back to the seminal work of Robbins and Monro [31]. In 2008, Jiang and Xu [10] followed the SA approach to solve VI models. Since then, efforts have been made to extend existing deterministic methods to the stochastic VI models; see e.g. [11, 41, 12, 15, 9, 7, 8].

Let us start our discussion by introducing the assumptions made throughout the paper. We consider VI model (1) where 𝒵\mathcal{Z} is a closed convex set. Moreover, the following two conditions are assumed:

(F(z)F(z))(zz)μzz2,z,z𝒵,\left(F(z)-F(z^{\prime})\right)^{\top}(z-z^{\prime})\geq\mu\|z-z^{\prime}\|^{2},\quad\forall z,z^{\prime}\in\mathcal{Z}, (2)

for some μ>0\mu>0, and

F(z)F(z)Lzz,z,z𝒵,\|F(z)-F(z^{\prime})\|\leq L\|z-z^{\prime}\|,\quad\forall z,z^{\prime}\in\mathcal{Z}, (3)

for some Lμ>0L\geq\mu>0. Condition (2) is known as the strong monotonicity of FF, while Condition (3) is known as the Lipschitz continuity of FF. If Condition 2 is met with μ=0\mu=0 then FF is known as monotone. VI problems satisfying (2) with positive μ\mu can be easily shown to have a unique solution zz^{*}. Let us denote κ:=Lμ1\kappa:=\frac{L}{\mu}\geq 1. Parameter κ\kappa is usually known as the condition number of the VI model (1). We also assume

maxz,z𝒵zzD,\displaystyle\max\limits_{z,z^{\prime}\in\mathcal{Z}}\|z-z^{\prime}\|\leq D, (4)

namely the constraint set is bounded. Remark that this assumption can actually be removed without affecting the results, but then the analysis becomes lengthier and tedious without conceivable conceptual benefit, and so we shall not pursue that generality in this paper.

The stochastic oracle of the mapping, denote by F^(z,ξ)\hat{F}(z,\xi), takes a random sample ξΞ\xi\in\Xi from some sample space Ξ\Xi. The oracle is required to satisfy:

𝔼[F^(z,ξ)F(z)]\displaystyle\mathbb{E}\left[\|\hat{F}(z,\xi)-F(z)\|\right] \displaystyle\leq δ\displaystyle\delta (5)
𝔼[F^(z,ξ)F(z)2]\displaystyle\mathbb{E}\left[\|\hat{F}(z,\xi)-F(z)\|^{2}\right] \displaystyle\leq σ2\displaystyle\sigma^{2} (6)

for all z𝒵z\in\mathcal{Z}, where δ,σ0\delta,\sigma\geq 0 are some constants. In other words, we assume both the bias and the deviation are uniformly upper-bounded.

In this paper, we propose two stochastic first-order schemes: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme maintains two sequences of iterates featuring several well-known first-order search directions such as: the extra-gradient [13, 36], the heavy-ball [29], Nesterov’s extrapolation [23, 25], and the optimism direction [30, 20]. The second scheme, on the other hand, specifically combines the heavy-ball momentum and the optimism momentum in its updating formula, and maintains only one sequence throughout the iterations, therefore requiring only one projection per iteration. These two approaches require different types of analysis. Both schemes render a wider range of search directions than the existing first-order methods, and the parameters associated with each search direction could and should be tuned differently from problem-class to problem-class in order to secure good practical performances. The deterministic counterpart of these methods can be found in our previous work [6]. In the stochastic context, we show that as long as the variance can be reduced throughout the iterations, they yield the optimal iteration complexity 𝒪(κln(1/ϵ))\mathcal{O}(\kappa\ln(1/\epsilon)) (cf. [42]) to reach ϵ\epsilon-solution: zkz2ϵ\|z^{k}-z^{*}\|^{2}\leq\epsilon, with an additional biased term depending on δ\delta. In a later section, we demonstrate an application to the stochastic black-box minimax saddle-point problem where only noisy function values f(x,y)f(x,y) are accessible. This application is particularly relevant, given its applications in machine learning, where the training data set may be very large and evaluating exact gradient/function value is usually impractical. Through a smoothing technique, we propose a stochastic zeroth-order gradient as our update directions in either the stochastic extra-point scheme or the stochastic extra-momentum scheme. We show that both approaches yield an iteration complexity of 𝒪(κln(1/ϵ))\mathcal{O}(\kappa\ln(1/\epsilon)) and a sample-complexity of 𝒪(κ2ϵln(1ϵ))\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

The rest of the paper is organized as follows. In Section 2, we survey the relevant literature with a focus on stochastic VI. In Section 3, we present the main results in this paper, i.e. the convergence results of the two proposed methods, while the technical proofs are relegated to the appendices. In Section 4, we introduce a stochastic black-box saddle-point problem and present the sample complexity results of our methods. We present some promising preliminary numerical results in Section 5 and conclude the paper in Section 6.

2 Literature Review

The first-order algorithms for deterministic VI (1) serve as a basis for the developments of their stochastic counterparts. These algorithms include the projection method [4], the proximal method [18, 32, 36], the extra-gradient method [13, 36], the optimistic gradient descent ascent (OGDA) method [30, 20, 21], the mirror-prox method [22], the extrapolation method [26, 24], and the extra-point method [6].

In this section, we shall focus on the developments of algorithms for stochastic VI, starting with a paper of Jiang and Xu [10], where the authors propose a stochastic projection method for solving strongly monotone and Lipschitz continuous VI problems and present an almost-sure convergence result. Koshal et al[14] propose iterative Tikhonov regularization method and iterative proximal point method and show almost-sure convergence with monotone and Lipschitz continuous VI problems. Both methods solve a strongly monotone VI subproblem at each iteration. Yousefian et al[39] further introduce local smoothing technique to the above-mentioned regularized methods to account for non-Lipschitz mappings and show almost-sure convergence. A survey on these methods, as well as applications and the theory behind stochastic VI can be found in Shanbhag [35].

Juditsky et al[11] are among the first to show an iteration complexity bound for stochastic VI algorithms. They extend the mirror-prox method [22] to stochastic settings and prove an optimal iteration complexity bound for monotone VI: 𝒪(1ϵ2)\mathcal{O}(\frac{1}{\epsilon^{2}}) , or 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) when the variance can be controlled small enough. Yousefian et al[41] further extend the stochastic mirror-prox method with a more general step size choice and show an 𝒪(1ϵ2)\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right) iteration complexity, where they also show an 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) complexity for the stochastic extra-gradient method for solving strongly monotone VI problems. Yousefian et al. [40] use randomized smoothing technique for non-Lipschitz mapping and show an 𝒪(1ϵ6)\mathcal{O}\left(\frac{1}{\epsilon^{6}}\right) iteration complexity. Chen et al[3] consider a specific class of VI model: a mapping that consists of a Lipschitz continuous and monotone operator, a Lipschitz continuous gradient mapping of a convex function, and a subgradient mapping of a simple convex function. They propose a method that combines Nesterov’s acceleration [25] with the stochastic mirror-prox method to exploit this special structure, resulting in an optimal iteration complexity for such class of problem: 𝒪(1ϵ2)\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right), or 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) when the variance can be controlled small enough, or 𝒪(1ϵ)\mathcal{O}\left(\sqrt{\frac{1}{\epsilon}}\right) when the operator consists only of gradient/subgradient mappings from some convex function. Kannan and Shanbhag [12] analyze a general variant of extra-gradient method (which uses general distance-generating functions) and show that under a slightly weaker assumptions than the strongly monotonicity, the optimal 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) iteration bound still hold. Kotsalis et al. [15] extend the OGDA method to strongly monotone stochastic VI with iteration complexity 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right), or 𝒪(κln(1ϵ))\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right) when the variance can be controlled small enough.

We shall note that the detailed implementation of variance-reduction is in general not considered in the above-mentioned methods (although some do present additional complexity term when the variance is small, such as in [11, 3]). Therefore, the optimal iteration complexity bound is 𝒪(1ϵ2)\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right) for monotone VI and 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) for strongly monotone VI, as compared to 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) and 𝒪(κln(1/ϵ))\mathcal{O}(\kappa\ln(1/\epsilon)) for their deterministic counterpart. By increasing the sample size (aka mini-batch) in each iteration, the variance can be reduced as the algorithm progresses, therefore attaining the same optimal iteration complexity bound as the deterministic problems.

There have been developments for variance-reduction-based methods in recent years. Jalilzadeh and Shanbhag [9] extend the method [26] for deterministic strongly monotone VI to stochastic VI and show that with variance reduction the optimal iteration complexity 𝒪(κln(1/ϵ))\mathcal{O}(\kappa\ln(1/\epsilon)) can be achieved, together with a total sample complexity of 𝒪(1ϵβ)\mathcal{O}\left(\frac{1}{\epsilon^{\beta}}\right) for some constant β>1\beta>1. With this method as a subroutine, they also propose a variance-reduced proximal point method with iteration complexity 𝒪(1ϵln(1ϵ))\mathcal{O}\left(\frac{1}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right) and sample complexity 𝒪(1ϵ1+2αβ)\mathcal{O}(\frac{1}{\epsilon^{1+2\alpha\beta}}) for some constants α,β>1\alpha,\beta>1. Iusem et al. [7] propose a variance-reduced extra-gradient-based method for monotone VI and show 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) iteration complexity and 𝒪(1ϵ)\mathcal{O}\left(\frac{1}{\epsilon}\right) sample complexity. They further extend the method [8] by incorporating line-search for unknown Lipschitz constant, while preserving similar bounds. Palaniappan and Bach [28] propose variance-reduced stochastic forward-backward methods based on (accelerated) stochastic gradient descent methods in optimization and show 𝒪(κln(1/ϵ))\mathcal{O}(\kappa\ln(1/\epsilon)) iteration complexity. For another line of work, which includes the concept of differential privacy in the stochastic VI, we refer the readers to a recent paper [2] and the references therein. The stochastic oracle maybe man-made. For instance, the technique of randomized smoothing has been applied in the so-called zeroth-order methods (i.e. derivative-free methods), refer to [27, 34] or the survey [16] in the context of optimization and [37, 38, 17, 33, 19] in the context of minimax saddle-point problems.

3 The Stochastic First-Order Methods for Strongly Monotone VI

Let us start this section by introducing the notations to facilitate our analysis. We shall denote the stochastic oracle as F^()\hat{F}(\cdot), suppressing the notation ξ\xi whenever it is clear from the context. For example, F^(zk)\hat{F}(z^{k}) is associated with the random sample ξkΞ\xi^{k}\in\Xi. In addition, we denote P𝒵()P_{\mathcal{Z}}(\cdot) as the projection operator onto the feasible set 𝒵\mathcal{Z}.

3.1 The stochastic extra-point scheme

We first present the iterative updating rule for the stochastic extra-point scheme:

{zk+0.5:=P𝒵(zk+β(zkzk1)ηF^(zk)),zk+1:=P𝒵(zkαF^(zk+0.5)+γ(zkzk1)τ(F^(zk)F^(zk1))),\displaystyle\left\{\begin{array}[]{lcl}z^{k+0.5}&:=&P_{\mathcal{Z}}\left(z^{k}+\beta(z^{k}-z^{k-1})-\eta\hat{F}(z^{k})\right),\\ z^{k+1}&:=&P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau(\hat{F}(z^{k})-\hat{F}(z^{k-1}))\right),\\ \end{array}\right. (9)

for k=0,1,k=0,1,..., where the sequence {zkk=0,1,}\{z^{k}\mid k=0,1,...\} is the sequence of iterates, and {zk+0.5k=0,1,}\{z^{k+0.5}\mid k=0,1,...\} is the sequence of extra points, which helps to produce the sequence of iterates.

In the case of deterministic strongly monotone VI, we introduced in our previous work [6] a unifying extra-point updating scheme, which includes specific first-order search directions such as the extra-gradient, the heavy-ball method, the optimistic method, and Nesterov’s extrapolation; these are incorporated with the parameters α,β,γ,η,τ0\alpha,\beta,\gamma,\eta,\tau\geq 0. As any specific configuration of these parameters should be tailored to the problem structure at hand, our goal is to provide conditions of the parameters under which an optimal iteration complexity can be guaranteed. This line of analysis will now be extended to solve stochastic VI as given in (9). We shall first establish the relational inequalities between subsequent iterates in terms of the expected distance to the unique solution zz^{*}, denote by dk=𝔼[zkz2]d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right].

Lemma 3.1.

For the sequences {zkk=0,1,}\{z^{k}\mid k=0,1,...\} and {zk+0.5k=0,1,}\{z^{k+0.5}\mid k=0,1,...\} generated from the stochastic extra-point scheme (9), the following inequality holds:

(14|γβ|τL)dk+1\displaystyle(1-4|\gamma-\beta|-\tau L)d_{k+1} (10)
\displaystyle\leq (1+4γ+6|γβ|+4τLαμ)dk+(2|γβ|+2γ+4τL)dk1\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L-\alpha\mu\right)d_{k}+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)d_{k-1}
+(2α2L2+2|γβ|+2γ+2αμ1)𝔼[zk+0.5zk2]\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
+8(α2+τL)σ2+2αδD+2(ηα)𝔼[(zk+1zk+0.5)F^(zk)].\displaystyle+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].
Proof.

See Appendix A.1. ∎

Lemma 3.1 forms a basis to the desired linear convergence, and it is possible to identify the conditions for the parameters α,β,γ,η,τ\alpha,\beta,\gamma,\eta,\tau in order to achieve linear convergence. Consider parameters satisfying

{η=α,2α2L2+2|γβ|+2γ+2αμ10,\displaystyle\left\{\begin{array}[]{ll}\eta=\alpha,\\ 2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\leq 0,\end{array}\right. (13)

and denote

{t1=αμ4γ6|γβ|4τL,t2=2|γβ|+2γ+4τL,t3=4|γβ|+τL.\displaystyle\left\{\begin{array}[]{ll}t_{1}=\alpha\mu-4\gamma-6|\gamma-\beta|-4\tau L,\\ t_{2}=2|\gamma-\beta|+2\gamma+4\tau L,\\ t_{3}=4|\gamma-\beta|+\tau L.\end{array}\right. (17)

Then we obtain from (10) that

(1t3)dk+1\displaystyle(1-t_{3})d_{k+1} \displaystyle\leq (1t1)dk+t2dk1+8(α2+τL)σ2+2αδD.\displaystyle\left(1-t_{1}\right)d_{k}+t_{2}d_{k-1}+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D. (18)

With additional constraints on t1,t2,t3t_{1},t_{2},t_{3}, the variance-reduced convergence result is summarized in the next theorem.

Theorem 3.2.

For non-negative parameters α,β,γ,η,τ\alpha,\beta,\gamma,\eta,\tau satisfying (13) and (17), suppose that

{0t3<t1<1,t2<t1t3.\displaystyle\left\{\begin{array}[]{ll}0\leq t_{3}<t_{1}<1,\\ t_{2}<t_{1}-t_{3}.\end{array}\right. (21)

Let q=2(1t3)t1t2t3>1q=\frac{2(1-t_{3})}{t_{1}-t_{2}-t_{3}}>1. For a fixed precision ϵ>0\epsilon>0, denote K=𝒪(qln(1ϵ))K=\mathcal{O}\left(q\cdot\ln\left(\frac{1}{\epsilon}\right)\right). Then, we have

dK=𝔼[zKz2]𝒪(ϵ)+𝒪(σ2)+𝒪(δD).\displaystyle d_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\mathcal{O}(\epsilon)+\mathcal{O}(\sigma^{2})+\mathcal{O}(\delta D). (22)
Proof.

See Appendix A.2. ∎

Regarding Theorem 3.2, a few remarks are in order. First, as we remarked earlier, the boundedness condition (4) can be removed. However, the analysis will become much longer and cumbersome; we keep it here for simplicity. Second, a common way to achieve variance reduction is through increasing the mini-batch sample sizes. In fact, we may fix the sample size at the beginning with order (11q)K\left(1-\frac{1}{q}\right)^{-K}, or it increases linearly at a rate (1+1q)\left(1+\frac{1}{q}\right) as kk increases. We shall discuss more on this strategy in Section 4. Finally, we note that without variance reduction, it is possible to adopt diminishing step sizes (αk,βk,γk,ηk,τk)(\alpha_{k},\beta_{k},\gamma_{k},\eta_{k},\tau_{k}) instead of fixing step sizes as we have assumed so far. The optimal uniform sublinear convergence rate 1k\frac{1}{k} can be established through a separate analysis continued from Lemma 3.1. The details can be found in Appendix B.

Next proposition concludes this subsection with a specific choice of the parameters.

Proposition 3.3.

If one chooses the following parameters

(α,β,γ,η,τ)=(14L,164κ,164κ,14L,164Lκ)\displaystyle(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right)

in (9) (thus (t1,t2,t3)=(18κ,332κ,164κ)(t_{1},t_{2},t_{3})=\left(\frac{1}{8\kappa},\frac{3}{32\kappa},\frac{1}{64\kappa}\right)) then it holds that

dk(11256κ)k283256z0z2+(40σ263L2+32δD63L)256κ.d_{k}\leq\left(1-\frac{1}{256\kappa}\right)^{k}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot 256\kappa.
Proof.

See Appendix A.3. ∎

3.2 The stochastic extra-momentum scheme

In this subsection, we present an alternative stochastic first-order method that achieves the optimal iteration complexity as well, the stochastic extra-momentum scheme:

zk+1:=P𝒵(zkαF^(zk)+γ(zkzk1)τ(F^(zk)F^(zk1))),\displaystyle z^{k+1}:=P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right), (23)

for k=0,1,k=0,1,....

Compared with the stochastic extra-point scheme (9), the above update (23) manipulates only the momentum terms alongside the stochastic gradient direction (the notion “gradient” here refers to the mapping F()F(\cdot) in the VI model), namely the heavy-ball direction zkzk1z^{k}-z^{k-1} and the optimism direction F^(zk)F^(zk1)\hat{F}(z^{k})-\hat{F}(z^{k-1}). Since it maintains a single sequence {zk}\{z^{k}\} throughout the iterations, this scheme requires one projection per iteration, as compared to two projections in the case of the stochastic extra-point scheme. We shall remark that the method proposed in Kotsalis et al. [15] only considers the optimism term. Therefore, the stochastic extra-momentum scheme introduced above may be viewed as a generalization.

As in the previous subsection, we shall first establish a relational inequality between the iterates. As we can see from the lemma below, the structure of this relational inequality is in fact quite different from the previous case. The detailed proof can be found in the appendix.

Lemma 3.4.

For the sequence {zkk=0,1,}\{z^{k}\mid k=0,1,...\} generated from the stochastic extra-momentum scheme (23), the following inequality holds:

𝔼[(12+αμ2γ2)zk+1z2+α(zk+1z)(F^(zk)F^(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right] (24)
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+8τ2σ2+αδ22μ.\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}.
Proof.

See Appendix A.4. ∎

Observe that each of the terms on the LHS of (24) differs in the iteration index from the RHS exactly by one. This property enables us to design a possible potential function that measures the convergence of the iterative process. We shall specify additional conditions on the non-negative parameters α,γ,τ\alpha,\gamma,\tau in order to further simplify (24):

1+αμγ1+θκ,ατ=1+θκ,18τ2L2+2γ1+θκ,\displaystyle 1+\alpha\mu-\gamma\geq 1+\frac{\theta}{\kappa},\quad\frac{\alpha}{\tau}=1+\frac{\theta}{\kappa},\quad\frac{1}{8\tau^{2}L^{2}+2\gamma}\geq 1+\frac{\theta}{\kappa}, (25)

where θ(0,1]\theta\in(0,1] is some constant independent of κ\kappa. Note that the LHS of each inequality in (25) is the ratio between the coefficients on the LHS and RHS of (24) for each corresponding term. Therefore, the relation (24) can be rearranged as:

(1+θκ)𝔼[12zk+1z2+τ(zk+1z)(F^(zk)F^(zk+1))+(2τ2L2+γ2)zk+1zk2]\displaystyle\left(1+\frac{\theta}{\kappa}\right)\mathbb{E}\left[\frac{1}{2}\|z^{k+1}-z^{*}\|^{2}+\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k+1}-z^{k}\|^{2}\right] (26)
\displaystyle\leq 𝔼[(12+αμ2γ2)zk+1z2+α(zk+1z)(F^(zk)F^(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right]
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+8τ2σ2+αδ22μ.\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}.

Now, by defining the potential function VkV_{k} as

Vk=𝔼[12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12],V_{k}=\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right],

inequality (26) can be rewritten as

(1+θκ)Vk+1Vk+8τ2σ2+αδ22μ.\displaystyle\left(1+\frac{\theta}{\kappa}\right)V_{k+1}\leq V_{k}+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}. (27)

This leads to our final results, as summarized in the next theorem:

Theorem 3.5.

Suppose that the non-negative parameters α,γ,τ\alpha,\gamma,\tau satisfy (25) for some constant θ(0,1]\theta\in(0,1]. For the sequence {zkk=0,1,}\{z^{k}\mid k=0,1,...\} generated from the stochastic extra-momentum scheme (23), the expected distance to the solution of iterate zkz^{k} is bounded by:

𝔼[zkz2]2(1+θκ)kz0z2+(κθ+1)32τ2σ2+2καδ2θμ.\displaystyle\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]\leq 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\|z^{0}-z^{*}\|^{2}+\left(\frac{\kappa}{\theta}+1\right)\cdot 32\tau^{2}\sigma^{2}+\frac{2\kappa\alpha\delta^{2}}{\theta\mu}. (28)
Proof.

See Appendix A.5. ∎

A simple choice of parameters leads to:

Proposition 3.6.

If we choose the parameters as

(α,τ,γ)=(14L,α1+θκ,18(κ+θ)),θ=18,(\alpha,\tau,\gamma)=\left(\frac{1}{4L},\frac{\alpha}{1+\frac{\theta}{\kappa}},\frac{1}{8(\kappa+\theta)}\right),\quad\theta=\frac{1}{8},

then scheme (23) assures that

dk2(118κ+1)kz0z2+128σ2μ(8L+μ)+4δ2μ2.\displaystyle d_{k}\leq 2\left(1-\frac{1}{8\kappa+1}\right)^{k}\|z^{0}-z^{*}\|^{2}+\frac{128\sigma^{2}}{\mu(8L+\mu)}+\frac{4\delta^{2}}{\mu^{2}}. (29)
Proof.

It follows by substituting the parameter choice into (28). ∎

This shows that if we run the stochastic extra-momentum scheme (23) with the above parameter choice, then in K=𝒪(κln1ϵ)K=\mathcal{O}(\kappa\ln\frac{1}{\epsilon}) iterations we will reach a solution satisfying

𝔼[zKz2]𝒪(ϵ)+𝒪(σ2μL)+𝒪(δ2μ2).\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\mathcal{O}(\epsilon)+\mathcal{O}\left(\frac{\sigma^{2}}{\mu L}\right)+\mathcal{O}\left(\frac{\delta^{2}}{\mu^{2}}\right).

4 A Stochastic Zeroth-Order Approach to Saddle-Point Problems

In this section, we shall apply the proposed stochastic extra-point/extra-momentum scheme to solve the following saddle-point problem without needing to compute the gradients of ff:

minx𝒳maxy𝒴f(x,y),\displaystyle\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}f(x,y), (30)

where 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n}, 𝒴m\mathcal{Y}\subseteq\mathbb{R}^{m} are convex sets, f(x,y)f(x,y) is strongly convex (with fixed yy) and strongly concave (with fixed xx) with modulus μ\mu, and the partial gradients xf(x,y)\nabla_{x}f(x,y)/yf(x,y)\nabla_{y}f(x,y) are Lipschitz continuous with constant Lx/LyL_{x}/L_{y} for fixed y/xy/x, and with constant LxyL_{xy} with fixed x/yx/y. We let L=2max(Lx,Ly,Lxy)L=2\cdot\max(L_{x},L_{y},L_{xy}). We further assume that the function f(x,y)f(x,y) is Lipschitz continuous for either fixed xx or yy with constant MM. This implies that the norms of the partial gradients are bounded by MM:

xf(x,y)M,yf(x,y)M.\|\nabla_{x}f(x,y)\|\leq M,\quad\|\nabla_{y}f(x,y)\|\leq M.

In particular, we consider the settings when the partial gradients xf(x,y)\nabla_{x}f(x,y) and yf(x,y)\nabla_{y}f(x,y) (and any higher-order information) are not available. Furthermore, the exact evaluation of the function value itself is also not available; instead, we can only access a stochastic oracle f^(x,y,ξ)\hat{f}(x,y,\xi), which satisfies the following assumption:

{𝔼[f^(x,y,ξ)]=f(x,y),𝔼[xf^(x,y,ξ)]=xf(x,y),𝔼[yf^(x,y,ξ)]=yf(x,y),𝔼[xf^(x,y,ξ)xf(x,y)2]σ2,𝔼[yf^(x,y,ξ)yf(x,y)2]σ2.\displaystyle\left\{\begin{array}[]{l}\mathbb{E}\left[\hat{f}(x,y,\xi)\right]=f(x,y),\\ \\ \mathbb{E}\left[\nabla_{x}\hat{f}(x,y,\xi)\right]=\nabla_{x}f(x,y),\\ \\ \mathbb{E}\left[\nabla_{y}\hat{f}(x,y,\xi)\right]=\nabla_{y}f(x,y),\\ \\ \mathbb{E}\left[\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\|^{2}\right]\leq\sigma^{2},\\ \\ \mathbb{E}\left[\|\nabla_{y}\hat{f}(x,y,\xi)-\nabla_{y}f(x,y)\|^{2}\right]\leq\sigma^{2}.\end{array}\right. (40)

Now, we shall use the so-called smoothing technique to approximate the first-order information, which then enables us to apply the proposed stochastic methods for VI, which includes the saddle-point model as a special case. In particular, we use a randomized smoothing scheme using uniform distributions UbU_{b}/VbV_{b} over the unit Euclidean ball BB in the n\mathbb{R}^{n}/m\mathbb{R}^{m} space, respectively. The smoothing functions with parameters ρx,ρy>0\rho_{x},\rho_{y}>0 are defined as follows:

fρx(x,y)\displaystyle f_{\rho_{x}}(x,y) :=\displaystyle:= 𝔼uUb[f(x+ρxu,y)]=1α(n)Bf(x+ρxu,y)𝑑u,\displaystyle\mathbb{E}_{u\sim U_{b}}\left[f(x+\rho_{x}u,y)\right]=\frac{1}{\alpha(n)}\int_{B}f(x+\rho_{x}u,y)du,
fρy(x,y)\displaystyle f_{\rho_{y}}(x,y) :=\displaystyle:= 𝔼vVb[f(x,y+ρyv)]=1α(m)Bf(x,y+ρyv)𝑑v,\displaystyle\mathbb{E}_{v\sim V_{b}}\left[f(x,y+\rho_{y}v)\right]=\frac{1}{\alpha(m)}\int_{B}f(x,y+\rho_{y}v)dv,

where α(n)\alpha(n)/α(m)\alpha(m) is the volume of the unit ball in n\mathbb{R}^{n}/m\mathbb{R}^{m}.

Let us summarize main properties of the smoothing functions fρx,fρyf_{\rho_{x}},\,f_{\rho_{y}} below:

Lemma 4.1.

Let USp/VSpU_{S_{p}}/V_{S_{p}} be the uniform distribution on the unit sphere SpS_{p} in n/m\mathbb{R}^{n}/\mathbb{R}^{m}. Then,

  1. 1.

    The smoothing functions fρx,fρyf_{\rho_{x}},f_{\rho_{y}} are continuously differentiable, and their partial gradients xfρx,yfρy\nabla_{x}f_{\rho_{x}},\nabla_{y}f_{\rho_{y}} can be expressed as:

    xfρx(x,y)\displaystyle\nabla_{x}f_{\rho_{x}}(x,y) :=\displaystyle:= 𝔼uUSp[nρxf(x+ρxu,y)u]=1β(n)uSpnρx(f(x+ρxu,y)f(x,y))u𝑑u,\displaystyle\mathbb{E}_{u\sim U_{S_{p}}}\left[\frac{n}{\rho_{x}}f(x+\rho_{x}u,y)u\right]=\frac{1}{\beta(n)}\int_{u\in S_{p}}\frac{n}{\rho_{x}}\left(f(x+\rho_{x}u,y)-f(x,y)\right)udu,
    yfρy(x,y)\displaystyle\nabla_{y}f_{\rho_{y}}(x,y) :=\displaystyle:= 𝔼vVSp[mρyf(x,y+ρyv)v]=1β(m)vSpmρy(f(x,y+ρyv)f(x,y))v𝑑v,\displaystyle\mathbb{E}_{v\sim V_{S_{p}}}\left[\frac{m}{\rho_{y}}f(x,y+\rho_{y}v)v\right]=\frac{1}{\beta(m)}\int_{v\in S_{p}}\frac{m}{\rho_{y}}\left(f(x,y+\rho_{y}v)-f(x,y)\right)vdv,

    where β(n)/β(m)\beta(n)/\beta(m) is the surface area of the unit sphere in n/m\mathbb{R}^{n}/\mathbb{R}^{m}.

  2. 2.

    For any xnx\in\mathbb{R}^{n} and any ymy\in\mathbb{R}^{m}, we have:

    {xfρx(x,y)xf(x,y)ρxnL2,yfρy(x,y)yf(x,y)ρymL2,\displaystyle\left\{\begin{array}[]{l}\|\nabla_{x}f_{\rho_{x}}(x,y)-\nabla_{x}f(x,y)\|\leq\frac{\rho_{x}nL}{2},\\ \\ \|\nabla_{y}f_{\rho_{y}}(x,y)-\nabla_{y}f(x,y)\|\leq\frac{\rho_{y}mL}{2},\end{array}\right. (44)
    {𝔼uUSp[nρx(f(x+ρxu,y)f(x,y))u2]2nxf(x,y)2+ρx2L2n22,𝔼vVSp[mρy(f(x,y+ρyv)f(x,y))v2]2myf(x,y)2+ρy2L2m22.\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{u\sim U_{S_{p}}}\left[\left\|\frac{n}{\rho_{x}}\left(f(x+\rho_{x}u,y)-f(x,y)\right)u\right\|^{2}\right]\leq 2n\|\nabla_{x}f(x,y)\|^{2}+\frac{\rho_{x}^{2}L^{2}n^{2}}{2},\\ \\ \mathbb{E}_{v\sim V_{S_{p}}}\left[\left\|\frac{m}{\rho_{y}}\left(f(x,y+\rho_{y}v)-f(x,y)\right)v\right\|^{2}\right]\leq 2m\|\nabla_{y}f(x,y)\|^{2}+\frac{\rho_{y}^{2}L^{2}m^{2}}{2}.\end{array}\right. (48)
Proof.

For Statement 1, cf. [34] (Lemma 4.4), and for Statement 2, cf. [5] (proofs for Propositions 2.7.5 and 2.7.6). Note that the proofs for the minimax function follows simply by fixing one of the two variables. ∎

We are now ready to define the stochastic zeroth-order gradient as follows:

{Fρx(x,y,ξ,u):=nρx(f^(x+ρxu,y,ξ)f^(x,y,ξ))u,Fρy(x,y,ξ,v):=mρy(f^(x,y+ρyv,ξ)f^(x,y,ξ))v,\displaystyle\left\{\begin{array}[]{l}F_{\rho_{x}}(x,y,\xi,u):=\frac{n}{\rho_{x}}\left(\hat{f}(x+\rho_{x}u,y,\xi)-\hat{f}(x,y,\xi)\right)u,\\ F_{\rho_{y}}(x,y,\xi,v):=\frac{m}{\rho_{y}}\left(\hat{f}(x,y+\rho_{y}v,\xi)-\hat{f}(x,y,\xi)\right)v,\end{array}\right. (51)

where uu and vv are the uniformly distributed random vectors over the unit spheres in n\mathbb{R}^{n} and m\mathbb{R}^{m} respectively.

The next lemma shows that such stochastic zeroth-order gradients are unbiased with respect to the gradients of the smoothing functions and have uniformly bounded variance.

Lemma 4.2.

The stochastic zeroth-order gradients defined in (51) are unbiased and have bounded variance for all (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}:

{𝔼ξ,u[Fρx(x,y,ξ,u)]=xfρx(x,y),𝔼ξ,v[Fρy(x,y,ξ,v)]=yfρy(x,y),\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{\xi,u}\left[F_{\rho_{x}}(x,y,\xi,u)\right]=\nabla_{x}f_{\rho_{x}}(x,y),\\ \\ \mathbb{E}_{\xi,v}\left[F_{\rho_{y}}(x,y,\xi,v)\right]=\nabla_{y}f_{\rho_{y}}(x,y),\end{array}\right. (55)

and

{𝔼ξ,u[Fρx(x,y,ξ,u)xfρx(x,y)2]σ~2,𝔼ξ,v[Fρy(x,y,ξ,v)yfρy(x,y)2]σ~2,\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)-\nabla_{x}f_{\rho_{x}}(x,y)\|^{2}\right]\leq\tilde{\sigma}^{2},\\ \\ \mathbb{E}_{\xi,v}\left[\|F_{\rho_{y}}(x,y,\xi,v)-\nabla_{y}f_{\rho_{y}}(x,y)\|^{2}\right]\leq\tilde{\sigma}^{2},\end{array}\right. (59)

where

σ~2=2max{nM2+nσ2+n2ρx2L2,mM2+mσ2+m2ρy2L2}.\tilde{\sigma}^{2}=2\cdot\max\left\{nM^{2}+n\sigma^{2}+n^{2}\rho_{x}^{2}L^{2},\,mM^{2}+m\sigma^{2}+m^{2}\rho_{y}^{2}L^{2}\right\}.
Proof.

See Appendix A.6. ∎

Before applying the stochastic extra-point/extra-momentum scheme to solve (30), let us first introduce the connections between these two models. As we regard the saddle-point model as a special case of VI, we shall treat the variables x,yx,y in the saddle-point problem as one variable and denote 𝒵=𝒳×𝒴,z=(x,y)\mathcal{Z}=\mathcal{X}\times\mathcal{Y},z=(x,y). Additionally, we define:

F(z):=(xf(x,y)yf(x,y)),Fρ(z):=(xfρx(x,y)yfρy(x,y)),F^ρ(z,u,v,ξ):=(Fρx(x,y,ξ,u)Fρy(x,y,ξ,v)).\displaystyle F(z):=\begin{pmatrix}\nabla_{x}f(x,y)\\ -\nabla_{y}f(x,y)\end{pmatrix},\quad F_{\rho}(z):=\begin{pmatrix}\nabla_{x}f_{\rho_{x}}(x,y)\\ -\nabla_{y}f_{\rho_{y}}(x,y)\end{pmatrix},\quad\hat{F}_{\rho}(z,u,v,\xi):=\begin{pmatrix}F_{\rho_{x}}(x,y,\xi,u)\\ -F_{\rho_{y}}(x,y,\xi,v)\end{pmatrix}.

These terms correspond to the gradient of f(x,y)f(x,y), the gradient of the smoothing functions fρx(x,y)f_{\rho_{x}}(x,y) and fρy(x,y)f_{\rho_{y}}(x,y), and the stochastic zeroth-order gradient, respectively. Note that we have flipped the sign on partial gradient correspond to yy to account for the concavity of ff with respect to yy.

Finally, as we shall use a sample size of tkt_{k}\in\mathbb{N} (a natural number) at iteration kk, we reserve the subscripts for the random vectors ξ,u,v\xi,u,v for the sample index (i),i=1,,tk(i),\,i=1,...,t_{k}, and denote:

F^ρk(zk)=1tki=1tkF^ρ(zk,u(i)k,v(i)k,ξ(i)k).\hat{F}^{k}_{\rho}(z^{k})=\frac{1}{t_{k}}\sum\limits_{i=1}^{t_{k}}\hat{F}_{\rho}(z^{k},u^{k}_{(i)},v^{k}_{(i)},\xi^{k}_{(i)}).

In the above definition we suppress the notation of the random vectors u,v,ξu,v,\xi on the LHS for cleaner presentation. Note that by the law of large numbers, together with (55)-(59), we have

𝔼[F^ρk(zk)]=Fρ(zk),\displaystyle\mathbb{E}\left[\hat{F}^{k}_{\rho}(z^{k})\right]=F_{\rho}(z^{k}), (60)
𝔼[F^ρk(zk)Fρ(zk)2]2σ~2tk.\displaystyle\mathbb{E}\left[\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})\|^{2}\right]\leq\frac{2\tilde{\sigma}^{2}}{t_{k}}. (61)

4.1 Sample complexity analysis: stochastic zeroth-order extra-point method

Recall that our objective is

minx𝒳maxy𝒴f(x,y).\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}f(x,y).

With only noisy function value f^(x,y,ξ)\hat{f}(x,y,\xi) accessible, we propose the stochastic zeroth-order extra-point method that updates (x,y):=z(x,y):=z simultaneously with the following update rule:

{zk+0.5:=P𝒵(zk+β(zkzk1)ηF^ρk(zk)),zk+1:=P𝒵(zkαF^ρk+0.5(zk+0.5)+γ(zkzk1)τ(F^ρk(zk)F^ρk1(zk1))).\displaystyle\left\{\begin{array}[]{lcl}z^{k+0.5}&:=&P_{\mathcal{Z}}\left(z^{k}+\beta(z^{k}-z^{k-1})-\eta\hat{F}^{k}_{\rho}(z^{k})\right),\\ z^{k+1}&:=&P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)\right).\\ \end{array}\right. (64)

Compare the above update with its original variant in (9) for solving stochastic VI, the update direction F^(zk)\hat{F}(z^{k}) is replaced by the averaged stochastic zeroth-order gradient F^ρk(zk)\hat{F}^{k}_{\rho}(z^{k}) with sample size tkt_{k} (similarly F^(zk+0.5)\hat{F}(z^{k+0.5}) is replaced by F^ρk+0.5(zk+0.5)\hat{F}^{k+0.5}_{\rho}(z^{k+0.5}) with sample size tk+0.5t_{k+0.5}). This circumvents the inaccessible first-order information and equips us with appropriate tools to reduce the variance and achieve overall linear convergence.

The next lemma establishes the relational inequality between the subsequent iterates in terms of the expected distance to the solution dk=𝔼[zkz2]d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right], similar to what we did in Section 3.1. The differences lie in the corresponding stochastic error terms shown below. Note that in each iteration we take two batches of samples tkt_{k} and tk+0.5t_{k+0.5}. The batch size tk1t_{k-1} also appears because the iterate zk1z^{k-1} is used in each iteration.

Lemma 4.3.

For the sequences {zkk=0,1,}\{z^{k}\mid k=0,1,...\} and {zk+0.5k=0,1,}\{z^{k+0.5}\mid k=0,1,...\} generated from the stochastic zeroth-order extra-point method (64), the following inequality holds:

(14|γβ|τL)dk+1\displaystyle(1-4|\gamma-\beta|-\tau L)d_{k+1} (65)
\displaystyle\leq (1+4γ+6|γβ|+4τLαμ)dk\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L-\alpha\mu\right)d_{k}
+(2|γβ|+2γ+4τL)dk1\displaystyle+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)d_{k-1}
+(α2L2+2|γβ|+2γ+2αμ1)𝔼[zk+0.5zk2]\displaystyle+\left(\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
+16σ~2((α2+τL)1tk+α2tk+0.5+τLtk1)+αLDρx2n2+ρy2m2\displaystyle+16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right)+\alpha LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}
+4(τL+α2L2)(ρx2n2+ρy2m2)\displaystyle+4(\tau L+\alpha^{2}L^{2})(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})
+2(ηα)𝔼[(zk+1zk+0.5)F^ρk(zk)].\displaystyle+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k})\right].
Proof.

See Appendix A.7. ∎

With the relational inequality in Lemma 4.3, we shall adopt the same conditions: (13), (17), and (21) for the parameters α,β,γ,η,τ\alpha,\beta,\gamma,\eta,\tau. Therefore, the results in Theorem 3.2 are directly applicable. In addition, we are now equipped with the variable sample size tkt_{k}/tk+0.5t_{k+0.5} to control the variance terms, as well as the smoothing parameters ρx,ρy\rho_{x},\rho_{y} to control the bias terms

αLDρx2n2+ρy2m2+4(τL+α2L2)(ρx2n2+ρy2m2).\alpha LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+4(\tau L+\alpha^{2}L^{2})(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}).

We shall utilize the example in Proposition 3.3 to analyze the sample complexity of the proposed method. The result is provided in the next proposition:

Proposition 4.4 (Sample complexity result 1).

The stochastic zeroth-order extra-point method (64) with the following parameter choice:

(α,β,γ,η,τ)=(14L,164κ,164κ,14L,164Lκ)\displaystyle(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right)

and

tk+0.5=tk=K(11256κ)(k+1),ρx=12nκ(11256κ)K,ρy=12mκ(11256κ)K,t_{k+0.5}=t_{k}=\Bigg{\lceil}K\left(1-\frac{1}{256\kappa}\right)^{-(k+1)}\Bigg{\rceil},\quad\rho_{x}=\frac{1}{\sqrt{2}n\kappa}\left(1-\frac{1}{256\kappa}\right)^{K},\quad\rho_{y}=\frac{1}{\sqrt{2}m\kappa}\left(1-\frac{1}{256\kappa}\right)^{K},

where KK is the iteration count decided in advance, outputs zKz^{K} such that dK=𝔼[zKz2]ϵd_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\epsilon, with K=𝒪(κln(1ϵ))K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right), and the total sample complexity of the procedure is

k=0K1(tk+tk+0.5)=𝒪(κ2ϵln(1ϵ)).\sum\limits_{k=0}^{K-1}(t_{k}+t_{k+0.5})=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).
Proof.

See Appendix A.8. ∎

4.2 Sample complexity: stochastic zeroth-order extra-momentum method

Next we consider the stochastic zeroth-order extra-momentum method, with one projection per each iteration:

zk+1:=P𝒵(zkαF^ρk(zk)+γ(zkzk1)τ(F^ρk(zk)F^ρk1(zk1))).\displaystyle z^{k+1}:=P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}^{k}_{\rho}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)\right). (66)

The relational inequality, similar to Lemma 3.4, is established in the next lemma:

Lemma 4.5.

For the sequence {zkk=0,1,}\{z^{k}\mid k=0,1,...\} generated from the stochastic zeroth-order extra-momentum method (66), the following inequality holds:

𝔼[(12+αμ2γ2)zk+1z2+α(zk+1z)(F^ρk(zk)F^ρk+1(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+1}_{\rho}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right] (67)
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^ρk1(zk1)F^ρk(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+16τ2σ2(1tk+1tk1)+L2(4τ2+α8μ)(ρx2n2+ρy2m2).\displaystyle+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}).
Proof.

See Appendix A.9. ∎

With the same condition as in (25) for the parameters α,γ,τ\alpha,\gamma,\tau, we can derive the similar bound to (26) (with F^(zk)\hat{F}(z^{k}) replaced with F^ρk(zk)\hat{F}^{k}_{\rho}(z^{k}) and with the new stochastic error expression) and define the potential function:

Vk=𝔼[12zkz2+τ(zkz)(F^ρk1(zk1)F^ρk(zk))+(2τ2L2+γ2)zkzk12].V_{k}=\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right].

Therefore, the following inequality holds:

(1+θκ)Vk+1Vk+16τ2σ2(1tk+1tk1)+L2(4τ2+α8μ)(ρx2n2+ρy2m2),\displaystyle\left(1+\frac{\theta}{\kappa}\right)V_{k+1}\leq V_{k}+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}), (68)

and we can apply the results directly from Theorem 3.5. In addition, with increasing sample sizes tkt_{k} and the smoothing parameters ρx,ρy\rho_{x},\rho_{y}, we are able to control the bias and the variance terms in the above inequality. We give the results of sample complexity in the next proposition.

Proposition 4.6 (Sample complexity result 2).

The stochastic zeroth-order extra-momentum method (66) with the following parameter choice:

(α,τ,γ)=(14L,α1+θκ,18(κ+θ)),θ=18,(\alpha,\tau,\gamma)=\left(\frac{1}{4L},\frac{\alpha}{1+\frac{\theta}{\kappa}},\frac{1}{8(\kappa+\theta)}\right),\quad\theta=\frac{1}{8},

and

tk=K(118κ+1)k,ρx=12nκ(118κ+1)K2,ρy=12mκ(118κ+1)K2,t_{k}=\Bigg{\lceil}K\left(1-\frac{1}{8\kappa+1}\right)^{-k}\Bigg{\rceil},\quad\rho_{x}=\frac{1}{\sqrt{2}n\kappa}\left(1-\frac{1}{8\kappa+1}\right)^{\frac{K}{2}},\quad\rho_{y}=\frac{1}{\sqrt{2}m\kappa}\left(1-\frac{1}{8\kappa+1}\right)^{\frac{K}{2}},

where KK is the iteration count decided in advance, outputs zKz^{K} such that dK=𝔼[zKz2]ϵd_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\epsilon, with K=𝒪(κln(1ϵ))K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right) and the total sample complexity of the procedure is

k=0K1tk=𝒪(κ2ϵln(1ϵ)).\sum\limits_{k=0}^{K-1}t_{k}=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).
Proof.

See Appendix A.10. ∎

Remark that the sample complexity of VS-Ave proposed in [9] for solving stochastic strongly monotone VI is given by 𝒪((κ2ϵ)β)\mathcal{O}\left(\left(\frac{\kappa^{2}}{\epsilon}\right)^{\beta}\right) for some β>1\beta>1, and its limiting case (β1\beta\rightarrow 1) differs from our results given in Proposition 4.4 and 4.6 only by a factor of 𝒪(ln(1ϵ))\mathcal{O}\left(\ln\left(\frac{1}{\epsilon}\right)\right).

5 Numerical Experiments

In this section, we conduct an experiment that models a regularized two-player zero-sum matrix game with some uncertain payoff matrix. In particular, the payoff matrix AξA_{\xi} is randomly distributed and can only be sampled for each (mixed) strategy. The problem can be formulated as follows:

minxnmaxym\displaystyle\min\limits_{x\in\mathbb{R}^{n}}\max\limits_{y\in\mathbb{R}^{m}} f(x,y)=𝔼[λx2x2+xAξyλy2y2]\displaystyle f(x,y)=\mathbb{E}\left[\frac{\lambda_{x}}{2}\|x\|^{2}+x^{\top}A_{\xi}y-\frac{\lambda_{y}}{2}\|y\|^{2}\right] (69)
s.t. i=1nxi=1,j=1myj=1,\displaystyle\sum\limits_{i=1}^{n}x_{i}=1,\quad\sum\limits_{j=1}^{m}y_{j}=1,
x,y0.\displaystyle x,y\geq 0.

The experiment consists of two parts. In the first part, the random matrix AξA_{\xi} is sampled element-wise from i.i.d. normal distribution, AξN(A0,σ2I(n+m))A_{\xi}\sim N\left(A_{0},\sigma^{2}I_{(n+m)}\right), where A0A_{0} is pre-determined and randomly generated as follows. We partition A0=(A11A12A21A22)A_{0}=\begin{pmatrix}A_{11}&A_{12}\\ A_{21}&A_{22}\end{pmatrix}, with each block matrix Aijn2×m2A_{ij}\in\mathbb{R}^{\frac{n}{2}\times\frac{m}{2}}. Each entry in AijA_{ij} is generated randomly from Unif(aijbij,aij+bij)(a_{ij}-b_{ij},a_{ij}+b_{ij}), where aija_{ij} and bijb_{ij} are randomly generated from Unif(30,30)(-30,30) and Unif(0,30)(0,30) respectively. The problem parameters are set as: n=10n=10, m=20m=20, σ2=0.5\sigma^{2}=0.5, λx=λy=1\lambda_{x}=\lambda_{y}=1, κ=Lμ161\kappa=\frac{L}{\mu}\approx 161, where LL and μ\mu are the largest and the smallest singular values of the Jacobian matrix (λInA0A0λIm)\begin{pmatrix}\lambda I_{n}&A_{0}\\ -A_{0}^{\top}&\lambda I_{m}\end{pmatrix} respectively. The smoothing parameters ρx,ρy\rho_{x},\rho_{y} are set to be in the order 10810^{-8}, the total iteration KK is set to 1485, and the sample size tk=kt_{k}=k at iteration kk.

In the second part, the random matrix AξA_{\xi} is sampled element-wise from i.i.d. log-normal distribution, which is known as to have a fat-tail. It is used to model multiplicative random variables that take positive values. We reuse the parameters A0A_{0} and σ2\sigma^{2} from the first part. In particular, the samples are generated by Aξ=eA010+σZA_{\xi}=e^{\frac{A_{0}}{10}+\sigma Z}, where ZZ is sampled element-wise from i.i.d. standard normal distribution N(0,1)N(0,1). Therefore, the mean of such distribution is given by A0=𝔼[Aξ]=eA010+σ22A_{0}^{\prime}=\mathbb{E}\left[A_{\xi}\right]=e^{\frac{A_{0}}{10}+\frac{\sigma^{2}}{2}}. We have κ146\kappa\approx 146 and set K=1345K=1345 in this part, and other parameters remain the same as the first part (ρx,ρy\rho_{x},\rho_{y} are in the same order).

In both parts of the experiment, we first solve the deterministic problem with the mean payoff matrix A0A_{0} (A0A_{0}^{\prime} for the second part) and denote the solution as (x,y)(x^{*},y^{*}). We then implement the two proposed methods: the stochastic zeroth-order extra-point method and the stochastic zeroth-order extra-momentum method. In addition, we compare these two methods with other first order methods: the extra-gradient method, the OGDA method, and the VS-Ave method (proposed in [9], which is a variance-reduced stochastic extension of Nesterov’s method [26]), all equipped with the same stochastic zeroth-order oracle. The results are shown in the following two figures, where the left plot shows the result from one experiment and the right shows the result from average of ten experiments. The parameters for each algorithm are manually tuned except for VS-Ave, where we adopt the recursive rule as proposed in its original paper. The results show that the two newly proposed methods exhibit comparable (or slightly improved) performance to the stochastic extra-gradient/OGDA method in this particular example of application.

Refer to caption
Refer to caption
Figure 1: Normal distributed payoff matrix AA
Refer to caption
Refer to caption
Figure 2: Log-normal distributed payoff matrix AA

6 Conclusions

This paper proposes two new schemes of stochastic first-order methods to solve strongly monotone VI problems: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme features a high flexibility in the configuration of parameter choices that can be tailored to different problem classes. The second scheme is less general in the choice of search directions. However, it has the advantage of maintaining a single sequence throughout the iterations. Therefore, it requires only one projection per iteration, as opposed to most other first method that maintains an extra iterative sequence. Both methods achieve optimal iteration complexity bound, provided that the stochastic gradient oracle allows the variance to be controllable. The application of these two schemes to solve stochastic black-box saddle-point problem is also presented. Through a randomized smoothing scheme, the stochastic oracles required in these two schemes can be constructed via stochastic zeroth-order gradient approximation. The variance is thus controllable by mini-batch sampling with linearly increasing sample sizes per iteration, and the sample complexity results are derived. Preliminary numerical experiments show an improved (or at least comparable) performance of the proposed schemes to other existing methods.

References

  • [1] H.. Bauschke and P.. Combettes “Convex analysis and monotone operator theory in Hilbert spaces” Springer, 2011
  • [2] D. Boob and C. Guzmán “Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems” In arXiv preprint arXiv: 2104.02988, 2021
  • [3] Y. Chen, G. Lan and Y. Ouyang “Accelerated schemes for a class of variational inequalities” In Mathematical Programming 165.1 Springer, 2017, pp. 113–149
  • [4] F. Facchinei and J.-S. Pang “Finite-dimensional variational inequalities and complementarity problems” Springer Science & Business Media, 2007
  • [5] X. Gao “Low-order optimization algorithms: iteration complexity and applications” Ph.D. Thesis, University of Minnesota, 2018
  • [6] K. Huang and S. Zhang “A unifying framework of accelerated first-order approach to strongly monotone variational inequalities” In arXiv preprint arXiv:2103.15270, 2021
  • [7] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Extragradient method with variance reduction for stochastic variational inequalities” In SIAM Journal on Optimization 27.2 SIAM, 2017, pp. 686–724
  • [8] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Variance-based extragradient methods with line search for stochastic variational inequalities” In SIAM Journal on Optimization 29.1 SIAM, 2019, pp. 175–206
  • [9] A. Jalilzadeh and U.. Shanbhag “A proximal-point algorithm with variable sample-sizes (PPAWSS) for monotone stochastic variational inequality problems” In 2019 Winter Simulation Conference (WSC), 2019, pp. 3551–3562 IEEE
  • [10] H. Jiang and H. Xu “Stochastic approximation approaches to the stochastic variational inequality problem” In IEEE Transactions on Automatic Control 53.6 IEEE, 2008, pp. 1462–1475
  • [11] A. Juditsky, A. Nemirovski and C. Tauvel “Solving variational inequalities with stochastic mirror-prox algorithm” In Stochastic Systems 1.1 INFORMS, 2011, pp. 17–58
  • [12] A. Kannan and U.. Shanbhag “Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants” In Computational Optimization and Applications 74.3 Springer, 2019, pp. 779–820
  • [13] G.M. Korpelevich “The extragradient method for finding saddle points and other problems” In Matecon 12, 1976, pp. 747–756
  • [14] J. Koshal, A. Nedic and U.. Shanbhag “Regularized iterative stochastic approximation methods for stochastic variational inequality problems” In IEEE Transactions on Automatic Control 58.3 IEEE, 2012, pp. 594–609
  • [15] G. Kotsalis, G. Lan and T. Li “Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation” In arXiv preprint arXiv:2011.02987, 2020
  • [16] J. Larson, M. Menickelly and S.. Wild “Derivative-free optimization methods” In arXiv preprint arXiv:1904.11585, 2019
  • [17] S. Liu, S. Lu, X. Chen, Y. Feng, K. Xu, A. Al-Dujaili, M. Hong and U.-M. O’Reilly “Min-max optimization without gradients: convergence and applications to adversarial ML” In arXiv preprint arXiv:1909.13806, 2019
  • [18] B. Martinet “Brève communication. Régularisation d’inéquations variationnelles par approximations successives” In Revue française d’informatique et de recherche opérationnelle. Série rouge 4.R3 EDP Sciences, 1970, pp. 154–158
  • [19] M. Menickelly and S.. Wild “Derivative-free robust optimization by outer approximations” In Mathematical Programming 179.1 Springer, 2020, pp. 157–193
  • [20] A. Mokhtari, A. Ozdaglar and S. Pattathil “A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach” In arXiv preprint arXiv:1901.08511, 2019
  • [21] A. Mokhtari, A. Ozdaglar and S. Pattathil “Convergence rate of O(1/k)O(1/k) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems” In SIAM Journal on Optimization 30.4 SIAM, 2020, pp. 3230–3251
  • [22] A. Nemirovski “Prox-method with rate of convergence O(1/t)O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems” In SIAM Journal on Optimization 15.1 SIAM, 2004, pp. 229–251
  • [23] Y. Nesterov “A method for unconstrained convex minimization problem with the rate of convergence O(1/k2)O(1/k^{2}) In Doklady SSSR 269, 1983, pp. 543–547
  • [24] Y. Nesterov “Dual extrapolation and its applications to solving variational inequalities and related problems” In Mathematical Programming 109.2-3 Springer, 2007, pp. 319–344
  • [25] Y. Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2003
  • [26] Y. Nesterov and L. Scrimali “Solving strongly monotone variational and quasi-variational inequalities” In Available at SSRN 970903, 2006
  • [27] Y. Nesterov and V. Spokoiny “Random gradient-free minimization of convex functions” In Foundations of Computational Mathematics 17.2 Springer, 2017, pp. 527–566
  • [28] B. Palaniappan and F. Bach “Stochastic variance reduction methods for saddle-point problems” In Advances in Neural Information Processing Systems, 2016, pp. 1416–1424
  • [29] B.. Polyak “Some methods of speeding up the convergence of iteration methods” In USSR Computational Mathematics and Mathematical Physics 4.5 Elsevier, 1964, pp. 1–17
  • [30] L.. Popov “A modification of the Arrow-Hurwicz method for search of saddle points” In Mathematical Notes of the Academy of Sciences of the USSR 28.5 Springer, 1980, pp. 845–848
  • [31] H. Robbins and S. Monro “A stochastic approximation method” In The Annals of Mathematical Statistics JSTOR, 1951, pp. 400–407
  • [32] R.. Rockafellar “Monotone operators and the proximal point algorithm” In SIAM Journal on Control and Optimization 14.5 SIAM, 1976, pp. 877–898
  • [33] A. Roy, Y. Chen, K. Balasubramanian and P. Mohapatra “Online and bandit algorithms for nonstationary stochastic saddle-point optimization” In arXiv preprint arXiv:1912.01698, 2019
  • [34] S. Shalev-Shwartz “Online learning and online convex optimization” In Foundations and trends in Machine Learning 4.2, 2011, pp. 107–194
  • [35] U.. Shanbhag “Stochastic variational inequality problems: Applications, analysis, and algorithms” In Theory Driven by Influential Applications INFORMS, 2013, pp. 71–107
  • [36] P. Tseng “On linear convergence of iterative methods for the variational inequality problem” In Journal of Computational and Applied Mathematics 60.1-2 Elsevier, 1995, pp. 237–252
  • [37] Z. Wang, K. Balasubramanian, S. Ma and M. Razaviyayn “Zeroth-order algorithms for nonconvex minimax problems with improved complexities” In arXiv preprint arXiv:2001.07819, 2020
  • [38] T. Xu, Z. Wang, Y. Liang and H.. Poor “Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence” In arXiv preprint arXiv:2006.09361, 2020
  • [39] F. Yousefian, A. Nedić and U.. Shanbhag “A regularized smoothing stochastic approximation (RSSA) algorithm for stochastic variational inequality problems” In 2013 Winter Simulations Conference (WSC), 2013, pp. 933–944 IEEE
  • [40] F. Yousefian, A. Nedić and U.. Shanbhag “On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems” In Mathematical Programming 165.1 Springer, 2017, pp. 391–431
  • [41] F. Yousefian, A. Nedić and U.. Shanbhag “Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems” In 53rd IEEE Conference on Decision and Control, 2014, pp. 5831–5836 IEEE
  • [42] J. Zhang, M. Hong and S. Zhang “On lower iteration complexity bounds for the saddle point problems” In arXiv preprint arXiv:1912.07481, 2018

Appendix A Proof of technical lemmas and theorems

A.1 Proof of Lemma 3.1

First of all, by the 1-co-coerciveness (cf. e.g. Proposition 4.4 in [1]) of the projection operator P𝒵P_{\mathcal{Z}}, we have

zk+1z2\displaystyle\|z^{k+1}-z^{*}\|^{2} =\displaystyle= P𝒵(zkαF^(zk+0.5)+γ(zkzk1)τ(F^(zk)F^(zk1)))P𝒵(z)2\displaystyle\|P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right)-P_{\mathcal{Z}}\left(z^{*}\right)\|^{2} (70)
\displaystyle\leq (zk+1z)(zkαF^(zk+0.5)+γ(zkzk1)τ(F^(zk)F^(zk1))z)\displaystyle(z^{k+1}-z^{*})^{\top}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)-z^{*}\right)
=\displaystyle= 12zk+1z2+12zkz212zk+1zk2τ(zk+1z)(F^(zk)F^(zk1))\displaystyle\frac{1}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{1}{2}\|z^{k}-z^{*}\|^{2}-\frac{1}{2}\|z^{k+1}-z^{k}\|^{2}-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)
+(zk+1z)(αF^(zk+0.5)+γ(zkzk1)).\displaystyle+(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right).

We shall first decompose the last term in the above inequality as

(zk+1z)(αF^(zk+0.5)+γ(zkzk1))\displaystyle(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right) (71)
=\displaystyle= (zk+1zk+0.5+zk+0.5z)(αF^(zk+0.5)+γ(zkzk1))\displaystyle(z^{k+1}-z^{k+0.5}+z^{k+0.5}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)
=\displaystyle= (zk+1zk+0.5)(ηF^(zk)+β(zkzk1))(a)\displaystyle\underbrace{(z^{k+1}-z^{k+0.5})^{\top}\left(-\eta\hat{F}(z^{k})+\beta(z^{k}-z^{k-1})\right)}_{(a)}
+(zk+1zk+0.5)(αF^(zk+0.5)+ηF^(zk)+(γβ)(zkzk1))(b)\displaystyle+\underbrace{(z^{k+1}-z^{k+0.5})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\eta\hat{F}(z^{k})+(\gamma-\beta)(z^{k}-z^{k-1})\right)}_{(b)}
+(zk+0.5z)(αF^(zk+0.5)+γ(zkzk1))(c).\displaystyle+\underbrace{(z^{k+0.5}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)}_{(c)}.

Let us first use the optimality condition of zk+0.5z^{k+0.5} to bound term (a)(a):

zk+0.5zkβ(zkzk1)+ηF^(zk),zzk+0.50,z𝒵.\langle z^{k+0.5}-z^{k}-\beta(z^{k}-z^{k-1})+\eta\hat{F}(z^{k}),z-z^{k+0.5}\rangle\geq 0,\quad\forall z\in\mathcal{Z}.

Taking z=zk+1z=z^{k+1}, we get

(a)\displaystyle(a) \displaystyle\leq 12zk+1zk212zk+0.5zk212zk+1zk+0.52.\displaystyle\frac{1}{2}\|z^{k+1}-z^{k}\|^{2}-\frac{1}{2}\|z^{k+0.5}-z^{k}\|^{2}-\frac{1}{2}\|z^{k+1}-z^{k+0.5}\|^{2}. (72)

We can also establish the bound for (b)(b):

(b)\displaystyle(b) =\displaystyle= (zk+1zk+0.5)(αF^(zk+0.5)+αF^(zk)αF^(zk)+ηF^(zk)+(γβ)(zkzk1))\displaystyle(z^{k+1}-z^{k+0.5})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\alpha\hat{F}(z^{k})-\alpha\hat{F}(z^{k})+\eta\hat{F}(z^{k})+(\gamma-\beta)(z^{k}-z^{k-1})\right)
\displaystyle\leq αzk+1zk+0.5F^(zk)F^(zk+0.5)+(ηα)(zk+1zk+0.5)F^(zk)\displaystyle\alpha\|z^{k+1}-z^{k+0.5}\|\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\|+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})
+(γβ)(zk+1zk+0.5)(zkzk1).\displaystyle+(\gamma-\beta)(z^{k+1}-z^{k+0.5})^{\top}(z^{k}-z^{k-1}).

Note the following bound from the Lipschitz continuity:

F^(z)F^(z)\displaystyle\|\hat{F}(z)-\hat{F}(z^{\prime})\| =\displaystyle= F^(z)F(z)+F(z)F^(z)+F(z)F(z)\displaystyle\|\hat{F}(z)-F(z)+F(z^{\prime})-\hat{F}(z^{\prime})+F(z)-F(z^{\prime})\| (73)
\displaystyle\leq εz+εz+Lzz,\displaystyle\varepsilon_{z}+\varepsilon_{z^{\prime}}+L\|z-z^{\prime}\|,

for any z,z𝒵z,z^{\prime}\in\mathcal{Z}, where we used the definition of the stochastic error term

εz=F(z)F^(z).\displaystyle\varepsilon_{z}=\left\|F(z)-\hat{F}(z)\right\|. (74)

Therefore,

αzk+1zk+0.5F^(zk)F^(zk+0.5)\displaystyle\alpha\|z^{k+1}-z^{k+0.5}\|\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\| (75)
\displaystyle\leq 12(zk+1zk+0.52+α2F^(zk)F^(zk+0.5)2)\displaystyle\frac{1}{2}\left(\|z^{k+1}-z^{k+0.5}\|^{2}+\alpha^{2}\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\|^{2}\right)
\displaystyle\leq 12zk+1zk+0.52+α2(εzk+εzk+0.5)2+α2L2zkzk+0.52.\displaystyle\frac{1}{2}\|z^{k+1}-z^{k+0.5}\|^{2}+\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\alpha^{2}L^{2}\|z^{k}-z^{k+0.5}\|^{2}.

Furthermore,

(γβ)(zk+1zk+0.5)(zkzk1)\displaystyle(\gamma-\beta)(z^{k+1}-z^{k+0.5})^{\top}(z^{k}-z^{k-1})
\displaystyle\leq 12|γβ|(zk+1zk+0.52+zkzk12)\displaystyle\frac{1}{2}|\gamma-\beta|\left(\|z^{k+1}-z^{k+0.5}\|^{2}+\|z^{k}-z^{k-1}\|^{2}\right)
\displaystyle\leq |γβ|(zk+1zk2+zk+0.5zk2+zkz2+zk1z2)\displaystyle|\gamma-\beta|\left(\|z^{k+1}-z^{k}\|^{2}+\|z^{k+0.5}-z^{k}\|^{2}+\|z^{k}-z^{*}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right)
\displaystyle\leq |γβ|(2zk+1z2+2zkz2+zk+0.5zk2+zkz2+zk1z2)\displaystyle|\gamma-\beta|\left(2\|z^{k+1}-z^{*}\|^{2}+2\|z^{k}-z^{*}\|^{2}+\|z^{k+0.5}-z^{k}\|^{2}+\|z^{k}-z^{*}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right)
=\displaystyle= |γβ|(2zk+1z2+3zkz2+zk+0.5zk2+zk1z2).\displaystyle|\gamma-\beta|\left(2\|z^{k+1}-z^{*}\|^{2}+3\|z^{k}-z^{*}\|^{2}+\|z^{k+0.5}-z^{k}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right).

The resulting bound for (b)(b) becomes:

(b)\displaystyle(b) \displaystyle\leq 12zk+1zk+0.52+α2(ϵzk+ϵzk+0.5)2+(α2L2+|γβ|)zkzk+0.52\displaystyle\frac{1}{2}\|z^{k+1}-z^{k+0.5}\|^{2}+\alpha^{2}(\epsilon_{z^{k}}+\epsilon_{z^{k+0.5}})^{2}+\left(\alpha^{2}L^{2}+|\gamma-\beta|\right)\|z^{k}-z^{k+0.5}\|^{2} (76)
+(ηα)(zk+1zk+0.5)F^(zk)\displaystyle+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})
+|γβ|(2zk+1z2+3zkz2+zk1z2).\displaystyle+|\gamma-\beta|\left(2\|z^{k+1}-z^{*}\|^{2}+3\|z^{k}-z^{*}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right).

Next let us bound (c)(c) in (71). We have,

(c)\displaystyle(c) =\displaystyle= α(zk+0.5z)F^(zk+0.5)+γ(zk+0.5z)(zkzk1)\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+\gamma(z^{k+0.5}-z^{*})^{\top}(z^{k}-z^{k-1}) (77)
\displaystyle\leq α(zk+0.5z)F^(zk+0.5)+12γ(zk+0.5z2+zkzk12)\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+\frac{1}{2}\gamma\left(\|z^{k+0.5}-z^{*}\|^{2}+\|z^{k}-z^{k-1}\|^{2}\right)
\displaystyle\leq α(zk+0.5z)F^(zk+0.5)+γ(zk+0.5zk2+zkz2+zkz2+zk1z2)\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+\gamma\left(\|z^{k+0.5}-z^{k}\|^{2}+\|z^{k}-z^{*}\|^{2}+\|z^{k}-z^{*}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right)
=\displaystyle= α(zk+0.5z)F^(zk+0.5)+γ(zk+0.5zk2+2zkz2+zk1z2).\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+\gamma\left(\|z^{k+0.5}-z^{k}\|^{2}+2\|z^{k}-z^{*}\|^{2}+\|z^{k-1}-z^{*}\|^{2}\right).

Combining the bounds for (a),(b),(c)(a),(b),(c) from (72), (76), and (77), it follows from (71) that

(zk+1z)(αF^(zk+0.5)+γ(zkzk1))\displaystyle(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right) (78)
\displaystyle\leq α(zk+0.5z)F^(zk+0.5)+(ηα)(zk+1zk+0.5)F^(zk)+α2(εzk+εzk+0.5)2\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})+\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}
+2(|γβ|)zk+1z2+(2γ+3|γβ|)zkz2+(|γβ|+γ)zk1z2\displaystyle+2\left(|\gamma-\beta|\right)\|z^{k+1}-z^{*}\|^{2}+\left(2\gamma+3|\gamma-\beta|\right)\|z^{k}-z^{*}\|^{2}+\left(|\gamma-\beta|+\gamma\right)\|z^{k-1}-z^{*}\|^{2}
+12zk+1zk2+(α2L2+|γβ|+γ12)zk+0.5zk2.\displaystyle+\frac{1}{2}\|z^{k+1}-z^{k}\|^{2}+\left(\alpha^{2}L^{2}+|\gamma-\beta|+\gamma-\frac{1}{2}\right)\|z^{k+0.5}-z^{k}\|^{2}.

We also need to bound the following term in (70):

τ(zk+1z)(F^(zk)F^(zk1))\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right) (79)
\displaystyle\leq τzk+1zF^(zk)F^(zk1)\displaystyle\tau\|z^{k+1}-z^{*}\|\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\|
(73)\displaystyle\overset{\eqref{sto-Lip-bd}}{\leq} τLzk+1z(1L(εzk+εzk1)+zkzk1)\displaystyle\tau L\|z^{k+1}-z^{*}\|\left(\frac{1}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})+\|z^{k}-z^{k-1}\|\right)
\displaystyle\leq τL2zk+1z2+τL(εzk+εzk1)2+τLzkzk12\displaystyle\frac{\tau L}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+\tau L\|z^{k}-z^{k-1}\|^{2}
\displaystyle\leq τL2zk+1z2+τL(εzk+εzk1)2+2τLzkz2+2τLzk1z2.\displaystyle\frac{\tau L}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L\|z^{k}-z^{*}\|^{2}+2\tau L\|z^{k-1}-z^{*}\|^{2}.

Combining the bounds in (78) and (79) with (70) and multiplying both sides by 2, we have

(14|γβ|τL)zk+1z2\displaystyle(1-4|\gamma-\beta|-\tau L)\|z^{k+1}-z^{*}\|^{2}
\displaystyle\leq (1+4γ+6|γβ|+4τL))zkz2\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L)\right)\|z^{k}-z^{*}\|^{2}
+(2|γβ|+2γ+4τL)zk1z2\displaystyle+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)\|z^{k-1}-z^{*}\|^{2}
+(2α2L2+2|γβ|+2γ1)zk+0.5zk2\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma-1\right)\|z^{k+0.5}-z^{k}\|^{2}
+2α2(εzk+εzk+0.5)2+2τL(εzk+εzk1)2\displaystyle+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}
2α(zk+0.5z)F^(zk+0.5)+2(ηα)(zk+1zk+0.5)F^(zk).\displaystyle-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k}).

Let us now take expectation on both sides. Noting dk+1=𝔼[zk+1z2]d_{k+1}=\mathbb{E}\left[\|z^{k+1}-z^{*}\|^{2}\right], dk=𝔼[zkz2]d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right], and dk1=𝔼[zk1z2]d_{k-1}=\mathbb{E}\left[\|z^{k-1}-z^{*}\|^{2}\right], and noting that 𝔼[εz2]σ2\mathbb{E}[\varepsilon_{z}^{2}]\leq\sigma^{2} for all z𝒵z\in\mathcal{Z} by Assumption (6), we obtain

(14|γβ|τL)dk+1\displaystyle(1-4|\gamma-\beta|-\tau L)d_{k+1} (80)
\displaystyle\leq (1+4γ+6|γβ|+4τL)dk+(2|γβ|+2γ+4τL)dk1\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L\right)d_{k}+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)d_{k-1}
+(2α2L2+2|γβ|+2γ1)𝔼[zk+0.5zk2]+8(α2+τL)σ2\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}
2α𝔼[(zk+0.5z)F^(zk+0.5)]+2(ηα)𝔼[(zk+1zk+0.5)F^(zk)].\displaystyle-2\alpha\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})\right]+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].\quad

Notice that

𝔼[(zk+0.5z)F^(zk+0.5)]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})\right]
=\displaystyle= 𝔼[(zk+0.5z)F(zk+0.5)]+𝔼[(zk+0.5z)(F^(zk+0.5)F(zk+0.5))],\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]+\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}(z^{k+0.5})-F(z^{k+0.5})\right)\right],

where

𝔼[(zk+0.5z)F(zk+0.5)]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]
=\displaystyle= 𝔼[(zk+0.5z)(F(zk+0.5)F(z))+(zk+0.5z)F(z)]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(F(z^{k+0.5})-F(z^{*})\right)+(z^{k+0.5}-z^{*})^{\top}F(z^{*})\right]
\displaystyle\geq 𝔼[μzk+0.5z2]\displaystyle\mathbb{E}\left[\mu\|z^{k+0.5}-z^{*}\|^{2}\right]
\displaystyle\geq μ2dkμ𝔼[zk+0.5zk2],\displaystyle\frac{\mu}{2}d_{k}-\mu\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right],

and

𝔼[(zk+0.5z)(F^(zk+0.5)F(zk+0.5))]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}(z^{k+0.5})-F(z^{k+0.5})\right)\right]
\displaystyle\geq 𝔼[zk+0.5zF^(zk+0.5)F(zk+0.5)]\displaystyle-\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\|\right]
=\displaystyle= 𝔼[𝔼[zk+0.5zF^(zk+0.5)F(zk+0.5)|ξ[k]]]\displaystyle-\mathbb{E}\left[\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\||\xi^{[k]}\right]\right]
=\displaystyle= 𝔼[zk+0.5z𝔼[F^(zk+0.5)F(zk+0.5)|ξ[k]]]\displaystyle-\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\mathbb{E}\left[\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\||\xi^{[k]}\right]\right]
\displaystyle\geq 𝔼[zk+0.5zδ]\displaystyle-\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\cdot\delta\right]
\displaystyle\geq δD.\displaystyle-\delta D.

Further note that we have denoted ξ[k]=(ξ0,ξ0.5,ξ1,ξ1.5,,ξk0.5,ξk)\xi^{[k]}=(\xi^{0},\xi^{0.5},\xi^{1},\xi^{1.5},...,\xi^{k-0.5},\xi^{k}) to be the collection of random vectors sampled up until the iterate zkz^{k}. Therefore, zk+0.5z^{k+0.5} is a known vector given ξ[k]\xi^{[k]}.

Putting the above two bounds back into (80), we arrive at the desired bound:

(14|γβ|τL)dk+1\displaystyle(1-4|\gamma-\beta|-\tau L)d_{k+1}
\displaystyle\leq (1+4γ+6|γβ|+4τLαμ)dk\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L-\alpha\mu\right)d_{k}
+(2|γβ|+2γ+4τL)dk1\displaystyle+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)d_{k-1}
+(2α2L2+2|γβ|+2γ+2αμ1)𝔼[zk+0.5zk2]\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
+8(α2+τL)σ2+2αδD\displaystyle+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D
+2(ηα)𝔼[(zk+1zk+0.5)F^(zk)].\displaystyle+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].\quad

A.2 Proof of Theorem 3.2

By condition (21), we have t3<1t_{3}<1. Let us start with divide both sides of (18) with 1t31-t_{3}:

dk+1\displaystyle d_{k+1} \displaystyle\leq (1t1t31t3)dk+t21t3dk1+8(α2+τL)σ21t3+2αδD1t3\displaystyle\left(1-\frac{t_{1}-t_{3}}{1-t_{3}}\right)d_{k}+\frac{t_{2}}{1-t_{3}}d_{k-1}+\frac{8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}}{1-t_{3}}+\frac{2\alpha\delta D}{1-t_{3}} (81)
=\displaystyle= (1a)dk+bdk1+cσ2+dδD.\displaystyle(1-a)d_{k}+b\cdot d_{k-1}+c\cdot\sigma^{2}+d\cdot\delta D.

Note that we have 1>a>b1>a>b by condition (21). It is elementary to verify that

b(1ab2)a+b2,b\leq\left(1-\frac{a-b}{2}\right)\cdot\frac{a+b}{2},

and by rearranging terms in (81), we have the following

dk+1+a+b2dk\displaystyle d_{k+1}+\frac{a+b}{2}d_{k} \displaystyle\leq (1ab2)dk+bdk1+cσ2+dδD\displaystyle\left(1-\frac{a-b}{2}\right)d_{k}+b\cdot d_{k-1}+c\cdot\sigma^{2}+d\cdot\delta D
\displaystyle\leq (1ab2)(dk+a+b2dk1)+cσ2+dδD\displaystyle\left(1-\frac{a-b}{2}\right)\left(d_{k}+\frac{a+b}{2}d_{k-1}\right)+c\cdot\sigma^{2}+d\cdot\delta D

A recursive argument yields the following result:

dk+1+a+b2dk\displaystyle d_{k+1}+\frac{a+b}{2}d_{k}
\displaystyle\leq (1ab2)k+1(d0+a+b2d1)+(cσ2+dδD)i=0k(1ab2)i\displaystyle\left(1-\frac{a-b}{2}\right)^{k+1}\left(d_{0}+\frac{a+b}{2}d_{-1}\right)+\left(c\cdot\sigma^{2}+d\cdot\delta D\right)\cdot\sum\limits_{i=0}^{k}\left(1-\frac{a-b}{2}\right)^{i}
\displaystyle\leq (1ab2)k+12+a+b2z0z2+(cσ2+dδD)2ab.\displaystyle\left(1-\frac{a-b}{2}\right)^{k+1}\cdot\frac{2+a+b}{2}\|z^{0}-z^{*}\|^{2}+\left(c\cdot\sigma^{2}+d\cdot\delta D\right)\cdot\frac{2}{a-b}.

Note that d0=d1=z0z2d_{0}=d_{-1}=\|z^{0}-z^{*}\|^{2}. The statement in Theorem 3.2 follows by letting q=2ab=2(1t3)t1t2t3q=\frac{2}{a-b}=\frac{2(1-t_{3})}{t_{1}-t_{2}-t_{3}}.

A.3 Proof of Proposition 3.3

For the choice of parameters

(α,β,γ,η,τ)=(14L,164κ,164κ,14L,164Lκ),(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right),

we have

(t1,t2,t3)=(18κ,332κ,164κ)(t_{1},t_{2},t_{3})=\left(\frac{1}{8\kappa},\frac{3}{32\kappa},\frac{1}{64\kappa}\right)

by the relation (17). Additionally,

2α2L2+2|γβ|+2γ+2αμ1=18+132κ+12κ1<0.2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1=\frac{1}{8}+\frac{1}{32\kappa}+\frac{1}{2\kappa}-1<0.

Therefore, both conditions (13) and (21) are satisfied.

Now, from (10), we have

(1164κ)dk+1\displaystyle\left(1-\frac{1}{64\kappa}\right)d_{k+1} \displaystyle\leq (118κ)dk+332κdk1+8(116L2+164L2κ)σ2+δD2L\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{k}+\frac{3}{32\kappa}d_{k-1}+8\left(\frac{1}{16L^{2}}+\frac{1}{64L^{2}\kappa}\right)\sigma^{2}+\frac{\delta D}{2L}
\displaystyle\leq (118κ)dk+332κdk1+5σ28L2+δD2L.\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{k}+\frac{3}{32\kappa}d_{k-1}+\frac{5\sigma^{2}}{8L^{2}}+\frac{\delta D}{2L}.

Divide both sides with 1164κ1-\frac{1}{64\kappa} and note that (1164κ)16463\left(1-\frac{1}{64\kappa}\right)^{-1}\leq\frac{64}{63}, we have:

dk+1\displaystyle d_{k+1} \displaystyle\leq 118κ1164κdk+221κdk1+40σ263L2+32δD63L\displaystyle\frac{1-\frac{1}{8\kappa}}{1-\frac{1}{64\kappa}}d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}
=\displaystyle= (1764κ1164κ)dk+221κdk1+40σ263L2+32δD63L\displaystyle\left(1-\frac{\frac{7}{64\kappa}}{1-\frac{1}{64\kappa}}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}
\displaystyle\leq (1764κ)dk+221κdk1+40σ263L2+32δD63L.\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}.

We can move a part of dkd_{k} to the LHS and form the following:

dk+1+27256κdk\displaystyle d_{k+1}+\frac{27}{256\kappa}d_{k}
\displaystyle\leq (11256κ)dk+221κdk1+40σ263L2+32δD63L\displaystyle\left(1-\frac{1}{256\kappa}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}
\displaystyle\leq (11256κ)(dk+27256dk1)+40σ263L2+32δD63L\displaystyle\left(1-\frac{1}{256\kappa}\right)\left(d_{k}+\frac{27}{256}d_{k-1}\right)+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}
\displaystyle\leq (11256κ)k+1(d0+27256d1)+(40σ263L2+32δD63L)i=0k(11256κ)i\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\left(d_{0}+\frac{27}{256}d_{-1}\right)+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot\sum\limits_{i=0}^{k}\left(1-\frac{1}{256\kappa}\right)^{i}
=\displaystyle= (11256κ)k+1283256z0z2+(40σ263L2+32δD63L)1(11256κ)k+11256κ\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot\frac{1-\left(1-\frac{1}{256\kappa}\right)^{k+1}}{\frac{1}{256\kappa}}
\displaystyle\leq (11256κ)k+1283256z0z2+(40σ263L2+32δD63L)256κ.\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot 256\kappa.

Note that d0=d1=z0z2d_{0}=d_{-1}=\|z^{0}-z^{*}\|^{2}. Finally, the LHS of the above inequality can be lower bounded by dk+1d_{k+1}, thus completing the proof.

A.4 Proof of Lemma 3.4

We start by using the 1-co-coerciveness of the projection operator P𝒵()P_{\mathcal{Z}}(\cdot):

zk+1z2\displaystyle\|z^{k+1}-z^{*}\|^{2}
=\displaystyle= P𝒵(zkαF^(zk)+γ(zkzk1)τ(F^(zk)F^(zk1)))P𝒵(zαF(z))2\displaystyle\|P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right)-P_{\mathcal{Z}}\left(z^{*}-\alpha F(z^{*})\right)\|^{2}
\displaystyle\leq (zk+1z)(zkαF^(zk)+γ(zkzk1)τ(F^(zk)F^(zk1))(zαF(z)))\displaystyle(z^{k+1}-z^{*})^{\top}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)-\left(z^{*}-\alpha F(z^{*})\right)\right)
=\displaystyle= (zk+1z)((zkz)α(F^(zk)F(z))+γ(zkzk1)τ(F^(zk)F^(zk1))).\displaystyle(z^{k+1}-z^{*})^{\top}\left((z^{k}-z^{*})-\alpha\left(\hat{F}(z^{k})-F(z^{*})\right)+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right).

Next, let us bound the above four terms separately:

(zk+1z)(zkz)=12(zk+1z2+zkz2zk+1zk2),\displaystyle(z^{k+1}-z^{*})^{\top}(z^{k}-z^{*})=\frac{1}{2}\left(\|z^{k+1}-z^{*}\|^{2}+\|z^{k}-z^{*}\|^{2}-\|z^{k+1}-z^{k}\|^{2}\right), (83)

and

(zk+1z)(F^(zk)F(z))\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-F(z^{*})\right) =\displaystyle= (zk+1z)(F^(zk)F^(zk+1)+F^(zk+1)F(z))\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})+\hat{F}(z^{k+1})-F(z^{*})\right) (84)

where

(zk+1z)(F^(zk+1)F(z))\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{*})\right) (85)
=\displaystyle= (zk+1z)(F^(zk+1)F(zk+1)+F(zk+1)F(z))\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})+F(z^{k+1})-F(z^{*})\right)
\displaystyle\geq (zk+1z)(F^(zk+1)F(zk+1))+μzk+1z2,\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)+\mu\|z^{k+1}-z^{*}\|^{2},

and

(zk+1z)(zkzk1)\displaystyle(z^{k+1}-z^{*})^{\top}(z^{k}-z^{k-1}) \displaystyle\leq γ2(zk+1z2+zkzk12),\displaystyle\frac{\gamma}{2}\left(\|z^{k+1}-z^{*}\|^{2}+\|z^{k}-z^{k-1}\|^{2}\right), (86)

and

τ(zk+1z)(F^(zk)F^(zk1))\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right) (87)
=\displaystyle= τ(zk+1zk+zkz)(F^(zk)F^(zk1))\displaystyle-\tau(z^{k+1}-z^{k}+z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)
\displaystyle\leq τzk+1zkF^(zk)F^(zk1)τ(zkz)(F^(zk)F^(zk1))\displaystyle\tau\|z^{k+1}-z^{k}\|\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\|-\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)
\displaystyle\leq 14zk+1zk2+τ2F^(zk)F^(zk1)2τ(zkz)(F^(zk)F^(zk1)),\displaystyle\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}+\tau^{2}\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\|^{2}-\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right),

where

τ2F^(zk)F^(zk1)2\displaystyle\tau^{2}\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\|^{2} (73)\displaystyle\overset{\eqref{sto-Lip-bd}}{\leq} 2τ2(εzk+εzk1)2+2τ2L2zkzk12.\displaystyle 2\tau^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau^{2}L^{2}\|z^{k}-z^{k-1}\|^{2}. (88)

Putting the bounds (83)-(88) back to (LABEL:sto-OGDA-hy-first-step), we get:

(12+αμγ2)zk+1z2+α(zk+1z)(F^(zk)F^(zk+1))+14zk+1zk2\displaystyle\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}
\displaystyle\leq 12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12\displaystyle\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}
+2τ2(εzk+εzk1)2α(zk+1z)(F^(zk+1)F(zk+1)).\displaystyle+2\tau^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}-\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right).

Taking expectation on both sides gives us

𝔼[(12+αμγ2)zk+1z2+α(zk+1z)(F^(zk)F^(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right] (89)
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+8τ2σ2𝔼[α(zk+1z)(F^(zk+1)F(zk+1))].\displaystyle+8\tau^{2}\sigma^{2}-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)\right].

Note that

𝔼[α(zk+1z)(F^(zk+1)F(zk+1))]\displaystyle-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)\right]
\displaystyle\leq α𝔼[zk+1zF^(zk+1)F(zk+1)]\displaystyle\alpha\mathbb{E}\left[\|z^{k+1}-z^{*}\|\|\hat{F}(z^{k+1})-F(z^{k+1})\|\right]
=\displaystyle= α𝔼[𝔼[(zk+1zF^(zk+1)F(zk+1))|ξ[k]]]\displaystyle\alpha\mathbb{E}\left[\mathbb{E}\left[\left(\|z^{k+1}-z^{*}\|\|\hat{F}(z^{k+1})-F(z^{k+1})\|\right)|\xi^{[k]}\right]\right]
=\displaystyle= α𝔼[zk+1z𝔼[(F^(zk+1)F(zk+1))|ξ[k]]]\displaystyle\alpha\mathbb{E}\left[\|z^{k+1}-z^{*}\|\cdot\mathbb{E}\left[\left(\|\hat{F}(z^{k+1})-F(z^{k+1})\|\right)|\xi^{[k]}\right]\right]
\displaystyle\leq α𝔼[δzk+1z]\displaystyle\alpha\mathbb{E}\left[\delta\|z^{k+1}-z^{*}\|\right]
\displaystyle\leq α𝔼[δ22μ+μ2zk+1z2].\displaystyle\alpha\mathbb{E}\left[\frac{\delta^{2}}{2\mu}+\frac{\mu}{2}\|z^{k+1}-z^{*}\|^{2}\right].

Here we define ξ[k]=(ξ0,ξ1,,ξk)\xi^{[k]}=(\xi^{0},\xi^{1},...,\xi^{k}) to be the collection of random vectors sampled up until the iterate zkz^{k}, and zk+1z^{k+1} is known given ξ[k]\xi^{[k]}.

Therefore, (89) becomes

𝔼[(12+αμ2γ2)zk+1z2+α(zk+1z)(F^(zk)F^(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right]
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^(zk1)F^(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+8τ2σ2+αδ22μ,\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu},

completing the proof.

A.5 Proof of Theorem 3.5

Continuing from (27), we have:

Vk\displaystyle V_{k} \displaystyle\leq (1+θκ)kV0+i=1k(1+θκ)i(8τ2σ2+αδ22μ)\displaystyle\left(1+\frac{\theta}{\kappa}\right)^{-k}V_{0}+\sum\limits_{i=1}^{k}\left(1+\frac{\theta}{\kappa}\right)^{-i}\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right)
=\displaystyle= 12(1+θκ)kz0z2+1(1+θκ)kθκ(8τ2σ2+αδ22μ),\displaystyle\frac{1}{2}\left(1+\frac{\theta}{\kappa}\right)^{-k}\|z^{0}-z^{*}\|^{2}+\frac{1-\left(1+\frac{\theta}{\kappa}\right)^{-k}}{\frac{\theta}{\kappa}}\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right),

where we use z1=z0z^{-1}=z^{0} for V0V_{0}.

Finally, with the following bound:

τ(zkz)(F^(zk1)F^(zk))\displaystyle\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)
\displaystyle\geq τzkzF^(zk1)F^(zk)\displaystyle-\tau\|z^{k}-z^{*}\|\|\hat{F}(z^{k-1})-\hat{F}(z^{k})\|
\displaystyle\geq 14zkz2τ2F^(zk1)F^(zk)2\displaystyle-\frac{1}{4}\|z^{k}-z^{*}\|^{2}-\tau^{2}\|\hat{F}(z^{k-1})-\hat{F}(z^{k})\|^{2}
(73)\displaystyle\overset{\eqref{sto-Lip-bd}}{\geq} 14zkz2τ2(2L2zk1zk2+2(εzk1+εzk)2),\displaystyle-\frac{1}{4}\|z^{k}-z^{*}\|^{2}-\tau^{2}\left(2L^{2}\|z^{k-1}-z^{k}\|^{2}+2(\varepsilon_{z^{k-1}}+\varepsilon_{z^{k}})^{2}\right),

we can lower bound VkV_{k} as

14𝔼[zkz2]8τ2σ2Vk.\frac{1}{4}\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]-8\tau^{2}\sigma^{2}\leq V_{k}.

Therefore

𝔼[zkz2]\displaystyle\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right] (90)
\displaystyle\leq 2(1+θκ)kz0z2+4κθ(1(1+θκ)k)(8τ2σ2+αδ22μ)+32τ2σ2\displaystyle 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\|z^{0}-z^{*}\|^{2}+\frac{4\kappa}{\theta}\cdot\left(1-\left(1+\frac{\theta}{\kappa}\right)^{-k}\right)\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right)+32\tau^{2}\sigma^{2}
\displaystyle\leq 2(1+θκ)kz0z2+(κθ+1)32τ2σ2+2καδ2θμ.\displaystyle 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\|z^{0}-z^{*}\|^{2}+\left(\frac{\kappa}{\theta}+1\right)\cdot 32\tau^{2}\sigma^{2}+\frac{2\kappa\alpha\delta^{2}}{\theta\mu}.

The statement in Theorem 3.5 follows by noting (1+θκ)1=1θκ+θ\left(1+\frac{\theta}{\kappa}\right)^{-1}=1-\frac{\theta}{\kappa+\theta}.

A.6 Proof of Lemma 4.2

We will derive the first bound in (59); the second bound is similar and will be omitted.

Notice that

𝔼ξ,u[Fρx(x,y,ξ,u)2]\displaystyle\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)\|^{2}\right]
=\displaystyle= 𝔼ξ[𝔼u[Fρx(x,y,ξ,u)2]]\displaystyle\mathbb{E}_{\xi}\left[\mathbb{E}_{u}\left[\|F_{\rho_{x}}(x,y,\xi,u)\|^{2}\right]\right]
(48)\displaystyle\overset{\eqref{smooth-x-sm}}{\leq} 𝔼ξ[2nxf^(x,y,ξ)2]+ρx2L2n22\displaystyle\mathbb{E}_{\xi}\left[2n\|\nabla_{x}\hat{f}(x,y,\xi)\|^{2}\right]+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}
=\displaystyle= 2n(𝔼ξ[xf(x,y)2+2xf(x,y)(xf^(x,y,ξ)xf(x,y))+xf^(x,y,ξ)xf(x,y)2])\displaystyle 2n\left(\mathbb{E}_{\xi}\left[\|\nabla_{x}f(x,y)\|^{2}+2\nabla_{x}f(x,y)^{\top}\left(\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\right)+\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\|^{2}\right]\right)
+ρx2L2n22\displaystyle+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}
=(40)\displaystyle\overset{\eqref{so-x-mean}}{=} 2n(xf(x,y)2+𝔼ξ[xf^(x,y,ξ)xf(x,y)2])+ρx2L2n22\displaystyle 2n\left(\|\nabla_{x}f(x,y)\|^{2}+\mathbb{E}_{\xi}\left[\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\|^{2}\right]\right)+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}
\displaystyle\leq 2n(M2+σ2)+ρx2L2n22.\displaystyle 2n\left(M^{2}+\sigma^{2}\right)+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}.

Further note that

𝔼ξ,u[Fρx(x,y,ξ,u)xfρx(x,y)2]\displaystyle\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)-\nabla_{x}f_{\rho_{x}}(x,y)\|^{2}\right]
=\displaystyle= 𝔼ξ,u[Fρx(x,y,ξ,u)22Fρx(x,y,ξ,u)xfρx(x,y)+xfρx(x,y)2]\displaystyle\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)\|^{2}-2F_{\rho_{x}}(x,y,\xi,u)^{\top}\nabla_{x}f_{\rho_{x}}(x,y)+\|\nabla_{x}f_{\rho_{x}}(x,y)\|^{2}\right]
=(55)\displaystyle\overset{\eqref{SZG-mean}}{=} 𝔼ξ,u[Fρx(x,y,ξ,u)2xfρx(x,y)2]\displaystyle\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)\|^{2}-\|\nabla_{x}f_{\rho_{x}}(x,y)\|^{2}\right]
\displaystyle\leq 𝔼ξ,u[Fρx(x,y,ξ,u)2]\displaystyle\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)\|^{2}\right]
\displaystyle\leq 2n(M2+σ2)+ρx2L2n22,\displaystyle 2n(M^{2}+\sigma^{2})+\frac{\rho_{x}^{2}L^{2}n^{2}}{2},

completing the proof for (59).

A.7 Proof of Lemma 4.3

The logic line of the proof for this lemma is very similar to the proof in Appendix A.1, with the stochastic mapping F^(zk)\hat{F}(z^{k}) replaced by the stochastic zeroth-order gradient F^ρk(zk)\hat{F}^{k}_{\rho}(z^{k}). Therefore, we shall refrain from repeating similar analysis, but highlight the main differences instead. First, for F(z)=(xf(x,y)yf(x,y))F(z)=\begin{pmatrix}\nabla_{x}f(x,y)\\ -\nabla_{y}f(x,y)\end{pmatrix}, we shall have

F(z)F(z)Lzz,z,z𝒵\displaystyle\|F(z)-F(z^{\prime})\|\leq L\|z-z^{\prime}\|,\quad\forall z,z^{\prime}\in\mathcal{Z} (91)

where L=2max(Lx,Ly,Lxy)L=2\cdot\max(L_{x},L_{y},L_{xy}), because

F(z)F(z)2\displaystyle\|F(z)-F(z^{\prime})\|^{2}
=\displaystyle= xf(x,y)xf(x,y)2+yf(x,y)yf(x,y)2\displaystyle\|\nabla_{x}f(x,y)-\nabla_{x}f(x^{\prime},y^{\prime})\|^{2}+\|\nabla_{y}f(x,y)-\nabla_{y}f(x^{\prime},y^{\prime})\|^{2}
\displaystyle\leq 2Lx2xx2+2Lxy2yy2+2Ly2yy2+2Lxy2xx2\displaystyle 2L_{x}^{2}\|x-x^{\prime}\|^{2}+2L_{xy}^{2}\|y-y^{\prime}\|^{2}+2L_{y}^{2}\|y-y^{\prime}\|^{2}+2L_{xy}^{2}\|x-x^{\prime}\|^{2}
\displaystyle\leq L2(xx2+yy2)=L2zz2.\displaystyle L^{2}(\|x-x^{\prime}\|^{2}+\|y-y^{\prime}\|^{2})=L^{2}\|z-z^{\prime}\|^{2}.

Next, by denoting εzk=F^ρk(zk)Fρ(zk)\varepsilon_{z^{k}}=\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})\|, we have

F^ρk(zk)F^ρk+0.5(zk+0.5)\displaystyle\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\| (92)
=\displaystyle= F^ρk(zk)Fρ(zk)+Fρ(zk+0.5)F^ρ(zk+0.5)+Fρ(zk)Fρ(zk+0.5)\displaystyle\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})+F_{\rho}(z^{k+0.5})-\hat{F}_{\rho}(z^{k+0.5})+F_{\rho}(z^{k})-F_{\rho}(z^{k+0.5})\|
\displaystyle\leq εzk+εzk+0.5+Fρ(zk)Fρ(zk+0.5)\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+\|F_{\rho}(z^{k})-F_{\rho}(z^{k+0.5})\|
=\displaystyle= εzk+εzk+0.5+Fρ(zk)F(zk)+F(zk+0.5)Fρ(zk+0.5)+F(zk)F(zk+0.5)\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+\|F_{\rho}(z^{k})-F(z^{k})+F(z^{k+0.5})-F_{\rho}(z^{k+0.5})+F(z^{k})-F(z^{k+0.5})\|
(44),(91)\displaystyle\overset{\eqref{smooth-grad-bd},\eqref{lip-minmax}}{\leq} εzk+εzk+0.5+Lρx2n2+ρy2m2+Lzkzk+0.5.\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+L\|z^{k}-z^{k+0.5}\|.

Therefore, for a similar bound as in (75), we have

αzk+1zk+0.5F^ρk(zk)F^ρk+0.5(zk+0.5)\displaystyle\alpha\|z^{k+1}-z^{k+0.5}\|\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\|
\displaystyle\leq 12(zk+1zk+0.52+α2F^ρk(zk)F^ρk+0.5(zk+0.5)2)\displaystyle\frac{1}{2}\left(\|z^{k+1}-z^{k+0.5}\|^{2}+\alpha^{2}\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\|^{2}\right)
\displaystyle\leq 12zk+1zk+0.52+2α2(εzk+εzk+0.5)2+2α2L2(ρx2n2+ρy2m2)+α2L2zkzk+0.52.\displaystyle\frac{1}{2}\|z^{k+1}-z^{k+0.5}\|^{2}+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+2\alpha^{2}L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\alpha^{2}L^{2}\|z^{k}-z^{k+0.5}\|^{2}.

For another similar bound as in (79), we have

τ(zk+1z)(F^ρk(zk)F^ρk1(zk1))\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)
\displaystyle\leq τzk+1zF^ρk(zk)F^ρk1(zk1)\displaystyle\tau\|z^{k+1}-z^{*}\|\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\|
\displaystyle\leq τLzk+1z(1L(εzk+εzk1)+ρx2n2+ρy2m2+zkzk1)\displaystyle\tau L\|z^{k+1}-z^{*}\|\left(\frac{1}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})+\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+\|z^{k}-z^{k-1}\|\right)
\displaystyle\leq τL2zk+1z2+2τL(εzk+εzk1)2+2τL(ρx2n2+ρy2m2)+τLzkzk12\displaystyle\frac{\tau L}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\tau L\|z^{k}-z^{k-1}\|^{2}
\displaystyle\leq τL2zk+1z2+2τL(εzk+εzk1)2+2τL(ρx2n2+ρy2m2)\displaystyle\frac{\tau L}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})
+2τLzkz2+2τLzk1z2.\displaystyle+2\tau L\|z^{k}-z^{*}\|^{2}+2\tau L\|z^{k-1}-z^{*}\|^{2}.

Therefore, we reach the bound that

(14|γβ|τL)zk+1z2\displaystyle(1-4|\gamma-\beta|-\tau L)\|z^{k+1}-z^{*}\|^{2}
\displaystyle\leq (1+4γ+6|γβ|+4τL))zkz2\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L)\right)\|z^{k}-z^{*}\|^{2}
+(2|γβ|+2γ+4τL)zk1z2\displaystyle+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)\|z^{k-1}-z^{*}\|^{2}
+(2α2L2+2|γβ|+2γ1)zk+0.5zk2\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma-1\right)\|z^{k+0.5}-z^{k}\|^{2}
+4α2(εzk+εzk+0.5)2+4τL(εzk+εzk1)2\displaystyle+4\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\frac{4\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}
+4(α2L2+τL)(ρx2n2+ρy2m2)\displaystyle+4(\alpha^{2}L^{2}+\tau L)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})
2α(zk+0.5z)F^ρk+0.5(zk+0.5)+2(ηα)(zk+1zk+0.5)F^ρk(zk).\displaystyle-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k}).

By (61), we have 𝔼[εzk2]2σ~2tk\mathbb{E}[\varepsilon^{2}_{z^{k}}]\leq\frac{2\tilde{\sigma}^{2}}{t_{k}}. Taking expectation on both sides for the above inequality, we have

(14|γβ|τL)dk+1\displaystyle(1-4|\gamma-\beta|-\tau L)d_{k+1} (93)
\displaystyle\leq (1+4γ+6|γβ|+4τL))dk\displaystyle\left(1+4\gamma+6|\gamma-\beta|+4\tau L)\right)d_{k}
+(2|γβ|+2γ+4τL)dk1\displaystyle+\left(2|\gamma-\beta|+2\gamma+4\tau L\right)d_{k-1}
+(2α2L2+2|γβ|+2γ1)𝔼[zk+0.5zk2]\displaystyle+\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
+16σ~2((α2+τL)1tk+α2tk+0.5+τLtk1)+4(α2L2+τL)(ρx2n2+ρy2m2)\displaystyle+16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right)+4(\alpha^{2}L^{2}+\tau L)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})
+𝔼[2α(zk+0.5z)F^ρk+0.5(zk+0.5)+2(ηα)(zk+1zk+0.5)F^ρk(zk)].\displaystyle+\mathbb{E}\left[-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k})\right].

Note that

𝔼[(zk+0.5z)F^ρk+0.5(zk+0.5)]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\right]
=\displaystyle= 𝔼[(zk+0.5z)F(zk+0.5)]+𝔼[(zk+0.5z)(F^ρk+0.5(zk+0.5)F(zk+0.5))],\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]+\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right],

where

𝔼[(zk+0.5z)F(zk+0.5)]𝔼[μzk+0.5z2]μdk2μ𝔼[zk+0.5zk2].\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]\geq\mathbb{E}\left[\mu\|z^{k+0.5}-z^{*}\|^{2}\right]\geq\frac{\mu d_{k}}{2}-\mu\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right].

Let us denote

ξ[tk]k:=(ξ1k,,ξtkk),\displaystyle\xi^{k}_{[t_{k}]}:=(\xi^{k}_{1},...,\xi^{k}_{t_{k}}), w[tk]k:=(w1k,,wtkk)\displaystyle w^{k}_{[t_{k}]}:=(w^{k}_{1},...,w^{k}_{t_{k}})
ξ[k]:=(ξ[t1]1,,ξ[tk]k),\displaystyle\xi^{[k]}:=(\xi^{1}_{[t_{1}]},...,\xi^{k}_{[t_{k}]}), w[k]:=(w[t1]1,,w[tk]k)\displaystyle w^{[k]}:=(w^{1}_{[t_{1}]},...,w^{k}_{[t_{k}]})

as the collection of all random vectors at iteration kk and the collection of all such random vectors from iteration 11 to kk respectively. Notice that for the given (ξ[k],w[k])(\xi^{[k]},w^{[k]}), zk+0.5z^{k+0.5} is a deterministic vector, we then have the following bound

𝔼[(zk+0.5z)(F^ρk+0.5(zk+0.5)F(zk+0.5))]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right]
=\displaystyle= 𝔼[𝔼[(zk+0.5z)(F^ρk+0.5(zk+0.5)F(zk+0.5))|ξ[k],w[k]]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\left|\xi^{[k]},w^{[k]}\right.\right]\right]
=(60)\displaystyle\overset{\eqref{batch-mean}}{=} 𝔼[(zk+0.5z)(Fρ(zk+0.5)F(zk+0.5))]\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(F_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right]
\displaystyle\geq 𝔼[zk+0.5zFρ(zk+0.5)F(zk+0.5)]\displaystyle-\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\left\|F_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right\|\right]
(44)\displaystyle\overset{\eqref{smooth-grad-bd}}{\geq} Lρx2n2+ρy2m22𝔼[zk+0.5z]\displaystyle-\frac{L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2}\mathbb{E}\left[\|z^{k+0.5}-z^{*}\|\right]
\displaystyle\geq LDρx2n2+ρy2m22,\displaystyle-\frac{LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2},

where in the last inequality we utilize the boundedness assumption of 𝒵=𝒳×𝒴\mathcal{Z}=\mathcal{X}\times\mathcal{Y} and denote D=maxz,z𝒵zzD=\max\limits_{z,z^{\prime}\in\mathcal{Z}}\|z-z^{\prime}\|. Combining the above bounds into (93), the desired result follows.

A.8 Proof of Proposition 4.4

Note the variance is upper bounded by

16σ~2((α2+τL)1tk+α2tk+0.5+τLtk1).16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right).

Since we take tk=tk+0.5t_{k}=t_{k+0.5} and 1tk1=1tk(11256κ)12tk\frac{1}{t_{k-1}}=\frac{1}{t_{k}}\left(1-\frac{1}{256\kappa}\right)^{-1}\leq\frac{2}{t_{k}}, the above upper bound can be written as

(α2+τL)48σ~2tk.\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{48\tilde{\sigma}^{2}}{t_{k}}.

By substituting the specific parameter choice into (65), starting from the last iteration KK, we have

(1164κ)dK\displaystyle\left(1-\frac{1}{64\kappa}\right)d_{K}
\displaystyle\leq (118κ)dK1+332κdK2+σ~2tK1(3L2+34L2κ)\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{\tilde{\sigma}^{2}}{t_{K-1}}\cdot\left(\frac{3}{L^{2}}+\frac{3}{4L^{2}\kappa}\right)
+(116κ+14)(ρx2n2+ρy2m2)+Dρx2n2+ρy2m24\displaystyle+\left(\frac{1}{16\kappa}+\frac{1}{4}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\frac{D\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{4}
\displaystyle\leq (118κ)dK1+332κdK2+15σ~24tK1L2+5(ρx2n2+ρy2m2)16+Dρx2n2+ρy2m24.\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{15\tilde{\sigma}^{2}}{4t_{K-1}L^{2}}+\frac{5(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})}{16}+\frac{D\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{4}.
\displaystyle\leq (118κ)dK1+332κdK2+15σ~24tK1L2+(516+D4)C1Kκ.\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{15\tilde{\sigma}^{2}}{4t_{K-1}L^{2}}+\left(\frac{5}{16}+\frac{D}{4}\right)\frac{C_{1}^{K}}{\kappa}.

In the last inequality, we denote C1=11256κC_{1}=1-\frac{1}{256\kappa} and ρx=C1K2nκ\rho_{x}=\frac{C_{1}^{K}}{\sqrt{2}n\kappa}, ρy=C1K2mκ\rho_{y}=\frac{C_{1}^{K}}{\sqrt{2}m\kappa}, and use the fact that C12C1C_{1}^{2}\leq C_{1}.

Dividing both sides by 1164κ1-\frac{1}{64\kappa} and noting that (1164κ)16463\left(1-\frac{1}{64\kappa}\right)^{-1}\leq\frac{64}{63}, we obtain

dK\displaystyle d_{K} \displaystyle\leq 118κ1164κdK1+221κdK2+80σ~221tK1L2+(20+16D)C1K63κ\displaystyle\frac{1-\frac{1}{8\kappa}}{1-\frac{1}{64\kappa}}d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}
=\displaystyle= (1764κ1164κ)dK1+221κdK2+80σ~221tK1L2+(20+16D)C1K63κ\displaystyle\left(1-\frac{\frac{7}{64\kappa}}{1-\frac{1}{64\kappa}}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}
\displaystyle\leq (1764κ)dK1+221κdK2+80σ~221tK1L2+(20+16D)C1K63κ.\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}.
=\displaystyle= (1764κ)dK1+221κdK2+c2tK1+c3C1Kκ.\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}.

By moving a part of dK1d_{K-1} to the LHS, we have:

dK+27256κdK1\displaystyle d_{K}+\frac{27}{256\kappa}d_{K-1}
\displaystyle\leq C1dK1+221κdK2+c2tK1+c3C1Kκ\displaystyle C_{1}\cdot d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}
\displaystyle\leq C1(dK1+27256dK2)+c2tK1+c3C1Kκ\displaystyle C_{1}\left(d_{K-1}+\frac{27}{256}d_{K-2}\right)+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}
\displaystyle\leq C1K(d0+27256d1)+c2i=1KtKi1C1i1+c3C1Kκi=1KC1i1\displaystyle C_{1}^{K}\left(d_{0}+\frac{27}{256}d_{-1}\right)+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+c_{3}\frac{C_{1}^{K}}{\kappa}\cdot\sum\limits_{i=1}^{K}C_{1}^{i-1}
=\displaystyle= C1K283256z0z2+c2i=1KtKi1C1i1+c3C1Kκ1C1K1256κ\displaystyle C_{1}^{K}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+c_{3}\frac{C_{1}^{K}}{\kappa}\cdot\frac{1-C_{1}^{K}}{\frac{1}{256\kappa}}
\displaystyle\leq C1K283256z0z2+c2i=1KtKi1C1i1+256c3C1K.\displaystyle C_{1}^{K}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+256c_{3}C_{1}^{K}.

With the sample size tk=K(11256κ)(k+1)=KC1(k+1)t_{k}=K\left(1-\frac{1}{256\kappa}\right)^{-(k+1)}=K\cdot C_{1}^{-(k+1)}, we have:

dK(11256κ)K(283256z0z2+c2+256c3).d_{K}\leq\left(1-\frac{1}{256\kappa}\right)^{K}\left(\frac{283}{256}\|z^{0}-z^{*}\|^{2}+c_{2}+256c_{3}\right).

It is then straightforward to see that for K=𝒪(κln(1ϵ))K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right) we have dKϵd_{K}\leq\epsilon, with the sample complexity given by

k=0K1(tk+tk+0.5)=2k=0K1tk=2K1C1KC1(1C11)=2K(C1K1)1C1.\displaystyle\sum\limits_{k=0}^{K-1}(t_{k}+t_{k+0.5})=2\sum\limits_{k=0}^{K-1}t_{k}=2K\cdot\frac{1-C_{1}^{-K}}{C_{1}(1-C_{1}^{-1})}=\frac{2K(C_{1}^{-K}-1)}{1-C_{1}}.

By noticing that a more precise expression of K=lnC11(1ϵ)K=\ln_{C_{1}^{-1}}\left(\frac{1}{\epsilon}\right), we have C1K=𝒪(1ϵ)C_{1}^{-K}=\mathcal{O}\left(\frac{1}{\epsilon}\right), 1C1=𝒪(1κ)1-C_{1}=\mathcal{O}\left(\frac{1}{\kappa}\right), the combined sample complexity is then given by 𝒪(κ2ϵln(1ϵ))\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

A.9 Proof of Lemma 4.5

With the similar logic to the proof in Appendix A.4, we shall focus on the main differences between the two proofs.

Firstly, with the similar derivation to (92), we have the following bound:

τ2F^ρk(zk)F^ρk1(zk1)2\displaystyle\tau^{2}\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\|^{2}
\displaystyle\leq τ2(εzk+εzk1+Lρx2n2+ρy2m2+Lzkzk1)2\displaystyle\tau^{2}\left(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}}+L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+L\|z^{k}-z^{k-1}\|\right)^{2}
\displaystyle\leq τ2(4(εzk+εzk1)2+4L2(ρx2n2+ρy2m2)+2L2zkzk12).\displaystyle\tau^{2}\left(4(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+4L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+2L^{2}\|z^{k}-z^{k-1}\|^{2}\right).

By using the variance bound in (61), we will reach the next inequality that is similar to the step in (89):

𝔼[(12+αμγ2)zk+1z2+α(zk+1z)(F^ρk(zk)F^ρk+1(zk+1))+14zk+1zk2]\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\|z^{k+1}-z^{*}\|^{2}+\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+1}_{\rho}(z^{k+1})\right)+\frac{1}{4}\|z^{k+1}-z^{k}\|^{2}\right] (94)
\displaystyle\leq 𝔼[12zkz2+τ(zkz)(F^ρk1(zk1)F^ρk(zk))+(2τ2L2+γ2)zkzk12]\displaystyle\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right]
+16τ2σ2(1tk+1tk1)+4τ2L2(ρx2n2+ρy2m2)\displaystyle+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+4\tau^{2}L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})
𝔼[α(zk+1z)(F^ρk+1(zk+1)F(zk+1))].\displaystyle-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\right].

Denote by

ξ[tk]k=(ξ1k,,ξtkk),\displaystyle\xi^{k}_{[t_{k}]}=(\xi^{k}_{1},...,\xi^{k}_{t_{k}}), w[tk]k=(w1k,,wtkk)\displaystyle w^{k}_{[t_{k}]}=(w^{k}_{1},...,w^{k}_{t_{k}})
ξ[k]=(ξ[t1]1,,ξ[tk]k),\displaystyle\xi^{[k]}=(\xi^{1}_{[t_{1}]},...,\xi^{k}_{[t_{k}]}), w[k]=(w[t1]1,,w[tk]k)\displaystyle w^{[k]}=(w^{1}_{[t_{1}]},...,w^{k}_{[t_{k}]})

the collection of all random vectors at iteration kk and the collection of all such random vectors from iteration 11 to kk respectively, and note that given (ξ[k],w[k])(\xi^{[k]},w^{[k]}), zk+1z^{k+1} is a deterministic vector, we then have the following bound:

𝔼[(zk+1z)(F^ρk+1(zk+1)F(zk+1))]\displaystyle\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\right]
=\displaystyle= 𝔼[𝔼[(zk+1z)(F^ρk+1(zk+1)F(zk+1))|ξ[k],w[k]]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\left|\xi^{[k]},w^{[k]}\right.\right]\right]
=(60)\displaystyle\overset{\eqref{batch-mean}}{=} 𝔼[(zk+1z)(Fρ(zk+1)F(zk+1))]\displaystyle\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(F_{\rho}(z^{k+1})-F(z^{k+1})\right)\right]
\displaystyle\geq 𝔼[zk+1zFρ(zk+1)F(zk+1)]\displaystyle-\mathbb{E}\left[\|z^{k+1}-z^{*}\|\left\|F_{\rho}(z^{k+1})-F(z^{k+1})\right\|\right]
(44)\displaystyle\overset{\eqref{smooth-grad-bd}}{\geq} 𝔼[zk+1zLρx2n2+ρy2m22]\displaystyle-\mathbb{E}\left[\|z^{k+1}-z^{*}\|\cdot\frac{L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2}\right]
\displaystyle\geq 𝔼[μ2zk+1z2+L2(ρxn2+ρy2m2)8μ],\displaystyle-\mathbb{E}\left[\frac{\mu}{2}\|z^{k+1}-z^{*}\|^{2}+\frac{L^{2}(\rho_{x}n^{2}+\rho_{y}^{2}m^{2})}{8\mu}\right],
=\displaystyle= μ2dk+1L2(ρx2n2+ρy2m2)8μ.\displaystyle-\frac{\mu}{2}d_{k+1}-\frac{L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})}{8\mu}.

Substituting the above bound into (94), the desired result follows.

A.10 Proof of Proposition 4.6

Let us start from the potential function inequality (68) from iteration KK. With θ=18\theta=\frac{1}{8}, let us also denote C1=(1+18κ)1=(118κ+1)C_{1}=\left(1+\frac{1}{8\kappa}\right)^{-1}=\left(1-\frac{1}{8\kappa+1}\right), then ρx=C1K2nκ,ρy=C1K2mκ\rho_{x}=\frac{\sqrt{C_{1}^{K}}}{\sqrt{2}n\kappa},\rho_{y}=\frac{\sqrt{C_{1}^{K}}}{\sqrt{2}m\kappa}. Note that the C1C_{1} defined here is only for this proof, not to be confused with that used in the proof in Appendix A.8. Then we have

VK\displaystyle V_{K} \displaystyle\leq C1VK1+16C1τ2σ2(1tK1+1tK2)+L2(4τ2+α8μ)C1C1Kκ2\displaystyle C_{1}V_{K-1}+16C_{1}\tau^{2}\sigma^{2}\left(\frac{1}{t_{K-1}}+\frac{1}{t_{K-2}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)\cdot C_{1}\cdot\frac{C_{1}^{K}}{\kappa^{2}}
\displaystyle\leq C1KV0+48τ2σ~2k=0K1C1Kktk+L2(4τ2+α8μ)k=1KC1kC1Kκ2.\displaystyle C_{1}^{K}V_{0}+48\tau^{2}\tilde{\sigma}^{2}\sum\limits_{k=0}^{K-1}\frac{C_{1}^{K-k}}{t_{k}}+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)\cdot\sum\limits_{k=1}^{K}C_{1}^{k}\cdot\frac{C_{1}^{K}}{\kappa^{2}}.

In the second inequality, we take tk=KC1kt_{k}=K\cdot C_{1}^{-k} and note that 1tk1=1C1tk2tk\frac{1}{t_{k-1}}=\frac{1}{C_{1}t_{k}}\leq\frac{2}{t_{k}}. Then we have k=0K1C1Kktk=C1K\sum\limits_{k=0}^{K-1}\frac{C_{1}^{K-k}}{t_{k}}=C_{1}^{K}. In addition,

k=1KC1k=C1(1C1K)1C1C11C1=8κ.\sum\limits_{k=1}^{K}C_{1}^{k}=\frac{C_{1}(1-C_{1}^{K})}{1-C_{1}}\leq\frac{C_{1}}{1-C_{1}}=8\kappa.

Therefore, we have

VKC1KV0+48τ2σ~2C1K+L2(32τ2+αμ)C1Kκ.\displaystyle V_{K}\leq C_{1}^{K}V_{0}+48\tau^{2}\tilde{\sigma}^{2}\cdot C_{1}^{K}+L^{2}\left(32\tau^{2}+\frac{\alpha}{\mu}\right)\frac{C_{1}^{K}}{\kappa}. (95)

Now, let us lower bound VkV_{k} by observing:

τ(zkz)(F^ρk1(zk1)F^ρk(zk))\displaystyle\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)
\displaystyle\geq τzkzF^ρk1(zk1)F^ρk(zk)\displaystyle-\tau\|z^{k}-z^{*}\|\|\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\|
\displaystyle\geq 14zkz2τ2F^ρk1(zk1)F^ρk(zk)2\displaystyle-\frac{1}{4}\|z^{k}-z^{*}\|^{2}-\tau^{2}\|\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\|^{2}
(92)\displaystyle\overset{\eqref{Lip-szo-grad}}{\geq} 14zkz2τ2(2L2zk1zk2+4(εzk1+εzk)2+4L2(ρx2n2+ρy2m2)),\displaystyle-\frac{1}{4}\|z^{k}-z^{*}\|^{2}-\tau^{2}\left(2L^{2}\|z^{k-1}-z^{k}\|^{2}+4(\varepsilon_{z^{k-1}}+\varepsilon_{z^{k}})^{2}+4L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})\right),

Then we have

VK\displaystyle V_{K} \displaystyle\geq 14dK16τ2σ~2(1tK+1tK1)4τ2L2C1Kκ2\displaystyle\frac{1}{4}d_{K}-16\tau^{2}\tilde{\sigma}^{2}\left(\frac{1}{t_{K}}+\frac{1}{t_{K-1}}\right)-\frac{4\tau^{2}L^{2}C_{1}^{K}}{\kappa^{2}}
\displaystyle\geq 14dK48τ2σ~2C1K4τ2L2C1Kκ2.\displaystyle\frac{1}{4}d_{K}-48\tau^{2}\tilde{\sigma}^{2}C_{1}^{K}-\frac{4\tau^{2}L^{2}C_{1}^{K}}{\kappa^{2}}.

Combining with (95), we have

dKC1K(4V0+384τ2σ~2+L2(32τ2κ+4τ2κ2+αμκ))=C1K𝒪(1).d_{K}\leq C_{1}^{K}\cdot\left(4V_{0}+384\tau^{2}\tilde{\sigma}^{2}+L^{2}\left(\frac{32\tau^{2}}{\kappa}+\frac{4\tau^{2}}{\kappa^{2}}+\frac{\alpha}{\mu\kappa}\right)\right)=C_{1}^{K}\mathcal{O}(1).

It follows immediately that for K=𝒪(κln(1ϵ))K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right) we have dKϵd_{K}\leq\epsilon. With a more precise expression K=lnC11(1ϵ)K=\ln_{C_{1}^{-1}}\left(\frac{1}{\epsilon}\right), the sample complexity can be estimated:

k=0K1tk\displaystyle\sum\limits_{k=0}^{K-1}t_{k} =\displaystyle= K1C1KC1(1C11)=K(C1K1)1C1κϵ(1C1)ln(1ϵ)=𝒪(κ2ϵln(1ϵ)).\displaystyle K\cdot\frac{1-C_{1}^{-K}}{C_{1}(1-C_{1}^{-1})}=\frac{K(C_{1}^{-K}-1)}{1-C_{1}}\leq\frac{\kappa}{\epsilon(1-C_{1})}\ln\left(\frac{1}{\epsilon}\right)=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

Appendix B Proof of the uniform sublinear convergence of the stochastic extra-point method

In order to establish a uniform sublinear convergence, we have to consider parameters that are diminishing with iteration number kk. Let us return to the one-iteration relation (10) and consider the following choice of parameters:

(α(k),β(k),γ(k),η(k),τ(k))=(2(k+2)μ,α2μ2128,α2μ2128,2(k+2)μ,α2μ128κ),(\alpha^{(k)},\beta^{(k)},\gamma^{(k)},\eta^{(k)},\tau^{(k)})=\left(\frac{2}{(k+2)\mu},\frac{\alpha^{2}\mu^{2}}{128},\frac{\alpha^{2}\mu^{2}}{128},\frac{2}{(k+2)\mu},\frac{\alpha^{2}\mu}{128\kappa}\right),

where we omit the superscript (k)(k) of α\alpha on the RHS for notation simplicity. We shall note that here α=α(k)\alpha=\alpha^{(k)} which is dependent on iteration kk and follow the same simplification for other parameters throughout the rest of the proof in this appendix unless noted otherwise.

By using the fact 2α1μ+α2μ2\alpha\leq\frac{1}{\mu}+\alpha^{2}\mu, we have:

(2α2L2+2|γβ|+2γ+2αμ1)𝔼[zk+0.5zk2]\displaystyle\left(2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
\displaystyle\leq (2α2L2+2γ+α2μ)𝔼[zk+0.5zk2]\displaystyle\left(2\alpha^{2}L^{2}+2\gamma+\alpha^{2}\mu\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
\displaystyle\leq α2(2L2+μ64+μ)𝔼[zk+0.5zk2]\displaystyle\alpha^{2}\left(2L^{2}+\frac{\mu}{64}+\mu\right)\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right]
\displaystyle\leq α2(2L2+μ264+μ2)D2,\displaystyle\alpha^{2}\left(2L^{2}+\frac{\mu^{2}}{64}+\mu^{2}\right)D^{2},

where in the last inequality we use the boundedness of the feasible set.

Therefore, we could rewrite (10) into:

(1τL)dk+1\displaystyle(1-\tau L)d_{k+1}
\displaystyle\leq (1+4γ+4τLαμ)dk+(2γ+4τL)dk1\displaystyle\left(1+4\gamma+4\tau L-\alpha\mu\right)d_{k}+\left(2\gamma+4\tau L\right)d_{k-1}
+α2(2L2+μ264+μ2)D2+8(α2+τL)σ2+2αδD\displaystyle+\alpha^{2}\left(2L^{2}+\frac{\mu^{2}}{64}+\mu^{2}\right)D^{2}+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D
=\displaystyle= (1+4γ+4τLαμ)dk+(2γ+4τL)dk1\displaystyle\left(1+4\gamma+4\tau L-\alpha\mu\right)d_{k}+\left(2\gamma+4\tau L\right)d_{k-1}
+4(k+2)2(2κ2D2+D264+D2+8σ2μ2+σ2128L2)G+4δD(k+2)μ.\displaystyle+\frac{4}{(k+2)^{2}}\cdot\underbrace{\left(2\kappa^{2}D^{2}+\frac{D^{2}}{64}+D^{2}+\frac{8\sigma^{2}}{\mu^{2}}+\frac{\sigma^{2}}{128L^{2}}\right)}_{G}+\frac{4\delta D}{(k+2)\mu}.

Substituting the parameters with their respective values in the rest of the terms:

(1132(k+2))dk+1\displaystyle\left(1-\frac{1}{32(k+2)}\right)d_{k+1}
\displaystyle\leq (1132(k+2)2)dk+1\displaystyle\left(1-\frac{1}{32(k+2)^{2}}\right)d_{k+1}
\displaystyle\leq (1+18(k+2)2+18(k+2)22k+2)dk\displaystyle\left(1+\frac{1}{8(k+2)^{2}}+\frac{1}{8(k+2)^{2}}-\frac{2}{k+2}\right)d_{k}
+(116(k+2)2+18(k+2)2)dk1+4G(k+2)2+4δD(k+2)μ\displaystyle+\left(\frac{1}{16(k+2)^{2}}+\frac{1}{8(k+2)^{2}}\right)d_{k-1}+\frac{4G}{(k+2)^{2}}+\frac{4\delta D}{(k+2)\mu}
\displaystyle\leq (174(k+2))dk+316(k+2)dk1+4G(k+2)2+4δD(k+2)μ.\displaystyle\left(1-\frac{7}{4(k+2)}\right)d_{k}+\frac{3}{16(k+2)}d_{k-1}+\frac{4G}{(k+2)^{2}}+\frac{4\delta D}{(k+2)\mu}.

Dividing both sides by 1132(k+2)1-\frac{1}{32(k+2)}, and noting that (1132(k+2))13231\left(1-\frac{1}{32(k+2)}\right)^{-1}\leq\frac{32}{31}, it follows that

dk+1\displaystyle d_{k+1} \displaystyle\leq 174(k+2)1132(k+2)dk+631(k+2)dk1+128G31(k+2)2+128δD31(k+2)μ\displaystyle\frac{1-\frac{7}{4(k+2)}}{1-\frac{1}{32(k+2)}}d_{k}+\frac{6}{31(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}
=\displaystyle= (15532(k+2)1132(k+2))dk+631(k+2)dk1+128G31(k+2)2+128δD31(k+2)μ\displaystyle\left(1-\frac{\frac{55}{32(k+2)}}{1-\frac{1}{32(k+2)}}\right)d_{k}+\frac{6}{31(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}
\displaystyle\leq (15532(k+2))dk+732(k+2)dk1+128G31(k+2)2+128δD31(k+2)μ.\displaystyle\left(1-\frac{55}{32(k+2)}\right)d_{k}+\frac{7}{32(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}.

From the above one-iteration inequality, we shall claim the following inequality

dkQk+2+256δD93μ,k0,d_{k}\leq\frac{Q}{k+2}+\frac{256\delta D}{93\mu},\quad\forall k\geq 0,

where

Q=max{133G9,2z0z2},Q=\max\left\{\frac{133G}{9},2\|z^{0}-z^{*}\|^{2}\right\},

and we shall prove the inequality by induction. For k=0k=0, the inequality holds trivially

d0=z0z2Q2.d_{0}=\|z^{0}-z^{*}\|^{2}\leq\frac{Q}{2}.

Assuming the inequality holds for all index 1,,k1,...,k, we then have

dk+1\displaystyle d_{k+1} \displaystyle\leq (15532(k+2))dk+732(k+2)dk1+128G31(k+2)2+128δD31(k+2)μ\displaystyle\left(1-\frac{55}{32(k+2)}\right)d_{k}+\frac{7}{32(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}
\displaystyle\leq (15532(k+2))(Qk+2+256δD93μ)+732(k+2)(Qk+1+256δD93μ)\displaystyle\left(1-\frac{55}{32(k+2)}\right)\left(\frac{Q}{k+2}+\frac{256\delta D}{93\mu}\right)+\frac{7}{32(k+2)}\left(\frac{Q}{k+1}+\frac{256\delta D}{93\mu}\right)
+128G31(k+2)2+128δD31(k+2)μ\displaystyle+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}
\displaystyle\leq (15532(k+2))Qk+2+732(k+2)Qk+1+128G31(k+2)2\displaystyle\left(1-\frac{55}{32(k+2)}\right)\cdot\frac{Q}{k+2}+\frac{7}{32(k+2)}\cdot\frac{Q}{k+1}+\frac{128G}{31(k+2)^{2}}
+256δD93μ440δD93(k+2)μ+56δD93(k+2)μ+128δD31(k+2)μ\displaystyle+\frac{256\delta D}{93\mu}-\frac{440\delta D}{93(k+2)\mu}+\frac{56\delta D}{93(k+2)\mu}+\frac{128\delta D}{31(k+2)\mu}
=\displaystyle= Qk+255Q32(k+2)2+732(k+2)Qk+1+128G31(k+2)2+256δD93μ\displaystyle\frac{Q}{k+2}-\frac{55Q}{32(k+2)^{2}}+\frac{7}{32(k+2)}\cdot\frac{Q}{k+1}+\frac{128G}{31(k+2)^{2}}+\frac{256\delta D}{93\mu}
\displaystyle\leq Qk+3+Q(k+2)255Q32(k+2)2+732(k+2)2Qk+2+133G32(k+2)2+256δD93μ\displaystyle\frac{Q}{k+3}+\frac{Q}{(k+2)^{2}}-\frac{55Q}{32(k+2)^{2}}+\frac{7}{32(k+2)}\cdot\frac{2Q}{k+2}+\frac{133G}{32(k+2)^{2}}+\frac{256\delta D}{93\mu}
=\displaystyle= Qk+39Q32(k+2)2+133G32(k+2)2+256δD93μ\displaystyle\frac{Q}{k+3}-\frac{9Q}{32(k+2)^{2}}+\frac{133G}{32(k+2)^{2}}+\frac{256\delta D}{93\mu}
=\displaystyle= Qk+3+256δD93μ.\displaystyle\frac{Q}{k+3}+\frac{256\delta D}{93\mu}.

Note that in the last inequality we used the identities 1k+12k+2\frac{1}{k+1}\leq\frac{2}{k+2} and 1283113332\frac{128}{31}\leq\frac{133}{32}. This completes the proof for the uniform 𝒪(1k)\mathcal{O}\left(\frac{1}{k}\right) convergence rate.