New First-Order Algorithms for Stochastic Variational Inequalities

Kevin Huang Shuzhong Zhang Department of Industrial and System Engineering, University of Minnesota, huan1741@umn.eduDepartment of Industrial and System Engineering, University of Minnesota, zhangs@umn.edu

Abstract

In this paper, we propose two new solution schemes to solve the stochastic strongly monotone variational inequality problems: the stochastic extra-point solution scheme and the stochastic extra-momentum solution scheme. The first one is a general scheme based on updating the iterative sequence and an auxiliary extra-point sequence. In the case of deterministic VI model, this approach includes several state-of-the-art first-order methods as its special cases. The second scheme combines two momentum-based directions: the so-called heavy-ball direction and the optimism direction, where only one projection per iteration is required in its updating process. We show that, if the variance of the stochastic oracle is appropriately controlled, then both schemes can be made to achieve optimal iteration complexity of $\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ to reach an $\epsilon$ -solution for a strongly monotone VI problem with condition number $\kappa$ . We show that these methods can be readily incorporated in a zeroth-order approach to solve stochastic minimax saddle-point problems, where only noisy and biased samples of the objective can be obtained, with a total sample complexity of $\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right)$ .

Keywords: variational inequality, minimax saddle-point, stochastic first-order method, zeroth-order method.

1 Introduction

Given a constraint set $\mathcal{Z}\subseteq\mathbb{R}^{n}$ and a mapping $F:\mathcal{Z}\rightarrow\mathbb{R}^{n}$ , the classical variational inequality (VI) problem is to find $z^{*}\in\mathcal{Z}$ such that

\displaystyle F(z^{*})^{\top}(z-z^{*})\geq 0,\quad\forall z\in\mathcal{Z}.

(1)

For an introduction to VI and its applications, we refer the readers to Facchinei and Pang [4] and the references therein.

In this paper, we consider a stochastic version of problem (1), where the exact evaluation of the mapping $F(\cdot)$ is inaccessible. Instead, only a stochastic oracle is available. The stochasticity in question may stem from, e.g., the non-deterministic nature of mixed strategies of the players in a game-setting, or simply because of the difficulty in evaluating the mapping itself. The latter has become more pronounced in the literature, due to its recent-found application as a training/learning subproblem in machine learning and/or statistical learning. The so-called stochastic oracle is a noisy estimation of the mapping $F(\cdot)$ , and an iterative scheme that incorporates such oracle is known as stochastic approximation (SA). As far as we know, the first proposal to use such approach for stochastic optimization can be traced back to the seminal work of Robbins and Monro [31]. In 2008, Jiang and Xu [10] followed the SA approach to solve VI models. Since then, efforts have been made to extend existing deterministic methods to the stochastic VI models; see e.g. [11, 41, 12, 15, 9, 7, 8].

Let us start our discussion by introducing the assumptions made throughout the paper. We consider VI model (1) where $\mathcal{Z}$ is a closed convex set. Moreover, the following two conditions are assumed:

\left(F(z)-F(z^{\prime})\right)^{\top}(z-z^{\prime})\geq\mu\|z-z^{\prime}\|^{2},\quad\forall z,z^{\prime}\in\mathcal{Z},

(2)

for some $\mu>0$ , and

\|F(z)-F(z^{\prime})\|\leq L\|z-z^{\prime}\|,\quad\forall z,z^{\prime}\in\mathcal{Z},

(3)

for some $L\geq\mu>0$ . Condition (2) is known as the strong monotonicity of $F$ , while Condition (3) is known as the Lipschitz continuity of $F$ . If Condition 2 is met with $\mu=0$ then $F$ is known as monotone. VI problems satisfying (2) with positive $\mu$ can be easily shown to have a unique solution $z^{*}$ . Let us denote $\kappa:=\frac{L}{\mu}\geq 1$ . Parameter $\kappa$ is usually known as the condition number of the VI model (1). We also assume

\displaystyle\max\limits_{z,z^{\prime}\in\mathcal{Z}}\|z-z^{\prime}\|\leq D,

(4)

namely the constraint set is bounded. Remark that this assumption can actually be removed without affecting the results, but then the analysis becomes lengthier and tedious without conceivable conceptual benefit, and so we shall not pursue that generality in this paper.

The stochastic oracle of the mapping, denote by $\hat{F}(z,\xi)$ , takes a random sample $\xi\in\Xi$ from some sample space $\Xi$ . The oracle is required to satisfy:

	$\displaystyle\mathbb{E}\left[\\|\hat{F}(z,\xi)-F(z)\\|\right]$	$\displaystyle\leq$	$\displaystyle\delta$		(5)
	$\displaystyle\mathbb{E}\left[\\|\hat{F}(z,\xi)-F(z)\\|^{2}\right]$	$\displaystyle\leq$	$\displaystyle\sigma^{2}$		(6)

for all $z\in\mathcal{Z}$ , where $\delta,\sigma\geq 0$ are some constants. In other words, we assume both the bias and the deviation are uniformly upper-bounded.

In this paper, we propose two stochastic first-order schemes: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme maintains two sequences of iterates featuring several well-known first-order search directions such as: the extra-gradient [13, 36], the heavy-ball [29], Nesterov’s extrapolation [23, 25], and the optimism direction [30, 20]. The second scheme, on the other hand, specifically combines the heavy-ball momentum and the optimism momentum in its updating formula, and maintains only one sequence throughout the iterations, therefore requiring only one projection per iteration. These two approaches require different types of analysis. Both schemes render a wider range of search directions than the existing first-order methods, and the parameters associated with each search direction could and should be tuned differently from problem-class to problem-class in order to secure good practical performances. The deterministic counterpart of these methods can be found in our previous work [6]. In the stochastic context, we show that as long as the variance can be reduced throughout the iterations, they yield the optimal iteration complexity $\mathcal{O}(\kappa\ln(1/\epsilon))$ (cf. [42]) to reach $\epsilon$ -solution: $\|z^{k}-z^{*}\|^{2}\leq\epsilon$ , with an additional biased term depending on $\delta$ . In a later section, we demonstrate an application to the stochastic black-box minimax saddle-point problem where only noisy function values $f(x,y)$ are accessible. This application is particularly relevant, given its applications in machine learning, where the training data set may be very large and evaluating exact gradient/function value is usually impractical. Through a smoothing technique, we propose a stochastic zeroth-order gradient as our update directions in either the stochastic extra-point scheme or the stochastic extra-momentum scheme. We show that both approaches yield an iteration complexity of $\mathcal{O}(\kappa\ln(1/\epsilon))$ and a sample-complexity of $\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right)$ .

The rest of the paper is organized as follows. In Section 2, we survey the relevant literature with a focus on stochastic VI. In Section 3, we present the main results in this paper, i.e. the convergence results of the two proposed methods, while the technical proofs are relegated to the appendices. In Section 4, we introduce a stochastic black-box saddle-point problem and present the sample complexity results of our methods. We present some promising preliminary numerical results in Section 5 and conclude the paper in Section 6.

2 Literature Review

The first-order algorithms for deterministic VI (1) serve as a basis for the developments of their stochastic counterparts. These algorithms include the projection method [4], the proximal method [18, 32, 36], the extra-gradient method [13, 36], the optimistic gradient descent ascent (OGDA) method [30, 20, 21], the mirror-prox method [22], the extrapolation method [26, 24], and the extra-point method [6].

In this section, we shall focus on the developments of algorithms for stochastic VI, starting with a paper of Jiang and Xu [10], where the authors propose a stochastic projection method for solving strongly monotone and Lipschitz continuous VI problems and present an almost-sure convergence result. Koshal et al. [14] propose iterative Tikhonov regularization method and iterative proximal point method and show almost-sure convergence with monotone and Lipschitz continuous VI problems. Both methods solve a strongly monotone VI subproblem at each iteration. Yousefian et al. [39] further introduce local smoothing technique to the above-mentioned regularized methods to account for non-Lipschitz mappings and show almost-sure convergence. A survey on these methods, as well as applications and the theory behind stochastic VI can be found in Shanbhag [35].

Juditsky et al. [11] are among the first to show an iteration complexity bound for stochastic VI algorithms. They extend the mirror-prox method [22] to stochastic settings and prove an optimal iteration complexity bound for monotone VI: $\mathcal{O}(\frac{1}{\epsilon^{2}})$ , or $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ when the variance can be controlled small enough. Yousefian et al. [41] further extend the stochastic mirror-prox method with a more general step size choice and show an $\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right)$ iteration complexity, where they also show an $\mathcal{O}(\frac{1}{\epsilon})$ complexity for the stochastic extra-gradient method for solving strongly monotone VI problems. Yousefian et al. [40] use randomized smoothing technique for non-Lipschitz mapping and show an $\mathcal{O}\left(\frac{1}{\epsilon^{6}}\right)$ iteration complexity. Chen et al. [3] consider a specific class of VI model: a mapping that consists of a Lipschitz continuous and monotone operator, a Lipschitz continuous gradient mapping of a convex function, and a subgradient mapping of a simple convex function. They propose a method that combines Nesterov’s acceleration [25] with the stochastic mirror-prox method to exploit this special structure, resulting in an optimal iteration complexity for such class of problem: $\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right)$ , or $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ when the variance can be controlled small enough, or $\mathcal{O}\left(\sqrt{\frac{1}{\epsilon}}\right)$ when the operator consists only of gradient/subgradient mappings from some convex function. Kannan and Shanbhag [12] analyze a general variant of extra-gradient method (which uses general distance-generating functions) and show that under a slightly weaker assumptions than the strongly monotonicity, the optimal $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ iteration bound still hold. Kotsalis et al. [15] extend the OGDA method to strongly monotone stochastic VI with iteration complexity $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ , or $\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ when the variance can be controlled small enough.

We shall note that the detailed implementation of variance-reduction is in general not considered in the above-mentioned methods (although some do present additional complexity term when the variance is small, such as in [11, 3]). Therefore, the optimal iteration complexity bound is $\mathcal{O}\left(\frac{1}{\epsilon^{2}}\right)$ for monotone VI and $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ for strongly monotone VI, as compared to $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ and $\mathcal{O}(\kappa\ln(1/\epsilon))$ for their deterministic counterpart. By increasing the sample size (aka mini-batch) in each iteration, the variance can be reduced as the algorithm progresses, therefore attaining the same optimal iteration complexity bound as the deterministic problems.

There have been developments for variance-reduction-based methods in recent years. Jalilzadeh and Shanbhag [9] extend the method [26] for deterministic strongly monotone VI to stochastic VI and show that with variance reduction the optimal iteration complexity $\mathcal{O}(\kappa\ln(1/\epsilon))$ can be achieved, together with a total sample complexity of $\mathcal{O}\left(\frac{1}{\epsilon^{\beta}}\right)$ for some constant $\beta>1$ . With this method as a subroutine, they also propose a variance-reduced proximal point method with iteration complexity $\mathcal{O}\left(\frac{1}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right)$ and sample complexity $\mathcal{O}(\frac{1}{\epsilon^{1+2\alpha\beta}})$ for some constants $\alpha,\beta>1$ . Iusem et al. [7] propose a variance-reduced extra-gradient-based method for monotone VI and show $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ iteration complexity and $\mathcal{O}\left(\frac{1}{\epsilon}\right)$ sample complexity. They further extend the method [8] by incorporating line-search for unknown Lipschitz constant, while preserving similar bounds. Palaniappan and Bach [28] propose variance-reduced stochastic forward-backward methods based on (accelerated) stochastic gradient descent methods in optimization and show $\mathcal{O}(\kappa\ln(1/\epsilon))$ iteration complexity. For another line of work, which includes the concept of differential privacy in the stochastic VI, we refer the readers to a recent paper [2] and the references therein. The stochastic oracle maybe man-made. For instance, the technique of randomized smoothing has been applied in the so-called zeroth-order methods (i.e. derivative-free methods), refer to [27, 34] or the survey [16] in the context of optimization and [37, 38, 17, 33, 19] in the context of minimax saddle-point problems.

3 The Stochastic First-Order Methods for Strongly Monotone VI

Let us start this section by introducing the notations to facilitate our analysis. We shall denote the stochastic oracle as $\hat{F}(\cdot)$ , suppressing the notation $\xi$ whenever it is clear from the context. For example, $\hat{F}(z^{k})$ is associated with the random sample $\xi^{k}\in\Xi$ . In addition, we denote $P_{\mathcal{Z}}(\cdot)$ as the projection operator onto the feasible set $\mathcal{Z}$ .

3.1 The stochastic extra-point scheme

We first present the iterative updating rule for the stochastic extra-point scheme:

\displaystyle\left\{\begin{array}[]{lcl}z^{k+0.5}&:=&P_{\mathcal{Z}}\left(z^{k}+\beta(z^{k}-z^{k-1})-\eta\hat{F}(z^{k})\right),\\ z^{k+1}&:=&P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau(\hat{F}(z^{k})-\hat{F}(z^{k-1}))\right),\\ \end{array}\right.

(9)

for $k=0,1,...$ , where the sequence $\{z^{k}\mid k=0,1,...\}$ is the sequence of iterates, and $\{z^{k+0.5}\mid k=0,1,...\}$ is the sequence of extra points, which helps to produce the sequence of iterates.

In the case of deterministic strongly monotone VI, we introduced in our previous work [6] a unifying extra-point updating scheme, which includes specific first-order search directions such as the extra-gradient, the heavy-ball method, the optimistic method, and Nesterov’s extrapolation; these are incorporated with the parameters $\alpha,\beta,\gamma,\eta,\tau\geq 0$ . As any specific configuration of these parameters should be tailored to the problem structure at hand, our goal is to provide conditions of the parameters under which an optimal iteration complexity can be guaranteed. This line of analysis will now be extended to solve stochastic VI as given in (9). We shall first establish the relational inequalities between subsequent iterates in terms of the expected distance to the unique solution $z^{*}$ , denote by $d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]$ .

Lemma 3.1.

For the sequences $\{z^{k}\mid k=0,1,...\}$ and $\{z^{k+0.5}\mid k=0,1,...\}$ generated from the stochastic extra-point scheme (9), the following inequality holds:

	$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$	(10)
$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L-\alpha\mu\right)d_{k}+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
	$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
	$\displaystyle+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].$

Proof.

See Appendix A.1. ∎

Lemma 3.1 forms a basis to the desired linear convergence, and it is possible to identify the conditions for the parameters $\alpha,\beta,\gamma,\eta,\tau$ in order to achieve linear convergence. Consider parameters satisfying

\displaystyle\left\{\begin{array}[]{ll}\eta=\alpha,\\ 2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1\leq 0,\end{array}\right.

(13)

and denote

\displaystyle\left\{\begin{array}[]{ll}t_{1}=\alpha\mu-4\gamma-6|\gamma-\beta|-4\tau L,\\ t_{2}=2|\gamma-\beta|+2\gamma+4\tau L,\\ t_{3}=4|\gamma-\beta|+\tau L.\end{array}\right.

(17)

Then we obtain from (10) that

\displaystyle(1-t_{3})d_{k+1}

\displaystyle\leq

\displaystyle\left(1-t_{1}\right)d_{k}+t_{2}d_{k-1}+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D.

(18)

With additional constraints on $t_{1},t_{2},t_{3}$ , the variance-reduced convergence result is summarized in the next theorem.

Theorem 3.2.

For non-negative parameters $\alpha,\beta,\gamma,\eta,\tau$ satisfying (13) and (17), suppose that

\displaystyle\left\{\begin{array}[]{ll}0\leq t_{3}<t_{1}<1,\\ t_{2}<t_{1}-t_{3}.\end{array}\right.

(21)

Let $q=\frac{2(1-t_{3})}{t_{1}-t_{2}-t_{3}}>1$ . For a fixed precision $\epsilon>0$ , denote $K=\mathcal{O}\left(q\cdot\ln\left(\frac{1}{\epsilon}\right)\right)$ . Then, we have

\displaystyle d_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\mathcal{O}(\epsilon)+\mathcal{O}(\sigma^{2})+\mathcal{O}(\delta D).

(22)

Proof.

See Appendix A.2. ∎

Regarding Theorem 3.2, a few remarks are in order. First, as we remarked earlier, the boundedness condition (4) can be removed. However, the analysis will become much longer and cumbersome; we keep it here for simplicity. Second, a common way to achieve variance reduction is through increasing the mini-batch sample sizes. In fact, we may fix the sample size at the beginning with order $\left(1-\frac{1}{q}\right)^{-K}$ , or it increases linearly at a rate $\left(1+\frac{1}{q}\right)$ as $k$ increases. We shall discuss more on this strategy in Section 4. Finally, we note that without variance reduction, it is possible to adopt diminishing step sizes $(\alpha_{k},\beta_{k},\gamma_{k},\eta_{k},\tau_{k})$ instead of fixing step sizes as we have assumed so far. The optimal uniform sublinear convergence rate $\frac{1}{k}$ can be established through a separate analysis continued from Lemma 3.1. The details can be found in Appendix B.

Next proposition concludes this subsection with a specific choice of the parameters.

Proposition 3.3.

If one chooses the following parameters

\displaystyle(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right)

in (9) (thus $(t_{1},t_{2},t_{3})=\left(\frac{1}{8\kappa},\frac{3}{32\kappa},\frac{1}{64\kappa}\right)$ ) then it holds that

d_{k}\leq\left(1-\frac{1}{256\kappa}\right)^{k}\cdot\frac{283}{256}\|z^{0}-z^{*}\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot 256\kappa.

Proof.

See Appendix A.3. ∎

3.2 The stochastic extra-momentum scheme

In this subsection, we present an alternative stochastic first-order method that achieves the optimal iteration complexity as well, the stochastic extra-momentum scheme:

\displaystyle z^{k+1}:=P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right),

(23)

for $k=0,1,...$ .

Compared with the stochastic extra-point scheme (9), the above update (23) manipulates only the momentum terms alongside the stochastic gradient direction (the notion “gradient” here refers to the mapping $F(\cdot)$ in the VI model), namely the heavy-ball direction $z^{k}-z^{k-1}$ and the optimism direction $\hat{F}(z^{k})-\hat{F}(z^{k-1})$ . Since it maintains a single sequence $\{z^{k}\}$ throughout the iterations, this scheme requires one projection per iteration, as compared to two projections in the case of the stochastic extra-point scheme. We shall remark that the method proposed in Kotsalis et al. [15] only considers the optimism term. Therefore, the stochastic extra-momentum scheme introduced above may be viewed as a generalization.

As in the previous subsection, we shall first establish a relational inequality between the iterates. As we can see from the lemma below, the structure of this relational inequality is in fact quite different from the previous case. The detailed proof can be found in the appendix.

Lemma 3.4.

For the sequence $\{z^{k}\mid k=0,1,...\}$ generated from the stochastic extra-momentum scheme (23), the following inequality holds:

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$	(24)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
	$\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}.$

Proof.

See Appendix A.4. ∎

Observe that each of the terms on the LHS of (24) differs in the iteration index from the RHS exactly by one. This property enables us to design a possible potential function that measures the convergence of the iterative process. We shall specify additional conditions on the non-negative parameters $\alpha,\gamma,\tau$ in order to further simplify (24):

\displaystyle 1+\alpha\mu-\gamma\geq 1+\frac{\theta}{\kappa},\quad\frac{\alpha}{\tau}=1+\frac{\theta}{\kappa},\quad\frac{1}{8\tau^{2}L^{2}+2\gamma}\geq 1+\frac{\theta}{\kappa},

(25)

where $\theta\in(0,1]$ is some constant independent of $\kappa$ . Note that the LHS of each inequality in (25) is the ratio between the coefficients on the LHS and RHS of (24) for each corresponding term. Therefore, the relation (24) can be rearranged as:

	$\displaystyle\left(1+\frac{\theta}{\kappa}\right)\mathbb{E}\left[\frac{1}{2}\\|z^{k+1}-z^{}\\|^{2}+\tau(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k+1}-z^{k}\\|^{2}\right]$	(26)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
	$\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}.$

Now, by defining the potential function $V_{k}$ as

V_{k}=\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right],

inequality (26) can be rewritten as

\displaystyle\left(1+\frac{\theta}{\kappa}\right)V_{k+1}\leq V_{k}+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}.

(27)

This leads to our final results, as summarized in the next theorem:

Theorem 3.5.

Suppose that the non-negative parameters $\alpha,\gamma,\tau$ satisfy (25) for some constant $\theta\in(0,1]$ . For the sequence $\{z^{k}\mid k=0,1,...\}$ generated from the stochastic extra-momentum scheme (23), the expected distance to the solution of iterate $z^{k}$ is bounded by:

\displaystyle\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]\leq 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\|z^{0}-z^{*}\|^{2}+\left(\frac{\kappa}{\theta}+1\right)\cdot 32\tau^{2}\sigma^{2}+\frac{2\kappa\alpha\delta^{2}}{\theta\mu}.

(28)

Proof.

See Appendix A.5. ∎

A simple choice of parameters leads to:

Proposition 3.6.

If we choose the parameters as

(\alpha,\tau,\gamma)=\left(\frac{1}{4L},\frac{\alpha}{1+\frac{\theta}{\kappa}},\frac{1}{8(\kappa+\theta)}\right),\quad\theta=\frac{1}{8},

then scheme (23) assures that

\displaystyle d_{k}\leq 2\left(1-\frac{1}{8\kappa+1}\right)^{k}\|z^{0}-z^{*}\|^{2}+\frac{128\sigma^{2}}{\mu(8L+\mu)}+\frac{4\delta^{2}}{\mu^{2}}.

(29)

Proof.

It follows by substituting the parameter choice into (28). ∎

This shows that if we run the stochastic extra-momentum scheme (23) with the above parameter choice, then in $K=\mathcal{O}(\kappa\ln\frac{1}{\epsilon})$ iterations we will reach a solution satisfying

\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\mathcal{O}(\epsilon)+\mathcal{O}\left(\frac{\sigma^{2}}{\mu L}\right)+\mathcal{O}\left(\frac{\delta^{2}}{\mu^{2}}\right).

4 A Stochastic Zeroth-Order Approach to Saddle-Point Problems

In this section, we shall apply the proposed stochastic extra-point/extra-momentum scheme to solve the following saddle-point problem without needing to compute the gradients of $f$ :

\displaystyle\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}f(x,y),

(30)

where $\mathcal{X}\subseteq\mathbb{R}^{n}$ , $\mathcal{Y}\subseteq\mathbb{R}^{m}$ are convex sets, $f(x,y)$ is strongly convex (with fixed $y$ ) and strongly concave (with fixed $x$ ) with modulus $\mu$ , and the partial gradients $\nabla_{x}f(x,y)$ / $\nabla_{y}f(x,y)$ are Lipschitz continuous with constant $L_{x}/L_{y}$ for fixed $y/x$ , and with constant $L_{xy}$ with fixed $x/y$ . We let $L=2\cdot\max(L_{x},L_{y},L_{xy})$ . We further assume that the function $f(x,y)$ is Lipschitz continuous for either fixed $x$ or $y$ with constant $M$ . This implies that the norms of the partial gradients are bounded by $M$ :

\|\nabla_{x}f(x,y)\|\leq M,\quad\|\nabla_{y}f(x,y)\|\leq M.

In particular, we consider the settings when the partial gradients $\nabla_{x}f(x,y)$ and $\nabla_{y}f(x,y)$ (and any higher-order information) are not available. Furthermore, the exact evaluation of the function value itself is also not available; instead, we can only access a stochastic oracle $\hat{f}(x,y,\xi)$ , which satisfies the following assumption:

\displaystyle\left\{\begin{array}[]{l}\mathbb{E}\left[\hat{f}(x,y,\xi)\right]=f(x,y),\\ \\ \mathbb{E}\left[\nabla_{x}\hat{f}(x,y,\xi)\right]=\nabla_{x}f(x,y),\\ \\ \mathbb{E}\left[\nabla_{y}\hat{f}(x,y,\xi)\right]=\nabla_{y}f(x,y),\\ \\ \mathbb{E}\left[\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\|^{2}\right]\leq\sigma^{2},\\ \\ \mathbb{E}\left[\|\nabla_{y}\hat{f}(x,y,\xi)-\nabla_{y}f(x,y)\|^{2}\right]\leq\sigma^{2}.\end{array}\right.

(40)

Now, we shall use the so-called smoothing technique to approximate the first-order information, which then enables us to apply the proposed stochastic methods for VI, which includes the saddle-point model as a special case. In particular, we use a randomized smoothing scheme using uniform distributions $U_{b}$ / $V_{b}$ over the unit Euclidean ball $B$ in the $\mathbb{R}^{n}$ / $\mathbb{R}^{m}$ space, respectively. The smoothing functions with parameters $\rho_{x},\rho_{y}>0$ are defined as follows:

	$\displaystyle f_{\rho_{x}}(x,y)$	$\displaystyle:=$	$\displaystyle\mathbb{E}_{u\sim U_{b}}\left[f(x+\rho_{x}u,y)\right]=\frac{1}{\alpha(n)}\int_{B}f(x+\rho_{x}u,y)du,$
	$\displaystyle f_{\rho_{y}}(x,y)$	$\displaystyle:=$	$\displaystyle\mathbb{E}_{v\sim V_{b}}\left[f(x,y+\rho_{y}v)\right]=\frac{1}{\alpha(m)}\int_{B}f(x,y+\rho_{y}v)dv,$

where $\alpha(n)$ / $\alpha(m)$ is the volume of the unit ball in $\mathbb{R}^{n}$ / $\mathbb{R}^{m}$ .

Let us summarize main properties of the smoothing functions $f_{\rho_{x}},\,f_{\rho_{y}}$ below:

Lemma 4.1.

Let $U_{S_{p}}/V_{S_{p}}$ be the uniform distribution on the unit sphere $S_{p}$ in $\mathbb{R}^{n}/\mathbb{R}^{m}$ . Then,

The smoothing functions $f_{\rho_{x}},f_{\rho_{y}}$ are continuously differentiable, and their partial gradients $\nabla_{x}f_{\rho_{x}},\nabla_{y}f_{\rho_{y}}$ can be expressed as:

	$\displaystyle\nabla_{x}f_{\rho_{x}}(x,y)$	$\displaystyle:=$	$\displaystyle\mathbb{E}_{u\sim U_{S_{p}}}\left[\frac{n}{\rho_{x}}f(x+\rho_{x}u,y)u\right]=\frac{1}{\beta(n)}\int_{u\in S_{p}}\frac{n}{\rho_{x}}\left(f(x+\rho_{x}u,y)-f(x,y)\right)udu,$
	$\displaystyle\nabla_{y}f_{\rho_{y}}(x,y)$	$\displaystyle:=$	$\displaystyle\mathbb{E}_{v\sim V_{S_{p}}}\left[\frac{m}{\rho_{y}}f(x,y+\rho_{y}v)v\right]=\frac{1}{\beta(m)}\int_{v\in S_{p}}\frac{m}{\rho_{y}}\left(f(x,y+\rho_{y}v)-f(x,y)\right)vdv,$

where $\beta(n)/\beta(m)$ is the surface area of the unit sphere in $\mathbb{R}^{n}/\mathbb{R}^{m}$ .

For any $x\in\mathbb{R}^{n}$ and any $y\in\mathbb{R}^{m}$ , we have:

	$\displaystyle\left\{\begin{array}[]{l}\\|\nabla_{x}f_{\rho_{x}}(x,y)-\nabla_{x}f(x,y)\\|\leq\frac{\rho_{x}nL}{2},\\ \\ \\|\nabla_{y}f_{\rho_{y}}(x,y)-\nabla_{y}f(x,y)\\|\leq\frac{\rho_{y}mL}{2},\end{array}\right.$		(44)

	$\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{u\sim U_{S_{p}}}\left[\left\\|\frac{n}{\rho_{x}}\left(f(x+\rho_{x}u,y)-f(x,y)\right)u\right\\|^{2}\right]\leq 2n\\|\nabla_{x}f(x,y)\\|^{2}+\frac{\rho_{x}^{2}L^{2}n^{2}}{2},\\ \\ \mathbb{E}_{v\sim V_{S_{p}}}\left[\left\\|\frac{m}{\rho_{y}}\left(f(x,y+\rho_{y}v)-f(x,y)\right)v\right\\|^{2}\right]\leq 2m\\|\nabla_{y}f(x,y)\\|^{2}+\frac{\rho_{y}^{2}L^{2}m^{2}}{2}.\end{array}\right.$		(48)

Proof.

For Statement 1, cf. [34] (Lemma 4.4), and for Statement 2, cf. [5] (proofs for Propositions 2.7.5 and 2.7.6). Note that the proofs for the minimax function follows simply by fixing one of the two variables. ∎

We are now ready to define the stochastic zeroth-order gradient as follows:

\displaystyle\left\{\begin{array}[]{l}F_{\rho_{x}}(x,y,\xi,u):=\frac{n}{\rho_{x}}\left(\hat{f}(x+\rho_{x}u,y,\xi)-\hat{f}(x,y,\xi)\right)u,\\ F_{\rho_{y}}(x,y,\xi,v):=\frac{m}{\rho_{y}}\left(\hat{f}(x,y+\rho_{y}v,\xi)-\hat{f}(x,y,\xi)\right)v,\end{array}\right.

(51)

where $u$ and $v$ are the uniformly distributed random vectors over the unit spheres in $\mathbb{R}^{n}$ and $\mathbb{R}^{m}$ respectively.

The next lemma shows that such stochastic zeroth-order gradients are unbiased with respect to the gradients of the smoothing functions and have uniformly bounded variance.

Lemma 4.2.

The stochastic zeroth-order gradients defined in (51) are unbiased and have bounded variance for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$ :

\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{\xi,u}\left[F_{\rho_{x}}(x,y,\xi,u)\right]=\nabla_{x}f_{\rho_{x}}(x,y),\\ \\ \mathbb{E}_{\xi,v}\left[F_{\rho_{y}}(x,y,\xi,v)\right]=\nabla_{y}f_{\rho_{y}}(x,y),\end{array}\right.

(55)

and

\displaystyle\left\{\begin{array}[]{l}\mathbb{E}_{\xi,u}\left[\|F_{\rho_{x}}(x,y,\xi,u)-\nabla_{x}f_{\rho_{x}}(x,y)\|^{2}\right]\leq\tilde{\sigma}^{2},\\ \\ \mathbb{E}_{\xi,v}\left[\|F_{\rho_{y}}(x,y,\xi,v)-\nabla_{y}f_{\rho_{y}}(x,y)\|^{2}\right]\leq\tilde{\sigma}^{2},\end{array}\right.

(59)

where

\tilde{\sigma}^{2}=2\cdot\max\left\{nM^{2}+n\sigma^{2}+n^{2}\rho_{x}^{2}L^{2},\,mM^{2}+m\sigma^{2}+m^{2}\rho_{y}^{2}L^{2}\right\}.

Proof.

See Appendix A.6. ∎

Before applying the stochastic extra-point/extra-momentum scheme to solve (30), let us first introduce the connections between these two models. As we regard the saddle-point model as a special case of VI, we shall treat the variables $x,y$ in the saddle-point problem as one variable and denote $\mathcal{Z}=\mathcal{X}\times\mathcal{Y},z=(x,y)$ . Additionally, we define:

\displaystyle F(z):=\begin{pmatrix}\nabla_{x}f(x,y)\\ -\nabla_{y}f(x,y)\end{pmatrix},\quad F_{\rho}(z):=\begin{pmatrix}\nabla_{x}f_{\rho_{x}}(x,y)\\ -\nabla_{y}f_{\rho_{y}}(x,y)\end{pmatrix},\quad\hat{F}_{\rho}(z,u,v,\xi):=\begin{pmatrix}F_{\rho_{x}}(x,y,\xi,u)\\ -F_{\rho_{y}}(x,y,\xi,v)\end{pmatrix}.

These terms correspond to the gradient of $f(x,y)$ , the gradient of the smoothing functions $f_{\rho_{x}}(x,y)$ and $f_{\rho_{y}}(x,y)$ , and the stochastic zeroth-order gradient, respectively. Note that we have flipped the sign on partial gradient correspond to $y$ to account for the concavity of $f$ with respect to $y$ .

Finally, as we shall use a sample size of $t_{k}\in\mathbb{N}$ (a natural number) at iteration $k$ , we reserve the subscripts for the random vectors $\xi,u,v$ for the sample index $(i),\,i=1,...,t_{k}$ , and denote:

\hat{F}^{k}_{\rho}(z^{k})=\frac{1}{t_{k}}\sum\limits_{i=1}^{t_{k}}\hat{F}_{\rho}(z^{k},u^{k}_{(i)},v^{k}_{(i)},\xi^{k}_{(i)}).

In the above definition we suppress the notation of the random vectors $u,v,\xi$ on the LHS for cleaner presentation. Note that by the law of large numbers, together with (55)-(59), we have

	$\displaystyle\mathbb{E}\left[\hat{F}^{k}_{\rho}(z^{k})\right]=F_{\rho}(z^{k}),$		(60)
	$\displaystyle\mathbb{E}\left[\\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})\\|^{2}\right]\leq\frac{2\tilde{\sigma}^{2}}{t_{k}}.$		(61)

4.1 Sample complexity analysis: stochastic zeroth-order extra-point method

Recall that our objective is

\min\limits_{x\in\mathcal{X}}\max\limits_{y\in\mathcal{Y}}f(x,y).

With only noisy function value $\hat{f}(x,y,\xi)$ accessible, we propose the stochastic zeroth-order extra-point method that updates $(x,y):=z$ simultaneously with the following update rule:

\displaystyle\left\{\begin{array}[]{lcl}z^{k+0.5}&:=&P_{\mathcal{Z}}\left(z^{k}+\beta(z^{k}-z^{k-1})-\eta\hat{F}^{k}_{\rho}(z^{k})\right),\\ z^{k+1}&:=&P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)\right).\\ \end{array}\right.

(64)

Compare the above update with its original variant in (9) for solving stochastic VI, the update direction $\hat{F}(z^{k})$ is replaced by the averaged stochastic zeroth-order gradient $\hat{F}^{k}_{\rho}(z^{k})$ with sample size $t_{k}$ (similarly $\hat{F}(z^{k+0.5})$ is replaced by $\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})$ with sample size $t_{k+0.5}$ ). This circumvents the inaccessible first-order information and equips us with appropriate tools to reduce the variance and achieve overall linear convergence.

The next lemma establishes the relational inequality between the subsequent iterates in terms of the expected distance to the solution $d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]$ , similar to what we did in Section 3.1. The differences lie in the corresponding stochastic error terms shown below. Note that in each iteration we take two batches of samples $t_{k}$ and $t_{k+0.5}$ . The batch size $t_{k-1}$ also appears because the iterate $z^{k-1}$ is used in each iteration.

Lemma 4.3.

For the sequences $\{z^{k}\mid k=0,1,...\}$ and $\{z^{k+0.5}\mid k=0,1,...\}$ generated from the stochastic zeroth-order extra-point method (64), the following inequality holds:

	$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$	(65)
$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L-\alpha\mu\right)d_{k}$
	$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
	$\displaystyle+\left(\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
	$\displaystyle+16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right)+\alpha LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}$
	$\displaystyle+4(\tau L+\alpha^{2}L^{2})(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})$
	$\displaystyle+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k})\right].$

Proof.

See Appendix A.7. ∎

With the relational inequality in Lemma 4.3, we shall adopt the same conditions: (13), (17), and (21) for the parameters $\alpha,\beta,\gamma,\eta,\tau$ . Therefore, the results in Theorem 3.2 are directly applicable. In addition, we are now equipped with the variable sample size $t_{k}$ / $t_{k+0.5}$ to control the variance terms, as well as the smoothing parameters $\rho_{x},\rho_{y}$ to control the bias terms

\alpha LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+4(\tau L+\alpha^{2}L^{2})(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}).

We shall utilize the example in Proposition 3.3 to analyze the sample complexity of the proposed method. The result is provided in the next proposition:

Proposition 4.4 (Sample complexity result 1).

The stochastic zeroth-order extra-point method (64) with the following parameter choice:

\displaystyle(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right)

and

t_{k+0.5}=t_{k}=\Bigg{\lceil}K\left(1-\frac{1}{256\kappa}\right)^{-(k+1)}\Bigg{\rceil},\quad\rho_{x}=\frac{1}{\sqrt{2}n\kappa}\left(1-\frac{1}{256\kappa}\right)^{K},\quad\rho_{y}=\frac{1}{\sqrt{2}m\kappa}\left(1-\frac{1}{256\kappa}\right)^{K},

where $K$ is the iteration count decided in advance, outputs $z^{K}$ such that $d_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\epsilon$ , with $K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ , and the total sample complexity of the procedure is

\sum\limits_{k=0}^{K-1}(t_{k}+t_{k+0.5})=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

Proof.

See Appendix A.8. ∎

4.2 Sample complexity: stochastic zeroth-order extra-momentum method

Next we consider the stochastic zeroth-order extra-momentum method, with one projection per each iteration:

\displaystyle z^{k+1}:=P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}^{k}_{\rho}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)\right).

(66)

The relational inequality, similar to Lemma 3.4, is established in the next lemma:

Lemma 4.5.

For the sequence $\{z^{k}\mid k=0,1,...\}$ generated from the stochastic zeroth-order extra-momentum method (66), the following inequality holds:

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+1}_{\rho}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$	(67)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
	$\displaystyle+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}).$

Proof.

See Appendix A.9. ∎

With the same condition as in (25) for the parameters $\alpha,\gamma,\tau$ , we can derive the similar bound to (26) (with $\hat{F}(z^{k})$ replaced with $\hat{F}^{k}_{\rho}(z^{k})$ and with the new stochastic error expression) and define the potential function:

V_{k}=\mathbb{E}\left[\frac{1}{2}\|z^{k}-z^{*}\|^{2}+\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\|z^{k}-z^{k-1}\|^{2}\right].

Therefore, the following inequality holds:

\displaystyle\left(1+\frac{\theta}{\kappa}\right)V_{k+1}\leq V_{k}+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}),

(68)

and we can apply the results directly from Theorem 3.5. In addition, with increasing sample sizes $t_{k}$ and the smoothing parameters $\rho_{x},\rho_{y}$ , we are able to control the bias and the variance terms in the above inequality. We give the results of sample complexity in the next proposition.

Proposition 4.6 (Sample complexity result 2).

The stochastic zeroth-order extra-momentum method (66) with the following parameter choice:

(\alpha,\tau,\gamma)=\left(\frac{1}{4L},\frac{\alpha}{1+\frac{\theta}{\kappa}},\frac{1}{8(\kappa+\theta)}\right),\quad\theta=\frac{1}{8},

and

t_{k}=\Bigg{\lceil}K\left(1-\frac{1}{8\kappa+1}\right)^{-k}\Bigg{\rceil},\quad\rho_{x}=\frac{1}{\sqrt{2}n\kappa}\left(1-\frac{1}{8\kappa+1}\right)^{\frac{K}{2}},\quad\rho_{y}=\frac{1}{\sqrt{2}m\kappa}\left(1-\frac{1}{8\kappa+1}\right)^{\frac{K}{2}},

where $K$ is the iteration count decided in advance, outputs $z^{K}$ such that $d_{K}=\mathbb{E}\left[\|z^{K}-z^{*}\|^{2}\right]\leq\epsilon$ , with $K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ and the total sample complexity of the procedure is

\sum\limits_{k=0}^{K-1}t_{k}=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

Proof.

See Appendix A.10. ∎

Remark that the sample complexity of VS-Ave proposed in [9] for solving stochastic strongly monotone VI is given by $\mathcal{O}\left(\left(\frac{\kappa^{2}}{\epsilon}\right)^{\beta}\right)$ for some $\beta>1$ , and its limiting case ( $\beta\rightarrow 1$ ) differs from our results given in Proposition 4.4 and 4.6 only by a factor of $\mathcal{O}\left(\ln\left(\frac{1}{\epsilon}\right)\right)$ .

5 Numerical Experiments

In this section, we conduct an experiment that models a regularized two-player zero-sum matrix game with some uncertain payoff matrix. In particular, the payoff matrix $A_{\xi}$ is randomly distributed and can only be sampled for each (mixed) strategy. The problem can be formulated as follows:

$\displaystyle\min\limits_{x\in\mathbb{R}^{n}}\max\limits_{y\in\mathbb{R}^{m}}$	$\displaystyle f(x,y)=\mathbb{E}\left[\frac{\lambda_{x}}{2}\\|x\\|^{2}+x^{\top}A_{\xi}y-\frac{\lambda_{y}}{2}\\|y\\|^{2}\right]$	(69)
s.t.	$\displaystyle\sum\limits_{i=1}^{n}x_{i}=1,\quad\sum\limits_{j=1}^{m}y_{j}=1,$
	$\displaystyle x,y\geq 0.$

The experiment consists of two parts. In the first part, the random matrix $A_{\xi}$ is sampled element-wise from i.i.d. normal distribution, $A_{\xi}\sim N\left(A_{0},\sigma^{2}I_{(n+m)}\right)$ , where $A_{0}$ is pre-determined and randomly generated as follows. We partition $A_{0}=\begin{pmatrix}A_{11}&A_{12}\\ A_{21}&A_{22}\end{pmatrix}$ , with each block matrix $A_{ij}\in\mathbb{R}^{\frac{n}{2}\times\frac{m}{2}}$ . Each entry in $A_{ij}$ is generated randomly from Unif $(a_{ij}-b_{ij},a_{ij}+b_{ij})$ , where $a_{ij}$ and $b_{ij}$ are randomly generated from Unif $(-30,30)$ and Unif $(0,30)$ respectively. The problem parameters are set as: $n=10$ , $m=20$ , $\sigma^{2}=0.5$ , $\lambda_{x}=\lambda_{y}=1$ , $\kappa=\frac{L}{\mu}\approx 161$ , where $L$ and $\mu$ are the largest and the smallest singular values of the Jacobian matrix $\begin{pmatrix}\lambda I_{n}&A_{0}\\ -A_{0}^{\top}&\lambda I_{m}\end{pmatrix}$ respectively. The smoothing parameters $\rho_{x},\rho_{y}$ are set to be in the order $10^{-8}$ , the total iteration $K$ is set to 1485, and the sample size $t_{k}=k$ at iteration $k$ .

In the second part, the random matrix $A_{\xi}$ is sampled element-wise from i.i.d. log-normal distribution, which is known as to have a fat-tail. It is used to model multiplicative random variables that take positive values. We reuse the parameters $A_{0}$ and $\sigma^{2}$ from the first part. In particular, the samples are generated by $A_{\xi}=e^{\frac{A_{0}}{10}+\sigma Z}$ , where $Z$ is sampled element-wise from i.i.d. standard normal distribution $N(0,1)$ . Therefore, the mean of such distribution is given by $A_{0}^{\prime}=\mathbb{E}\left[A_{\xi}\right]=e^{\frac{A_{0}}{10}+\frac{\sigma^{2}}{2}}$ . We have $\kappa\approx 146$ and set $K=1345$ in this part, and other parameters remain the same as the first part ( $\rho_{x},\rho_{y}$ are in the same order).

In both parts of the experiment, we first solve the deterministic problem with the mean payoff matrix $A_{0}$ ( $A_{0}^{\prime}$ for the second part) and denote the solution as $(x^{*},y^{*})$ . We then implement the two proposed methods: the stochastic zeroth-order extra-point method and the stochastic zeroth-order extra-momentum method. In addition, we compare these two methods with other first order methods: the extra-gradient method, the OGDA method, and the VS-Ave method (proposed in [9], which is a variance-reduced stochastic extension of Nesterov’s method [26]), all equipped with the same stochastic zeroth-order oracle. The results are shown in the following two figures, where the left plot shows the result from one experiment and the right shows the result from average of ten experiments. The parameters for each algorithm are manually tuned except for VS-Ave, where we adopt the recursive rule as proposed in its original paper. The results show that the two newly proposed methods exhibit comparable (or slightly improved) performance to the stochastic extra-gradient/OGDA method in this particular example of application.

Refer to caption — Figure 1: Normal distributed payoff matrix $A$

6 Conclusions

This paper proposes two new schemes of stochastic first-order methods to solve strongly monotone VI problems: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme features a high flexibility in the configuration of parameter choices that can be tailored to different problem classes. The second scheme is less general in the choice of search directions. However, it has the advantage of maintaining a single sequence throughout the iterations. Therefore, it requires only one projection per iteration, as opposed to most other first method that maintains an extra iterative sequence. Both methods achieve optimal iteration complexity bound, provided that the stochastic gradient oracle allows the variance to be controllable. The application of these two schemes to solve stochastic black-box saddle-point problem is also presented. Through a randomized smoothing scheme, the stochastic oracles required in these two schemes can be constructed via stochastic zeroth-order gradient approximation. The variance is thus controllable by mini-batch sampling with linearly increasing sample sizes per iteration, and the sample complexity results are derived. Preliminary numerical experiments show an improved (or at least comparable) performance of the proposed schemes to other existing methods.

References

[1] H.. Bauschke and P.. Combettes “Convex analysis and monotone operator theory in Hilbert spaces” Springer, 2011
[2] D. Boob and C. Guzmán “Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems” In arXiv preprint arXiv: 2104.02988, 2021
[3] Y. Chen, G. Lan and Y. Ouyang “Accelerated schemes for a class of variational inequalities” In Mathematical Programming 165.1 Springer, 2017, pp. 113–149
[4] F. Facchinei and J.-S. Pang “Finite-dimensional variational inequalities and complementarity problems” Springer Science & Business Media, 2007
[5] X. Gao “Low-order optimization algorithms: iteration complexity and applications” Ph.D. Thesis, University of Minnesota, 2018
[6] K. Huang and S. Zhang “A unifying framework of accelerated first-order approach to strongly monotone variational inequalities” In arXiv preprint arXiv:2103.15270, 2021
[7] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Extragradient method with variance reduction for stochastic variational inequalities” In SIAM Journal on Optimization 27.2 SIAM, 2017, pp. 686–724
[8] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Variance-based extragradient methods with line search for stochastic variational inequalities” In SIAM Journal on Optimization 29.1 SIAM, 2019, pp. 175–206
[9] A. Jalilzadeh and U.. Shanbhag “A proximal-point algorithm with variable sample-sizes (PPAWSS) for monotone stochastic variational inequality problems” In 2019 Winter Simulation Conference (WSC), 2019, pp. 3551–3562 IEEE
[10] H. Jiang and H. Xu “Stochastic approximation approaches to the stochastic variational inequality problem” In IEEE Transactions on Automatic Control 53.6 IEEE, 2008, pp. 1462–1475
[11] A. Juditsky, A. Nemirovski and C. Tauvel “Solving variational inequalities with stochastic mirror-prox algorithm” In Stochastic Systems 1.1 INFORMS, 2011, pp. 17–58
[12] A. Kannan and U.. Shanbhag “Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants” In Computational Optimization and Applications 74.3 Springer, 2019, pp. 779–820
[13] G.M. Korpelevich “The extragradient method for finding saddle points and other problems” In Matecon 12, 1976, pp. 747–756
[14] J. Koshal, A. Nedic and U.. Shanbhag “Regularized iterative stochastic approximation methods for stochastic variational inequality problems” In IEEE Transactions on Automatic Control 58.3 IEEE, 2012, pp. 594–609
[15] G. Kotsalis, G. Lan and T. Li “Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation” In arXiv preprint arXiv:2011.02987, 2020
[16] J. Larson, M. Menickelly and S.. Wild “Derivative-free optimization methods” In arXiv preprint arXiv:1904.11585, 2019
[17] S. Liu, S. Lu, X. Chen, Y. Feng, K. Xu, A. Al-Dujaili, M. Hong and U.-M. O’Reilly “Min-max optimization without gradients: convergence and applications to adversarial ML” In arXiv preprint arXiv:1909.13806, 2019
[18] B. Martinet “Brève communication. Régularisation d’inéquations variationnelles par approximations successives” In Revue française d’informatique et de recherche opérationnelle. Série rouge 4.R3 EDP Sciences, 1970, pp. 154–158
[19] M. Menickelly and S.. Wild “Derivative-free robust optimization by outer approximations” In Mathematical Programming 179.1 Springer, 2020, pp. 157–193
[20] A. Mokhtari, A. Ozdaglar and S. Pattathil “A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach” In arXiv preprint arXiv:1901.08511, 2019
[21] A. Mokhtari, A. Ozdaglar and S. Pattathil “Convergence rate of $O(1/k)$ for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems” In SIAM Journal on Optimization 30.4 SIAM, 2020, pp. 3230–3251
[22] A. Nemirovski “Prox-method with rate of convergence $O(1/t)$ for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems” In SIAM Journal on Optimization 15.1 SIAM, 2004, pp. 229–251
[23] Y. Nesterov “A method for unconstrained convex minimization problem with the rate of convergence $O(1/k^{2})$ ” In Doklady SSSR 269, 1983, pp. 543–547
[24] Y. Nesterov “Dual extrapolation and its applications to solving variational inequalities and related problems” In Mathematical Programming 109.2-3 Springer, 2007, pp. 319–344
[25] Y. Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2003
[26] Y. Nesterov and L. Scrimali “Solving strongly monotone variational and quasi-variational inequalities” In Available at SSRN 970903, 2006
[27] Y. Nesterov and V. Spokoiny “Random gradient-free minimization of convex functions” In Foundations of Computational Mathematics 17.2 Springer, 2017, pp. 527–566
[28] B. Palaniappan and F. Bach “Stochastic variance reduction methods for saddle-point problems” In Advances in Neural Information Processing Systems, 2016, pp. 1416–1424
[29] B.. Polyak “Some methods of speeding up the convergence of iteration methods” In USSR Computational Mathematics and Mathematical Physics 4.5 Elsevier, 1964, pp. 1–17
[30] L.. Popov “A modification of the Arrow-Hurwicz method for search of saddle points” In Mathematical Notes of the Academy of Sciences of the USSR 28.5 Springer, 1980, pp. 845–848
[31] H. Robbins and S. Monro “A stochastic approximation method” In The Annals of Mathematical Statistics JSTOR, 1951, pp. 400–407
[32] R.. Rockafellar “Monotone operators and the proximal point algorithm” In SIAM Journal on Control and Optimization 14.5 SIAM, 1976, pp. 877–898
[33] A. Roy, Y. Chen, K. Balasubramanian and P. Mohapatra “Online and bandit algorithms for nonstationary stochastic saddle-point optimization” In arXiv preprint arXiv:1912.01698, 2019
[34] S. Shalev-Shwartz “Online learning and online convex optimization” In Foundations and trends in Machine Learning 4.2, 2011, pp. 107–194
[35] U.. Shanbhag “Stochastic variational inequality problems: Applications, analysis, and algorithms” In Theory Driven by Influential Applications INFORMS, 2013, pp. 71–107
[36] P. Tseng “On linear convergence of iterative methods for the variational inequality problem” In Journal of Computational and Applied Mathematics 60.1-2 Elsevier, 1995, pp. 237–252
[37] Z. Wang, K. Balasubramanian, S. Ma and M. Razaviyayn “Zeroth-order algorithms for nonconvex minimax problems with improved complexities” In arXiv preprint arXiv:2001.07819, 2020
[38] T. Xu, Z. Wang, Y. Liang and H.. Poor “Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence” In arXiv preprint arXiv:2006.09361, 2020
[39] F. Yousefian, A. Nedić and U.. Shanbhag “A regularized smoothing stochastic approximation (RSSA) algorithm for stochastic variational inequality problems” In 2013 Winter Simulations Conference (WSC), 2013, pp. 933–944 IEEE
[40] F. Yousefian, A. Nedić and U.. Shanbhag “On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems” In Mathematical Programming 165.1 Springer, 2017, pp. 391–431
[41] F. Yousefian, A. Nedić and U.. Shanbhag “Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems” In 53rd IEEE Conference on Decision and Control, 2014, pp. 5831–5836 IEEE
[42] J. Zhang, M. Hong and S. Zhang “On lower iteration complexity bounds for the saddle point problems” In arXiv preprint arXiv:1912.07481, 2018

Appendix A Proof of technical lemmas and theorems

A.1 Proof of Lemma 3.1

First of all, by the 1-co-coerciveness (cf. e.g. Proposition 4.4 in [1]) of the projection operator $P_{\mathcal{Z}}$ , we have

$\displaystyle\\|z^{k+1}-z^{*}\\|^{2}$	$\displaystyle=$	$\displaystyle\\|P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right)-P_{\mathcal{Z}}\left(z^{*}\right)\\|^{2}$	(70)
	$\displaystyle\leq$	$\displaystyle(z^{k+1}-z^{})^{\top}\left(z^{k}-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)-z^{}\right)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\\|z^{k+1}-z^{}\\|^{2}+\frac{1}{2}\\|z^{k}-z^{}\\|^{2}-\frac{1}{2}\\|z^{k+1}-z^{k}\\|^{2}-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)$
		$\displaystyle+(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right).$

We shall first decompose the last term in the above inequality as

	$\displaystyle(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)$	(71)
$\displaystyle=$	$\displaystyle(z^{k+1}-z^{k+0.5}+z^{k+0.5}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)$
$\displaystyle=$	$\displaystyle\underbrace{(z^{k+1}-z^{k+0.5})^{\top}\left(-\eta\hat{F}(z^{k})+\beta(z^{k}-z^{k-1})\right)}_{(a)}$
	$\displaystyle+\underbrace{(z^{k+1}-z^{k+0.5})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\eta\hat{F}(z^{k})+(\gamma-\beta)(z^{k}-z^{k-1})\right)}_{(b)}$
	$\displaystyle+\underbrace{(z^{k+0.5}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)}_{(c)}.$

Let us first use the optimality condition of $z^{k+0.5}$ to bound term $(a)$ :

\langle z^{k+0.5}-z^{k}-\beta(z^{k}-z^{k-1})+\eta\hat{F}(z^{k}),z-z^{k+0.5}\rangle\geq 0,\quad\forall z\in\mathcal{Z}.

Taking $z=z^{k+1}$ , we get

\displaystyle(a)

\displaystyle\leq

\displaystyle\frac{1}{2}\|z^{k+1}-z^{k}\|^{2}-\frac{1}{2}\|z^{k+0.5}-z^{k}\|^{2}-\frac{1}{2}\|z^{k+1}-z^{k+0.5}\|^{2}.

(72)

We can also establish the bound for $(b)$ :

$\displaystyle(b)$	$\displaystyle=$	$\displaystyle(z^{k+1}-z^{k+0.5})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\alpha\hat{F}(z^{k})-\alpha\hat{F}(z^{k})+\eta\hat{F}(z^{k})+(\gamma-\beta)(z^{k}-z^{k-1})\right)$
	$\displaystyle\leq$	$\displaystyle\alpha\\|z^{k+1}-z^{k+0.5}\\|\\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\\|+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})$
		$\displaystyle+(\gamma-\beta)(z^{k+1}-z^{k+0.5})^{\top}(z^{k}-z^{k-1}).$

Note the following bound from the Lipschitz continuity:

	$\displaystyle\\|\hat{F}(z)-\hat{F}(z^{\prime})\\|$	$\displaystyle=$	$\displaystyle\\|\hat{F}(z)-F(z)+F(z^{\prime})-\hat{F}(z^{\prime})+F(z)-F(z^{\prime})\\|$		(73)
		$\displaystyle\leq$	$\displaystyle\varepsilon_{z}+\varepsilon_{z^{\prime}}+L\\|z-z^{\prime}\\|,$		(73)

for any $z,z^{\prime}\in\mathcal{Z}$ , where we used the definition of the stochastic error term

\displaystyle\varepsilon_{z}=\left\|F(z)-\hat{F}(z)\right\|.

(74)

Therefore,

	$\displaystyle\alpha\\|z^{k+1}-z^{k+0.5}\\|\\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\\|$	(75)
$\displaystyle\leq$	$\displaystyle\frac{1}{2}\left(\\|z^{k+1}-z^{k+0.5}\\|^{2}+\alpha^{2}\\|\hat{F}(z^{k})-\hat{F}(z^{k+0.5})\\|^{2}\right)$
$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|z^{k+1}-z^{k+0.5}\\|^{2}+\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\alpha^{2}L^{2}\\|z^{k}-z^{k+0.5}\\|^{2}.$

Furthermore,

			$\displaystyle(\gamma-\beta)(z^{k+1}-z^{k+0.5})^{\top}(z^{k}-z^{k-1})$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\|\gamma-\beta\|\left(\\|z^{k+1}-z^{k+0.5}\\|^{2}+\\|z^{k}-z^{k-1}\\|^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\|\gamma-\beta\|\left(\\|z^{k+1}-z^{k}\\|^{2}+\\|z^{k+0.5}-z^{k}\\|^{2}+\\|z^{k}-z^{}\\|^{2}+\\|z^{k-1}-z^{}\\|^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\|\gamma-\beta\|\left(2\\|z^{k+1}-z^{}\\|^{2}+2\\|z^{k}-z^{}\\|^{2}+\\|z^{k+0.5}-z^{k}\\|^{2}+\\|z^{k}-z^{}\\|^{2}+\\|z^{k-1}-z^{}\\|^{2}\right)$
		$\displaystyle=$	$\displaystyle\|\gamma-\beta\|\left(2\\|z^{k+1}-z^{}\\|^{2}+3\\|z^{k}-z^{}\\|^{2}+\\|z^{k+0.5}-z^{k}\\|^{2}+\\|z^{k-1}-z^{*}\\|^{2}\right).$

The resulting bound for $(b)$ becomes:

$\displaystyle(b)$	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|z^{k+1}-z^{k+0.5}\\|^{2}+\alpha^{2}(\epsilon_{z^{k}}+\epsilon_{z^{k+0.5}})^{2}+\left(\alpha^{2}L^{2}+\|\gamma-\beta\|\right)\\|z^{k}-z^{k+0.5}\\|^{2}$	(76)
		$\displaystyle+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})$
		$\displaystyle+\|\gamma-\beta\|\left(2\\|z^{k+1}-z^{}\\|^{2}+3\\|z^{k}-z^{}\\|^{2}+\\|z^{k-1}-z^{*}\\|^{2}\right).$

Next let us bound $(c)$ in (71). We have,

$\displaystyle(c)$	$\displaystyle=$	$\displaystyle-\alpha(z^{k+0.5}-z^{})^{\top}\hat{F}(z^{k+0.5})+\gamma(z^{k+0.5}-z^{})^{\top}(z^{k}-z^{k-1})$	(77)
	$\displaystyle\leq$	$\displaystyle-\alpha(z^{k+0.5}-z^{})^{\top}\hat{F}(z^{k+0.5})+\frac{1}{2}\gamma\left(\\|z^{k+0.5}-z^{}\\|^{2}+\\|z^{k}-z^{k-1}\\|^{2}\right)$
	$\displaystyle\leq$	$\displaystyle-\alpha(z^{k+0.5}-z^{})^{\top}\hat{F}(z^{k+0.5})+\gamma\left(\\|z^{k+0.5}-z^{k}\\|^{2}+\\|z^{k}-z^{}\\|^{2}+\\|z^{k}-z^{}\\|^{2}+\\|z^{k-1}-z^{}\\|^{2}\right)$
	$\displaystyle=$	$\displaystyle-\alpha(z^{k+0.5}-z^{})^{\top}\hat{F}(z^{k+0.5})+\gamma\left(\\|z^{k+0.5}-z^{k}\\|^{2}+2\\|z^{k}-z^{}\\|^{2}+\\|z^{k-1}-z^{*}\\|^{2}\right).$

Combining the bounds for $(a),(b),(c)$ from (72), (76), and (77), it follows from (71) that

	$\displaystyle(z^{k+1}-z^{*})^{\top}\left(-\alpha\hat{F}(z^{k+0.5})+\gamma(z^{k}-z^{k-1})\right)$	(78)
$\displaystyle\leq$	$\displaystyle-\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})+\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}$
	$\displaystyle+2\left(\|\gamma-\beta\|\right)\\|z^{k+1}-z^{}\\|^{2}+\left(2\gamma+3\|\gamma-\beta\|\right)\\|z^{k}-z^{}\\|^{2}+\left(\|\gamma-\beta\|+\gamma\right)\\|z^{k-1}-z^{*}\\|^{2}$
	$\displaystyle+\frac{1}{2}\\|z^{k+1}-z^{k}\\|^{2}+\left(\alpha^{2}L^{2}+\|\gamma-\beta\|+\gamma-\frac{1}{2}\right)\\|z^{k+0.5}-z^{k}\\|^{2}.$

We also need to bound the following term in (70):

	$\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)$	(79)
$\displaystyle\leq$	$\displaystyle\tau\\|z^{k+1}-z^{*}\\|\\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\\|$
$\displaystyle\overset{\eqref{sto-Lip-bd}}{\leq}$	$\displaystyle\tau L\\|z^{k+1}-z^{*}\\|\left(\frac{1}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})+\\|z^{k}-z^{k-1}\\|\right)$
$\displaystyle\leq$	$\displaystyle\frac{\tau L}{2}\\|z^{k+1}-z^{*}\\|^{2}+\frac{\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+\tau L\\|z^{k}-z^{k-1}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{\tau L}{2}\\|z^{k+1}-z^{}\\|^{2}+\frac{\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L\\|z^{k}-z^{}\\|^{2}+2\tau L\\|z^{k-1}-z^{*}\\|^{2}.$

Combining the bounds in (78) and (79) with (70) and multiplying both sides by 2, we have

			$\displaystyle(1-4\|\gamma-\beta\|-\tau L)\\|z^{k+1}-z^{*}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L)\right)\\|z^{k}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)\\|z^{k-1}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma-1\right)\\|z^{k+0.5}-z^{k}\\|^{2}$
			$\displaystyle+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}$
			$\displaystyle-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k}).$

Let us now take expectation on both sides. Noting $d_{k+1}=\mathbb{E}\left[\|z^{k+1}-z^{*}\|^{2}\right]$ , $d_{k}=\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]$ , and $d_{k-1}=\mathbb{E}\left[\|z^{k-1}-z^{*}\|^{2}\right]$ , and noting that $\mathbb{E}[\varepsilon_{z}^{2}]\leq\sigma^{2}$ for all $z\in\mathcal{Z}$ by Assumption (6), we obtain

	$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$	(80)
$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L\right)d_{k}+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
	$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}$
	$\displaystyle-2\alpha\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})\right]+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].\quad$

Notice that

			$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{})^{\top}F(z^{k+0.5})\right]+\mathbb{E}\left[(z^{k+0.5}-z^{})^{\top}\left(\hat{F}(z^{k+0.5})-F(z^{k+0.5})\right)\right],$

where

			$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{})^{\top}\left(F(z^{k+0.5})-F(z^{})\right)+(z^{k+0.5}-z^{})^{\top}F(z^{})\right]$
		$\displaystyle\geq$	$\displaystyle\mathbb{E}\left[\mu\\|z^{k+0.5}-z^{*}\\|^{2}\right]$
		$\displaystyle\geq$	$\displaystyle\frac{\mu}{2}d_{k}-\mu\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right],$

and

			$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}(z^{k+0.5})-F(z^{k+0.5})\right)\right]$
		$\displaystyle\geq$	$\displaystyle-\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\\|\right]$
		$\displaystyle=$	$\displaystyle-\mathbb{E}\left[\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\\|\|\xi^{[k]}\right]\right]$
		$\displaystyle=$	$\displaystyle-\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\mathbb{E}\left[\\|\hat{F}(z^{k+0.5})-F(z^{k+0.5})\\|\|\xi^{[k]}\right]\right]$
		$\displaystyle\geq$	$\displaystyle-\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\cdot\delta\right]$
		$\displaystyle\geq$	$\displaystyle-\delta D.$

Further note that we have denoted $\xi^{[k]}=(\xi^{0},\xi^{0.5},\xi^{1},\xi^{1.5},...,\xi^{k-0.5},\xi^{k})$ to be the collection of random vectors sampled up until the iterate $z^{k}$ . Therefore, $z^{k+0.5}$ is a known vector given $\xi^{[k]}$ .

Putting the above two bounds back into (80), we arrive at the desired bound:

			$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L-\alpha\mu\right)d_{k}$
			$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
			$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
			$\displaystyle+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D$
			$\displaystyle+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].\quad$

A.2 Proof of Theorem 3.2

By condition (21), we have $t_{3}<1$ . Let us start with divide both sides of (18) with $1-t_{3}$ :

	$\displaystyle d_{k+1}$	$\displaystyle\leq$	$\displaystyle\left(1-\frac{t_{1}-t_{3}}{1-t_{3}}\right)d_{k}+\frac{t_{2}}{1-t_{3}}d_{k-1}+\frac{8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}}{1-t_{3}}+\frac{2\alpha\delta D}{1-t_{3}}$		(81)
		$\displaystyle=$	$\displaystyle(1-a)d_{k}+b\cdot d_{k-1}+c\cdot\sigma^{2}+d\cdot\delta D.$		(81)

Note that we have $1>a>b$ by condition (21). It is elementary to verify that

b\leq\left(1-\frac{a-b}{2}\right)\cdot\frac{a+b}{2},

and by rearranging terms in (81), we have the following

	$\displaystyle d_{k+1}+\frac{a+b}{2}d_{k}$	$\displaystyle\leq$	$\displaystyle\left(1-\frac{a-b}{2}\right)d_{k}+b\cdot d_{k-1}+c\cdot\sigma^{2}+d\cdot\delta D$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{a-b}{2}\right)\left(d_{k}+\frac{a+b}{2}d_{k-1}\right)+c\cdot\sigma^{2}+d\cdot\delta D$

A recursive argument yields the following result:

			$\displaystyle d_{k+1}+\frac{a+b}{2}d_{k}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{a-b}{2}\right)^{k+1}\left(d_{0}+\frac{a+b}{2}d_{-1}\right)+\left(c\cdot\sigma^{2}+d\cdot\delta D\right)\cdot\sum\limits_{i=0}^{k}\left(1-\frac{a-b}{2}\right)^{i}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{a-b}{2}\right)^{k+1}\cdot\frac{2+a+b}{2}\\|z^{0}-z^{*}\\|^{2}+\left(c\cdot\sigma^{2}+d\cdot\delta D\right)\cdot\frac{2}{a-b}.$

Note that $d_{0}=d_{-1}=\|z^{0}-z^{*}\|^{2}$ . The statement in Theorem 3.2 follows by letting $q=\frac{2}{a-b}=\frac{2(1-t_{3})}{t_{1}-t_{2}-t_{3}}$ .

A.3 Proof of Proposition 3.3

For the choice of parameters

(\alpha,\beta,\gamma,\eta,\tau)=\left(\frac{1}{4L},\frac{1}{64\kappa},\frac{1}{64\kappa},\frac{1}{4L},\frac{1}{64L\kappa}\right),

we have

(t_{1},t_{2},t_{3})=\left(\frac{1}{8\kappa},\frac{3}{32\kappa},\frac{1}{64\kappa}\right)

by the relation (17). Additionally,

2\alpha^{2}L^{2}+2|\gamma-\beta|+2\gamma+2\alpha\mu-1=\frac{1}{8}+\frac{1}{32\kappa}+\frac{1}{2\kappa}-1<0.

Therefore, both conditions (13) and (21) are satisfied.

Now, from (10), we have

	$\displaystyle\left(1-\frac{1}{64\kappa}\right)d_{k+1}$	$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{k}+\frac{3}{32\kappa}d_{k-1}+8\left(\frac{1}{16L^{2}}+\frac{1}{64L^{2}\kappa}\right)\sigma^{2}+\frac{\delta D}{2L}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{k}+\frac{3}{32\kappa}d_{k-1}+\frac{5\sigma^{2}}{8L^{2}}+\frac{\delta D}{2L}.$

Divide both sides with $1-\frac{1}{64\kappa}$ and note that $\left(1-\frac{1}{64\kappa}\right)^{-1}\leq\frac{64}{63}$ , we have:

$\displaystyle d_{k+1}$	$\displaystyle\leq$	$\displaystyle\frac{1-\frac{1}{8\kappa}}{1-\frac{1}{64\kappa}}d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}$
	$\displaystyle=$	$\displaystyle\left(1-\frac{\frac{7}{64\kappa}}{1-\frac{1}{64\kappa}}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}.$

We can move a part of $d_{k}$ to the LHS and form the following:

			$\displaystyle d_{k+1}+\frac{27}{256\kappa}d_{k}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{256\kappa}\right)d_{k}+\frac{2}{21\kappa}d_{k-1}+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{256\kappa}\right)\left(d_{k}+\frac{27}{256}d_{k-1}\right)+\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\left(d_{0}+\frac{27}{256}d_{-1}\right)+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot\sum\limits_{i=0}^{k}\left(1-\frac{1}{256\kappa}\right)^{i}$
		$\displaystyle=$	$\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\cdot\frac{283}{256}\\|z^{0}-z^{*}\\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot\frac{1-\left(1-\frac{1}{256\kappa}\right)^{k+1}}{\frac{1}{256\kappa}}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{256\kappa}\right)^{k+1}\cdot\frac{283}{256}\\|z^{0}-z^{*}\\|^{2}+\left(\frac{40\sigma^{2}}{63L^{2}}+\frac{32\delta D}{63L}\right)\cdot 256\kappa.$

Note that $d_{0}=d_{-1}=\|z^{0}-z^{*}\|^{2}$ . Finally, the LHS of the above inequality can be lower bounded by $d_{k+1}$ , thus completing the proof.

A.4 Proof of Lemma 3.4

We start by using the 1-co-coerciveness of the projection operator $P_{\mathcal{Z}}(\cdot)$ :

			$\displaystyle\\|z^{k+1}-z^{*}\\|^{2}$
		$\displaystyle=$	$\displaystyle\\|P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right)-P_{\mathcal{Z}}\left(z^{}-\alpha F(z^{})\right)\\|^{2}$
		$\displaystyle\leq$	$\displaystyle(z^{k+1}-z^{})^{\top}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)-\left(z^{}-\alpha F(z^{*})\right)\right)$
		$\displaystyle=$	$\displaystyle(z^{k+1}-z^{})^{\top}\left((z^{k}-z^{})-\alpha\left(\hat{F}(z^{k})-F(z^{*})\right)+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right).$

Next, let us bound the above four terms separately:

\displaystyle(z^{k+1}-z^{*})^{\top}(z^{k}-z^{*})=\frac{1}{2}\left(\|z^{k+1}-z^{*}\|^{2}+\|z^{k}-z^{*}\|^{2}-\|z^{k+1}-z^{k}\|^{2}\right),

(83)

and

\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-F(z^{*})\right)

\displaystyle=

\displaystyle(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})+\hat{F}(z^{k+1})-F(z^{*})\right)

(84)

where

	$\displaystyle(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k+1})-F(z^{})\right)$	(85)
$\displaystyle=$	$\displaystyle(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})+F(z^{k+1})-F(z^{})\right)$
$\displaystyle\geq$	$\displaystyle(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)+\mu\\|z^{k+1}-z^{}\\|^{2},$

and

\displaystyle(z^{k+1}-z^{*})^{\top}(z^{k}-z^{k-1})

\displaystyle\leq

\displaystyle\frac{\gamma}{2}\left(\|z^{k+1}-z^{*}\|^{2}+\|z^{k}-z^{k-1}\|^{2}\right),

(86)

and

	$\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)$	(87)
$\displaystyle=$	$\displaystyle-\tau(z^{k+1}-z^{k}+z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)$
$\displaystyle\leq$	$\displaystyle\tau\\|z^{k+1}-z^{k}\\|\\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\\|-\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)$
$\displaystyle\leq$	$\displaystyle\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}+\tau^{2}\\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\\|^{2}-\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right),$

where

\displaystyle\tau^{2}\|\hat{F}(z^{k})-\hat{F}(z^{k-1})\|^{2}

\displaystyle\overset{\eqref{sto-Lip-bd}}{\leq}

\displaystyle 2\tau^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau^{2}L^{2}\|z^{k}-z^{k-1}\|^{2}.

(88)

Putting the bounds (83)-(88) back to (LABEL:sto-OGDA-hy-first-step), we get:

			$\displaystyle\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}$
			$\displaystyle+2\tau^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}-\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right).$

Taking expectation on both sides gives us

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$	(89)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
	$\displaystyle+8\tau^{2}\sigma^{2}-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)\right].$

Note that

			$\displaystyle-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}(z^{k+1})-F(z^{k+1})\right)\right]$
		$\displaystyle\leq$	$\displaystyle\alpha\mathbb{E}\left[\\|z^{k+1}-z^{*}\\|\\|\hat{F}(z^{k+1})-F(z^{k+1})\\|\right]$
		$\displaystyle=$	$\displaystyle\alpha\mathbb{E}\left[\mathbb{E}\left[\left(\\|z^{k+1}-z^{*}\\|\\|\hat{F}(z^{k+1})-F(z^{k+1})\\|\right)\|\xi^{[k]}\right]\right]$
		$\displaystyle=$	$\displaystyle\alpha\mathbb{E}\left[\\|z^{k+1}-z^{*}\\|\cdot\mathbb{E}\left[\left(\\|\hat{F}(z^{k+1})-F(z^{k+1})\\|\right)\|\xi^{[k]}\right]\right]$
		$\displaystyle\leq$	$\displaystyle\alpha\mathbb{E}\left[\delta\\|z^{k+1}-z^{*}\\|\right]$
		$\displaystyle\leq$	$\displaystyle\alpha\mathbb{E}\left[\frac{\delta^{2}}{2\mu}+\frac{\mu}{2}\\|z^{k+1}-z^{*}\\|^{2}\right].$

Here we define $\xi^{[k]}=(\xi^{0},\xi^{1},...,\xi^{k})$ to be the collection of random vectors sampled up until the iterate $z^{k}$ , and $z^{k+1}$ is known given $\xi^{[k]}$ .

Therefore, (89) becomes

			$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\frac{\alpha\mu}{2}-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}(z^{k})-\hat{F}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
			$\displaystyle+8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu},$

completing the proof.

A.5 Proof of Theorem 3.5

Continuing from (27), we have:

	$\displaystyle V_{k}$	$\displaystyle\leq$	$\displaystyle\left(1+\frac{\theta}{\kappa}\right)^{-k}V_{0}+\sum\limits_{i=1}^{k}\left(1+\frac{\theta}{\kappa}\right)^{-i}\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{2}\left(1+\frac{\theta}{\kappa}\right)^{-k}\\|z^{0}-z^{*}\\|^{2}+\frac{1-\left(1+\frac{\theta}{\kappa}\right)^{-k}}{\frac{\theta}{\kappa}}\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right),$

where we use $z^{-1}=z^{0}$ for $V_{0}$ .

Finally, with the following bound:

			$\displaystyle\tau(z^{k}-z^{*})^{\top}\left(\hat{F}(z^{k-1})-\hat{F}(z^{k})\right)$
		$\displaystyle\geq$	$\displaystyle-\tau\\|z^{k}-z^{*}\\|\\|\hat{F}(z^{k-1})-\hat{F}(z^{k})\\|$
		$\displaystyle\geq$	$\displaystyle-\frac{1}{4}\\|z^{k}-z^{*}\\|^{2}-\tau^{2}\\|\hat{F}(z^{k-1})-\hat{F}(z^{k})\\|^{2}$
		$\displaystyle\overset{\eqref{sto-Lip-bd}}{\geq}$	$\displaystyle-\frac{1}{4}\\|z^{k}-z^{*}\\|^{2}-\tau^{2}\left(2L^{2}\\|z^{k-1}-z^{k}\\|^{2}+2(\varepsilon_{z^{k-1}}+\varepsilon_{z^{k}})^{2}\right),$

we can lower bound $V_{k}$ as

\frac{1}{4}\mathbb{E}\left[\|z^{k}-z^{*}\|^{2}\right]-8\tau^{2}\sigma^{2}\leq V_{k}.

Therefore

	$\displaystyle\mathbb{E}\left[\\|z^{k}-z^{*}\\|^{2}\right]$	(90)
$\displaystyle\leq$	$\displaystyle 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\\|z^{0}-z^{*}\\|^{2}+\frac{4\kappa}{\theta}\cdot\left(1-\left(1+\frac{\theta}{\kappa}\right)^{-k}\right)\cdot\left(8\tau^{2}\sigma^{2}+\frac{\alpha\delta^{2}}{2\mu}\right)+32\tau^{2}\sigma^{2}$
$\displaystyle\leq$	$\displaystyle 2\left(1+\frac{\theta}{\kappa}\right)^{-k}\\|z^{0}-z^{*}\\|^{2}+\left(\frac{\kappa}{\theta}+1\right)\cdot 32\tau^{2}\sigma^{2}+\frac{2\kappa\alpha\delta^{2}}{\theta\mu}.$

The statement in Theorem 3.5 follows by noting $\left(1+\frac{\theta}{\kappa}\right)^{-1}=1-\frac{\theta}{\kappa+\theta}$ .

A.6 Proof of Lemma 4.2

We will derive the first bound in (59); the second bound is similar and will be omitted.

Notice that

			$\displaystyle\mathbb{E}_{\xi,u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)\\|^{2}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{\xi}\left[\mathbb{E}_{u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)\\|^{2}\right]\right]$
		$\displaystyle\overset{\eqref{smooth-x-sm}}{\leq}$	$\displaystyle\mathbb{E}_{\xi}\left[2n\\|\nabla_{x}\hat{f}(x,y,\xi)\\|^{2}\right]+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}$
		$\displaystyle=$	$\displaystyle 2n\left(\mathbb{E}_{\xi}\left[\\|\nabla_{x}f(x,y)\\|^{2}+2\nabla_{x}f(x,y)^{\top}\left(\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\right)+\\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\\|^{2}\right]\right)$
			$\displaystyle+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}$
		$\displaystyle\overset{\eqref{so-x-mean}}{=}$	$\displaystyle 2n\left(\\|\nabla_{x}f(x,y)\\|^{2}+\mathbb{E}_{\xi}\left[\\|\nabla_{x}\hat{f}(x,y,\xi)-\nabla_{x}f(x,y)\\|^{2}\right]\right)+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}$
		$\displaystyle\leq$	$\displaystyle 2n\left(M^{2}+\sigma^{2}\right)+\frac{\rho_{x}^{2}L^{2}n^{2}}{2}.$

Further note that

			$\displaystyle\mathbb{E}_{\xi,u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)-\nabla_{x}f_{\rho_{x}}(x,y)\\|^{2}\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}_{\xi,u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)\\|^{2}-2F_{\rho_{x}}(x,y,\xi,u)^{\top}\nabla_{x}f_{\rho_{x}}(x,y)+\\|\nabla_{x}f_{\rho_{x}}(x,y)\\|^{2}\right]$
		$\displaystyle\overset{\eqref{SZG-mean}}{=}$	$\displaystyle\mathbb{E}_{\xi,u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)\\|^{2}-\\|\nabla_{x}f_{\rho_{x}}(x,y)\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\xi,u}\left[\\|F_{\rho_{x}}(x,y,\xi,u)\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle 2n(M^{2}+\sigma^{2})+\frac{\rho_{x}^{2}L^{2}n^{2}}{2},$

completing the proof for (59).

A.7 Proof of Lemma 4.3

The logic line of the proof for this lemma is very similar to the proof in Appendix A.1, with the stochastic mapping $\hat{F}(z^{k})$ replaced by the stochastic zeroth-order gradient $\hat{F}^{k}_{\rho}(z^{k})$ . Therefore, we shall refrain from repeating similar analysis, but highlight the main differences instead. First, for $F(z)=\begin{pmatrix}\nabla_{x}f(x,y)\\ -\nabla_{y}f(x,y)\end{pmatrix}$ , we shall have

\displaystyle\|F(z)-F(z^{\prime})\|\leq L\|z-z^{\prime}\|,\quad\forall z,z^{\prime}\in\mathcal{Z}

(91)

where $L=2\cdot\max(L_{x},L_{y},L_{xy})$ , because

			$\displaystyle\\|F(z)-F(z^{\prime})\\|^{2}$
		$\displaystyle=$	$\displaystyle\\|\nabla_{x}f(x,y)-\nabla_{x}f(x^{\prime},y^{\prime})\\|^{2}+\\|\nabla_{y}f(x,y)-\nabla_{y}f(x^{\prime},y^{\prime})\\|^{2}$
		$\displaystyle\leq$	$\displaystyle 2L_{x}^{2}\\|x-x^{\prime}\\|^{2}+2L_{xy}^{2}\\|y-y^{\prime}\\|^{2}+2L_{y}^{2}\\|y-y^{\prime}\\|^{2}+2L_{xy}^{2}\\|x-x^{\prime}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle L^{2}(\\|x-x^{\prime}\\|^{2}+\\|y-y^{\prime}\\|^{2})=L^{2}\\|z-z^{\prime}\\|^{2}.$

Next, by denoting $\varepsilon_{z^{k}}=\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})\|$ , we have

	$\displaystyle\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\\|$	(92)
$\displaystyle=$	$\displaystyle\\|\hat{F}^{k}_{\rho}(z^{k})-F_{\rho}(z^{k})+F_{\rho}(z^{k+0.5})-\hat{F}_{\rho}(z^{k+0.5})+F_{\rho}(z^{k})-F_{\rho}(z^{k+0.5})\\|$
$\displaystyle\leq$	$\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+\\|F_{\rho}(z^{k})-F_{\rho}(z^{k+0.5})\\|$
$\displaystyle=$	$\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+\\|F_{\rho}(z^{k})-F(z^{k})+F(z^{k+0.5})-F_{\rho}(z^{k+0.5})+F(z^{k})-F(z^{k+0.5})\\|$
$\displaystyle\overset{\eqref{smooth-grad-bd},\eqref{lip-minmax}}{\leq}$	$\displaystyle\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}}+L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+L\\|z^{k}-z^{k+0.5}\\|.$

Therefore, for a similar bound as in (75), we have

			$\displaystyle\alpha\\|z^{k+1}-z^{k+0.5}\\|\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\left(\\|z^{k+1}-z^{k+0.5}\\|^{2}+\alpha^{2}\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\\|^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|z^{k+1}-z^{k+0.5}\\|^{2}+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+2\alpha^{2}L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\alpha^{2}L^{2}\\|z^{k}-z^{k+0.5}\\|^{2}.$

For another similar bound as in (79), we have

			$\displaystyle-\tau(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\right)$
		$\displaystyle\leq$	$\displaystyle\tau\\|z^{k+1}-z^{*}\\|\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\\|$
		$\displaystyle\leq$	$\displaystyle\tau L\\|z^{k+1}-z^{*}\\|\left(\frac{1}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})+\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+\\|z^{k}-z^{k-1}\\|\right)$
		$\displaystyle\leq$	$\displaystyle\frac{\tau L}{2}\\|z^{k+1}-z^{*}\\|^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\tau L\\|z^{k}-z^{k-1}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\frac{\tau L}{2}\\|z^{k+1}-z^{*}\\|^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+2\tau L(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})$
			$\displaystyle+2\tau L\\|z^{k}-z^{}\\|^{2}+2\tau L\\|z^{k-1}-z^{}\\|^{2}.$

Therefore, we reach the bound that

			$\displaystyle(1-4\|\gamma-\beta\|-\tau L)\\|z^{k+1}-z^{*}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L)\right)\\|z^{k}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)\\|z^{k-1}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma-1\right)\\|z^{k+0.5}-z^{k}\\|^{2}$
			$\displaystyle+4\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\frac{4\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}$
			$\displaystyle+4(\alpha^{2}L^{2}+\tau L)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})$
			$\displaystyle-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k}).$

By (61), we have $\mathbb{E}[\varepsilon^{2}_{z^{k}}]\leq\frac{2\tilde{\sigma}^{2}}{t_{k}}$ . Taking expectation on both sides for the above inequality, we have

	$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$	(93)
$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L)\right)d_{k}$
	$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
	$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
	$\displaystyle+16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right)+4(\alpha^{2}L^{2}+\tau L)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})$
	$\displaystyle+\mathbb{E}\left[-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}^{k}_{\rho}(z^{k})\right].$

Note that

			$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{})^{\top}F(z^{k+0.5})\right]+\mathbb{E}\left[(z^{k+0.5}-z^{})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right],$

where

\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}F(z^{k+0.5})\right]\geq\mathbb{E}\left[\mu\|z^{k+0.5}-z^{*}\|^{2}\right]\geq\frac{\mu d_{k}}{2}-\mu\mathbb{E}\left[\|z^{k+0.5}-z^{k}\|^{2}\right].

Let us denote

		$\displaystyle\xi^{k}_{[t_{k}]}:=(\xi^{k}_{1},...,\xi^{k}_{t_{k}}),$	$\displaystyle w^{k}_{[t_{k}]}:=(w^{k}_{1},...,w^{k}_{t_{k}})$
		$\displaystyle\xi^{[k]}:=(\xi^{1}_{[t_{1}]},...,\xi^{k}_{[t_{k}]}),$	$\displaystyle w^{[k]}:=(w^{1}_{[t_{1}]},...,w^{k}_{[t_{k}]})$

as the collection of all random vectors at iteration $k$ and the collection of all such random vectors from iteration $1$ to $k$ respectively. Notice that for the given $(\xi^{[k]},w^{[k]})$ , $z^{k+0.5}$ is a deterministic vector, we then have the following bound

			$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\left\|\xi^{[k]},w^{[k]}\right.\right]\right]$
		$\displaystyle\overset{\eqref{batch-mean}}{=}$	$\displaystyle\mathbb{E}\left[(z^{k+0.5}-z^{*})^{\top}\left(F_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right)\right]$
		$\displaystyle\geq$	$\displaystyle-\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\left\\|F_{\rho}(z^{k+0.5})-F(z^{k+0.5})\right\\|\right]$
		$\displaystyle\overset{\eqref{smooth-grad-bd}}{\geq}$	$\displaystyle-\frac{L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2}\mathbb{E}\left[\\|z^{k+0.5}-z^{*}\\|\right]$
		$\displaystyle\geq$	$\displaystyle-\frac{LD\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2},$

where in the last inequality we utilize the boundedness assumption of $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ and denote $D=\max\limits_{z,z^{\prime}\in\mathcal{Z}}\|z-z^{\prime}\|$ . Combining the above bounds into (93), the desired result follows.

A.8 Proof of Proposition 4.4

Note the variance is upper bounded by

16\tilde{\sigma}^{2}\left(\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{1}{t_{k}}+\frac{\alpha^{2}}{t_{k+0.5}}+\frac{\tau}{Lt_{k-1}}\right).

Since we take $t_{k}=t_{k+0.5}$ and $\frac{1}{t_{k-1}}=\frac{1}{t_{k}}\left(1-\frac{1}{256\kappa}\right)^{-1}\leq\frac{2}{t_{k}}$ , the above upper bound can be written as

\left(\alpha^{2}+\frac{\tau}{L}\right)\frac{48\tilde{\sigma}^{2}}{t_{k}}.

By substituting the specific parameter choice into (65), starting from the last iteration $K$ , we have

			$\displaystyle\left(1-\frac{1}{64\kappa}\right)d_{K}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{\tilde{\sigma}^{2}}{t_{K-1}}\cdot\left(\frac{3}{L^{2}}+\frac{3}{4L^{2}\kappa}\right)$
			$\displaystyle+\left(\frac{1}{16\kappa}+\frac{1}{4}\right)(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\frac{D\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{4}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{15\tilde{\sigma}^{2}}{4t_{K-1}L^{2}}+\frac{5(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})}{16}+\frac{D\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{4}.$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{8\kappa}\right)d_{K-1}+\frac{3}{32\kappa}d_{K-2}+\frac{15\tilde{\sigma}^{2}}{4t_{K-1}L^{2}}+\left(\frac{5}{16}+\frac{D}{4}\right)\frac{C_{1}^{K}}{\kappa}.$

In the last inequality, we denote $C_{1}=1-\frac{1}{256\kappa}$ and $\rho_{x}=\frac{C_{1}^{K}}{\sqrt{2}n\kappa}$ , $\rho_{y}=\frac{C_{1}^{K}}{\sqrt{2}m\kappa}$ , and use the fact that $C_{1}^{2}\leq C_{1}$ .

Dividing both sides by $1-\frac{1}{64\kappa}$ and noting that $\left(1-\frac{1}{64\kappa}\right)^{-1}\leq\frac{64}{63}$ , we obtain

$\displaystyle d_{K}$	$\displaystyle\leq$	$\displaystyle\frac{1-\frac{1}{8\kappa}}{1-\frac{1}{64\kappa}}d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}$
	$\displaystyle=$	$\displaystyle\left(1-\frac{\frac{7}{64\kappa}}{1-\frac{1}{64\kappa}}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{80\tilde{\sigma}^{2}}{21t_{K-1}L^{2}}+\frac{(20+16D)C_{1}^{K}}{63\kappa}.$
	$\displaystyle=$	$\displaystyle\left(1-\frac{7}{64\kappa}\right)d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}.$

By moving a part of $d_{K-1}$ to the LHS, we have:

			$\displaystyle d_{K}+\frac{27}{256\kappa}d_{K-1}$
		$\displaystyle\leq$	$\displaystyle C_{1}\cdot d_{K-1}+\frac{2}{21\kappa}d_{K-2}+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}$
		$\displaystyle\leq$	$\displaystyle C_{1}\left(d_{K-1}+\frac{27}{256}d_{K-2}\right)+\frac{c_{2}}{t_{K-1}}+c_{3}\cdot\frac{C_{1}^{K}}{\kappa}$
		$\displaystyle\leq$	$\displaystyle C_{1}^{K}\left(d_{0}+\frac{27}{256}d_{-1}\right)+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+c_{3}\frac{C_{1}^{K}}{\kappa}\cdot\sum\limits_{i=1}^{K}C_{1}^{i-1}$
		$\displaystyle=$	$\displaystyle C_{1}^{K}\cdot\frac{283}{256}\\|z^{0}-z^{*}\\|^{2}+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+c_{3}\frac{C_{1}^{K}}{\kappa}\cdot\frac{1-C_{1}^{K}}{\frac{1}{256\kappa}}$
		$\displaystyle\leq$	$\displaystyle C_{1}^{K}\cdot\frac{283}{256}\\|z^{0}-z^{*}\\|^{2}+c_{2}\cdot\sum\limits_{i=1}^{K}t_{K-i}^{-1}C_{1}^{i-1}+256c_{3}C_{1}^{K}.$

With the sample size $t_{k}=K\left(1-\frac{1}{256\kappa}\right)^{-(k+1)}=K\cdot C_{1}^{-(k+1)}$ , we have:

d_{K}\leq\left(1-\frac{1}{256\kappa}\right)^{K}\left(\frac{283}{256}\|z^{0}-z^{*}\|^{2}+c_{2}+256c_{3}\right).

It is then straightforward to see that for $K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ we have $d_{K}\leq\epsilon$ , with the sample complexity given by

\displaystyle\sum\limits_{k=0}^{K-1}(t_{k}+t_{k+0.5})=2\sum\limits_{k=0}^{K-1}t_{k}=2K\cdot\frac{1-C_{1}^{-K}}{C_{1}(1-C_{1}^{-1})}=\frac{2K(C_{1}^{-K}-1)}{1-C_{1}}.

By noticing that a more precise expression of $K=\ln_{C_{1}^{-1}}\left(\frac{1}{\epsilon}\right)$ , we have $C_{1}^{-K}=\mathcal{O}\left(\frac{1}{\epsilon}\right)$ , $1-C_{1}=\mathcal{O}\left(\frac{1}{\kappa}\right)$ , the combined sample complexity is then given by $\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right)$ .

A.9 Proof of Lemma 4.5

With the similar logic to the proof in Appendix A.4, we shall focus on the main differences between the two proofs.

Firstly, with the similar derivation to (92), we have the following bound:

			$\displaystyle\tau^{2}\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k-1}_{\rho}(z^{k-1})\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\tau^{2}\left(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}}+L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}+L\\|z^{k}-z^{k-1}\\|\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\tau^{2}\left(4(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}+4L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+2L^{2}\\|z^{k}-z^{k-1}\\|^{2}\right).$

By using the variance bound in (61), we will reach the next inequality that is similar to the step in (89):

	$\displaystyle\mathbb{E}\left[\left(\frac{1}{2}+\alpha\mu-\frac{\gamma}{2}\right)\\|z^{k+1}-z^{}\\|^{2}+\alpha(z^{k+1}-z^{})^{\top}\left(\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+1}_{\rho}(z^{k+1})\right)+\frac{1}{4}\\|z^{k+1}-z^{k}\\|^{2}\right]$	(94)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\\|z^{k}-z^{}\\|^{2}+\tau(z^{k}-z^{})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)+\left(2\tau^{2}L^{2}+\frac{\gamma}{2}\right)\\|z^{k}-z^{k-1}\\|^{2}\right]$
	$\displaystyle+16\tau^{2}\sigma^{2}\left(\frac{1}{t_{k}}+\frac{1}{t_{k-1}}\right)+4\tau^{2}L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})$
	$\displaystyle-\mathbb{E}\left[\alpha(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\right].$

Denote by

		$\displaystyle\xi^{k}_{[t_{k}]}=(\xi^{k}_{1},...,\xi^{k}_{t_{k}}),$	$\displaystyle w^{k}_{[t_{k}]}=(w^{k}_{1},...,w^{k}_{t_{k}})$
		$\displaystyle\xi^{[k]}=(\xi^{1}_{[t_{1}]},...,\xi^{k}_{[t_{k}]}),$	$\displaystyle w^{[k]}=(w^{1}_{[t_{1}]},...,w^{k}_{[t_{k}]})$

the collection of all random vectors at iteration $k$ and the collection of all such random vectors from iteration $1$ to $k$ respectively, and note that given $(\xi^{[k]},w^{[k]})$ , $z^{k+1}$ is a deterministic vector, we then have the following bound:

			$\displaystyle\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\right]$
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(\hat{F}^{k+1}_{\rho}(z^{k+1})-F(z^{k+1})\right)\left\|\xi^{[k]},w^{[k]}\right.\right]\right]$
		$\displaystyle\overset{\eqref{batch-mean}}{=}$	$\displaystyle\mathbb{E}\left[(z^{k+1}-z^{*})^{\top}\left(F_{\rho}(z^{k+1})-F(z^{k+1})\right)\right]$
		$\displaystyle\geq$	$\displaystyle-\mathbb{E}\left[\\|z^{k+1}-z^{*}\\|\left\\|F_{\rho}(z^{k+1})-F(z^{k+1})\right\\|\right]$
		$\displaystyle\overset{\eqref{smooth-grad-bd}}{\geq}$	$\displaystyle-\mathbb{E}\left[\\|z^{k+1}-z^{*}\\|\cdot\frac{L\sqrt{\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2}}}{2}\right]$
		$\displaystyle\geq$	$\displaystyle-\mathbb{E}\left[\frac{\mu}{2}\\|z^{k+1}-z^{*}\\|^{2}+\frac{L^{2}(\rho_{x}n^{2}+\rho_{y}^{2}m^{2})}{8\mu}\right],$
		$\displaystyle=$	$\displaystyle-\frac{\mu}{2}d_{k+1}-\frac{L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})}{8\mu}.$

Substituting the above bound into (94), the desired result follows.

A.10 Proof of Proposition 4.6

Let us start from the potential function inequality (68) from iteration $K$ . With $\theta=\frac{1}{8}$ , let us also denote $C_{1}=\left(1+\frac{1}{8\kappa}\right)^{-1}=\left(1-\frac{1}{8\kappa+1}\right)$ , then $\rho_{x}=\frac{\sqrt{C_{1}^{K}}}{\sqrt{2}n\kappa},\rho_{y}=\frac{\sqrt{C_{1}^{K}}}{\sqrt{2}m\kappa}$ . Note that the $C_{1}$ defined here is only for this proof, not to be confused with that used in the proof in Appendix A.8. Then we have

	$\displaystyle V_{K}$	$\displaystyle\leq$	$\displaystyle C_{1}V_{K-1}+16C_{1}\tau^{2}\sigma^{2}\left(\frac{1}{t_{K-1}}+\frac{1}{t_{K-2}}\right)+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)\cdot C_{1}\cdot\frac{C_{1}^{K}}{\kappa^{2}}$
		$\displaystyle\leq$	$\displaystyle C_{1}^{K}V_{0}+48\tau^{2}\tilde{\sigma}^{2}\sum\limits_{k=0}^{K-1}\frac{C_{1}^{K-k}}{t_{k}}+L^{2}\left(4\tau^{2}+\frac{\alpha}{8\mu}\right)\cdot\sum\limits_{k=1}^{K}C_{1}^{k}\cdot\frac{C_{1}^{K}}{\kappa^{2}}.$

In the second inequality, we take $t_{k}=K\cdot C_{1}^{-k}$ and note that $\frac{1}{t_{k-1}}=\frac{1}{C_{1}t_{k}}\leq\frac{2}{t_{k}}$ . Then we have $\sum\limits_{k=0}^{K-1}\frac{C_{1}^{K-k}}{t_{k}}=C_{1}^{K}$ . In addition,

\sum\limits_{k=1}^{K}C_{1}^{k}=\frac{C_{1}(1-C_{1}^{K})}{1-C_{1}}\leq\frac{C_{1}}{1-C_{1}}=8\kappa.

Therefore, we have

\displaystyle V_{K}\leq C_{1}^{K}V_{0}+48\tau^{2}\tilde{\sigma}^{2}\cdot C_{1}^{K}+L^{2}\left(32\tau^{2}+\frac{\alpha}{\mu}\right)\frac{C_{1}^{K}}{\kappa}.

(95)

Now, let us lower bound $V_{k}$ by observing:

			$\displaystyle\tau(z^{k}-z^{*})^{\top}\left(\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\right)$
		$\displaystyle\geq$	$\displaystyle-\tau\\|z^{k}-z^{*}\\|\\|\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\\|$
		$\displaystyle\geq$	$\displaystyle-\frac{1}{4}\\|z^{k}-z^{*}\\|^{2}-\tau^{2}\\|\hat{F}^{k-1}_{\rho}(z^{k-1})-\hat{F}^{k}_{\rho}(z^{k})\\|^{2}$
		$\displaystyle\overset{\eqref{Lip-szo-grad}}{\geq}$	$\displaystyle-\frac{1}{4}\\|z^{k}-z^{*}\\|^{2}-\tau^{2}\left(2L^{2}\\|z^{k-1}-z^{k}\\|^{2}+4(\varepsilon_{z^{k-1}}+\varepsilon_{z^{k}})^{2}+4L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})\right),$

Then we have

	$\displaystyle V_{K}$	$\displaystyle\geq$	$\displaystyle\frac{1}{4}d_{K}-16\tau^{2}\tilde{\sigma}^{2}\left(\frac{1}{t_{K}}+\frac{1}{t_{K-1}}\right)-\frac{4\tau^{2}L^{2}C_{1}^{K}}{\kappa^{2}}$
		$\displaystyle\geq$	$\displaystyle\frac{1}{4}d_{K}-48\tau^{2}\tilde{\sigma}^{2}C_{1}^{K}-\frac{4\tau^{2}L^{2}C_{1}^{K}}{\kappa^{2}}.$

Combining with (95), we have

d_{K}\leq C_{1}^{K}\cdot\left(4V_{0}+384\tau^{2}\tilde{\sigma}^{2}+L^{2}\left(\frac{32\tau^{2}}{\kappa}+\frac{4\tau^{2}}{\kappa^{2}}+\frac{\alpha}{\mu\kappa}\right)\right)=C_{1}^{K}\mathcal{O}(1).

It follows immediately that for $K=\mathcal{O}\left(\kappa\ln\left(\frac{1}{\epsilon}\right)\right)$ we have $d_{K}\leq\epsilon$ . With a more precise expression $K=\ln_{C_{1}^{-1}}\left(\frac{1}{\epsilon}\right)$ , the sample complexity can be estimated:

\displaystyle\sum\limits_{k=0}^{K-1}t_{k}

\displaystyle=

\displaystyle K\cdot\frac{1-C_{1}^{-K}}{C_{1}(1-C_{1}^{-1})}=\frac{K(C_{1}^{-K}-1)}{1-C_{1}}\leq\frac{\kappa}{\epsilon(1-C_{1})}\ln\left(\frac{1}{\epsilon}\right)=\mathcal{O}\left(\frac{\kappa^{2}}{\epsilon}\ln\left(\frac{1}{\epsilon}\right)\right).

Appendix B Proof of the uniform sublinear convergence of the stochastic extra-point method

In order to establish a uniform sublinear convergence, we have to consider parameters that are diminishing with iteration number $k$ . Let us return to the one-iteration relation (10) and consider the following choice of parameters:

(\alpha^{(k)},\beta^{(k)},\gamma^{(k)},\eta^{(k)},\tau^{(k)})=\left(\frac{2}{(k+2)\mu},\frac{\alpha^{2}\mu^{2}}{128},\frac{\alpha^{2}\mu^{2}}{128},\frac{2}{(k+2)\mu},\frac{\alpha^{2}\mu}{128\kappa}\right),

where we omit the superscript $(k)$ of $\alpha$ on the RHS for notation simplicity. We shall note that here $\alpha=\alpha^{(k)}$ which is dependent on iteration $k$ and follow the same simplification for other parameters throughout the rest of the proof in this appendix unless noted otherwise.

By using the fact $2\alpha\leq\frac{1}{\mu}+\alpha^{2}\mu$ , we have:

			$\displaystyle\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle\left(2\alpha^{2}L^{2}+2\gamma+\alpha^{2}\mu\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle\alpha^{2}\left(2L^{2}+\frac{\mu}{64}+\mu\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
		$\displaystyle\leq$	$\displaystyle\alpha^{2}\left(2L^{2}+\frac{\mu^{2}}{64}+\mu^{2}\right)D^{2},$

where in the last inequality we use the boundedness of the feasible set.

Therefore, we could rewrite (10) into:

			$\displaystyle(1-\tau L)d_{k+1}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+4\tau L-\alpha\mu\right)d_{k}+\left(2\gamma+4\tau L\right)d_{k-1}$
			$\displaystyle+\alpha^{2}\left(2L^{2}+\frac{\mu^{2}}{64}+\mu^{2}\right)D^{2}+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D$
		$\displaystyle=$	$\displaystyle\left(1+4\gamma+4\tau L-\alpha\mu\right)d_{k}+\left(2\gamma+4\tau L\right)d_{k-1}$
			$\displaystyle+\frac{4}{(k+2)^{2}}\cdot\underbrace{\left(2\kappa^{2}D^{2}+\frac{D^{2}}{64}+D^{2}+\frac{8\sigma^{2}}{\mu^{2}}+\frac{\sigma^{2}}{128L^{2}}\right)}_{G}+\frac{4\delta D}{(k+2)\mu}.$

Substituting the parameters with their respective values in the rest of the terms:

			$\displaystyle\left(1-\frac{1}{32(k+2)}\right)d_{k+1}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{1}{32(k+2)^{2}}\right)d_{k+1}$
		$\displaystyle\leq$	$\displaystyle\left(1+\frac{1}{8(k+2)^{2}}+\frac{1}{8(k+2)^{2}}-\frac{2}{k+2}\right)d_{k}$
			$\displaystyle+\left(\frac{1}{16(k+2)^{2}}+\frac{1}{8(k+2)^{2}}\right)d_{k-1}+\frac{4G}{(k+2)^{2}}+\frac{4\delta D}{(k+2)\mu}$
		$\displaystyle\leq$	$\displaystyle\left(1-\frac{7}{4(k+2)}\right)d_{k}+\frac{3}{16(k+2)}d_{k-1}+\frac{4G}{(k+2)^{2}}+\frac{4\delta D}{(k+2)\mu}.$

Dividing both sides by $1-\frac{1}{32(k+2)}$ , and noting that $\left(1-\frac{1}{32(k+2)}\right)^{-1}\leq\frac{32}{31}$ , it follows that

$\displaystyle d_{k+1}$	$\displaystyle\leq$	$\displaystyle\frac{1-\frac{7}{4(k+2)}}{1-\frac{1}{32(k+2)}}d_{k}+\frac{6}{31(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}$
	$\displaystyle=$	$\displaystyle\left(1-\frac{\frac{55}{32(k+2)}}{1-\frac{1}{32(k+2)}}\right)d_{k}+\frac{6}{31(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{55}{32(k+2)}\right)d_{k}+\frac{7}{32(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}.$

From the above one-iteration inequality, we shall claim the following inequality

d_{k}\leq\frac{Q}{k+2}+\frac{256\delta D}{93\mu},\quad\forall k\geq 0,

where

Q=\max\left\{\frac{133G}{9},2\|z^{0}-z^{*}\|^{2}\right\},

and we shall prove the inequality by induction. For $k=0$ , the inequality holds trivially

d_{0}=\|z^{0}-z^{*}\|^{2}\leq\frac{Q}{2}.

Assuming the inequality holds for all index $1,...,k$ , we then have

$\displaystyle d_{k+1}$	$\displaystyle\leq$	$\displaystyle\left(1-\frac{55}{32(k+2)}\right)d_{k}+\frac{7}{32(k+2)}d_{k-1}+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{55}{32(k+2)}\right)\left(\frac{Q}{k+2}+\frac{256\delta D}{93\mu}\right)+\frac{7}{32(k+2)}\left(\frac{Q}{k+1}+\frac{256\delta D}{93\mu}\right)$
		$\displaystyle+\frac{128G}{31(k+2)^{2}}+\frac{128\delta D}{31(k+2)\mu}$
	$\displaystyle\leq$	$\displaystyle\left(1-\frac{55}{32(k+2)}\right)\cdot\frac{Q}{k+2}+\frac{7}{32(k+2)}\cdot\frac{Q}{k+1}+\frac{128G}{31(k+2)^{2}}$
		$\displaystyle+\frac{256\delta D}{93\mu}-\frac{440\delta D}{93(k+2)\mu}+\frac{56\delta D}{93(k+2)\mu}+\frac{128\delta D}{31(k+2)\mu}$
	$\displaystyle=$	$\displaystyle\frac{Q}{k+2}-\frac{55Q}{32(k+2)^{2}}+\frac{7}{32(k+2)}\cdot\frac{Q}{k+1}+\frac{128G}{31(k+2)^{2}}+\frac{256\delta D}{93\mu}$
	$\displaystyle\leq$	$\displaystyle\frac{Q}{k+3}+\frac{Q}{(k+2)^{2}}-\frac{55Q}{32(k+2)^{2}}+\frac{7}{32(k+2)}\cdot\frac{2Q}{k+2}+\frac{133G}{32(k+2)^{2}}+\frac{256\delta D}{93\mu}$
	$\displaystyle=$	$\displaystyle\frac{Q}{k+3}-\frac{9Q}{32(k+2)^{2}}+\frac{133G}{32(k+2)^{2}}+\frac{256\delta D}{93\mu}$
	$\displaystyle=$	$\displaystyle\frac{Q}{k+3}+\frac{256\delta D}{93\mu}.$

Note that in the last inequality we used the identities $\frac{1}{k+1}\leq\frac{2}{k+2}$ and $\frac{128}{31}\leq\frac{133}{32}$ . This completes the proof for the uniform $\mathcal{O}\left(\frac{1}{k}\right)$ convergence rate.

			$\displaystyle(1-4\|\gamma-\beta\|-\tau L)\\|z^{k+1}-z^{*}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L)\right)\\|z^{k}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)\\|z^{k-1}-z^{*}\\|^{2}$
			$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma-1\right)\\|z^{k+0.5}-z^{k}\\|^{2}$
			$\displaystyle+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+\frac{2\tau}{L}(\varepsilon_{z^{k}}+\varepsilon_{z^{k-1}})^{2}$
			$\displaystyle-2\alpha(z^{k+0.5}-z^{*})^{\top}\hat{F}(z^{k+0.5})+2(\eta-\alpha)(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k}).$

			$\displaystyle(1-4\|\gamma-\beta\|-\tau L)d_{k+1}$
		$\displaystyle\leq$	$\displaystyle\left(1+4\gamma+6\|\gamma-\beta\|+4\tau L-\alpha\mu\right)d_{k}$
			$\displaystyle+\left(2\|\gamma-\beta\|+2\gamma+4\tau L\right)d_{k-1}$
			$\displaystyle+\left(2\alpha^{2}L^{2}+2\|\gamma-\beta\|+2\gamma+2\alpha\mu-1\right)\mathbb{E}\left[\\|z^{k+0.5}-z^{k}\\|^{2}\right]$
			$\displaystyle+8\left(\alpha^{2}+\frac{\tau}{L}\right)\sigma^{2}+2\alpha\delta D$
			$\displaystyle+2(\eta-\alpha)\mathbb{E}\left[(z^{k+1}-z^{k+0.5})^{\top}\hat{F}(z^{k})\right].\quad$

			$\displaystyle\\|z^{k+1}-z^{*}\\|^{2}$
		$\displaystyle=$	$\displaystyle\\|P_{\mathcal{Z}}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right)-P_{\mathcal{Z}}\left(z^{}-\alpha F(z^{})\right)\\|^{2}$
		$\displaystyle\leq$	$\displaystyle(z^{k+1}-z^{})^{\top}\left(z^{k}-\alpha\hat{F}(z^{k})+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)-\left(z^{}-\alpha F(z^{*})\right)\right)$
		$\displaystyle=$	$\displaystyle(z^{k+1}-z^{})^{\top}\left((z^{k}-z^{})-\alpha\left(\hat{F}(z^{k})-F(z^{*})\right)+\gamma(z^{k}-z^{k-1})-\tau\left(\hat{F}(z^{k})-\hat{F}(z^{k-1})\right)\right).$

			$\displaystyle\\|F(z)-F(z^{\prime})\\|^{2}$
		$\displaystyle=$	$\displaystyle\\|\nabla_{x}f(x,y)-\nabla_{x}f(x^{\prime},y^{\prime})\\|^{2}+\\|\nabla_{y}f(x,y)-\nabla_{y}f(x^{\prime},y^{\prime})\\|^{2}$
		$\displaystyle\leq$	$\displaystyle 2L_{x}^{2}\\|x-x^{\prime}\\|^{2}+2L_{xy}^{2}\\|y-y^{\prime}\\|^{2}+2L_{y}^{2}\\|y-y^{\prime}\\|^{2}+2L_{xy}^{2}\\|x-x^{\prime}\\|^{2}$
		$\displaystyle\leq$	$\displaystyle L^{2}(\\|x-x^{\prime}\\|^{2}+\\|y-y^{\prime}\\|^{2})=L^{2}\\|z-z^{\prime}\\|^{2}.$

			$\displaystyle\alpha\\|z^{k+1}-z^{k+0.5}\\|\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\left(\\|z^{k+1}-z^{k+0.5}\\|^{2}+\alpha^{2}\\|\hat{F}^{k}_{\rho}(z^{k})-\hat{F}^{k+0.5}_{\rho}(z^{k+0.5})\\|^{2}\right)$
		$\displaystyle\leq$	$\displaystyle\frac{1}{2}\\|z^{k+1}-z^{k+0.5}\\|^{2}+2\alpha^{2}(\varepsilon_{z^{k}}+\varepsilon_{z^{k+0.5}})^{2}+2\alpha^{2}L^{2}(\rho_{x}^{2}n^{2}+\rho_{y}^{2}m^{2})+\alpha^{2}L^{2}\\|z^{k}-z^{k+0.5}\\|^{2}.$