New First-Order Algorithms for Stochastic Variational Inequalities
Abstract
In this paper, we propose two new solution schemes to solve the stochastic strongly monotone variational inequality problems: the stochastic extra-point solution scheme and the stochastic extra-momentum solution scheme. The first one is a general scheme based on updating the iterative sequence and an auxiliary extra-point sequence. In the case of deterministic VI model, this approach includes several state-of-the-art first-order methods as its special cases. The second scheme combines two momentum-based directions: the so-called heavy-ball direction and the optimism direction, where only one projection per iteration is required in its updating process. We show that, if the variance of the stochastic oracle is appropriately controlled, then both schemes can be made to achieve optimal iteration complexity of to reach an -solution for a strongly monotone VI problem with condition number . We show that these methods can be readily incorporated in a zeroth-order approach to solve stochastic minimax saddle-point problems, where only noisy and biased samples of the objective can be obtained, with a total sample complexity of .
Keywords: variational inequality, minimax saddle-point, stochastic first-order method, zeroth-order method.
1 Introduction
Given a constraint set and a mapping , the classical variational inequality (VI) problem is to find such that
(1) |
For an introduction to VI and its applications, we refer the readers to Facchinei and Pang [4] and the references therein.
In this paper, we consider a stochastic version of problem (1), where the exact evaluation of the mapping is inaccessible. Instead, only a stochastic oracle is available. The stochasticity in question may stem from, e.g., the non-deterministic nature of mixed strategies of the players in a game-setting, or simply because of the difficulty in evaluating the mapping itself. The latter has become more pronounced in the literature, due to its recent-found application as a training/learning subproblem in machine learning and/or statistical learning. The so-called stochastic oracle is a noisy estimation of the mapping , and an iterative scheme that incorporates such oracle is known as stochastic approximation (SA). As far as we know, the first proposal to use such approach for stochastic optimization can be traced back to the seminal work of Robbins and Monro [31]. In 2008, Jiang and Xu [10] followed the SA approach to solve VI models. Since then, efforts have been made to extend existing deterministic methods to the stochastic VI models; see e.g. [11, 41, 12, 15, 9, 7, 8].
Let us start our discussion by introducing the assumptions made throughout the paper. We consider VI model (1) where is a closed convex set. Moreover, the following two conditions are assumed:
(2) |
for some , and
(3) |
for some . Condition (2) is known as the strong monotonicity of , while Condition (3) is known as the Lipschitz continuity of . If Condition 2 is met with then is known as monotone. VI problems satisfying (2) with positive can be easily shown to have a unique solution . Let us denote . Parameter is usually known as the condition number of the VI model (1). We also assume
(4) |
namely the constraint set is bounded. Remark that this assumption can actually be removed without affecting the results, but then the analysis becomes lengthier and tedious without conceivable conceptual benefit, and so we shall not pursue that generality in this paper.
The stochastic oracle of the mapping, denote by , takes a random sample from some sample space . The oracle is required to satisfy:
(5) | |||||
(6) |
for all , where are some constants. In other words, we assume both the bias and the deviation are uniformly upper-bounded.
In this paper, we propose two stochastic first-order schemes: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme maintains two sequences of iterates featuring several well-known first-order search directions such as: the extra-gradient [13, 36], the heavy-ball [29], Nesterov’s extrapolation [23, 25], and the optimism direction [30, 20]. The second scheme, on the other hand, specifically combines the heavy-ball momentum and the optimism momentum in its updating formula, and maintains only one sequence throughout the iterations, therefore requiring only one projection per iteration. These two approaches require different types of analysis. Both schemes render a wider range of search directions than the existing first-order methods, and the parameters associated with each search direction could and should be tuned differently from problem-class to problem-class in order to secure good practical performances. The deterministic counterpart of these methods can be found in our previous work [6]. In the stochastic context, we show that as long as the variance can be reduced throughout the iterations, they yield the optimal iteration complexity (cf. [42]) to reach -solution: , with an additional biased term depending on . In a later section, we demonstrate an application to the stochastic black-box minimax saddle-point problem where only noisy function values are accessible. This application is particularly relevant, given its applications in machine learning, where the training data set may be very large and evaluating exact gradient/function value is usually impractical. Through a smoothing technique, we propose a stochastic zeroth-order gradient as our update directions in either the stochastic extra-point scheme or the stochastic extra-momentum scheme. We show that both approaches yield an iteration complexity of and a sample-complexity of .
The rest of the paper is organized as follows. In Section 2, we survey the relevant literature with a focus on stochastic VI. In Section 3, we present the main results in this paper, i.e. the convergence results of the two proposed methods, while the technical proofs are relegated to the appendices. In Section 4, we introduce a stochastic black-box saddle-point problem and present the sample complexity results of our methods. We present some promising preliminary numerical results in Section 5 and conclude the paper in Section 6.
2 Literature Review
The first-order algorithms for deterministic VI (1) serve as a basis for the developments of their stochastic counterparts. These algorithms include the projection method [4], the proximal method [18, 32, 36], the extra-gradient method [13, 36], the optimistic gradient descent ascent (OGDA) method [30, 20, 21], the mirror-prox method [22], the extrapolation method [26, 24], and the extra-point method [6].
In this section, we shall focus on the developments of algorithms for stochastic VI, starting with a paper of Jiang and Xu [10], where the authors propose a stochastic projection method for solving strongly monotone and Lipschitz continuous VI problems and present an almost-sure convergence result. Koshal et al. [14] propose iterative Tikhonov regularization method and iterative proximal point method and show almost-sure convergence with monotone and Lipschitz continuous VI problems. Both methods solve a strongly monotone VI subproblem at each iteration. Yousefian et al. [39] further introduce local smoothing technique to the above-mentioned regularized methods to account for non-Lipschitz mappings and show almost-sure convergence. A survey on these methods, as well as applications and the theory behind stochastic VI can be found in Shanbhag [35].
Juditsky et al. [11] are among the first to show an iteration complexity bound for stochastic VI algorithms. They extend the mirror-prox method [22] to stochastic settings and prove an optimal iteration complexity bound for monotone VI: , or when the variance can be controlled small enough. Yousefian et al. [41] further extend the stochastic mirror-prox method with a more general step size choice and show an iteration complexity, where they also show an complexity for the stochastic extra-gradient method for solving strongly monotone VI problems. Yousefian et al. [40] use randomized smoothing technique for non-Lipschitz mapping and show an iteration complexity. Chen et al. [3] consider a specific class of VI model: a mapping that consists of a Lipschitz continuous and monotone operator, a Lipschitz continuous gradient mapping of a convex function, and a subgradient mapping of a simple convex function. They propose a method that combines Nesterov’s acceleration [25] with the stochastic mirror-prox method to exploit this special structure, resulting in an optimal iteration complexity for such class of problem: , or when the variance can be controlled small enough, or when the operator consists only of gradient/subgradient mappings from some convex function. Kannan and Shanbhag [12] analyze a general variant of extra-gradient method (which uses general distance-generating functions) and show that under a slightly weaker assumptions than the strongly monotonicity, the optimal iteration bound still hold. Kotsalis et al. [15] extend the OGDA method to strongly monotone stochastic VI with iteration complexity , or when the variance can be controlled small enough.
We shall note that the detailed implementation of variance-reduction is in general not considered in the above-mentioned methods (although some do present additional complexity term when the variance is small, such as in [11, 3]). Therefore, the optimal iteration complexity bound is for monotone VI and for strongly monotone VI, as compared to and for their deterministic counterpart. By increasing the sample size (aka mini-batch) in each iteration, the variance can be reduced as the algorithm progresses, therefore attaining the same optimal iteration complexity bound as the deterministic problems.
There have been developments for variance-reduction-based methods in recent years. Jalilzadeh and Shanbhag [9] extend the method [26] for deterministic strongly monotone VI to stochastic VI and show that with variance reduction the optimal iteration complexity can be achieved, together with a total sample complexity of for some constant . With this method as a subroutine, they also propose a variance-reduced proximal point method with iteration complexity and sample complexity for some constants . Iusem et al. [7] propose a variance-reduced extra-gradient-based method for monotone VI and show iteration complexity and sample complexity. They further extend the method [8] by incorporating line-search for unknown Lipschitz constant, while preserving similar bounds. Palaniappan and Bach [28] propose variance-reduced stochastic forward-backward methods based on (accelerated) stochastic gradient descent methods in optimization and show iteration complexity. For another line of work, which includes the concept of differential privacy in the stochastic VI, we refer the readers to a recent paper [2] and the references therein. The stochastic oracle maybe man-made. For instance, the technique of randomized smoothing has been applied in the so-called zeroth-order methods (i.e. derivative-free methods), refer to [27, 34] or the survey [16] in the context of optimization and [37, 38, 17, 33, 19] in the context of minimax saddle-point problems.
3 The Stochastic First-Order Methods for Strongly Monotone VI
Let us start this section by introducing the notations to facilitate our analysis. We shall denote the stochastic oracle as , suppressing the notation whenever it is clear from the context. For example, is associated with the random sample . In addition, we denote as the projection operator onto the feasible set .
3.1 The stochastic extra-point scheme
We first present the iterative updating rule for the stochastic extra-point scheme:
(9) |
for , where the sequence is the sequence of iterates, and is the sequence of extra points, which helps to produce the sequence of iterates.
In the case of deterministic strongly monotone VI, we introduced in our previous work [6] a unifying extra-point updating scheme, which includes specific first-order search directions such as the extra-gradient, the heavy-ball method, the optimistic method, and Nesterov’s extrapolation; these are incorporated with the parameters . As any specific configuration of these parameters should be tailored to the problem structure at hand, our goal is to provide conditions of the parameters under which an optimal iteration complexity can be guaranteed. This line of analysis will now be extended to solve stochastic VI as given in (9). We shall first establish the relational inequalities between subsequent iterates in terms of the expected distance to the unique solution , denote by .
Lemma 3.1.
For the sequences and generated from the stochastic extra-point scheme (9), the following inequality holds:
(10) | |||||
Proof.
See Appendix A.1. ∎
Lemma 3.1 forms a basis to the desired linear convergence, and it is possible to identify the conditions for the parameters in order to achieve linear convergence. Consider parameters satisfying
(13) |
and denote
(17) |
Then we obtain from (10) that
(18) |
With additional constraints on , the variance-reduced convergence result is summarized in the next theorem.
Theorem 3.2.
Proof.
See Appendix A.2. ∎
Regarding Theorem 3.2, a few remarks are in order. First, as we remarked earlier, the boundedness condition (4) can be removed. However, the analysis will become much longer and cumbersome; we keep it here for simplicity. Second, a common way to achieve variance reduction is through increasing the mini-batch sample sizes. In fact, we may fix the sample size at the beginning with order , or it increases linearly at a rate as increases. We shall discuss more on this strategy in Section 4. Finally, we note that without variance reduction, it is possible to adopt diminishing step sizes instead of fixing step sizes as we have assumed so far. The optimal uniform sublinear convergence rate can be established through a separate analysis continued from Lemma 3.1. The details can be found in Appendix B.
Next proposition concludes this subsection with a specific choice of the parameters.
Proposition 3.3.
Proof.
See Appendix A.3. ∎
3.2 The stochastic extra-momentum scheme
In this subsection, we present an alternative stochastic first-order method that achieves the optimal iteration complexity as well, the stochastic extra-momentum scheme:
(23) |
for .
Compared with the stochastic extra-point scheme (9), the above update (23) manipulates only the momentum terms alongside the stochastic gradient direction (the notion “gradient” here refers to the mapping in the VI model), namely the heavy-ball direction and the optimism direction . Since it maintains a single sequence throughout the iterations, this scheme requires one projection per iteration, as compared to two projections in the case of the stochastic extra-point scheme. We shall remark that the method proposed in Kotsalis et al. [15] only considers the optimism term. Therefore, the stochastic extra-momentum scheme introduced above may be viewed as a generalization.
As in the previous subsection, we shall first establish a relational inequality between the iterates. As we can see from the lemma below, the structure of this relational inequality is in fact quite different from the previous case. The detailed proof can be found in the appendix.
Lemma 3.4.
For the sequence generated from the stochastic extra-momentum scheme (23), the following inequality holds:
(24) | |||||
Proof.
See Appendix A.4. ∎
Observe that each of the terms on the LHS of (24) differs in the iteration index from the RHS exactly by one. This property enables us to design a possible potential function that measures the convergence of the iterative process. We shall specify additional conditions on the non-negative parameters in order to further simplify (24):
(25) |
where is some constant independent of . Note that the LHS of each inequality in (25) is the ratio between the coefficients on the LHS and RHS of (24) for each corresponding term. Therefore, the relation (24) can be rearranged as:
(26) | |||||
Now, by defining the potential function as
inequality (26) can be rewritten as
(27) |
This leads to our final results, as summarized in the next theorem:
Theorem 3.5.
Proof.
See Appendix A.5. ∎
A simple choice of parameters leads to:
Proposition 3.6.
Proof.
It follows by substituting the parameter choice into (28). ∎
This shows that if we run the stochastic extra-momentum scheme (23) with the above parameter choice, then in iterations we will reach a solution satisfying
4 A Stochastic Zeroth-Order Approach to Saddle-Point Problems
In this section, we shall apply the proposed stochastic extra-point/extra-momentum scheme to solve the following saddle-point problem without needing to compute the gradients of :
(30) |
where , are convex sets, is strongly convex (with fixed ) and strongly concave (with fixed ) with modulus , and the partial gradients / are Lipschitz continuous with constant for fixed , and with constant with fixed . We let . We further assume that the function is Lipschitz continuous for either fixed or with constant . This implies that the norms of the partial gradients are bounded by :
In particular, we consider the settings when the partial gradients and (and any higher-order information) are not available. Furthermore, the exact evaluation of the function value itself is also not available; instead, we can only access a stochastic oracle , which satisfies the following assumption:
(40) |
Now, we shall use the so-called smoothing technique to approximate the first-order information, which then enables us to apply the proposed stochastic methods for VI, which includes the saddle-point model as a special case. In particular, we use a randomized smoothing scheme using uniform distributions / over the unit Euclidean ball in the / space, respectively. The smoothing functions with parameters are defined as follows:
where / is the volume of the unit ball in /.
Let us summarize main properties of the smoothing functions below:
Lemma 4.1.
Let be the uniform distribution on the unit sphere in . Then,
-
1.
The smoothing functions are continuously differentiable, and their partial gradients can be expressed as:
where is the surface area of the unit sphere in .
-
2.
For any and any , we have:
(44) (48)
Proof.
We are now ready to define the stochastic zeroth-order gradient as follows:
(51) |
where and are the uniformly distributed random vectors over the unit spheres in and respectively.
The next lemma shows that such stochastic zeroth-order gradients are unbiased with respect to the gradients of the smoothing functions and have uniformly bounded variance.
Lemma 4.2.
The stochastic zeroth-order gradients defined in (51) are unbiased and have bounded variance for all :
(55) |
and
(59) |
where
Proof.
See Appendix A.6. ∎
Before applying the stochastic extra-point/extra-momentum scheme to solve (30), let us first introduce the connections between these two models. As we regard the saddle-point model as a special case of VI, we shall treat the variables in the saddle-point problem as one variable and denote . Additionally, we define:
These terms correspond to the gradient of , the gradient of the smoothing functions and , and the stochastic zeroth-order gradient, respectively. Note that we have flipped the sign on partial gradient correspond to to account for the concavity of with respect to .
Finally, as we shall use a sample size of (a natural number) at iteration , we reserve the subscripts for the random vectors for the sample index , and denote:
In the above definition we suppress the notation of the random vectors on the LHS for cleaner presentation. Note that by the law of large numbers, together with (55)-(59), we have
(60) | |||
(61) |
4.1 Sample complexity analysis: stochastic zeroth-order extra-point method
Recall that our objective is
With only noisy function value accessible, we propose the stochastic zeroth-order extra-point method that updates simultaneously with the following update rule:
(64) |
Compare the above update with its original variant in (9) for solving stochastic VI, the update direction is replaced by the averaged stochastic zeroth-order gradient with sample size (similarly is replaced by with sample size ). This circumvents the inaccessible first-order information and equips us with appropriate tools to reduce the variance and achieve overall linear convergence.
The next lemma establishes the relational inequality between the subsequent iterates in terms of the expected distance to the solution , similar to what we did in Section 3.1. The differences lie in the corresponding stochastic error terms shown below. Note that in each iteration we take two batches of samples and . The batch size also appears because the iterate is used in each iteration.
Lemma 4.3.
For the sequences and generated from the stochastic zeroth-order extra-point method (64), the following inequality holds:
(65) | |||||
Proof.
See Appendix A.7. ∎
With the relational inequality in Lemma 4.3, we shall adopt the same conditions: (13), (17), and (21) for the parameters . Therefore, the results in Theorem 3.2 are directly applicable. In addition, we are now equipped with the variable sample size / to control the variance terms, as well as the smoothing parameters to control the bias terms
We shall utilize the example in Proposition 3.3 to analyze the sample complexity of the proposed method. The result is provided in the next proposition:
Proposition 4.4 (Sample complexity result 1).
The stochastic zeroth-order extra-point method (64) with the following parameter choice:
and
where is the iteration count decided in advance, outputs such that , with , and the total sample complexity of the procedure is
Proof.
See Appendix A.8. ∎
4.2 Sample complexity: stochastic zeroth-order extra-momentum method
Next we consider the stochastic zeroth-order extra-momentum method, with one projection per each iteration:
(66) |
The relational inequality, similar to Lemma 3.4, is established in the next lemma:
Lemma 4.5.
For the sequence generated from the stochastic zeroth-order extra-momentum method (66), the following inequality holds:
(67) | |||||
Proof.
See Appendix A.9. ∎
With the same condition as in (25) for the parameters , we can derive the similar bound to (26) (with replaced with and with the new stochastic error expression) and define the potential function:
Therefore, the following inequality holds:
(68) |
and we can apply the results directly from Theorem 3.5. In addition, with increasing sample sizes and the smoothing parameters , we are able to control the bias and the variance terms in the above inequality. We give the results of sample complexity in the next proposition.
Proposition 4.6 (Sample complexity result 2).
The stochastic zeroth-order extra-momentum method (66) with the following parameter choice:
and
where is the iteration count decided in advance, outputs such that , with and the total sample complexity of the procedure is
Proof.
See Appendix A.10. ∎
5 Numerical Experiments
In this section, we conduct an experiment that models a regularized two-player zero-sum matrix game with some uncertain payoff matrix. In particular, the payoff matrix is randomly distributed and can only be sampled for each (mixed) strategy. The problem can be formulated as follows:
(69) | |||||
s.t. | |||||
The experiment consists of two parts. In the first part, the random matrix is sampled element-wise from i.i.d. normal distribution, , where is pre-determined and randomly generated as follows. We partition , with each block matrix . Each entry in is generated randomly from Unif, where and are randomly generated from Unif and Unif respectively. The problem parameters are set as: , , , , , where and are the largest and the smallest singular values of the Jacobian matrix respectively. The smoothing parameters are set to be in the order , the total iteration is set to 1485, and the sample size at iteration .
In the second part, the random matrix is sampled element-wise from i.i.d. log-normal distribution, which is known as to have a fat-tail. It is used to model multiplicative random variables that take positive values. We reuse the parameters and from the first part. In particular, the samples are generated by , where is sampled element-wise from i.i.d. standard normal distribution . Therefore, the mean of such distribution is given by . We have and set in this part, and other parameters remain the same as the first part ( are in the same order).
In both parts of the experiment, we first solve the deterministic problem with the mean payoff matrix ( for the second part) and denote the solution as . We then implement the two proposed methods: the stochastic zeroth-order extra-point method and the stochastic zeroth-order extra-momentum method. In addition, we compare these two methods with other first order methods: the extra-gradient method, the OGDA method, and the VS-Ave method (proposed in [9], which is a variance-reduced stochastic extension of Nesterov’s method [26]), all equipped with the same stochastic zeroth-order oracle. The results are shown in the following two figures, where the left plot shows the result from one experiment and the right shows the result from average of ten experiments. The parameters for each algorithm are manually tuned except for VS-Ave, where we adopt the recursive rule as proposed in its original paper. The results show that the two newly proposed methods exhibit comparable (or slightly improved) performance to the stochastic extra-gradient/OGDA method in this particular example of application.




6 Conclusions
This paper proposes two new schemes of stochastic first-order methods to solve strongly monotone VI problems: the stochastic extra-point scheme and the stochastic extra-momentum scheme. The first scheme features a high flexibility in the configuration of parameter choices that can be tailored to different problem classes. The second scheme is less general in the choice of search directions. However, it has the advantage of maintaining a single sequence throughout the iterations. Therefore, it requires only one projection per iteration, as opposed to most other first method that maintains an extra iterative sequence. Both methods achieve optimal iteration complexity bound, provided that the stochastic gradient oracle allows the variance to be controllable. The application of these two schemes to solve stochastic black-box saddle-point problem is also presented. Through a randomized smoothing scheme, the stochastic oracles required in these two schemes can be constructed via stochastic zeroth-order gradient approximation. The variance is thus controllable by mini-batch sampling with linearly increasing sample sizes per iteration, and the sample complexity results are derived. Preliminary numerical experiments show an improved (or at least comparable) performance of the proposed schemes to other existing methods.
References
- [1] H.. Bauschke and P.. Combettes “Convex analysis and monotone operator theory in Hilbert spaces” Springer, 2011
- [2] D. Boob and C. Guzmán “Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems” In arXiv preprint arXiv: 2104.02988, 2021
- [3] Y. Chen, G. Lan and Y. Ouyang “Accelerated schemes for a class of variational inequalities” In Mathematical Programming 165.1 Springer, 2017, pp. 113–149
- [4] F. Facchinei and J.-S. Pang “Finite-dimensional variational inequalities and complementarity problems” Springer Science & Business Media, 2007
- [5] X. Gao “Low-order optimization algorithms: iteration complexity and applications” Ph.D. Thesis, University of Minnesota, 2018
- [6] K. Huang and S. Zhang “A unifying framework of accelerated first-order approach to strongly monotone variational inequalities” In arXiv preprint arXiv:2103.15270, 2021
- [7] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Extragradient method with variance reduction for stochastic variational inequalities” In SIAM Journal on Optimization 27.2 SIAM, 2017, pp. 686–724
- [8] A.. Iusem, A. Jofré, R.. Oliveira and P. Thompson “Variance-based extragradient methods with line search for stochastic variational inequalities” In SIAM Journal on Optimization 29.1 SIAM, 2019, pp. 175–206
- [9] A. Jalilzadeh and U.. Shanbhag “A proximal-point algorithm with variable sample-sizes (PPAWSS) for monotone stochastic variational inequality problems” In 2019 Winter Simulation Conference (WSC), 2019, pp. 3551–3562 IEEE
- [10] H. Jiang and H. Xu “Stochastic approximation approaches to the stochastic variational inequality problem” In IEEE Transactions on Automatic Control 53.6 IEEE, 2008, pp. 1462–1475
- [11] A. Juditsky, A. Nemirovski and C. Tauvel “Solving variational inequalities with stochastic mirror-prox algorithm” In Stochastic Systems 1.1 INFORMS, 2011, pp. 17–58
- [12] A. Kannan and U.. Shanbhag “Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants” In Computational Optimization and Applications 74.3 Springer, 2019, pp. 779–820
- [13] G.M. Korpelevich “The extragradient method for finding saddle points and other problems” In Matecon 12, 1976, pp. 747–756
- [14] J. Koshal, A. Nedic and U.. Shanbhag “Regularized iterative stochastic approximation methods for stochastic variational inequality problems” In IEEE Transactions on Automatic Control 58.3 IEEE, 2012, pp. 594–609
- [15] G. Kotsalis, G. Lan and T. Li “Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation” In arXiv preprint arXiv:2011.02987, 2020
- [16] J. Larson, M. Menickelly and S.. Wild “Derivative-free optimization methods” In arXiv preprint arXiv:1904.11585, 2019
- [17] S. Liu, S. Lu, X. Chen, Y. Feng, K. Xu, A. Al-Dujaili, M. Hong and U.-M. O’Reilly “Min-max optimization without gradients: convergence and applications to adversarial ML” In arXiv preprint arXiv:1909.13806, 2019
- [18] B. Martinet “Brève communication. Régularisation d’inéquations variationnelles par approximations successives” In Revue française d’informatique et de recherche opérationnelle. Série rouge 4.R3 EDP Sciences, 1970, pp. 154–158
- [19] M. Menickelly and S.. Wild “Derivative-free robust optimization by outer approximations” In Mathematical Programming 179.1 Springer, 2020, pp. 157–193
- [20] A. Mokhtari, A. Ozdaglar and S. Pattathil “A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach” In arXiv preprint arXiv:1901.08511, 2019
- [21] A. Mokhtari, A. Ozdaglar and S. Pattathil “Convergence rate of for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems” In SIAM Journal on Optimization 30.4 SIAM, 2020, pp. 3230–3251
- [22] A. Nemirovski “Prox-method with rate of convergence for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems” In SIAM Journal on Optimization 15.1 SIAM, 2004, pp. 229–251
- [23] Y. Nesterov “A method for unconstrained convex minimization problem with the rate of convergence ” In Doklady SSSR 269, 1983, pp. 543–547
- [24] Y. Nesterov “Dual extrapolation and its applications to solving variational inequalities and related problems” In Mathematical Programming 109.2-3 Springer, 2007, pp. 319–344
- [25] Y. Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2003
- [26] Y. Nesterov and L. Scrimali “Solving strongly monotone variational and quasi-variational inequalities” In Available at SSRN 970903, 2006
- [27] Y. Nesterov and V. Spokoiny “Random gradient-free minimization of convex functions” In Foundations of Computational Mathematics 17.2 Springer, 2017, pp. 527–566
- [28] B. Palaniappan and F. Bach “Stochastic variance reduction methods for saddle-point problems” In Advances in Neural Information Processing Systems, 2016, pp. 1416–1424
- [29] B.. Polyak “Some methods of speeding up the convergence of iteration methods” In USSR Computational Mathematics and Mathematical Physics 4.5 Elsevier, 1964, pp. 1–17
- [30] L.. Popov “A modification of the Arrow-Hurwicz method for search of saddle points” In Mathematical Notes of the Academy of Sciences of the USSR 28.5 Springer, 1980, pp. 845–848
- [31] H. Robbins and S. Monro “A stochastic approximation method” In The Annals of Mathematical Statistics JSTOR, 1951, pp. 400–407
- [32] R.. Rockafellar “Monotone operators and the proximal point algorithm” In SIAM Journal on Control and Optimization 14.5 SIAM, 1976, pp. 877–898
- [33] A. Roy, Y. Chen, K. Balasubramanian and P. Mohapatra “Online and bandit algorithms for nonstationary stochastic saddle-point optimization” In arXiv preprint arXiv:1912.01698, 2019
- [34] S. Shalev-Shwartz “Online learning and online convex optimization” In Foundations and trends in Machine Learning 4.2, 2011, pp. 107–194
- [35] U.. Shanbhag “Stochastic variational inequality problems: Applications, analysis, and algorithms” In Theory Driven by Influential Applications INFORMS, 2013, pp. 71–107
- [36] P. Tseng “On linear convergence of iterative methods for the variational inequality problem” In Journal of Computational and Applied Mathematics 60.1-2 Elsevier, 1995, pp. 237–252
- [37] Z. Wang, K. Balasubramanian, S. Ma and M. Razaviyayn “Zeroth-order algorithms for nonconvex minimax problems with improved complexities” In arXiv preprint arXiv:2001.07819, 2020
- [38] T. Xu, Z. Wang, Y. Liang and H.. Poor “Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence” In arXiv preprint arXiv:2006.09361, 2020
- [39] F. Yousefian, A. Nedić and U.. Shanbhag “A regularized smoothing stochastic approximation (RSSA) algorithm for stochastic variational inequality problems” In 2013 Winter Simulations Conference (WSC), 2013, pp. 933–944 IEEE
- [40] F. Yousefian, A. Nedić and U.. Shanbhag “On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems” In Mathematical Programming 165.1 Springer, 2017, pp. 391–431
- [41] F. Yousefian, A. Nedić and U.. Shanbhag “Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems” In 53rd IEEE Conference on Decision and Control, 2014, pp. 5831–5836 IEEE
- [42] J. Zhang, M. Hong and S. Zhang “On lower iteration complexity bounds for the saddle point problems” In arXiv preprint arXiv:1912.07481, 2018
Appendix A Proof of technical lemmas and theorems
A.1 Proof of Lemma 3.1
First of all, by the 1-co-coerciveness (cf. e.g. Proposition 4.4 in [1]) of the projection operator , we have
(70) | |||||
We shall first decompose the last term in the above inequality as
(71) | |||||
Let us first use the optimality condition of to bound term :
Taking , we get
(72) |
We can also establish the bound for :
Note the following bound from the Lipschitz continuity:
(73) | |||||
for any , where we used the definition of the stochastic error term
(74) |
Therefore,
(75) | |||||
Furthermore,
The resulting bound for becomes:
(76) | |||||
Next let us bound in (71). We have,
(77) | |||||
We also need to bound the following term in (70):
(79) | |||||
Let us now take expectation on both sides. Noting , , and , and noting that for all by Assumption (6), we obtain
(80) | |||||
Notice that
where
and
Further note that we have denoted to be the collection of random vectors sampled up until the iterate . Therefore, is a known vector given .
Putting the above two bounds back into (80), we arrive at the desired bound:
A.2 Proof of Theorem 3.2
By condition (21), we have . Let us start with divide both sides of (18) with :
(81) | |||||
Note that we have by condition (21). It is elementary to verify that
and by rearranging terms in (81), we have the following
A recursive argument yields the following result:
Note that . The statement in Theorem 3.2 follows by letting .
A.3 Proof of Proposition 3.3
Now, from (10), we have
Divide both sides with and note that , we have:
We can move a part of to the LHS and form the following:
Note that . Finally, the LHS of the above inequality can be lower bounded by , thus completing the proof.
A.4 Proof of Lemma 3.4
We start by using the 1-co-coerciveness of the projection operator :
Next, let us bound the above four terms separately:
(83) |
and
(84) |
where
(85) | |||||
and
(86) |
and
(87) | |||||
where
(88) |
Taking expectation on both sides gives us
(89) | |||||
Note that
Here we define to be the collection of random vectors sampled up until the iterate , and is known given .
A.5 Proof of Theorem 3.5
Finally, with the following bound:
we can lower bound as
Therefore
(90) | |||||
The statement in Theorem 3.5 follows by noting .
A.6 Proof of Lemma 4.2
We will derive the first bound in (59); the second bound is similar and will be omitted.
A.7 Proof of Lemma 4.3
The logic line of the proof for this lemma is very similar to the proof in Appendix A.1, with the stochastic mapping replaced by the stochastic zeroth-order gradient . Therefore, we shall refrain from repeating similar analysis, but highlight the main differences instead. First, for , we shall have
(91) |
where , because
Next, by denoting , we have
(92) | |||||
Therefore, for a similar bound as in (75), we have
For another similar bound as in (79), we have
Therefore, we reach the bound that
By (61), we have . Taking expectation on both sides for the above inequality, we have
(93) | |||||
Note that
where
Let us denote
as the collection of all random vectors at iteration and the collection of all such random vectors from iteration to respectively. Notice that for the given , is a deterministic vector, we then have the following bound
where in the last inequality we utilize the boundedness assumption of and denote . Combining the above bounds into (93), the desired result follows.
A.8 Proof of Proposition 4.4
Note the variance is upper bounded by
Since we take and , the above upper bound can be written as
By substituting the specific parameter choice into (65), starting from the last iteration , we have
In the last inequality, we denote and , , and use the fact that .
Dividing both sides by and noting that , we obtain
By moving a part of to the LHS, we have:
With the sample size , we have:
It is then straightforward to see that for we have , with the sample complexity given by
By noticing that a more precise expression of , we have , , the combined sample complexity is then given by .
A.9 Proof of Lemma 4.5
With the similar logic to the proof in Appendix A.4, we shall focus on the main differences between the two proofs.
Firstly, with the similar derivation to (92), we have the following bound:
By using the variance bound in (61), we will reach the next inequality that is similar to the step in (89):
(94) | |||||
Denote by
the collection of all random vectors at iteration and the collection of all such random vectors from iteration to respectively, and note that given , is a deterministic vector, we then have the following bound:
Substituting the above bound into (94), the desired result follows.
A.10 Proof of Proposition 4.6
Let us start from the potential function inequality (68) from iteration . With , let us also denote , then . Note that the defined here is only for this proof, not to be confused with that used in the proof in Appendix A.8. Then we have
In the second inequality, we take and note that . Then we have . In addition,
Therefore, we have
(95) |
Now, let us lower bound by observing:
Then we have
Combining with (95), we have
It follows immediately that for we have . With a more precise expression , the sample complexity can be estimated:
Appendix B Proof of the uniform sublinear convergence of the stochastic extra-point method
In order to establish a uniform sublinear convergence, we have to consider parameters that are diminishing with iteration number . Let us return to the one-iteration relation (10) and consider the following choice of parameters:
where we omit the superscript of on the RHS for notation simplicity. We shall note that here which is dependent on iteration and follow the same simplification for other parameters throughout the rest of the proof in this appendix unless noted otherwise.
By using the fact , we have:
where in the last inequality we use the boundedness of the feasible set.
Therefore, we could rewrite (10) into:
Substituting the parameters with their respective values in the rest of the terms:
Dividing both sides by , and noting that , it follows that
From the above one-iteration inequality, we shall claim the following inequality
where
and we shall prove the inequality by induction. For , the inequality holds trivially
Assuming the inequality holds for all index , we then have
Note that in the last inequality we used the identities and . This completes the proof for the uniform convergence rate.