High Probability Convergence for Accelerated Stochastic Mirror Descent
Abstract
In this work, we describe a generic approach to show convergence with high probability for stochastic convex optimization. In previous works, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution as opposed to the domain diameter. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations.
1 Introduction
Stochastic convex optimization is a well-studied area with numerous applications in algorithms, machine learning, and beyond. Various algorithms have been shown to converge for many classes of functions including Lipschitz functions, smooth functions, and their linear combinations. However, one curious gap remains in the understanding of their convergence with high probability compared with convergence in expectation. Classical results show that in expectation, the function value gap of the final solution is proportional to the distance between the original solution and the optimal solution. On the other hand, classical results for convergence with high probability could only show that the function value gap of the final solution is proportional to the diameter of the domain, which could be much larger or even unbounded. In this work, we bridge this gap and establish a generic approach to show convergence with high probability where the final function value gap is proportional to the distance between the original solution and the optimal solution. We instantiate our approach in two settings, stochastic mirror descent and stochastic accelerated gradient descent. The results are analogous to known results for convergence in expectation but now with high probability. The algorithms are universal for both Lipschitz functions and smooth functions.
The proof technique is inspired by classical works in concentration inequalities, specifically a type of martingale inequalities where the variance of the martingale difference is bounded by a linear function of the previous value. This technique is first applied to showing high probability convergence by Harvey et al. [2]. Our proof is inspired by the proof of Theorem 7.3 by Chung and Lu [1]. In each time step with iterate , let be the error in our gradient estimate. Classical proofs of convergence evolve around analyzing the sum of , which can be viewed as a martingale sequence. Assuming a bounded domain, the concentration of the sum can be shown via classical martingale inequalities. The key new insight is that instead of analyzing this sum, we analyze a related sum where the coefficients decrease over time to account for the fact that we have a looser grip on the distance to the optimal solution as time increases. Nonetheless, the coefficients are kept within a constant factor of each others and the same asymptotic convergence is attained with high probability.
Related work
Lan [5] establishes high probability bounds for the general setting of stochastic mirror descent and accelerated stochastic mirror descent under the assumption that the stochastic noise is subgaussian. The rates shown in [5] match the best rates known in expectation, but they depend on the Bregman diameter of the domain, which can be unbounded. Our work complements the analysis of [5] with a novel concentration argument that allows us to establish convergence with respect to the distance from the initial point. Our analysis applies to the general setting considered in [5] and we use the same subgraussian assumption on the stochastic noise.
The algorithms and step sizes we consider capture the stochastic gradient descent algorithms with the standard setting of the step sizes for both smooth and non-smooth problems. The high-probability convergence of SGD is studied in the works [4, 6, 3, 2]. These works either assume that the function is strongly convex or the domain is compact. In contrast, our work applies to non-strongly convex optimization with a general domain.
2 Preliminaries
We consider the problem where is a convex function and is a convex domain. We consider the general setting where is potentially not strongly convex and the domain is not necessarily compact.
We assume we have access to a stochastic gradient oracle that returns a stochastic gradient that satisfies the following two assumptions for any prior history:
-
1.
Unbiased estimator: .
-
2.
Sub-Gaussian noise: is a -subgaussian random variable (Definition 2.1).
There are several equivalent definitions of subgaussian random variables up to an absolute constant scaling (see, e.g., Proposition 2.5.2 in [7]). For convenience, we use the following property as the definition.
Definition 2.1.
A random variable is -subgaussian if
The above definition is equivalent to the following property, see Proposition 2.5.2 in [7].
We will also use the following helper lemma whose proof we defer to the Appendix.
Lemma 2.3.
For any , and a nonnegative -subgaussian random variable ,
3 Analysis of Stochastic Mirror Descent
Parameters: initial point , step sizes
for to :
Β Β Β
return
In this section, we analyze the Stochastic Mirror Descent algorithm (Algorithm 1). For simplicity, here we consider the non-smooth setting, and assume that is -Lipschitz continuous, i.e., we have for all . The analysis for the smooth setting follows via a simple modification to the analysis presented here as well as the analysis for the accelerated setting given in the next section.
We define
We let denote the natural filtration. Note that is -measurable.
The starting point of our analysis is the following inequality that follows from the standard stochastic mirror descent analysis (see, e.g., [5]). We include the proof in the Appendix for completeness.
Lemma 3.1.
([5])For every iteration , we have
We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let be a non-increasing sequence where for all . We define
Before proceeding with the analysis, we provide intuition for our approach. If we consider , we see that it combines the gains in function value gaps with weights given by the non-increasing sequence . The intuition here is that we want to leverage the progress in function value to absorb the error from the stochastic error terms on the RHS of Lemma 3.1. For the divergence terms, we use the same coefficient to allow for the terms to telescope. In Theorem 3.2, we upper bound the moment generating function of and derive a set of conditions for the weights that allow us to absorb the stochastic errors. In Corollary 3.3, we show how to choose the weights and obtain a convergence rate that matches the standard rates that hold in expectation.
We now give our main concentration argument that bounds the moment generating function of . The proof of the following theorem is nspired by the proof of Theorem 7.3 in [1].
Theorem 3.2.
Suppose that and for every . For every , we have
Proof.
We proceed by induction on . Consider the base case . We have and , and the inequality follows. Next, we consider . We have
(1) |
We now analyze the inner expectation. Conditioned on , is fixed. Using the inductive hypothesis , we obtain
(2) |
Let . By Lemma 3.1, we have
and thus
Plugging into (2), we obtain
Plugging into (1), we obtain
(3) |
Next, we analyze the the expectation on the RHS of the above inequality. We have
(4) |
On the first line we used the Taylor expansion of , and on the second line we used that . On the third line, we used Cauchy-Schwartz and obtained
On the fourth line, we applied Lemma 2.3 with , , and . On the fifth line, we used that , which follows from the strong convexity of .
Theorem 3.2 and Markovβs inequality gives us the following convergence guarantee.
Corollary 3.3.
Suppose the sequence satisfies the conditions of Theorem 3.2. For any , the following event holds with probability at least :
Proof.
Let
By Theorem 3.2 and Markovβs inequality, we have
Note that
Therefore, with probability at least , we have
β
With the above result in hand, we complete the convergence analysis by showing how to define the sequence with the desired properties.
Corollary 3.4.
Suppose we run the Stochastic Mirror Descent algorithm with fixed step sizes . Let and for all . The sequence satisfies the conditions required by Corollary 3.3. By Corollary 3.3, for any , the following events hold with probability at least :
and
Setting to balance the two terms in the first inequality gives
and
Proof.
Recall from Corollary 3.3 that the sequence needs to satisfy the following conditions for all :
Let . We set . For , we set so that the first condition holds with equality
We can show by induction that, for every , we have
The base case follows from the definition of . Consider . Using the definition of and the inductive hypothesis, we obtain
as needed.
Using this fact, we now show that satisfies the second condition. For every , we have
as needed.
Thus, by Corollary 3.3, with probability , we have
Note that and for all . Thus we obtain
Thus we have
and
β
The analysis readily extends to the setting where the time horizon is not known and we set time-varying step sizes. We include below the analysis for well-studied steps .
Corollary 3.5.
Proof.
Recall from Corollary 3.3 that the sequence needs to satisfy the following conditions for all :
Let and . We set . For , we set so that the first condition holds with equality
We can show by induction that, for every , we have
The base case follows from the definition of . Consider . Using the definition of and the inductive hypothesis, we obtain
as needed.
Using this fact, we now show that satisfies the second condition. For every , we have
as needed.
Thus, by Corollary 3.3, with probability , we have
Note that and for all . Thus we obtain
Plugging in and simplifying, we obtain
Thus we have
and
β
4 Analysis of Accelerated Stochastic Mirror Descent
Parameters: initial point , step size
Set ,
for to :
Β Β Β
Β Β Β
Β Β Β
return
In this section, we analyze the Accelerated Stochastic Mirror Descent Algorithm (Algorithm (2)). We assume that satisfies the following condition:
-smooth functions, -Lipschitz functions, and their sums all satisfy the above conditions.
As before, we define
We let denote the natural filtration. Note that is -measurable and and are -measurable.
We follow a similar analysis to the previous section. As before, we start with the inequalities shown in the standard analysis of the algorithm, and we combine them using coefficients . The following lemma follows from the analysis given in [5] and we include the proof in the Appendix for completeness.
Lemma 4.1.
([5]) For every iteration , we have
We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let be a non-increasing sequence where for all . We define
Theorem 4.2.
Suppose that for every and for every . For every , we have
Proof.
We proceed by induction on . Consider the base case . We have and , and the inequality follows. Next, we consider . We have
(5) |
We now analyze the inner expectation. Conditioned on , is fixed. Using the inductive hypothesis, we obtain
(6) |
Let . By Lemma 4.1, we have
and thus
Plugging into (6), we obtain
Plugging into (5), we obtain
(7) |
Next, we analyze the the expectation on the RHS of the above inequality. We have
(8) |
On the first line we used the Taylor expansion of , and on the second line we used that . On the third line, we used Cauchy-Schwartz and obtained
On the fourth line, we applied Lemma 2.3 with , , and . On the fifth line, we used that , which follows from the strong convexity of .
Theorem 4.2 and Markovβs inequality gives us the following convergence guarantee.
Corollary 4.3.
Suppose the sequence satisfies the conditions of Theorem 4.2. For any , the following event holds with probability at least :
Proof.
Let
By Theorem 4.2 and Markovβs inequality, we have
Note that
Therefore, with probability at least , we have
β
With the above result in hand, we complete the convergence analysis by showing how to define the sequence with the desired properties.
Corollary 4.4.
Suppose we run the Accelerated Stochastic Mirror Descent algorithm with the standard choices and with . Let and for all . The sequence satisfies the conditions required by Corollary 4.3. By Corollary 4.3, for any , the following events hold with probability at least :
and
Setting to balance the two terms in the first inequality gives
and
Proof.
Recall from Corollary 4.3 that the sequence needs to satisfy the following conditions for all :
(9) | ||||
(10) |
We will set so that it satisfies the following additional condition, which will allow us to telescope the sum on the RHS of Corollary 4.3:
(11) |
Given , we set for every so that the first condition (9) holds with equality:
Let . We set
Given this choice for , we now verify that, for all , we have
We proceed by induction on . The base case follows from the definition of . Consider . Using the definition of and the inductive hypothesis, we obtain
as needed.
Let us now verify that the second condition (10) also holds. Using that , , and , we obtain
as needed.
Let us now verify that the third condition (11) also holds. Since and , we have . Since , it follows that condition (11) holds.
We now turn our attention to the convergence. By Corollary 4.3, with probability , we have
Grouping terms on the LHS and using that , we obtain
Since satisfies condition (11), the coefficient of is non-negative and thus we can drop the above sum. We obtain
Using that and for all , we obtain
Thus
Using that and , we obtain
Plugging in and using that and , we obtain
We can further simplify the bound by lower bounding and upper bounding . We obtain
Thus we obtain
and
β
References
- [1] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79β127, 2006.
- [2] NicholasΒ JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579β1613. PMLR, 2019.
- [3] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489β2512, 2014.
- [4] ShamΒ M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008.
- [5] Guanghui Lan. First-order and stochastic optimization methods for machine learning. Springer, 2020.
- [6] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
- [7] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volumeΒ 47. Cambridge university press, 2018.
Appendix A Omitted Proofs
Proof.
(Lemma 2.3) Consider two cases either or . First suppose . We use the inequality ,
Next, let . We have
In the first inequality, we use the inequality . In the third inequality, we use . This inequality can be proved with the Taylor expansion.
β
Proof.
(Lemma (3.1)) By the optimality condition, we have
and thus
Note that
and thus
where we have used that by the strong convexity of .
By convexity,
Combining the two inequalities, we obtain
Using the triangle inequality and the bounded gradient assumption , we obtain
Thus
as needed. β
Proof.
(Lemma 4.1) Starting with smoothness, we obtain
By the optimality condition for ,
Rearranging, we obtain
By combining the two inequalities, we obtain
Subtracting from both sides, rearranging, and using that , we obtain
Finally, we divide by , and obtain
β