This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Acceleration by Random Stepsizes:
Hedging, Equalization, and the Arcsine Stepsize Schedule

Jason M. Altschuler
UPenn
alts@upenn.edu
   Pablo A. Parrilo
LIDS - MIT
parrilo@mit.edu
Abstract

We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the iteration complexity from O(κ)O(\kappa) to O(κ1/2)O(\kappa^{1/2}), where κ\kappa is the condition number. No momentum or other algorithmic modifications are required. This result is incomparable to the (deterministic) Silver Stepsize Schedule which does not require separability but only achieves partial acceleration O(κlog1+22)O(κ0.78)O(\kappa^{\log_{1+\sqrt{2}}2})\approx O(\kappa^{0.78}). Our starting point is a conceptual connection to potential theory: the variational characterization for the distribution of stepsizes with fastest convergence rate mirrors the variational characterization for the distribution of charged particles with minimal logarithmic potential energy. The Arcsine distribution solves both variational characterizations due to a remarkable “equalization property” which in the physical context amounts to a constant potential over space, and in the optimization context amounts to an identical convergence rate over all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.

00footnotetext: This result first appeared in Chapter 6 of the 2018 MS Thesis [2] of the first author (advised by the second author). This was the first convergence rate for any stepsize schedule (deterministic or randomized) that improved asymptotically over the standard unaccelerated rate for constant-stepsize GD, in any setting beyond quadratics.

1 Introduction

It is classically known that by using certain deterministic stepsize schedules {αt}\{\alpha_{t}\}, the Gradient Descent algorithm (GD)

xt+1=xtαtf(xt),t=0,1,2,\displaystyle x_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t}),\qquad t=0,1,2,\dots (1.1)

solves convex optimization problems minxdf(x)\min_{x\in\mathbb{R}^{d}}f(x) to arbitrary precision from an arbitrary initialization x0x_{0}. This result can be found in any textbook on convex optimization, e.g., [9, 10, 40, 45, 30, 38, 6, 35]. The present paper investigates the following question:

Is it possible to accelerate the convergence of GD by using random stepsizes? (1.2)

In this paper, we show that for separable convex optimization, the answer is yes. Specifically, we provide an i.i.d. stepsize schedule that makes GD converge at the optimal accelerated rate. See §1.1 for a precise statement; here we first contextualize with the relevant literature on random stepsizes.

Note that the use of randomness for GD stepsizes is conceptually different from the standard use of randomness in Stochastic Gradient Descent (SGD). SGD has random, inexact gradient queries and typically use deterministic stepsizes. GD has exact gradient queries, and random stepsizes amounts to purposely injecting noise into the algorithm.

The special case of quadratics.

This question is understood (only) for optimizing convex quadratic functions. In particular, the line of work [47, 63, 32] shows that for quadratic functions ff that are mm-strongly convex and MM-smooth (i.e., mI2fMImI\preceq\nabla^{2}f\preceq MI), there is a unique optimal distribution for i.i.d. stepsizes that makes GD converge at a rate of

(xnxx0x)1/nκ1κ+1,\displaystyle\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\longrightarrow\;\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,, (1.3)

where xx^{*} denotes the minimizer of ff, κ:=Mm\kappa:=\tfrac{M}{m} denotes the condition number, and this convergence is in an appropriate probabilistic sense (e.g., almost-sure or L1L^{1}). This stepsize distribution is most naturally stated in terms of the inverse stepsizes βt:=1/αt\beta_{t}:=1/\alpha_{t} having the Arcsine distribution on the interval (m,M)(m,M), i.e., density

dμdβ=𝟙β(m,M)π(Mβ)(βm).\displaystyle\frac{d\mu}{d\beta}=\frac{\mathds{1}_{\beta\in(m,M)}}{\pi\sqrt{(M-\beta)(\beta-m)}}\,. (1.4)

Briefly, this Arcsine distribution arises naturally for the special case of quadratics because: (1) the convergence rate is invariant w.r.t. the stepsize order and thus depends only on the empirical measure of the stepsizes; and (2) the Arcsine distribution is the limit as nn\to\infty of the empirical measure of the optimal nn (deterministic) stepsizes for GD [61]. See §2.1 for details.

The resulting convergence rate (1.3) requires roughly Θ(κlog1ε)\Theta(\sqrt{\kappa}\log\tfrac{1}{\varepsilon}) iterations to produce an ε\varepsilon-accurate solution. Such a rate is called “accelerated” because its dependence on κ\kappa improves over the standard Θ(κlog1ε)\Theta(\kappa\log\tfrac{1}{\varepsilon}) iteration complexity required by GD with constant stepsizes. Moreover, this convergence rate Θ(κlog1ε)\Theta(\sqrt{\kappa}\log\tfrac{1}{\varepsilon}) is information-theoretically optimal for any first-order algorithm [40, 39], and matches the worst-case convergence rate of accelerated algorithms that change GD beyond just its stepsizes, e.g., the Conjugate Gradient method [31], Polyak’s Heavy Ball method [44], Nesterov-style momentum methods [41], and more [22]. See Table 1, left.

Refer to caption
Figure 1: The induced distribution of stepsizes α\alpha, for inverse stepsizes α1\alpha^{-1} taken from the Arcsine distribution on (m,M)(m,M). The minimum stepsize is 1/M1/M, the maximum is 1/m1/m, and the median is 2/(M+m)2/(M+m), which is the optimal value for constant stepsize schedules. For constant stepsize schedules, the dashed red line α¯=2/M\bar{\alpha}=2/M is the threshold for convergence; larger stepsizes α\alpha lead to divergence, and short stepsizes α\alpha lead to slow convergence. This distribution optimally hedges between short and long steps. Note that the mean stepsize is 1/Mm1/\sqrt{Mm} which is larger than the divergence threshold for constant stepsize schedules, by a factor of Θ(κ)\Theta(\sqrt{\kappa}). This plot sets κ=10\kappa=10, m=1/κm=1/\kappa, M=1M=1; the discrepancy between this distribution and standard constant stepsizes is even more dramatic for larger κ\kappa.
Beyond quadratics?

For general convex optimization, momentum-based algorithms still achieve these optimally accelerated rates Θ(κlog1ε)\Theta(\sqrt{\kappa}\log\tfrac{1}{\varepsilon}). However, existing acceleration results for random stepsizes do not extend beyond the special case of quadratic functions ff—otherwise f\nabla f is non-linear, making GD a non-linear map, which completely breaks all previous analyses. This failure of the analyses is due to fundamental differences between quadratic optimization and convex optimization:

  • The optimal deterministic stepsizes for quadratic optimization [61] (which the random stepsizes are built upon) are provably bad for convex optimization [2].

  • The convergence rate is invariant w.r.t. the stepsize order for quadratic optimization (which motivates the opportunity for random stepsizes), but this is false for convex optimization [2].

  • Using time-varying stepsizes is equivalent to using momentum for quadratic optimization (which enables GD to achieve accelerated rates), but this is false for convex optimization [22].

Due to these challenges, it remained unclear whether random stepsizes could lead to the accelerated rate Θ(κlog1ε)\Theta(\sqrt{\kappa}\log\tfrac{1}{\varepsilon}) for convex optimization. In fact, it was unknown even if random stepsizes could provide an arbitrary small improvement over the Θ(κlog1ε)\Theta(\kappa\log\tfrac{1}{\varepsilon}) rate of constant-stepsize GD in any setting beyond quadratic optimization.

1.1 Contribution and discussion

This paper shows that random stepsizes provably accelerate GD beyond the setting of quadratics. Specifically, our main result (Theorem 1.2) shows that GD converges at the optimally accelerated rate for separable convex functions. This separability property is defined below and, for C2C^{2} functions ff, is equivalent to commutativity of the Hessians {2f(x)}xd\{\nabla^{2}f(x)\}_{x\in\mathbb{R}^{d}}, see Appendix B.1.

Definition 1.1 (Separable functions).

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is separable if it admits a decomposition of the form

f(x)=i=1dfi([Ux]i)\displaystyle f(x)=\sum_{i=1}^{d}f_{i}([Ux]_{i}) (1.5)

for some orthogonal matrix Ud×dU\in\mathbb{R}^{d\times d} and some functions f1,,fd:f_{1},\dots,f_{d}:\mathbb{R}\to\mathbb{R}.

Theorem 1.2 (Random stepsizes accelerate GD for separable convex optimization).

Consider any dimension dd, any separable function f:df:\mathbb{R}^{d}\to\mathbb{R} that is mm-strongly convex and MM-smooth, and any initialization point x0x_{0} that is not equal to the minimizer xx^{*} of ff. By using i.i.d. inverse stepsizes αt1\alpha_{t}^{-1} from the Arcsine distribution (1.4), GD converges at the rate

(xnxx0x)1/nκ1κ+1,\displaystyle\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\longrightarrow\;\;\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,, (1.6)

where this convergence in the almost sure and L1L^{1} sense. Moreover, this is the unique distribution for which GD achieves this optimal rate.

Stepsizes  \  Function class Quadratic Separable Convex
Constant Θ(κ)\Theta(\kappa) folklore Θ(κ)\Theta(\kappa) folklore Θ(κ)\Theta(\kappa) folklore
Deterministic (Chebyshev/Silver) Θ(κ1/2)\Theta(\kappa^{1/2}) [61] O(κlog1+22)O(\kappa^{\log_{1+\sqrt{2}}2}) [3] O(κlog1+22)O(\kappa^{\log_{1+\sqrt{2}}2}) [3]
Random (Arcsine) Θ(κ1/2)\Theta(\kappa^{1/2}) [32, 47, 63] Θ(κ1/2)\Theta(\kappa^{1/2}) Theorem 1.2 unknown if randomness helps
Table 1: Iteration complexities for minimizing κ\kappa-conditioned functions using GD with different stepsize choices. The dependence on the accuracy is omitted as it is always log1/ε\log 1/\varepsilon. Information-theoretic lower bounds are Ω(κ1/2)\Omega(\kappa^{1/2}) in all settings [39]. Constant stepsize schedules require Θ(κ)\Theta(\kappa) iterations; this is the textbook unaccelerated rate. For quadratics (left), accelerated rates of Θ(κ1/2)\Theta(\kappa^{1/2}) can be equivalently achieved via deterministic stepsizes [61], random stepsizes [32, 47, 63], or momentum-based methods such as Heavy Ball [44] and Conjugate Gradients [31], among others [22]. Beyond quadratics (middle, right), momentum-based methods are not equivalent to varying stepsizes. Acceleration was first achieved by Nesterov’s Fast Gradient Method [41] and it has long been believed that beyond quadratics, any acceleration requires modifying GD by adding internal dynamics, e.g., momentum. [3] recently showed (conjecturally optimal) partially accelerated rates O(κlog1+22)O(κ0.78)O(\kappa^{\log_{1+\sqrt{2}}2})\approx O(\kappa^{0.78}) for convex optimization with deterministic stepsizes. We prove that random stepsizes enable fully accelerated rates Θ(κ1/2)\Theta(\kappa^{1/2}) beyond quadratics. Conditional on conjectures (see §7), this suggests potential gaps between deterministic and randomized stepsizes for GD, and gaps between separable and general convex optimization for GD.

We make several remarks about this result.

Optimality.

The rate in Theorem 1.2 is optimal since there is an exactly matching lower bound: there is an explicit κ\kappa-conditioned quadratic function for which any (Krylov) first-order algorithm cannot contract to the optimum faster than this rate κ1κ+1\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} [41, Theorem 2.1.12].

Comparison to GD with different stepsize schedules.

Theorem 1.2 first appeared as Theorem 6.1 of the 2018 MS thesis [2] of the first author (advised by the second author). This was the first convergence rate for any stepsize schedule (deterministic or randomized) that improved over the textbook rate for constant-stepsize GD, in any setting beyond quadratics. Indeed, even though many stepsize schedules have been investigated in both theory and practice, none of these alternative schedules had led to an analysis that outperforms (even by an arbitrarily small amount) the “unaccelerated” contraction factor (xnxx0x)1/nκ1κ+1(\tfrac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|})^{1/n}\leqslant\tfrac{\kappa-1}{\kappa+1} of constant stepsize GD, which corresponds to an iteration complexity of Θ(κlog1ε)\Theta(\kappa\log\tfrac{1}{\varepsilon}). This comparison statement includes even adaptive stepsize schedules such as exact line search [42, 9, 45, 15], Armijo-Goldstein schedules [40], Polyak-type schedules [45], and Barzilai-Borwein-type schedules [5]. Some of these schedules fail to accelerate even on quadratics (and therefore also on separable convex problems), a classic example being exact line search [33, §3.2.2].

In the years since [2], a line of work has shown partially accelerated convergence rates for general convex optimization using deterministic stepsize schedules [2, 12, 23, 13, 26, 3, 4, 27, 28, 58, 62, 29, 8]. In particular, [3] showed the Silver Convergence Rate O(κlog1+22log1ε)O(κ0.78log1ε)O(\kappa^{\log_{1+\sqrt{2}}2}\log\tfrac{1}{\varepsilon})\approx O(\kappa^{0.78}\log\tfrac{1}{\varepsilon}) for minimizing κ\kappa-conditioned functions by using a certain fractal-like deterministic stepsize schedule called the Silver Stepsize Schedule. This rate is conjecturally optimal among all possible deterministic stepsize schedules [3] and naturally extends to non-strongly convex optimization [4]. Theorem 1.2 is incomparable: it shows a fully accelerated rate O(κ1/2log1ε)O(\kappa^{1/2}\log\tfrac{1}{\varepsilon}), but under the additional assumption of separability. This incomparability raises several interesting questions about the possible benefit of randomization for GD. The future work section §7 collects several such conjectures.

Comparison to arbitrary first-order algorithms.

Starting from the seminal work of Nesterov in 1983 [41], the traditional approach for obtaining accelerated convergence rates is to modify the GD algorithm beyond just changing stepsizes, for example by adding momentum, extra state, or other internal dynamics. These results hold for arbitrary convex optimization and do not require separability. See for example [11, 21, 52, 56] for some recent such algorithms, and see the recent survey [22] for a comprehensive account of this extensive line of work. The purpose of this paper is to investigate an alternative mechanism for acceleration: can GD be accelerated in its original form, by just choosing random stepsizes?

Generality.

The assumptions in Theorem 1.2 can be relaxed. Separability can be generalized to radial-separability, and strong convexity and smoothness can be relaxed to less stringent assumptions on the spectrum of the Hessian of ff; see §4 for details. Note also that since GD does not depend on the choice of the basis UU in the separable decomposition (1.5), Theorem 1.2 only requires the existence of some such UU (which is the definition of separability). In particular, the algorithm does not need to know UU. Also, Theorem 1.2 is stated for convergence in terms of distance to optimality, but extends to all standard desiderata; see §4. Theorem 1.2 also extends to settings where gradient descent is performed inexactly, e.g., due to using inexact gradients; see §6.1.

Opportunities for randomized stepsizes.

Randomized algorithms have variability in their trajectories. This provides potential opportunities for running multiple times in parallel and selecting the best run, see §5.3. This best run can be significantly better than the typical run (whose rate converges almost surely to the accelerated rate, see §5.2) and the worst run (whose rate can diverge but only with exponentially small probability, see §5.1).

Game-theoretic connections and lower bounds.

This paper considers random stepsizes in order to accelerate GD. This randomized perspective is also helpful for the dual problem of constructing lower bounds, i.e., exhibiting functions for which gradient descent converges slowly. In particular, lifting to distributions over functions leads to a game between an algorithm (who chooses a distribution over stepsizes) and an adversary (who chooses a distribution over functions) which is symmetric in that the Arcsine distribution is the optimal answer for both. This leads to new perspectives on classical lower bounds for GD; details in §6.2.

Techniques and connections to logarithmic potential theory.

A distinguishing feature of our analysis is a connection to potential theory: the variational characterization for the optimal stepsize distribution mirrors the variational characterization for a certain equilibrium distribution which configures a unit of charged particles so as to minimize the logarithmic potential energy. The optimal distribution satisfies a remarkable “equalization property” which in the physical context amounts to a constant potential over space, and in the optimization context amounts to an identical convergence rate over all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. While links between Krylov-subspace algorithms and potential theory are classical (see the excellent survey [19]), these connections are restricted to the special case of quadratics and moreover, even for that setting, these connections have not been exploited for the analysis of GD with random stepsizes. See §2 for a conceptual overview of our analysis techniques and how we exploit this equalization property to prove Theorem 1.2.

1.2 Organization

§2 provides a conceptual overview of our analysis approach. §3 proves our main result, Theorem 1.2. §4 discusses generalizations beyond separability, strong convexity, and smoothness. §5 investigates the variability from using randomized stepsizes, in terms of notions of convergence, high probability bounds, and opportunities for parallelized optimization. §6 provides auxiliary results about stability and game-theoretic interpretations. §7 describes several directions for future work that are motivated by our results. For brevity, several proofs are deferred to the appendix.

2 Conceptual overview: acceleration by random stepsizes

Here we provide informal derivations of our main result in order to overview the key conceptual ideas. We begin by briefly recalling in §2.1 how random stepsizes can accelerate GD for the special case of quadratic optimization. Then in §2.2, we describe the challenges for acceleration beyond quadratics, and our new techniques for circumventing these challenges.

2.1 Quadratic optimization

Consider running GD on the class 𝒬\mathcal{Q} of quadratic functions ff that are mm-strongly convex and MM-smooth. What is the optimal distribution from which to draw i.i.d. stepsizes? In order to understand this, we first recall the optimal deterministic stepsizes.

Optimal deterministic stepsizes.

Without loss of generality after translating,

f(x)=12xTHx where mIHMI.\displaystyle f(x)=\frac{1}{2}x^{T}Hx\,\qquad\text{ where }\qquad mI\preceq H\preceq MI\,. (2.1)

Since the function is quadratic, its gradient is linear f(x)=Hx\nabla f(x)=Hx, hence GD is a linear map

xt+1=xtαtf(xt)=(IαtH)xt,\displaystyle x_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t})=(I-\alpha_{t}H)x_{t}\,, (2.2)

and therefore the final iterate is

xn=t=0n1(IαtH)x0.\displaystyle x_{n}=\prod_{t=0}^{n-1}(I-\alpha_{t}H)x_{0}\,. (2.3)

The convergence rate for a given stepsize schedule—in the worst case over objectives functions in 𝒬\mathcal{Q} and initializations x0x_{0}—is therefore

supf𝒬,x0x(xnxx0x)1/n\displaystyle\sup_{f\in\mathcal{Q},\,x_{0}\neq x^{*}}\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n} =supmIHMI,x0x(t=0n1(IαtH)x0x0)1/n\displaystyle=\sup_{mI\preceq H\preceq MI,\,x_{0}\neq x^{*}}\left(\frac{\|\prod_{t=0}^{n-1}(I-\alpha_{t}H)x_{0}\|}{\|x_{0}\|}\right)^{1/n}
=supmIHMIt=0n1(IαtH)1/n\displaystyle=\sup_{mI\preceq H\preceq MI}\|\prod_{t=0}^{n-1}(I-\alpha_{t}H)\|^{1/n} (2.4)
=supλ[m,M]t=0n1|1αtλ|1/n.\displaystyle=\sup_{\lambda\in[m,M]}\prod_{t=0}^{n-1}|1-\alpha_{t}\lambda|^{1/n}\,. (2.5)

Above, the penultimate step is by definition of the spectral norm, and the final step is by diagonalizing. It is a simple and celebrated fact, dating back to Young in 1953 [61], that minimizing this convergence rate (2.5) over stepsizes is equivalent to a certain extremal polynomial problem, and that as a consequence, the optimal stepsizes minimizing (2.5) are the inverses of the roots of a (translated and scaled) Chebyshev polynomial, in any order. Said more precisely: for any horizon length nn, the optimal nn stepsizes are any permutation of

{(M+m2+Mm2cos(2t+12nπ))1}t=0n1\displaystyle\left\{\left(\frac{M+m}{2}+\frac{M-m}{2}\cos\left(\frac{2t+1}{2n}\pi\right)\right)^{-1}\right\}_{t=0}^{n-1} (2.6)

and the corresponding nn-step convergence rate in (2.5) is

supf𝒬,x0x(xnxx0x)1/n=[2(κ+1)n(κ1)n(κ+1)2n+(κ12n)]1/nnκ1κ+1.\displaystyle\sup_{f\in\mathcal{Q},\,x_{0}\neq x^{*}}\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}=\left[2\frac{(\sqrt{\kappa}+1)^{n}(\sqrt{\kappa}-1)^{n}}{(\sqrt{\kappa+1})^{2n}+(\sqrt{\kappa-1}^{2n})}\right]^{1/n}\overset{n\to\infty}{\longrightarrow}\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,. (2.7)
Optimal random stepsizes.

Much less well-known is the fact that this accelerated rate can also be asymptotically achieved by simply using i.i.d. stepsizes from an appropriate distribution [32, 47, 63]. The intuition is that the convergence rate (2.5) is invariant in the order of the stepsizes, instead depending only on their empirical measure, and as nn\to\infty one can achieve the same empirical measure (and thus the same convergence rate) by replacing the carefully chosen deterministic Chebyshev stepsize schedule (2.6) with i.i.d. stepsizes from their limiting empirical measure. See [2, Chapter 3] for a discussion of this through the lens of exchangeability and De Finetti’s Theorem.

Concretely, if the inverse stepsizes βt:=1/αt\beta_{t}:=1/\alpha_{t} are i.i.d. from some distribution μ\mu, then (forgoing measure-theoretic technicalities) the Law of Large Numbers ensures that the nn-step convergence rate in (2.5) converges almost surely and in expectation to

supf𝒬,x0x(xnxx0x)1/n\displaystyle\sup_{f\in\mathcal{Q},\,x_{0}\neq x^{*}}\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n} =supλ[m,M]t=0n1|1αtλ|1/n\displaystyle=\sup_{\lambda\in[m,M]}\prod_{t=0}^{n-1}|1-\alpha_{t}\lambda|^{1/n}
=exp(supλ[m,M]1nt=0n1log|1βt1λ|)\displaystyle=\exp\left(\sup_{\lambda\in[m,M]}\frac{1}{n}\sum_{t=0}^{n-1}\log|1-\beta_{t}^{-1}\lambda|\right)
nexp(supλ[m,M]𝔼βμlog|1β1λ|)\displaystyle\overset{n\to\infty}{\longrightarrow}\exp\left(\sup_{\lambda\in[m,M]}\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda|\right)\, (2.8)

This informal derivation suggests that the distribution μ\mu minimizing the asymptotic convergence rate (2.8) should be the appropriate limit as nn\to\infty of the empirical measure for the inverse Chebyshev stepsize schedule (this limit is the Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)} distribution, since a+bcos(Θ)a+b\cos(\Theta) has the Arcsine distribution on [ab,a+b][a-b,a+b] if Θ\Theta is uniformly distributed on [0,π],[0,\pi], see Figure 2), and moreover the optimal value should be the accelerated rate κ1κ+1\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}. This intuition is formalized by the following lemma.

Refer to caption
(a) n=102n=10^{2}
Refer to caption
(b) n=103n=10^{3}
Refer to caption
(c) n=104n=10^{4}
Figure 2: As nn\to\infty, the empirical distribution (yellow) of the nn-step inverse Chebyshev stepsize schedule (2.6) converges to the Arcsine distribution (blue). The fit is increasingly accurate as nn increases from 10210^{2} (left) to 10410^{4} (right). Demonstrated here for m=1m=1 and M=10M=10.
Lemma 2.1 (Extremal property of the Arcsine distribution).

The extremal problem

infμsupλ[m,M]𝔼βμlog|1β1λ|\displaystyle\inf_{\mu}\sup_{\lambda\in[m,M]}\mathbb{E}_{\beta\sim\mu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert (2.9)

has optimal value logκ1κ+1\log\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}, achieved by the unique optimal solution μ=Arcsine(m,M)\mu=\operatorname{{\mathrm{Arcsine}}(m,M)}.

This lemma was proven in the optimization literature via a long, tour-de-force integral calculation [32]. In Appendix A, we present an alternative, short proof from Altschuler’s thesis [2] that argues by connecting this extremal problem for stepsize distributions to a classical extremal problem in potential theory about electrostatic capacitors. The key feature of this connection is the following equalization property of the Arcsine distribution.

Lemma 2.2 (Equalization property of the Arcsine distribution).

For all λ[m,M]\lambda\in[m,M],

𝔼βArcsine(m,M)log|1β1λ|=logκ1κ+1.\displaystyle\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert=\log\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,. (2.10)

In particular, this value does not depend on λ\lambda.

This equalization property uniquely defines the Arcsine distribution in that there is no other distribution satisfying (2.10); see Appendix A. In words, this property shows that the Arcsine distribution “equalizes” the convergence rate over all quadratic functions (parameterized by their curvatures λ\lambda). We emphasize that this is not just an algebraic identity: this equalization property also has a natural interpretation in terms of a game between an optimization algorithm (who chooses a stepsize to minimize the convergence rate) and an adversary (who chooses a quadratic function to maximize the convergence rate), see §6.2. In Appendix A we elaborate on the connections to potential theory, and leverage that connection to provide short proofs of these lemmas.

2.2 Beyond quadratic optimization

We now turn to accelerating GD beyond the special case of quadratic functions ff. The central challenge is that the curvature, as measured by the Hessian 2f\nabla^{2}f, is no longer constant. This breaks the above quadratic analysis in several ways since it is no longer possible to summarize or re-parameterize the Hessians along the GD trajectory in terms of a single scalar λ[m,M]\lambda\in[m,M]. To explain these issues, let us attempt to adapt this analysis to the general convex setting, assuming for simplicity here that ff is twice differentiable (our actual analysis in §3 avoids any such assumption).

Time-varying curvature.

Clearly, if ff is not quadratic, then the gradient f\nabla f is not linear, and as a consequence the GD map is not linear. Assuming as before that x=0x^{*}=0 without loss of generality, one can generalize the linear GD map (2.3) to the non-quadratic setting as

xn=t=0n1(IαtHt)x0whereHt:=012f(uxt)𝑑u,\displaystyle x_{n}=\prod_{t=0}^{n-1}(I-\alpha_{t}H_{t})x_{0}\qquad\text{where}\qquad H_{t}:=\int_{0}^{1}\nabla^{2}f(ux_{t})du\,, (2.11)

since by the Fundamental Theorem of Calculus, f(xt)=Htxt\nabla f(x_{t})=H_{t}x_{t} and thus xt+1=xtαtf(xt)=(IαtHt)xtx_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t})=(I-\alpha_{t}H_{t})x_{t}. The interpretation of HtH_{t} is that it is the average Hessian between x=0x^{*}=0 and xtx_{t}, i.e., it measures the relevant local curvature in the tt-th iteration of GD. In the quadratic setting, Ht=HH_{t}=H for all tt since the Hessian 2fH\nabla^{2}f\equiv H is constant, which recovers (2.3) from (2.11). However, for the general non-quadratic setting, the curvature HtH_{t} varies in tt. This is the overarching challenge mentioned above. For the purpose of accelerating GD, this challenge manifests in two interrelated ways: the lack of commutativity and predictability of the curvature HtH_{t} along the trajectory of GD. Below we describe both challenges and how we circumvent them.

2.2.1 Issue 1: commutativity (deriving the extremal problem)

The first issue is that for convex functions, the Hessians at different points do not necessarily commute. Thus, in general, {Ht}\{H_{t}\} do not commute and therefore are not simultaneously diagonalizable. This precludes the key step (2.5) in the quadratic analysis above, which decomposes the convergence rate across eigendirections, thereby reducing the (multivariate) problem of convergence analysis to an analytically tractable (univariate) extremal problem. This is the only reason we assume separability: it is equivalent to commuting Hessians (Lemma B.1). Concretely, separability allows us to show an analog of (2.5) in which the convergence rate of a stepsize schedule {αt}\{\alpha_{t}\} is bounded by

t=0n1|1αtλt|1/n,\displaystyle\prod_{t=0}^{n-1}|1-\alpha_{t}\lambda_{t}|^{1/n}\,, (2.12)

the difference to the quadratic case being that there all λt[m,M]\lambda_{t}\in[m,M] are the same, whereas now λt[m,M]\lambda_{t}\in[m,M] can be time-varying and can depend on all previous stepsizes {αs}s<t\{\alpha_{s}\}_{s<t}.

By taking an appropriate111A subtle but important technical issue is that one cannot simply use the Law of Large Numbers to derive the asymptotic convergence rate from the nn-step rate (c.f., (2.8) for how that worked in the quadratic case). Indeed, in the convex case here, the relevant empirical sum 1nt=0n1log|1βt1λt|\frac{1}{n}\sum_{t=0}^{n-1}\log|1-\beta_{t}^{-1}\lambda_{t}| has non-i.i.d. summands: they are neither identically distributed (since λt\lambda_{t} is time-varying) nor independent (since λt\lambda_{t} can depend on previous stepsizes). Briefly, we overcome the identically distributed issue by using the equalization property of the Arcsine distribution (Lemma 2.2) to show that these summands still have the desired expectation, and we overcome the independence issue by using a martingale-based argument to show that the fresh randomness in αt\alpha_{t} makes these summands “independent enough” to prove a specially tailored version of Kolmogorov’s Strong Law of Large Numbers for this setting. Details in §3.1. limit of the nn-step convergence rate (2.12), it can be shown that the (logarithm of the) asymptotic rate for i.i.d. inverse stepsizes βt:=αt1\beta_{t}:=\alpha_{t}^{-1} from a distribution μ\mu is governed by the extremal problem:

infμsup{𝒜t}adapted process {λt[m,M]}lim supn1nt=0n1𝔼βtμlog|1βt1λt|\displaystyle\inf_{\mu}\sup_{\{\mathcal{A}_{t}\}-\text{adapted process }\{\lambda_{t}\in[m,M]\}}\limsup_{n\to\infty}\frac{1}{n}\sum_{t=0}^{n-1}\mathbb{E}_{\beta_{t}\sim\mu}\log\left\lvert 1-\beta_{t}^{-1}\lambda_{t}\right\rvert (2.13)

where {𝒜t}\{\mathcal{A}_{t}\} denotes the natural filtration corresponding to the inverse stepsizes {βt}\{\beta_{t}\}. Observe that (2.13) recovers the extremal problem (2.8) for the quadratic case if λt\lambda_{t} are constant, but in general is different because λt\lambda_{t} can be time-varying. This brings us to the second issue.

2.2.2 Issue 2: unpredictability (solving the extremal problem)

The second issue is that the curvature is time-varying (as measured by λt\lambda_{t}), thus less predictable, which makes it difficult to choose stepsizes that lead to fast convergence (as measured by the extremal problem (2.13)). The key upshot of our random stepsizes is that the randomness of βt\beta_{t} prevents an adversarial choice of λt[m,M]\lambda_{t}\in[m,M] since λt\lambda_{t} must be “chosen” before βt\beta_{t} is drawn. More precisely, we exploit the equalization property of the Arcsine distribution (Lemma 2.2) to argue that for any choice of λt\lambda_{t}, the distribution of βt\beta_{t} ensures that

𝔼βtμlog|1βt1λt|=logκ1κ+1.\displaystyle\mathbb{E}_{\beta_{t}\sim\mu}\log|1-\beta_{t}^{-1}\lambda_{t}|=\log\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,. (2.14)

Intuitively, this can be interpreted as using random stepsizes to hedge between the possible curvatures λt\lambda_{t}: the expected “progress” from an iteration is the same regardless of the curvature. Quantitatively, this equalization property (2.14) lets us argue an accelerated convergence rate using a martingale version of the Law of Large Numbers. In particular, this implies that the Arcsine distribution—which was optimal for the extremal problem (2.9) in the quadratic setting—achieves the same value for the extremal problem (2.13) in the convex setting, and therefore is also optimal for this setting (as it is more general).

Remark 2.3 (Overcoming unpredictability via random hedging).

Observe that in the extremal problem (2.13), we allow {λt}\{\lambda_{t}\} to vary arbitrarily in tt, and in particular we do not enforce these curvatures to be consistent with a single convex function. This consistency is the defining feature of Performance Estimation Problems [20, 53] and is essential to the line of work on accelerating gradient descent using deterministic stepsize schedules [2, 12, 23, 13, 26, 3, 4, 27, 28, 58, 62, 29, 8]. Remarkably, random stepsizes obtain the fully accelerated rate without needing to enforce any consistency conditions. This can be interpreted through the lens of “hedging” between worst-case functions, as described in [2]. On one hand, the opportunity for faster convergence via deterministic time-varying stepsizes arises due to the rigidity of convex functions: a function being worst-case for the stepsize in one iteration may constrain its curvature from being worst-case in the next iteration. On the other hand, the opportunity for faster convergence via random stepsizes arises because the worst-case function does not “know” the realization of the random stepsizes. It is an interesting question if these two opportunities can be combined, e.g., can randomized analyses also exploit consistency conditions on {λt}\{\lambda_{t}\} in order to generalize beyond separability? See §7.

3 Proof of main result

Here we prove Theorem 1.2. We build up to the full result by first proving it for univariate objectives, and then extending to the general (separable) multivariate setting. See §2 for a high-level overview of the analysis techniques and the two key conceptual challenges regarding unpredictability of the Hessian (dealt with in the univariate proof via the equalization property of the Arcsine distribution) and commutativity of the Hessian (dealt with in the multivariate proof via separability). For shorthand, we denote the accelerated rate by

Racc:=κ1κ+1.\displaystyle\operatorname{\mathrm{R_{acc}}}:=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,. (3.1)

3.1 Univariate setting

We begin by remarking that unforeseen complications can occur even in the univariate setting. For example, Polyak’s Heavy Ball method fails to converge for such functions [37, §4.6].

We first state a simple helper lemma that lets us avoid assumptions of twice-differentiability when expanding the convergence rate of GD in the form (2.12) described in the overview §2.2.

Lemma 3.1.

Suppose that f:f:\mathbb{R}\to\mathbb{R} is mm-strongly convex and MM-smooth, and let xx^{*} denote its minimizer. Then for any xxx\neq x^{*}, it holds that

mf(x)xxM.m\leqslant\frac{f^{\prime}(x)}{x-x^{*}}\leqslant M\,.

For intuition, if ff is twice-differentiable, then this lemma follows immediately from the Intermediate Value Theorem since f(x)=f′′(z)(xx)f^{\prime}(x)=f^{\prime\prime}(z)(x-x^{*}) for some zz between xx and xx^{*}, and the Hessian f′′(z)[m,M]f^{\prime\prime}(z)\in[m,M] by mm-strong convexity and MM-smoothness. The proof is not much more difficult without the assumption of twice-differentiability.

Proof.

For the upper bound, MM-smoothness of ff implies |f(x)|=|f(x)f(x)|M|xx||f^{\prime}(x)|=|f^{\prime}(x)-f^{\prime}(x^{*})|\leqslant M|x-x^{*}|. Note that sign(f(x)f(x))=sign(xx)\text{sign}(f^{\prime}(x)-f^{\prime}(x^{*}))=\text{sign}(x-x^{*}) since ff is univariate. Thus, by re-arranging, we obtain the desired inequality f(x)/(xx)Mf^{\prime}(x)/(x-x^{*})\leqslant M. The lower bound follows from an analogous argument, since mm-strong convexity of ff implies |f(x)f(x)|m|xx||f^{\prime}(x)-f^{\prime}(x^{*})|\geqslant m|x-x^{*}|. ∎

Proof of Theorem 1.2 for univariate ff.

Let βt\beta_{t} denote the inverse stepsizes; these are i.i.d. Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)}. Defining λt:=f(xt)xtx\lambda_{t}:=\frac{f^{\prime}(x_{t})}{x_{t}-x^{*}}, we have by definition of GD that

xt+1x=xtxβt1f(xt)=(1βt1λt)(xtx).x_{t+1}-x^{*}=x_{t}-x^{*}-\beta_{t}^{-1}f^{\prime}(x_{t})=(1-\beta_{t}^{-1}\lambda_{t})(x_{t}-x^{*}).

Therefore for any nn\in\mathbb{N}, the average rate after nn steps is

Rn:=|xnxx0x|1/n=|t=0n1(1βt1λt)|1/n.\displaystyle R_{n}:=\Bigg{|}\frac{x_{n}-x^{*}}{x_{0}-x^{*}}\Bigg{|}^{1/n}=\Bigg{|}\prod_{t=0}^{n-1}(1-\beta_{t}^{-1}\lambda_{t})\Bigg{|}^{1/n}. (3.2)

Note that this sequence of random variables is uniformly integrable. Indeed for each nn\in\mathbb{N}, a crude bound gives |t=0n1(1βt1λt)|1/nmaxβ,λ[m,M]|1β1λ|=κ1<|\prod_{t=0}^{n-1}(1-\beta_{t}^{-1}\lambda_{t})|^{1/n}\leqslant\max_{\beta,\lambda\in[m,M]}|1-\beta^{-1}\lambda|=\kappa-1<\infty. Therefore, by the Dominated Convergence Theorem, L1L_{1} convergence of RnR_{n} to Racc\operatorname{\mathrm{R_{acc}}} will follow immediately from almost-sure convergence. To show the latter, note that since almost-sure convergence is preserved under continuous mappings (in particular the exp()\exp(\cdot) function), it suffices to show that

1nt=0n1Zta.s.logRacc,\displaystyle\frac{1}{n}\sum_{t=0}^{n-1}Z_{t}\operatorname*{\overset{\text{a.s.}}{\longrightarrow}}\log\operatorname{\mathrm{R_{acc}}},

where we define

Zt:=log|1βt1λt|.Z_{t}:=\log|1-\beta_{t}^{-1}\lambda_{t}|\,.

As mentioned in the overview section 2.2, the issue with applying the Law of Large Numbers is that {Zt}\{Z_{t}\} are not i.i.d. (For example, even though we show below that these random variables have the desired first moments—which allows us to adapt the Law of Large Numbers—one can check that {Zt}\{Z_{t}\} are neither independent nor identically distributed.)

Now observe that λt\lambda_{t} depends on the realizations of only the previous β0,,βt1\beta_{0},\dots,\beta_{t-1}. Moreover, λt\lambda_{t} lies inside [m,M][m,M] by Lemma 3.1. Thus by the equalization property of the Arcsine distribution (Lemma 2.2), 𝔼[Zt]=logRacc\mathbb{E}[Z_{t}]=\log\operatorname{\mathrm{R_{acc}}} for all tt, even if λt\lambda_{t} is chosen adversarially as a function of the previous β0,,βt1\beta_{0},\dots,\beta_{t-1}. Indeed, by the independence of {βt}\{\beta_{t}\} and the Law of Iterated Expectations:

𝔼[Zt]\displaystyle\mathbb{E}\Big{[}Z_{t}\Big{]} =𝔼β0,,βt[log|1βt1λt|]\displaystyle=\mathbb{E}_{\beta_{0},\dots,\beta_{t}}\Big{[}\log\left\lvert 1-\beta_{t}^{-1}\lambda_{t}\right\rvert\Big{]}
=𝔼β0,,βt1[𝔼βt[log|1βt1λt||β0,,βt1]]\displaystyle=\mathbb{E}_{\beta_{0},\dots,\beta_{t-1}}\Big{[}\mathbb{E}_{\beta_{t}}\Big{[}\log\left\lvert 1-\beta_{t}^{-1}\lambda_{t}\right\rvert\;\Big{|}\;\beta_{0},\dots,\beta_{t-1}\Big{]}\Big{]}
=𝔼β1,,βt1[logRacc]\displaystyle=\mathbb{E}_{\beta_{1},\dots,\beta_{t-1}}\Big{[}\log\operatorname{\mathrm{R_{acc}}}\Big{]}
=logRacc.\displaystyle=\log\operatorname{\mathrm{R_{acc}}}. (3.3)

Therefore it suffices to show

1nt=0n1(Zt𝔼[Zt])a.s.0.\displaystyle\frac{1}{n}\sum_{t=0}^{n-1}\left(Z_{t}-\mathbb{E}[Z_{t}]\right)\operatorname*{\overset{\text{a.s.}}{\longrightarrow}}0. (3.4)

We show this by adapting the standard proof of Kolmogorov’s Strong Law of Large Numbers, see e.g. [59] for the classical proof. We first argue that

Mn:=t=0n1Zt𝔼[Zt]t+1M_{n}:=\sum_{t=0}^{n-1}\frac{Z_{t}-\mathbb{E}[Z_{t}]}{t+1}

forms a martingale w.r.t. its own filtration {Fn}\{F_{n}\}, as stated in the following lemma. Intuitively this follows from the calculation (3.3); a formal proof is deferred to Appendix B.2.

Lemma 3.2 (Martingale helper lemma).

{Mn}\{M_{n}\} is an L1L_{1}-bounded martingale w.r.t. {Fn}\{F_{n}\}.

Thus we may apply the Martingale Convergence Theorem (see e.g., [59, §11.5]) to establish that M:=limnMnM_{\infty}:=\lim_{n\to\infty}M_{n} exists and is finite almost surely. Therefore by Kronecker’s Lemma (recalled below), we conclude that 1nt=0n1(Zt𝔼[Zt])a.s.0\frac{1}{n}\sum_{t=0}^{n-1}(Z_{t}-\mathbb{E}[Z_{t}])\operatorname*{\overset{\text{a.s.}}{\longrightarrow}}0, proving (3.4) as desired. ∎

Lemma 3.3 (Kronecker’s Lemma).

Let {zn}\{z_{n}\} and {bn}\{b_{n}\} be two real-valued sequences such that each bnb_{n} is strictly positive, the bnb_{n} are strictly increasing to \infty, and t=0ztbt\sum_{t=0}^{\infty}\tfrac{z_{t}}{b_{t}} converges to a finite limit. Then limn1bnt=0n1zt=0\lim_{n\to\infty}\tfrac{1}{b_{n}}\sum_{t=0}^{n-1}z_{t}=0.

3.2 Multivariate setting

Here we prove Theorem 1.2 in the general multivariate setting. The proof follows quickly from the univariate setting proved above and the definition of separability.

Proof of Theorem 1.2.

First, note that it suffices to prove the theorem for functions which are separable in the standard basis, i.e., functions ff of the form

f(x)=i=1dfi([x]i).\displaystyle f(x)=\sum_{i=1}^{d}f_{i}([x]_{i})\,. (3.5)

Indeed, consider a general separable function f(x)=i=1dfi([Ux]i)f(x)=\sum_{i=1}^{d}f_{i}([Ux]_{i}) where Ud×dU\in\mathbb{R}^{d\times d} is orthogonal. Then by the Chain Rule, f=UTh\nabla f=U^{T}\nabla h where h(y):=i=1dfi([y]i)h(y):=\sum_{i=1}^{d}f_{i}([y]_{i}). Hence running GD xt+1=xtαtf(xt)x_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t}) on ff is equivalent to running GD yt+1=ytαth(yt)y_{t+1}=y_{t}-\alpha_{t}\nabla h(y_{t}) on hh, in the sense that yt=Uxty_{t}=Ux_{t} for all tt. Thus, modulo an isometric change of coordinates which does not affect the convergence rate, we may assume henceforth that the objective function is separable of the form (3.5).

Since f(x)=i=1df([x]i)f(x)=\sum_{i=1}^{d}f([x]_{i}), the ii-th coordinate of f(x)\nabla f(x) is fi([x]i)f_{i}^{\prime}([x]_{i}). Thus the GD update xt+1=xtβt1f(xt)x_{t+1}=x_{t}-\beta_{t}^{-1}\nabla f(x_{t}) can be re-written coordinate-wise as

[xt+1]i=[xt]iβt1fi([xt]i),i[d].\displaystyle[x_{t+1}]_{i}=[x_{t}]_{i}-\beta_{t}^{-1}f_{i}^{\prime}([x_{t}]_{i}),\qquad\forall i\in[d]\,. (3.6)

Now we make two observations. First, (3.6) is the update that GD would perform to minimize fif_{i} from initialization [x0]i[x_{0}]_{i}. Second, each fif_{i} is mm-strongly convex and MM-smooth (because strong convexity and smoothness are preserved under restrictions). Thus we may apply the univariate setting of Theorem 1.2 (proved above) to argue that the asymptotic convergence rate is

limn(|[xn]i[x]i||[x0]i[x]i|)1/n=a.s.Racc\lim_{n\to\infty}\left(\frac{\left\lvert[x_{n}]_{i}-[x^{*}]_{i}\right\rvert}{\left\lvert[x_{0}]_{i}-[x^{*}]_{i}\right\rvert}\right)^{1/n}\operatorname*{\overset{\text{a.s.}}{=}}\operatorname{\mathrm{R_{acc}}}

for every coordinate iS:={i[d]:[x0]i[x]i}i\in S:=\{i\in[d]:[x_{0}]_{i}\neq[x^{*}]_{i}\} for which the initialization is not already optimal. Note that x0xx_{0}\neq x^{*} ensures SS is non-empty. Since [x]i[x^{*}]_{i} is the unique minimizer of fif_{i}, and since also the number of coordinates iSi\in S is finite, it follows that

limn(xnxx0x)1/n=a.s.limnmaxiS(|[xn]i[x]i||[x0]i[x]i|)1/n=a.s.Racc.\lim_{n\to\infty}\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\operatorname*{\overset{\text{a.s.}}{=}}\lim_{n\to\infty}\max_{i\in S}\left(\frac{\left\lvert[x_{n}]_{i}-[x^{*}]_{i}\right\rvert}{\left\lvert[x_{0}]_{i}-[x^{*}]_{i}\right\rvert}\right)^{1/n}\operatorname*{\overset{\text{a.s.}}{=}}\operatorname{\mathrm{R_{acc}}}\,.

This establishes the desired almost-sure convergence. L1L_{1} convergence then follows from the Dominated Convergence Theorem, since we have uniform integrability by a similar argument as in the univariate setting above. ∎

4 Generalizations

The purpose of this paper is to show that random stepsizes can accelerate GD on classes of strongly convex and smooth functions that are more general than quadratics. For concreteness, Theorem 1.2 focuses on functions that are separable, strongly convex, and smooth. But this result extends to other classes of functions. We describe two extensions below, thematically splitting relaxations of the separability assumption (§4.1) from the strong convexity and smoothness assumptions (§4.2). These two types of relaxations are complementary in the sense that they individually deal with the two key issues discussed in §2.2: commutativity and unpredictability of the Hessian, respectively.

Remark 4.1 (Performance metrics).

We also mention that throughout, we measure convergence via the contraction factor (xnxx0x)1/n(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|})^{1/n}. The results are unchanged if the progress measure is changed from distance to function suboptimality or gradient norm, at either the initialization x0x_{0}, termination xnx_{n}, or both. This is because for κ\kappa-conditioned functions, all these progress measures xx2\|x-x^{*}\|^{2}, f(x)f(x)f(x)-f(x^{*}), and f(x)2\|\nabla f(x)\|^{2} are equivalent up to a multiplicative factor cc that is independent of nn (it depends only on the strong convexity and smoothness parameters), and therefore in the limit nn\to\infty, the convergence rate is unaffected by changing these performance metrics since limnc1/n=1\lim_{n\to\infty}c^{1/n}=1.

4.1 Beyond separability

Recall from Definition 1.1 that separability amounts to decomposability into univariate functions. Theorem 1.2 holds more generally for functions which are “radially-separable” in that they decompose into radial functions. This is formally defined as follows.

Definition 4.2 (Radial functions).

A function r:dr:\mathbb{R}^{d}\to\mathbb{R} is radial if it admits a decomposition of the form

r(x)=h(xx)\displaystyle r(x)=h(\|x-x^{*}\|) (4.1)

for some function h:h:\mathbb{R}\to\mathbb{R} and point xdx^{*}\in\mathbb{R}^{d}.

Definition 4.3 (Radially-separable functions).

A function f:df:\mathbb{R}^{d}\to\mathbb{R} is radially-separable if it admits a decomposition of the form

f(x)=i=1bri([Ux]Si),\displaystyle f(x)=\sum_{i=1}^{b}r_{i}([Ux]_{S_{i}})\,, (4.2)

for some orthogonal matrix Ud×dU\in\mathbb{R}^{d\times d}, partition S1,,SbS_{1},\dots,S_{b} of [d][d], and radial functions r1,,rbr_{1},\dots,r_{b}. (Above, [Ux]Si[Ux]_{S_{i}} is shorthand for the subset of coordinates of UxUx in set SiS_{i}.)

Clearly, radial-separability generalizes both separability (consider the partition Si={i}S_{i}=\{i\}) and radiality (consider the trivial partition S1=[d]S_{1}=[d]).

Theorem 4.4 (Random stepsizes accelerate GD on radially-separable functions).

Theorem 1.2 is unchanged if the assumption of separability is replaced with radial-separability.

Proof.

The proof is mostly similar to the proof of Theorem 1.2. By the same argument (see §3.2), it suffices to prove the accelerated convergence rate for a block of the decomposition. So, without loss of generality we may assume that ff is radial. In words, the key idea behind the rest of the proof is that although radial functions do not have commutative Hessians—which we exploited in our main result to reduce the (multivariate) problem of convergence analysis to an analytically tractable (univariate) extremal problem, see the overview §2.2—radial functions are univariate in nature (in particular all gradients in a GD trajectory are colinear), and this enables a similar reduction.

Formally, note that the center xx^{*} of a convex radial function is a minimizer. (Indeed, for any xdx\in\mathbb{R}^{d}, we have by radiality and then Jensen’s inequality that f(x)=𝔼zf(z)f(x)f(x)=\mathbb{E}_{z}f(z)\geqslant f(x^{*}), where the expectation is over zz drawn uniformly at random from the sphere of radius xx\|x-x^{*}\| around xx^{*}.) Thus after a possible translation, we may assume without loss of generality that x=0x^{*}=0, so that

f(x)=h(x2)f(x)=h(\|x\|^{2})

for some univariate function h:h:\mathbb{R}\to\mathbb{R}. The chain rule implies f(x)=2h(x2)x\nabla f(x)=2h^{\prime}(\|x\|^{2})x, hence the GD update xt+1=xtαtf(xt)x_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t}) expands to

xn=t=0n1(1αtλt)x0 where λt:=2h(xt2).\displaystyle x_{n}=\prod_{t=0}^{n-1}(1-\alpha_{t}\lambda_{t})x_{0}\qquad\text{ where }\qquad\lambda_{t}:=2h^{\prime}(\|x_{t}\|^{2})\,. (4.3)

It is straightforward to show that λt[m,M]\lambda_{t}\in[m,M], e.g., this follows from Lemma 3.1 after restricting ff to the line between xx^{*} and xtx_{t}. Thus this gives an alternative approach for reducing the (multivariate) convergence rate to the same (univariate) product form (2.12), at which point the rest of our analysis goes through unchanged. ∎

4.2 Beyond strong convexity and smoothness

Recall that for twice-differentiable functions ff, the assumptions of mm-strong convexity and MM-smoothness are equivalent to the assumption that the spectrum of 2f(x)\nabla^{2}f(x) lies in the interval [m,M][m,M] for all points xx. The algorithmic techniques developed in this paper extend to functions ff whose Hessians have more structured spectra, lying within some set E0E\subset\mathbb{R}_{\geqslant 0} that is not necessarily an interval. Note that (strong222In the case of (non-strongly) convex, smooth optimization, we have m=0m=0 and E=[0,M]E=[0,M]. For that case, much of the ideas in this paper still apply at a high level, but the singularity at λ=0\lambda=0 makes the problem degenerate in ways that must be addressed with refined versions of these techniques. This will be discussed in a later paper.) convexity of ff ensures a (strictly) positive real spectrum. Such extensions to more structured Hessians have led to improved results for other algorithms in quadratic settings and beyond, see e.g., [34, 43, 24].

This extension is rooted in connections to logarithmic potential theory, and proceeds as follows. Recall from the overview §2 that the asymptotic convergence rate of GD with inverse stepsizes i.i.d. from a distribution μ\mu, is dictated by the maximum logarithmic potential of μ\mu over [m,M][m,M], i.e.,

supλ[m,M]𝔼βμlog|1β1λ|,\displaystyle\sup_{\lambda\in[m,M]}\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda|\,, (4.4)

and that the unique distribution minimizing this quantity is the Arcsine distribution (Lemma 2.1). This extends to spectral sets E0E\subset\mathbb{R}_{\geqslant 0} beyond intervals with little change. Indeed, by an identical derivation, the convergence rate is dictated by the analogous extremal problem

supλE𝔼βμlog|1β1λ|,\displaystyle\sup_{\lambda\in E}\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda|\,, (4.5)

and the unique distribution minimizing this quantity is the unique optimal stepsize distribution. What is this optimal distribution? This is a classical topic in potential theory and there is a well-developed machinery (typically based on computing conformal mappings, Green’s functions, or Fekete sets) for determining this distribution explicitly in closed-form when possible (typically this only occurs for very simple sets) or otherwise approximating it; we refer the interested reader to the surveys and textbooks [48, 50, 49, 19].

5 Opportunities and perils of random stepsizes

Here we investigate the implications of the variability arising from random stepsizes, see Figure 3. In §5.1, we show that trajectories can diverge but only with exponentially low probability. In §5.2, we investigate typical trajectories, showing that with overwhelming probability, their rate converges to the accelerated rate. In §5.3, we discuss running multiple trajectories in parallel and selecting the best.

Refer to caption
Figure 3: The randomness of the proposed stepsizes leads to variability between runs. Illustrated here with a boxplot for 500500 runs (77 full trajectories shown) from the same initialization on a univariate quadratic function, with κ=200\kappa=200; similar phenomena occur for other problem instances. The horizontal axis is the number of iterations nn; the vertical axis is log10xnxx0x\log_{10}\tfrac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}. The worst run can diverge exponentially, but this occurs with exponentially low probability (details in §5.1). The median run converges exponentially at the optimal accelerated rate Racc\operatorname{\mathrm{R_{acc}}} (Theorem 1.2). The fit is extremely precise, as shown by the bolded blue line. The best run can be substantially better than the median run, leading to the possibility of faster parallel optimization (details in §5.3).

5.1 Divergent runs and notion of convergence

When considering randomized optimization algorithms, there are several notions for measuring the convergence rate. Two natural ones are:

𝔼[(xnxx0x)1/n] vs 𝔼[xnxx0x]1/n.\displaystyle\mathbb{E}\Bigg{[}\Bigg{(}\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\Bigg{)}^{1/n}\Bigg{]}\qquad\text{ vs }\qquad\mathbb{E}\Bigg{[}\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\Bigg{]}^{1/n}\,. (5.1)

These are different. Indeed, our main result shows that by running GD with random stepsizes, the former quantity is asymptotically equal to the accelerated rate Racc=κ1κ+1<1\operatorname{\mathrm{R_{acc}}}=\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}<1. Whereas the latter quantity can be larger than 11 (details in Appendix B.3). Why the discrepancy between these two notions of convergence rate? Which is relevant for optimization?

The discrepancy arises because the error xnx\|x_{n}-x^{*}\| is exponentially small with overwhelming probability (leading to our convergence result for the first notion), but can be exponentially large with exponentially small probability (leading to the instability for the second notion). The key point is that for optimization algorithms, it is very desirable to succeed with overwhelming probability, even if performance is poor in the unlikely complement event. Simply re-run the algorithm if that happens. This why we study the quantity on the left hand side of (5.1). For intuition, we conclude this section with an illustrative toy example that explains this discrepancy in its simplest form.

Example 5.1 (Measuring convergence when exponentially rare trajectories can be exponentially large).

Consider i.i.d. random variables ZtZ_{t} which are each equally likely to be 1/91/9 or 33. Then (t=1nZt)1/n=31nt=1nlog3Zt3𝔼[log3Zt]=31/2<1(\prod_{t=1}^{n}Z_{t})^{1/n}=3^{\tfrac{1}{n}\sum_{t=1}^{n}\log_{3}Z_{t}}\to 3^{\mathbb{E}[\log_{3}Z_{t}]}=3^{-1/2}<1 by the Law of Large Numbers, hence

limn𝔼[(t=1nZt)1/n]=13<1.\displaystyle\lim_{n\to\infty}\mathbb{E}\Big{[}\Big{(}\prod_{t=1}^{n}Z_{t}\Big{)}^{1/n}\Big{]}=\frac{1}{\sqrt{3}}<1\,. (5.2)

Whereas

limn𝔼[t=1nZt]1/n=𝔼[Z1]=149>1.\displaystyle\lim_{n\to\infty}\mathbb{E}\Big{[}\prod_{t=1}^{n}Z_{t}\Big{]}^{1/n}=\mathbb{E}[Z_{1}]=\frac{14}{9}>1\,. (5.3)

In words, this discrepancy arises because t=1nZt\prod_{t=1}^{n}Z_{t} can be exponentially large in nn, but the probability of this is exponentially small. For example, Z1==Zn=3Z_{1}=\dots=Z_{n}=3 occurs with probability 2n2^{-n} and thus by itself, this event contributes limn(3n2n)1/n=3/2>1\lim_{n\to\infty}(3^{n}2^{-n})^{1/n}=3/2>1 to the expectation in (5.3). Whereas the 1/n1/n normalization inside the expectation in (5.2) dampens the effect of exponentially rare blow-ups, and thus agrees with the exponentially fast convergence that occurs with overwhelming probability.

5.2 Typical runs and high probability bounds

Theorem 1.2 establishes that random stepsizes lead to an accelerated rate in the limit as the number of iterations nn\to\infty. The same analysis arguments can also establish non-asymptotic convergence rates that apply for any finite nn and hold with high probability. Indeed, recall from (3.2) that our argument in §3 relates the convergence rate (for any finite nn) to an empirical sum, namely

|xnxx0x|1/n=exp(1nt=0n1log|1βt1λt|).\displaystyle\Bigg{|}\frac{x_{n}-x^{*}}{x_{0}-x^{*}}\Bigg{|}^{1/n}=\exp\left(\frac{1}{n}\sum_{t=0}^{n-1}\log|1-\beta_{t}^{-1}\lambda_{t}|\right)\,. (5.4)

Now, instead of using the Law of Large Numbers to argue that this empirical sum converges to its expectation (giving the accelerated rate), one can use concentration-of-measure to argue about the deviations from its expectation.

Below we state such a result. This is established by bounding the variance of the aforementioned empirical sum via a martingale-type argument and applying Chebyshev’s inequality; and then extending this convergence rate to general multivariate objectives ff by a union bound. Full proof details are deferred to Appendix B.4. We note that one can show more refined bounds by bounding higher-order moments or the moment generating function (see, e.g., the textbooks [55, 57, 17]), but that requires computing involved logarithmic integrals and thus for simplicity of exposition, here we state only this simple illustrative result.

Theorem 5.2 (Non-asymptotic version of Theorem 1.2).

Consider the setup of Theorem 1.2. For any deviation ε>0\varepsilon>0 and error probability δ>0\delta>0,

((xnxx0x)1/neεRacc)δ\displaystyle\mathbb{P}\left(\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\geqslant e^{\varepsilon}\operatorname{\mathrm{R_{acc}}}\right)\leqslant\delta (5.5)

if n(π2+oκ(1))dδε2n\geqslant\left(\pi^{2}+o_{\kappa}(1)\right)\frac{d}{\delta\varepsilon^{2}}.

This implies that for a typical run, the convergence rate is close to the accelerated rate; see Figure 3 for an illustration.

5.3 Best runs and parallelization

The randomness in the Arcsine Stepsize Schedule leads to a spread of convergence rates. Often the best run is substantially better than the median, which in turn is substantially better than the worst run; see Figure 3. This suggests a natural approach for parallelized optimization: run this randomized algorithm on pp different processors with i.i.d. stepsizes, and take the best final iterate. One can also perform a more refined parallelized scheme: synchronize every kk iterations by restarting all pp processors from the best current iterate.

How much of a speedup does this parallelization give? There are two sources of tension. First, the convergence rate Racc\operatorname{\mathrm{R_{acc}}} achieved by Theorem 1.2 for p=1p=1 processor is already minimax-optimal, but parallelization helps with instance-optimality which is important in practical settings, i.e., beyond worst-case settings. The second source of tension is that the results in §5.2 show that each of the pp i.i.d. runs has convergence rate Rn:=(xnx/x0x)1/nR_{n}:=(\|x_{n}-x^{*}\|/\|x_{0}-x^{*}\|)^{1/n} concentrating around Racc\operatorname{\mathrm{R_{acc}}} for nn large, which means that RnR_{n} will be asymptotically unaffected by parallelization in the regime of fixed pp and nn\to\infty. However, for regimes where pp and nn scale together, there are deviations from this concentration, and the maximal such deviation in the pp runs is precisely the speedup of this parallelization. See below for a calculation that illustrates this tension in a simple setting. Characterizing the tradeoff between the number of processors pp, the synchronization rate kk, the number of iterations nn, the condition number κ\kappa, and the precision ε\varepsilon is an interesting direction for future work but a full understanding is out of the scope of this paper.333Interesting phenomena arise when jointly optimizing kk and pp. For example, if k=1k=1, then moderate parallelism (p>1p>1) improves the rate in both practice (see Figure 3) and theory (see below), but pp\to\infty recovers exact line search which provably does not accelerate, see e.g., [9, §9.3]. This tradeoff warns of a potential drawback of over-parallelizing. On the other hand, when k>1k>1, the pp\to\infty parallelization is a familiarly good algorithm: it produces the optimal point in the kk-dimensional Krylov subspace, which yields minimax-optimal worst-case convergence rates and is also instance-adaptive; in fact it is equivalent to Conjugate Gradients in exact arithmetic for quadratics.

Simple illustrative calculation.

Here, we provide a short back-of-the-envelope calculation in order to demonstrate these tensions and the benefit of parallelization in the simple setting of univariate quadratic functions. For simplicity consider a single synchronization (i.e., k=nk=n); the general case of multiple synchronizations (k<nk<n) follows by replacing nn by kk in this calculation, and then repeating this convergence rate n/kn/k times. For shorthand, let xt(j)x_{t}^{(j)} denote the tt-th iterate on the jj-th machine, let αt(j)\alpha_{t}^{(j)} denote the corresponding stepsize, and let f(x)=λ2(xx)2f(x)=\frac{\lambda}{2}(x-x^{*})^{2} for λ[m,M]\lambda\in[m,M] (all univariate κ\kappa-conditioned quadratics are of this form).

Step 1: single machine. By the same analysis as in our main results (see §2.1 or §3.1) and then a Central Limit Theorem approximation, the convergence rate on the jj-th machine is

(|xn(j)x||x0(j)x|)1/n=exp(1nt=0n1log|1αt(j)λ|)exp(logRacc+σnξj)=Raccexp(σnξj),\displaystyle\left(\frac{\left\lvert x_{n}^{(j)}-x^{*}\right\rvert}{\left\lvert x_{0}^{(j)}-x^{*}\right\rvert}\right)^{1/n}=\exp\left(\frac{1}{n}\sum_{t=0}^{n-1}\log\left\lvert 1-\alpha_{t}^{(j)}\lambda\right\rvert\right)\approx\exp\left(\log\operatorname{\mathrm{R_{acc}}}+\frac{\sigma}{\sqrt{n}}\xi_{j}\right)=\operatorname{\mathrm{R_{acc}}}\cdot\exp\left(\frac{\sigma}{\sqrt{n}}\xi_{j}\right)\,,

where ξj𝒩(0,1)\xi_{j}\sim\mathcal{N}(0,1) and σ=Varα(log|1αλ|)\sigma=\operatorname{\mathrm{Var}}_{\alpha}(\log\left\lvert 1-\alpha\lambda\right\rvert). Note that the error of this approximation can be tightly controlled using standard techniques such as the Berry-Esseen CLT [57, Theorem 2.1.13].

Step 2: best machine. Taking the minimum over all pp processors gives the convergence rate of the parallelized scheme which takes the best final iterate:

minj[p](|xn(j)x||x0(j)x|)1/nRaccexp(σnminj[p]ξj)Raccexp(σ2logpn).\displaystyle\min_{j\in[p]}\left(\frac{\left\lvert x_{n}^{(j)}-x^{*}\right\rvert}{\left\lvert x_{0}^{(j)}-x^{*}\right\rvert}\right)^{1/n}\approx\operatorname{\mathrm{R_{acc}}}\cdot\exp\left(\frac{\sigma}{\sqrt{n}}\min_{j\in[p]}\xi_{j}\right)\approx\operatorname{\mathrm{R_{acc}}}\cdot\exp\left(-\sigma\sqrt{\frac{2\log p}{n}}\right)\,.

Above, the final step is due to the well-known fact that the minimum of pp i.i.d. standard Gaussians concentrates tightly around 2logp-\sqrt{2\log p}, see, e.g., [57, Chapter 2.5]. This factor exp(σ(2logp)/n)\exp(-\sigma\sqrt{(2\log p)/n}) quantifies the speedup of parallelization since it is the size of the variation in Figure 3. By a Taylor approximation of logRacc\log\operatorname{\mathrm{R_{acc}}}, this parallelization provides asymptotic speedups when σ(2logp)/n|logRacc|=|logκ1κ+1|1κ\sigma\sqrt{(2\log p)/n}\gtrsim|\log\operatorname{\mathrm{R_{acc}}}|=|\log\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}|\approx\tfrac{1}{\sqrt{\kappa}}, or equivalently when the number of processors pexp(n2κσ2)p\gtrsim\exp(\tfrac{n}{2\kappa\sigma^{2}}).

6 Discussion and related results

6.1 Stability

Throughout, we have assumed for simplicity that all calculations are performed with exact gradients and in exact arithmetic. It is of practical importance to understand how inexactness affects convergence, and there is a long line of work on such questions for first-order algorithms, see, e.g., [16, 46, 18, 14, 51, 36, 1] and the references within. There are many models of inexact computation. Here, as an illustrative example, we study one common model: the multiplicative error model for gradients [16, 46]. Specifically, suppose that all calculations are still performed in exact arithmetic, but GD performs updates using an estimate g~t\tilde{g}_{t} of the true gradient f(xt)\nabla f(x_{t}), where

g~tf(xt)εf(xt).\displaystyle\|\tilde{g}_{t}-\nabla f(x_{t})\|\leqslant\varepsilon\|\nabla f(x_{t})\|\,. (6.1)
Theorem 6.1 (Inexact-gradient version of Theorem 1.2).

Consider the setup of Theorem 1.2. Suppose that GD updates in each iteration using g~t\tilde{g}_{t} rather than f(xt)\nabla f(x_{t}), i.e.,

xt+1=xtαtg~t,\displaystyle x_{t+1}=x_{t}-\alpha_{t}\tilde{g}_{t}\,, (6.2)

where g~t\tilde{g}_{t} are inexact gradients satisfying the multiplicative-error assumption (6.1). Then the convergence rate of GD with i.i.d. Arcsine stepsizes is almost surely bounded by

limn(xnxx0x)1/n11+ε~(1+ε~)21slowdown from inexact gradientsκ1κ+1accelerated rate Racc where ε~:=2MMmε.\displaystyle\lim_{n\to\infty}\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\leqslant\;\;\underbrace{\frac{1}{1+\tilde{\varepsilon}-\sqrt{(1+\tilde{\varepsilon})^{2}-1}}}_{\text{slowdown from inexact gradients}}\cdot\underbrace{\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}}_{\text{accelerated rate }\operatorname{\mathrm{R_{acc}}}}\,\text{ where }\quad\tilde{\varepsilon}:=\frac{2M}{M-m}\,\varepsilon\,. (6.3)

Moreover this bound is tight, in that it holds with equality for some choice of ff and inexact gradients.

The proof is deferred to Appendix B.5, since it is similar to that of Theorem 1.2. Conceptually, the only difference is that the average Hessians λt[m,M]\lambda_{t}\in[m,M] are now perturbed, and thus the convergence rate is governed by exp(𝔼βArcsine(m,M)log|1β1λ~|)\exp(\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\log|1-\beta^{-1}\tilde{\lambda}|) where λ~\tilde{\lambda} can now be slightly outside the interval [m,M][m,M].

We remark that for small error ε\varepsilon, the slowdown factor in Theorem 6.1 is of the order

11+ε~(1+ε~)21=1+2ε~+Θ(ε~)=1+4MεMm+Θ(ε).\displaystyle\frac{1}{1+\tilde{\varepsilon}-\sqrt{(1+\tilde{\varepsilon})^{2}-1}}=1+\sqrt{2\tilde{\varepsilon}}+\Theta(\tilde{\varepsilon})=1+\sqrt{\frac{4M\varepsilon}{M-m}}+\Theta(\varepsilon)\,.

In other words, for large condition number κ1\kappa\gg 1, the number of iterations required by GD increases by a multiplicative factor of roughly 1+2ε1+2\sqrt{\varepsilon} when using inexact gradients. Note that this slowdown is more than the (1+Θ(ε))(1+\Theta(\varepsilon)) slowdown for constant-stepsize schedules [16, Theorem 5.3]. This is a manifestation of the general phenomenon that accelerated algorithms are often less stable to model mis-specification, see e.g., [18].

We also remark that by a nearly identical argument, Theorem 6.1 can be extended to a stochastic error models, i.e., where g~tf(xt)\tilde{g}_{t}-\nabla f(x_{t}) are random. The slowdown factor then becomes an averaged version of the quantity in Theorem 6.1. However, this is more complicated to state, hence for simplicity the above discussion focuses just on the deterministic error model.

6.2 Game-theoretic connections and lower bounds

This paper considers random stepsizes for the algorithmic purpose of accelerating GD. It turns out that this randomized perspective is also helpful for the dual problem of constructing algorithmic lower bounds. We describe this duality here.

Recall from §2 that the optimal distribution μ\mu over i.i.d. inverse stepsizes β=α1\beta=\alpha^{-1} has the following extremal characterization:

infμ(E)supλE𝔼βμlog|1β1λ|.\displaystyle\inf_{\mu\in\mathcal{M}(E)}\;\sup_{\lambda\in E}\;\mathbb{E}_{\beta\sim\mu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert\,. (6.4)

Above, μ\mu is constrained to E:=[m,M]E:=[m,M] without loss of generality (since this holds at optimality), λE\lambda\in E should be interpreted as the curvature of a “worst-case” univariate quadratic function f(x)=λ2x2f(x)=\tfrac{\lambda}{2}x^{2}, and 𝔼βνlog|1β1λ|\mathbb{E}_{\beta\sim\nu}\log|1-\beta^{-1}\lambda| is the logarithm of the asymptotic convergence rate.

To bridge this problem of accelerating GD with the problem of constructing lower bounds, we consider a standard two-player zero-sum game where players choose their actions simultaneously and independently. In our case, one player chooses inverse stepsizes β1\beta^{-1} and the other player chooses a quadratic function (parameterized by its curvature λ)\lambda). By lifting to mixed (i.e., randomized) strategies, one player chooses a distribution μ\mu over stepsizes, the other chooses a distribution ν\nu over quadratic functions, with payoff given by (μ,ν):=𝔼βμ,λνlog|1β1λ|{\cal R}(\mu,\nu):=\mathbb{E}_{\beta\sim\mu,\,\lambda\sim\nu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert.

The key observation is that due to the specific payoff function, the game is essentially symmetric between the two players (modulo inversion), and as a consequence, so are the optimal mixed strategies for both players. Below, let FlippedArcsine(m,M)\operatorname{\mathrm{FlippedArcsine}(m,M)} denote the law of λ\lambda when λ1Arcsine(M1,m1)\lambda^{-1}\sim\mathrm{Arcsine}(M^{-1},m^{-1}).

Lemma 6.2 (Optimal distribution for worst-case functions).

The (minimax) value of this game is logRacc\log\operatorname{\mathrm{R_{acc}}}, i.e., we have the saddle point condition

supν(μ,ν)=(μ,ν)=infμ(μ,ν)\sup_{\nu}{\cal R}(\mu^{*},\nu)={\cal R}(\mu^{*},\nu^{*})=\inf_{\mu}{\cal R}(\mu,\nu^{*})

with

(μ,ν)=𝔼βμ,λνlog|1β1λ|=logRacc,{\cal R}(\mu^{*},\nu^{*})=\mathbb{E}_{\beta\sim\mu^{*},\,\lambda\sim\nu^{*}}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert=\log\operatorname{\mathrm{R_{acc}}},

and the unique optimal mixed strategies are μ=Arcsine(m,M)\mu^{*}=\operatorname{{\mathrm{Arcsine}}(m,M)} and ν=FlippedArcsine(m,M)\nu^{*}=\operatorname{\mathrm{FlippedArcsine}(m,M)}.

Remark 6.3 (Game-theoretic interpretation of equalization).

Notice that both optimal distributions have full support. As a consequence, at equilibrium a player is indifferent among all of their pure actions, i.e., the equalization property (Lemma 2.2) must hold.

Proof sketch for Lemma 6.2.

Lemma 2.1 implies that μ=Arcsine(m,M)\mu^{*}=\operatorname{{\mathrm{Arcsine}}(m,M)} is the unique optimal solution for the stepsize player, and that the value of the game is logRacc\log\operatorname{\mathrm{R_{acc}}}. See [2, Chapter 3.3.3] for a rigorous proof that the unique optimal solution for the other player is ν=FlippedArcsine(m,M)\nu^{*}=\operatorname{\mathrm{FlippedArcsine}(m,M)}; here we just provide intuition in terms of symmetry of the game. The underlying idea is to consider a “flipped game” in which strategies correspond to the inverses of strategies in the original game, i.e., one player chooses stepsizes β~=β1E~\tilde{\beta}=\beta^{-1}\in\tilde{E}, and the other player chooses a quadratic function parameterized by its inverse curvature λ~=λ1E~\tilde{\lambda}=\lambda^{-1}\in\tilde{E}. Here E~:=[m~,M~]\tilde{E}:=[\tilde{m},\tilde{M}] where m~:=M1\tilde{m}:=M^{-1} and M~:=m1\tilde{M}:=m^{-1}. By lifting to mixed strategies, one player chooses a distribution μ~\tilde{\mu} over stepsizes β~E~\tilde{\beta}\in\tilde{E}, and the other player chooses a distribution ν~\tilde{\nu} over quadratic functions λ~E~\tilde{\lambda}\in\tilde{E}, with payoff given by

~(μ~,ν~):=𝔼β~μ~,λ~ν~log|1λ~1β~|.{\cal\tilde{R}}(\tilde{\mu},\tilde{\nu}):=\mathbb{E}_{\tilde{\beta}\sim\tilde{\mu},\tilde{\lambda}\sim\tilde{\nu}}\log|1-\tilde{\lambda}^{-1}\tilde{\beta}|\,.

This flipped game is equivalent to the original game due the invariance of both the payoff function ~(μ~,ν~)=(μ,ν)\tilde{\mathcal{R}}(\tilde{\mu},\tilde{\nu})=\mathcal{R}(\mu,\nu) and the condition number M~m~=Mm\tfrac{\tilde{M}}{\tilde{m}}=\tfrac{M}{m} with respect to this inversion of the strategies. Because Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)} is the unique optimal solution μ\mu^{*} in the original game, FlippedArcsine(m~,M~)\operatorname{\mathrm{FlippedArcsine}(\tilde{m},\tilde{M})} is the unique optimal solution μ~\tilde{\mu}^{*} in the flipped game; in order to show that FlippedArcsine(m,M)\operatorname{\mathrm{FlippedArcsine}(m,M)} is the unique optimal solution ν\nu^{*} in the original game, it suffices to show that Arcsine(m~,M~)\operatorname{{\mathrm{Arcsine}}(\tilde{m},\tilde{M})} is the unique optimal solution ν~\tilde{\nu}^{*} in the flipped game. To this end, it suffices to show that an optimal solution ν~\tilde{\nu}^{*} must satisfy the equalization property in the flipped game, i.e.,

(δβ~,ν~)=𝔼λ~ν~log|1λ~1β~|\displaystyle\mathcal{R}(\delta_{\tilde{\beta}},\tilde{\nu}^{*})=\mathbb{E}_{\tilde{\lambda}\sim\tilde{\nu}^{*}}\log\left\lvert 1-\tilde{\lambda}^{-1}\tilde{\beta}\right\rvert

is independent of β~E~\tilde{\beta}\in\tilde{E}. This suffices since the equalization property uniquely defines the Arcsine distribution by classical results from logarithmic potential theory, see Appendix A.

Why must an optimal solution ν~\tilde{\nu}^{*} for the flipped game satisfy this equalization property? An informal argument is as follows. By definition of ~\tilde{\mathcal{R}}, the symmetry between ~\tilde{\mathcal{R}} and \mathcal{R}, and then the equalization property of the Arcsine distribution (Lemma 2.2) applied to μ\mu^{*},

𝔼β~μ~(δβ~,ν~)=~(μ~,ν~)=(μ,ν)=logRacc.\displaystyle\mathbb{E}_{\tilde{\beta}\sim\tilde{\mu}}\mathcal{R}(\delta_{\tilde{\beta}},\tilde{\nu}^{*})=\tilde{\mathcal{R}}(\tilde{\mu}^{*},\tilde{\nu}^{*})=\mathcal{R}(\mu^{*},\nu^{*})=\log\operatorname{\mathrm{R_{acc}}}\,.

Thus, forgoing measure-theoretic technicalities, if ν~\tilde{\nu}^{*} did not satisfy the equalization property, then by the probabilistic method, there would exist some β~\tilde{\beta} for which β~(δβ~,ν~)\tilde{\beta}\mapsto\mathcal{R}(\delta_{\tilde{\beta}},\tilde{\nu}^{*}) is strictly less than its average 𝔼β~μ~(δβ~,ν~)=logRacc\mathbb{E}_{\tilde{\beta}\sim\tilde{\mu}}\mathcal{R}(\delta_{\tilde{\beta}},\tilde{\nu}^{*})=\log\operatorname{\mathrm{R_{acc}}}. But this means that the value of the flipped game is less than logRacc\log\operatorname{\mathrm{R_{acc}}}, hence also for the original game (since their values are identical), contradiction. ∎

As a corollary, Lemma 6.2 gives a new proof that κ1κ+1\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} is the fastest possible convergence rate for GD. This result is classically due to [39], who proved this by constructing a single, infinite-dimensional “worst-case function” for which any Krylov-subspace algorithm converges slowly. Here we construct a “worst-case distribution” over univariate functions for which any GD stepsize schedule converges slowly in expectation. By the probabilistic method, for any specific GD stepsize schedule, we can then conclude the existence of a single univariate function for which convergence is slow. For simplicity, we state this lower bound here for i.i.d. stepsizes; this can be extended to arbitrary GD stepsize schedules and also to non-asymptotic lower bounds, see [2, Chapter 5].

Corollary 6.4.

For any distribution of i.i.d. GD stepsizes, there exists a univariate quadratic function that is mm-strongly convex and MM-smooth such that the asymptotic convergence rate

limn𝔼[(xnxx0x)1/n]κ1κ+1.\displaystyle\lim_{n\to\infty}\mathbb{E}\left[\left(\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\right)^{1/n}\right]\geqslant\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\,.
Proof.

Lemma 6.2 shows that in expectation over λFlippedArcsine(m,M)\lambda\sim\operatorname{\mathrm{FlippedArcsine}(m,M)}, the asymptotic convergence rate is Racc\operatorname{\mathrm{R_{acc}}} when run on the quadratic function fλ(x)=λ2x2f_{\lambda}(x)=\tfrac{\lambda}{2}x^{2} from initialization x0=1x_{0}=1. By the probabilistic method, there exists a λ\lambda^{*} for which the asymptotic convergence rate on fλ(x)f_{\lambda^{*}}(x) is at least this average value Racc\operatorname{\mathrm{R_{acc}}}. ∎

Remark 6.5.

The lower bounds in this section can be interpreted as an instantiation of Yao’s minimax principle [60]: in order to lower bound the performance of a randomized algorithm (e.g., GD with the Arcsine Stepsize Schedule), we exhibit a distribution over difficult inputs (namely quadratics with curvature λ\lambda according to the FlippedArcsine distribution) for which any deterministic algorithm cannot perform well. This is implicit along the way in this section’s development.

7 Conclusion and future work

This paper shows that randomizing stepsizes provides an alternative mechanism for accelerated convex optimization beyond the special case of quadratics. This raises several natural questions about the possible benefit of randomness for GD—or said equivalently, whether the Arcsine Stepsize Schedule can be derandomized. The current results differ on two fronts: Theorem 1.2 shows that random stepsizes achieve the fully accelerated rate O(κ1/2log1ε)O(\kappa^{1/2}\log\tfrac{1}{\varepsilon}) on separable functions, whereas [3] shows that deterministic stepsize achieves the partially accelerated “Silver” rate O(κlog1+22log1ε)O(κ0.78log1ε)O(\kappa^{\log_{1+\sqrt{2}}2}\log\tfrac{1}{\varepsilon})\approx O(\kappa^{0.78}\log\tfrac{1}{\varepsilon}) on general convex functions. These two sources of differences motivate the questions:

  • Is it possible to de-randomize Theorem 1.2, i.e., construct a deterministic stepsize schedule which achieves the fully accelerated rate for separable functions? If so, a natural conjecture would be some ordering of the Chebyshev stepsize schedule.

  • Does the Arcsine Stepsize Schedule fully accelerate on general (non-separable) convex functions? If not, can any randomized stepsize schedule improve over the Silver Convergence Rate for general convex functions?

There are also many questions about extensions to other optimization settings of interest. For other settings, what is the optimal stepsize distribution if not Arcsine? In any setting, is there a provable benefit of using non-i.i.d. stepsizes? Can random stepsizes be alternatively understood through discretization analyses of continuous-time gradient flows? While other mechanisms for acceleration (e.g., momentum) are now well understood [22], little is known about acceleration via randomness.

Appendix A Connections to logarithmic potential theory

Recall from the overview section §2 that the optimal distribution μ\mu for inverse stepsizes β\beta is given by the extremal problem

infμ(E)supλE𝔼βμlog|1β1λ|,\displaystyle\inf_{\mu\in\mathcal{M}(E)}\;\sup_{\lambda\in E}\;\mathbb{E}_{\beta\sim\mu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert\,, (A.1)

where E:=[m,M]E:=[m,M] and (E)\mathcal{M}(E) denotes the probability distributions supported on EE.444Although the stepsize problem was not originally written with this constraint in §2, it is easily seen that this constraint is satisfied at optimality and therefore does not change the extremal problem. See Observation A.4. Indeed, for quadratic optimization, this extremal problem exactly characterizes the asymptotic convergence rate (see §2.1); and for the more general setting of separable convex optimization, even though the relevant extremal problem is somewhat different, the solution is identical (see §2.2).

In this appendix section, we connect this extremal problem over stepsize distributions with the extremal problem for electrostatic capacitors that is classically studied in potential theory. As we explain below, although these two problems seek probability distributions minimizing different objectives, the two problems are intimately linked because they have the same optimality condition: the optimal distribution μ\mu must satisfy the equalization property from Lemma 2.2 (i.e., the objective 𝔼βμlog|1β1λ|\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda| has equal value for all λ[m,M]\lambda\in[m,M]). Because this equalization property uniquely characterizes the distribution (a classical fact from potential theory, see Lemma A.9), this means that the two problems have the same unique optimal solution. In this way, the equalization property plays the central role connecting these two problems.

This appendix section is organized as follows. In §A.1 we recall the electrostatics problem and highlight the syntactic similarities and differences to the stepsize problem. In §A.2 we explain the equalization property in the context of both problems and how it arises as the optimality condition. This is the key link between the two problems. In §A.3 we exploit this connection concretely by using the machinery of potential theory to solve the stepsize problem. This provides alternative proofs of Lemmas 2.1 and 2.2 that are significantly shorter than the integral calculations of [32, 47].

A.1 Extremal problem for electrostatic capacitors

The extremal problem central to logarithmic potential theory is

infμ(E)log1|λω|dμ(ω)𝑑μ(λ),\displaystyle\inf_{\mu\in\mathcal{M}(E)}\iint\log\frac{1}{\left\lvert\lambda-\omega\right\rvert}d\mu(\omega)d\mu(\lambda)\,, (A.2)

given a set EE\subset\mathbb{C}. This problem is sometimes called the electrostatics problem for a capacitor due to its physical interpretation: how should you place a distribution of repelling particles on EE so as to minimize the total repulsive energy? Here the repulsive energy is measured via

I(μ):=log1|λω|dμ(ω)𝑑μ(λ)=𝔼λμϕμ(λ),I(\mu):=\iint\log\frac{1}{\left\lvert\lambda-\omega\right\rvert}d\mu(\omega)d\mu(\lambda)=\mathbb{E}_{\lambda\sim\mu}\phi_{\mu}(\lambda)\,,

which is the μ\mu-average of the logarithmic potential energy at λ\lambda, denoted

ϕμ(λ):=𝔼ωμlog1|λω|.\phi_{\mu}(\lambda):=\mathbb{E}_{\omega\sim\mu}\log\frac{1}{|\lambda-\omega|}\,.

Standard arguments based on lower semi-continuity and strict convexity of the objective ensure that an optimal distribution exists and is unique. For certain sets EE, this optimal distribution can be computed in closed form. For example it is well-known that for the interval E=[m,M]E=[m,M], the unique optimal distribution is the Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)} distribution, see e.g., [49, Example 1.11].

Syntactic similarities and differences.

Let E=[m,M]E=[m,M]. Then the stepsize extremal problem (A.1) clearly bears similarities to the electrostatics extremal problem (A.2), namely

infμ(E)supλEϕμ(0)ϕμ(λ)versusinfμ(E)𝔼λμϕμ(λ).\displaystyle\inf_{\mu\in\mathcal{M}(E)}\sup_{\lambda\in E}\phi_{\mu}(0)-\phi_{\mu}(\lambda)\qquad\text{versus}\qquad\inf_{\mu\in\mathcal{M}(E)}\mathbb{E}_{\lambda\sim\mu}\phi_{\mu}(\lambda)\,. (A.3)

Yet there are two differences. First, the stepsize problem evaluates the logarithmic potential ϕμ(λ)\phi_{\mu}(\lambda) at a specific point λ[m,M]\lambda\in[m,M], whereas the electrostatics problem averages this potential over λμ\lambda\sim\mu. Second, the stepsize problem has an additional term of ϕμ(0)\phi_{\mu}(0).

In general, changing the objective or constraints in an extremal problem can significantly alter the optimal solution. Yet, remarkably, the optimal solution μ\mu is identical for both problems. This is because both problems have the same optimality condition, described next.

A.2 Equalization property

Here we describe how the equalization property arises as the optimality condition in both problems.

A.2.1 Equalized potential for the electrostatics problem

A celebrated fact—sometimes called the Fundamental Theorem of Potential Theory [48, Chapter 3.3]—is that the optimal distribution μ\mu to the electrostatics problem (A.2) satisfies an equalization property: the potential λϕμ(λ)\lambda\mapsto\phi_{\mu}(\lambda) is constant on EE. (Note that this is equivalent to the constancy on EE of the objective for the stepsize problem, namely λ𝔼βlog|1β1λ|=ϕμ(0)ϕμ(λ)\lambda\mapsto\mathbb{E}_{\beta}\log|1-\beta^{-1}\lambda|=\phi_{\mu}(0)-\phi_{\mu}(\lambda).)

This equalization property has an intuitive physical interpretation: if the potential is not constant on EE, then particles will flow to points of EE with lower potential, so the current state is not at equilibrium and therefore cannot be optimal. This is why the optimal solution μ\mu is often called the equilibrium distribution. This result is attributed to Frostman; here we state a simple version of it that suffices for our setting (namely a compact interval E=[m,M]E=[m,M]).

Theorem A.1 (Frostman’s Theorem).

Let EE\subset\mathbb{C} be compact and simply connected, and suppose that the optimal value of (A.2) is not -\infty. Then there is a unique optimal distribution μ\mu^{*} for (A.2). Moreover it satisfies

  • ϕμ(λ)=I(μ)\phi_{\mu^{*}}(\lambda)=I(\mu^{*}) for all λE\lambda\in E.

  • ϕμ(λ)I(μ)\phi_{\mu^{*}}(\lambda)\leqslant I(\mu^{*}) for all λ\lambda\in\mathbb{C}.

A proof can be found in any textbook on logarithmic potential theory, e.g., [54, 48, 50]. Below we use calculus of variations to give an informal back-of-the-envelope calculation that provides intuition for how this equalization property arises as the optimality condition. Let

F(μ):=log1|xy|dμ(x)𝑑μ(y)+γ(𝑑μ(x)1)\displaystyle F(\mu):=\iint\log\frac{1}{|x-y|}d\mu(x)d\mu(y)+\gamma\left(\int d\mu(x)-1\right)

denote the original objective plus a Lagrangian term for the normalization constraint on μ\mu. Then a quick calculation shows that the first variation is

δF(μ,h)\displaystyle\delta F(\mu,h) =ddεF(μ+εh)|ε=0=[γ+log1|xy|dμ(x)]𝑑h(y).\displaystyle=\frac{d}{d\varepsilon}F(\mu+\varepsilon h)\big{|}_{\varepsilon=0}=\int\left[\gamma+\int\log\frac{1}{|x-y|}d\mu(x)\right]dh(y)\,.

In order for this first variation to vanish along any direction hh, it must hold that

γ+log1|xy|dμ(x)=0,yE.\displaystyle\gamma+\int\log\frac{1}{|x-y|}d\mu(x)=0,\qquad\forall y\in E\,.

This is the equalization property.

A.2.2 Equalized convergence rate for the stepsize problem

As discussed in §2.1 and formally stated in Lemma 2.1, the stepsize problem has the same equalization property at optimality. This is rigorously proven in §A.3.2 using the machinery of potential theory; to provide intuition for how this arises as the optimality condition, here we give an informal back-of-the-envelope argument based on the game-theoretic perspective developed in §6.2.

Suppose that μ\mu is an solution to the stepsize problem (A.1). By the calculation in §6.2, there is a distribution ν\nu (the Flipped Arcsine distribution) for which

𝔼βμ,λνlog|1β1λ|=logRacc.\displaystyle\mathbb{E}_{\beta\sim\mu,\,\lambda\sim\nu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert=\log\operatorname{\mathrm{R_{acc}}}\,. (A.4)

Observe that this is the ν\nu-average of Pμ(λ):=𝔼βμlog|1β1λ|=ϕμ(0)ϕμ(λ)P_{\mu}(\lambda):=\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda|=\phi_{\mu}(0)-\phi_{\mu}(\lambda). Thus, if μ\mu does not equalize, then (forgoing measure-theoretic technicalities) there exists λsupp(ν)=[m,M]\lambda\in\operatorname{supp}(\nu)=[m,M] for which Pμ(λ)P_{\mu}(\lambda) is strictly larger than the ν\nu-average. That is, maxλ[m,M]Pμ(λ)>logRacc\max_{\lambda\in[m,M]}P_{\mu}(\lambda)>\log\operatorname{\mathrm{R_{acc}}}. But this contradicts the optimality of μ\mu since the value of the stepsize problem (A.1) is logRacc\log\operatorname{\mathrm{R_{acc}}}. Therefore the equalization property must be satisfied by the optimal distribution μ\mu.

A.3 Solution to the extremal problem over stepsizes

So far we have focused on developing conceptual connections between the stepsize problem and electrostatistics problem. We now exploit these connections concretely by using the machinery of potential theory to solve the stepsize problem. This provides short proofs of Lemmas 2.1 and 2.2.

A.3.1 Proof of Lemma 2.2

Here we prove the equalization property. In the optimization literature, this was proven by computing the integral directly using trigonometric identities [47], or by decomposing the logarithmic integrand into the Chebyshev polynomial basis [32]. Here we mention an alternative approach by appealing to the following well-known identity from potential theory, which gives the potential for the equilibrium distribution on [1,1][-1,1] at any point zz; see e.g., [50, Example 1.11].

Fact A.2.

Let γ\gamma denote the Arcsine distribution on [1,1][-1,1]. Then for any zz\in\mathbb{C},

𝔼tγlog1|zt|=log2log|zz21|𝟙z[1,1].\displaystyle\mathbb{E}_{t\sim\gamma}\log\frac{1}{\left\lvert z-t\right\rvert}=\log 2-\log\left\lvert z-\sqrt{z^{2}-1}\right\rvert\cdot\mathds{1}_{z\notin[-1,1]}\,.
Proof of Lemma 2.2.

Let μ\mu denote the Arcsine distribution on [m,M][m,M]. Note that μ\mu is the push-forward of γ\gamma under the linear map L(z):=M+m2+Mm2zL(z):=\tfrac{M+m}{2}+\tfrac{M-m}{2}z that sends [1,1][-1,1] to [m,M][m,M]. Thus by a change-of-measure, the μ\mu-potential at any ω\omega\in\mathbb{C} is

ϕμ(ω)=𝔼βμlog1|βω|=𝔼tγlog1|L(t)L(z)|=ϕγ(z)+log2Mm,\displaystyle\phi_{\mu}(\omega)=\mathbb{E}_{\beta\sim\mu}\log\frac{1}{\left\lvert\beta-\omega\right\rvert}=\mathbb{E}_{t\sim\gamma}\log\frac{1}{\left\lvert L(t)-L(z)\right\rvert}=\phi_{\gamma}(z)+\log\frac{2}{M-m}\,,

where z:=L1(ω)z:=L^{-1}(\omega). This gives the desired identity:

𝔼βμlog|1β1λ|=𝔼βμlog1|β|𝔼βμlog1|λβ|=ϕμ(0)ϕμ(λ)=log|zz21|=logκ1κ+1.\displaystyle\mathbb{E}_{\beta\sim\mu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert=\mathbb{E}_{\beta\sim\mu}\log\frac{1}{|\beta|}-\mathbb{E}_{\beta\sim\mu}\log\frac{1}{\left\lvert\lambda-\beta\right\rvert}=\phi_{\mu}(0)-\phi_{\mu}(\lambda)=-\log\left\lvert z-\sqrt{z^{2}-1}\right\rvert=\log\frac{\sqrt{\kappa}-1}{\sqrt{\kappa+1}}\,.

Above the first step is by expanding the logarithm, the second step is by definition of the logarithmic potential ϕμ\phi_{\mu}, the third step is by Fact A.2 and the above formula for ϕμ\phi_{\mu} in terms of ϕγ\phi_{\gamma}, and the last step is by plugging in the value of z=L1(0)=κ+1κ1z=L^{-1}(0)=-\tfrac{\kappa+1}{\kappa-1} and simplifying. ∎

Remark A.3 (Connection to Green’s function).

This expression logκ1κ+1=logRacc\log\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}=\log\operatorname{\mathrm{R_{acc}}} is the evaluation at 0 of Green’s function for the interval [m,M][m,M]. This connection essentially arises because: 1) for quadratic optimization, polynomial approximation dictates the convergence rate of GD (see §2.1), and 2) the Bernstein-Walsh Theorem characterizes the optimal rate of convergence of polynomial approximations in terms of Green’s function (see e.g., [19, Theorem 1]).

A.3.2 Proof of Lemma 2.1

Here we prove that the Arcsine distribution is the unique optimal solution for the stepsize problem (A.1) by mirroring the standard argument from potential theory that the Arcsine distribution is the unique optimal solution to the electrostatic problem (A.2), see e.g., [49, Example 2.9]. Specifically, our approach is to show that any optimal distribution μ\mu satisfies the equalization property (see §A.2.2 for a heuristic argument of this). Indeed, it then follows that the optimal distribution μ\mu is unique and is the Arcsine distribution by the converse of Frostman’s Theorem (stated below), which ensures that the Arcsine distribution is the unique distribution satisfying equalization.

We begin with a simple observation.

Observation A.4 (Support of the optimal stepsize distribution).

The constraint μ(E)\mu\in\mathcal{M}(E) does not change the value of the stepsize problem (A.1).

Proof.

It suffices to argue that for any distribution μ\mu (with arbitrary support), there exists some distribution μ~(E)\tilde{\mu}\in\mathcal{M}(E) whose objective value is at least as good, i.e., maxλ[m,M]𝔼β~μ~log|1β~1λ|maxλ[m,M]𝔼βμlog|1β1λ|\max_{\lambda\in[m,M]}\mathbb{E}_{\tilde{\beta}\sim\tilde{\mu}}\log|1-\tilde{\beta}^{-1}\lambda|\leqslant\max_{\lambda\in[m,M]}\mathbb{E}_{\beta\sim\mu}\log|1-\beta^{-1}\lambda|. To this end, it suffices to observe the elementary inequality

|1λα~||1λα|,λ[m,M],|1-\lambda\tilde{\alpha}|\leqslant|1-\lambda\alpha|\,,\qquad\forall\lambda\in[m,M]\,,

where α:=β1=a+bi\alpha:=\beta^{-1}=a+bi\in\mathbb{C} is arbitrary and α~:=max(min(a,M1),m1)\tilde{\alpha}:=\max(\min(a,M^{-1}),m^{-1}) is the projection of α\alpha onto the interval [M1,m1][M^{-1},m^{-1}]. Indeed, this pointwise inequality implies that we can project μ\mu onto (E)\mathcal{M}(E) in this way without worsening the objective value of μ\mu. ∎

We now show that any optimal distribution, constrained to this interval, must satisfy the equalization property. For this it is helpful to recall several classical facts from potential theory. These can be found, e.g., in page 171 and Theorems 3.2, 2.8, 2.2, and 3.1, respectively, of [49].

Lemma A.5 (Basic properties of logarithmic potentials).

If μ()\mu\in\mathcal{M}(\mathbb{C}), then ϕμ\phi_{\mu} is lower semicontinuous on \mathbb{C} and harmonic outside the support of μ\mu.

Lemma A.6 (Continuity of the equilibrium potential).

Consider the setting of Theorem A.1. Then the potential ϕμ\phi_{\mu^{*}} is continuous on \mathbb{C}.

Lemma A.7 (Principle of domination).

Suppose μ,ν()\mu,\nu\in\mathcal{M}(\mathbb{C}) are compactly supported and that I(μ)<I(\mu)<\infty. If for some constant cc, ϕμ(z)ϕν(z)c\phi_{\mu}(z)-\phi_{\nu}(z)\leqslant c holds μ\mu-a.e., then it holds for all zz\in\mathbb{C}.

Lemma A.8 (Maximum principle).

Suppose hh is a harmonic function on a domain Ω\Omega\subset\mathbb{C}. If hh attains a local maximum in Ω\Omega, then hh is constant on Ω\Omega.

Lemma A.9 (Converse to Frostman’s Theorem).

Suppose μ(E)\mu\in\operatorname{\mathcal{M}}(E) satisfies I(μ)<I(\mu)<\infty. If ϕμ(z)=c\phi_{\mu}(z)=c on EE for some constant cc, then μ\mu is the equilibrium distribution for EE.

Proof of Lemma 2.1.

For shorthand, let μ\mu denote the Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)} distribution and note that I(μ)<I(\mu)<\infty by Lemma 2.2. Suppose ν([m,M])\nu\in\mathcal{M}([m,M]) has objective value for the stepsize problem (2.9) that is at least as good as μ\mu. Then by the equalization property for μ\mu (Lemma 2.2),

ϕν(0)ϕν(λ)=𝔼βνlog|1β1λ|𝔼βμlog|1β1λ|=ϕμ(0)ϕμ(λ),λ[m,M].\phi_{\nu}(0)-\phi_{\nu}(\lambda)=\mathbb{E}_{\beta\sim\nu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert\leqslant\mathbb{E}_{\beta\sim\mu}\log\left\lvert 1-\beta^{-1}\lambda\right\rvert=\phi_{\mu}(0)-\phi_{\mu}(\lambda)\,,\qquad\forall\lambda\in[m,M]\,.

Re-arranging gives

f(λ)c,λ[m,M],\displaystyle f(\lambda)\leqslant c\,,\qquad\forall\lambda\in[m,M], (A.5)

where f:=ϕμϕνf:=\phi_{\mu}-\phi_{\nu} and c:=f(0)c:=f(0). By the principle of domination (Lemma A.7), this inequality (A.5) holds for all λ\lambda\in\mathbb{C}. Since 0Ω:=[m,M]0\in\Omega:=\mathbb{C}\setminus[m,M], we conclude that ff achieves its maximum on Ω\Omega at 0. But because ff is harmonic on Ω\Omega (Lemma A.5), the Maximum Principle (Lemma A.8) implies that ff is constant on Ω\Omega, i.e.,

f(λ)=c,λ[m,M].\displaystyle f(\lambda)=c,\qquad\forall\lambda\in\mathbb{C}\setminus[m,M]\,. (A.6)

Fix any λ[m,M]\lambda\in[m,M] and let {zn}n\{z_{n}\}_{n\in\mathbb{N}} be any sequence in [m,M]\mathbb{C}\setminus[m,M] converging to λ\lambda. Note that ϕν(λ)lim infnϕν(zn)\phi_{\nu}(\lambda)\leqslant\liminf_{n\to\infty}\phi_{\nu}(z_{n}) by lower semicontinuity (Lemma A.5), and ϕμ(λ)=limnϕμ(zn)\phi_{\mu}(\lambda)=\lim_{n\to\infty}\phi_{\mu}(z_{n}) by continuity of the equilibrium potential (Lemma A.6). Thus by (A.6),

f(λ)lim supnf(zn)=c,λ[m,M].\displaystyle f(\lambda)\geqslant\limsup_{n\to\infty}f(z_{n})=c\,,\qquad\forall\lambda\in[m,M].

Combining this with (A.5) shows that

f(λ)=c,λ[m,M].\displaystyle f(\lambda)=c\,,\qquad\forall\lambda\in[m,M].

Thus ν\nu has constant logarithmic potential ϕν(λ)=ϕμ(λ)+c\phi_{\nu}(\lambda)=\phi_{\mu}(\lambda)+c on [m,M][m,M], and therefore by the converse to Frostman’s Theorem (Lemma A.9), it must be the equilibrium distribution, a.k.a., ν=Arcsine(m,M)\nu=\operatorname{{\mathrm{Arcsine}}(m,M)}. We conclude that the Arcsine distribution is the unique optimal solution for the stepsize problem (A.1). The optimal value then follows from Lemma 2.1. ∎

Appendix B Deferred proofs

B.1 Equivalence of separability and commutative Hessians

Here we prove an alternative characterization of separability. We remark that this provides a simple way to certify that a function is not separable: it suffices to exhibit points xx and yy for which the Hessians 2f(x)\nabla^{2}f(x) and 2f(y)\nabla^{2}f(y) do not commute.

Lemma B.1 (Equivalence of separability and commutative Hessians).

Let fC2(d)f\in C^{2}(\mathbb{R}^{d}). Then the following are equivalent:

  • (i)

    f(x)=i=1dfi([y]i)f(x)=\sum_{i=1}^{d}f_{i}([y]_{i}) where y=Uxy=Ux for some orthogonal UU and functions fi:f_{i}:\mathbb{R}\to\mathbb{R}.

  • (ii)

    {2f(x)}xd\{\nabla^{2}f(x)\}_{x\in\mathbb{R}^{d}} commute.

Proof of (i) \Longrightarrow (ii). The function h(y):=f(UTy)=i=1dfi([y]i)h(y):=f(U^{T}y)=\sum_{i=1}^{d}f_{i}([y]_{i}) is separable, hence its Hessian 2h\nabla^{2}h is diagonal. The chain rule implies 2f=UT2hU\nabla^{2}f=U^{T}\nabla^{2}hU, thus the Hessians of ff are simultaneously diagonalizable in the basis UU, hence they commute.

Proof of (ii) \Longrightarrow (i). Since the Hessians commute, they are simultaneously diagonalizable, i.e., there exists an orthogonal matrix UU such that 2f(x)=UTD(x)U\nabla^{2}f(x)=U^{T}D(x)U for all xx, where D(x)D(x) is diagonal. Defining h(y):=f(UTy)h(y):=f(U^{T}y), it follows by the chain rule that 2h(y)=D(UTy)\nabla^{2}h(y)=D(U^{T}y) is diagonal. Thus all mixed partial derivatives vanish, hence [y]ih(y)\tfrac{\partial}{\partial[y]_{i}}h(y) is a function only of [y]i[y]_{i}. Integrating implies h(y)=i=1dfi([y]i)h(y)=\sum_{i=1}^{d}f_{i}([y]_{i}) for some functions fif_{i}. Thus f(x)=h(Ux)=i=1dfi([Ux]i)f(x)=h(Ux)=\sum_{i=1}^{d}f_{i}([Ux]_{i}) as desired.

B.2 Proof of Lemma 3.2

We first isolate a useful identity: the random variables Zt=log|1βt1λt|Z_{t}=\log|1-\beta_{t}^{-1}\lambda_{t}| defined in §3.1 are uncorrelated—despite not being independent. Intuitively, this is due to the equalization property of the Arcsine distribution (from which βt1\beta_{t}^{-1} are independently drawn) since it ensures that each ZtZ_{t} has the same (conditional) expectation regardless of the dependencies injected through the non-stationary process λt\lambda_{t}.

Lemma B.2.

Cov(Zs,Zt)=0\operatorname{\mathrm{Cov}}(Z_{s},Z_{t})=0 for all sts\neq t.

Proof.

Suppose s<ts<t without loss of generality. Then Cov(Zs,Zt)=𝔼β0,,βt[(ZslogRacc)(ZtlogRacc)]=𝔼β0,,βt1[𝔼βt[(ZslogRacc)(ZtlogRacc)|β0,,βt1]]=0\operatorname{\mathrm{Cov}}(Z_{s},Z_{t})=\mathbb{E}_{\beta_{0},\dots,\beta_{t}}[(Z_{s}-\log\operatorname{\mathrm{R_{acc}}})(Z_{t}-\log\operatorname{\mathrm{R_{acc}}})]=\mathbb{E}_{\beta_{0},\dots,\beta_{t-1}}[\mathbb{E}_{\beta_{t}}[(Z_{s}-\log\operatorname{\mathrm{R_{acc}}})(Z_{t}-\log\operatorname{\mathrm{R_{acc}}})\,|\,\beta_{0},\dots,\beta_{t-1}]]=0, where the first step is by the definition of covariance and (3.3), the second step is by the Law of Iterated Expectations, and the third step is because 𝔼βt[Zt|β0,,βt1]=logRacc\mathbb{E}_{\beta_{t}}[Z_{t}\;|\;\beta_{0},\dots,\beta_{t-1}]=\log\operatorname{\mathrm{R_{acc}}} by the equalization property of the Arcsine distribution (Lemma 2.2). ∎

Proof of Lemma 3.2.

First we check that {Mn}n\{M_{n}\}_{n\in\mathbb{N}} is indeed a martingale:

𝔼[Mn|Fn1]=Mn1+𝔼[Zn𝔼[Zn]|Fn1]n+1=Mn1.\mathbb{E}[M_{n}\;|\;F_{n-1}]=M_{n-1}+\frac{\mathbb{E}\left[Z_{n}-\mathbb{E}[Z_{n}]\;\Big{|}\;F_{n-1}\right]}{n+1}=M_{n-1}.

The final step uses 𝔼[Zn|Fn1]=𝔼[Zn]\mathbb{E}[Z_{n}|F_{n-1}]=\mathbb{E}[Z_{n}], which follows by an identical argument as in (3.3) above.

Next, we show {Mn}n\{M_{n}\}_{n\in\mathbb{N}} is L1L_{1}-bounded. By the Cauchy-Schwarz inequality,

𝔼|Mn|𝔼[Mn2]=Var(Mn)\mathbb{E}|M_{n}|\leqslant\sqrt{\mathbb{E}[M_{n}^{2}]}=\sqrt{\operatorname{\mathrm{Var}}(M_{n})}

Thus it suffices to show that Var(Mn)\operatorname{\mathrm{Var}}(M_{n}) is uniformly bounded away from \infty. To this end, since Cov(Zs,Zt)=0\operatorname{\mathrm{Cov}}(Z_{s},Z_{t})=0 for any sts\neq t by Lemma B.2, it follows that

Var(Mn)=Var(t=0n1Ztt+1)=t=0n1Var(Ztt+1)Vt=1n1t2Vt=11t2=Vπ26\operatorname{\mathrm{Var}}(M_{n})=\operatorname{\mathrm{Var}}\left(\sum_{t=0}^{n-1}\frac{Z_{t}}{t+1}\right)=\sum_{t=0}^{n-1}\operatorname{\mathrm{Var}}\left(\frac{Z_{t}}{t+1}\right)\leqslant V\sum_{t=1}^{n}\frac{1}{t^{2}}\leqslant V\sum_{t=1}^{\infty}\frac{1}{t^{2}}=V\frac{\pi^{2}}{6}

where V:=supλ[m,M]VarβArcsine(m,M)(log|1β1λ|)V:=\sup_{\lambda\in[m,M]}\operatorname{\mathrm{Var}}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}(\log\left\lvert 1-\beta^{-1}\lambda\right\rvert) is a uniform upper bound on the variance of ZtZ_{t}. Note that V<V<\infty because the logarithmic potential is an integrable singularity (see also Appendix 5 of [32] for quantitative bounds on VV). Thus MnM_{n} has finite variance, as desired. ∎

B.3 Details for §5.1

Here we observe that random stepsizes can make GD unstable in alternative notions of convergence. Instability in the following lemma occurs for κ>4\kappa>4 since then κ1>1\sqrt{\kappa}-1>1. See §5.1 for a discussion of this phenomena and why the notion of convergence studied in the rest of this paper is more relevant to optimization practice.

Lemma B.3 (Instability in alternative notions of convergence).

Consider GD with i.i.d. inverse stepsizes from Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)}. There is an mm-strongly convex, MM-smooth, quadratic function ff such that for any initialization x0xx_{0}\neq x^{*} and any number of iterations nn,

𝔼[xnxx0x]1/nκ1.\displaystyle\mathbb{E}\Bigg{[}\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\Bigg{]}^{1/n}\geqslant\sqrt{\kappa}-1\,. (B.1)

To prove this, we make use of the following integral representation of the geometric mean of two positive scalars a,b>0a,b>0 in terms of the weighted harmonic mean Ht(a,b):=(ta1+(1t)b1)1H_{t}(a,b):=(ta^{-1}+(1-t)b^{-1})^{-1}. This identity is well-known (e.g., it follows from [7, Exercise V.4.20], or directly from [25, §3.121-2] by taking q=1/bq=1/b and p=1/b1/ap=1/b-1/a).

Lemma B.4 (Arcsine integral representation of the geometric mean).

For any a,b>0a,b>0,

𝔼tArcsine(0,1)[Ht(a,b)]=ab.\displaystyle\mathbb{E}_{t\sim\text{\emph{Arcsine(0,1)}}}\big{[}H_{t}(a,b)\big{]}=\sqrt{ab}\,.

This lets us compute the expectation of the proposed random stepsizes.

Lemma B.5 (Expectation of proposed random stepsizes).

For any 0<m<M<0<m<M<\infty,

𝔼βArcsine(m,M)´[β1]=1Mm.\displaystyle\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}\textasciiacute}\big{[}\beta^{-1}\big{]}=\frac{1}{\sqrt{Mm}}\,.
Proof.

Re-parameterize β=tM+(1t)m\beta=tM+(1-t)m so that β1=Ht(M1,m1)\beta^{-1}=H_{t}(M^{-1},m^{-1}). Since βArcsine(m,M)\beta\sim\mathrm{Arcsine}(m,M) implies tArcsine(0,1)t\sim\mathrm{Arcsine}(0,1), applying Lemma B.4 provides the desired identity

𝔼βArcsine(m,M)[β1]=𝔼tArcsine(0,1)[Ht(M1,m1)]=1Mm.\displaystyle\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\big{[}\beta^{-1}\big{]}=\mathbb{E}_{t\sim\mathrm{Arcsine}(0,1)}\big{[}H_{t}(M^{-1},m^{-1})\big{]}=\frac{1}{\sqrt{Mm}}\,.

We are now ready to prove Lemma B.3.

Proof of Lemma B.3.

Let βt\beta_{t} denote the inverse stepsizes; recall that these are i.i.d. Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)}. By expanding the definition of GD for the function f(x)=M2x2f(x)=\tfrac{M}{2}x^{2},

xn=t=0n1(1βt1M)x0.\displaystyle x_{n}=\prod_{t=0}^{n-1}(1-\beta_{t}^{-1}M)x_{0}\,.

By taking expectations, using independence of the stepsizes, and applying Lemma B.5,

𝔼[xn]=(1κ)nx0.\displaystyle\mathbb{E}[x_{n}]=(1-\sqrt{\kappa})^{n}x_{0}\,.

Thus by Jensen’s inequality and then the fact that κ>1\kappa>1, we conclude the desired inequality

𝔼[xnxx0x]1/n(𝔼xnxx0x)1/n=κ1.\mathbb{E}\Bigg{[}\frac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\Bigg{]}^{1/n}\geqslant\Bigg{(}\frac{\|\mathbb{E}x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|}\Bigg{)}^{1/n}\ =\sqrt{\kappa}-1.

B.4 Proof of Theorem 5.2

For shorthand, let V:=supλ[m,M]VarβArcsine(m,M)(log|1β1λ|)V:=\sup_{\lambda\in[m,M]}\operatorname{\mathrm{Var}}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\left(\log\left\lvert 1-\beta^{-1}\lambda\right\rvert\right) denote a uniform upper bound on the variance of any ZtZ_{t}. By the integral calculations in [32, Appendix 5],

Vπ2+oκ(1).\displaystyle V\leqslant\pi^{2}+o_{\kappa}(1)\,.

Thus because Cov(Zs,Zt)=0\operatorname{\mathrm{Cov}}(Z_{s},Z_{t})=0 for sts\neq t by Lemma B.2,

Var(1nt=0n1Zt)=1n2t=0n1Var(Zt)Vnπ2+oκ(1)n.\displaystyle\operatorname{\mathrm{Var}}\left(\frac{1}{n}\sum_{t=0}^{n-1}Z_{t}\right)=\frac{1}{n^{2}}\sum_{t=0}^{n-1}\operatorname{\mathrm{Var}}\left(Z_{t}\right)\leqslant\frac{V}{n}\leqslant\frac{\pi^{2}+o_{\kappa}(1)}{n}\,.

Thus by Chebyshev’s inequality and the equalization property of the Arcsine distribution (Lemma 2.2),

(1nt=0n1log|1βt1λt|logRacc+ε)π2+oκ(1)nε2,\displaystyle\mathbb{P}\left(\frac{1}{n}\sum_{t=0}^{n-1}\log|1-\beta_{t}^{-1}\lambda_{t}|\geqslant\log\operatorname{\mathrm{R_{acc}}}+\varepsilon\right)\leqslant\frac{\pi^{2}+o_{\kappa}(1)}{n\varepsilon^{2}}\,, (B.2)

where βt\beta_{t} are i.i.d. Arcsine(m,M)\operatorname{{\mathrm{Arcsine}}(m,M)} and λt[m,M]\lambda_{t}\in[m,M] is any process adapted to the filtration σ({βs}s<t)\sigma(\{\beta_{s}\}_{s<t}). Thus by (3.2), the average convergence rate Rn,i:=(|[xn]i[x]i||[x0]i[x]i|)1/nR_{n,i}:=(\tfrac{|[x_{n}]_{i}-[x^{*}]_{i}|}{|[x_{0}]_{i}-[x^{*}]_{i}|})^{1/n} for the function fif_{i} satisfies

(Rn,ieεRacc)π2+oκ(1)nε2\displaystyle\mathbb{P}\left(R_{n,i}\geqslant e^{\varepsilon}\,\operatorname{\mathrm{R_{acc}}}\right)\leqslant\frac{\pi^{2}+o_{\kappa}(1)}{n\varepsilon^{2}} (B.3)

for any coordinate iS:={i[d]:[x0]i[x]i}i\in S:=\{i\in[d]:[x_{0}]_{i}\neq[x^{*}]_{i}\}, i.e., for any ii for which the initialization is not already optimal. Thus the overall convergence rate Rn:=(xnxx0x)1/nR_{n}:=(\tfrac{\|x_{n}-x^{*}\|}{\|x_{0}-x^{*}\|})^{1/n} for ff satisfies

(RneεRacc)(maxiSRn,ieεRacc)d(π2+oκ(1))nε2.\displaystyle\mathbb{P}\left(R_{n}\geqslant e^{\varepsilon}\,\operatorname{\mathrm{R_{acc}}}\right)\leqslant\mathbb{P}\left(\max_{i\in S}R_{n,i}\geqslant e^{\varepsilon}\,\operatorname{\mathrm{R_{acc}}}\right)\leqslant\frac{d(\pi^{2}+o_{\kappa}(1))}{n\varepsilon^{2}}\,. (B.4)

Above the first step is because RnmaxiSRn,iR_{n}\leqslant\max_{i\in S}R_{n,i} (since Rn2=iS|[xn]i[x]i|2iS|[x0]i[x]i|2maxiS|[xn]i[x]i|2|[x0]i[x]i|2=maxiSRn,i2R_{n}^{2}=\tfrac{\sum_{i\in S}|[x_{n}]_{i}-[x^{*}]_{i}|^{2}}{\sum_{i\in S}|[x_{0}]_{i}-[x^{*}]_{i}|^{2}}\leqslant\max_{i\in S}\tfrac{|[x_{n}]_{i}-[x^{*}]_{i}|^{2}}{|[x_{0}]_{i}-[x^{*}]_{i}|^{2}}=\max_{i\in S}R_{n,i}^{2}). The second step is by a union bound and the previous display. The proof is then complete by setting δ\delta equal to the right hand side of the above display.

B.5 Details for §6.1

Proof of Theorem 6.1.

The proof is similar to that of Theorem 1.2, so we just highlight the differences. By the same argument as in §3.2, it suffices to show the result for univariate objectives ff, so we henceforth restrict to this case. For shorthand, denote Et:=(g~tf(xt))/(xtx)E_{t}:=(\tilde{g}_{t}-f^{\prime}(x_{t}))/(x_{t}-x^{*}) and note that the multiplicative error assumption (6.1) implies |Et|ε:=Mε|E_{t}|\leqslant\varepsilon^{\prime}:=M\varepsilon by smoothness. By an identical calculation as in §3.1,

xt+1x=xtαtg~tx=(1αt(λt+Et))(xtx),x_{t+1}-x^{*}=x_{t}-\alpha_{t}\tilde{g}_{t}-x^{*}=(1-\alpha_{t}(\lambda_{t}+E_{t}))(x_{t}-x^{*})\,,

and thus

xnx=t=0n1(1αt(λt+Et))(x0x),x_{n}-x^{*}=\prod_{t=0}^{n-1}(1-\alpha_{t}(\lambda_{t}+E_{t}))(x_{0}-x^{*})\,,

and so in particular

log(|xnx||x0x|)1/n=1nt=0n1log|1αt(λt+Et)|.\log\left(\frac{|x_{n}-x^{*}|}{|x_{0}-x^{*}|}\right)^{1/n}=\frac{1}{n}\sum_{t=0}^{n-1}\log\left\lvert 1-\alpha_{t}(\lambda_{t}+E_{t})\right\rvert\,.

By a similar martingale-based Law of Large Numbers argument as in §3.1, it suffices to compute

maxλ~[mε,M+ε]𝔼βArcsine(m,M)log|1β1λ~|\displaystyle\max_{\tilde{\lambda}\in[m-\varepsilon^{\prime},M+\varepsilon^{\prime}]}\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\log\left\lvert 1-\beta^{-1}\tilde{\lambda}\right\rvert (B.5)

and in particular to show that this equal to the logarithm of the upper bound in (6.3). Above, λ~\tilde{\lambda} plays the role of λ+E\lambda+E where λ[m,M]\lambda\in[m,M] and E[ε,ε]E\in[-\varepsilon^{\prime},\varepsilon^{\prime}]. By the same calculation as in Appendix A.3.1, this expectation simplifies to

𝔼βArcsine(m,M)log|1β1λ~|=logRacclog(|z|z21)𝟙z[1,1],\displaystyle\mathbb{E}_{\beta\sim\operatorname{{\mathrm{Arcsine}}(m,M)}}\log\left\lvert 1-\beta^{-1}\tilde{\lambda}\right\rvert=\log\operatorname{\mathrm{R_{acc}}}-\log(|z|-\sqrt{z^{2}-1})\cdot\mathds{1}_{z\notin[-1,1]}\,, (B.6)

where z:=2Mm(λ~M+m2)z:=\tfrac{2}{M-m}(\tilde{\lambda}-\tfrac{M+m}{2}). This quantity is increasing in |z||z|, thus (B.5) is maximized at both extremes λ~{mε,M+ε}\tilde{\lambda}\in\{m-\varepsilon^{\prime},M+\varepsilon^{\prime}\} which lead to the same maximal value of |z|=1+2εMm=1+ε~|z|=1+\tfrac{2\varepsilon^{\prime}}{M-m}=1+\tilde{\varepsilon}. Plugging this into (B.5) yields the desired bound.

Finally, for tightness, note that the analysis is tight at the aforementioned extremes λ=m,E=ε\lambda=m,E=-\varepsilon^{\prime} and λ=M,E=ε\lambda=M,E=\varepsilon^{\prime}. These two extremes respectively correspond to functions f(x)=mx22f(x)=\tfrac{mx^{2}}{2} and f(x)=Mx22f(x)=\tfrac{Mx^{2}}{2}, with estimates g~t\tilde{g}_{t} that maximally underestimate or overestimate the true gradient. ∎

References

  • Agarwal et al. [2021] Naman Agarwal, Surbhi Goel, and Cyril Zhang. Acceleration via fractal learning rate schedules. In International Conference on Machine Learning, pages 87–99. PMLR, 2021.
  • Altschuler [2018] Jason M. Altschuler. Greed, hedging, and acceleration in convex optimization. Master’s thesis, Massachusetts Institute of Technology, 2018.
  • Altschuler and Parrilo [2023] Jason M Altschuler and Pablo A Parrilo. Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule. arXiv preprint arXiv:2309.07879, 2023.
  • Altschuler and Parrilo [2024] Jason M Altschuler and Pablo A Parrilo. Acceleration by Stepsize Hedging: Silver Stepsize Schedule for Smooth Convex Optimization. Mathematical Programming, pages 1–14, 2024.
  • Barzilai and Borwein [1988] Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988.
  • Bertsekas [1999] Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
  • Bhatia [2013] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013.
  • Bok and Altschuler [2024] Jinho Bok and Jason M. Altschuler. Accelerating proximal gradient descent via silver stepsizes. arXiv preprint, 2024.
  • Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  • Bubeck et al. [2015] Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
  • Daccache [2019] Antoine Daccache. Performance estimation of the gradient method with fixed arbitrary step sizes. PhD thesis, Master’s thesis, Université Catholique de Louvain, 2019.
  • Das Gupta et al. [2023] Shuvomoy Das Gupta, Bart PG Van Parys, and Ernest K Ryu. Branch-and-bound performance estimation programming: a unified methodology for constructing optimal optimization methods. Mathematical Programming, pages 1–73, 2023.
  • d’Aspremont [2008] Alexandre d’Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization, 19(3):1171–1183, 2008.
  • De Klerk et al. [2017] Etienne De Klerk, François Glineur, and Adrien B Taylor. On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optimization Letters, 11:1185–1199, 2017.
  • De Klerk et al. [2020] Etienne De Klerk, Francois Glineur, and Adrien B Taylor. Worst-case convergence analysis of inexact gradient and Newton methods through semidefinite programming performance estimation. SIAM Journal on Optimization, 30(3):2053–2082, 2020.
  • Dembo [2009] Amir Dembo. Large deviations techniques and applications. Springer, 2009.
  • Devolder et al. [2014] Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146:37–75, 2014.
  • Driscoll et al. [1998] Tobin A Driscoll, Kim-Chuan Toh, and Lloyd N Trefethen. From potential theory to matrix iterations in six steps. SIAM Review, 40(3):547–578, 1998.
  • Drori and Teboulle [2014] Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1-2):451–482, 2014.
  • Drusvyatskiy et al. [2018] Dmitriy Drusvyatskiy, Maryam Fazel, and Scott Roy. An optimal first order method based on optimal quadratic averaging. SIAM Journal on Optimization, 28(1):251–271, 2018.
  • d’Aspremont et al. [2021] Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor. Acceleration methods. Foundations and Trends® in Optimization, 5(1-2):1–245, 2021.
  • Eloi [2022] Diego Eloi. Worst-case functions for the gradient method with fixed variable step sizes. PhD thesis, Master’s thesis, Université Catholique de Louvain, 2022.
  • Goujaud et al. [2022] Baptiste Goujaud, Damien Scieur, Aymeric Dieuleveut, Adrien B Taylor, and Fabian Pedregosa. Super-acceleration with cyclical step-sizes. In International Conference on Artificial Intelligence and Statistics, pages 3028–3065. PMLR, 2022.
  • Gradshteyn and Ryzhik [2014] Izrail Solomonovich Gradshteyn and Iosif Moiseevich Ryzhik. Table of integrals, series, and products. Academic Press, 2014.
  • Grimmer [2024] Benjamin Grimmer. Provably faster gradient descent via long steps. SIAM Journal on Optimization, 34(3):2588–2608, 2024.
  • Grimmer et al. [2023] Benjamin Grimmer, Kevin Shu, and Alex L. Wang. Accelerated gradient descent via long steps. arXiv preprint arXiv:2309.09961, 2023.
  • Grimmer et al. [2024a] Benjamin Grimmer, Kevin Shu, and Alex Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. arXiv preprint arXiv:2403.14045, 2024a.
  • Grimmer et al. [2024b] Benjamin Grimmer, Kevin Shu, and Alex L Wang. Composing optimized stepsize schedules for gradient descent. arXiv preprint arXiv:2410.16249, 2024b.
  • Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
  • Hestenes et al. [1952] Magnus R Hestenes, Eduard Stiefel, et al. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
  • Kalousek [2017] Zdeněk Kalousek. Steepest descent method with random step lengths. Foundations of Computational Mathematics, 17(2):359–422, 2017.
  • Kelley [1999] Carl T Kelley. Iterative methods for optimization. SIAM, 1999.
  • Kelner et al. [2022] Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, and Honglin Yuan. Big-step-little-step: Efficient gradient methods for objectives with multiple scales. In Conference on Learning Theory, pages 2431–2540. PMLR, 2022.
  • Lan [2020] Guanghui Lan. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020.
  • Lebedev and Finogenov [1971] VI Lebedev and SA Finogenov. Ordering of the iterative parameters in the cyclical Chebyshev iterative method. USSR Computational Mathematics and Mathematical Physics, 11(2):155–170, 1971.
  • Lessard et al. [2016] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
  • Luenberger and Ye [1984] David G Luenberger and Yinyu Ye. Linear and nonlinear programming, volume 2. Springer, 1984.
  • Nemirovskii and Yudin [1983] Arkadii Nemirovskii and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
  • Nesterov [1998] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 1998.
  • Nesterov [1983] Yurii E. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2){O}(1/k^{2}). Soviet Math. Dokl., 27(2):372–376, 1983.
  • Nocedal and Wright [1999] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.
  • Oymak [2021] Samet Oymak. Provable super-convergence with a large cyclical learning rate. IEEE Signal Processing Letters, 28:1645–1649, 2021.
  • Polyak [1964] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • Polyak [1987] Boris T. Polyak. Introduction to optimization. Optimization Software, Inc., 1987.
  • Polyak [1971] Boris Teodorovich Polyak. Convergence of methods of feasible directions in extremal problems. USSR Computational Mathematics and Mathematical Physics, 11(4):53–70, 1971.
  • Pronzato and Zhigljavsky [2011] Luc Pronzato and Anatoly Zhigljavsky. Gradient algorithms for quadratic optimization with fast convergence rates. Computational Optimization and Applications, 50(3):597–617, 2011.
  • Ransford [1995] Thomas Ransford. Potential theory in the complex plane, volume 28. Cambridge University Press, 1995.
  • Saff [2010] Edward B Saff. Logarithmic potential theory with applications to approximation theory. Surveys in Approximation Theory, 5:165–200, 2010.
  • Saff and Totik [2013] Edward B Saff and Vilmos Totik. Logarithmic potentials with external fields, volume 316. Springer Science & Business Media, 2013.
  • Schmidt et al. [2011] Mark Schmidt, Nicolas Roux, and Francis Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. Advances in neural information processing systems, 24, 2011.
  • Taylor and Drori [2023] Adrien Taylor and Yoel Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557–594, 2023.
  • Taylor et al. [2017] Adrien B Taylor, Julien M Hendrickx, and François Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming, 161(1-2):307–345, 2017.
  • Tsuji [1959] Masatsugu Tsuji. Potential theory in modern function theory. Maruzen, 1959.
  • Van Handel [2014] Ramon Van Handel. Probability in high dimension. Lecture Notes (Princeton University), 2(3):2–3, 2014.
  • Van Scoy et al. [2018] Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
  • Vershynin [2018] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
  • Wang et al. [2024] Bofan Wang, Shiqian Ma, Junfeng Yang, and Danqing Zhou. Relaxed proximal point algorithm: Tight complexity bounds and acceleration without momentum. arXiv preprint arXiv:2410.08890, 2024.
  • Williams [1991] David Williams. Probability with martingales. Cambridge University Press, 1991.
  • Yao [1977] Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In Symposium on Foundations of Computer Science (FOCS), pages 222–227. IEEE, 1977.
  • Young [1953] David Young. On Richardson’s method for solving linear systems with positive definite matrices. Journal of Mathematics and Physics, 32(1-4):243–255, 1953.
  • Zhang and Jiang [2024] Zehao Zhang and Rujun Jiang. Accelerated gradient descent by concatenation of stepsize schedules. arXiv preprint arXiv:2410.12395, 2024.
  • Zhigljavsky et al. [2013] Anatoly Zhigljavsky, Luc Pronzato, and Elena Bukina. An asymptotically optimal gradient algorithm for quadratic optimization with low computational cost. Optimization Letters, 7(6):1047–1059, 2013.