\newcites

suppSupplementary references

Adaptive Importance Sampling for Finite-Sum Optimization and Sampling with Decreasing Step-Sizes

Ayoub El Hanchi
McGill University
ayoub.elhanchi@mail.mcgill.ca
&David A. Stephens
McGill University
david.stephens@mcgill.ca

Abstract

Reducing the variance of the gradient estimator is known to improve the convergence rate of stochastic gradient-based optimization and sampling algorithms. One way of achieving variance reduction is to design importance sampling strategies. Recently, the problem of designing such schemes was formulated as an online learning problem with bandit feedback, and algorithms with sub-linear static regret were designed. In this work, we build on this framework and propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes. Under standard technical conditions, we show that Avare achieves $\mathcal{O}(T^{2/3})$ and $\mathcal{O}(T^{5/6})$ dynamic regret for SGD and SGLD respectively when run with $\mathcal{O}(1/t)$ step sizes. We achieve this dynamic regret bound by leveraging our knowledge of the dynamics defined by the algorithm, and combining ideas from online learning and variance-reduced stochastic optimization. We validate empirically the performance of our algorithm and identify settings in which it leads to significant improvements.

1 Introduction

Functions $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ of the form:

f(x)=\sum_{i=1}^{N}f_{i}(x)

(1)

are prevalent in modern machine learning and statistics. Important examples include the empirical risk in the empirical risk minimization framework, ¹¹1Up to a normalizing factor of $\frac{1}{N}$ which does not affect the optimization or the log-posterior of an exchangeable Bayesian model. When $N$ is large, the preferred methods for solving the resulting optimization or sampling problem usually rely on stochastic estimates of the gradient of $f$ , using variants of stochastic gradient descent (SGD) [1] or stochastic gradient Langevin dynamics (SGLD) [2]:

	$\displaystyle x^{SGD}_{t+1}$	$\displaystyle=x^{SGD}_{t}-\alpha_{t}N\nabla f_{I_{t}}(x^{SGD}_{t})$		(2)
	$\displaystyle x^{SGLD}_{t+1}$	$\displaystyle=x^{SGLD}_{t}-\alpha_{t}N\nabla f_{I_{t}}(x^{SGLD}_{t})+\xi_{t}\quad\quad\xi_{t}\sim\mathcal{N}(0,2\alpha_{t})$		(3)

where $\{\alpha_{t}\}_{t=1}^{T}$ is a sequence of step-sizes, and the index $I_{t}$ is sampled uniformly from $[N]$ , making $N\nabla f_{I_{t}}(x)$ an unbiased estimator of the gradient of $f$ . We use $\{x_{t}\}_{t=1}^{T}$ to refer to either sequence when we do not wish to distinguish between them. It is well known that the quality of the answers given by these algorithms depends on the (trace of the) variance of the gradient estimator, and considerable efforts have been made to design methods that reduce this variance.

In this paper, we focus on the importance sampling approach to achieving variance reduction. At each iteration, the algorithm samples $I_{t}$ according to a specified distribution $p^{t}$ , and estimates the gradient using:

\hat{g}^{t}:=\frac{1}{p^{t}_{I_{t}}}\nabla f_{I_{t}}(x_{t})

(4)

It is immediate to verify that $\hat{g}^{t}$ is an unbiased estimator of $g^{t}:=\nabla f(x_{t})$ . By cleverly choosing the distributions $p^{t}$ , one can achieve significant variance reduction (up to a factor of N) compared to the estimator based on uniform sampling. Unfortunately, computing the variance-minimizing distributions at each iteration requires the knowledge of the Euclidean norm of all the individual gradients $g_{i}^{t}:=\nabla f_{i}(x_{t})$ at each iteration, making it unpractical [3, 4].

Many methods have been proposed that attempt to construct sequences of distributions $\{p^{t}\}_{t=1}^{T}$ that result in efficient estimators [3, 4, 5, 6, 7, 8, 9]. Of particular interest to us, the task of designing such sequences was recently cast as an online learning problem with bandit feedback [10, 11, 12, 13]. In this formulation, one attempts to design algorithms with sub-linear expected static regret, which is defined as:

\text{Regret}_{S}(T):=\sum_{t=1}^{T}c_{t}(p^{t})-\min_{p\in\Delta}\sum_{t=1}^{T}c_{t}(p)

where $\Delta$ denotes the probability simplex in $\mathbb{R}^{N}$ , and $c_{t}(p)$ is the trace of the covariance matrix of the gradient estimator (4), which is easily shown to be:

c_{t}(p):=\sum_{i=1}^{N}\frac{1}{p_{i}}\left\lVert g_{i}^{t}\right\rVert_{2}^{2}-\left\lVert g^{t}\right\rVert_{2}^{2}

(5)

Note that the second term cancels in the definition of regret, and we omit it in the rest of our discussion. In this formulation, and to keep the computational load manageable, one has only access to partial feedback in the form of the norm of the $I_{t}^{th}$ gradient, and not to the complete cost function (5). Under the assumption of uniformly bounded gradients, the best result in this category can be found in [12] where an algorithm with $\tilde{O}(T^{2/3})$ static regret is proposed. A more difficult but more natural performance measure that makes the attempt to approximate the optimal distributions explicit is the dynamic regret, defined by:

\text{Regret}_{D}(T):=\sum_{t=1}^{T}c_{t}(p^{t})-\sum_{t=1}^{T}\min_{p\in\Delta}c_{t}(p)

(6)

Guarantees with respect to the expected dynamic regret are more difficult to obtain, and require that the cost functions $c_{t}(p)$ do not change too rapidly with respect to some reasonable measure of variation. See [14, 15, 16, 17, 18] for examples of such measures and the corresponding regret bounds for general convex cost functions.

In this work, we propose Avare, an algorithm that achieves sub-linear dynamic regret for both SGD and SGLD when the sequence of step-sizes $\{\alpha_{t}\}_{t=1}^{T}$ is decreasing. The name Avare is derived from adaptive variance minimization. Specifically, our contributions are as follows:

•

We show that Avare achieves $\mathcal{O}(T^{2/3})$ and $\mathcal{O}(T^{5/6})$ dynamic regret for SGD and SGLD respectively when $\alpha_{t}$ is $\mathcal{O}(1/t)$ .
•

We propose a new mini-batch estimator that combines the benefits of sampling without replacement and importance sampling while preserving unbiasedness.
•

We validate empirically the performance of our algorithm and identify settings in which it leads to significant improvements.

We would like to point out that while the decreasing step size requirement might seem restrictive, we argue that for SGD and SGLD, it is the right setting to consider for variance reduction. Indeed, it is well known that under suitable technical conditions, both algorithms converge to their respective solutions exponentially fast in the early stages. Variance reduction is primarily useful at later stages when the noise from the stochastic gradient dominates. In the absence of control variates, one is forced to use decreasing step-sizes to achieve convergence. This is precisely the regime we consider.

2 Related work

It is easy to see that the cost functions (5) are convex over the probability simplex. A celebrated algorithm for convex optimization over the simplex is entropic descent [19], an instance of mirror descent [20] where the Bregman divergence is taken to be the relative entropy. A slight modification of this algorithm for online learning is the EXP3 algorithm [21] which mixes the iterates of entropic descent with a uniform distribution to avoid the assignment of null probabilities. See [22, 23, 24] for a more thorough discussion of online learning (also called online convex optimization) in general and variants of this algorithm in particular.

Since we are working over the simplex, the EXP3 algorithm is a natural choice. This is the approach taken in [10] and [11], although strictly speaking neither is able to prove sub-linear static regret bounds. The difficulty comes from the fact that the norm of the gradients of the cost functions (5) explode to infinity on the boundary of the simplex. This is amplified by the use of stochastic gradients which grow as $1/p_{min}^{5}$ in expectation, making it very difficult to reach regions near the boundary. Algorithms based on entropic descent for dynamic regret problems also exist, including the fixed share algorithm and projected mirror descent [25, 26, 27, 28]. Building on these algorithms, and by artificially making the cost functions strongly-convex to allow the use of decreasing step-sizes, we were only able to show $\tilde{O}(T^{7/8})$ dynamic regret using projected mirror descent for SGD with $O(1/t)$ decreasing step sizes and uniformly bounded gradients.

The approach taken in [12] is qualitatively different and is based on the follow-the-regularized-leader (FTRL) scheme. By solving the successive minimization problems stemming from the FTRL scheme analytically, the authors avoid the above-mentioned issue of exploding gradients. The rest of their analysis relies on constructing an unbiased estimate of the cost functions (5) using only the partial feedback $\left\lVert g_{I_{t}}^{t}\right\rVert_{2}^{2}$ , and probabilistically bounding the deviation of the estimated cost functions from the true cost functions using a martingale concentration inequality. The final result is an algorithm that enjoys an $\tilde{O}(T^{2/3})$ static regret bound.

Our approach is similar to the one taken in [12] in that we work directly with the cost functions and not their gradients to avoid the exploding gradient issue. Beyond this point however, our analysis is substantially different. While we are still working within the online learning framework, our analysis is more closely related to the analysis of variance-reduced stochastic optimization algorithms that are based on control variates. In particular, we study a Lyapunov-like functional and show that it decreases along the trajectory of the dynamics defined by the algorithms. This yields simpler and more concise proofs, and opens the door for a unified analysis.

3 Algorithm

Most of the literature on online convex optimization with dynamic regret relies on the use of the gradients of the cost functions [14, 15, 16, 17, 18]. However, as explained in the previous section, such approaches are not viable for our problem. Unfortunately, the regret guarantees obtained from the FTRL scheme for static regret used in [12] do not directly translate into guarantees for dynamic regret.

We start by presenting the high level ideas that go into the construction of our algorithm. We then present our algorithm in explicit form, and discuss its implementation and computational complexity.

3.1 High level ideas

The simplest update rule one might consider when working with dynamic regret is a natural extension of the follow-the-leader approach:

p^{t+1}:=\operatorname*{argmin}_{p\in\Delta}\left\{c_{t}(p)\right\}

(7)

Intuitively, if the cost functions do not change too quickly, then it is reasonable to expect good performance from this algorithm. In our case however, we do not have access to the full cost function. The traditional way of dealing with this problem is to build unbiased estimates of the cost functions using the bandit feedback, and then bounding the deviation of these unbiased estimates from the true cost functions. While this might work, we consider here a different approach based on constructing surrogate cost functions that are not necessarily unbiased estimates of the true costs.

For each $i\in[N]$ , denote by $h_{i}^{t}$ the last observed gradient of $f_{i}$ at time $t$ , with $h_{i}^{1}$ initialized arbitrarily (to $0$ for example). We consider the following surrogate cost for all $t\in[T]$ :

\widetilde{c}_{t}(p):=\sum_{i=1}^{N}\frac{1}{p_{i}}\left\lVert h_{i}^{t}\right\rVert_{2}^{2}

(8)

As successive iterates of the algorithm become closer and closer to each other, the squared norm of the $h_{i}^{t}$ s become better and better approximations to the squared norm of the $g_{i}^{t}$ s, thereby making the surrogate costs (8) accurate approximations of the true costs (5). The idea of using previously seen gradients to approximate current gradients is inspired by the celebrated variance-reduced stochastic optimization algorithm SAGA [29].

We are now almost ready to formulate our algorithm. If we try to directly minimize the surrogate cost functions over the entire simplex at each iteration as in (7), we might end up assigning null or close to null probabilities to certain indices. Depending on how far we are in the algorithm, this might or might not be a problem. In the initial stages, this is clearly an issue since this might only be an artifact of the initialization (notably when we initialize the $h_{i}^{1}$ to $0$ ), or an accurate representation of the current norm of the gradient, but which might not be representative later on as the algorithm progresses. On the other hand, in the later stages of the algorithm, the cost functions are nearly identical, so that an assignment of near zero probabilities is a reflection of the true optimal probabilities, and is not necessarily problematic.

The above discussion suggests the following algorithm. Define the $\varepsilon$ -restricted probability simplex to be:

\Delta(\varepsilon):=\left\{p\in\mathbb{R}^{N}\mid p_{i}\geq\varepsilon,~{}\sum_{i=1}^{N}p_{i}=1\right\}

(9)

And let $\{\varepsilon_{t}\}_{t=1}^{T}$ be a decreasing sequence of positive numbers with $\varepsilon_{1}\leq\frac{1}{N}$ . Then we propose the following algorithm:

p^{t}:=\operatorname*{argmin}_{p\in\Delta(\varepsilon_{t})}\left\{\widetilde{c}_{t}(p)\right\}

(10)

Our theoretical analysis in Section 4 suggests a specific decay rate for the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ that depends on the sequence of step-sizes $\{\alpha_{t}\}_{t=1}^{T}$ and whether SGD or SGLD is run.

3.2 Explicit form

Equation (10) defines our sequence of distribution $\{p^{t}\}_{t=1}^{T}$ , but the question remains whether we can solve the implied optimization problems efficiently. In this section we answer this question in the affirmative, and provide an explicit algorithm for the computation of the sequence $\{p^{t}\}_{t=1}^{T}$ .

We state our main result of this section in the following lemma. The proof can be found in appendix A.

Lemma 1.

Let $\{a_{i}\}_{i=1}^{N}$ be a non-negative set of numbers where at least one of the $a_{i}$ s is strictly positive, and let $\varepsilon\in[0,1/N]$ . Let $\pi:[N]\rightarrow[N]$ be a permutation that orders $\{a_{i}\}_{i=1}^{N}$ in a decreasing order ( $a_{\pi(1)}\geq a_{\pi(2)}\geq\dots\geq a_{\pi(N)}$ ). Define:

\rho:=\max\left\{i\in[N]\;|\;a_{\pi(i)}\geq\varepsilon\frac{\sum_{j=1}^{i}a_{\pi(j)}}{1-(N-i)\varepsilon}\right\}

(11)

and:

\lambda:=\frac{\sum_{j=1}^{\rho}a_{\pi(j)}}{1-(N-\rho)\varepsilon}

(12)

Then a solution of the optimization problem:

\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}

(13)

is given by:

p^{*}_{i}=\begin{cases}a_{i}/\lambda&\text{if $i\in\{\pi(1),\dots,\pi(\rho)\}$}\\ \varepsilon&\text{otherwise}\end{cases}

(14)

In the case all $a_{i}$ are zero, any $p\in\Delta(\varepsilon)$ is a solution.

In light of Lemma 1, to compute $p^{t}$ as defined in (10), it is enough to know the value of $\rho_{t}$ as defined in (11), replacing $a_{i}$ with $\left\lVert h_{i}^{t}\right\rVert_{2}$ and $\varepsilon$ with $\varepsilon_{t}$ . Using $\rho_{t}$ we can then compute $\lambda_{t}$ using (12), and obtain $p^{t}$ from (14). It remains to specify an efficient way to perform this computation.

3.3 Implementation details

The naive way to perform the above computation is to do the following at each iteration:

•

Sort $\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N}$ in decreasing order.
•

Find $\rho_{t}$ by traversing $(\pi(i))_{i=1}^{N}$ in increasing order and finding the first $i\in[N]$ for which the inequality in (11) does not hold.
•

Explicitly compute the probabilities using (12) and (14).
•

Sample from $p^{t}$ using inverse transform sampling.

This has complexity $O(N\log{N})$ . In appendix A, we present an algorithm that requires only $O(N)$ vectorized operations, $O(\log^{2}N)$ sequential (non-vectorized) operations, and $cN$ memory for small $c$ . The algorithm uses three data structures:

•

An array storing $\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N}$ unsorted.
•

An order statistic tree storing the pairs $(\left\lVert h_{i}^{t}\right\rVert_{2},i)_{i=1}^{N}$ sorted according to $\left\lVert h_{i}^{t}\right\rVert_{2}$ .
•

An array storing $\{\sum_{j=1}^{i}\left\lVert h_{\pi(j)}^{t}\right\rVert_{2}\}_{i=1}^{N}$ where $\pi$ is the permutation that sorts $\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N}$ in the order statistic tree.

The order statistic tree along with the array storing the cumulative sums allows the retrieval of $\rho_{t}$ in $O(\log^{2}N)$ time. The cumulative sums allow to sample from $p^{t}$ in $O(\log N)$ time using binary search, and maintaining them is the only operation that requires a vectorized $O(N)$ operation. All other operations run in $O(\log N)$ time. See appendix A for the full version of the algorithm and a more complete discussion of its computational complexity.

4 Theory

In this section, we prove a sub-linear dynamic regret guarantee for our proposed algorithm when used with SGD and SGLD with decreasing step-sizes. We present our results in a more general setting, and show that they apply to our cases of interest. We start by stating our assumptions:

Assumption 1.

(Bounded gradients) There exists a $G>0$ such that $\left\lVert\nabla f_{i}(x)\right\rVert_{2}\leq G$ for all $x\in\mathbb{R}^{d}$ and for all $i\in[N]$ .

Assumption 2.

(Smoothness) There exists an $L>0$ such that $\left\lVert\nabla f_{i}(x)-\nabla f_{i}(y)\right\rVert_{2}\leq L\left\lVert x-y\right\rVert_{2}$ for all $x,y\in\mathbb{R}^{d}$ and for all $i\in[N]$ .

Assumption 3.

(Contraction of the iterates in expectation) There exists constants $A\geq 0$ , $B\geq 1$ and $\delta\in(0,1]$ such that $\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}-x_{t}\right\rVert_{2}\;|\;I_{1},\dots,I_{t-1}\right]\leq A/(B+t-1)^{\delta}$ for all $t\in[T]$ .

The bounded gradients assumption has been used in all previous related work [10, 11, 12, 13], although it is admittedly quite strong. The smoothness assumption is standard in the study of optimization and sampling algorithms. Note that we chose to state our assumptions using index-independent constants to make the presentation clearer and since this does not affect our derivation of the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ .

Finally, Assumption 3 should really be derived from more basic assumptions, and is only stated to allow for a unified analysis. Note that in the optimization case this is a very mild assumption since we are generally only interested in convergent sequences, and for any such sequence with reasonably fast convergence this assumption holds. The following proposition shows that Assumption 3 holds for our cases of interest. All the proofs for this section can be found in appendix B.

Proposition 1.

For any choice of $\{p^{t}\}_{t=1}^{T}$ , the iterates of SGD (2) with the gradient estimator (4) and decreasing step-sizes $\alpha_{t}:=E/(F+t-1)^{\beta}$ with $E\geq 0$ , $F\geq 1$ and $\beta\in(0,1]$ satisfy Assumption 3 with $A:=NGE$ , $B:=F$ , and $\delta:=\beta$ . Under the same conditions, the iterates of SGLD (3) satisfy Assumption 3 with $A:=\sqrt{E}\left(NG\sqrt{\alpha_{1}}+\sqrt{2d}\right)$ , $B:=F$ , and $\delta:=\beta/2$ .

We now state a proposition that relates the optimal function value for the problem (13) over the restricted simplex, with the optimal function value over the whole simplex. Its proof is taken from ([12], Lemma 6):

Proposition 2.

Let $\{a_{i}\}_{i=1}^{N}$ be a non-negative set of numbers, and let $\varepsilon\in[0,1/2N]$ . Then:

\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}\leq 6\varepsilon N\left(\sum_{i=1}^{N}a_{i}\right)^{2}

The following lemma gives our first bound on the regret per time step:

Lemma 2.

Let $q^{t}:=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}$ . Under Assumption 1, and when using the sequence of distributions defined by (10), we have the following bound for $t\in\{t_{0},\dots,T\}$ :

\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\frac{4G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]+6\varepsilon_{t}G^{2}N^{3}

where $t_{0}:=\min\{t\in[T]\;|\;\varepsilon_{t}\leq\frac{1}{2N}\}$ .

Proof outline.

Let $\widetilde{p}^{t}:=\operatorname*{argmin}_{p\in\Delta}\{\widetilde{c}_{t}(p)\}$ . Then we have the following decomposition:

\displaystyle\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]

\displaystyle=\underbrace{\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-\widetilde{c}_{t}(p^{t})\right]}_{\text{(A)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(p^{t})-\widetilde{c}_{t}(\widetilde{p}^{t})\right]}_{\text{(B)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(\widetilde{p}^{t})-c_{t}(q^{t})\right]}_{\text{(C)}}

The terms (A) and (C) are the penalties we pay for using a surrogate cost function, while (B) is the price we pay for restricting the simplex. Using Assumption 1, proposition 2, and the fact that $p^{t}$ is contained in the $\varepsilon_{t}$ -restricted simplex, each of these terms can be bound to give the result stated. ∎

The expectation in the first term of the above lemma is highly reminiscent of the first term of the Lyapunov function used to study the convergence of SAGA first proposed in [30] and subsequently refined in [31] (with $g_{i}^{t}$ replaced by $g_{i}^{*}$ , the gradient at the minimum). Inspired by this similarity, we prove the following recursion:

Lemma 3.

Under Assumptions 2 and 3, we have:

\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t+1}\right\rVert_{2}\right]\leq\frac{NLA}{(B+t-1)^{\delta}}+(1-\varepsilon_{t})\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]

The natural thing to do at this stage is to unroll the above recursion, replace in Lemma 2, sum over all time steps, and minimize the obtained regret bound over the choice of the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ . However, even if we can solve the resulting optimization problem efficiently, the solution will still depend on the constants $G$ and $L$ and on the initial error due to the initialization of the $h_{i}$ s, both of which are usually unknown. Here instead, we make an asymptotic argument to find the optimal decay rate of $\{\varepsilon_{t}\}_{t=1}^{T}$ , propose a sequence that satisfies this decay rate, and show that it gives rise to the dynamic regret bounds stated.

Denote by $\varphi(t)$ the expectation in the first term of the upper bound in Lemma 2. Suppose we take $\varepsilon_{t}$ to be of order $t^{-\beta}$ . Looking at the recursion in Lemma 3, we see that to control the positive term, we need the negative term to be of order at least $t^{-\delta}$ , so that $\varphi(t)$ cannot be smaller than $t^{\beta-\delta}$ . The bound of Lemma 2 is therefore of order $t^{2\beta-\delta}+t^{-\beta}$ . The minimum is attained when the exponents are equal so we have: $2\beta-\delta=-\beta\implies\beta=\frac{\delta}{3}$

We are now ready to guess the form of $\varepsilon_{t}$ . Matching the positive term in Lemma 3, we consider the following sequence:

\varepsilon_{t}:=\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}

(15)

For a free parameter $C$ satisfying $C\geq N$ to ensure $\varepsilon_{1}\leq 1/N$ . With this choice of the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ , we are now finally ready to state our main result of the paper:

Theorem 1.

Under Assumptions 1 and 2 on the functions $f_{i}$ and Assumption 3 on the sequence $\{x_{t}\}_{t=1}^{T}$ , algorithm (10) with the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ given in (15) satisfies the following dynamic regret bound for all $T\geq t_{0}$ :

\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{1-\delta/3})

(16)

where $t_{0}:=\min\{t\in[T]\;|\;\varepsilon_{t}\leq\frac{1}{2N}\}$ as in Lemma 2.

Proof outline.

Using the sequence given by (15) and the recursion in Lemma 3, we show by induction that $\varphi(t)\leq\mathcal{O}(t^{-2\delta/3})$ . Replacing in Lemma 2, summing over all time steps, and bounding the resulting sums by the appropriate integrals we get the result. ∎

Note that since $C\geq N$ , we have $t_{0}\leq(2^{3/\delta}-1)N+1$ , so $t_{0}$ is bounded by a constant independent of $T$ . Furthermore, setting $C=2N$ makes the theorem hold for all $T\in\mathbb{N}$ . In practice however, it might be beneficial to set $C=N$ to overcome a bad initialization of the $h_{i}$ s.

Combining Theorem 1 with proposition 1 we obtain the following corollary:

Corollary 1.

Under Assumptions 1 and 2 on the functions $f_{i}$ , if SGD (2) is run with step-sizes $\mathcal{O}(1/t)$ using the estimator (4) and probabilities (10) with the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ given by (15), then for all $T\geq t_{0}$ :

\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{2/3})

(17)

and for SGLD (3):

\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{5/6})

(18)

under the same conditions.

5 A new mini-batch estimator

In most practical applications, one uses a mini-batch of samples to construct the gradient estimator instead of just a single sample. The most basic such estimator is the one formed by sampling a mini-batch of indices $S_{t}=\{I_{t}^{1},\dots,I_{t}^{m}\}$ uniformly and independently, and taking the sample mean. This gives an unbiased estimator, whose variance decreases as $1/m$ . A simple way to make it more efficient is by sampling the indices uniformly but without replacement. In that case, the variance decreases by an additional factor of $(1-(m-1)/(N-1))$ . For $m\ll N$ , the difference is negligible, and the additional cost of sampling without replacement is not justified. However, when using unequal probabilities, this argument no longer holds, and the additional variance reduction obtained from sampling without replacement can be significant even for small $m$ .

For our problem, besides the additional variance reduction, sampling without replacement allows a higher rate of replacement of the $h_{i}$ s, which is directly related to our regret through the factor in front of the second term of Lemma 3, whose proper generalization for mini-batch sampling is $(1-\min_{i\in[N]}\pi_{i}^{t})$ , where $\pi_{i}^{t}=P(i\in S_{t})$ is the inclusion probability of index $i$ . This makes sampling without replacement desirable for our purposes. Unfortunately, unlike in the uniform case, the sample mean generalization of (4) is no longer unbiased. We propose instead the following estimator:

[]\hat{g}^{t}_{b}=\frac{1}{m}\sum_{j=1}^{m}\hat{g}^{t}_{j}\quad\quad\hat{g}^{t}_{j}:=\left[\frac{1}{q^{t,j}_{I_{t}^{j}}}g_{I_{t}^{j}}^{t}+\sum_{k=1}^{j-1}g_{I_{t}^{k}}^{t}\right]\quad\quad q^{t,j}_{i}:=\frac{p^{t}_{i}}{1-\sum_{k=1}^{j-1}p^{t}_{I_{t}^{k}}}

(19)

We summarize some of its properties in the following proposition:

Proposition 3.

Let $S_{t}^{j}:=\{I_{t}^{1},\dots,I_{t}^{j}\}$ for $j\in[m]$ and $S_{t}^{0}:=\emptyset$ . We have:

(a)

$\operatorname{\mathbb{E}}\left[\hat{g}_{b}^{t}\right]=g^{t}$
(b)

$\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right]=(1/m^{2})\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]$
(c)

$\operatorname*{argmin}_{p\in\Delta}\{\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right]\}=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}$
(d)

$\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j+1}-g^{t}\right\rVert_{2}^{2}\right]=\left(1-\operatorname{\mathbb{E}}\left[q^{t,j}_{I_{t}^{j}}\right]\right)\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]-\operatorname{\mathbb{E}}\left[q^{t,j}_{I_{t}^{j}}\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]$

where all the expectations in (d) are conditional on $S_{t}^{j-1}$ .

The proposed estimator is therefore unbiased and its variance decreases super-linearly in $m$ (by (b) and (d)). Although we were unable to prove a regret bound for this estimator, (c) suggests that it is still reasonable to use our algorithm. To take into account the higher rate of replacement of the $h_{i}$ s, we propose using the following $\varepsilon_{t}$ sequence, which is based on the mini-batch equivalent of Lemma 3 and the inequality $\min_{i\in[N]}\pi_{i}^{t}\geq m\varepsilon_{t}$ [32, 33, 34]:

\varepsilon_{t}:=\frac{1}{C^{1-\delta/3}(C+m(t-1))^{\delta/3}}

(20)

6 Experiments

In this section, we present results of experiments with our algorithm. We start by validating our theoretical results through a synthetic experiment. We then show that our proposed method outperforms existing methods on real world datasets. Finally, we identify settings in which adaptive importance sampling can lead to significant performance gains for the final optimization algorithm.

In all experiments, we added $l_{2}$ -regularization to the model’s loss and set the regularization parameter $\mu=1$ . We ran SGD (2) with decreasing step sizes $\alpha_{t}=\frac{m}{2NL+m\mu t}$ where $m$ is the batch size following [35]. We experimented with 3 different samplers in addition to our proposed sampler: Uniform, Multi-armed bandit Sampler (Mabs) [11], and Variance Reducer Bandit (Vrb) \citesuppDBLP:conf/colt/Borsos0L18. The hyperparameters of both Mabs and Vrb are set to the ones prescribed by the theory in the original papers. For our propsoed sampler Avare, we use the epsilon sequence given by (20) with $C=N$ , $\delta=1$ , and initialized $h_{i}=0$ for all $i\in[N]$ . For each sampler, we ran SGD 10 times and averaged the results. The shaded areas represent a one standard deviation confidence interval.

Refer to caption — Figure 1: Evolution of function sub-optimality (left) and dynamic regret (right) as a function of data passes on a synthetic dataset with an $l_{2}$ regularized logistic regression model and different samplers.

To validate our theoretical findings, we randomly generated a dataset for binary classification with $N=100$ and $d=10$ . We then trained a logistic regression model on the generated data, and used a batch size of 1 to match the setting of our regret bound. The results of this experiment are depicted in figure 1, and show that Avare outperforms the other methods, achieving significantly lower dynamic regret and faster convergence.

For our second experiment, we tested our algorithm on three real world datasets: MNIST, IJCNN1 [36], and CIFAR10. We used a softmax regression model, and a batch size of 128 sampled with replacement. The results of this experiment can be seen in figure 2. The third column shows the relative error which we define as $\left[c_{t}(p^{t})-\min_{p\in\Delta}c_{t}(p)\right]/\min_{p\in\Delta}c_{t}(p)$ . In all three cases, Avare achieves significantly smaller dynamic regret, with a relative error quickly decaying to zero. The performance gains in the final optimization algorithm are clear in both MNIST and IJCNN1, but are not noticeable in the case of CIFAR10.

To determine the effectiveness of non-uniform sampling in accelerating the optimization process, previous work [4, 11] has suggested to look at the ratio of the maximum smoothness constant and the average smoothness constant. We show here that this is the wrong measure to look at when using adaptive probabilities. In particular, we argue that the ratio of the variance with uniform sampling at the loss minimizer to the optimal variance at the loss minimizer is much more informative of the performance gains achievable through adaptive importance sampling. For each dataset, these ratios are displayed in Table 1, supporting our claim. We suspect that for large models capable of fitting the data nearly perfectly, our proposed ratio will be large since many of the per-example gradients at the optimum will be zero. We therefore expect our method to be particularly effective in the training of models of this type. We leave such experiments for future work. Finally, in appendix D, we propose an extension of our method to constant step-size SGD, and show that it preserves the performance gains observed when using decreasing step-sizes.

Table 1: Useful ratios in determining the effectiveness of variance reduction through importance sampling.

L_{i}

is the smoothness constant of

f_{i}

L_{max}=\max_{i\in[N]}{L_{i}}

, and

g_{i}^{*}

and is the gradient of

f_{i}

at the loss minimizer

x^{*}

Dataset	$\frac{NL_{max}}{\sum_{i=1}^{N}L_{i}}$	$\frac{N\sum_{i=1}^{N}\left\lVert g_{i}^{}\right\rVert_{2}^{2}}{(\sum_{i=1}^{N}\left\lVert g_{i}^{}\right\rVert_{2})^{2}}$
Synthetic	1.69	4.46
MNIST	3.28	5.08
IJCNN1	1.12	4.83
CIFAR10	3.40	1.14

Broader Impact

Our work develops a new method for variance reduction for stochastic optimization and sampling algorithms. On the optimization side, we expect our method to be very useful in accelerating the training of large scale neural networks, particularly since our method is expected to provide significant performance gains when the model is able to fit the data nearly perfectly. On the sampling side, we expect our method to be useful in accelerating the convergence of MCMC algorithms, opening the door for the use of accurate Bayesian methods at a large scale.

Acknowledgments and Disclosure of Funding

This research was supported by an NSERC discovery grant. We would like to thank the anonymous reviewers for their useful comments and suggestions.

References

[1] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Ann. Math. Statist., 22(3):400–407, 1951.
[2] Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 681–688. Omnipress, 2011.
[3] Deanna Needell, Rachel Ward, and Nathan Srebro. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D Lawrence, and Kilian Q Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1017–1025, 2014.
[4] Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In Francis R Bach and David M Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1–9. JMLR.org, 2015.
[5] Guillaume Bouchard, Théo Trouillon, Julien Perez, and Adrien Gaidon. Accelerating Stochastic Gradient Descent via Online Learning to Sample. CoRR, abs/1506.0, 2015.
[6] Sebastian U Stich, Anant Raj, and Martin Jaggi. Safe Adaptive Importance Sampling. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M Wallach, Rob Fergus, S V N Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4381–4391, 2017.
[7] Angelos Katharopoulos and François Fleuret. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In Jennifer G Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2530–2539. PMLR, 2018.
[8] Tyler B Johnson and Carlos Guestrin. Training Deep Models Faster with Robust, Approximate Importance Sampling. In Samy Bengio, Hanna M Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 7276–7286, 2018.
[9] Dominik Csiba and Peter Richtárik. Importance Sampling for Minibatches. J. Mach. Learn. Res., 19:27:1—-27:21, 2018.
[10] Hongseok Namkoong, Aman Sinha, Steve Yadlowsky, and John C Duchi. Adaptive Sampling Probabilities for Non-Smooth Optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2574–2583. PMLR, 2017.
[11] Farnood Salehi, L Elisa Celis, and Patrick Thiran. Stochastic Optimization with Bandit Sampling. CoRR, abs/1708.0, 2017.
[12] Zalan Borsos, Andreas Krause, and Kfir Y Levy. Online Variance Reduction for Stochastic Optimization. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 324–357. PMLR, 2018.
[13] Zalán Borsos, Sebastian Curi, Kfir Yehuda Levy, and Andreas Krause. Online Variance Reduction with Mixtures. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 705–714. PMLR, 2019.
[14] Omar Besbes, Yonatan Gur, and Assaf J Zeevi. Non-Stationary Stochastic Optimization. Oper. Res., 63(5):1227–1244, 2015.
[15] Xi Chen, Yining Wang, and Yu-Xiang Wang. Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes. CoRR, abs/1708.0, 2017.
[16] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, CDC 2016, Las Vegas, NV, USA, December 12-14, 2016, pages 7195–7201. IEEE, 2016.
[17] Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient. In Maria-Florina Balcan and Kilian Q Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 449–457. JMLR.org, 2016.
[18] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive Online Learning in Dynamic Environments. In Samy Bengio, Hanna M Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 1330–1340, 2018.
[19] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003.
[20] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. The Journal of the Operational Research Society, 1984.
[21] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput., 32(1):48–77, 2002.
[22] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Tom Fawcett and Nina Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936. AAAI Press, 2003.
[23] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
[24] Elad Hazan. Introduction to Online Convex Optimization. Found. Trends Optim., 2(3-4):157–325, 2016.
[25] Mark Herbster and Manfred Warmuth. Tracking the Best Expert. In Armand Prieditis and Stuart Russell, editors, Machine Learning Proceedings 1995, pages 286–294. Morgan Kaufmann, San Francisco (CA), 1995.
[26] Mark Herbster and Manfred K Warmuth. Tracking the Best Linear Predictor. J. Mach. Learn. Res., 1:281–309, sep 2001.
[27] Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. Mirror Descent Meets Fixed Share (and feels no regret). In Peter L Bartlett, Fernando C N Pereira, Christopher J C Burges, Léon Bottou, and Kilian Q Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 989–997, 2012.
[28] András György and Csaba Szepesvári. Shifting Regret, Mirror Descent, and Matrices. In Maria-Florina Balcan and Kilian Q Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 2943–2951. JMLR.org, 2016.
[29] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. CoRR, abs/1407.0202, 2014.
[30] Thomas Hofmann, Aurélien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance Reduced Stochastic Gradient Descent with Neighbors. In Corinna Cortes, Neil D Lawrence, Daniel D Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2305–2313, 2015.
[31] Aaron Defazio. A Simple Practical Accelerated Method for Finite Sums. In Daniel D Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 676–684, 2016.
[32] Yaming Yu. On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli, 18(1):279–289, 2012.
[33] Hartmut Milbrodt. Comparing inclusion probabilities and drawing probabilities for rejective sampling and successive sampling. Statistics and Probability Letters, 14(3):243–246, 1992.
[34] Talluri Rao, Samindranath Sengupta, and B Sinha. Some order relations between selection and inclusion probabilities for PPSWOR sampling scheme. Metrika, 38:335–343, 1991.
[35] Sebastian U. Stich. Unified optimal analysis of the (stochastic) gradient method, 2019.
[36] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), May 2011.

Appendix A Algorithm

A.1 Proof of Lemma 1

See 1

Proof.

Edge case and well-definedness.

If $a_{i}=0$ for all $i\in[N]$ , then the objective function is identically $0$ over $\Delta(\varepsilon)$ for all $\varepsilon\in[0,1/N]$ , so that any $p\in\Delta(\varepsilon)$ is a solution (we set $(1/0)0:=0$ in the objective). Else there exists an $i\in[N]$ such that $a_{i}>0$ , and therefore $a_{\pi(1)}>0$ . Now:

	$\displaystyle\frac{\varepsilon}{1-(N-1)\varepsilon}\leq 1$
	$\displaystyle\Leftrightarrow\varepsilon\leq 1-(N-1)\varepsilon$
	$\displaystyle\Leftrightarrow 0\leq 1-N\varepsilon$
	$\displaystyle\Leftrightarrow N\varepsilon\leq 1$
	$\displaystyle\Leftrightarrow\varepsilon\leq\frac{1}{N}$

The last inequality is true, so it implies the first, and we have:

a_{\pi(1)}\geq\varepsilon\frac{a_{\pi(1)}}{1-(N-1)\varepsilon}

Therefore $\rho$ is well defined and is $\geq 1$ . As a consequence, $\lambda$ is also well defined and is $>0$ , making $p^{*}$ in turn well-defined.

Optimality proof.

It is easily verified that problem (13) is convex. Its Lagrangian is given by:

\mathcal{L}(p,\mu,\nu)=\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\sum_{i=1}^{N}\mu_{i}(p_{i}-\varepsilon)+\nu\left(\sum_{i=1}^{N}p_{i}-1\right)

and the KKT conditions are:

	(Stationarity)	$\displaystyle\quad p_{i}=\frac{a_{i}}{\sqrt{\nu-\mu_{i}}}$
	(Complementary slackness)	$\displaystyle\quad\mu_{i}=0\thinspace\lor\thinspace p_{i}=\varepsilon$
	(Primal feasibility)	$\displaystyle\quad p_{i}\geq\varepsilon\thinspace\land\thinspace\sum_{j=1}^{N}p_{j}=1$
	(Dual feasibility)	$\displaystyle\quad\mu_{i}\geq 0$

By convexity of the problem, the KKT conditions are sufficient conditions for global optimality. To show that our proposed solution is optimal, it therefore suffices to exhibit constants $(\mu^{*}_{i})_{i=1}^{N}$ and $\nu^{*}$ that together with $p^{*}$ satisfy these conditions. Let:

	$\displaystyle\nu^{*}$	$\displaystyle:=\lambda^{2}$
	$\displaystyle\mu^{*}_{i}$	$\displaystyle:=\begin{cases}0&\text{if $i\in\{\pi(1),\dots,\pi(\rho)\}$}\\ \nu^{*}-a_{i}^{2}/\varepsilon^{2}&\text{otherwise}\end{cases}$

Note that the $\mu_{i}^{*}$ s are well defined since when $\varepsilon=0$ , $\rho=N$ , and $\mu_{i}^{*}=0$ for all $i\in[N]$ . We claim that the triplet $(p^{*},(\mu^{*}_{i})_{i=1}^{N},\nu^{*})$ satisfies the KKT conditions.

Stationarity and complementary slackness are immediate from the definitions. The first clause of primal feasibility holds by definition of $p^{*}_{i}$ for $i\in\{\pi(\rho+1),\dots,\pi(N)\}$ . In the other case we have:

	$\displaystyle p_{i}^{*}$	$\displaystyle=\frac{a_{i}}{\lambda}$
		$\displaystyle=\frac{a_{i}}{\sum_{j=1}^{\rho}a_{\pi(j)}}\left(1-(N-\rho)\varepsilon\right)$
		$\displaystyle\geq\frac{a_{\pi(\rho)}}{\sum_{j=1}^{\rho}a_{\pi(j)}}\left(1-(N-\rho)\varepsilon\right)$
		$\displaystyle\geq\varepsilon$

where in the third line we used the fact that $\pi$ orders $\{a_{i}\}_{i=1}^{N}$ in decreasing order, and in the last line we used the inequality in the definition of $\rho$ . For the second clause of primal feasibility:

\sum_{j=1}^{N}p^{*}_{j}=\sum_{j=1}^{N}p^{*}_{\pi(j)}=\sum_{j=1}^{\rho}\frac{a_{\pi(i)}}{\lambda}+\sum_{j=\rho+1}^{N}\varepsilon=\left(1-(N-\rho)\varepsilon\right)+(N-\rho)\varepsilon=1

Finally, dual feasibility holds by definition of $\mu^{*}_{i}$ for $i\in\{\pi(1),\dots,\pi(\rho)\}$ . In the other case we have:

	$\displaystyle\mu_{i}^{*}$	$\displaystyle=\nu^{*}-\frac{a_{i}^{2}}{\varepsilon^{2}}$
		$\displaystyle=\lambda^{2}-\frac{a_{i}^{2}}{\varepsilon^{2}}$
		$\displaystyle=\left(\lambda+\frac{a_{i}}{\varepsilon}\right)\left(\lambda-\frac{a_{i}}{\varepsilon}\right)$
		$\displaystyle\geq\left(\lambda+\frac{a_{i}}{\varepsilon}\right)\left(\lambda-\frac{a_{\pi(\rho+1)}}{\varepsilon}\right)$

The first factor is positive by positivity of $\lambda$ and non-negativity of the $a_{i}$ s. For the second factor, we have by the maximality of $\rho$ :

	$\displaystyle a_{\pi(p+1)}<\varepsilon\frac{\sum_{j=1}^{\rho+1}a_{\pi(j)}}{1-(N-\rho-1)\varepsilon}$
	$\displaystyle\Rightarrow a_{\pi(\rho+1)}(1-(N-\rho)\varepsilon+\varepsilon)<\varepsilon\sum_{i=1}^{\rho}a_{\pi(j)}+\varepsilon a_{\pi(\rho+1)}$
	$\displaystyle\Rightarrow a_{\pi(\rho+1)}(1-(N-\rho)\varepsilon)<\varepsilon\sum_{i=1}^{\rho}a_{\pi(j)}$
	$\displaystyle\Rightarrow\frac{a_{\pi(\rho+1)}}{\varepsilon}<\lambda$		(21)

Therefore the second factor is also positive, and dual feasibility holds. ∎

A.2 Implementation and complexity

In this section of the appendix, we provide pseudocode for the implementation of the algorithm and discuss its computational complexity.

Algorithm 1 Implementation of the proposed sampler

1:class Sampler:

2: procedure Initialize(

\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N}

)

self.H\leftarrow Array(\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N})

self.T

\leftarrow

OST(keys=\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N},values=[N])

self.CS\leftarrow Array(\{\sum_{j=1}^{i}\left\lVert h_{\pi(j)}\right\rVert_{2}\}_{i=1}^{N})

6: procedure Delete(

x

)

r

\leftarrow

self.T.rank(x)

self.T.delete(x)

self.CS[r:N]

\leftarrow

self.CS[r:N]-x

10: procedure Insert(

x

i

)

11:

self.T.insert(key=x,value=i)

12:

r

\leftarrow

self.T.rank(x)

13:

self.CS[r:N]

\leftarrow

self.CS[r:N]+x

14: procedure Update(

\left\lVert h_{I}\right\rVert_{2}

I

)

15:

self.delete(self.H[I])

16:

self.H[I]\leftarrow\left\lVert h_{I}\right\rVert_{2}

17:

self.insert(\left\lVert h_{I}\right\rVert_{2},I)

18: function Search(

\varepsilon

, node)

19:

r\leftarrow self.T.rank(node)

20: if

r==N

then return

r

21:

c\leftarrow 1-(N-r)\varepsilon

22: if

c\cdot node.key<\varepsilon\cdot self.CS[r]

then

23: return

self.search(\varepsilon,node.left)

24: else

25:

d=1-(N-r-1)\varepsilon

26: if

d\cdot node.successor.key

\varepsilon\cdot self.CS[r+1]

then

27: return

r

28: else

29: return

self.search(\varepsilon,node.right)

30: function Sample(

\varepsilon

)

31:

\rho\leftarrow self.search(\varepsilon,self.T.root)

32:

\lambda\leftarrow self.CS[\rho]/\left(1-(N-\rho)\varepsilon\right)

33:

b\sim Bernoulli((N-\rho)\varepsilon)

34: if b == 1 then

35: Sample an index

\tilde{I}

uniformly from

\{\rho+1,\dots,N\}

36: else

37:

u\sim Uniform([0,1])

38: Find the first index

\tilde{I}

for which

\lambda u\leq self.CS[\tilde{I}]

using binary search.

39:

I\leftarrow self.T.select(\tilde{I})

40: return

I

41:

Algorithm 2 AVARE

Input: $x_{1},T,\{\alpha_{t}\}_{t=1}^{T},\{\left\lVert h_{i}^{1}\right\rVert_{2}\}_{i=1}^{N},C\geq N$

sampler\leftarrow

Sampler

(\{\left\lVert h_{i}^{1}\right\rVert_{2}\}_{i=1}^{N})

2:for

t=1,2,\dots,T

3: Set

\varepsilon_{t}

according to (15)

I_{t}\leftarrow sampler.sample(\varepsilon_{t})

5: Obtain

x_{t+1}

using (2) or (3) and the estimator (4).

sampler.update(\left\lVert g_{I_{t}}\right\rVert_{2},I_{t})

7:return

x_{T+1}

A.2.1 Implementation

We assume in algorithm 1 that an order statistic tree (OST) (see, for example, \citesuppDBLP:books/daglib/0023376, chapter 14) can be instantiated and that the ordering in the tree is such that the key of the left child of a node is greater than or equal to that of the node itself. Furthermore, we assume that the $rank(x)$ method returns the position of $x$ in the order determined by an inorder traversal of the tree. Finally, the $select(i)$ method returns the value of the $i^{th}$ -largest key in the tree.

The algorithm works as follows. At initialization, three data structures are initialized: an arrray $H$ holding the gradient norms according to the original indices, an order statistic tree $T$ holding the gradient norms as keys and the original indices as values, and an array $CS$ holding the cumulative sums of the gradient norms, where the sums are accumulated in the (decreasing) order that sorts the gradients norms in $T$ .

The $sample(\varepsilon)$ method allows to sample from the optimal distribution on $\Delta(\varepsilon)$ . It uses the $search(\varepsilon,node)$ method to find $\rho$ by searching the tree $T$ and using the maximality property of $\rho$ . Once $\rho$ is determined, $\lambda$ can be calculated. Using the fact that the cumulative sums are proportional to the CDF of the distribution, the algorithm then samples an index using inverse-transform sampling. The sampled index is then transformed back to an index in the original order using the $select$ method of the tree $T$ .

Finally, the $update(\left\lVert h_{I}\right\rVert_{2},I)$ method replaces the gradient norm of a given index by a new one. It calls the methods $delete(x)$ and $insert(x,i)$ which perform the deletion and insertion while maintaining the tree $T$ and array $CS$ .

A.2.2 Complexity

First, let us analyze the cost of running the $update(\left\lVert h_{I}\right\rVert_{2},I)$ method. For the array $H$ , we only use random access and assignment, which are both $O(1)$ . For the tree $T$ , we use the methods $insert$ , $delete$ , and $rank$ , all of which are $O(\log N)$ . Finally, for the array $CS$ we add and subtract from a sub-array, which takes $O(N)$ time, although this operation is vectorized and very fast in practice.

Let us now look at the cost of running $sample(\varepsilon)$ . The $search$ method is recursive, but will only be called at most as many times as the height of the tree, which is $O(\log N)$ . Now for each call of $search$ , both the $rank$ and $successor$ methods of the tree $T$ require $O(\log N)$ time. The rest of the $search$ method only requires $O(1)$ operations. The total cost of the $search$ method is therefore $O(\log^{2}N)$ . For the rest of the $sample$ method, the operations that dominate the cost are the $select$ method of the tree $T$ , which takes $O(\log N)$ time, and the binary search in the else branch, which also runs in $O(\log N)$ time. Consequently, the total cost of the sample method is $O(\log^{2}N)$ .

The total per iteration cost of using the proposed sampler is therefore $O(N)$ vectorized operations, and $O(\log^{2}N)$ sequential (non-vectorized) operations. The total memory cost is $O(N)$ .

Appendix B Theory

We restate our assumptions here for ease of reference. See 1 See 2 See 3

B.1 Proof of proposition 1

See 1

Proof.

Conditioning on the knowledge of $\{I_{1},\dots,I_{t-1}\}$ we have for SGD:

	$\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}^{SGD}-x_{t}^{SGD}\right\rVert_{2}\right]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\alpha_{t}\frac{1}{p_{I_{t}}^{t}}\left\lVert g_{I_{t}}^{t}\right\rVert_{2}\right]$
		$\displaystyle=\alpha_{t}\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}$
		$\displaystyle\leq\alpha_{t}NG$

and for SGLD we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}^{SGLD}-x_{t}^{SGLD}\right\rVert_{2}\right]$	$\displaystyle\leq\operatorname{\mathbb{E}}\left[\alpha_{t}\frac{1}{p_{I_{t}}^{t}}\left\lVert g_{I_{t}}^{t}\right\rVert_{2}\right]+\operatorname{\mathbb{E}}\left[\left\lVert\xi_{t}\right\rVert_{2}\right]$
		$\displaystyle\leq\alpha_{t}\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}+\sqrt{\operatorname{\mathbb{E}}\left[\left\lVert\xi_{t}\right\rVert_{2}^{2}\right]}$
		$\displaystyle\leq\alpha_{t}NG+\sqrt{\alpha_{t}}\sqrt{2d}$
		$\displaystyle\leq\sqrt{\alpha_{t}}\left(NG\sqrt{\alpha_{1}}+\sqrt{2d}\right)$

where in the first line we used the triangle inequality, in the second we used Jensen’s inequality, and in the last we used the fact that $\{\alpha_{t}\}_{t=1}^{T}$ is decreasing. Replacing with the value of $\alpha_{t}$ we obtain the result. ∎

B.2 Proof of proposition 2

The following proof is taken from ([12], Lemma 6). See 2

Proof.

By Lemma 1 we have:

	$\displaystyle\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}$	$\displaystyle=\lambda\sum_{i=1}^{\rho}a_{\pi(i)}+\sum_{i=\rho+1}^{N}\frac{a^{2}_{\pi(i)}}{\varepsilon}$
		$\displaystyle\leq\lambda^{2}\left(1-(N-\rho)\varepsilon\right)+\varepsilon\sum_{i=\rho+1}^{N}\frac{a^{2}_{\pi(\rho+1)}}{\varepsilon^{2}}$
		$\displaystyle\leq\lambda^{2}\left(1-(N-\rho)\varepsilon\right)+(N-\rho)\varepsilon\lambda^{2}$
		$\displaystyle=\lambda^{2}$

where in the third line we used inequality (21) from the proof of Lemma 1. Now for the case $\varepsilon=0$ we have $\rho=N$ , so the second term in the first line is zero and the inequality becomes an equality:

\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}=\left(\sum_{i=1}^{N}a_{i}\right)^{2}

(22)

The difference is therefore bounded by:

	$\displaystyle\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}$	$\displaystyle\leq\frac{\left(\sum_{i=1}^{\rho}a_{\pi(i)}\right)^{2}}{(1-(N-\rho)\varepsilon)^{2}}-\left(\sum_{i=1}^{N}a_{i}\right)^{2}$
		$\displaystyle\leq\left(\frac{1}{(1-N\varepsilon)^{2}}-1\right)\left(\sum_{i=1}^{N}a_{i}\right)^{2}$
		$\displaystyle\leq 6\varepsilon N\left(\sum_{i=1}^{N}a_{i}\right)^{2}$

where in the last line we used the inequality $\frac{1}{(1-x)^{2}}-1\leq 6x$ for $x\in[0,1/2]$ which gives the restriction $\varepsilon\in[0,1/2N]$ . ∎

B.3 Proof of Lemma 2

See 2

Proof.

Let $t\geq t_{0}$ and $\widetilde{p}^{t}:=\operatorname*{argmin}_{p\in\Delta}\{\widetilde{c}_{t}(p)\}$ . We have the following decomposition:

\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]=\underbrace{\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-\widetilde{c}_{t}(p^{t})\right]}_{\text{(A)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(p^{t})-\widetilde{c}_{t}(\widetilde{p}^{t})\right]}_{\text{(B)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(\widetilde{p}^{t})-c_{t}(q^{t})\right]}_{\text{(C)}}

We bound each term separately:

	(A)	$\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{1}{p_{i}^{t}}\left(\left\lVert g_{i}^{t}\right\rVert_{2}^{2}-\left\lVert h_{i}^{t}\right\rVert_{2}^{2}\right)\right]$
		$\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{1}{p_{i}^{t}}\left(\left\lVert g_{i}^{t}\right\rVert_{2}-\left\lVert h_{i}^{t}\right\rVert_{2}\right)\left(\left\lVert g_{i}^{t}\right\rVert_{2}+\left\lVert h_{i}^{t}\right\rVert_{2}\right)\right]$
		$\displaystyle\leq\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{2G}{p_{i}^{t}}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$
		$\displaystyle\leq\frac{2G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$

where in the third line we used Assumption 1 and the reverse triangle inequality, and in the last line we used the fact that $p^{t}\in\Delta(\varepsilon_{t})$ . Since $t\geq t_{0}$ , we can apply proposition 2 on (B) to obtain:

(B)

\displaystyle\leq\operatorname{\mathbb{E}}\left[6\varepsilon_{t}N\left(\sum_{i=1}^{N}\left\lVert h_{i}^{t}\right\rVert_{2}\right)^{2}\right]\leq 6\varepsilon_{t}G^{2}N^{3}

where the second inequality uses Assumption 1. Finally, using the optimal function values (22) we have for (C):

	(C)	$\displaystyle=\operatorname{\mathbb{E}}\left[\left(\sum_{i=1}^{N}\left\lVert h_{i}^{t}\right\rVert_{2}\right)^{2}-\left(\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}\right)^{2}\right]$
		$\displaystyle=\operatorname{\mathbb{E}}\left[\left(\sum_{i=1}^{N}\left(\left\lVert h_{i}^{t}\right\rVert_{2}-\left\lVert g_{i}^{t}\right\rVert_{2}\right)\right)\left(\sum_{i=1}^{N}(\left\lVert g_{i}^{t}\right\rVert_{2}+\left\lVert h_{i}^{t}\right\rVert_{2})\right)\right]$
		$\displaystyle\leq 2GN\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$
		$\displaystyle\leq\frac{2G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$

where we again used Assumption 1 and the reverse triangle inequality in the third line. the last inequality follows from the fact that $\varepsilon_{t}\leq\frac{1}{N}$ . Combining the three bounds gives the result. ∎

B.4 Proof of Lemma 3

See 3

Proof.

Conditioning on the knowledge of $\{I_{1},\dots,I_{t-1}\}$ we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t+1}\right\rVert_{2}\right]$	$\displaystyle=\sum_{j=1}^{N}p_{j}^{t}\left[\left\lVert g_{j}^{t+1}-g_{j}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t}\right\rVert_{2}\right]$
		$\displaystyle\leq\sum_{j=1}^{N}p_{j}^{t}\left[\left\lVert g_{j}^{t+1}-g_{j}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left(\left\lVert g_{i}^{t+1}-g_{i}^{t}\right\rVert_{2}+\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right)\right]$
		$\displaystyle=\sum_{j=1}^{N}p_{j}^{t}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-g_{i}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$
		$\displaystyle\leq NL\sum_{j=1}^{N}p_{j}^{t}\left\lVert x_{t+1}-x_{t}\right\rVert_{2}+\sum_{i=1}^{N}\left(\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}p_{j}^{t}\right)\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}$
		$\displaystyle=NL\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}-x_{t}\right\rVert_{2}\right]+\sum_{i=1}^{N}(1-p_{i}^{t})\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}$
		$\displaystyle\leq\frac{NLA}{(B+t-1)^{\delta}}+(1-\varepsilon_{t})\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}$

where in the fourth line we used Assumption 2, and in the last line we used Assumption 3 and the fact that $p^{t}\in\Delta(\varepsilon_{t})$ . Taking expectation with respect to the choice of $\{I_{1},\dots,I_{t-1}\}$ on both sides we get the result. ∎

B.5 Lemma 4

Before proving Theorem 1, we first state and prove the following solution of the recursion of Lemma 3 assuming the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ is given by (15).

Lemma 4.

Assuming the use of the sequence $\{\varepsilon_{t}\}_{t=1}^{T}$ given by (15) we have:

\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]\leq\frac{K}{(C+t-1)^{2\delta/3}}

where:

K:=\max\left\{\frac{3C^{1-\delta/3}D}{3-2\delta},C^{2\delta/3}\sum_{i=1}^{N}\left\lVert g_{i}^{1}-h_{i}^{1}\right\rVert_{2}\right\}\\

and:

D:=\begin{cases}NLA&\text{if $B\geq C$}\\ (\frac{C}{B})^{\delta}NLA&\text{if $B<C$}\end{cases}

where $A$ , $B$ , and $\delta$ are as in Assumption 3.

Proof.

A simple inequality.

Suppose $B\geq C$ , then:

\frac{NLA}{(B+t-1)^{\delta}}\leq\frac{NLA}{(C+t-1)^{\delta}}

otherwise, we have:

\displaystyle\frac{NLA}{(B+t-1)^{\delta}}=\frac{(\frac{C}{B})^{\delta}NLA}{(C+(\frac{C}{B})(t-1))^{\delta}}\leq\frac{(\frac{C}{B})^{\delta}NLA}{(C+t-1)^{\delta}}

where the last inequality follows from the fact that $C>B$ and $t\geq 1$ . We conclude that:

\frac{NLA}{(B+t-1)^{\delta}}\leq\frac{D}{(C+t-1)^{\delta}}

Induction proof.

Let $\varphi(t):=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]$ , and let $K^{\prime}:=C^{2-2\delta/3}K$ . For $t=1$ the statement holds since:

\varphi(1)=\frac{C^{2\delta/3}\varphi(1)}{(C+t-1)^{2\delta/3}}\leq\frac{K}{(C+t-1)^{2\delta/3}}

For $t\geq 1$ we have by Lemma 3 and the above inequality:

	$\displaystyle\varphi(t+1)$	$\displaystyle\leq\frac{D}{(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\varphi(t)$
		$\displaystyle=\frac{aC^{3-\delta}D}{aC^{3-\delta}(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\varphi(t)$
		$\displaystyle\leq\frac{K^{\prime}}{aC^{3-\delta}(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\frac{K^{\prime}}{C^{2-2\delta/3}(C+t-1)^{2\delta/3}}$

where $a=\frac{3}{3-2\delta}$ and where the last line follows by the induction hypothesis. To simplify notation let $x:=(C+t-1)$ , $E:=C^{1-\delta/3}$ , $\gamma:=(1-\frac{1}{a})=\frac{2\delta}{3}$ . Then the above inequality can be rewritten as:

	$\displaystyle\varphi(t+1)$	$\displaystyle\leq K^{\prime}\left(\frac{1}{E^{2}x^{2\delta/3}}-\frac{\gamma}{E^{3}x^{\delta}}\right)$
		$\displaystyle=K^{\prime}\frac{Ex^{\delta/3}-\gamma}{E^{3}x^{\delta}}$
		$\displaystyle=K^{\prime}\frac{E^{3}x^{\delta}-\gamma^{3}}{E^{3}x^{\delta}(E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}+\gamma^{2})}$
		$\displaystyle\leq K^{\prime}\frac{1}{E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}}$

Now by concavity of $x^{2\delta/3}$ we have:

E^{2}\left[(x+1)^{2\delta/3}-x^{2\delta/3}\right]\leq E^{2}\frac{2\delta}{3}x^{2\delta/3-1}

so that:

	$\displaystyle E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}\geq E^{2}(x+1)^{2\delta/3}$
	$\displaystyle\Leftrightarrow E\gamma x^{\delta/3}\geq E^{2}\left[(x+1)^{2\delta/3}-x^{2\delta/3}\right]$
	$\displaystyle\Leftarrow\gamma Ex^{\delta/3}\geq\frac{2\delta}{3}E^{2}x^{2\delta/3-1}$
	$\displaystyle\Leftrightarrow x^{1-\delta/3}\geq C^{1-\delta/3}$
	$\displaystyle\Leftrightarrow x\geq C$
	$\displaystyle\Leftrightarrow(C+t-1)\geq C$
	$\displaystyle\Leftrightarrow t\geq 1$

The last statement is true, and therefore so is the first. Replacing in the bound on $\varphi(t+1)$ we get:

\varphi(t+1)\leq\frac{K^{\prime}}{C^{2-2\delta/3}(C+(t+1)-1)^{2\delta/3}}=\frac{K}{(C+(t+1)-1)^{2\delta/3}}

which finishes the proof. ∎

B.6 Proof of Theorem 1

See 1

Proof.

Combining Lemma 4 with Lemma 2 we have the following per-step bound for $t\geq t_{0}$ :

\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\frac{4GKC^{1-\delta/3}+\frac{6G^{2}N^{3}}{C^{1-\delta/3}}}{(C+t-1)^{\delta/3}}=:\frac{K^{\prime}}{(C+t-1)^{\delta/3}}

Summing over the time steps $\{t_{0},\dots,T\}$ we get:

\displaystyle\sum_{t=t_{0}}^{T}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\sum_{t=t_{0}}^{T}\frac{K^{\prime}}{(C+t-1)^{\delta/3}}

Therefore:

	$\displaystyle\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]$	$\displaystyle=\sum_{t=1}^{t_{0}-1}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]+\sum_{t=t_{0}}^{T}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]$
		$\displaystyle\leq\sum_{t=1}^{t_{0}-1}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})\right]+\sum_{t=t_{0}}^{T}\frac{K^{\prime}}{(C+t-1)^{\delta/3}}$
		$\displaystyle\leq\sum_{t=1}^{t_{0}-1}\sum_{i=1}^{N}\operatorname{\mathbb{E}}\left[\frac{1}{p_{i}^{t}}\left\lVert g_{i}^{t}\right\rVert_{2}^{2}\right]+K^{\prime}\int_{t=t_{0}-1}^{T}\frac{1}{(C+t-1)^{\delta/3}}dt$
		$\displaystyle\leq(t_{0}-1)\frac{NG^{2}}{\varepsilon_{t_{0}}}+K^{\prime}\frac{(C+T-1)^{1-\delta/3}-(C+t_{0}-2)^{1-\delta/3}}{1-\delta/3}$
		$\displaystyle\leq(2^{3/\delta}-1)\frac{N^{2}G^{2}}{\varepsilon_{t_{0}}}+K^{\prime}\frac{(C+T-1)^{1-\delta/3}-(C-1)^{1-\delta/3}}{1-\delta/3}$
		$\displaystyle=\mathcal{O}(T^{1-\delta/3})$

Where the line before the last follows from the fact that $1\leq t_{0}\leq(2^{3/\delta}-1)N+1$ since $C\geq N$ . ∎

Appendix C A new mini-batch estimator

C.1 A class of unbiased estimators

It will be useful for our discussion to consider the following class $\mathcal{C}(p^{t,1},\dots,p^{t,m})$ of estimators:

\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m}):=\frac{1}{m}\sum_{j=1}^{m}\hat{g}_{j}^{t}(p^{t,j})\quad\quad\hat{g}_{j}^{t}(p^{t,j}):=\left[\frac{1}{p^{t,j}_{I_{t}^{j}}}g_{I_{t}^{j}}^{t}+\sum_{k=1}^{j-1}g_{I_{t}^{k}}^{t}\right]

where each $p^{t,j}$ is a distribution on $[N]\setminus\{I_{t}^{1},\dots,I_{t}^{j-1}\}$ . The estimator we proposed in Section 5 is: See 19 where the indices $S_{t}=\{I_{t}^{1},\dots,I_{t}^{m}\}$ are sampled without replacement according to $p^{t}$ . Setting:

p^{t,j}_{i}:=\begin{cases}0&\text{if $i\in\{I_{t}^{1},\dots,I_{t}^{j-1}\}$}\\ q_{i}^{t,j}&\text{otherwise}\end{cases}

we see that our proposed estimator belongs to the class of estimators $\mathcal{C}$ introduced above. The proofs of (a) and (b) of proposition 3 below apply with almost no modification to any estimator in the class $\mathcal{C}$ . A natural question then is which estimator in the above-defined class achieves minimum variance. We answer this in the proof of part $(c)$ below, and show that our proposed estimator (LABEL:minibatchestimator) with $p^{t}:=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}$ achieves minimum variance.

C.2 Proof of proposition 3

See 3

Proof.

(a)

For $j\in[m]$ and conditional on $S_{t}^{j-1}$ we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\hat{g}_{j}^{t}\right]$	$\displaystyle=\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}q^{t,j}_{i}\left[\frac{1}{q_{i}^{t,j}}g_{i}^{t}+\sum_{k\in S_{t}^{j-1}}g_{k}^{t}\right]$
		$\displaystyle=\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}g_{i}^{t}+\left(\sum_{k\in S_{t}^{j-1}}g_{k}^{t}\right)\underbrace{\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}q_{i}^{t,j}\right)}_{=1}$
		$\displaystyle=\sum_{i=1}^{N}g_{i}^{t}$
		$\displaystyle=g^{t}$

Taking expectation with respect to the choice of $S_{t}^{j-1}$ , and taking the average over $j\in[m]$ we get the result.

(b)

We have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert\frac{1}{m}\sum_{j=1}^{m}\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]$
		$\displaystyle=\frac{1}{m^{2}}\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]+\frac{2}{m^{2}}\sum_{\begin{subarray}{c}j,i\\ j<i\end{subarray}}^{m}\operatorname{\mathbb{E}}\left[\langle\hat{g}_{j}^{t}-g^{t},\hat{g}_{i}^{t}-g^{t}\rangle\right]$

To show the claim, it is therefore enough to show that second term is zero. Let $j\in[m-1]$ . Conditional on $S_{t}^{i-1}$ we have:

\operatorname{\mathbb{E}}\left[\langle\hat{g}_{j}^{t}-g^{t},\hat{g}_{i}^{t}-g^{t}\rangle\right]=\langle\hat{g}_{j}^{t}-g^{t},\operatorname{\mathbb{E}}\left[\hat{g}_{i}^{t}\right]-g^{t}\rangle=0

where we used the fact that the conditional expectation is zero by part (a). Taking expectation with respect to $S_{t}^{i-1}$ on both sides yields the result.

(c)

As discussed in the previous section, (b) applies to all estimators in the class $\mathcal{C}$ , so we have for all such estimators:

\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m})-g^{t}\right\rVert_{2}^{2}\right]=\frac{1}{m^{2}}\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}(p^{t,j})-g^{t}\right\rVert_{2}^{2}\right]

minimizing over $(p^{t,1},\dots,p^{t,m})$ by minimizing each term with respect to its variable we get:

\operatorname*{argmin}_{(p^{t,1},\dots,p^{t,m})}\{\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m})-g^{t}\right\rVert_{2}^{2}\right]\}=\left(\frac{p^{t,*}}{1-\sum_{k=1}^{j-1}p_{I_{t}^{k}}^{t,*}}\right)_{j=1}^{m}

where:

p^{t,*}=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}=\frac{\left\lVert g_{i}^{t}\right\rVert_{2}}{\sum_{j=1}^{N}\left\lVert g_{j}^{t}\right\rVert_{2}}

Recalling that our estimator is in $\mathcal{C}$ , and noticing that the optimal probabilities over $\mathcal{C}$ are feasible for our estimator we get the result.

(d)

Fix $t\in[T]$ . We drop the superscript $t$ from $p^{t}$ , $q_{i}^{t,j}$ , and $g_{i}^{t}$ . We also drop the subscript $t$ from $I_{t}^{j}$ and $S_{t}^{j}$ to simplify notation. Define for $j\in[m]$ :

x_{j}:=\frac{1}{q_{I^{j}}^{j}}g_{I^{j}}\quad\quad\mu_{j}:=g^{t}-\sum_{k=1}^{j-1}g_{I^{k}}

We have from part (a):

\operatorname{\mathbb{E}}\left[x_{j}\right]=\mu_{j}

Before proceeding with the proof, we first derive an identity relating $q_{i}^{j+1}$ and $q_{i}^{j}$ :

	$\displaystyle\frac{1}{q_{i}^{j+1}}$	$\displaystyle=\frac{1-\sum_{k\in S^{j}}p_{k}}{p_{i}}$
		$\displaystyle=\frac{1-\sum_{k\in S^{j-1}}p_{k}-p_{I^{j}}}{p_{i}}$
		$\displaystyle=\left(1-\frac{p_{I^{j}}}{1-\sum_{k\in S^{j-1}}p_{k}}\right)\left(\frac{1-\sum_{k\in S^{j-1}}p_{k}}{p_{i}}\right)$
		$\displaystyle=\left(1-q_{I^{j}}^{j}\right)\frac{1}{q_{i}^{j}}$

Now, conditional on $S_{t}^{j}$ , we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j+1}^{t}-g^{t}\right\rVert_{2}^{2}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert x_{j+1}-\mu_{j+1}\right\rVert_{2}^{2}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert x_{j+1}\right\rVert_{2}^{2}\right]-\left\lVert\mu_{j+1}\right\rVert_{2}^{2}$
	$\displaystyle=\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j}\end{subarray}}^{N}\frac{1}{q_{i}^{j+1}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\left\lVert\mu_{j+1}\right\rVert_{2}^{2}$
	$\displaystyle=\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j+1}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\frac{1}{q_{I^{j}}^{j+1}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-\left\lVert\mu_{j}-g_{I^{j}}\right\rVert_{2}^{2}$
	$\displaystyle=\left(1-q_{I^{j}}^{j}\right)\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\left(1-q_{I^{j}}^{j}\right)\frac{1}{q_{I^{j}}^{j}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-\left(\left\lVert\mu_{j}\right\rVert_{2}^{2}-2\langle\mu_{j},g_{I^{j}}\rangle+\left\lVert g_{I^{j}}\right\rVert_{2}^{2}\right)$
	$\displaystyle=\left(1-q_{I^{j}}^{j}\right)\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j}}\left\lVert g_{i}\right\rVert_{2}^{2}-\left\lVert\mu_{j}\right\rVert_{2}^{2}\right)-\left(\frac{1}{q^{j}_{I^{j}}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-2\langle g_{I^{j}},\mu_{j}\rangle+q_{I^{j}}^{j}\left\lVert\mu_{j}\right\rVert_{2}^{2}\right)$
	$\displaystyle=\left(1-q_{I^{j}}^{j}\right)\operatorname{\mathbb{E}}\left[\left\lVert x_{j}-\mu_{j}\right\rVert_{2}^{2}\right]-q_{I^{j}}^{j}\left\lVert x_{j}-\mu_{j}\right\rVert_{2}^{2}$
	$\displaystyle=\left(1-q_{I^{j}}^{j}\right)\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]-q_{I^{j}}^{j}\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}$

where the expectation in the last two lines is conditional on $S^{j-1}$ . Taking expectation with respect to $S^{j-1}$ on both sides yields the result.

∎

Appendix D Extension to constant step-size SGD

While our analysis heavily relies on the assumption of decreasing step-sizes, we have found empirically that a slight modification of our method works just as well when a constant step-size is used. We propose the following epsilon sequence to account for the use of a constant step-size:

\varepsilon_{t}=\frac{1}{C^{1-\delta/3}(C+m(t-1))^{\delta/3}}+p_{min}

(23)

for a constant $p_{min}\in[0,1/N]$ and the following condition on $C$ :

C\leq\frac{1}{\frac{1}{N}-p_{min}}

which ensures that $\varepsilon_{1}\leq 1/N$ . We ran the same experiment on MNIST, IJCNN1, and CIFAR10 as in Section 6, but with a constant step-size $\alpha_{t}=\alpha=\frac{m}{2NL}$ , and the epsilon sequence (23) with $p_{min}=\frac{1}{5N}$ and $C=\frac{1}{\frac{1}{N}-p_{min}}$ , and $\delta=1$ . The results are displayed in figure 3, showing a similar performance compared to the decreasing step-sizes case. Note that choosing a too small $p_{min}$ can start to deteriorate the performance of the algorithm. It is still unclear how to set $p_{min}$ so as to guarantee good performance, but our experiments suggest that setting $\frac{1}{5N}$ is a safe choice.

\bibliographystylesupp

plain \bibliographysuppmendeley