This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

suppSupplementary references

Adaptive Importance Sampling for Finite-Sum Optimization and Sampling with Decreasing Step-Sizes

Ayoub El Hanchi
McGill University
ayoub.elhanchi@mail.mcgill.ca
&David A. Stephens
McGill University
david.stephens@mcgill.ca
Abstract

Reducing the variance of the gradient estimator is known to improve the convergence rate of stochastic gradient-based optimization and sampling algorithms. One way of achieving variance reduction is to design importance sampling strategies. Recently, the problem of designing such schemes was formulated as an online learning problem with bandit feedback, and algorithms with sub-linear static regret were designed. In this work, we build on this framework and propose Avare, a simple and efficient algorithm for adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes. Under standard technical conditions, we show that Avare achieves π’ͺ​(T2/3)\mathcal{O}(T^{2/3}) and π’ͺ​(T5/6)\mathcal{O}(T^{5/6}) dynamic regret for SGD and SGLD respectively when run with π’ͺ​(1/t)\mathcal{O}(1/t) step sizes. We achieve this dynamic regret bound by leveraging our knowledge of the dynamics defined by the algorithm, and combining ideas from online learning and variance-reduced stochastic optimization. We validate empirically the performance of our algorithm and identify settings in which it leads to significant improvements.

1 Introduction

Functions f:ℝd→ℝf:\mathbb{R}^{d}\rightarrow\mathbb{R} of the form:

f​(x)=βˆ‘i=1Nfi​(x)f(x)=\sum_{i=1}^{N}f_{i}(x) (1)

are prevalent in modern machine learning and statistics. Important examples include the empirical risk in the empirical risk minimization framework, 111Up to a normalizing factor of 1N\frac{1}{N} which does not affect the optimization or the log-posterior of an exchangeable Bayesian model. When NN is large, the preferred methods for solving the resulting optimization or sampling problem usually rely on stochastic estimates of the gradient of ff, using variants of stochastic gradient descent (SGD) [1] or stochastic gradient Langevin dynamics (SGLD) [2]:

xt+1S​G​D\displaystyle x^{SGD}_{t+1} =xtS​G​Dβˆ’Ξ±t​Nβ€‹βˆ‡fIt​(xtS​G​D)\displaystyle=x^{SGD}_{t}-\alpha_{t}N\nabla f_{I_{t}}(x^{SGD}_{t}) (2)
xt+1S​G​L​D\displaystyle x^{SGLD}_{t+1} =xtS​G​L​Dβˆ’Ξ±t​Nβ€‹βˆ‡fIt​(xtS​G​L​D)+ΞΎtΞΎtβˆΌπ’©β€‹(0,2​αt)\displaystyle=x^{SGLD}_{t}-\alpha_{t}N\nabla f_{I_{t}}(x^{SGLD}_{t})+\xi_{t}\quad\quad\xi_{t}\sim\mathcal{N}(0,2\alpha_{t}) (3)

where {Ξ±t}t=1T\{\alpha_{t}\}_{t=1}^{T} is a sequence of step-sizes, and the index ItI_{t} is sampled uniformly from [N][N], making Nβ€‹βˆ‡fIt​(x)N\nabla f_{I_{t}}(x) an unbiased estimator of the gradient of ff. We use {xt}t=1T\{x_{t}\}_{t=1}^{T} to refer to either sequence when we do not wish to distinguish between them. It is well known that the quality of the answers given by these algorithms depends on the (trace of the) variance of the gradient estimator, and considerable efforts have been made to design methods that reduce this variance.

In this paper, we focus on the importance sampling approach to achieving variance reduction. At each iteration, the algorithm samples ItI_{t} according to a specified distribution ptp^{t}, and estimates the gradient using:

g^t:=1pIttβ€‹βˆ‡fIt​(xt)\hat{g}^{t}:=\frac{1}{p^{t}_{I_{t}}}\nabla f_{I_{t}}(x_{t}) (4)

It is immediate to verify that g^t\hat{g}^{t} is an unbiased estimator of gt:=βˆ‡f​(xt)g^{t}:=\nabla f(x_{t}). By cleverly choosing the distributions ptp^{t}, one can achieve significant variance reduction (up to a factor of N) compared to the estimator based on uniform sampling. Unfortunately, computing the variance-minimizing distributions at each iteration requires the knowledge of the Euclidean norm of all the individual gradients git:=βˆ‡fi​(xt)g_{i}^{t}:=\nabla f_{i}(x_{t}) at each iteration, making it unpractical [3, 4].

Many methods have been proposed that attempt to construct sequences of distributions {pt}t=1T\{p^{t}\}_{t=1}^{T} that result in efficient estimators [3, 4, 5, 6, 7, 8, 9]. Of particular interest to us, the task of designing such sequences was recently cast as an online learning problem with bandit feedback [10, 11, 12, 13]. In this formulation, one attempts to design algorithms with sub-linear expected static regret, which is defined as:

RegretS​(T):=βˆ‘t=1Tct​(pt)βˆ’minpβˆˆΞ”β€‹βˆ‘t=1Tct​(p)\text{Regret}_{S}(T):=\sum_{t=1}^{T}c_{t}(p^{t})-\min_{p\in\Delta}\sum_{t=1}^{T}c_{t}(p)

where Ξ”\Delta denotes the probability simplex in ℝN\mathbb{R}^{N}, and ct​(p)c_{t}(p) is the trace of the covariance matrix of the gradient estimator (4), which is easily shown to be:

ct​(p):=βˆ‘i=1N1pi​βˆ₯gitβˆ₯22βˆ’βˆ₯gtβˆ₯22c_{t}(p):=\sum_{i=1}^{N}\frac{1}{p_{i}}\left\lVert g_{i}^{t}\right\rVert_{2}^{2}-\left\lVert g^{t}\right\rVert_{2}^{2} (5)

Note that the second term cancels in the definition of regret, and we omit it in the rest of our discussion. In this formulation, and to keep the computational load manageable, one has only access to partial feedback in the form of the norm of the Itt​hI_{t}^{th} gradient, and not to the complete cost function (5). Under the assumption of uniformly bounded gradients, the best result in this category can be found in [12] where an algorithm with O~​(T2/3)\tilde{O}(T^{2/3}) static regret is proposed. A more difficult but more natural performance measure that makes the attempt to approximate the optimal distributions explicit is the dynamic regret, defined by:

RegretD​(T):=βˆ‘t=1Tct​(pt)βˆ’βˆ‘t=1TminpβˆˆΞ”β‘ct​(p)\text{Regret}_{D}(T):=\sum_{t=1}^{T}c_{t}(p^{t})-\sum_{t=1}^{T}\min_{p\in\Delta}c_{t}(p) (6)

Guarantees with respect to the expected dynamic regret are more difficult to obtain, and require that the cost functions ct​(p)c_{t}(p) do not change too rapidly with respect to some reasonable measure of variation. See [14, 15, 16, 17, 18] for examples of such measures and the corresponding regret bounds for general convex cost functions.

In this work, we propose Avare, an algorithm that achieves sub-linear dynamic regret for both SGD and SGLD when the sequence of step-sizes {Ξ±t}t=1T\{\alpha_{t}\}_{t=1}^{T} is decreasing. The name Avare is derived from adaptive variance minimization. Specifically, our contributions are as follows:

  • β€’

    We show that Avare achieves π’ͺ​(T2/3)\mathcal{O}(T^{2/3}) and π’ͺ​(T5/6)\mathcal{O}(T^{5/6}) dynamic regret for SGD and SGLD respectively when Ξ±t\alpha_{t} is π’ͺ​(1/t)\mathcal{O}(1/t).

  • β€’

    We propose a new mini-batch estimator that combines the benefits of sampling without replacement and importance sampling while preserving unbiasedness.

  • β€’

    We validate empirically the performance of our algorithm and identify settings in which it leads to significant improvements.

We would like to point out that while the decreasing step size requirement might seem restrictive, we argue that for SGD and SGLD, it is the right setting to consider for variance reduction. Indeed, it is well known that under suitable technical conditions, both algorithms converge to their respective solutions exponentially fast in the early stages. Variance reduction is primarily useful at later stages when the noise from the stochastic gradient dominates. In the absence of control variates, one is forced to use decreasing step-sizes to achieve convergence. This is precisely the regime we consider.

2 Related work

It is easy to see that the cost functions (5) are convex over the probability simplex. A celebrated algorithm for convex optimization over the simplex is entropic descent [19], an instance of mirror descent [20] where the Bregman divergence is taken to be the relative entropy. A slight modification of this algorithm for online learning is the EXP3 algorithm [21] which mixes the iterates of entropic descent with a uniform distribution to avoid the assignment of null probabilities. See [22, 23, 24] for a more thorough discussion of online learning (also called online convex optimization) in general and variants of this algorithm in particular.

Since we are working over the simplex, the EXP3 algorithm is a natural choice. This is the approach taken in [10] and [11], although strictly speaking neither is able to prove sub-linear static regret bounds. The difficulty comes from the fact that the norm of the gradients of the cost functions (5) explode to infinity on the boundary of the simplex. This is amplified by the use of stochastic gradients which grow as 1/pm​i​n51/p_{min}^{5} in expectation, making it very difficult to reach regions near the boundary. Algorithms based on entropic descent for dynamic regret problems also exist, including the fixed share algorithm and projected mirror descent [25, 26, 27, 28]. Building on these algorithms, and by artificially making the cost functions strongly-convex to allow the use of decreasing step-sizes, we were only able to show O~​(T7/8)\tilde{O}(T^{7/8}) dynamic regret using projected mirror descent for SGD with O​(1/t)O(1/t) decreasing step sizes and uniformly bounded gradients.

The approach taken in [12] is qualitatively different and is based on the follow-the-regularized-leader (FTRL) scheme. By solving the successive minimization problems stemming from the FTRL scheme analytically, the authors avoid the above-mentioned issue of exploding gradients. The rest of their analysis relies on constructing an unbiased estimate of the cost functions (5) using only the partial feedback βˆ₯gIttβˆ₯22\left\lVert g_{I_{t}}^{t}\right\rVert_{2}^{2}, and probabilistically bounding the deviation of the estimated cost functions from the true cost functions using a martingale concentration inequality. The final result is an algorithm that enjoys an O~​(T2/3)\tilde{O}(T^{2/3}) static regret bound.

Our approach is similar to the one taken in [12] in that we work directly with the cost functions and not their gradients to avoid the exploding gradient issue. Beyond this point however, our analysis is substantially different. While we are still working within the online learning framework, our analysis is more closely related to the analysis of variance-reduced stochastic optimization algorithms that are based on control variates. In particular, we study a Lyapunov-like functional and show that it decreases along the trajectory of the dynamics defined by the algorithms. This yields simpler and more concise proofs, and opens the door for a unified analysis.

3 Algorithm

Most of the literature on online convex optimization with dynamic regret relies on the use of the gradients of the cost functions [14, 15, 16, 17, 18]. However, as explained in the previous section, such approaches are not viable for our problem. Unfortunately, the regret guarantees obtained from the FTRL scheme for static regret used in [12] do not directly translate into guarantees for dynamic regret.

We start by presenting the high level ideas that go into the construction of our algorithm. We then present our algorithm in explicit form, and discuss its implementation and computational complexity.

3.1 High level ideas

The simplest update rule one might consider when working with dynamic regret is a natural extension of the follow-the-leader approach:

pt+1:=argminpβˆˆΞ”{ct​(p)}p^{t+1}:=\operatorname*{argmin}_{p\in\Delta}\left\{c_{t}(p)\right\} (7)

Intuitively, if the cost functions do not change too quickly, then it is reasonable to expect good performance from this algorithm. In our case however, we do not have access to the full cost function. The traditional way of dealing with this problem is to build unbiased estimates of the cost functions using the bandit feedback, and then bounding the deviation of these unbiased estimates from the true cost functions. While this might work, we consider here a different approach based on constructing surrogate cost functions that are not necessarily unbiased estimates of the true costs.

For each i∈[N]i\in[N], denote by hith_{i}^{t} the last observed gradient of fif_{i} at time tt, with hi1h_{i}^{1} initialized arbitrarily (to 0 for example). We consider the following surrogate cost for all t∈[T]t\in[T]:

c~t​(p):=βˆ‘i=1N1pi​βˆ₯hitβˆ₯22\widetilde{c}_{t}(p):=\sum_{i=1}^{N}\frac{1}{p_{i}}\left\lVert h_{i}^{t}\right\rVert_{2}^{2} (8)

As successive iterates of the algorithm become closer and closer to each other, the squared norm of the hith_{i}^{t}s become better and better approximations to the squared norm of the gitg_{i}^{t}s, thereby making the surrogate costs (8) accurate approximations of the true costs (5). The idea of using previously seen gradients to approximate current gradients is inspired by the celebrated variance-reduced stochastic optimization algorithm SAGA [29].

We are now almost ready to formulate our algorithm. If we try to directly minimize the surrogate cost functions over the entire simplex at each iteration as in (7), we might end up assigning null or close to null probabilities to certain indices. Depending on how far we are in the algorithm, this might or might not be a problem. In the initial stages, this is clearly an issue since this might only be an artifact of the initialization (notably when we initialize the hi1h_{i}^{1} to 0), or an accurate representation of the current norm of the gradient, but which might not be representative later on as the algorithm progresses. On the other hand, in the later stages of the algorithm, the cost functions are nearly identical, so that an assignment of near zero probabilities is a reflection of the true optimal probabilities, and is not necessarily problematic.

The above discussion suggests the following algorithm. Define the Ξ΅\varepsilon-restricted probability simplex to be:

Δ​(Ξ΅):={pβˆˆβ„N∣piβ‰₯Ξ΅,βˆ‘i=1Npi=1}\Delta(\varepsilon):=\left\{p\in\mathbb{R}^{N}\mid p_{i}\geq\varepsilon,~{}\sum_{i=1}^{N}p_{i}=1\right\} (9)

And let {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} be a decreasing sequence of positive numbers with Ξ΅1≀1N\varepsilon_{1}\leq\frac{1}{N}. Then we propose the following algorithm:

pt:=argminpβˆˆΞ”β€‹(Ξ΅t){c~t​(p)}p^{t}:=\operatorname*{argmin}_{p\in\Delta(\varepsilon_{t})}\left\{\widetilde{c}_{t}(p)\right\} (10)

Our theoretical analysis in Section 4 suggests a specific decay rate for the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} that depends on the sequence of step-sizes {Ξ±t}t=1T\{\alpha_{t}\}_{t=1}^{T} and whether SGD or SGLD is run.

3.2 Explicit form

Equation (10) defines our sequence of distribution {pt}t=1T\{p^{t}\}_{t=1}^{T}, but the question remains whether we can solve the implied optimization problems efficiently. In this section we answer this question in the affirmative, and provide an explicit algorithm for the computation of the sequence {pt}t=1T\{p^{t}\}_{t=1}^{T}.

We state our main result of this section in the following lemma. The proof can be found in appendix A.

Lemma 1.

Let {ai}i=1N\{a_{i}\}_{i=1}^{N} be a non-negative set of numbers where at least one of the aia_{i}s is strictly positive, and let Ρ∈[0,1/N]\varepsilon\in[0,1/N]. Let Ο€:[N]β†’[N]\pi:[N]\rightarrow[N] be a permutation that orders {ai}i=1N\{a_{i}\}_{i=1}^{N} in a decreasing order (aπ​(1)β‰₯aπ​(2)β‰₯β‹―β‰₯aπ​(N)a_{\pi(1)}\geq a_{\pi(2)}\geq\dots\geq a_{\pi(N)}). Define:

ρ:=max⁑{i∈[N]|aπ​(i)β‰₯Ξ΅β€‹βˆ‘j=1iaπ​(j)1βˆ’(Nβˆ’i)​Ρ}\rho:=\max\left\{i\in[N]\;|\;a_{\pi(i)}\geq\varepsilon\frac{\sum_{j=1}^{i}a_{\pi(j)}}{1-(N-i)\varepsilon}\right\} (11)

and:

Ξ»:=βˆ‘j=1ρaπ​(j)1βˆ’(Nβˆ’Ο)​Ρ\lambda:=\frac{\sum_{j=1}^{\rho}a_{\pi(j)}}{1-(N-\rho)\varepsilon} (12)

Then a solution of the optimization problem:

minpβˆˆΞ”β€‹(Ξ΅)β€‹βˆ‘i=1N1pi​ai2\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2} (13)

is given by:

piβˆ—={ai/Ξ»ifΒ i∈{π​(1),…,π​(ρ)}Ξ΅otherwisep^{*}_{i}=\begin{cases}a_{i}/\lambda&\text{if $i\in\{\pi(1),\dots,\pi(\rho)\}$}\\ \varepsilon&\text{otherwise}\end{cases} (14)

In the case all aia_{i} are zero, any pβˆˆΞ”β€‹(Ξ΅)p\in\Delta(\varepsilon) is a solution.

In light of Lemma 1, to compute ptp^{t} as defined in (10), it is enough to know the value of ρt\rho_{t} as defined in (11), replacing aia_{i} with βˆ₯hitβˆ₯2\left\lVert h_{i}^{t}\right\rVert_{2} and Ξ΅\varepsilon with Ξ΅t\varepsilon_{t}. Using ρt\rho_{t} we can then compute Ξ»t\lambda_{t} using (12), and obtain ptp^{t} from (14). It remains to specify an efficient way to perform this computation.

3.3 Implementation details

The naive way to perform the above computation is to do the following at each iteration:

  • β€’

    Sort {βˆ₯hitβˆ₯2}i=1N\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N} in decreasing order.

  • β€’

    Find ρt\rho_{t} by traversing (π​(i))i=1N(\pi(i))_{i=1}^{N} in increasing order and finding the first i∈[N]i\in[N] for which the inequality in (11) does not hold.

  • β€’

    Explicitly compute the probabilities using (12) and (14).

  • β€’

    Sample from ptp^{t} using inverse transform sampling.

This has complexity O​(N​log⁑N)O(N\log{N}). In appendix A, we present an algorithm that requires only O​(N)O(N) vectorized operations, O​(log2⁑N)O(\log^{2}N) sequential (non-vectorized) operations, and c​NcN memory for small cc. The algorithm uses three data structures:

  • β€’

    An array storing {βˆ₯hitβˆ₯2}i=1N\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N} unsorted.

  • β€’

    An order statistic tree storing the pairs (βˆ₯hitβˆ₯2,i)i=1N(\left\lVert h_{i}^{t}\right\rVert_{2},i)_{i=1}^{N} sorted according to βˆ₯hitβˆ₯2\left\lVert h_{i}^{t}\right\rVert_{2}.

  • β€’

    An array storing {βˆ‘j=1iβˆ₯hπ​(j)tβˆ₯2}i=1N\{\sum_{j=1}^{i}\left\lVert h_{\pi(j)}^{t}\right\rVert_{2}\}_{i=1}^{N} where Ο€\pi is the permutation that sorts {βˆ₯hitβˆ₯2}i=1N\{\left\lVert h_{i}^{t}\right\rVert_{2}\}_{i=1}^{N} in the order statistic tree.

The order statistic tree along with the array storing the cumulative sums allows the retrieval of ρt\rho_{t} in O​(log2⁑N)O(\log^{2}N) time. The cumulative sums allow to sample from ptp^{t} in O​(log⁑N)O(\log N) time using binary search, and maintaining them is the only operation that requires a vectorized O​(N)O(N) operation. All other operations run in O​(log⁑N)O(\log N) time. See appendix A for the full version of the algorithm and a more complete discussion of its computational complexity.

4 Theory

In this section, we prove a sub-linear dynamic regret guarantee for our proposed algorithm when used with SGD and SGLD with decreasing step-sizes. We present our results in a more general setting, and show that they apply to our cases of interest. We start by stating our assumptions:

Assumption 1.

(Bounded gradients) There exists a G>0G>0 such that βˆ₯βˆ‡fi​(x)βˆ₯2≀G\left\lVert\nabla f_{i}(x)\right\rVert_{2}\leq G for all xβˆˆβ„dx\in\mathbb{R}^{d} and for all i∈[N]i\in[N].

Assumption 2.

(Smoothness) There exists an L>0L>0 such that βˆ₯βˆ‡fi​(x)βˆ’βˆ‡fi​(y)βˆ₯2≀L​βˆ₯xβˆ’yβˆ₯2\left\lVert\nabla f_{i}(x)-\nabla f_{i}(y)\right\rVert_{2}\leq L\left\lVert x-y\right\rVert_{2} for all x,yβˆˆβ„dx,y\in\mathbb{R}^{d} and for all i∈[N]i\in[N].

Assumption 3.

(Contraction of the iterates in expectation) There exists constants Aβ‰₯0A\geq 0, Bβ‰₯1B\geq 1 and δ∈(0,1]\delta\in(0,1] such that 𝔼⁑[βˆ₯xt+1βˆ’xtβˆ₯2|I1,…,Itβˆ’1]≀A/(B+tβˆ’1)Ξ΄\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}-x_{t}\right\rVert_{2}\;|\;I_{1},\dots,I_{t-1}\right]\leq A/(B+t-1)^{\delta} for all t∈[T]t\in[T].

The bounded gradients assumption has been used in all previous related work [10, 11, 12, 13], although it is admittedly quite strong. The smoothness assumption is standard in the study of optimization and sampling algorithms. Note that we chose to state our assumptions using index-independent constants to make the presentation clearer and since this does not affect our derivation of the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T}.

Finally, Assumption 3 should really be derived from more basic assumptions, and is only stated to allow for a unified analysis. Note that in the optimization case this is a very mild assumption since we are generally only interested in convergent sequences, and for any such sequence with reasonably fast convergence this assumption holds. The following proposition shows that Assumption 3 holds for our cases of interest. All the proofs for this section can be found in appendix B.

Proposition 1.

For any choice of {pt}t=1T\{p^{t}\}_{t=1}^{T}, the iterates of SGD (2) with the gradient estimator (4) and decreasing step-sizes Ξ±t:=E/(F+tβˆ’1)Ξ²\alpha_{t}:=E/(F+t-1)^{\beta} with Eβ‰₯0E\geq 0, Fβ‰₯1F\geq 1 and β∈(0,1]\beta\in(0,1] satisfy Assumption 3 with A:=N​G​EA:=NGE, B:=FB:=F, and Ξ΄:=Ξ²\delta:=\beta. Under the same conditions, the iterates of SGLD (3) satisfy Assumption 3 with A:=E​(N​G​α1+2​d)A:=\sqrt{E}\left(NG\sqrt{\alpha_{1}}+\sqrt{2d}\right), B:=FB:=F, and Ξ΄:=Ξ²/2\delta:=\beta/2.

We now state a proposition that relates the optimal function value for the problem (13) over the restricted simplex, with the optimal function value over the whole simplex. Its proof is taken from ([12], Lemma 6):

Proposition 2.

Let {ai}i=1N\{a_{i}\}_{i=1}^{N} be a non-negative set of numbers, and let Ρ∈[0,1/2​N]\varepsilon\in[0,1/2N]. Then:

minpβˆˆΞ”β€‹(Ξ΅)β€‹βˆ‘i=1N1pi​ai2βˆ’minpβˆˆΞ”β€‹βˆ‘i=1N1pi​ai2≀6​Ρ​N​(βˆ‘i=1Nai)2\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}\leq 6\varepsilon N\left(\sum_{i=1}^{N}a_{i}\right)^{2}

The following lemma gives our first bound on the regret per time step:

Lemma 2.

Let qt:=argminpβˆˆΞ”{ct​(p)}q^{t}:=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}. Under Assumption 1, and when using the sequence of distributions defined by (10), we have the following bound for t∈{t0,…,T}t\in\{t_{0},\dots,T\}:

𝔼⁑[ct​(pt)βˆ’ct​(qt)]≀4​GΞ΅t​𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]+6​Ρt​G2​N3\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\frac{4G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]+6\varepsilon_{t}G^{2}N^{3}

where t0:=min⁑{t∈[T]|Ξ΅t≀12​N}t_{0}:=\min\{t\in[T]\;|\;\varepsilon_{t}\leq\frac{1}{2N}\}.

Proof outline.

Let p~t:=argminpβˆˆΞ”{c~t​(p)}\widetilde{p}^{t}:=\operatorname*{argmin}_{p\in\Delta}\{\widetilde{c}_{t}(p)\}. Then we have the following decomposition:

𝔼⁑[ct​(pt)βˆ’ct​(qt)]\displaystyle\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right] =𝔼⁑[ct​(pt)βˆ’c~t​(pt)]⏟(A)+𝔼⁑[c~t​(pt)βˆ’c~t​(p~t)]⏟(B)+𝔼⁑[c~t​(p~t)βˆ’ct​(qt)]⏟(C)\displaystyle=\underbrace{\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-\widetilde{c}_{t}(p^{t})\right]}_{\text{(A)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(p^{t})-\widetilde{c}_{t}(\widetilde{p}^{t})\right]}_{\text{(B)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(\widetilde{p}^{t})-c_{t}(q^{t})\right]}_{\text{(C)}}

The terms (A) and (C) are the penalties we pay for using a surrogate cost function, while (B) is the price we pay for restricting the simplex. Using Assumption 1, proposition 2, and the fact that ptp^{t} is contained in the Ρt\varepsilon_{t}-restricted simplex, each of these terms can be bound to give the result stated. ∎

The expectation in the first term of the above lemma is highly reminiscent of the first term of the Lyapunov function used to study the convergence of SAGA first proposed in [30] and subsequently refined in [31] (with gitg_{i}^{t} replaced by giβˆ—g_{i}^{*}, the gradient at the minimum). Inspired by this similarity, we prove the following recursion:

Lemma 3.

Under Assumptions 2 and 3, we have:

𝔼⁑[βˆ‘i=1Nβˆ₯git+1βˆ’hit+1βˆ₯2]≀N​L​A(B+tβˆ’1)Ξ΄+(1βˆ’Ξ΅t)​𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t+1}\right\rVert_{2}\right]\leq\frac{NLA}{(B+t-1)^{\delta}}+(1-\varepsilon_{t})\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]

The natural thing to do at this stage is to unroll the above recursion, replace in Lemma 2, sum over all time steps, and minimize the obtained regret bound over the choice of the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T}. However, even if we can solve the resulting optimization problem efficiently, the solution will still depend on the constants GG and LL and on the initial error due to the initialization of the hih_{i}s, both of which are usually unknown. Here instead, we make an asymptotic argument to find the optimal decay rate of {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T}, propose a sequence that satisfies this decay rate, and show that it gives rise to the dynamic regret bounds stated.

Denote by φ​(t)\varphi(t) the expectation in the first term of the upper bound in Lemma 2. Suppose we take Ξ΅t\varepsilon_{t} to be of order tβˆ’Ξ²t^{-\beta}. Looking at the recursion in Lemma 3, we see that to control the positive term, we need the negative term to be of order at least tβˆ’Ξ΄t^{-\delta}, so that φ​(t)\varphi(t) cannot be smaller than tΞ²βˆ’Ξ΄t^{\beta-\delta}. The bound of Lemma 2 is therefore of order t2β€‹Ξ²βˆ’Ξ΄+tβˆ’Ξ²t^{2\beta-\delta}+t^{-\beta}. The minimum is attained when the exponents are equal so we have: 2β€‹Ξ²βˆ’Ξ΄=βˆ’Ξ²βŸΉΞ²=Ξ΄32\beta-\delta=-\beta\implies\beta=\frac{\delta}{3}

We are now ready to guess the form of Ξ΅t\varepsilon_{t}. Matching the positive term in Lemma 3, we consider the following sequence:

Ξ΅t:=1C1βˆ’Ξ΄/3​(C+tβˆ’1)Ξ΄/3\varepsilon_{t}:=\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}} (15)

For a free parameter CC satisfying Cβ‰₯NC\geq N to ensure Ξ΅1≀1/N\varepsilon_{1}\leq 1/N. With this choice of the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T}, we are now finally ready to state our main result of the paper:

Theorem 1.

Under Assumptions 1 and 2 on the functions fif_{i} and Assumption 3 on the sequence {xt}t=1T\{x_{t}\}_{t=1}^{T}, algorithm (10) with the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} given in (15) satisfies the following dynamic regret bound for all Tβ‰₯t0T\geq t_{0}:

𝔼⁑[RegretD​(T)]≀π’ͺ​(T1βˆ’Ξ΄/3)\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{1-\delta/3}) (16)

where t0:=min⁑{t∈[T]|Ξ΅t≀12​N}t_{0}:=\min\{t\in[T]\;|\;\varepsilon_{t}\leq\frac{1}{2N}\} as in Lemma 2.

Proof outline.

Using the sequence given by (15) and the recursion in Lemma 3, we show by induction that φ​(t)≀π’ͺ​(tβˆ’2​δ/3)\varphi(t)\leq\mathcal{O}(t^{-2\delta/3}) . Replacing in Lemma 2, summing over all time steps, and bounding the resulting sums by the appropriate integrals we get the result. ∎

Note that since Cβ‰₯NC\geq N, we have t0≀(23/Ξ΄βˆ’1)​N+1t_{0}\leq(2^{3/\delta}-1)N+1, so t0t_{0} is bounded by a constant independent of TT. Furthermore, setting C=2​NC=2N makes the theorem hold for all Tβˆˆβ„•T\in\mathbb{N}. In practice however, it might be beneficial to set C=NC=N to overcome a bad initialization of the hih_{i}s.

Combining Theorem 1 with proposition 1 we obtain the following corollary:

Corollary 1.

Under Assumptions 1 and 2 on the functions fif_{i}, if SGD (2) is run with step-sizes π’ͺ​(1/t)\mathcal{O}(1/t) using the estimator (4) and probabilities (10) with the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} given by (15), then for all Tβ‰₯t0T\geq t_{0}:

𝔼⁑[RegretD​(T)]≀π’ͺ​(T2/3)\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{2/3}) (17)

and for SGLD (3):

𝔼⁑[RegretD​(T)]≀π’ͺ​(T5/6)\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right]\leq\mathcal{O}(T^{5/6}) (18)

under the same conditions.

5 A new mini-batch estimator

In most practical applications, one uses a mini-batch of samples to construct the gradient estimator instead of just a single sample. The most basic such estimator is the one formed by sampling a mini-batch of indices St={It1,…,Itm}S_{t}=\{I_{t}^{1},\dots,I_{t}^{m}\} uniformly and independently, and taking the sample mean. This gives an unbiased estimator, whose variance decreases as 1/m1/m. A simple way to make it more efficient is by sampling the indices uniformly but without replacement. In that case, the variance decreases by an additional factor of (1βˆ’(mβˆ’1)/(Nβˆ’1))(1-(m-1)/(N-1)). For mβ‰ͺNm\ll N, the difference is negligible, and the additional cost of sampling without replacement is not justified. However, when using unequal probabilities, this argument no longer holds, and the additional variance reduction obtained from sampling without replacement can be significant even for small mm.

For our problem, besides the additional variance reduction, sampling without replacement allows a higher rate of replacement of the hih_{i}s, which is directly related to our regret through the factor in front of the second term of Lemma 3, whose proper generalization for mini-batch sampling is (1βˆ’mini∈[N]⁑πit)(1-\min_{i\in[N]}\pi_{i}^{t}), where Ο€it=P​(i∈St)\pi_{i}^{t}=P(i\in S_{t}) is the inclusion probability of index ii. This makes sampling without replacement desirable for our purposes. Unfortunately, unlike in the uniform case, the sample mean generalization of (4) is no longer unbiased. We propose instead the following estimator:

[]​g^bt=1mβ€‹βˆ‘j=1mg^jtg^jt:=[1qItjt,j​gItjt+βˆ‘k=1jβˆ’1gItkt]qit,j:=pit1βˆ’βˆ‘k=1jβˆ’1pItkt[]\hat{g}^{t}_{b}=\frac{1}{m}\sum_{j=1}^{m}\hat{g}^{t}_{j}\quad\quad\hat{g}^{t}_{j}:=\left[\frac{1}{q^{t,j}_{I_{t}^{j}}}g_{I_{t}^{j}}^{t}+\sum_{k=1}^{j-1}g_{I_{t}^{k}}^{t}\right]\quad\quad q^{t,j}_{i}:=\frac{p^{t}_{i}}{1-\sum_{k=1}^{j-1}p^{t}_{I_{t}^{k}}} (19)

We summarize some of its properties in the following proposition:

Proposition 3.

Let Stj:={It1,…,Itj}S_{t}^{j}:=\{I_{t}^{1},\dots,I_{t}^{j}\} for j∈[m]j\in[m] and St0:=βˆ…S_{t}^{0}:=\emptyset. We have:

  1. (a)

    𝔼⁑[g^bt]=gt\operatorname{\mathbb{E}}\left[\hat{g}_{b}^{t}\right]=g^{t}

  2. (b)

    𝔼⁑[βˆ₯g^btβˆ’gtβˆ₯22]=(1/m2)β€‹βˆ‘j=1m𝔼⁑[βˆ₯g^jtβˆ’gtβˆ₯22]\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right]=(1/m^{2})\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]

  3. (c)

    argminpβˆˆΞ”{𝔼⁑[βˆ₯g^btβˆ’gtβˆ₯22]}=argminpβˆˆΞ”{ct​(p)}\operatorname*{argmin}_{p\in\Delta}\{\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right]\}=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}

  4. (d)

    𝔼⁑[βˆ₯g^j+1tβˆ’gtβˆ₯22]=(1βˆ’π”Όβ‘[qItjt,j])​𝔼⁑[βˆ₯g^jtβˆ’gtβˆ₯22]βˆ’π”Όβ‘[qItjt,j​βˆ₯g^jtβˆ’gtβˆ₯22]\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j+1}-g^{t}\right\rVert_{2}^{2}\right]=\left(1-\operatorname{\mathbb{E}}\left[q^{t,j}_{I_{t}^{j}}\right]\right)\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]-\operatorname{\mathbb{E}}\left[q^{t,j}_{I_{t}^{j}}\left\lVert\hat{g}^{t}_{j}-g^{t}\right\rVert_{2}^{2}\right]

where all the expectations in (d) are conditional on Stjβˆ’1S_{t}^{j-1}.

The proposed estimator is therefore unbiased and its variance decreases super-linearly in mm (by (b) and (d)). Although we were unable to prove a regret bound for this estimator, (c) suggests that it is still reasonable to use our algorithm. To take into account the higher rate of replacement of the hih_{i}s, we propose using the following Ξ΅t\varepsilon_{t} sequence, which is based on the mini-batch equivalent of Lemma 3 and the inequality mini∈[N]⁑πitβ‰₯m​Ρt\min_{i\in[N]}\pi_{i}^{t}\geq m\varepsilon_{t} [32, 33, 34]:

Ξ΅t:=1C1βˆ’Ξ΄/3​(C+m​(tβˆ’1))Ξ΄/3\varepsilon_{t}:=\frac{1}{C^{1-\delta/3}(C+m(t-1))^{\delta/3}} (20)

6 Experiments

In this section, we present results of experiments with our algorithm. We start by validating our theoretical results through a synthetic experiment. We then show that our proposed method outperforms existing methods on real world datasets. Finally, we identify settings in which adaptive importance sampling can lead to significant performance gains for the final optimization algorithm.

In all experiments, we added l2l_{2}-regularization to the model’s loss and set the regularization parameter ΞΌ=1\mu=1. We ran SGD (2) with decreasing step sizes Ξ±t=m2​N​L+m​μ​t\alpha_{t}=\frac{m}{2NL+m\mu t} where mm is the batch size following [35]. We experimented with 3 different samplers in addition to our proposed sampler: Uniform, Multi-armed bandit Sampler (Mabs) [11], and Variance Reducer Bandit (Vrb) \citesuppDBLP:conf/colt/Borsos0L18. The hyperparameters of both Mabs and Vrb are set to the ones prescribed by the theory in the original papers. For our propsoed sampler Avare, we use the epsilon sequence given by (20) with C=NC=N, Ξ΄=1\delta=1, and initialized hi=0h_{i}=0 for all i∈[N]i\in[N]. For each sampler, we ran SGD 10 times and averaged the results. The shaded areas represent a one standard deviation confidence interval.

Refer to caption
Figure 1: Evolution of function sub-optimality (left) and dynamic regret (right) as a function of data passes on a synthetic dataset with an l2l_{2} regularized logistic regression model and different samplers.

To validate our theoretical findings, we randomly generated a dataset for binary classification with N=100N=100 and d=10d=10. We then trained a logistic regression model on the generated data, and used a batch size of 1 to match the setting of our regret bound. The results of this experiment are depicted in figure 1, and show that Avare outperforms the other methods, achieving significantly lower dynamic regret and faster convergence.

For our second experiment, we tested our algorithm on three real world datasets: MNIST, IJCNN1 [36], and CIFAR10. We used a softmax regression model, and a batch size of 128 sampled with replacement. The results of this experiment can be seen in figure 2. The third column shows the relative error which we define as [ct​(pt)βˆ’minpβˆˆΞ”β‘ct​(p)]/minpβˆˆΞ”β‘ct​(p)\left[c_{t}(p^{t})-\min_{p\in\Delta}c_{t}(p)\right]/\min_{p\in\Delta}c_{t}(p). In all three cases, Avare achieves significantly smaller dynamic regret, with a relative error quickly decaying to zero. The performance gains in the final optimization algorithm are clear in both MNIST and IJCNN1, but are not noticeable in the case of CIFAR10.

To determine the effectiveness of non-uniform sampling in accelerating the optimization process, previous work [4, 11] has suggested to look at the ratio of the maximum smoothness constant and the average smoothness constant. We show here that this is the wrong measure to look at when using adaptive probabilities. In particular, we argue that the ratio of the variance with uniform sampling at the loss minimizer to the optimal variance at the loss minimizer is much more informative of the performance gains achievable through adaptive importance sampling. For each dataset, these ratios are displayed in Table 1, supporting our claim. We suspect that for large models capable of fitting the data nearly perfectly, our proposed ratio will be large since many of the per-example gradients at the optimum will be zero. We therefore expect our method to be particularly effective in the training of models of this type. We leave such experiments for future work. Finally, in appendix D, we propose an extension of our method to constant step-size SGD, and show that it preserves the performance gains observed when using decreasing step-sizes.

Refer to caption
Figure 2: Comparison of the performance of importance samplers on an l2-regularized softmax regression model on three real world datasets: MNIST (top), IJCNN1 (middle), CIFAR10 (bottom).
Table 1: Useful ratios in determining the effectiveness of variance reduction through importance sampling. LiL_{i} is the smoothness constant of fif_{i}, Lm​a​x=maxi∈[N]⁑LiL_{max}=\max_{i\in[N]}{L_{i}}, and giβˆ—g_{i}^{*} and is the gradient of fif_{i} at the loss minimizer xβˆ—x^{*}.
Dataset N​Lm​a​xβˆ‘i=1NLi\frac{NL_{max}}{\sum_{i=1}^{N}L_{i}} Nβ€‹βˆ‘i=1Nβˆ₯giβˆ—βˆ₯22(βˆ‘i=1Nβˆ₯giβˆ—βˆ₯2)2\frac{N\sum_{i=1}^{N}\left\lVert g_{i}^{*}\right\rVert_{2}^{2}}{(\sum_{i=1}^{N}\left\lVert g_{i}^{*}\right\rVert_{2})^{2}}
Synthetic 1.69 4.46
MNIST 3.28 5.08
IJCNN1 1.12 4.83
CIFAR10 3.40 1.14

Broader Impact

Our work develops a new method for variance reduction for stochastic optimization and sampling algorithms. On the optimization side, we expect our method to be very useful in accelerating the training of large scale neural networks, particularly since our method is expected to provide significant performance gains when the model is able to fit the data nearly perfectly. On the sampling side, we expect our method to be useful in accelerating the convergence of MCMC algorithms, opening the door for the use of accurate Bayesian methods at a large scale.

Acknowledgments and Disclosure of Funding

This research was supported by an NSERC discovery grant. We would like to thank the anonymous reviewers for their useful comments and suggestions.

References

  • [1] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Ann. Math. Statist., 22(3):400–407, 1951.
  • [2] Max Welling and YeeΒ Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 681–688. Omnipress, 2011.
  • [3] Deanna Needell, Rachel Ward, and Nathan Srebro. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In Zoubin Ghahramani, Max Welling, Corinna Cortes, NeilΒ D Lawrence, and KilianΒ Q Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1017–1025, 2014.
  • [4] Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In FrancisΒ R Bach and DavidΒ M Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volumeΒ 37 of JMLR Workshop and Conference Proceedings, pages 1–9. JMLR.org, 2015.
  • [5] Guillaume Bouchard, ThΓ©o Trouillon, Julien Perez, and Adrien Gaidon. Accelerating Stochastic Gradient Descent via Online Learning to Sample. CoRR, abs/1506.0, 2015.
  • [6] SebastianΒ U Stich, Anant Raj, and Martin Jaggi. Safe Adaptive Importance Sampling. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, HannaΒ M Wallach, Rob Fergus, SΒ VΒ N Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4381–4391, 2017.
  • [7] Angelos Katharopoulos and FranΓ§ois Fleuret. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In JenniferΒ G Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, StockholmsmΓ€ssan, Stockholm, Sweden, July 10-15, 2018, volumeΒ 80 of Proceedings of Machine Learning Research, pages 2530–2539. PMLR, 2018.
  • [8] TylerΒ B Johnson and Carlos Guestrin. Training Deep Models Faster with Robust, Approximate Importance Sampling. In Samy Bengio, HannaΒ M Wallach, Hugo Larochelle, Kristen Grauman, NicolΓ² Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, MontrΓ©al, Canada, pages 7276–7286, 2018.
  • [9] Dominik Csiba and Peter RichtΓ‘rik. Importance Sampling for Minibatches. J. Mach. Learn. Res., 19:27:1β€”-27:21, 2018.
  • [10] Hongseok Namkoong, Aman Sinha, Steve Yadlowsky, and JohnΒ C Duchi. Adaptive Sampling Probabilities for Non-Smooth Optimization. In Doina Precup and YeeΒ Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volumeΒ 70 of Proceedings of Machine Learning Research, pages 2574–2583. PMLR, 2017.
  • [11] Farnood Salehi, LΒ Elisa Celis, and Patrick Thiran. Stochastic Optimization with Bandit Sampling. CoRR, abs/1708.0, 2017.
  • [12] Zalan Borsos, Andreas Krause, and KfirΒ Y Levy. Online Variance Reduction for Stochastic Optimization. In SΓ©bastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volumeΒ 75 of Proceedings of Machine Learning Research, pages 324–357. PMLR, 2018.
  • [13] ZalΓ‘n Borsos, Sebastian Curi, KfirΒ Yehuda Levy, and Andreas Krause. Online Variance Reduction with Mixtures. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volumeΒ 97 of Proceedings of Machine Learning Research, pages 705–714. PMLR, 2019.
  • [14] Omar Besbes, Yonatan Gur, and AssafΒ J Zeevi. Non-Stationary Stochastic Optimization. Oper. Res., 63(5):1227–1244, 2015.
  • [15] XiΒ Chen, Yining Wang, and Yu-Xiang Wang. Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes. CoRR, abs/1708.0, 2017.
  • [16] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, CDC 2016, Las Vegas, NV, USA, December 12-14, 2016, pages 7195–7201. IEEE, 2016.
  • [17] Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient. In Maria-Florina Balcan and KilianΒ Q Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volumeΒ 48 of JMLR Workshop and Conference Proceedings, pages 449–457. JMLR.org, 2016.
  • [18] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive Online Learning in Dynamic Environments. In Samy Bengio, HannaΒ M Wallach, Hugo Larochelle, Kristen Grauman, NicolΓ² Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, MontrΓ©al, Canada, pages 1330–1340, 2018.
  • [19] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003.
  • [20] A.Β S. Nemirovsky and D.Β B. Yudin. Problem Complexity and Method Efficiency in Optimization. The Journal of the Operational Research Society, 1984.
  • [21] Peter Auer, NicolΓ² Cesa-Bianchi, Yoav Freund, and RobertΒ E Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput., 32(1):48–77, 2002.
  • [22] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Tom Fawcett and Nina Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 928–936. AAAI Press, 2003.
  • [23] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
  • [24] Elad Hazan. Introduction to Online Convex Optimization. Found. Trends Optim., 2(3-4):157–325, 2016.
  • [25] Mark Herbster and Manfred Warmuth. Tracking the Best Expert. In Armand Prieditis and Stuart Russell, editors, Machine Learning Proceedings 1995, pages 286–294. Morgan Kaufmann, San Francisco (CA), 1995.
  • [26] Mark Herbster and ManfredΒ K Warmuth. Tracking the Best Linear Predictor. J. Mach. Learn. Res., 1:281–309, sep 2001.
  • [27] NicolΓ² Cesa-Bianchi, Pierre Gaillard, GΓ‘bor Lugosi, and Gilles Stoltz. Mirror Descent Meets Fixed Share (and feels no regret). In PeterΒ L Bartlett, Fernando CΒ N Pereira, Christopher JΒ C Burges, LΓ©on Bottou, and KilianΒ Q Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 989–997, 2012.
  • [28] AndrΓ‘s GyΓΆrgy and Csaba SzepesvΓ‘ri. Shifting Regret, Mirror Descent, and Matrices. In Maria-Florina Balcan and KilianΒ Q Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volumeΒ 48 of JMLR Workshop and Conference Proceedings, pages 2943–2951. JMLR.org, 2016.
  • [29] Aaron Defazio, FrancisΒ R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. CoRR, abs/1407.0202, 2014.
  • [30] Thomas Hofmann, AurΓ©lien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance Reduced Stochastic Gradient Descent with Neighbors. In Corinna Cortes, NeilΒ D Lawrence, DanielΒ D Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2305–2313, 2015.
  • [31] Aaron Defazio. A Simple Practical Accelerated Method for Finite Sums. In DanielΒ D Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 676–684, 2016.
  • [32] Yaming Yu. On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli, 18(1):279–289, 2012.
  • [33] Hartmut Milbrodt. Comparing inclusion probabilities and drawing probabilities for rejective sampling and successive sampling. Statistics and Probability Letters, 14(3):243–246, 1992.
  • [34] Talluri Rao, Samindranath Sengupta, and BΒ Sinha. Some order relations between selection and inclusion probabilities for PPSWOR sampling scheme. Metrika, 38:335–343, 1991.
  • [35] SebastianΒ U. Stich. Unified optimal analysis of the (stochastic) gradient method, 2019.
  • [36] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3), May 2011.

Appendix A Algorithm

A.1 Proof of Lemma 1

See 1

Proof.
Edge case and well-definedness.

If ai=0a_{i}=0 for all i∈[N]i\in[N], then the objective function is identically 0 over Δ​(Ξ΅)\Delta(\varepsilon) for all Ρ∈[0,1/N]\varepsilon\in[0,1/N], so that any pβˆˆΞ”β€‹(Ξ΅)p\in\Delta(\varepsilon) is a solution (we set (1/0)​0:=0(1/0)0:=0 in the objective). Else there exists an i∈[N]i\in[N] such that ai>0a_{i}>0, and therefore aπ​(1)>0a_{\pi(1)}>0. Now:

Ξ΅1βˆ’(Nβˆ’1)​Ρ≀1\displaystyle\frac{\varepsilon}{1-(N-1)\varepsilon}\leq 1
⇔Ρ≀1βˆ’(Nβˆ’1)​Ρ\displaystyle\Leftrightarrow\varepsilon\leq 1-(N-1)\varepsilon
⇔0≀1βˆ’N​Ρ\displaystyle\Leftrightarrow 0\leq 1-N\varepsilon
⇔N​Ρ≀1\displaystyle\Leftrightarrow N\varepsilon\leq 1
⇔Ρ≀1N\displaystyle\Leftrightarrow\varepsilon\leq\frac{1}{N}

The last inequality is true, so it implies the first, and we have:

aπ​(1)β‰₯Ρ​aπ​(1)1βˆ’(Nβˆ’1)​Ρa_{\pi(1)}\geq\varepsilon\frac{a_{\pi(1)}}{1-(N-1)\varepsilon}

Therefore ρ\rho is well defined and is β‰₯1\geq 1. As a consequence, Ξ»\lambda is also well defined and is >0>0, making pβˆ—p^{*} in turn well-defined.

Optimality proof.

It is easily verified that problem (13) is convex. Its Lagrangian is given by:

ℒ​(p,ΞΌ,Ξ½)=βˆ‘i=1N1pi​ai2βˆ’βˆ‘i=1NΞΌi​(piβˆ’Ξ΅)+ν​(βˆ‘i=1Npiβˆ’1)\mathcal{L}(p,\mu,\nu)=\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\sum_{i=1}^{N}\mu_{i}(p_{i}-\varepsilon)+\nu\left(\sum_{i=1}^{N}p_{i}-1\right)

and the KKT conditions are:

(Stationarity) pi=aiΞ½βˆ’ΞΌi\displaystyle\quad p_{i}=\frac{a_{i}}{\sqrt{\nu-\mu_{i}}}
(Complementary slackness) μi=0∨pi=Ρ\displaystyle\quad\mu_{i}=0\thinspace\lor\thinspace p_{i}=\varepsilon
(Primal feasibility) piβ‰₯Ξ΅βˆ§βˆ‘j=1Npj=1\displaystyle\quad p_{i}\geq\varepsilon\thinspace\land\thinspace\sum_{j=1}^{N}p_{j}=1
(Dual feasibility) ΞΌiβ‰₯0\displaystyle\quad\mu_{i}\geq 0

By convexity of the problem, the KKT conditions are sufficient conditions for global optimality. To show that our proposed solution is optimal, it therefore suffices to exhibit constants (ΞΌiβˆ—)i=1N(\mu^{*}_{i})_{i=1}^{N} and Ξ½βˆ—\nu^{*} that together with pβˆ—p^{*} satisfy these conditions. Let:

Ξ½βˆ—\displaystyle\nu^{*} :=Ξ»2\displaystyle:=\lambda^{2}
ΞΌiβˆ—\displaystyle\mu^{*}_{i} :={0ifΒ i∈{π​(1),…,π​(ρ)}Ξ½βˆ—βˆ’ai2/Ξ΅2otherwise\displaystyle:=\begin{cases}0&\text{if $i\in\{\pi(1),\dots,\pi(\rho)\}$}\\ \nu^{*}-a_{i}^{2}/\varepsilon^{2}&\text{otherwise}\end{cases}

Note that the ΞΌiβˆ—\mu_{i}^{*}s are well defined since when Ξ΅=0\varepsilon=0, ρ=N\rho=N, and ΞΌiβˆ—=0\mu_{i}^{*}=0 for all i∈[N]i\in[N]. We claim that the triplet (pβˆ—,(ΞΌiβˆ—)i=1N,Ξ½βˆ—)(p^{*},(\mu^{*}_{i})_{i=1}^{N},\nu^{*}) satisfies the KKT conditions.

Stationarity and complementary slackness are immediate from the definitions. The first clause of primal feasibility holds by definition of piβˆ—p^{*}_{i} for i∈{π​(ρ+1),…,π​(N)}i\in\{\pi(\rho+1),\dots,\pi(N)\}. In the other case we have:

piβˆ—\displaystyle p_{i}^{*} =aiΞ»\displaystyle=\frac{a_{i}}{\lambda}
=aiβˆ‘j=1ρaπ​(j)​(1βˆ’(Nβˆ’Ο)​Ρ)\displaystyle=\frac{a_{i}}{\sum_{j=1}^{\rho}a_{\pi(j)}}\left(1-(N-\rho)\varepsilon\right)
β‰₯aπ​(ρ)βˆ‘j=1ρaπ​(j)​(1βˆ’(Nβˆ’Ο)​Ρ)\displaystyle\geq\frac{a_{\pi(\rho)}}{\sum_{j=1}^{\rho}a_{\pi(j)}}\left(1-(N-\rho)\varepsilon\right)
β‰₯Ξ΅\displaystyle\geq\varepsilon

where in the third line we used the fact that Ο€\pi orders {ai}i=1N\{a_{i}\}_{i=1}^{N} in decreasing order, and in the last line we used the inequality in the definition of ρ\rho. For the second clause of primal feasibility:

βˆ‘j=1Npjβˆ—=βˆ‘j=1Npπ​(j)βˆ—=βˆ‘j=1ρaπ​(i)Ξ»+βˆ‘j=ρ+1NΞ΅=(1βˆ’(Nβˆ’Ο)​Ρ)+(Nβˆ’Ο)​Ρ=1\sum_{j=1}^{N}p^{*}_{j}=\sum_{j=1}^{N}p^{*}_{\pi(j)}=\sum_{j=1}^{\rho}\frac{a_{\pi(i)}}{\lambda}+\sum_{j=\rho+1}^{N}\varepsilon=\left(1-(N-\rho)\varepsilon\right)+(N-\rho)\varepsilon=1

Finally, dual feasibility holds by definition of ΞΌiβˆ—\mu^{*}_{i} for i∈{π​(1),…,π​(ρ)}i\in\{\pi(1),\dots,\pi(\rho)\}. In the other case we have:

ΞΌiβˆ—\displaystyle\mu_{i}^{*} =Ξ½βˆ—βˆ’ai2Ξ΅2\displaystyle=\nu^{*}-\frac{a_{i}^{2}}{\varepsilon^{2}}
=Ξ»2βˆ’ai2Ξ΅2\displaystyle=\lambda^{2}-\frac{a_{i}^{2}}{\varepsilon^{2}}
=(Ξ»+aiΞ΅)​(Ξ»βˆ’aiΞ΅)\displaystyle=\left(\lambda+\frac{a_{i}}{\varepsilon}\right)\left(\lambda-\frac{a_{i}}{\varepsilon}\right)
β‰₯(Ξ»+aiΞ΅)​(Ξ»βˆ’aπ​(ρ+1)Ξ΅)\displaystyle\geq\left(\lambda+\frac{a_{i}}{\varepsilon}\right)\left(\lambda-\frac{a_{\pi(\rho+1)}}{\varepsilon}\right)

The first factor is positive by positivity of λ\lambda and non-negativity of the aia_{i}s. For the second factor, we have by the maximality of ρ\rho:

aπ​(p+1)<Ξ΅β€‹βˆ‘j=1ρ+1aπ​(j)1βˆ’(Nβˆ’Οβˆ’1)​Ρ\displaystyle a_{\pi(p+1)}<\varepsilon\frac{\sum_{j=1}^{\rho+1}a_{\pi(j)}}{1-(N-\rho-1)\varepsilon}
β‡’aπ​(ρ+1)​(1βˆ’(Nβˆ’Ο)​Ρ+Ξ΅)<Ξ΅β€‹βˆ‘i=1ρaπ​(j)+Ρ​aπ​(ρ+1)\displaystyle\Rightarrow a_{\pi(\rho+1)}(1-(N-\rho)\varepsilon+\varepsilon)<\varepsilon\sum_{i=1}^{\rho}a_{\pi(j)}+\varepsilon a_{\pi(\rho+1)}
β‡’aπ​(ρ+1)​(1βˆ’(Nβˆ’Ο)​Ρ)<Ξ΅β€‹βˆ‘i=1ρaπ​(j)\displaystyle\Rightarrow a_{\pi(\rho+1)}(1-(N-\rho)\varepsilon)<\varepsilon\sum_{i=1}^{\rho}a_{\pi(j)}
β‡’aπ​(ρ+1)Ξ΅<Ξ»\displaystyle\Rightarrow\frac{a_{\pi(\rho+1)}}{\varepsilon}<\lambda (21)

Therefore the second factor is also positive, and dual feasibility holds. ∎

A.2 Implementation and complexity

In this section of the appendix, we provide pseudocode for the implementation of the algorithm and discuss its computational complexity.

Algorithm 1 Implementation of the proposed sampler
1:class Sampler:
2:Β Β Β Β Β procedureΒ Initialize({βˆ₯hiβˆ₯2}i=1N\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N})
3:Β Β Β Β Β Β Β Β Β s​e​l​f.H←A​r​r​a​y​({βˆ₯hiβˆ₯2}i=1N)self.H\leftarrow Array(\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N})
4:Β Β Β Β Β Β Β Β Β s​e​l​f.Tself.T ←\leftarrow O​S​T​(k​e​y​s={βˆ₯hiβˆ₯2}i=1N,v​a​l​u​e​s=[N])OST(keys=\{\left\lVert h_{i}\right\rVert_{2}\}_{i=1}^{N},values=[N])
5:Β Β Β Β Β Β Β Β Β s​e​l​f.C​S←A​r​r​a​y​({βˆ‘j=1iβˆ₯hπ​(j)βˆ₯2}i=1N)self.CS\leftarrow Array(\{\sum_{j=1}^{i}\left\lVert h_{\pi(j)}\right\rVert_{2}\}_{i=1}^{N}) Β Β Β Β Β 
6:Β Β Β Β Β procedureΒ Delete(xx)
7:Β Β Β Β Β Β Β Β Β rr ←\leftarrow s​e​l​f.T.r​a​n​k​(x)self.T.rank(x)
8:Β Β Β Β Β Β Β Β Β s​e​l​f.T.d​e​l​e​t​e​(x)self.T.delete(x)
9:Β Β Β Β Β Β Β Β Β self.CS[r:N]self.CS[r:N] ←\leftarrow self.CS[r:N]βˆ’xself.CS[r:N]-x Β Β Β Β Β 
10:Β Β Β Β Β procedureΒ Insert(xx, ii)
11:Β Β Β Β Β Β Β Β Β s​e​l​f.T.i​n​s​e​r​t​(k​e​y=x,v​a​l​u​e=i)self.T.insert(key=x,value=i)
12:Β Β Β Β Β Β Β Β Β rr ←\leftarrow s​e​l​f.T.r​a​n​k​(x)self.T.rank(x)
13:Β Β Β Β Β Β Β Β Β self.CS[r:N]self.CS[r:N] ←\leftarrow self.CS[r:N]+xself.CS[r:N]+x Β Β Β Β Β 
14:Β Β Β Β Β procedureΒ Update(βˆ₯hIβˆ₯2\left\lVert h_{I}\right\rVert_{2}, II)
15:Β Β Β Β Β Β Β Β Β self.delete(self.H[I])self.delete(self.H[I])
16:Β Β Β Β Β Β Β Β Β s​e​l​f.H​[I]←βˆ₯hIβˆ₯2self.H[I]\leftarrow\left\lVert h_{I}\right\rVert_{2}
17:Β Β Β Β Β Β Β Β Β s​e​l​f.i​n​s​e​r​t​(βˆ₯hIβˆ₯2,I)self.insert(\left\lVert h_{I}\right\rVert_{2},I) Β Β Β Β Β 
18:Β Β Β Β Β functionΒ Search(Ξ΅\varepsilon, node)
19:Β Β Β Β Β Β Β Β Β r←s​e​l​f.T.r​a​n​k​(n​o​d​e)r\leftarrow self.T.rank(node)
20:Β Β Β Β Β Β Β Β Β ifΒ r==Nr==NΒ thenΒ return rr
21:Β Β Β Β Β Β Β Β Β c←1βˆ’(Nβˆ’r)​Ρc\leftarrow 1-(N-r)\varepsilon
22:Β Β Β Β Β Β Β Β Β ifΒ cβ‹…n​o​d​e.k​e​y<Ξ΅β‹…s​e​l​f.C​S​[r]c\cdot node.key<\varepsilon\cdot self.CS[r]Β then
23:Β Β Β Β Β Β Β Β Β Β Β Β Β Β return self.search(Ξ΅,node.left)self.search(\varepsilon,node.left)
24:Β Β Β Β Β Β Β Β Β else
25:Β Β Β Β Β Β Β Β Β Β Β Β Β Β d=1βˆ’(Nβˆ’rβˆ’1)​Ρd=1-(N-r-1)\varepsilon
26:Β Β Β Β Β Β Β Β Β Β Β Β Β Β ifΒ dβ‹…n​o​d​e.s​u​c​c​e​s​s​o​r.k​e​yd\cdot node.successor.key < Ξ΅β‹…s​e​l​f.C​S​[r+1]\varepsilon\cdot self.CS[r+1]Β then
27:Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β return rr
28:Β Β Β Β Β Β Β Β Β Β Β Β Β Β else
29:Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β return self.search(Ξ΅,node.right)self.search(\varepsilon,node.right) Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 
30:Β Β Β Β Β functionΒ Sample(Ξ΅\varepsilon)
31:         ρ←self.search(Ξ΅,self.T.root)\rho\leftarrow self.search(\varepsilon,self.T.root)
32:         λ←s​e​l​f.C​S​[ρ]/(1βˆ’(Nβˆ’Ο)​Ρ)\lambda\leftarrow self.CS[\rho]/\left(1-(N-\rho)\varepsilon\right)
33:Β Β Β Β Β Β Β Β Β b∼B​e​r​n​o​u​l​l​i​((Nβˆ’Ο)​Ρ)b\sim Bernoulli((N-\rho)\varepsilon)
34:Β Β Β Β Β Β Β Β Β ifΒ b == 1Β then
35:Β Β Β Β Β Β Β Β Β Β Β Β Β Β Sample an index I~\tilde{I} uniformly from {ρ+1,…,N}\{\rho+1,\dots,N\}.
36:Β Β Β Β Β Β Β Β Β else
37:Β Β Β Β Β Β Β Β Β Β Β Β Β Β u∼U​n​i​f​o​r​m​([0,1])u\sim Uniform([0,1])
38:Β Β Β Β Β Β Β Β Β Β Β Β Β Β Find the first index I~\tilde{I} for which λ​u≀s​e​l​f.C​S​[I~]\lambda u\leq self.CS[\tilde{I}] using binary search. Β Β Β Β Β Β Β Β Β 
39:Β Β Β Β Β Β Β Β Β I←s​e​l​f.T.s​e​l​e​c​t​(I~)I\leftarrow self.T.select(\tilde{I})
40:Β Β Β Β Β Β Β Β Β return II Β Β Β Β Β 
41:
Algorithm 2 AVARE

Input: x1,T,{Ξ±t}t=1T,{βˆ₯hi1βˆ₯2}i=1N,Cβ‰₯Nx_{1},T,\{\alpha_{t}\}_{t=1}^{T},\{\left\lVert h_{i}^{1}\right\rVert_{2}\}_{i=1}^{N},C\geq N

1:s​a​m​p​l​e​r←sampler\leftarrow Sampler({βˆ₯hi1βˆ₯2}i=1N)(\{\left\lVert h_{i}^{1}\right\rVert_{2}\}_{i=1}^{N})
2:forΒ t=1,2,…,Tt=1,2,\dots,TΒ do
3:Β Β Β Β Β Set Ξ΅t\varepsilon_{t} according to (15)
4:Β Β Β Β Β It←s​a​m​p​l​e​r.s​a​m​p​l​e​(Ξ΅t)I_{t}\leftarrow sampler.sample(\varepsilon_{t})
5:Β Β Β Β Β Obtain xt+1x_{t+1} using (2) or (3) and the estimator (4).
6:Β Β Β Β Β s​a​m​p​l​e​r.u​p​d​a​t​e​(βˆ₯gItβˆ₯2,It)sampler.update(\left\lVert g_{I_{t}}\right\rVert_{2},I_{t})
7:return xT+1x_{T+1}

A.2.1 Implementation

We assume in algorithm 1 that an order statistic tree (OST) (see, for example, \citesuppDBLP:books/daglib/0023376, chapter 14) can be instantiated and that the ordering in the tree is such that the key of the left child of a node is greater than or equal to that of the node itself. Furthermore, we assume that the r​a​n​k​(x)rank(x) method returns the position of xx in the order determined by an inorder traversal of the tree. Finally, the s​e​l​e​c​t​(i)select(i) method returns the value of the it​hi^{th}-largest key in the tree.

The algorithm works as follows. At initialization, three data structures are initialized: an arrray HH holding the gradient norms according to the original indices, an order statistic tree TT holding the gradient norms as keys and the original indices as values, and an array C​SCS holding the cumulative sums of the gradient norms, where the sums are accumulated in the (decreasing) order that sorts the gradients norms in TT.

The s​a​m​p​l​e​(Ξ΅)sample(\varepsilon) method allows to sample from the optimal distribution on Δ​(Ξ΅)\Delta(\varepsilon). It uses the s​e​a​r​c​h​(Ξ΅,n​o​d​e)search(\varepsilon,node) method to find ρ\rho by searching the tree TT and using the maximality property of ρ\rho. Once ρ\rho is determined, Ξ»\lambda can be calculated. Using the fact that the cumulative sums are proportional to the CDF of the distribution, the algorithm then samples an index using inverse-transform sampling. The sampled index is then transformed back to an index in the original order using the s​e​l​e​c​tselect method of the tree TT.

Finally, the u​p​d​a​t​e​(βˆ₯hIβˆ₯2,I)update(\left\lVert h_{I}\right\rVert_{2},I) method replaces the gradient norm of a given index by a new one. It calls the methods d​e​l​e​t​e​(x)delete(x) and i​n​s​e​r​t​(x,i)insert(x,i) which perform the deletion and insertion while maintaining the tree TT and array C​SCS.

A.2.2 Complexity

First, let us analyze the cost of running the u​p​d​a​t​e​(βˆ₯hIβˆ₯2,I)update(\left\lVert h_{I}\right\rVert_{2},I) method. For the array HH, we only use random access and assignment, which are both O​(1)O(1). For the tree TT, we use the methods i​n​s​e​r​tinsert, d​e​l​e​t​edelete, and r​a​n​krank, all of which are O​(log⁑N)O(\log N). Finally, for the array C​SCS we add and subtract from a sub-array, which takes O​(N)O(N) time, although this operation is vectorized and very fast in practice.

Let us now look at the cost of running s​a​m​p​l​e​(Ξ΅)sample(\varepsilon). The s​e​a​r​c​hsearch method is recursive, but will only be called at most as many times as the height of the tree, which is O​(log⁑N)O(\log N). Now for each call of s​e​a​r​c​hsearch, both the r​a​n​krank and s​u​c​c​e​s​s​o​rsuccessor methods of the tree TT require O​(log⁑N)O(\log N) time. The rest of the s​e​a​r​c​hsearch method only requires O​(1)O(1) operations. The total cost of the s​e​a​r​c​hsearch method is therefore O​(log2⁑N)O(\log^{2}N). For the rest of the s​a​m​p​l​esample method, the operations that dominate the cost are the s​e​l​e​c​tselect method of the tree TT, which takes O​(log⁑N)O(\log N) time, and the binary search in the else branch, which also runs in O​(log⁑N)O(\log N) time. Consequently, the total cost of the sample method is O​(log2⁑N)O(\log^{2}N).

The total per iteration cost of using the proposed sampler is therefore O​(N)O(N) vectorized operations, and O​(log2⁑N)O(\log^{2}N) sequential (non-vectorized) operations. The total memory cost is O​(N)O(N).

Appendix B Theory

We restate our assumptions here for ease of reference. See 1 See 2 See 3

B.1 Proof of proposition 1

See 1

Proof.

Conditioning on the knowledge of {I1,…,Itβˆ’1}\{I_{1},\dots,I_{t-1}\} we have for SGD:

𝔼⁑[βˆ₯xt+1S​G​Dβˆ’xtS​G​Dβˆ₯2]\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}^{SGD}-x_{t}^{SGD}\right\rVert_{2}\right] =𝔼⁑[Ξ±t​1pItt​βˆ₯gIttβˆ₯2]\displaystyle=\operatorname{\mathbb{E}}\left[\alpha_{t}\frac{1}{p_{I_{t}}^{t}}\left\lVert g_{I_{t}}^{t}\right\rVert_{2}\right]
=Ξ±tβ€‹βˆ‘i=1Nβˆ₯gitβˆ₯2\displaystyle=\alpha_{t}\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}
≀αt​N​G\displaystyle\leq\alpha_{t}NG

and for SGLD we have:

𝔼⁑[βˆ₯xt+1S​G​L​Dβˆ’xtS​G​L​Dβˆ₯2]\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}^{SGLD}-x_{t}^{SGLD}\right\rVert_{2}\right] ≀𝔼⁑[Ξ±t​1pItt​βˆ₯gIttβˆ₯2]+𝔼⁑[βˆ₯ΞΎtβˆ₯2]\displaystyle\leq\operatorname{\mathbb{E}}\left[\alpha_{t}\frac{1}{p_{I_{t}}^{t}}\left\lVert g_{I_{t}}^{t}\right\rVert_{2}\right]+\operatorname{\mathbb{E}}\left[\left\lVert\xi_{t}\right\rVert_{2}\right]
≀αtβ€‹βˆ‘i=1Nβˆ₯gitβˆ₯2+𝔼⁑[βˆ₯ΞΎtβˆ₯22]\displaystyle\leq\alpha_{t}\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}+\sqrt{\operatorname{\mathbb{E}}\left[\left\lVert\xi_{t}\right\rVert_{2}^{2}\right]}
≀αt​N​G+Ξ±t​2​d\displaystyle\leq\alpha_{t}NG+\sqrt{\alpha_{t}}\sqrt{2d}
≀αt​(N​G​α1+2​d)\displaystyle\leq\sqrt{\alpha_{t}}\left(NG\sqrt{\alpha_{1}}+\sqrt{2d}\right)

where in the first line we used the triangle inequality, in the second we used Jensen’s inequality, and in the last we used the fact that {Ξ±t}t=1T\{\alpha_{t}\}_{t=1}^{T} is decreasing. Replacing with the value of Ξ±t\alpha_{t} we obtain the result. ∎

B.2 Proof of proposition 2

The following proof is taken from ([12], Lemma 6). See 2

Proof.

By Lemma 1 we have:

minpβˆˆΞ”β€‹(Ξ΅)β€‹βˆ‘i=1N1pi​ai2\displaystyle\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2} =Ξ»β€‹βˆ‘i=1ρaπ​(i)+βˆ‘i=ρ+1Naπ​(i)2Ξ΅\displaystyle=\lambda\sum_{i=1}^{\rho}a_{\pi(i)}+\sum_{i=\rho+1}^{N}\frac{a^{2}_{\pi(i)}}{\varepsilon}
≀λ2​(1βˆ’(Nβˆ’Ο)​Ρ)+Ξ΅β€‹βˆ‘i=ρ+1Naπ​(ρ+1)2Ξ΅2\displaystyle\leq\lambda^{2}\left(1-(N-\rho)\varepsilon\right)+\varepsilon\sum_{i=\rho+1}^{N}\frac{a^{2}_{\pi(\rho+1)}}{\varepsilon^{2}}
≀λ2​(1βˆ’(Nβˆ’Ο)​Ρ)+(Nβˆ’Ο)​Ρ​λ2\displaystyle\leq\lambda^{2}\left(1-(N-\rho)\varepsilon\right)+(N-\rho)\varepsilon\lambda^{2}
=Ξ»2\displaystyle=\lambda^{2}

where in the third line we used inequality (21) from the proof of Lemma 1. Now for the case Ρ=0\varepsilon=0 we have ρ=N\rho=N, so the second term in the first line is zero and the inequality becomes an equality:

minpβˆˆΞ”β€‹βˆ‘i=1N1pi​ai2=(βˆ‘i=1Nai)2\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}=\left(\sum_{i=1}^{N}a_{i}\right)^{2} (22)

The difference is therefore bounded by:

minpβˆˆΞ”β€‹(Ξ΅)β€‹βˆ‘i=1N1pi​ai2βˆ’minpβˆˆΞ”β€‹βˆ‘i=1N1pi​ai2\displaystyle\min_{p\in\Delta(\varepsilon)}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2}-\min_{p\in\Delta}\sum_{i=1}^{N}\frac{1}{p_{i}}a_{i}^{2} ≀(βˆ‘i=1ρaπ​(i))2(1βˆ’(Nβˆ’Ο)​Ρ)2βˆ’(βˆ‘i=1Nai)2\displaystyle\leq\frac{\left(\sum_{i=1}^{\rho}a_{\pi(i)}\right)^{2}}{(1-(N-\rho)\varepsilon)^{2}}-\left(\sum_{i=1}^{N}a_{i}\right)^{2}
≀(1(1βˆ’N​Ρ)2βˆ’1)​(βˆ‘i=1Nai)2\displaystyle\leq\left(\frac{1}{(1-N\varepsilon)^{2}}-1\right)\left(\sum_{i=1}^{N}a_{i}\right)^{2}
≀6​Ρ​N​(βˆ‘i=1Nai)2\displaystyle\leq 6\varepsilon N\left(\sum_{i=1}^{N}a_{i}\right)^{2}

where in the last line we used the inequality 1(1βˆ’x)2βˆ’1≀6​x\frac{1}{(1-x)^{2}}-1\leq 6x for x∈[0,1/2]x\in[0,1/2] which gives the restriction Ρ∈[0,1/2​N]\varepsilon\in[0,1/2N]. ∎

B.3 Proof of Lemma 2

See 2

Proof.

Let tβ‰₯t0t\geq t_{0} and p~t:=argminpβˆˆΞ”{c~t​(p)}\widetilde{p}^{t}:=\operatorname*{argmin}_{p\in\Delta}\{\widetilde{c}_{t}(p)\}. We have the following decomposition:

𝔼⁑[ct​(pt)βˆ’ct​(qt)]=𝔼⁑[ct​(pt)βˆ’c~t​(pt)]⏟(A)+𝔼⁑[c~t​(pt)βˆ’c~t​(p~t)]⏟(B)+𝔼⁑[c~t​(p~t)βˆ’ct​(qt)]⏟(C)\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]=\underbrace{\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-\widetilde{c}_{t}(p^{t})\right]}_{\text{(A)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(p^{t})-\widetilde{c}_{t}(\widetilde{p}^{t})\right]}_{\text{(B)}}+\underbrace{\operatorname{\mathbb{E}}\left[\widetilde{c}_{t}(\widetilde{p}^{t})-c_{t}(q^{t})\right]}_{\text{(C)}}

We bound each term separately:

(A) =𝔼⁑[βˆ‘i=1N1pit​(βˆ₯gitβˆ₯22βˆ’βˆ₯hitβˆ₯22)]\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{1}{p_{i}^{t}}\left(\left\lVert g_{i}^{t}\right\rVert_{2}^{2}-\left\lVert h_{i}^{t}\right\rVert_{2}^{2}\right)\right]
=𝔼⁑[βˆ‘i=1N1pit​(βˆ₯gitβˆ₯2βˆ’βˆ₯hitβˆ₯2)​(βˆ₯gitβˆ₯2+βˆ₯hitβˆ₯2)]\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{1}{p_{i}^{t}}\left(\left\lVert g_{i}^{t}\right\rVert_{2}-\left\lVert h_{i}^{t}\right\rVert_{2}\right)\left(\left\lVert g_{i}^{t}\right\rVert_{2}+\left\lVert h_{i}^{t}\right\rVert_{2}\right)\right]
≀𝔼⁑[βˆ‘i=1N2​Gpit​βˆ₯gitβˆ’hitβˆ₯2]\displaystyle\leq\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\frac{2G}{p_{i}^{t}}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]
≀2​GΞ΅t​𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]\displaystyle\leq\frac{2G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]

where in the third line we used Assumption 1 and the reverse triangle inequality, and in the last line we used the fact that ptβˆˆΞ”β€‹(Ξ΅t)p^{t}\in\Delta(\varepsilon_{t}). Since tβ‰₯t0t\geq t_{0}, we can apply proposition 2 on (B) to obtain:

(B) ≀𝔼⁑[6​Ρt​N​(βˆ‘i=1Nβˆ₯hitβˆ₯2)2]≀6​Ρt​G2​N3\displaystyle\leq\operatorname{\mathbb{E}}\left[6\varepsilon_{t}N\left(\sum_{i=1}^{N}\left\lVert h_{i}^{t}\right\rVert_{2}\right)^{2}\right]\leq 6\varepsilon_{t}G^{2}N^{3}

where the second inequality uses Assumption 1. Finally, using the optimal function values (22) we have for (C):

(C) =𝔼⁑[(βˆ‘i=1Nβˆ₯hitβˆ₯2)2βˆ’(βˆ‘i=1Nβˆ₯gitβˆ₯2)2]\displaystyle=\operatorname{\mathbb{E}}\left[\left(\sum_{i=1}^{N}\left\lVert h_{i}^{t}\right\rVert_{2}\right)^{2}-\left(\sum_{i=1}^{N}\left\lVert g_{i}^{t}\right\rVert_{2}\right)^{2}\right]
=𝔼⁑[(βˆ‘i=1N(βˆ₯hitβˆ₯2βˆ’βˆ₯gitβˆ₯2))​(βˆ‘i=1N(βˆ₯gitβˆ₯2+βˆ₯hitβˆ₯2))]\displaystyle=\operatorname{\mathbb{E}}\left[\left(\sum_{i=1}^{N}\left(\left\lVert h_{i}^{t}\right\rVert_{2}-\left\lVert g_{i}^{t}\right\rVert_{2}\right)\right)\left(\sum_{i=1}^{N}(\left\lVert g_{i}^{t}\right\rVert_{2}+\left\lVert h_{i}^{t}\right\rVert_{2})\right)\right]
≀2​G​N​𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]\displaystyle\leq 2GN\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]
≀2​GΞ΅t​𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]\displaystyle\leq\frac{2G}{\varepsilon_{t}}\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]

where we again used Assumption 1 and the reverse triangle inequality in the third line. the last inequality follows from the fact that Ξ΅t≀1N\varepsilon_{t}\leq\frac{1}{N}. Combining the three bounds gives the result. ∎

B.4 Proof of Lemma 3

See 3

Proof.

Conditioning on the knowledge of {I1,…,Itβˆ’1}\{I_{1},\dots,I_{t-1}\} we have:

𝔼⁑[βˆ‘i=1Nβˆ₯git+1βˆ’hit+1βˆ₯2]\displaystyle\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t+1}\right\rVert_{2}\right] =βˆ‘j=1Npjt​[βˆ₯gjt+1βˆ’gjtβˆ₯2+βˆ‘i=1iβ‰ jNβˆ₯git+1βˆ’hitβˆ₯2]\displaystyle=\sum_{j=1}^{N}p_{j}^{t}\left[\left\lVert g_{j}^{t+1}-g_{j}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left\lVert g_{i}^{t+1}-h_{i}^{t}\right\rVert_{2}\right]
β‰€βˆ‘j=1Npjt​[βˆ₯gjt+1βˆ’gjtβˆ₯2+βˆ‘i=1iβ‰ jN(βˆ₯git+1βˆ’gitβˆ₯2+βˆ₯gitβˆ’hitβˆ₯2)]\displaystyle\leq\sum_{j=1}^{N}p_{j}^{t}\left[\left\lVert g_{j}^{t+1}-g_{j}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left(\left\lVert g_{i}^{t+1}-g_{i}^{t}\right\rVert_{2}+\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right)\right]
=βˆ‘j=1Npjt​[βˆ‘i=1Nβˆ₯git+1βˆ’gitβˆ₯2+βˆ‘i=1iβ‰ jNβˆ₯gitβˆ’hitβˆ₯2]\displaystyle=\sum_{j=1}^{N}p_{j}^{t}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t+1}-g_{i}^{t}\right\rVert_{2}+\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]
≀N​Lβ€‹βˆ‘j=1Npjt​βˆ₯xt+1βˆ’xtβˆ₯2+βˆ‘i=1N(βˆ‘j=1jβ‰ iNpjt)​βˆ₯gitβˆ’hitβˆ₯2\displaystyle\leq NL\sum_{j=1}^{N}p_{j}^{t}\left\lVert x_{t+1}-x_{t}\right\rVert_{2}+\sum_{i=1}^{N}\left(\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{N}p_{j}^{t}\right)\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}
=N​L​𝔼⁑[βˆ₯xt+1βˆ’xtβˆ₯2]+βˆ‘i=1N(1βˆ’pit)​βˆ₯gitβˆ’hitβˆ₯2\displaystyle=NL\operatorname{\mathbb{E}}\left[\left\lVert x_{t+1}-x_{t}\right\rVert_{2}\right]+\sum_{i=1}^{N}(1-p_{i}^{t})\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}
≀N​L​A(B+tβˆ’1)Ξ΄+(1βˆ’Ξ΅t)β€‹βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2\displaystyle\leq\frac{NLA}{(B+t-1)^{\delta}}+(1-\varepsilon_{t})\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}

where in the fourth line we used Assumption 2, and in the last line we used Assumption 3 and the fact that ptβˆˆΞ”β€‹(Ξ΅t)p^{t}\in\Delta(\varepsilon_{t}). Taking expectation with respect to the choice of {I1,…,Itβˆ’1}\{I_{1},\dots,I_{t-1}\} on both sides we get the result. ∎

B.5 Lemma 4

Before proving Theorem 1, we first state and prove the following solution of the recursion of Lemma 3 assuming the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} is given by (15).

Lemma 4.

Assuming the use of the sequence {Ξ΅t}t=1T\{\varepsilon_{t}\}_{t=1}^{T} given by (15) we have:

𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]≀K(C+tβˆ’1)2​δ/3\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right]\leq\frac{K}{(C+t-1)^{2\delta/3}}

where:

K:=max⁑{3​C1βˆ’Ξ΄/3​D3βˆ’2​δ,C2​δ/3β€‹βˆ‘i=1Nβˆ₯gi1βˆ’hi1βˆ₯2}K:=\max\left\{\frac{3C^{1-\delta/3}D}{3-2\delta},C^{2\delta/3}\sum_{i=1}^{N}\left\lVert g_{i}^{1}-h_{i}^{1}\right\rVert_{2}\right\}\\

and:

D:={N​L​AifΒ Bβ‰₯C(CB)δ​N​L​AifΒ B<CD:=\begin{cases}NLA&\text{if $B\geq C$}\\ (\frac{C}{B})^{\delta}NLA&\text{if $B<C$}\end{cases}

where AA, BB, and Ξ΄\delta are as in Assumption 3.

Proof.
A simple inequality.

Suppose Bβ‰₯CB\geq C, then:

N​L​A(B+tβˆ’1)δ≀N​L​A(C+tβˆ’1)Ξ΄\frac{NLA}{(B+t-1)^{\delta}}\leq\frac{NLA}{(C+t-1)^{\delta}}

otherwise, we have:

N​L​A(B+tβˆ’1)Ξ΄=(CB)δ​N​L​A(C+(CB)​(tβˆ’1))δ≀(CB)δ​N​L​A(C+tβˆ’1)Ξ΄\displaystyle\frac{NLA}{(B+t-1)^{\delta}}=\frac{(\frac{C}{B})^{\delta}NLA}{(C+(\frac{C}{B})(t-1))^{\delta}}\leq\frac{(\frac{C}{B})^{\delta}NLA}{(C+t-1)^{\delta}}

where the last inequality follows from the fact that C>BC>B and tβ‰₯1t\geq 1. We conclude that:

N​L​A(B+tβˆ’1)δ≀D(C+tβˆ’1)Ξ΄\frac{NLA}{(B+t-1)^{\delta}}\leq\frac{D}{(C+t-1)^{\delta}}

Induction proof.

Let φ​(t):=𝔼⁑[βˆ‘i=1Nβˆ₯gitβˆ’hitβˆ₯2]\varphi(t):=\operatorname{\mathbb{E}}\left[\sum_{i=1}^{N}\left\lVert g_{i}^{t}-h_{i}^{t}\right\rVert_{2}\right], and let Kβ€²:=C2βˆ’2​δ/3​KK^{\prime}:=C^{2-2\delta/3}K. For t=1t=1 the statement holds since:

φ​(1)=C2​δ/3​φ​(1)(C+tβˆ’1)2​δ/3≀K(C+tβˆ’1)2​δ/3\varphi(1)=\frac{C^{2\delta/3}\varphi(1)}{(C+t-1)^{2\delta/3}}\leq\frac{K}{(C+t-1)^{2\delta/3}}

For tβ‰₯1t\geq 1 we have by Lemma 3 and the above inequality:

φ​(t+1)\displaystyle\varphi(t+1) ≀D(C+tβˆ’1)Ξ΄+(1βˆ’1C1βˆ’Ξ΄/3​(C+tβˆ’1)Ξ΄/3)​φ​(t)\displaystyle\leq\frac{D}{(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\varphi(t)
=a​C3βˆ’Ξ΄β€‹Da​C3βˆ’Ξ΄β€‹(C+tβˆ’1)Ξ΄+(1βˆ’1C1βˆ’Ξ΄/3​(C+tβˆ’1)Ξ΄/3)​φ​(t)\displaystyle=\frac{aC^{3-\delta}D}{aC^{3-\delta}(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\varphi(t)
≀Kβ€²a​C3βˆ’Ξ΄β€‹(C+tβˆ’1)Ξ΄+(1βˆ’1C1βˆ’Ξ΄/3​(C+tβˆ’1)Ξ΄/3)​Kβ€²C2βˆ’2​δ/3​(C+tβˆ’1)2​δ/3\displaystyle\leq\frac{K^{\prime}}{aC^{3-\delta}(C+t-1)^{\delta}}+\left(1-\frac{1}{C^{1-\delta/3}(C+t-1)^{\delta/3}}\right)\frac{K^{\prime}}{C^{2-2\delta/3}(C+t-1)^{2\delta/3}}

where a=33βˆ’2​δa=\frac{3}{3-2\delta} and where the last line follows by the induction hypothesis. To simplify notation let x:=(C+tβˆ’1)x:=(C+t-1), E:=C1βˆ’Ξ΄/3E:=C^{1-\delta/3}, Ξ³:=(1βˆ’1a)=2​δ3\gamma:=(1-\frac{1}{a})=\frac{2\delta}{3}. Then the above inequality can be rewritten as:

φ​(t+1)\displaystyle\varphi(t+1) ≀K′​(1E2​x2​δ/3βˆ’Ξ³E3​xΞ΄)\displaystyle\leq K^{\prime}\left(\frac{1}{E^{2}x^{2\delta/3}}-\frac{\gamma}{E^{3}x^{\delta}}\right)
=K′​E​xΞ΄/3βˆ’Ξ³E3​xΞ΄\displaystyle=K^{\prime}\frac{Ex^{\delta/3}-\gamma}{E^{3}x^{\delta}}
=K′​E3​xΞ΄βˆ’Ξ³3E3​xδ​(E2​x2​δ/3+E​γ​xΞ΄/3+Ξ³2)\displaystyle=K^{\prime}\frac{E^{3}x^{\delta}-\gamma^{3}}{E^{3}x^{\delta}(E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}+\gamma^{2})}
≀K′​1E2​x2​δ/3+E​γ​xΞ΄/3\displaystyle\leq K^{\prime}\frac{1}{E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}}

Now by concavity of x2​δ/3x^{2\delta/3} we have:

E2​[(x+1)2​δ/3βˆ’x2​δ/3]≀E2​2​δ3​x2​δ/3βˆ’1E^{2}\left[(x+1)^{2\delta/3}-x^{2\delta/3}\right]\leq E^{2}\frac{2\delta}{3}x^{2\delta/3-1}

so that:

E2​x2​δ/3+E​γ​xΞ΄/3β‰₯E2​(x+1)2​δ/3\displaystyle E^{2}x^{2\delta/3}+E\gamma x^{\delta/3}\geq E^{2}(x+1)^{2\delta/3}
⇔E​γ​xΞ΄/3β‰₯E2​[(x+1)2​δ/3βˆ’x2​δ/3]\displaystyle\Leftrightarrow E\gamma x^{\delta/3}\geq E^{2}\left[(x+1)^{2\delta/3}-x^{2\delta/3}\right]
⇐γ​E​xΞ΄/3β‰₯2​δ3​E2​x2​δ/3βˆ’1\displaystyle\Leftarrow\gamma Ex^{\delta/3}\geq\frac{2\delta}{3}E^{2}x^{2\delta/3-1}
⇔x1βˆ’Ξ΄/3β‰₯C1βˆ’Ξ΄/3\displaystyle\Leftrightarrow x^{1-\delta/3}\geq C^{1-\delta/3}
⇔xβ‰₯C\displaystyle\Leftrightarrow x\geq C
⇔(C+tβˆ’1)β‰₯C\displaystyle\Leftrightarrow(C+t-1)\geq C
⇔tβ‰₯1\displaystyle\Leftrightarrow t\geq 1

The last statement is true, and therefore so is the first. Replacing in the bound on φ​(t+1)\varphi(t+1) we get:

φ​(t+1)≀Kβ€²C2βˆ’2​δ/3​(C+(t+1)βˆ’1)2​δ/3=K(C+(t+1)βˆ’1)2​δ/3\varphi(t+1)\leq\frac{K^{\prime}}{C^{2-2\delta/3}(C+(t+1)-1)^{2\delta/3}}=\frac{K}{(C+(t+1)-1)^{2\delta/3}}

which finishes the proof. ∎

B.6 Proof of Theorem 1

See 1

Proof.

Combining Lemma 4 with Lemma 2 we have the following per-step bound for tβ‰₯t0t\geq t_{0}:

𝔼[ct(pt)βˆ’ct(qt)]≀4​G​K​C1βˆ’Ξ΄/3+6​G2​N3C1βˆ’Ξ΄/3(C+tβˆ’1)Ξ΄/3=:Kβ€²(C+tβˆ’1)Ξ΄/3\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\frac{4GKC^{1-\delta/3}+\frac{6G^{2}N^{3}}{C^{1-\delta/3}}}{(C+t-1)^{\delta/3}}=:\frac{K^{\prime}}{(C+t-1)^{\delta/3}}

Summing over the time steps {t0,…,T}\{t_{0},\dots,T\} we get:

βˆ‘t=t0T𝔼⁑[ct​(pt)βˆ’ct​(qt)]β‰€βˆ‘t=t0TKβ€²(C+tβˆ’1)Ξ΄/3\displaystyle\sum_{t=t_{0}}^{T}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]\leq\sum_{t=t_{0}}^{T}\frac{K^{\prime}}{(C+t-1)^{\delta/3}}

Therefore:

𝔼⁑[RegretD​(T)]\displaystyle\operatorname{\mathbb{E}}\left[\text{Regret}_{D}(T)\right] =βˆ‘t=1t0βˆ’1𝔼⁑[ct​(pt)βˆ’ct​(qt)]+βˆ‘t=t0T𝔼⁑[ct​(pt)βˆ’ct​(qt)]\displaystyle=\sum_{t=1}^{t_{0}-1}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]+\sum_{t=t_{0}}^{T}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})-c_{t}(q^{t})\right]
β‰€βˆ‘t=1t0βˆ’1𝔼⁑[ct​(pt)]+βˆ‘t=t0TKβ€²(C+tβˆ’1)Ξ΄/3\displaystyle\leq\sum_{t=1}^{t_{0}-1}\operatorname{\mathbb{E}}\left[c_{t}(p^{t})\right]+\sum_{t=t_{0}}^{T}\frac{K^{\prime}}{(C+t-1)^{\delta/3}}
β‰€βˆ‘t=1t0βˆ’1βˆ‘i=1N𝔼⁑[1pit​βˆ₯gitβˆ₯22]+Kβ€²β€‹βˆ«t=t0βˆ’1T1(C+tβˆ’1)Ξ΄/3​𝑑t\displaystyle\leq\sum_{t=1}^{t_{0}-1}\sum_{i=1}^{N}\operatorname{\mathbb{E}}\left[\frac{1}{p_{i}^{t}}\left\lVert g_{i}^{t}\right\rVert_{2}^{2}\right]+K^{\prime}\int_{t=t_{0}-1}^{T}\frac{1}{(C+t-1)^{\delta/3}}dt
≀(t0βˆ’1)​N​G2Ξ΅t0+K′​(C+Tβˆ’1)1βˆ’Ξ΄/3βˆ’(C+t0βˆ’2)1βˆ’Ξ΄/31βˆ’Ξ΄/3\displaystyle\leq(t_{0}-1)\frac{NG^{2}}{\varepsilon_{t_{0}}}+K^{\prime}\frac{(C+T-1)^{1-\delta/3}-(C+t_{0}-2)^{1-\delta/3}}{1-\delta/3}
≀(23/Ξ΄βˆ’1)​N2​G2Ξ΅t0+K′​(C+Tβˆ’1)1βˆ’Ξ΄/3βˆ’(Cβˆ’1)1βˆ’Ξ΄/31βˆ’Ξ΄/3\displaystyle\leq(2^{3/\delta}-1)\frac{N^{2}G^{2}}{\varepsilon_{t_{0}}}+K^{\prime}\frac{(C+T-1)^{1-\delta/3}-(C-1)^{1-\delta/3}}{1-\delta/3}
=π’ͺ​(T1βˆ’Ξ΄/3)\displaystyle=\mathcal{O}(T^{1-\delta/3})

Where the line before the last follows from the fact that 1≀t0≀(23/Ξ΄βˆ’1)​N+11\leq t_{0}\leq(2^{3/\delta}-1)N+1 since Cβ‰₯NC\geq N. ∎

Appendix C A new mini-batch estimator

C.1 A class of unbiased estimators

It will be useful for our discussion to consider the following class π’žβ€‹(pt,1,…,pt,m)\mathcal{C}(p^{t,1},\dots,p^{t,m}) of estimators:

g^bt​(pt,1,…,pt,m):=1mβ€‹βˆ‘j=1mg^jt​(pt,j)g^jt​(pt,j):=[1pItjt,j​gItjt+βˆ‘k=1jβˆ’1gItkt]\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m}):=\frac{1}{m}\sum_{j=1}^{m}\hat{g}_{j}^{t}(p^{t,j})\quad\quad\hat{g}_{j}^{t}(p^{t,j}):=\left[\frac{1}{p^{t,j}_{I_{t}^{j}}}g_{I_{t}^{j}}^{t}+\sum_{k=1}^{j-1}g_{I_{t}^{k}}^{t}\right]

where each pt,jp^{t,j} is a distribution on [N]βˆ–{It1,…,Itjβˆ’1}[N]\setminus\{I_{t}^{1},\dots,I_{t}^{j-1}\}. The estimator we proposed in Section 5 is: See 19 where the indices St={It1,…,Itm}S_{t}=\{I_{t}^{1},\dots,I_{t}^{m}\} are sampled without replacement according to ptp^{t}. Setting:

pit,j:={0ifΒ i∈{It1,…,Itjβˆ’1}qit,jotherwisep^{t,j}_{i}:=\begin{cases}0&\text{if $i\in\{I_{t}^{1},\dots,I_{t}^{j-1}\}$}\\ q_{i}^{t,j}&\text{otherwise}\end{cases}

we see that our proposed estimator belongs to the class of estimators π’ž\mathcal{C} introduced above. The proofs of (a) and (b) of proposition 3 below apply with almost no modification to any estimator in the class π’ž\mathcal{C}. A natural question then is which estimator in the above-defined class achieves minimum variance. We answer this in the proof of part (c)(c) below, and show that our proposed estimator (LABEL:minibatchestimator) with pt:=argminpβˆˆΞ”{ct​(p)}p^{t}:=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\} achieves minimum variance.

C.2 Proof of proposition 3

See 3

Proof.
  1. (a)

    For j∈[m]j\in[m] and conditional on Stjβˆ’1S_{t}^{j-1} we have:

    𝔼⁑[g^jt]\displaystyle\operatorname{\mathbb{E}}\left[\hat{g}_{j}^{t}\right] =βˆ‘i=1iβˆ‰Stjβˆ’1Nqit,j​[1qit,j​git+βˆ‘k∈Stjβˆ’1gkt]\displaystyle=\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}q^{t,j}_{i}\left[\frac{1}{q_{i}^{t,j}}g_{i}^{t}+\sum_{k\in S_{t}^{j-1}}g_{k}^{t}\right]
    =βˆ‘i=1iβˆ‰Stjβˆ’1Ngit+(βˆ‘k∈Stjβˆ’1gkt)​(βˆ‘i=1iβˆ‰Stjβˆ’1Nqit,j)⏟=1\displaystyle=\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}g_{i}^{t}+\left(\sum_{k\in S_{t}^{j-1}}g_{k}^{t}\right)\underbrace{\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S_{t}^{j-1}\end{subarray}}^{N}q_{i}^{t,j}\right)}_{=1}
    =βˆ‘i=1Ngit\displaystyle=\sum_{i=1}^{N}g_{i}^{t}
    =gt\displaystyle=g^{t}

    Taking expectation with respect to the choice of Stjβˆ’1S_{t}^{j-1}, and taking the average over j∈[m]j\in[m] we get the result.

  2. (b)

    We have:

    𝔼⁑[βˆ₯g^btβˆ’gtβˆ₯22]\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}-g^{t}\right\rVert_{2}^{2}\right] =𝔼⁑[βˆ₯1mβ€‹βˆ‘j=1mg^jtβˆ’gtβˆ₯22]\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert\frac{1}{m}\sum_{j=1}^{m}\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]
    =1m2β€‹βˆ‘j=1m𝔼⁑[βˆ₯g^jtβˆ’gtβˆ₯22]+2m2β€‹βˆ‘j,ij<im𝔼⁑[⟨g^jtβˆ’gt,g^itβˆ’gt⟩]\displaystyle=\frac{1}{m^{2}}\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]+\frac{2}{m^{2}}\sum_{\begin{subarray}{c}j,i\\ j<i\end{subarray}}^{m}\operatorname{\mathbb{E}}\left[\langle\hat{g}_{j}^{t}-g^{t},\hat{g}_{i}^{t}-g^{t}\rangle\right]

    To show the claim, it is therefore enough to show that second term is zero. Let j∈[mβˆ’1]j\in[m-1]. Conditional on Stiβˆ’1S_{t}^{i-1} we have:

    𝔼⁑[⟨g^jtβˆ’gt,g^itβˆ’gt⟩]=⟨g^jtβˆ’gt,𝔼⁑[g^it]βˆ’gt⟩=0\operatorname{\mathbb{E}}\left[\langle\hat{g}_{j}^{t}-g^{t},\hat{g}_{i}^{t}-g^{t}\rangle\right]=\langle\hat{g}_{j}^{t}-g^{t},\operatorname{\mathbb{E}}\left[\hat{g}_{i}^{t}\right]-g^{t}\rangle=0

    where we used the fact that the conditional expectation is zero by part (a). Taking expectation with respect to Stiβˆ’1S_{t}^{i-1} on both sides yields the result.

  3. (c)

    As discussed in the previous section, (b) applies to all estimators in the class π’ž\mathcal{C}, so we have for all such estimators:

    𝔼⁑[βˆ₯g^bt​(pt,1,…,pt,m)βˆ’gtβˆ₯22]=1m2β€‹βˆ‘j=1m𝔼⁑[βˆ₯g^jt​(pt,j)βˆ’gtβˆ₯22]\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m})-g^{t}\right\rVert_{2}^{2}\right]=\frac{1}{m^{2}}\sum_{j=1}^{m}\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}(p^{t,j})-g^{t}\right\rVert_{2}^{2}\right]

    minimizing over (pt,1,…,pt,m)(p^{t,1},\dots,p^{t,m}) by minimizing each term with respect to its variable we get:

    argmin(pt,1,…,pt,m){𝔼⁑[βˆ₯g^bt​(pt,1,…,pt,m)βˆ’gtβˆ₯22]}=(pt,βˆ—1βˆ’βˆ‘k=1jβˆ’1pItkt,βˆ—)j=1m\operatorname*{argmin}_{(p^{t,1},\dots,p^{t,m})}\{\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{b}^{t}(p^{t,1},\dots,p^{t,m})-g^{t}\right\rVert_{2}^{2}\right]\}=\left(\frac{p^{t,*}}{1-\sum_{k=1}^{j-1}p_{I_{t}^{k}}^{t,*}}\right)_{j=1}^{m}

    where:

    pt,βˆ—=argminpβˆˆΞ”{ct​(p)}=βˆ₯gitβˆ₯2βˆ‘j=1Nβˆ₯gjtβˆ₯2p^{t,*}=\operatorname*{argmin}_{p\in\Delta}\{c_{t}(p)\}=\frac{\left\lVert g_{i}^{t}\right\rVert_{2}}{\sum_{j=1}^{N}\left\lVert g_{j}^{t}\right\rVert_{2}}

    Recalling that our estimator is in π’ž\mathcal{C}, and noticing that the optimal probabilities over π’ž\mathcal{C} are feasible for our estimator we get the result.

  4. (d)

    Fix t∈[T]t\in[T]. We drop the superscript tt from ptp^{t}, qit,jq_{i}^{t,j}, and gitg_{i}^{t}. We also drop the subscript tt from ItjI_{t}^{j} and StjS_{t}^{j} to simplify notation. Define for j∈[m]j\in[m]:

    xj:=1qIjj​gIjΞΌj:=gtβˆ’βˆ‘k=1jβˆ’1gIkx_{j}:=\frac{1}{q_{I^{j}}^{j}}g_{I^{j}}\quad\quad\mu_{j}:=g^{t}-\sum_{k=1}^{j-1}g_{I^{k}}

    We have from part (a):

    𝔼⁑[xj]=ΞΌj\operatorname{\mathbb{E}}\left[x_{j}\right]=\mu_{j}

    Before proceeding with the proof, we first derive an identity relating qij+1q_{i}^{j+1} and qijq_{i}^{j}:

    1qij+1\displaystyle\frac{1}{q_{i}^{j+1}} =1βˆ’βˆ‘k∈Sjpkpi\displaystyle=\frac{1-\sum_{k\in S^{j}}p_{k}}{p_{i}}
    =1βˆ’βˆ‘k∈Sjβˆ’1pkβˆ’pIjpi\displaystyle=\frac{1-\sum_{k\in S^{j-1}}p_{k}-p_{I^{j}}}{p_{i}}
    =(1βˆ’pIj1βˆ’βˆ‘k∈Sjβˆ’1pk)​(1βˆ’βˆ‘k∈Sjβˆ’1pkpi)\displaystyle=\left(1-\frac{p_{I^{j}}}{1-\sum_{k\in S^{j-1}}p_{k}}\right)\left(\frac{1-\sum_{k\in S^{j-1}}p_{k}}{p_{i}}\right)
    =(1βˆ’qIjj)​1qij\displaystyle=\left(1-q_{I^{j}}^{j}\right)\frac{1}{q_{i}^{j}}

    Now, conditional on StjS_{t}^{j}, we have:

    𝔼⁑[βˆ₯g^j+1tβˆ’gtβˆ₯22]\displaystyle\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j+1}^{t}-g^{t}\right\rVert_{2}^{2}\right]
    =𝔼⁑[βˆ₯xj+1βˆ’ΞΌj+1βˆ₯22]\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert x_{j+1}-\mu_{j+1}\right\rVert_{2}^{2}\right]
    =𝔼⁑[βˆ₯xj+1βˆ₯22]βˆ’βˆ₯ΞΌj+1βˆ₯22\displaystyle=\operatorname{\mathbb{E}}\left[\left\lVert x_{j+1}\right\rVert_{2}^{2}\right]-\left\lVert\mu_{j+1}\right\rVert_{2}^{2}
    =(βˆ‘i=1iβˆ‰SjN1qij+1​βˆ₯giβˆ₯22)βˆ’βˆ₯ΞΌj+1βˆ₯22\displaystyle=\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j}\end{subarray}}^{N}\frac{1}{q_{i}^{j+1}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\left\lVert\mu_{j+1}\right\rVert_{2}^{2}
    =(βˆ‘i=1iβˆ‰Sjβˆ’1N1qij+1​βˆ₯giβˆ₯22)βˆ’1qIjj+1​βˆ₯gIjβˆ₯22βˆ’βˆ₯ΞΌjβˆ’gIjβˆ₯22\displaystyle=\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j+1}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\frac{1}{q_{I^{j}}^{j+1}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-\left\lVert\mu_{j}-g_{I^{j}}\right\rVert_{2}^{2}
    =(1βˆ’qIjj)​(βˆ‘i=1iβˆ‰Sjβˆ’1N1qij​βˆ₯giβˆ₯22)βˆ’(1βˆ’qIjj)​1qIjj​βˆ₯gIjβˆ₯22βˆ’(βˆ₯ΞΌjβˆ₯22βˆ’2β€‹βŸ¨ΞΌj,gIj⟩+βˆ₯gIjβˆ₯22)\displaystyle=\left(1-q_{I^{j}}^{j}\right)\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j}}\left\lVert g_{i}\right\rVert_{2}^{2}\right)-\left(1-q_{I^{j}}^{j}\right)\frac{1}{q_{I^{j}}^{j}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-\left(\left\lVert\mu_{j}\right\rVert_{2}^{2}-2\langle\mu_{j},g_{I^{j}}\rangle+\left\lVert g_{I^{j}}\right\rVert_{2}^{2}\right)
    =(1βˆ’qIjj)​(βˆ‘i=1iβˆ‰Sjβˆ’1N1qij​βˆ₯giβˆ₯22βˆ’βˆ₯ΞΌjβˆ₯22)βˆ’(1qIjj​βˆ₯gIjβˆ₯22βˆ’2β€‹βŸ¨gIj,ΞΌj⟩+qIjj​βˆ₯ΞΌjβˆ₯22)\displaystyle=\left(1-q_{I^{j}}^{j}\right)\left(\sum_{\begin{subarray}{c}i=1\\ i\notin S^{j-1}\end{subarray}}^{N}\frac{1}{q_{i}^{j}}\left\lVert g_{i}\right\rVert_{2}^{2}-\left\lVert\mu_{j}\right\rVert_{2}^{2}\right)-\left(\frac{1}{q^{j}_{I^{j}}}\left\lVert g_{I^{j}}\right\rVert_{2}^{2}-2\langle g_{I^{j}},\mu_{j}\rangle+q_{I^{j}}^{j}\left\lVert\mu_{j}\right\rVert_{2}^{2}\right)
    =(1βˆ’qIjj)​𝔼⁑[βˆ₯xjβˆ’ΞΌjβˆ₯22]βˆ’qIjj​βˆ₯xjβˆ’ΞΌjβˆ₯22\displaystyle=\left(1-q_{I^{j}}^{j}\right)\operatorname{\mathbb{E}}\left[\left\lVert x_{j}-\mu_{j}\right\rVert_{2}^{2}\right]-q_{I^{j}}^{j}\left\lVert x_{j}-\mu_{j}\right\rVert_{2}^{2}
    =(1βˆ’qIjj)​𝔼⁑[βˆ₯g^jtβˆ’gtβˆ₯22]βˆ’qIjj​βˆ₯g^jtβˆ’gtβˆ₯22\displaystyle=\left(1-q_{I^{j}}^{j}\right)\operatorname{\mathbb{E}}\left[\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}\right]-q_{I^{j}}^{j}\left\lVert\hat{g}_{j}^{t}-g^{t}\right\rVert_{2}^{2}

    where the expectation in the last two lines is conditional on Sjβˆ’1S^{j-1}. Taking expectation with respect to Sjβˆ’1S^{j-1} on both sides yields the result.

∎

Appendix D Extension to constant step-size SGD

While our analysis heavily relies on the assumption of decreasing step-sizes, we have found empirically that a slight modification of our method works just as well when a constant step-size is used. We propose the following epsilon sequence to account for the use of a constant step-size:

Ξ΅t=1C1βˆ’Ξ΄/3​(C+m​(tβˆ’1))Ξ΄/3+pm​i​n\varepsilon_{t}=\frac{1}{C^{1-\delta/3}(C+m(t-1))^{\delta/3}}+p_{min} (23)

for a constant pm​i​n∈[0,1/N]p_{min}\in[0,1/N] and the following condition on CC:

C≀11Nβˆ’pm​i​nC\leq\frac{1}{\frac{1}{N}-p_{min}}

which ensures that Ξ΅1≀1/N\varepsilon_{1}\leq 1/N. We ran the same experiment on MNIST, IJCNN1, and CIFAR10 as in Section 6, but with a constant step-size Ξ±t=Ξ±=m2​N​L\alpha_{t}=\alpha=\frac{m}{2NL}, and the epsilon sequence (23) with pm​i​n=15​Np_{min}=\frac{1}{5N} and C=11Nβˆ’pm​i​nC=\frac{1}{\frac{1}{N}-p_{min}}, and Ξ΄=1\delta=1. The results are displayed in figure 3, showing a similar performance compared to the decreasing step-sizes case. Note that choosing a too small pm​i​np_{min} can start to deteriorate the performance of the algorithm. It is still unclear how to set pm​i​np_{min} so as to guarantee good performance, but our experiments suggest that setting 15​N\frac{1}{5N} is a safe choice.

Refer to caption
Figure 3: Comparison of the performance of importance samplers on an l2-regularized softmax regression model on three real world datasets: MNIST (top), IJCNN1 (middle), CIFAR10 (bottom). For this set of experiments, SGD was run using a constant step size.
\bibliographystylesupp

plain \bibliographysuppmendeley