This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variance reduction techniques for stochastic proximal point algorithms

Cheik Traoré\,{}^{\textrm{{\char 0\relax}}}\, ,  Vassilis Apidopoulos ,  Saverio Salzo  and Silvia Villa Corresponding author: Malga Center, DIMA, Università degli Studi di Genova, Genova, Italy (traore@dima.unige.it). This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 861137.Malga Center, Dibris, Università degli Studi di Genova, Genova, Italy (vassilis.apid@gmail.com).Sapienza Università di Roma, Roma, Italy (salzo@diag.uniroma1.it).Malga Center, DIMA, Università degli Studi di Genova, Genova, Italy (silvia.villa@unige.it). She acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390, FA8655-22-1-7034, and BAAAFRL- AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826.
Abstract

In the context of finite sums minimization, variance reduction techniques are widely used to improve the performance of state-of-the-art stochastic gradient methods. Their practical impact is clear, as well as their theoretical properties. Stochastic proximal point algorithms have been studied as an alternative to stochastic gradient algorithms since they are more stable with respect to the choice of the step size. However, their variance-reduced versions are not as well studied as the gradient ones. In this work, we propose the first unified study of variance reduction techniques for stochastic proximal point algorithms. We introduce a generic stochastic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. For this algorithm, in the smooth setting, we provide several convergence rates for the iterates and the objective function values, which are faster than those of the vanilla stochastic proximal point algorithm. More specifically, for convex functions, we prove a sublinear convergence rate of O(1/k)O(1/k). In addition, under the Polyak-Łojasiewicz (PL) condition, we obtain linear convergence rates. Finally, our numerical experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts in terms of the stability with respect to the choice of the step size in most cases, especially for difficult problems.

Keywords. Stochastic optimization, proximal point algorithm, variance reduction techniques, SVRG, SAGA.
AMS Mathematics Subject Classification: 65K05, 90C25, 90C06, 49M27

1 Introduction

The objective of the paper is to solve the following finite-sum optimization problem

minimize𝘅𝖧F(𝘅)=1ni=1nfi(𝘅),\displaystyle\underset{\begin{subarray}{c}{\bm{\mathsf{x}}\in\mathsf{H}}\end{subarray}}{\text{{minimize}}}\;\;F(\bm{\mathsf{x}})=\frac{1}{n}\sum_{i=1}^{n}f_{i}(\bm{\mathsf{x}}), (1.1)

where 𝖧\mathsf{H} is a separable Hilbert space and for all i{1,2,,n}i\in\{1,2,\cdots,n\}, fi:𝖧f_{i}:\mathsf{H}\to\mathbb{R}.

Several problems can be expressed as in (1.1). The most popular example is the Empirical Risk Minimization (ERM) problem in machine learning [41, Section 2.2]. In that setting, nn is the number of data points, 𝘅d\bm{\mathsf{x}}\in\mathbb{R}^{d} includes the parameters of a machine learning model (linear functions, neural networks, etc.), and the function fif_{i} is the loss of the model 𝘅\bm{\mathsf{x}} at the ii-th data point. Due to the large scale of data points used in machine learning/deep learning, leveraging gradient descent (GD) for the problem (1.1) can be excessively costly both in terms of computational power and storage. To overcome these issues, several stochastic variants of gradient descent have been proposed in recent years.

Stochastic Gradient Descent.

The most common version of stochastic approximation [37] applied to (1.1) is that where at each step the full gradient is replaced by fi\nabla f_{i}, the gradient of a function fif_{i}, with ii sampled uniformly among {1,,n}\{1,\cdots,n\}. This procedure yields stochastic gradient descent (SGD), often referred to as incremental SGD. In its vanilla version or modified ones (AdaGrad [16], ADAM [27], etc.), SGD is ubiquitous in modern machine learning and deep learning. Stochastic approximation [37] provides the appropriate framework to study the theoretical properties of the SGD algorithm, which are nowadays well understood [10, 11, 31, 25]. Theoretical analysis shows that SGD has a worse convergence rate compared to its deterministic counterpart GD. Indeed, GD exhibits a convergence rate for the function values ranging from O(1/k)O(1/k) for convex functions to O(exp{Cμk})O(\exp\{-C_{\mu}k\}) for μ\mu-strongly convex functions, while for SGD the convergence rates of the function values vary from O(1/k)O(1/\sqrt{k}) to O(1/k)O(1/k). In addition, convergence of SGD is guaranteed if the step size sequence is square summable, but not summable. In practice, this requirement is not really meaningful, and the appropriate choice of the step size is one of the major issues in SGD implementations. In particular, if the initial step size is too large, the SGD blows up even if the sequence of step sizes satisfies the suitable decrease requirement; see, e.g., [31]. Therefore, the step size needs to be tuned by hand, and for solving problem (1.1), this is typically time consuming.

Stochastic Proximal Point Algorithm.

Whenever the computation of the proximity operator of fif_{i}, proxfi\text{{prox}}_{f_{i}} (for a definition see notation paragraph 2.1), is tractable, an alternative to SGD is the stochastic proximal point algorithm (SPPA). Instead of the gradient fi\nabla f_{i}, the proximity operator of a fif_{i}, chosen randomly, is used in each iteration. Recent works, in particular [38, 3, 26], showed that SPPA is more robust to the choice of step size with respect to SGD. In addition, the convergence rates are the same as those of SGD, in various settings [6, 34, 3], possibly including a momentum term [26, 44, 42, 43].

Variance reduction methods.

As observed above, the convergence rates for SGD are worse than those of their deterministic counterparts. This is due to the non-vanishing variance of the stochastic estimator of the true gradient. In order to circumvent this issue, starting from SVRG [22, 1], a new wave of SGD algorithms was developed with the aim of reducing the variance and recovering standard GD rates with constant step size. Different methods were developed and all share a O(1/k)O(1/k) and a O(eCμk)O(e^{-C_{\mu}k}) convergence rate for function values for convex and μ\mu-strongly convex objective functions, respectively. In the convex case, the convergence is ergodic. Apart from SVRG, the other methods share the idea of reducing the variance by aggregating different stochastic estimates of the gradient. Among them we mention SAG [39] and SAGA [14]. In subsequent years, a plethora of papers have appeared on variance reduction techniques; see, for example, [28, 17, 45, 33]. The paper [20] provided a unified study of variance reduction techniques for SGD that encompasses many of them. The latter work [20] has inspired our unified study of variance reduction for SPPA.

Variance-reduced Stochastic Proximal Point Algorithms.

The application of variance reduction techniques to SPPA is very recent and limited: the existing methods are Point-SAGA in [13], the proximal version of L-SVRG [28], and SNSPP proposed in [30]. All existing convergence results are provided in the smooth case, except for Point-SAGA in [13], where an ergodic and sublinear convergence rate with a constant step size is provided for nonsmooth strongly convex functions.

Contributions.

Our contribution can be summarized as follows:

  • Assuming that the functions fif_{i} are smooth, we propose a unified variance reduction technique for stochastic proximal point algorithm (SPPA). We devise a unified analysis that extends several variance reduction techniques used for SGD to SPPA, as listed in Section 4, with improved rates over SPPA. In particular, we prove a sublinear convergence rate 𝒪(1/k)\mathcal{O}(1/k) of the function values for convex functions. The analysis in the convex case is new in the literature. Assuming additionally that the objective function FF satisfies the Polyak-Łojasiewicz (PL) condition, we prove linear convergence rate both for the iterates and the function values. The PL condition on FF is less strong than the strong convexity of FF or even fif_{i} used in the related previous work. Finally, we show that these results are achieved for constant step sizes.

  • As a byproduct, we derive and analyze some stochastic variance-reduced proximal point algorithms, in analogy to SVRG [22], SAGA [14] and L-SVRG [28].

  • The experiments show that, in most cases and especially for difficult problems, the proposed methods are more robust to the step sizes and converge with larger step sizes, while retaining at least the same speed of convergence as their gradient counterparts. This generalizes the advantages of SPPA over SGD (see [3, 26]) to variance reduction settings.

Organization.

The rest of the paper is organized as follows: In Section 2, we present our generic algorithm and the assumptions we will need in subsequent sections. In Section 3, we show the results pertaining to that algorithm. Then, in Section 4, we specialize the general results to particular variance reduction algorithms. Section 5 collects our numerical experiments. Proofs of auxiliary results can be found in Appendix A.

2 Algorithm and assumptions

2.1 Notation

We first introduce the notation and recall a few basic notions that will be needed throughout the paper. We denote by \mathbb{N} the set of natural numbers (including zero) and by +=[0,+[\mathbb{R}_{+}=\left[0,+\infty\right[ the set of positive real numbers. For every integer 1\ell\geq 1, we define []={1,,}[\ell]=\{1,\dots,\ell\}. 𝖧\mathsf{H} is a Hilbert space endowed with scalar product ,\langle\cdot,\cdot\rangle and induced norm \|\cdot\|. If 𝖲𝖧\mathsf{S}\subset\mathsf{H} is convex and closed and 𝘅𝖧\bm{\mathsf{x}}\in\mathsf{H}, we set dist(𝘅,𝖲)=inf𝘇𝖲𝘅𝘇\mathrm{dist}(\bm{\mathsf{x}},\mathsf{S})=\inf_{\bm{\mathsf{z}}\in\mathsf{S}}\lVert\bm{\mathsf{x}}-\bm{\mathsf{z}}\rVert. The projection of 𝘅\bm{\mathsf{x}} onto 𝖲\mathsf{S} is denoted by P𝖲(𝘅)P_{\mathsf{S}}(\bm{\mathsf{x}}).

Bold default font is used for random variables taking values in 𝖧\mathsf{H}, while bold sans serif font is used for their realizations or deterministic variables in 𝖧\mathsf{H}. The probability space underlying random variables is denoted by (Ω,𝔄,𝖯)(\Omega,\mathfrak{A},\mathsf{P}). For every random variable 𝒙\bm{x}, 𝖤[𝒙]\mathsf{E}[\bm{x}] denotes its expectation, while if 𝔉𝔄\mathfrak{F}\subset\mathfrak{A} is a sub σ\sigma-algebra we denote by 𝖤[𝒙|𝔉]\mathsf{E}[\bm{x}\,|\,\mathfrak{F}] the conditional expectation of 𝒙\bm{x} given 𝔉\mathfrak{F}. Also, σ(𝒚)\sigma(\bm{y}) represents the σ\sigma-algebra generated by the random variable 𝒚\bm{y}.

Let φ:𝖧{\varphi}\colon\mathsf{H}\to\mathbb{R} be a function. The set of minimizers of φ{\varphi} is argminφ={𝘅𝖧|φ(𝘅)=infφ}\operatorname*{\text{\rm{argmin}}}{\varphi}=\{\bm{\mathsf{x}}\in\mathsf{H}\,|\,{\varphi}(\bm{\mathsf{x}})=\inf{\varphi}\}. If infφ\inf{\varphi} is finite, it is represented by φ{\varphi}_{*}. When φ{\varphi} is differentiable φ\nabla{\varphi} denotes the gradient of φ{\varphi}. We recall that the proximity operator of φ\varphi is defined as proxφ(𝘅)=argmin𝘆𝖧φ(𝘆)+12𝘆𝘅2\text{{prox}}_{\varphi}(\bm{\mathsf{x}})=\operatorname*{\text{\rm{argmin}}}_{\bm{\mathsf{y}}\in\mathsf{H}}\varphi(\bm{\mathsf{y}})+\frac{1}{2}\lVert\bm{\mathsf{y}}-\bm{\mathsf{x}}\rVert^{2}.

In this work, 1\ell^{1} represents the space of sequences which norms are summable and 2\ell^{2} the space of sequences which norms are square summable.

2.2 Algorithm

In this paragraph, we describe a generic method for solving problem (1.1), based on the stochastic proximal point algorithm.

Algorithm 2.1.

Let (𝒆k)k(\bm{e}^{k})_{k\in\mathbb{N}} be a sequence of random vectors in 𝖧\mathsf{H} and let (ik)k(i_{k})_{k\in\mathbb{N}} be a sequence of i.i.d. random variables uniformly distributed on {1,,n}\{1,\dots,n\}, so that iki_{k} is independent of 𝒆0\bm{e}^{0}, …, 𝒆k1\bm{e}^{k-1}. Let α>0\alpha>0 and set the initial point 𝒙0𝘅0𝖧\bm{x}^{0}\equiv\bm{\mathsf{x}}^{0}\in\mathsf{H}. Then define

fork=0,1,𝒙k+1=proxαfik(𝒙k+α𝒆k).\begin{array}[]{l}\text{for}\;k=0,1,\ldots\\ \left\lfloor\begin{array}[]{l}\bm{x}^{k+1}=\text{{prox}}_{\alpha f_{i_{k}}}\big{(}\bm{x}^{k}+\alpha\bm{e}^{k}\big{)}.\end{array}\right.\end{array}

Algorithm 2.1 is a stochastic proximal point method including an additional term 𝒆k\bm{e}^{k}, that will be suitably chosen to reduce the variance. As we shall see in Section 4, depending on the specific algorithm (SPPA, SVRP, L-SVRP, SAPA), 𝒆k\bm{e}^{k} may be defined in various ways. Note that 𝒙k+1\bm{x}^{k+1} is a random vector depending on 𝒙k,ik\bm{x}^{k},i_{k}, and 𝒆k\bm{e}^{k}. By definition of the proximal operator, we derive from Algorithm 2.1

𝒙k+1\displaystyle\bm{x}^{k+1} =proxαfik(𝒙k+α𝒆k)\displaystyle=\text{{prox}}_{\alpha f_{i_{k}}}\big{(}\bm{x}^{k}+\alpha\bm{e}^{k}\big{)}
=argmin𝒙𝖧fik(𝒙)+12α𝒙k+α𝒆k𝒙2.\displaystyle=\operatorname*{\text{\rm{argmin}}}_{\bm{x}\in\mathsf{H}}f_{i_{k}}(\bm{x})+\frac{1}{2\alpha}\|\bm{x}^{k}+\alpha\bm{e}^{k}-\bm{x}\|^{2}.

By the optimality condition and assuming fikf_{i_{k}} is differentiable, we have

0=fik(𝒙k+1)+1α(𝒙k+1𝒙kα𝒆k).\displaystyle 0=\nabla f_{i_{k}}(\bm{x}^{k+1})+\frac{1}{\alpha}\left(\bm{x}^{k+1}-\bm{x}^{k}-\alpha\bm{e}^{k}\right). (2.1)

Let 𝒘k=fik(𝒙k+1)𝒆k\bm{w}^{k}=\nabla f_{i_{k}}(\bm{x}^{k+1})-\bm{e}^{k}. Thanks to (2.1), the update in Algorithm 2.1 can be rewritten as

𝒙k+1=𝒙kα[fik(𝒙k+1)𝒆k]=𝒙kα𝒘k.\bm{x}^{k+1}=\bm{x}^{k}-\alpha\big{[}\nabla f_{i_{k}}(\bm{x}^{k+1})-\bm{e}^{k}\big{]}=\bm{x}^{k}-\alpha\bm{w}^{k}. (2.2)

Keeping in mind that actually 𝒘k\bm{w}^{k} depends on fik(𝒙k+1)\nabla f_{i_{k}}(\bm{x}^{k+1}), Equation (2.2) shows that Algorithm 2.1 can be seen as an implicit stochastic gradient method, in contrast to an explicit one, where 𝒘k\bm{w}^{k} is replaced by

𝒗kfik(𝒙k)𝒆k.\bm{v}^{k}\coloneqq\nabla f_{i_{k}}(\bm{x}^{k})-\bm{e}^{k}. (2.3)

This point of view has been exploited in [20, Appendix A], to provide a unified theory for variance reduction techniques for SGD.

Remark 2.2.

Due to the dependence of 𝒙k+1\bm{x}^{k+1} on iki_{k}, in general 𝖤[fik(𝒙k+1)|𝒙k]F(𝒙k+1)\mathsf{E}[\nabla f_{i_{k}}(\bm{x}^{k+1})\,|\,\bm{x}^{k}]\neq\nabla F(\bm{x}^{k+1}) and 𝖤[fik(𝒙k+1)|𝒙k]F(𝒙k+1)\mathsf{E}[f_{i_{k}}(\bm{x}^{k+1})\,|\,\bm{x}^{k}]\neq F(\bm{x}^{k+1}), making the analysis of Algorithm 2.1 tricky. We circumvent that problem using 𝒗k=fik(𝒙k)𝒆k\bm{v}^{k}=\nabla f_{i_{k}}(\bm{x}^{k})-\bm{e}^{k} as an auxiliary variable; see (A.3). This is helpful since 𝒙k\bm{x}^{k} is independent of iki_{k}. Therefore, even though 𝒗k\bm{v}^{k} does not appear explicitly in Algorithm 2.1, it is still relevant in the associated analysis of the convergence bounds, and, for those derivations, some assumptions will be required on 𝒗k\bm{v}^{k}.

Remark 2.3.

Variance reduction techniques have been already connected to the proximal gradient algorithm to solve structured problems of the form ifi+g\sum_{i}f_{i}+g, where fif_{i} are assumed to be smooth and gg is just prox friendly. Indeed, this is the model corresponding to regularized empirical risk minimization [41]. For this objective functions the stochastic proximal gradient algorithm is as follows

xk+1=proxγkg(xkγkfik(xk)).x_{k+1}=\text{{prox}}_{\gamma_{k}g}(x_{k}-\gamma_{k}\nabla f_{i_{k}}(x_{k})).

As can be seen from the definition, no sum structure is assumed on the function gg which is the one activated through the proximity operator. On the contrary, stochasticity arises from the computation of the gradient of ifi\sum_{i}f_{i} and variance reduction techniques can be exploited at this level. In this paper we tackle the same problem, in the special case where g=0g=0, but we analyze a different algorithm, where the functions fif_{i}’s are activated through their proximity operator instead of their gradient. The addition of a regularizer gg can be considered, but at the moment differentiability is needed for our analysis.

2.3 Assumptions

The first assumptions are made on the functions fi:𝖧f_{i}:\mathsf{H}\to\mathbb{R}, i[n]i\in[n], as well as on the objective function F:𝖧F:\mathsf{H}\to\mathbb{R}.

Assumptions 2.4.
  1. (A.i)

    argminF\operatorname*{\text{\rm{argmin}}}F\neq\varnothing.

  2. (A.ii)

    For all i[n]i\in[n], fif_{i} is convex and LL-smooth, i.e., differentiable and such that

    (𝘅,𝘆𝖧)fi(𝘅)fi(𝘆)L𝘅𝘆(\forall\bm{\mathsf{x}},\bm{\mathsf{y}}\in\mathsf{H})\quad\|\nabla f_{i}(\bm{\mathsf{x}})-\nabla f_{i}(\bm{\mathsf{y}})\|\leq L\|\bm{\mathsf{x}}-\bm{\mathsf{y}}\|

    for some L>0L>0. As a consequence, FF is convex and LL-smooth.

  3. (A.iii)

    FF satisfies the PL condition with constant μ>0\mu>0, i.e.,

    (𝘅𝖧)F(𝘅)F12μF(𝘅)2,(\forall\bm{\mathsf{x}}\in\mathsf{H})\quad F(\bm{\mathsf{x}})-F_{*}\leq\frac{1}{2\mu}\|\nabla F(\bm{\mathsf{x}})\|^{2}, (2.4)

    which is equivalent to the following quadratic growth condition when FF is convex

    (𝘅𝖧)μ2dist(𝘅,argminF)2F(𝘅)F.(\forall\bm{\mathsf{x}}\in\mathsf{H})\quad\frac{\mu}{2}\operatorname*{\text{\rm{dist}}}(\bm{\mathsf{x}},\operatorname*{\text{\rm{argmin}}}F)^{2}\leq F(\bm{\mathsf{x}})-F_{*}. (2.5)

(A.i) and (A.ii) constitute the common assumptions that we use for all the convergence results presented in Sections 3 and 4. Assumption (A.iii) is often called Polyak-Łojasiewicz condition and was introduced in [29] (see also [35]) and is closely connected with the quadratic growth condition (2.5) (they are equivalent in the convex setting, see e.g. [9]). Conditions (2.4) and (2.5) are both relaxations of the strong convexity property and are powerful key tools in establishing linear convergence for many iterative schemes, both in the convex [23, 9, 18, 15] and the non-convex setting [35, 36, 8, 5, 2].

In the same fashion, in this work, Assumption (A.iii) will be used in order to deduce linear convergence rates in terms of objective function values for the sequence generated by Algorithm 2.1.

Assumptions 2.5.

Let, for all kk\in\mathbb{N}, 𝐯k:=fik(𝐱k)𝐞k\bm{v}^{k}:=\nabla f_{i_{k}}(\bm{x}^{k})-\bm{e}^{k}. Then there exist non-negative real numbers A,B,C+A,B,C\in\mathbb{R}_{+} and ρ[0,1]\rho\in[0,1], and a non-positive real-valued random variable DD such that, for every kk\in\mathbb{N},

  1. (B.i)

    𝖤[𝒆k|𝔉k]=0\mathsf{E}[\bm{e}^{k}\,|\,\mathfrak{F}_{k}]=0  a.s.,

  2. (B.ii)

    𝖤[𝒗k2|𝔉k]2A(F(𝒙k)F)+Bσk2+D\mathsf{E}\left[\lVert\bm{v}^{k}\rVert^{2}\,|\,\mathfrak{F}_{k}\right]\leq 2A(F(\bm{x}^{k})-F_{*})+B\sigma_{k}^{2}+D  a.s.,

  3. (B.iii)

    𝖤[σk+12](1ρ)𝖤[σk2]+2C𝖤[F(𝒙k)F]\mathsf{E}\left[\sigma_{k+1}^{2}\right]\leq(1-\rho)\mathsf{E}\left[\sigma_{k}^{2}\right]+2C\mathsf{E}[F(\bm{x}^{k})-F_{*}],

where σk\sigma_{k} is a real-valued random variable, (𝔉k)k(\mathfrak{F}_{k})_{k\in\mathbb{N}} is a sequence of σ\sigma-algebras such that, k\forall k\in\mathbb{N}, 𝔉k𝔉k+1𝔄\mathfrak{F}_{k}\subset\mathfrak{F}_{k+1}\subset\mathfrak{A}, ik1i_{k-1} and xkx_{k} are 𝔉k\mathfrak{F}_{k}-measurables, and iki_{k} is independent of 𝔉k\mathfrak{F}_{k}.

Assumption (B.i) ensures that 𝖤[𝒗k|𝔉k]=𝖤[fik(𝒙k)|𝔉k]=F(𝒙k)\mathsf{E}[\bm{v}^{k}\,|\,\mathfrak{F}_{k}]=\mathsf{E}[\nabla f_{i_{k}}(\bm{x}^{k})\,|\,\mathfrak{F}_{k}]=\nabla F(\bm{x}^{k}), so that the direction 𝒗k\bm{v}^{k} is an unbiased estimator of the full gradient of FF at 𝒙k\bm{x}^{k}, which is a standard assumption in the related literature. Assumption (B.ii) on 𝖤[𝒗k2|𝔉k]\mathsf{E}\left[\lVert\bm{v}^{k}\rVert^{2}\,|\,\mathfrak{F}_{k}\right] is the equivalent of what is called, in the literature [25, 20], the ABCABC condition on 𝖤[fik(𝒙k)2|𝔉k]\mathsf{E}\left[\lVert\nabla f_{i_{k}}(\bm{x}^{k})\rVert^{2}\,|\,\mathfrak{F}_{k}\right] with σk=F(𝒙k)\sigma_{k}=\|\nabla F(\bm{x}^{k})\| and DD constant (see also [20]). Assumption (B.iii) is justified by the fact that it is needed for the theoretical study and it is satisfied by many examples of variance reduction techniques. For additional discussion on these assumptions, especially Assumption (B.iii), see [20].

3 Main results

In the rest of the paper, we will always suppose that Assumption (A.i) holds.

Before stating the main results of this work, we start with a technical proposition that constitutes the cornerstone of our analysis. The proof can be found in Appendix A.1

Proposition 3.1.

Suppose that Assumptions 2.5 and (A.ii) are verified and that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 2.1. Let M>0M>0. Then, for all kk\in\mathbb{N},

𝖤[dist(𝒙k+1,argminF)2]+α2M𝖤[σk+12]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k+1}^{2}] 𝖤[dist(𝒙k,argminF)2]+α2[M+BρM]𝖤[σk2]\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}\left[M+B-\rho M\right]\mathsf{E}[\sigma_{k}^{2}]
2α[1α(A+MC)]𝖤[F(𝒙k)F]\displaystyle\qquad-2\alpha\left[1-\alpha(A+MC)\right]\mathsf{E}[F(\bm{x}^{k})-F_{*}]
+α2𝖤[D].\displaystyle\qquad+\alpha^{2}\mathsf{E}[D].

We now state two theorems that can be derived from the previous proposition. The first theorem deals with cases where the function FF is only convex.

Theorem 3.2.

Suppose that Assumptions (A.ii) and 2.5 hold with ρ>0\rho>0 and that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 2.1. Let M>0M>0 and α>0\alpha>0 be such that MB/ρM\geq B/\rho and α<1/(A+MC)\alpha<1/(A+MC). Then, for all kk\in\mathbb{N},

𝖤[F(¯𝒙k)F]\displaystyle\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}] dist(𝒙0,argminF)2+α2M𝖤[σ02]2αk[1α(A+MC)],\displaystyle\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]}{2\alpha k\left[1-\alpha(A+MC)\right]},

with ¯𝐱k=1kt=0k1𝐱t\bm{\bar{}}{\bm{x}}^{k}=\frac{1}{k}\sum_{t=0}^{k-1}\bm{x}^{t}.

Proof.

Since BρM0B-\rho M\leq 0 and 𝖤[D]0\mathsf{E}[D]\leq 0, it follows from Proposition 3.1 that

2α[1α(A+MC)]𝖤[F(𝒙k)F]\displaystyle 2\alpha\left[1-\alpha(A+MC)\right]\mathsf{E}[F(\bm{x}^{k})-F_{*}] 𝖤[dist(𝒙k,argminF)2]+α2M𝖤[σk2]\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k}^{2}]
(𝖤[dist(𝒙k+1,argminF)2]+α2M𝖤[σk+1]).\displaystyle\qquad-\left(\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k+1}]\right).

Summing from 0 up to k1k-1 and dividing both sides by kk, we obtain

2α[1α(A+MC)]t=0k11k𝖤[F(𝒙t)F]\displaystyle 2\alpha\left[1-\alpha(A+MC)\right]\sum_{t=0}^{k-1}\frac{1}{k}\mathsf{E}[F(\bm{x}^{t})-F_{*}] 1k(dist(𝒙0,argminF)2+α2M𝖤[σ02])\displaystyle\leq\frac{1}{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right)
1k(𝖤[dist(𝒙k,argminF)]2+α2M𝖤[σk2])\displaystyle\qquad-\frac{1}{k}\left(\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2}+\alpha^{2}M\mathsf{E}[\sigma_{k}^{2}]\right)
1k(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\displaystyle\leq\frac{1}{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right).

Finally, by convexity of FF, we get

𝖤[F(¯𝒙k)F]\displaystyle\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}] dist(𝒙0,argminF)2+α2M𝖤[σ02]2αk[1α(A+MC)].\displaystyle\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]}{2\alpha k\left[1-\alpha(A+MC)\right]}.\qed

The next theorem shows that, when FF additionally satisfies the PL property (2.4), the sequence generated by algorithm 2.1 exhibits a linear convergence rate both in terms of the distance to a minimizer and also of the values of the objective function.

Theorem 3.3.

Suppose that Assumptions 2.4 and 2.5 are verified with ρ>0\rho>0 and that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 2.1. Let MM be such that M>B/ρM>B/\rho and α>0\alpha>0 such that α<1/(A+MC)\alpha<1/(A+MC). Set qmax{1αμ(1α(A+MC)),1+BMρ}q\coloneqq\max\left\{1-\alpha\mu\left(1-\alpha(A+MC)\right),1+\frac{B}{M}-\rho\right\}. Then q]0,1[q\in]0,1[ and for all kk\in\mathbb{N},

Vk+1qVk,V^{k+1}\leq qV^{k},

with Vk=𝖤[dist(𝐱k,argminF)2]+α2M𝖤[σk2]V^{k}=\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k}^{2}]\, for all kk\in\mathbb{N}. Moreover, for every kk\in\mathbb{N},

𝖤[dist(𝒙k,argminF)2]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}] qk(dist(𝒙0,argminF)2+α2M𝖤[σ02]),\displaystyle\leq q^{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right),
𝖤[F(𝒙k)F]\displaystyle\mathsf{E}[F(\bm{x}^{k})-F_{*}] qkL2(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\displaystyle\leq\frac{q^{k}L}{2}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right).
Proof.

Since α<1A+MC\alpha<\frac{1}{A+MC}, we obtain thanks to Assumption (A.iii) and Proposition 3.1

𝖤[dist(𝒙k+1,argminF)2]+α2M𝖤[σk+12]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k+1}^{2}] [1αμ(1α(A+MC))]𝖤[dist(𝒙k,argminF)2]\displaystyle\leq\left[1-\alpha\mu\left(1-\alpha(A+MC)\right)\right]\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]
+α2M[1+BMρ]𝖤[σk2].\displaystyle\qquad+\alpha^{2}M\left[1+\frac{B}{M}-\rho\right]\mathsf{E}[\sigma_{k}^{2}]. (3.1)

Since α<1A+MC\alpha<\frac{1}{A+MC} and M>BρM>\frac{B}{\rho}, we obtain from (3) that

𝖤[dist(𝒙k+1,argminF)2]+α2M𝖤[σk+12]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k+1}^{2}]
max{1αμ(1α(A+MC)),1+BMρ}(𝖤[dist(𝒙k,argminF)2]+α2M𝖤[σk2]).\displaystyle\leq\max\left\{1-\alpha\mu\left(1-\alpha(A+MC)\right),1+\frac{B}{M}-\rho\right\}\left(\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}M\mathsf{E}[\sigma_{k}^{2}]\right).

Since 0<1ρ1+B/Mρ<10<1-\rho\leq 1+B/M-\rho<1, it is clear that q]0,1[q\in\left]0,1\right[. Iterating down on kk, we obtain

𝖤[dist(𝒙k,argminF)2]qk(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]\leq q^{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right). (3.2)

Let 𝘅argminF\bm{\mathsf{x}}^{*}\in\operatorname*{\text{\rm{argmin}}}F. As FF is LL-Lipschitz smooth, from the Descent Lemma [32, Lemma 1.2.3], we have

(𝘅𝖧)F(𝘅)FL2𝘅𝘅2.(\forall\bm{\mathsf{x}}\in\mathsf{H})\quad F(\bm{\mathsf{x}})-F_{*}\leq\frac{L}{2}\|\bm{\mathsf{x}}-\bm{\mathsf{x}}^{*}\|^{2}.

In particular,

(𝘅𝖧)F(𝘅)FL2dist(𝘅,argminF)2.(\forall\bm{\mathsf{x}}\in\mathsf{H})\quad F(\bm{\mathsf{x}})-F_{*}\leq\frac{L}{2}\operatorname*{\text{\rm{dist}}}(\bm{\mathsf{x}},\operatorname*{\text{\rm{argmin}}}F)^{2}. (3.3)

Using (3.3) with 𝒙k\bm{x}^{k} in (3.2), we get

𝖤[F(𝒙k)F]qkL2(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\mathsf{E}[F(\bm{x}^{k})-F_{*}]\leq\frac{q^{k}L}{2}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right).\qed
Remark 3.4.
  1. (i)

    The convergence rate of order O(1k)O\mathopen{}\left(\frac{1}{k}\right) for the general variance reduction scheme 2.1, with constant step size, as stated in Theorem 3.2, is an improved extension of the one found for the vanilla stochastic proximal gradient method (see e.g. [3, Proposition 3.83.8]). It is important to mention that the convergence with a constant step size is no longer true when DD in Assumption (B.ii) is positive. As we shall see in Section 4 several choices for the variance term 𝒆k\bm{e}_{k} in Algorithm 2.1 can be beneficial regarding this issue, provided Assumption (B.ii) with D0D\leq 0.

  2. (ii)

    The linear rates as stated in Theorem 3.3 have some similarity with the ones found in [20, Theorem 4.14.1], where the authors present a unified study for variance-reduced stochastic (explicit) gradient methods. However we note that the Polyak-Łojasiewicz condition (2.4) on FF used here is slightly weaker than the quasi-strong convexity used in [20, Assumption 4.24.2].

4 Derivation of stochastic proximal point type algorithms

In this section, we provide and analyze several instances of the general scheme 2.1, corresponding to different choices of the variance reduction term 𝒆k\bm{e}^{k}. In particular in the next paragraphs we describe four different schemes, namely Stochastic Proximal Point Algorithm (SPPA), Stochastic Variance-reduced Proximal (SVRP) algorithm, Loopless SVRP (L-SVRP) and Stochastic Average Proximal Algorithm (SAPA).

4.1 Stochastic Proximal Point Algorithm

We start by presenting the classic vanilla stochastic proximal method (SPPA), see e.g. [6, 7, 34]. We suppose that Assumptions (A.i) and (A.ii) hold and that for all kk\in\mathbb{N}

σk2:=sup𝘅argminF1ni=1nfi(𝘅)2<+.\sigma^{2}_{k}:=\sup_{\bm{\mathsf{x}}^{*}\in\operatorname*{\text{\rm{argmin}}}F}\frac{1}{n}\sum_{i=1}^{n}\|\nabla f_{i}(\bm{\mathsf{x}}^{*})\|^{2}<+\infty. (4.1)
Algorithm 4.1 (SPPA).

Let (ik)k(i_{k})_{k\in\mathbb{N}} be a sequence of i.i.d. random variables uniformly distributed on {1,,n}\{1,\dots,n\}. Let αk>0\alpha_{k}>0 for all kk\in\mathbb{N} and set the initial point 𝒙0𝖧\bm{x}^{0}\in\mathsf{H}. Define

fork=0,1,𝒙k+1=proxαkfik(𝒙k).\begin{array}[]{l}\text{for}\;k=0,1,\ldots\\ \left\lfloor\begin{array}[]{l}\bm{x}^{k+1}=\text{{prox}}_{\alpha_{k}f_{i_{k}}}(\bm{x}^{k}).\end{array}\right.\end{array} (4.2)

Algorithm 4.1 can be directly identified with the general scheme 2.1, by setting 𝒆k=0\bm{e}^{k}=0 and 𝒗k=fik(𝒙k)\bm{v}^{k}=\nabla f_{i_{k}}(\bm{x}^{k}). The following lemma provides a bound in expectation on the sequence 𝒗k\|\bm{v}^{k}\| and can be found in the related literature; see, e.g. [40, Lemma 1].

Lemma 4.2.

We suppose that Assumption (A.ii) holds and that (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is a sequence generated by Algorithm 4.1 and 𝐯k=fik(𝐱k)\bm{v}^{k}=\nabla f_{i_{k}}(\bm{x}^{k}). Then, for all kk\in\mathbb{N}, it holds

𝖤[𝒗k2|𝔉k]4L(F(𝒙k)F)+2σk2,\displaystyle\mathsf{E}[\|\bm{v}^{k}\|^{2}\,|\,\mathfrak{F}_{k}]\leq 4L\big{(}F(\bm{x}^{k})-F_{*}\big{)}+2\sigma^{2}_{k},
𝖤[σk+12]=σk+12=σk2=𝖤[σk2],\displaystyle\mathsf{E}\left[\sigma_{k+1}^{2}\right]=\sigma_{k+1}^{2}=\sigma_{k}^{2}=\mathsf{E}\left[\sigma_{k}^{2}\right],

where 𝔉k=σ(i0,,ik1)\mathfrak{F}_{k}=\sigma(i_{0},\dots,i_{k-1}) and σk2\displaystyle\sigma^{2}_{k} is defined as a constant random variable in (4.1).

From Lemma 4.2, we immediately notice that Assumptions 2.5 are verified with A=2L,B=2A=2L,B=2, C=ρ=0C=\rho=0 and D0D\equiv 0. In this setting, we are able to recover the following convergence result (see also [3, Lemma 3.103.10 and Proposition 3.83.8]).

Theorem 4.3.

Suppose that Assumption (A.ii) holds and let (γk)k(\gamma_{k})_{k\in\mathbb{N}} be a positive real-valued sequence such that αk14L\alpha_{k}\leq\frac{1}{4L}, for all kk\in\mathbb{N}. Suppose also that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 4.1 with the sequence (αk)k(\alpha_{k})_{k\in\mathbb{N}}. Then, k1\forall k\geq 1,

𝖤[F(¯𝒙k)F]dist(𝒙0,argminF)2t=0k1αt+2σ12t=0k1αt2t=0k1αt,\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}]\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}+2\sigma^{2}_{1}\frac{\sum_{t=0}^{k-1}\alpha_{t}^{2}}{\sum_{t=0}^{k-1}\alpha_{t}},

where ¯𝐱k=t=0k1αtt=0k1αt𝐱t\displaystyle\bm{\bar{}}{\bm{x}}^{k}=\sum_{t=0}^{k-1}\frac{\alpha_{t}}{\sum_{t=0}^{k-1}\alpha_{t}}\bm{x}^{t}.

Proof.

From Proposition 3.1 adapted to the update (4.2), it follows that

2αk[12αkL]𝖤[F(𝒙k)F]𝖤[dist(𝒙k,argminF)]2𝖤[dist(𝒙k+1,argminF)]2+2αk2σk2.2\alpha_{k}\left[1-2\alpha_{k}L\right]\mathsf{E}[F(\bm{x}^{k})-F_{*}]\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2}-\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)]^{2}+2\alpha_{k}^{2}\sigma^{2}_{k}. (4.3)

Since αk1/4L\alpha_{k}\leq 1/4L, for all kk\in\mathbb{N}, we have from (4.3),

αk𝖤[F(𝒙k)F]\displaystyle\alpha_{k}\mathsf{E}[F(\bm{x}^{k})-F_{*}] 𝖤[dist(𝒙k,argminF)]2𝖤[dist(𝒙k+1,argminF)]2+2αk2σk2\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2}-\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)]^{2}+2\alpha_{k}^{2}\sigma^{2}_{k}
=𝖤[dist(𝒙k,argminF)]2𝖤[dist(𝒙k+1,argminF)]2+2αk2σ12.\displaystyle=\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2}-\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)]^{2}+2\alpha_{k}^{2}\sigma^{2}_{1}.

Let k1k\geq 1. Summing from 0 up to k1k-1 and dividing both side by t=0k1αt\sum_{t=0}^{k-1}\alpha_{t}, we obtain

t=0k1αtt=0k1αt𝖤[F(𝒙t)F]\displaystyle\sum_{t=0}^{k-1}\frac{\alpha_{t}}{\sum_{t=0}^{k-1}\alpha_{t}}\mathsf{E}[F(\bm{x}^{t})-F_{*}] 1t=0k1αt(dist(𝒙0,argminF)2𝖤[dist(𝒙k,argminF)]2)\displaystyle\leq\frac{1}{\sum_{t=0}^{k-1}\alpha_{t}}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}-\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2}\right)
+2σ12t=0k1αt2t=0k1αt\displaystyle\qquad+2\sigma^{2}_{1}\frac{\sum_{t=0}^{k-1}\alpha_{t}^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}
dist(𝒙0,argminF)2t=0k1αt+2σ12t=0k1αt2t=0k1αt.\displaystyle\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}+2\sigma^{2}_{1}\frac{\sum_{t=0}^{k-1}\alpha_{t}^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}.

Finally by convexity of FF and using Jensen’s inequality, we obtain

𝖤[F(¯𝒙k)F]dist(𝒙0,argminF)2t=0k1αt+2σ12t=0k1αt2t=0k1αt.\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}]\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}+2\sigma^{2}_{1}\frac{\sum_{t=0}^{k-1}\alpha_{t}^{2}}{\sum_{t=0}^{k-1}\alpha_{t}}.\qed

4.2 Stochastic Variance-Reduced Proximal point algorithm

In this paragraph, we present a Stochastic Proximal Point Algorithm coupled with a variance reduction term, in the spirit of the Stochastic Variance Reduction Gradient (SVRG) method introduced in [22]. It is coined Stochastic Variance-Reduced Proximal point algorithm (SVRP).

The SVRP method involves two levels of iterative procedure: outer iterations and inner iterations. We shall stress out that the framework presented in the previous section covers only the inner iteration procedure and thus the convergence analysis for SVRP demands an additional care. In contrast to the subsequent schemes, Theorems 3.2 and 3.3 do not apply directly to SVRP. In particular, as it can be noted below, in the case of SVRP, the constant ρ\rho appearing in (B.iii) in Assumption 2.5, is null. Nevertheless, it is worth mentioning that the convergence analysis still uses Proposition 3.1.

Algorithm 4.4 (SVRP).

Let mm\in\mathbb{N}, with m1m\geq 1, and (ξs)s(\xi_{s})_{s\in\mathbb{N}}, (it)t(i_{t})_{t\in\mathbb{N}} be two independent sequences of i.i.d. random variables uniformly distributed on {0,1,,m1}\{0,1,\dots,m-1\} and {1,,n}\{1,\dots,n\} respectively. Let α>0\alpha>0 and set the initial point 𝒙~0𝘅~0𝖧\tilde{\bm{x}}^{0}\equiv\tilde{\bm{\mathsf{x}}}^{0}\in\mathsf{H}. Then

fors=0,1,𝒙0=𝒙~sfork=0,,m1𝒙k+1=proxαfism+k(𝒙k+αfism+k(𝒙~s)αF(𝒙~s))𝒙~s+1=k=0m1δk,ξs𝒙k,or𝒙~s+1=1mk=0m1𝒙k,\begin{array}[]{l}\text{for}\;s=0,1,\ldots\\ \left\lfloor\begin{array}[]{l}\bm{x}^{0}=\tilde{\bm{x}}^{s}\\ \text{for}\;k=0,\dots,m-1\\[3.01385pt] \left\lfloor\begin{array}[]{l}\bm{x}^{k+1}=\text{{prox}}_{\alpha f_{i_{sm+k}}}\left(\bm{x}^{k}+\alpha\nabla f_{i_{sm+k}}(\tilde{\bm{x}}^{s})-\alpha\nabla F(\tilde{\bm{x}}^{s})\right)\end{array}\right.\\ \tilde{\bm{x}}^{s+1}=\sum_{k=0}^{m-1}\delta_{k,\xi_{s}}\bm{x}^{k},\\ \text{or}\\ {\tilde{\bm{x}}^{s+1}=\frac{1}{m}\sum_{k=0}^{m-1}\bm{x}^{k},}\end{array}\right.\end{array}

where δk,h\delta_{k,h} is the Kronecker symbol. In the case of the first option, one iterate is randomly selected among the inner iterates 𝒙0,𝒙1,,𝒙m1\bm{x}_{0},\bm{x}_{1},\cdots,\bm{x}_{m-1}, losing possibly a lot of information computed in the inner loop. For the second option, those inner iterates are averaged and most of the information are used.

Let ss\in\mathbb{N}. In this case, for all k{0,1,,m1}k\in\{0,1,\cdots,m-1\}, setting jk=ism+kj_{k}=i_{sm+k}, the inner iteration procedure of Algorithm 4.4 can be identified with the general scheme 2.1, by setting 𝒆kfjk(𝒙~s)F(𝒙~s)\bm{e}^{k}\coloneqq\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla F(\tilde{\bm{x}}^{s}). In addition let us define

σk21ni=1nfi(𝒙~s)fi(𝒚~s)2\sigma_{k}^{2}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\left\|\nabla f_{i}(\tilde{\bm{x}}^{s})-\nabla f_{i}(\tilde{\bm{y}}^{s})\right\|^{2} (4.4)

where 𝒚~sargminF\tilde{\bm{y}}^{s}\in\operatorname*{\text{\rm{argmin}}}F is such that 𝒙~s𝒚~s=dist(𝒙~s,argminF)\lVert\tilde{\bm{x}}^{s}-\tilde{\bm{y}}^{s}\rVert=\operatorname*{\text{\rm{dist}}}(\tilde{\bm{x}}^{s},\operatorname*{\text{\rm{argmin}}}F). Moreover, setting 𝔉s,k=σ(ξ0,,ξs1,i0,,ism+k1)\mathfrak{F}_{s,k}=\sigma(\xi_{0},\dots,\xi_{s-1,}i_{0},\dots,i_{sm+k-1}), we have that 𝒙~s,𝒚~s\tilde{\bm{x}}^{s},\tilde{\bm{y}}^{s}, and 𝒙k\bm{x}^{k} are 𝔉s,k\mathfrak{F}_{s,k}-measurables and jkj_{k} is independent of 𝔉s,k\mathfrak{F}_{s,k}. The following result is proved in Appendix A.2.

Lemma 4.5.

Suppose that Assumption (A.ii) holds true. Let ss\in\mathbb{N} and let (𝐱k)k[m](\bm{x}^{k})_{k\in[m]} be the (finite) sequence generated by the inner iteration in Algorithm 4.4. Set 𝐯k=fjk(𝐱k)fjk(𝐱~s)+F(𝐱~s)\bm{v}^{k}=\nabla f_{j_{k}}\left(\bm{x}^{k}\right)-\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})+\nabla F(\tilde{\bm{x}}^{s}) and σk\sigma_{k} as defined in (4.4). Then, for every k{0,1,,m1}k\in\{0,1,\cdots,m-1\}, it holds

𝖤[𝒗k2|𝔉s,k]4L(F(𝒙k)F)+2σk22F(𝒙~s)2,\mathsf{E}[\|\bm{v}^{k}\|^{2}\,|\,\mathfrak{F}_{s,k}]\leq 4L\big{(}F(\bm{x}^{k})-F_{*}\big{)}+2\sigma_{k}^{2}-2\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}, (4.5)

and

𝖤[σk+12|𝔉s,0]=𝖤[σk2|𝔉s,0].\mathsf{E}\big{[}\sigma_{k+1}^{2}\,|\,\mathfrak{F}_{s,0}\big{]}=\mathsf{E}\big{[}\sigma_{k}^{2}\,|\,\mathfrak{F}_{s,0}\big{]}.

As an immediate consequence, of Proposition 3.1, we have the following corollary regarding the inner iteration procedure of Algorithm 4.4.

Corollary 4.6.

Suppose that Assumption (A.ii) holds. Let ss\in\mathbb{N} and let (𝐱k)k[m](\bm{x}^{k})_{k\in[m]} be the sequence generated by the inner iteration in Algorithm 4.4. Then, for all k{0,1,,m1}k\in\{0,1,\cdots,m-1\},

𝖤[dist(𝒙k+1,argminF)2]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}] 𝖤[dist(𝒙k,argminF)2]\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}] (4.6)
2α(12αL)𝖤[F(𝒙k)F]\displaystyle\qquad-2\alpha\left(1-2\alpha L\right)\mathsf{E}[F(\bm{x}^{k})-F_{*}]
+4Lα2𝖤[F(𝒙~s)F]2α2𝖤[F(𝒙~s)2].\displaystyle\qquad+4L\alpha^{2}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]-2\alpha^{2}\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}].
Proof.

Since Assumption (A.ii) is true, by Lemma 4.5 Assumptions 2.5 are satisfied with 𝒗k=fjk(𝒙k)fjk(𝒙~s)+F(𝒙~s),A=2L,B=2,ρ=C=0,\bm{v}^{k}=\nabla f_{j_{k}}\left(\bm{x}^{k}\right)-\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})+\nabla F(\tilde{\bm{x}}^{s}),A=2L,B=2,\rho=C=0, and D=2F(𝒙~s)2D=-2\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}. So Proposition 3.1 yields

𝖤[dist(𝒙k+1,argminF)2|𝔉s,0]\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}\,|\,\mathfrak{F}_{s,0}] 𝖤[dist(𝒙k,argminF)2|𝔉s,0]\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}\,|\,\mathfrak{F}_{s,0}]
2α(12αL)𝖤[F(𝒙k)F|𝔉s,0]\displaystyle\qquad-2\alpha\left(1-2\alpha L\right)\mathsf{E}[F(\bm{x}^{k})-F_{*}\,|\,\mathfrak{F}_{s,0}]
+4Lα2𝖤[F(𝒙~s)F|𝔉s,0]2α2𝖤[F(𝒙~s)2|𝔉s,0].\displaystyle\qquad+4L\alpha^{2}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\,|\,\mathfrak{F}_{s,0}\right]-2\alpha^{2}\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}\,|\,\mathfrak{F}_{s,0}].

Thus, taking the total expectation, the statement follows. ∎

The next theorem shows that, under some additional assumptions on the choice of the step size α\alpha and the number of inner iterations mm\in\mathbb{N}, Algorithm 4.4 yields a linear convergence rate in terms of the expectation of the objective function values of the outer iterates (𝒙~s)s\tilde{\bm{x}}^{s})_{s\in\mathbb{N}}.

Theorem 4.7.

Suppose that Assumptions 2.4 are satisfied and that the sequence (𝐱~s)s(\tilde{\bm{x}}^{s})_{s\in\mathbb{N}} is generated by Algorithm 4.4 with

0<α<12(2Lμ)andm>1μα(12α(2Lμ)).0<\alpha<\frac{1}{2(2L-\mu)}\quad\text{and}\quad m>\frac{1}{\mu\alpha(1-2\alpha(2L-\mu))}. (4.7)

Then, for all ss\in\mathbb{N}, it holds

𝖤[F(𝒙~s+1)F]qs(F(𝒙0)F),withq(1μα(12Lα)m+2α(Lμ)12Lα)<1.\mathsf{E}\left[F\left(\tilde{\bm{x}}^{s+1}\right)-F_{*}\right]\leq q^{s}\left(F(\bm{x}^{0})-F_{*}\right),\quad\text{with}\quad q\coloneqq\left(\frac{1}{\mu\alpha(1-2L\alpha)m}+\frac{2\alpha(L-\mu)}{1-2L\alpha}\right)<1. (4.8)
Remark 4.8.
  1. (i)

    Conditions (4.7) is used in this form if α\alpha is set first and mm is chosen after. They are needed to ensure that q<1q<1. Indeed, it is clear that 0<α<12(2Lμ)0<\alpha<\frac{1}{2(2L-\mu)} is needed to have 2α(Lμ)12Lα<1\frac{2\alpha(L-\mu)}{1-2L\alpha}<1. If not, qq cannot be less than 11. Then we need m>1μα(12α(2Lμ))m>\frac{1}{\mu\alpha(1-2\alpha(2L-\mu))} once α\alpha is fixed.

  2. (ii)

    Conditions (4.7) can be equivalently stated as follows

    m8(2Lμ)μand118(2κ1)/m4(2Lμ)<α1+18(2κ1)/m4(2Lμ),κ=Lμ.m\geq\frac{8(2L-\mu)}{\mu}\quad\text{and}\quad\frac{1-\sqrt{1-8(2\kappa-1)/m}}{4(2L-\mu)}<\alpha\leq\frac{1+\sqrt{1-8(2\kappa-1)/m}}{4(2L-\mu)},\quad\kappa=\frac{L}{\mu}.

    The above formulas can be useful if one prefers to set the parameter mm first and set the step size α\alpha afterwards.

  3. (iii)

    The convergence rate in Theorem 4.7 establishes the improvement from the outer step ss to s+1s+1. Of course, it depends on the number of inner iterations mm. As expected, and as we can see from Equation 4.8, increasing mm improve the bound on the rate, and since mm is not bounded from above, the best choice would be to let mm go to ++\infty. In practice, there is no best choice of mm, but empirically a balance between the number of inner and outer iterations should be found. Consequently, there is also no optimal choice for α\alpha either.

  4. (iv)

    It is worth mentioning that the linear convergence factor qq in (4.8) is better (smaller) than the one provided in [22, Theorem 11] for the SVRG method for strongly convex functions. There, it is 1μα(12Lα)m+2αL12Lα\frac{1}{\mu\alpha(1-2L\alpha)m}+\frac{2\alpha L}{1-2L\alpha}. The linear convergence factor in (4.8) is also better than the one in [46, Proposition 3.13.1], and also [19, Theorem 11], dealing with a proximal version of SVRG for functions satisfying the PL condition (2.4). In both papers, it is 12μα(14Lα)m+4αL(m+1)(14Lα)m\frac{1}{2\mu\alpha(1-4L\alpha)m}+\frac{4\alpha L(m+1)}{(1-4L\alpha)m}. However, we note that this improvement can also be obtained for the aforementioned SVRG methods using a similar analysis.

  5. (v)

    We stress that following the lines of the proof of Theorem 4.7, the same linear convergence rate found in (4.8), holds true also for the averaged iterate 𝒙~s+1=1mk=0m1𝒙k\tilde{\bm{x}}^{s+1}=\frac{1}{m}\sum_{k=0}^{m-1}\bm{x}^{k}. But, the analysis does not work for the last iterate of the inner loop.

Remark 4.9.

In [30] a variant of Algorithm 4.4 (SVRP) in this paper, called SNSPP, has been also proposed and analyzed. The SNSPP algorithm includes a subroutine to compute the proximity operator. This can be useful in practice when a closed form solution of the proximity operator is not available. Contingent on some additional conditions on the conjugate of fif_{i} and assuming semismoothness of the proximity mapping, [30] provides a linear convergence rate for SNSPP for FF Lipschitz smooth and strongly convex. An ergodic sublinear convergence rate is also proved for weakly convex functions.

Proof of Theorem 4.7.

We consider a fixed stage ss\in\mathbb{N} and 𝒙~s+1\tilde{\bm{x}}^{s+1} is defined as in Algorithm 4.4. By summing inequality (4.6) in Corollary 4.6 over k=0,,m1k=0,\ldots,m-1 and taking the total expectation, we obtain

𝖤[dist(𝒙m,argminF)]2\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{m},\operatorname*{\text{\rm{argmin}}}F)]^{2} +2α(12Lα)m𝖤[F(𝒙~s+1)F]\displaystyle+2\alpha(1-2L\alpha)m\mathsf{E}[F\left(\tilde{\bm{x}}^{s+1}\right)-F_{*}] (4.9)
𝖤[dist(𝒙0,argminF)]2+4Lmα2𝖤[F(𝒙~s)F]2α2m𝖤[F(𝒙~s)2]\displaystyle\leq\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)]^{2}+4Lm\alpha^{2}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]-2\alpha^{2}m\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}]
=𝖤[dist(𝒙~s,argminF)]2+4Lmα2𝖤[F(𝒙~s)F]2α2m𝖤[F(𝒙~s)2]\displaystyle=\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\tilde{\bm{x}}^{s},\operatorname*{\text{\rm{argmin}}}F)]^{2}+4Lm\alpha^{2}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]-2\alpha^{2}m\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}]
2μ𝖤[F(𝒙~s)F]+4Lmα2𝖤[F(𝒙~s)F]2α2m𝖤[F(𝒙~s)2]\displaystyle\leq\frac{2}{\mu}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]+4Lm\alpha^{2}\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]-2\alpha^{2}m\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}]
=2(μ1+2Lmα2)𝖤[F(𝒙~s)F]2α2m𝖤[F(𝒙~s)2]\displaystyle=2\left(\mu^{-1}+2Lm\alpha^{2}\right)\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right]-2\alpha^{2}m\mathsf{E}[\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}]
2(μ1+2Lmα22μmα2)𝖤[F(𝒙~s)F].\displaystyle\leq 2\left(\mu^{-1}+2Lm\alpha^{2}-2\mu m\alpha^{2}\right)\mathsf{E}\left[F(\tilde{\bm{x}}^{s})-F_{*}\right].

In the first inequality, we used the fact that

k=0m1F(𝒙k)=mk=0m11mF(𝒙k)=mξ=0m11mF(k=0m1δk,ξ𝒙k)=m𝖤[F(𝒙~s+1)|𝔉s,m1].\sum_{k=0}^{m-1}F(\bm{x}^{k})=m\sum_{k=0}^{m-1}\frac{1}{m}F(\bm{x}^{k})=m\sum_{\xi=0}^{m-1}\frac{1}{m}F\left({\textstyle\sum_{k=0}^{m-1}\delta_{k,\xi}\bm{x}^{k}}\right)=m\mathsf{E}[F(\tilde{\bm{x}}^{s+1})\,|\,\mathfrak{F}_{s,m-1}].

Notice that relation (4.9) is still valid by choosing 𝒙~s+1=k=0m11m𝒙k\tilde{\bm{x}}^{s+1}=\sum_{k=0}^{m-1}\frac{1}{m}\bm{x}^{k} , in Algorithm 4.4, and using Jensen inequality to lower bound k=0m1F(𝒙k)\sum_{k=0}^{m-1}F(\bm{x}^{k}) by mF(𝒙~s+1)mF(\tilde{\bm{x}}^{s+1}).

The second and the last inequalities use, respectively, the quadratic growth (2.5) and the PL condition (2.4). We thus obtain

𝖤[F(𝒙~s+1)F](1μα(12Lα)m+2α(Lμ)12Lα)𝖤[F(𝒙~s)F].\mathsf{E}\left[F\left(\tilde{\bm{x}}^{s+1}\right)-F_{*}\right]\leq\left(\frac{1}{\mu\alpha(1-2L\alpha)m}+\frac{2\alpha(L-\mu)}{1-2L\alpha}\right)\mathsf{E}\left[F\left(\tilde{\bm{x}}^{s}\right)-F_{*}\right].\qed

4.3 Loopless SVRP

In this paragraph, we propose a single-loop variant of the SVRP algorithm presented previously, by removing the burden of choosing the number of inner iterations. This idea is inspired by the Loopless Stochastic Variance-Reduced Gradient (L-SVRG) method, as proposed in [28, 21] (see also [20]) and here, we present the stochastic proximal method variant that we call L-SVRP.

Algorithm 4.10 (L-SVRP).

Let (ik)k(i_{k})_{k\in\mathbb{N}} be a sequence of i.i.d. random variables uniformly distributed on {1,,n}\{1,\dots,n\} and let (εk)k(\varepsilon^{k})_{k\in\mathbb{N}} be a sequence of i.i.d Bernoulli random variables such that 𝖯(εk=1)=p]0,1]\mathsf{P}(\varepsilon^{k}=1)=p\in\left]0,1\right]. Let α>0\alpha>0 and set the initial points 𝒙0=𝒖0𝖧\bm{x}^{0}=\bm{u}^{0}\in\mathsf{H}. Then

fork=0,1,𝒙k+1=proxαfik(𝒙k+αfik(𝒖k)αF(𝒖k))𝒖k+1=(1εk)𝒖k+εk𝒙k\begin{array}[]{l}\text{for}\;k=0,1,\ldots\\ \left\lfloor\begin{array}[]{l}\bm{x}^{k+1}=\text{{prox}}_{\alpha f_{i_{k}}}\left(\bm{x}^{k}+\alpha\nabla f_{i_{k}}(\bm{u}^{k})-\alpha\nabla F(\bm{u}^{k})\right)\\ \bm{u}^{k+1}=(1-\varepsilon^{k})\bm{u}^{k}+\varepsilon^{k}\bm{x}^{k}\end{array}\right.\end{array}

Here we note that Algorithm 4.10 can be identified with the general scheme 2.1, by setting 𝒆kfik(𝒖k)F(𝒖k)\bm{e}^{k}\coloneqq\nabla f_{i_{k}}(\bm{u}^{k})-\nabla F(\bm{u}^{k}). In addition, we define

σk21ni=1nfi(𝒖k)fi(𝒚k)2\sigma_{k}^{2}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{u}^{k})-\nabla f_{i}(\bm{y}^{k})\big{\|}^{2} (4.10)

with 𝒚kargminF\bm{y}^{k}\in\operatorname*{\text{\rm{argmin}}}F such that 𝒖k𝒚k=dist(𝒖k,argminF)\lVert\bm{u}^{k}-\bm{y}^{k}\rVert=\operatorname*{\text{\rm{dist}}}(\bm{u}^{k},\operatorname*{\text{\rm{argmin}}}F). Moreover, setting 𝔉k=σ(i0,,ik1,ε0,,εk1)\mathfrak{F}_{k}=\sigma(i_{0},\dots,i_{k-1},\varepsilon^{0},\dots,\varepsilon^{k-1}), we have that 𝒙k\bm{x}^{k}, 𝒖k\bm{u}^{k} and 𝒚k\bm{y}^{k} are 𝔉k\mathfrak{F}_{k}-measurable, iki_{k} and εk\varepsilon^{k} are independent of 𝔉k\mathfrak{F}_{k}.

Lemma 4.11.

Suppose that Assumption (A.ii) is satisfied. Let (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} be the sequence generated by Algorithm 4.10, with 𝐯k=fik(𝐱k)fik(𝐮k)+F(𝐮k)\bm{v}^{k}=\nabla f_{i_{k}}(\bm{x}^{k})-\nabla f_{i_{k}}(\bm{u}^{k})+\nabla F(\bm{u}^{k}) and σk\sigma_{k} as defined in (4.10). Then for all kk\in\mathbb{N}, it holds

𝖤[𝒗k2|𝔉k]4L(F(𝒙k)F)+2σk2,\mathsf{E}[\|\bm{v}^{k}\|^{2}\,|\,\mathfrak{F}_{k}]\leq 4L\big{(}F(\bm{x}^{k})-F_{*}\big{)}+2\sigma_{k}^{2}, (4.11)

and

𝖤[σk+12](1p)𝖤[σk2]+2pL𝖤[F(𝒙k)F].\mathsf{E}\big{[}\sigma_{k+1}^{2}\big{]}\leq(1-p)\mathsf{E}\big{[}\sigma_{k}^{2}\big{]}+2pL\mathsf{E}\big{[}F(\bm{x}^{k})-F_{*}\big{]}. (4.12)

Lemma 4.11 whose proof can be found in Appendix A.2 ensures that Assumptions 2.5 hold true with constants A=2L,B=2,C=pLA=2L,B=2,C=pL, ρ=p\rho=p and D0D\equiv 0. Then the following corollaries can be obtained by applying respectively Theorem 3.2 and 3.3 on Algorithm 4.10.

Corollary 4.12.

Suppose that Assumption (A.ii) holds. Suppose also that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 4.10. Let MM such that M2pM\geq\frac{2}{p} and α>0\alpha>0 such that α<1L(2+pM)\alpha<\frac{1}{L(2+pM)}. Then, for all kk\in\mathbb{N},

𝖤[F(¯𝒙k)F]\displaystyle\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}] dist(𝒙0,argminF)2+α2M𝖤[σ02]2αk[1αL(2+pM)],\displaystyle\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]}{2\alpha k\left[1-\alpha L(2+pM)\right]},

with ¯𝐱k=1kt=0k1𝐱t\bm{\bar{}}{\bm{x}}^{k}=\frac{1}{k}\sum_{t=0}^{k-1}\bm{x}^{t}.

Corollary 4.13.

Suppose that Assumptions 2.4 are verified. Suppose now that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 4.10. Let MM such that M>2/pM>2/p and α>0\alpha>0 such that α<1L(2+pM)\alpha<\frac{1}{L(2+pM)}. Set qmax{1αμ(1αL(2+pM)),1+2Mp}q\coloneqq\max\left\{1-\alpha\mu\left(1-\alpha L(2+pM)\right),1+\frac{2}{M}-p\right\}. Then q]0,1[q\in]0,1[ and for all kk\in\mathbb{N},

𝖤[dist(𝒙k,argminF)]2\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2} qk(dist(𝒙0,argminF)2+α2M𝖤[σ02]),\displaystyle\leq q^{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right),
𝖤[F(𝒙k)F]\displaystyle\mathsf{E}[F(\bm{x}^{k})-F_{*}] qkL2(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\displaystyle\leq\frac{q^{k}L}{2}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right).
Remark 4.14.

The proximal version of L-SVRG [28] has been concurrently proposed in [24]. Linear convergence holds in the smooth strongly convex setting. In [24], an approximation of the proximity operator at each iteration is used. Also, Lipschitz continuity of the gradient is replaced by the weaker “second-order similarity”, namely

1ni=1nfi(𝘅)f(𝘅)[fi(𝘆)f(𝘆)]2δ2𝘅𝘆2.\frac{1}{n}\sum_{i=1}^{n}\left\|\nabla f_{i}(\bm{\mathsf{x}})-\nabla f(\bm{\mathsf{x}})-\left[\nabla f_{i}(\bm{\mathsf{y}})-\nabla f(\bm{\mathsf{y}})\right]\right\|^{2}\leq\delta^{2}\|\bm{\mathsf{x}}-\bm{\mathsf{y}}\|^{2}.

In [24] the convex case is not analyzed.

4.4 Stochastic Average Proximal Algorithm

In this paragraph, we propose a new stochastic proximal point method in analogy to SAGA [14], called Stochastic Aggregated Proximal Algorithm (SAPA).

Algorithm 4.15 (SAPA).

Let (ik)k(i_{k})_{k\in\mathbb{N}} be a sequence of i.i.d. random variables uniformly distributed on {1,,n}\{1,\dots,n\}. Let α>0\alpha>0 and set, for every i[n]i\in[n], ϕi0=𝒙0𝖧\bm{\phi}_{i}^{0}=\bm{x}^{0}\in\mathsf{H}. Then

fork=0,1,𝒙k+1=proxαfik(𝒙k+αfik(ϕikk)αni=1nfi(ϕik)),i[n]:ϕik+1=ϕik+δi,ik(𝒙kϕik),\begin{array}[]{l}\text{for}\;k=0,1,\ldots\\ \left\lfloor\begin{array}[]{l}\bm{x}^{k+1}=\text{{prox}}_{\alpha f_{i_{k}}}\left(\bm{x}^{k}+\alpha\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})-\frac{\alpha}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k})\right),\\ \forall\,i\in[n]\colon\ \bm{\phi}^{k+1}_{i}=\bm{\phi}^{k}_{i}+\delta_{i,i_{k}}(\bm{x}^{k}-\bm{\phi}^{k}_{i}),\end{array}\right.\end{array}

where δi,j\delta_{i,j} is the Kronecker symbol.

Remark 4.16.

Algorithm 4.15 is similar but different to the Point-SAGA one proposed in [13]. The update of the stored gradients for Point-SAGA is

i[n]:ϕik+1=ϕik+δi,ik(𝒙k+1ϕik),\forall\,i\in[n]\colon\ \bm{\phi}^{k+1}_{i}=\bm{\phi}^{k}_{i}+\delta_{i,i_{k}}(\bm{x}^{k+1}-\bm{\phi}^{k}_{i}),

whereas in SAGA [14] and its proximal version that we proposed in Algorithm 4.15, the update is

i[n]:ϕik+1=ϕik+δi,ik(𝒙kϕik).\forall\,i\in[n]\colon\ \bm{\phi}^{k+1}_{i}=\bm{\phi}^{k}_{i}+\delta_{i,i_{k}}(\bm{x}^{k}-\bm{\phi}^{k}_{i}).

As for the previous cases, SAPA can be identified with Algorithm 2.1, by setting 𝒆kfik(ϕikk)1ni=1nfi(ϕik)\bm{e}^{k}\coloneqq\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})-\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k}) for all kk\in\mathbb{N}. In addition, let

σk21ni=1nfi(ϕik)fi(𝒙)2\sigma_{k}^{2}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2} (4.13)

with 𝒙argminF\bm{x}^{*}\in\operatorname*{\text{\rm{argmin}}}F such that 𝒙0𝒙=dist(𝒙0,argminF)\lVert\bm{x}^{0}-\bm{x}^{*}\rVert=\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F). Setting 𝔉k=σ(i0,,ik1)\mathfrak{F}_{k}=\sigma(i_{0},\dots,i_{k-1}), we have that 𝒙k\bm{x}^{k} and ϕik\bm{\phi}_{i}^{k} are 𝔉k\mathfrak{F}_{k}-measurables and iki_{k} is independent of 𝔉k\mathfrak{F}_{k}.

Lemma 4.17.

Suppose that Assumption (A.ii) holds. Let (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} be the sequence generated by Algorithm 4.15, with 𝐯k=fik(𝐱k)fik(ϕikk)+1ni=1nfi(ϕik)\bm{v}^{k}=\nabla f_{i_{k}}\left(\bm{x}^{k}\right)-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})+\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k}) and σk\sigma_{k} as defined in (4.13). Then, for all kk\in\mathbb{N}, it holds

𝖤[𝒗k2|𝔉k]4L(F(𝒙k)F)+2σk2,\mathsf{E}\big{[}\|\bm{v}^{k}\|^{2}\,|\,\mathfrak{F}_{k}\big{]}\leq 4L(F(\bm{x}^{k})-F_{*})+2\sigma_{k}^{2},

and

𝖤[σk+12](11n)𝖤[σk2]+2Ln𝖤[F(𝒙k)F].\mathsf{E}\left[\sigma_{k+1}^{2}\right]\leq\left(1-\frac{1}{n}\right)\mathsf{E}\left[\sigma_{k}^{2}\right]+\frac{2L}{n}\mathsf{E}\big{[}F(\bm{x}^{k})-F_{*}\big{]}. (4.14)

From the above lemma, we know that Assumptions 2.5 are verified with A=2L,B=2,C=LnA=2L,B=2,C=\frac{L}{n}, ρ=1n\rho=\frac{1}{n} and D0D\equiv 0. These allow us to state the next corollaries obtained by applying respectively Theorem 3.2 and 3.3 on Algorithm 4.15.

Corollary 4.18.

Suppose that Assumptions (A.ii) are verified. Suppose also that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 4.15. Let MM such that M2nM\geq 2n and α>0\alpha>0 such that α<1L(2+M/n)\alpha<\frac{1}{L(2+M/n)}. Then, for all kk\in\mathbb{N},

𝖤[F(¯𝒙k)F]\displaystyle\mathsf{E}[F(\bm{\bar{}}{\bm{x}}^{k})-F_{*}] dist(𝒙0,argminF)2+α2M𝖤[σ02]2αk[1αL(2+M/n)],\displaystyle\leq\frac{\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]}{2\alpha k\left[1-\alpha L(2+M/n)\right]},

with ¯𝐱k=1kt=0k1𝐱t\bm{\bar{}}{\bm{x}}^{k}=\frac{1}{k}\sum_{t=0}^{k-1}\bm{x}^{t}.

Corollary 4.19.

Suppose that Assumptions 2.4 hold. Suppose now that the sequence (𝐱k)k(\bm{x}^{k})_{k\in\mathbb{N}} is generated by Algorithm 4.15. Let MM such that M>2nM>2n and α>0\alpha>0 such that α<1L(2+M/n)\alpha<\frac{1}{L(2+M/n)}. Set qmax{1αμ(1αL(2+M/n)),1+2M1n}q\coloneqq\max\left\{1-\alpha\mu\left(1-\alpha L(2+M/n)\right),1+\frac{2}{M}-\frac{1}{n}\right\}. Then q]0,1[q\in]0,1[ and for all kk\in\mathbb{N},

𝖤[dist(𝒙k,argminF)]2\displaystyle\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)]^{2} qk(dist(𝒙0,argminF)2+α2M𝖤[σ02]),\displaystyle\leq q^{k}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right),
𝖤[F(𝒙k)F]\displaystyle\mathsf{E}[F(\bm{x}^{k})-F_{*}] qkL2(dist(𝒙0,argminF)2+α2M𝖤[σ02]).\displaystyle\leq\frac{q^{k}L}{2}\left(\operatorname*{\text{\rm{dist}}}(\bm{x}^{0},\operatorname*{\text{\rm{argmin}}}F)^{2}+\alpha^{2}M\mathsf{E}[\sigma_{0}^{2}]\right).
Remark 4.20.

Now, we compare our results with those in [13] for Point-SAGA. In the smooth case, Point-SAGA converges linearly when fif_{i} is differentiable with LL- Lipschitz gradient and μ\mu-strongly convex for every i[n]i\in[n], whereas we require convexity of fif_{i} and only the PL condition to be satisfied by FF. The work [13] does not provide any rate for convex functions but instead does provide an ergodic sublinear convergence rate for nonsmooth and strongly convex functions. While the rate is ergodic and sublinear for strongly convex functions, the algorithm converges with a constant step size.

5 Experiments

In this section, we perform some experiments on synthetic data to compare the schemes presented and analyzed in Section 4. We compare the variance-reduced algorithms SAPA (Algorithm 4.15) and SVRP (Algorithm 4.4) with their vanilla counterpart SPPA (Algorithm 4.1) and their explicit gradient counterparts: SAGA [14] and SVRG [22]. The plots presented in the section are averaged of 10 or 5 runs, depending on the computational demand of the problem. The deviation from the average is also plotted. All the codes are available on GitHub111https://github.com/cheiktraore/Variance-reduction-for-SPPA.

5.1 Comparing SAPA and SVRP to SPPA

The cost of each iteration of SVRP is different from that of SPPA. More precisely, SVRP consists of two nested iterations, where each outer iteration requires a full explicit gradient and mm stochastic gradient computations. We will consider the minimization of the sum of nn functions; therefore, we run SPPA for NN iterations, with N=S(m+n+1)N=S(m+n+1), where SS is the maximum number of outer iterations and mm is that of inner iterations as defined in Algorithm 4.4. Let ss be the outer iterations counter. As for mm, it is set at 2n2n like in [22]. Then the step size α\alpha in SVRP is fixed at 1/(5L)1/(5L). For SAPA, we run it for NnN-n iterations because there is a full gradient computation at the beginning of the algorithm. The SAPA step size is set to 1/(5L)1/(5L). Finally, the SPPA step size is is chosen to be αk=1/(k0.55)\alpha_{k}=1/(k^{0.55}).

For all three algorithms, we normalize the abscissa so to present the convergence with respect to the outer iterates (𝒙~s)s(\tilde{\bm{x}}^{s})_{s\in\mathbb{N}} as in Theorem 4.7.

The algorithms are run for n{1000,5000,10000}n\in\{1000,5000,10000\}, dd fixed at 500500 and cond(AA)cond(A^{\top\!}A), the condition number of AAA^{\top\!}A, at 100100.

5.1.1 Logistic regression

First, we consider experiments on the logistic loss:

F(𝘅)=1ni=1nlog(1+exp{bi𝗮i,𝘅}),F(\bm{\mathsf{x}})=\frac{1}{n}\sum_{i=1}^{n}\log(1+\exp\{-b_{i}\langle\bm{\mathsf{a}}_{i},\bm{\mathsf{x}}\rangle\}), (5.1)

where 𝗮i\bm{\mathsf{a}}_{i} is the ithi^{th} row of a matrix An×dA\in\mathbb{R}^{n\times d} and bi{1,1}b_{i}\in\{-1,1\} for all i[n]i\in[n]. If we set fi(𝘅)=log(1+exp{bi𝗮i,𝘅})f_{i}(\bm{\mathsf{x}})=\log(1+\exp\{-b_{i}\langle\bm{\mathsf{a}}_{i},\bm{\mathsf{x}}\rangle\}) for all i[n]i\in[n], then F(𝘅)=i=1n1nfi(𝘅)F(\bm{\mathsf{x}})=\sum_{i=1}^{n}\frac{1}{n}f_{i}(\bm{\mathsf{x}}). The matrix AA is generated randomly. We first generate a matrix MM according to the standard normal distribution. A singular value decomposition gives M=UDVM=UDV^{\top\!}. We set the smallest singular value to zero and rescale the rest of the vector of singular values so that the biggest singular value is equal to a given condition number and the second smallest to one. We obtain a new diagonal matrix DD^{\prime}. Then AA is given by UDVUD^{\prime}V^{\top\!}. In this problem, we have L=0.25×maxi𝗮i22L=0.25\times\max_{i}\|\bm{\mathsf{a}}_{i}\|_{2}^{2}. We compute the proximity operator of the logistic function fif_{i} according to the formula and the subroutine code available in [12] and considering the rule of calculus of the proximity operator of a function composed with a linear map [4, Corollary 24.15].

Even though we don’t have any theoretical result for SVRP in the convex case, we perform some experiments in this case as well. As it can readily be seen in Figure 1, both SAPA and SVRP are better than SPPA.

5.1.2 Ordinary least squares (OLS)

To analyze the practical behavior of the proposed methods when the PL condition holds, we test all the algorithms on an ordinary least squares problem:

minimize𝘅dF(𝘅)=12nA𝘅𝒃2=1ni=1n12(𝗮i,𝘅bi)2,\underset{\begin{subarray}{c}{\bm{\mathsf{x}}\in\mathbb{R}^{d}}\end{subarray}}{\text{{minimize}}}\;\;F(\bm{\mathsf{x}})=\frac{1}{2n}\|A\bm{\mathsf{x}}-\bm{b}\|^{2}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{2}\left(\langle\bm{\mathsf{a}}_{i},\bm{\mathsf{x}}\rangle-b_{i}\right)^{2}, (5.2)

where 𝗮i\bm{\mathsf{a}}_{i} is the ithi^{th} row of the matrix An×dA\in\mathbb{R}^{n\times d} and bib_{i}\in\mathbb{R} for all i[n]i\in[n]. In this setting, we have F(𝘅)=i=1n1nfi(𝘅)F(\bm{\mathsf{x}})=\sum_{i=1}^{n}\frac{1}{n}f_{i}(\bm{\mathsf{x}}) with fi(𝘅)=12(𝗮i,𝘅bi)2f_{i}(\bm{\mathsf{x}})=\frac{1}{2}\left(\langle\bm{\mathsf{a}}_{i},\bm{\mathsf{x}}\rangle-b_{i}\right)^{2} for all i[n]i\in[n].

Here, the matrix AA was generated as in Section 5.1.1. The proximity operator is computed with the following closed form solution:

proxαfi(𝒙)=𝒙+αbi𝗮i,𝒙α𝗮i2+1𝗮i.\text{{prox}}_{\alpha f_{i}}(\bm{x})=\bm{x}+\alpha\frac{b_{i}-\langle\bm{\mathsf{a}}_{i},\bm{x}\rangle}{\alpha\|\bm{\mathsf{a}}_{i}\|^{2}+1}\bm{\mathsf{a}}_{i}.

Like in the logistic case, SVRP and SAPA exhibit faster convergence compared to SPPA. See Figure 2.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Evolution of F(𝒙~s)FF(\tilde{\bm{x}}^{s})-F_{*}, with respect to the normalized iterations counter, for different values of number of functions nn.We compare the performance of SAPA (blue) and SVRP (green) to SPPA (orange) in the logistic regression case.
Refer to caption
Refer to caption
Refer to caption
Figure 2: Evolution of F(𝒙~s)FF(\tilde{\bm{x}}^{s})-F_{*}, with respect to the normalized iterations counter, for different values of number of functions nn. We compare the performance of SAPA (blue) and SVRP (green) to SPPA (orange) in the ordinary least squares case.

5.2 Comparing SAPA to SAGA

In the next experiments, we implement SAPA and SAGA to solve the least-squares problem (5.2) and logistic problem (5.1). The data are exactly as in Section 5.1.2 and the matrix AA is generated as explained in Section 5.1.1.

The aim of these experiments is twofold: on the one hand, we aim to establish the practical performance of the proposed method in terms of convergence rate and to compare it with the corresponding variance reduction gradient algorithm, on the other hand, we want to assess the stability of SAPA with respect to the step size selection. Indeed, SPPA has been shown to be more robust and stable with respect to the choice of the step size than the stochastic gradient algorithms, see [3, 37, 26]. In Figure 3 for OLS and Figure 4 for logistic, we plot the number of iterations that are needed for SAPA and SAGA to achieve an accuracy at least F(xt)Fε=0.01F(x_{t})-F_{\ast}\leq\varepsilon=0.01, along a fixed range of step sizes. The algorithms run for n{1000,5000,10000}n\in\{1000,5000,10000\}, dd fixed at 500500 and cond(AA)cond(A^{\top\!}A) at 100100.

Our experimental results show that if the step size is small enough, SAPA and SAGA behave very similarly for both problems, see Figure 3 and Figure 4. This behavior is in agreement with the theoretical results that establish convergence rates for SAPA that are similar to those of SAGA. By increasing the step sizes, we observe that SAPA is more stable than SAGA: the range of step sizes for which SAPA converges is wider than that for SAGA.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Number of iterations needed in order to achieve an accuracy of at least 0.010.01 for different step sizes when solving an OLS problem. A cap is put at 40000 iterations. Here we compare SAPA (in blue) with SAGA (in orange) for different values of the number of functions n.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Number of iterations needed in order to achieve an accuracy of at least 0.010.01 for different step sizes when solving a logistic problem. A cap is put at 2.5×1072.5\times 10^{7} iterations. Here we compare SAPA (in blue) with SAGA (in orange) for different values of the number of functions n.

5.3 Comparing SVRP to SVRG

In this section, we compare the performance of Algorithm 4.4 (SVRP) with SVRG [22] for the least-squares problem (5.2), in terms of number of inner iterations (oracle calls) to achieve an accuracy at least F(xt)Fε=0.01F(x_{t})-F_{\ast}\leq\varepsilon=0.01, along a fixed range of step sizes. In this experiment, the condition number is set cond(AA)=100\text{cond}(A^{\top}A)=100, n=2000n=2000 and the dimension of the objective variable dd varies in {1000,1500,2000,3000}\{1000,1500,2000,3000\}. The number of outer and inner iterations is set to s=40s=40 and m=1000m=1000, respectively, and the maximum number of iterations (oracle calls) is set to N=s(m+n+1)N=s(m+n+1).

Two main observations can be made. First, we note that when the step size is optimally tuned, SVRP performs always better than SVRG (overall, it requires fewer iterations to achieve ε=0.01\varepsilon=0.01 accuracy). Secondly, regarding the stability of the two methods with respect to the choice of the step size, while SVRG seems to be a bit more stable for easy problems (d=1000d=1000), the situation is reversed for harder problems (e.g., d{2000,3000}d\in\{2000,3000\}), where SVRP is more robust. The results are reported in Figure 5.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Number of iterations needed in order to achieve an accuracy of at least 0.010.01 for different step sizes when solving an OLS problem. Here we compare SVRP (in blue) with SVRG (in orange) for four different values of the dimension dd in problem (5.2), starting from d=1000d=1000 (easy case) to d=3000d=3000 (hard case).

6 Conclusion and future work

In this work, we proposed a general scheme of variance reduction, based on the stochastic proximal point algorithm (SPPA). In particular, we introduced some new variants of SPPA coupled with various variance reduction strategies, namely SVRP, SAPA and L-SVRP (see Section 4) for which we provide improved convergence results compared to the vanilla version. As the experiments suggest, the convergence behavior of these stochastic proximal variance-reduced methods is more stable with respect to the step size, rendering them advantageous over their explicit-gradient counterparts.

Since in this paper, differentiability of the summands is required, an interesting line of future research includes the extension of these results to the nonsmooth case and the model-based framework, as well as the consideration of inertial (accelerated) variants leading, hopefully, to faster rates. Appendices

Appendix A Additional proofs

In this appendix we provide the proofs of some auxiliary lemmas used in Sections 3 and 4 in the main core of the paper.

A.1 Proofs of Section 𝟑\mathbf{3}

Proof of Proposition 3.1. Notice that from (2.2), 𝒙k+1\bm{x}^{k+1} can be identified as

𝒙k+1=argmin𝘅𝖧{fik(𝒙)𝒆k,𝘅𝒙k:=Rik(𝘅)+12α𝘅𝒙k2}.\bm{x}^{k+1}=\operatorname*{\text{\rm{argmin}}}_{\bm{\mathsf{x}}\in\mathsf{H}}\{\underbrace{f_{i_{k}}(\bm{x})-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle}_{:=R_{i_{k}}(\bm{\mathsf{x}})}+\frac{1}{2\alpha}\lVert\bm{\mathsf{x}}-\bm{x}^{k}\rVert^{2}\}.

Let 𝘅𝖧\bm{\mathsf{x}}\in\mathsf{H}. The function Rik(𝘅)=fik(𝘅)𝒆k,𝘅𝒙kR_{i_{k}}(\bm{\mathsf{x}})=f_{i_{k}}(\bm{\mathsf{x}})-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle is convex and continuously differentiable with Rik(𝘅)=fik(𝘅)𝒆k\nabla R_{i_{k}}(\bm{\mathsf{x}})=\nabla f_{i_{k}}(\bm{\mathsf{x}})-\bm{e}^{k}

By convexity of RikR_{i_{k}}, we have:

Rik(𝒙k+1),𝘅𝒙k+1\displaystyle\langle\nabla R_{i_{k}}(\bm{x}^{k+1}),\bm{\mathsf{x}}-\bm{x}^{k+1}\rangle Rik(𝘅)Rik(𝒙k+1)\displaystyle\leq R_{i_{k}}(\bm{\mathsf{x}})-R_{i_{k}}(\bm{x}^{k+1}) (A.1)
=fik(𝘅)fik(𝒙k+1)𝒆k,𝘅𝒙k+𝒆k,𝒙k+1𝒙k.\displaystyle=f_{i_{k}}(\bm{\mathsf{x}})-f_{i_{k}}(\bm{x}^{k+1})-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle+\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle.

Since Rik(𝒙k+1)=𝒙k𝒙k+1α\nabla R_{i_{k}}(\bm{x}^{k+1})=\frac{\bm{x}^{k}-\bm{x}^{k+1}}{\alpha}, from (A.1), it follows:

1α𝒙k𝒙k+1,𝘅𝒙k+1fik(𝘅)fik(𝒙k+1)𝒆k,𝘅𝒙k+𝒆k,𝒙k+1𝒙k.\frac{1}{\alpha}\langle\bm{x}^{k}-\bm{x}^{k+1},\bm{\mathsf{x}}-\bm{x}^{k+1}\rangle\leq f_{i_{k}}(\bm{\mathsf{x}})-f_{i_{k}}(\bm{x}^{k+1})-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle+\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle.

By using the identity 𝒙k𝒙k+1,𝘅𝒙k+1=12𝒙k+1𝒙k2+12𝒙k+1𝘅212𝒙k𝘅2\langle\bm{x}^{k}-\bm{x}^{k+1},\bm{\mathsf{x}}-\bm{x}^{k+1}\rangle=\frac{1}{2}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}+\frac{1}{2}\lVert\bm{x}^{k+1}-\bm{\mathsf{x}}\rVert^{2}-\frac{1}{2}\lVert\bm{x}^{k}-\bm{\mathsf{x}}\rVert^{2} in the previous inequality, we find

𝒆k,𝒙k+1𝒙k+\displaystyle-\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+ fik(𝒙k+1)fik(𝘅)+12α𝒙k+1𝒙k2\displaystyle f_{i_{k}}(\bm{x}^{k+1})-f_{i_{k}}(\bm{\mathsf{x}})+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
12α𝒙k𝘅212α𝒙k+1𝘅2𝒆k,𝘅𝒙k.\displaystyle\leq\frac{1}{2\alpha}\lVert\bm{x}^{k}-\bm{\mathsf{x}}\rVert^{2}-\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{\mathsf{x}}\rVert^{2}-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle. (A.2)

Recalling the definition of 𝒗k\bm{v}_{k} in (2.3), we can lower bound the left hand term as follows:

𝒆k\displaystyle-\langle\bm{e}^{k} ,𝒙k+1𝒙k+fik(𝒙k+1)fik(𝘅)+12α𝒙k+1𝒙k2\displaystyle,\bm{x}^{k+1}-\bm{x}^{k}\rangle+f_{i_{k}}(\bm{x}^{k+1})-f_{i_{k}}(\bm{\mathsf{x}})+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
=𝒆k,𝒙k+1𝒙k+fik(𝒙k+1)fik(𝒙k)+12α𝒙k+1𝒙k2\displaystyle=-\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+f_{i_{k}}(\bm{x}^{k+1})-f_{i_{k}}(\bm{x}^{k})+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
+fik(𝒙k)fik(𝘅)\displaystyle\qquad+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}})
𝒆k,𝒙k+1𝒙k+fik(𝒙k),𝒙k+1𝒙k+12α𝒙k+1𝒙k2\displaystyle\geq-\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+\langle\nabla f_{i_{k}}(\bm{x}^{k}),\bm{x}^{k+1}-\bm{x}^{k}\rangle+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
+fik(𝒙k)fik(𝘅)\displaystyle\qquad+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}})
=𝒆k+fik(𝒙k),𝒙k+1𝒙k+12α𝒙k+1𝒙k2\displaystyle=\langle-\bm{e}^{k}+\nabla f_{i_{k}}(\bm{x}^{k}),\bm{x}^{k+1}-\bm{x}^{k}\rangle+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
+fik(𝒙k)fik(𝘅)\displaystyle\qquad+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}})
=𝒗k,𝒙k+1𝒙k+12α𝒙k+1𝒙k2\displaystyle=\langle\bm{v}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
+fik(𝒙k)fik(𝘅),\displaystyle\qquad+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}}), (A.3)

where in the first inequality, we used the convexity of fif_{i}, for all i[n]i\in[n]. Since

𝒗k,𝒙k+1𝒙k+12α𝒙k+1𝒙k2=α2𝒗k+12α(𝒙k+1𝒙k)2α2𝒗k2.\langle\bm{v}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}=\left\|\frac{\sqrt{\alpha}}{\sqrt{2}}\bm{v}^{k}+\frac{1}{\sqrt{2\alpha}}(\bm{x}^{k+1}-\bm{x}^{k})\right\|^{2}-\frac{\alpha}{2}\|\bm{v}^{k}\|^{2}.

From (A.3), it follows

𝒆k,𝒙k+1𝒙k+\displaystyle-\langle\bm{e}^{k},\bm{x}^{k+1}-\bm{x}^{k}\rangle+ fik(𝒙k+1)fik(𝘅)+12α𝒙k+1𝒙k2\displaystyle f_{i_{k}}(\bm{x}^{k+1})-f_{i_{k}}(\bm{\mathsf{x}})+\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{x}^{k}\rVert^{2}
α2𝒗k+12α(𝒙k+1𝒙k)2α2𝒗k2+fik(𝒙k)fik(𝘅)\displaystyle\geq\left\|\frac{\sqrt{\alpha}}{\sqrt{2}}\bm{v}^{k}+\frac{1}{\sqrt{2\alpha}}(\bm{x}^{k+1}-\bm{x}^{k})\right\|^{2}-\frac{\alpha}{2}\|\bm{v}^{k}\|^{2}+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}})
=α2fik(𝒙k)fik(𝒙k+1)2α2𝒗k2+fik(𝒙k)fik(𝘅).\displaystyle=\frac{\alpha}{2}\left\|\nabla f_{i_{k}}(\bm{x}^{k})-\nabla f_{i_{k}}(\bm{x}^{k+1})\right\|^{2}-\frac{\alpha}{2}\|\bm{v}^{k}\|^{2}+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}}). (A.4)

By using (A.1) in (A.1), we obtain

α2𝒗k2+fik(𝒙k)fik(𝘅)\displaystyle-\frac{\alpha}{2}\|\bm{v}^{k}\|^{2}+f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}}) 12α𝒙k𝘅212α𝒙k+1𝘅2𝒆k,𝘅𝒙k.\displaystyle\leq\frac{1}{2\alpha}\lVert\bm{x}^{k}-\bm{\mathsf{x}}\rVert^{2}-\frac{1}{2\alpha}\lVert\bm{x}^{k+1}-\bm{\mathsf{x}}\rVert^{2}-\langle\bm{e}^{k},\bm{\mathsf{x}}-\bm{x}^{k}\rangle. (A.5)

Now, define 𝖤k[]=𝖤[|𝔉k]\mathsf{E}_{k}[\cdot]=\mathsf{E}[\cdot\,|\,\mathfrak{F}_{k}], where 𝔉k\mathfrak{F}_{k} is defined in Assumptions  2.5 and is such that 𝒙k\bm{x}^{k} is 𝔉k\mathfrak{F}_{k}-measurable and iki_{k} is independent of 𝔉k\mathfrak{F}_{k}. Thus, taking the conditional expectation of inequality A.5 and rearranging the terms, we have

𝖤k[𝒙k+1𝘅2]\displaystyle\mathsf{E}_{k}[\lVert\bm{x}^{k+1}-\bm{\mathsf{x}}\rVert^{2}] 𝒙k𝘅22α𝖤k[fik(𝒙k)fik(𝘅)]+α2𝖤k𝒗k2\displaystyle\leq\lVert\bm{x}^{k}-\bm{\mathsf{x}}\rVert^{2}-2\alpha\mathsf{E}_{k}\big{[}f_{i_{k}}(\bm{x}^{k})-f_{i_{k}}(\bm{\mathsf{x}})\big{]}+\alpha^{2}\mathsf{E}_{k}\|\bm{v}^{k}\|^{2}
=𝒙k𝘅22α(F(𝒙k)F(𝘅))+α2𝖤k𝒗k2.\displaystyle=\lVert\bm{x}^{k}-\bm{\mathsf{x}}\rVert^{2}-2\alpha(F(\bm{x}^{k})-F(\bm{\mathsf{x}}))+\alpha^{2}\mathsf{E}_{k}\|\bm{v}^{k}\|^{2}.

Replacing 𝘅\bm{\mathsf{x}} by 𝒚k\bm{y}^{k}, with 𝒚kargminF\bm{y}^{k}\in\operatorname*{\text{\rm{argmin}}}F such that 𝒙k𝒚k=dist(𝒙k,argminF)\lVert\bm{x}^{k}-\bm{y}^{k}\rVert=\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F) and using Assumption (B.ii), we get

𝖤k𝒙k+1𝒚k2\displaystyle\mathsf{E}_{k}\lVert\bm{x}^{k+1}-\bm{y}^{k}\rVert^{2} 𝒙k𝒚k2\displaystyle\leq\lVert\bm{x}^{k}-\bm{y}^{k}\rVert^{2}
2α(F(𝒙k)F)+α2(2A(F(𝒙k)F)+Bσk2+D)a.s.\displaystyle\qquad-2\alpha(F(\bm{x}^{k})-F_{*})+\alpha^{2}\left(2A(F(\bm{x}^{k})-F_{*})+B\sigma_{k}^{2}+D\right)\;a.s.
=𝒙k𝒚k2+α2D\displaystyle=\lVert\bm{x}^{k}-\bm{y}^{k}\rVert^{2}+\alpha^{2}D
2α(1αA)(F(𝒙k)F)+α2Bσk2a.s.\displaystyle\qquad-2\alpha(1-\alpha A)(F(\bm{x}^{k})-F_{*})+\alpha^{2}B\sigma_{k}^{2}\;\;a.s. (A.6)

By taking the total expectation in (A.1) and using (B.iii), for all M>0M>0 we have

𝖤𝒙k+1𝒚k2+α2M𝖤[σk+12]\displaystyle\mathsf{E}\lVert\bm{x}^{k+1}-\bm{y}^{k}\rVert^{2}+\alpha^{2}M\mathsf{E}[\sigma_{k+1}^{2}] 𝖤𝒙k𝒚k2+α2𝖤[D]\displaystyle\leq\mathsf{E}\lVert\bm{x}^{k}-\bm{y}^{k}\rVert^{2}+\alpha^{2}\mathsf{E}[D]
2α(1αA)𝖤[F(𝒙k)F]+α2B𝖤[σk2]\displaystyle\qquad-2\alpha(1-\alpha A)\mathsf{E}[F(\bm{x}^{k})-F_{*}]+\alpha^{2}B\mathsf{E}[\sigma_{k}^{2}]
+α2M(1ρ)𝖤[σk2]+2α2MC𝖤[F(𝒙k)F]\displaystyle\qquad+\alpha^{2}M(1-\rho)\mathsf{E}[\sigma_{k}^{2}]+2\alpha^{2}MC\mathsf{E}[F(\bm{x}^{k})-F_{*}]
=𝖤𝒙k𝒚k2+α2𝖤[D]\displaystyle=\mathsf{E}\lVert\bm{x}^{k}-\bm{y}^{k}\rVert^{2}+\alpha^{2}\mathsf{E}[D]
2α[1α(A+MC)]𝖤[F(𝒙k)F]\displaystyle\qquad-2\alpha\left[1-\alpha(A+MC)\right]\mathsf{E}[F(\bm{x}^{k})-F_{*}]
+α2[M+BρM]𝖤[σk2]\displaystyle\qquad+\alpha^{2}\left[M+B-\rho M\right]\mathsf{E}[\sigma_{k}^{2}]
=𝖤[dist(𝒙k,argminF)2]+α2𝖤[D]\displaystyle=\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k},\operatorname*{\text{\rm{argmin}}}F)^{2}]+\alpha^{2}\mathsf{E}[D]
2α[1α(A+MC)]𝖤[F(𝒙k)F]\displaystyle\qquad-2\alpha\left[1-\alpha(A+MC)\right]\mathsf{E}[F(\bm{x}^{k})-F_{*}]
+α2[M+BρM]𝖤[σk2].\displaystyle\qquad+\alpha^{2}\left[M+B-\rho M\right]\mathsf{E}[\sigma_{k}^{2}].

Since 𝖤[dist(𝒙k+1,argminF)2]𝖤𝒙k+1𝒚k2\mathsf{E}[\operatorname*{\text{\rm{dist}}}(\bm{x}^{k+1},\operatorname*{\text{\rm{argmin}}}F)^{2}]\leq\mathsf{E}\lVert\bm{x}^{k+1}-\bm{y}^{k}\rVert^{2}, the statement follows. ∎

A.2 Proofs of Section 𝟒\mathbf{4}

Proof of Lemma 4.5. Given any i[n]i\in[n], since fif_{i} is convex and LL-Lipschitz smooth, it is a standard fact (see [32, Equation 2.1.7]) that

(𝘅,𝘆𝖧)fi(𝘅)fi(𝘆)22L[fi(𝘅)fi(𝘆)fi(𝘆),𝘅𝘆].(\forall\,\bm{\mathsf{x}},\bm{\mathsf{y}}\in\mathsf{H})\quad\big{\|}\nabla f_{i}(\bm{\mathsf{x}})-\nabla f_{i}(\bm{\mathsf{y}})\big{\|}^{2}\leq 2L\big{[}f_{i}(\bm{\mathsf{x}})-f_{i}(\bm{\mathsf{y}})-\langle\nabla f_{i}(\bm{\mathsf{y}}),\bm{\mathsf{x}}-\bm{\mathsf{y}}\rangle\big{]}.

By summing the above inequality over i=1,,ni=1,\ldots,n, we have

i=1n1nfi(𝘅)fi(𝘆)22L[F(𝘅)F(𝘆)]2LF(𝘆),𝘅𝘆\sum_{i=1}^{n}\frac{1}{n}\big{\|}\nabla f_{i}(\bm{\mathsf{x}})-\nabla f_{i}(\bm{\mathsf{y}})\big{\|}^{2}\leq 2L\big{[}F(\bm{\mathsf{x}})-F(\bm{\mathsf{y}})\big{]}-2L\langle\nabla F(\bm{\mathsf{y}}),\bm{\mathsf{x}}-\bm{\mathsf{y}}\rangle

and hence if we suppose that F(𝘆)=0\nabla F\left(\bm{\mathsf{y}}\right)=0, then we obtain

i=1n1nfi(𝘅)fi(𝘆)22L[F(𝘅)F(𝘆)].\sum_{i=1}^{n}\frac{1}{n}\big{\|}\nabla f_{i}(\bm{\mathsf{x}})-\nabla f_{i}(\bm{\mathsf{y}})\big{\|}^{2}\leq 2L\big{[}F(\bm{\mathsf{x}})-F(\bm{\mathsf{y}})\big{]}. (A.7)

Now, let ss\in\mathbb{N} and k{0,,m1}k\in\{0,\dots,m-1\} and set, for the sake of brevity, jk=ism+kj_{k}=i_{sm+k}. Defining 𝔉k=𝔉s,k=σ(ξ0,,ξs1,i0,,ism+k1)\mathfrak{F}_{k}=\mathfrak{F}_{s,k}=\sigma(\xi_{0},\dots,\xi_{s-1},i_{0},\dots,i_{sm+k-1}) and 𝖤k[]=𝖤[|𝔉k]\mathsf{E}_{k}[\cdot]=\mathsf{E}[\cdot\,|\,\mathfrak{F}_{k}], and recalling that

𝒗k=fjk(𝒙k)fjk(𝒙~s)+F(𝒙~s)and𝒚~s=PargminF(𝒙~s),\bm{v}^{k}=\nabla f_{j_{k}}(\bm{x}^{k})-\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})+\nabla F(\tilde{\bm{x}}^{s})\quad\text{and}\quad\tilde{\bm{y}}^{s}=P_{\operatorname*{\text{\rm{argmin}}}F}(\tilde{\bm{x}}^{s}),

we obtain

𝖤k[𝒗k2]\displaystyle\mathsf{E}_{k}\big{[}\|\bm{v}^{k}\|^{2}\big{]} 2𝖤kfjk(𝒙k)fjk(𝒚~s)2+2𝖤k[fjk(𝒙~s)fjk(𝒚~s)]F(𝒙~s)2\displaystyle\leq 2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\bm{x}^{k})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}+2\mathsf{E}_{k}\big{\|}[\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})]-\nabla F(\tilde{\bm{x}}^{s})\big{\|}^{2}
=2𝖤k[fjk(𝒙~s)fjk(𝒚~s)]𝖤k[fjk(𝒙~s)fjk(𝒚~s)]2\displaystyle=2\mathsf{E}_{k}\big{\|}[\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{]}-\mathsf{E}_{k}[\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})]\big{\|}^{2}
+2𝖤kfjk(𝒙k)fjk(𝒚~s)2\displaystyle\quad+2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\bm{x}^{k})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}
=2𝖤kfjk(𝒙~s)fjk(𝒚~s)22𝖤k[fjk(𝒙~s)fjk(𝒚~s)]2\displaystyle=2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}-2\big{\|}\mathsf{E}_{k}[\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})]\big{\|}^{2}
+2𝖤kfjk(𝒙k)fjk(𝒚~s)2\displaystyle\qquad+2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\bm{x}^{k})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}
=2𝖤kfjk(𝒙k)fjk(𝒚~s)2+2𝖤kfjk(𝒙~s)fjk(𝒚~s)2\displaystyle=2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\bm{x}^{k})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}+2\mathsf{E}_{k}\big{\|}\nabla f_{j_{k}}(\tilde{\bm{x}}^{s})-\nabla f_{j_{k}}(\tilde{\bm{y}}^{s})\big{\|}^{2}
2F(𝒙~s)2\displaystyle\qquad-2\|\nabla F(\tilde{\bm{x}}^{s})\|^{2}
4L(F(𝒙k)F(𝒚~s))+2σk22F(𝒙~s)2,\displaystyle\leq 4L\big{(}F(\bm{x}^{k})-F(\tilde{\bm{y}}^{s})\big{)}+2\sigma_{k}^{2}-2\|\nabla F(\tilde{\bm{x}}^{s})\|^{2},

where in the last inequality, we used (A.7) with 𝘆=𝒚~s\bm{\mathsf{y}}=\tilde{\bm{y}}^{s}. ∎

Proof of Lemma 4.11 The proof of equation (4.11) is equal to that of (4.5) in Lemma 4.5 using 𝒖k\bm{u}^{k} instead of 𝒙~s\tilde{\bm{x}}^{s} and 𝖤k[]=𝖤[|𝔉k]\mathsf{E}_{k}[\cdot]=\mathsf{E}[\cdot\,|\,\mathfrak{F}_{k}] with 𝔉k=σ(i0,,ik1,ε0,,εk1)\mathfrak{F}_{k}=\sigma(i_{0},\dots,i_{k-1},\varepsilon^{0},\dots,\varepsilon^{k-1}), which ensures that 𝒙k\bm{x}^{k} and 𝒖k\bm{u}^{k} are 𝔉k\mathfrak{F}_{k}-measurables, and iki_{k} and εk\varepsilon^{k} are independent of 𝔉k\mathfrak{F}_{k}. Concerning (4.12), we note that

σk+12=1ni=1nfi((1εk)𝒖k+εk𝒙k)fi(PargminF((1εk)𝒖k+εk𝒙k))2.\sigma_{k+1}^{2}=\frac{1}{n}\sum_{i=1}^{n}\big{\lVert}\nabla f_{i}((1-\varepsilon^{k})\bm{u}^{k}+\varepsilon^{k}\bm{x}^{k})-\nabla f_{i}(P_{\operatorname*{\text{\rm{argmin}}}F}((1-\varepsilon^{k})\bm{u}^{k}+\varepsilon^{k}\bm{x}^{k}))\big{\rVert}^{2}.

Therefore, we have

𝖤k[σk+12]\displaystyle\mathsf{E}_{k}\left[\sigma^{2}_{k+1}\right] =(1p)σk2+p1ni=1nfi(𝒙k)fi(PargminF(𝒙k))2\displaystyle=(1-p)\sigma^{2}_{k}+p\frac{1}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{x}^{k})-\nabla f_{i}(P_{\operatorname*{\text{\rm{argmin}}}F}(\bm{x}^{k}))\big{\|}^{2}
(1p)σk2+2pL(F(𝒙k)F),\displaystyle\leq(1-p)\sigma^{2}_{k}+2pL(F(\bm{x}^{k})-F^{*}),

where in the last inequality we used (A.7) with 𝘆=PargminF(𝒙k)\bm{\mathsf{y}}=P_{\operatorname*{\text{\rm{argmin}}}F}(\bm{x}^{k}). ∎

Proof of Lemma 4.17 We recall that, by definition,

(i[n])ϕik+1=ϕik+δi,ik(𝒙kϕik).(\forall\,i\in[n])\quad\bm{\phi}^{k+1}_{i}=\bm{\phi}^{k}_{i}+\delta_{i,i_{k}}(\bm{x}^{k}-\bm{\phi}^{k}_{i}).

Now, set 𝖤k[]=𝖤[|𝔉k]\mathsf{E}_{k}[\cdot]=\mathsf{E}[\cdot|\,\mathfrak{F}_{k}] with 𝔉k=σ(i0,,ik1)\mathfrak{F}_{k}=\sigma(i_{0},\dots,i_{k-1}), so that ϕik\bm{\phi}^{k}_{i} and 𝒙k\bm{x}^{k} are 𝔉k\mathfrak{F}_{k}-measurable and iki_{k} is independent of 𝔉k\mathfrak{F}_{k}. By definition of 𝒗k\bm{v}^{k}, and using the fact that F(𝒙)=0\nabla F(\bm{x}^{*})=0 and inequality (A.7), we have

𝖤k[𝒗k2]\displaystyle\mathsf{E}_{k}\big{[}\|\bm{v}^{k}\|^{2}\big{]} =𝖤k[fik(𝒙k)fik(ϕikk)+1ni=1nfi(ϕik)F(𝒙)2]\displaystyle=\mathsf{E}_{k}\left[\Big{\|}\nabla f_{i_{k}}(\bm{x}^{k})-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})+\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla F(\bm{x}^{*})\Big{\|}^{2}\right]
=𝖤k[fik(𝒙k)fik(𝒙)+fik(𝒙)fik(ϕikk)\displaystyle=\mathsf{E}_{k}\bigg{[}\Big{\|}\nabla f_{i_{k}}(\bm{x}^{k})-\nabla f_{i_{k}}(\bm{x}^{*})+\nabla f_{i_{k}}(\bm{x}^{*})-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})
+1ni=1nfi(ϕik)F(𝒙)2]\displaystyle\qquad\qquad+\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla F(\bm{x}^{*})\Big{\|}^{2}\bigg{]}
2𝖤k[fik(𝒙k)fik(𝒙)2]\displaystyle\leq 2\mathsf{E}_{k}\left[\big{\|}\nabla f_{i_{k}}(\bm{x}^{k})-\nabla f_{i_{k}}(\bm{x}^{*})\big{\|}^{2}\right]
+2𝖤k[fik(𝒙)fik(ϕikk)𝖤k[fik(𝒙)fik(ϕikk)]2]\displaystyle\qquad\qquad+2\mathsf{E}_{k}\left[\big{\|}\nabla f_{i_{k}}(\bm{x}^{*})-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})-\mathsf{E}_{k}[\nabla f_{i_{k}}(\bm{x}^{*})-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})]\big{\|}^{2}\right]
=2ni=1nfi(𝒙k)fi(𝒙)2+2𝖤k[fik(𝒙)fik(ϕikk)2]\displaystyle=\frac{2}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{x}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}+2\mathsf{E}_{k}\left[\big{\|}\nabla f_{i_{k}}(\bm{x}^{*})-\nabla f_{i_{k}}(\bm{\phi}_{i_{k}}^{k})\big{\|}^{2}\right]
21ni=1nfi(ϕik)2\displaystyle\qquad\qquad-2\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k})\bigg{\|}^{2}
4L(F(𝒙k)F)+21ni=1nfi(ϕik)fi(𝒙)2σk221ni=1nfi(ϕik)2.\displaystyle\leq 4L\big{(}F(\bm{x}^{k})-F_{*}\big{)}+2\underbrace{\frac{1}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}}_{\sigma_{k}^{2}}-2\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(\bm{\phi}_{i}^{k})\bigg{\|}^{2}.

For the second part (4.14), we proceed as follows

𝖤k[σk+12]\displaystyle\mathsf{E}_{k}\left[\sigma_{k+1}^{2}\right] =1ni=1n𝖤k[fi(ϕik+1)fi(𝒙)2]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathsf{E}_{k}\left[\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k+1})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}\right]
=1ni=1n𝖤k[fi(ϕik+δi,ik(𝒙kϕik))fi(𝒙)2]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathsf{E}_{k}\left[\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k}+\delta_{i,i_{k}}(\bm{x}^{k}-\bm{\phi}^{k}_{i}))-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}\right]
=1ni=1n(1nj=1nfi(ϕik+δi,j(𝒙kϕik))fi(𝒙)2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bigg{(}\frac{1}{n}\sum_{j=1}^{n}\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k}+\delta_{i,j}(\bm{x}^{k}-\bm{\phi}^{k}_{i}))-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}\bigg{)}
=1ni=1n(n1nfi(ϕik)fi(𝒙)2+1nfi(𝒙k)fi(𝒙)2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{n-1}{n}\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}+\frac{1}{n}\big{\|}\nabla f_{i}(\bm{x}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}\right)
=(11n)1ni=1nfi(ϕik)fi(𝒙)2+1n2i=1nfi(𝒙k)fi(𝒙)2\displaystyle=\left(1-\frac{1}{n}\right)\frac{1}{n}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{\phi}_{i}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}+\frac{1}{n^{2}}\sum_{i=1}^{n}\big{\|}\nabla f_{i}(\bm{x}^{k})-\nabla f_{i}(\bm{x}^{*})\big{\|}^{2}
(11n)σk2+2Ln(F(𝒙k)F),\displaystyle\leq\left(1-\frac{1}{n}\right)\sigma_{k}^{2}+\frac{2L}{n}\big{(}F(\bm{x}^{k})-F_{*}\big{)},

where in the last inequality we used inequality (A.7) with 𝘆=𝒙\bm{\mathsf{y}}=\bm{x}^{*}. ∎

References

  • [1] Z. Allen-Zhu and Y. Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1080–1089, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • [2] V. Apidopoulos, N. Ginatta, and S. Villa. Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition. Journal of Global Optimization, 84(3):563–589, 2022.
  • [3] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019.
  • [4] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer New York, NY, 2017.
  • [5] P. Bégout, J. Bolte, and M. A. Jendoubi. On damped second-order gradient systems. Journal of Differential Equations, 259(7):3115–3143, 2015.
  • [6] D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129(2):163–195, 2011.
  • [7] P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235–2260, 2016.
  • [8] J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
  • [9] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 165(2):471–507, 2017.
  • [10] L. Bottou. On-line Learning and Stochastic Approximations, page 9–42. Publications of the Newton Institute. Cambridge University Press, 1999.
  • [11] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proceedings of COMPSTAT’2010, pages 177–186, Heidelberg, 2010. Physica-Verlag HD.
  • [12] G. Chierchia, E. Chouzenoux, P. L. Combettes, and J.-C. Pesquet. The proximity operator repository. user’s guide. http://proximity-operator. net/download/guide.pdf, 6, 2020.
  • [13] A. Defazio. A simple practical accelerated method for finite sums. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [14] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27:1646–1654, 2014.
  • [15] D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.
  • [16] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
  • [17] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 689–699. Curran Associates, Inc., 2018.
  • [18] G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: beyond the worst-case with the help of geometry. Mathematical Programming, 198:937–996, 2023.
  • [19] P. Gong and J. Ye. Linear convergence of variance-reduced stochastic gradient without strong convexity. arXiv preprint arXiv:1406.1102, 2014.
  • [20] E. Gorbunov, F. Hanzely, and P. Richtarik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 680–690. PMLR, 26–28 Aug 2020.
  • [21] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2305–2313. Curran Associates, Inc., 2015.
  • [22] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26, pages 315–323. Curran Associates, Inc., 2013.
  • [23] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages 795–811, Cham, 2016. Springer International Publishing.
  • [24] A. Khaled and C. Jin. Faster federated optimization under second-order similarity. In The Eleventh International Conference on Learning Representations, 2023.
  • [25] A. Khaled and P. Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. Survey Certification.
  • [26] J. L. Kim, P. Toulis, and A. Kyrillidis. Convergence and stability of the stochastic proximal point algorithm with momentum. In R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, editors, Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pages 1034–1047. PMLR, 23–24 Jun 2022.
  • [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [28] D. Kovalev, S. Horváth, and P. Richtárik. Don’t jump through hoops and remove those loops: Svrg and katyusha are better without the outer loop. In A. Kontorovich and G. Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 451–467. PMLR, 08 Feb–11 Feb 2020.
  • [29] S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87–89, 1963.
  • [30] A. Milzarek, F. Schaipp, and M. Ulbrich. A semismooth newton stochastic proximal point algorithm with variance reduction. SIAM Journal on Optimization, 34(1):1157–1185, 2024.
  • [31] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24, pages 451–459. Curran Associates, Inc., 2011.
  • [32] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • [33] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2613–2621. PMLR, 06–11 Aug 2017.
  • [34] A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. The Journal of Machine Learning Research, 18(1):7204–7245, 2017.
  • [35] B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864–878, 1963.
  • [36] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
  • [37] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, 22(3):400–407, 1951.
  • [38] E. K. Ryu and S. Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014.
  • [39] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
  • [40] O. Sebbouh, R. M. Gower, and A. Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In M. Belkin and S. Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3935–3971. PMLR, 15–19 Aug 2021.
  • [41] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [42] P. Toulis and E. M. Airoldi. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694–1727, 2017.
  • [43] P. Toulis, T. Horel, and E. M. Airoldi. The proximal robbins–monro method. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(1):188–212, 2021.
  • [44] P. Toulis, D. Tran, and E. Airoldi. Towards stability and optimality in stochastic gradient descent. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 1290–1298, Cadiz, Spain, 09–11 May 2016. PMLR.
  • [45] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 2406–2416. Curran Associates, Inc., 2019.
  • [46] J. Zhang and X. Zhu. Linear convergence of prox-svrg method for separable non-smooth convex optimization problems under bounded metric subregularity. Journal of Optimization Theory and Applications, 192(2):564–597, 2022.