This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

\nameJi Liu \emailji.liu.uwisc@gmail.com
\nameStephen J. Wright \emailswright@cs.wisc.edu
\addrDepartment of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706
\nameChristopher Ré \emailchrismre@cs.stanford.edu
\addrDepartment of Computer Science, Stanford University, Stanford, CA 94305
\nameVictor Bittorf \emailvictor.bittorf@cloudera.com
\addrCloudera, Inc., Palo Alto, CA 94304
\nameSrikrishna Sridhar \emailsrikris@graphlab.com
\addrGraphLab Inc., Seattle, WA 98103
Abstract

We describe an asynchronous parallel stochastic coordinate descent algorithm for minimizing smooth unconstrained or separably constrained functions. The method achieves a linear convergence rate on functions that satisfy an essential strong convexity property and a sublinear rate (1/K1/K) on general convex functions. Near-linear speedup on a multicore system can be expected if the number of processors is O(n1/2)O(n^{1/2}) in unconstrained optimization and O(n1/4)O(n^{1/4}) in the separable-constrained case, where nn is the number of variables. We describe results from implementation on 40-core processors.

Keywords: Asynchronous Parallel Optimization, Stochastic Coordinate Descent

1 Introduction

Consider the convex optimization problem

minxΩf(x),\min_{x\in\Omega}\,\quad f(x), (1)

where Ωn\Omega\subset\mathbb{R}^{n} is a closed convex set and ff is a smooth convex mapping from an open neighborhood of Ω\Omega to \mathbb{R}. We consider two particular cases of Ω\Omega in this paper: the unconstrained case Ω=n\Omega=\mathbb{R}^{n}, and the separable case

Ω=Ω1×Ω2××Ωn,\Omega=\Omega_{1}\times\Omega_{2}\times\dotsc\times\Omega_{n}, (2)

where each Ωi\Omega_{i}, i=1,2,,ni=1,2,\dotsc,n is a closed subinterval of the real line.

Formulations of the type (1,2) arise in many data analysis and machine learning problems, for example, support vector machines (linear or nonlinear dual formulation) (Cortes and Vapnik, 1995), LASSO (after decomposing xx into positive and negative parts) (Tibshirani, 1996), and logistic regression. Algorithms based on gradient and approximate or partial gradient information have proved effective in these settings. We mention in particular gradient projection and its accelerated variants (Nesterov, 2004), accelerated proximal gradient methods for regularized objectives (Beck and Teboulle, 2009), and stochastic gradient methods (Nemirovski et al., 2009; Shamir and Zhang, 2013). These methods are inherently serial, in that each iteration depends on the result of the previous iteration. Recently, parallel multicore versions of stochastic gradient and stochastic coordinate descent have been described for problems involving large data sets; see for example Niu et al. (2011); Richtárik and Takáč (2012b); Avron et al. (2014).

This paper proposes an asynchronous stochastic coordinate descent (AsySCD) algorithm for convex optimization. Each step of AsySCD chooses an index i{1,2,,n}i\in\{1,2,\dotsc,n\} and subtracts a short, constant, positive multiple of the iith partial gradient if(x):=f/xi\nabla_{i}f(x):=\partial f/\partial x_{i} from the iith component of xx. When separable constraints (2) are present, the update is “clipped” to maintain feasibility with respect to Ωi\Omega_{i}. Updates take place in parallel across the cores of a multicore system, without any attempt to synchronize computation between cores. We assume that there is a bound τ\tau on the age of the updates, that is, no more than τ\tau updates to xx occur between the time at which a processor reads xx (and uses it to evaluate one element of the gradient) and the time at which this processor makes its update to a single element of xx. (A similar model of parallel asynchronous computation was used in Hogwild! (Niu et al., 2011).) Our implementation, described in Section 6, is a little more complex than this simple model would suggest, as it is tailored to the architecture of the Intel Xeon machine that we use for experiments.

We show that linear convergence can be attained if an “essential strong convexity” property (3) holds, while sublinear convergence at a “1/K1/K” rate can be proved for general convex functions. Our analysis also defines a sufficient condition for near-linear speedup in the number of cores used. This condition relates the value of delay parameter τ\tau (which relates to the number of cores / threads used in the computation) to the problem dimension nn. A parameter that quantifies the cross-coordinate interactions in f\nabla f also appears in this relationship. When the Hessian of ff is nearly diagonal, the minimization problem can almost be separated along the coordinate axes, so higher degrees of parallelism are possible.

We review related work in Section 2. Section 3 specifies the proposed algorithm. Convergence results for unconstrained and constrained cases are described in Sections 4 and 5, respectively, with proofs given in the appendix. Computational experience is reported in Section 6. We discuss several variants of AsySCD in Section 7. Some conclusions are given in Section 8.

Notation and Assumption

We use the following notation.

  • eine_{i}\in\mathbb{R}^{n} denotes the iith natural basis vector (0,,0,1,0,,0)T(0,\dotsc,0,1,0,\dotsc,0)^{T} with the ‘”11” in the iith position.

  • \|\cdot\| denotes the Euclidean norm 2\|\cdot\|_{2}.

  • SΩS\subset\Omega denotes the set on which ff attains its optimal value, which is denoted by ff^{*}.

  • 𝒫S()\mathcal{P}_{S}(\cdot) and 𝒫Ω()\mathcal{P}_{\Omega}(\cdot) denote Euclidean projection onto SS and Ω\Omega, respectively.

  • We use xix_{i} for the iith element of xx, and if(x)\nabla_{i}f(x) for the iith element of the gradient vector f(x)\nabla f(x).

  • We define the following essential strong convexity condition for a convex function ff with respect to the optimal set SS, with parameter l>0l>0:

    f(x)f(y)f(y),xy+l2xy2for all x,yΩ with 𝒫S(x)=𝒫S(y).f(x)-f(y)\geq\langle\nabla f(y),x-y\rangle+{l\over 2}\|x-y\|^{2}\quad\mbox{for all $x,y\in\Omega$ with $\mathcal{P}_{S}(x)=\mathcal{P}_{S}(y)$.} (3)

    This condition is significantly weaker than the usual strong convexity condition, which requires the inequality to hold for all x,yΩx,y\in\Omega. In particular, it allows for non-singleton solution sets SS, provided that ff increases at a uniformly quadratic rate with distance from SS. (This property is noted for convex quadratic ff in which the Hessian is rank deficient.) Other examples of essentially strongly convex functions that are not strongly convex include:

    • f(Ax)f(Ax) with arbitrary linear transformation AA, where f()f(\cdot) is strongly convex;

    • f(x)=max(aTxb,0)2f(x)=\max(a^{T}x-b,0)^{2}, for a0a\neq 0.

  • Define LresL_{\mbox{\rm\scriptsize res}} as the restricted Lipschitz constant for f\nabla f, where the “restriction” is to the coordinate directions: We have

    f(x)f(x+tei)Lres|t|,for all i=1,2,,n and t, with x,x+teiΩ.\|\nabla f(x)-\nabla f(x+te_{i})\|\leq L_{\mbox{\rm\scriptsize res}}|t|,\quad\mbox{for all $i=1,2,\dotsc,n$ and $t\in\mathbb{R}$, with $x,x+te_{i}\in\Omega$.}
  • Define LiL_{i} as the coordinate Lipschitz constant for f\nabla f in the iith coordinate direction: We have

    f(x+tei)f(x)if(x),t+Li2t2,for i{1,2,,n}, and x,x+teiΩ,f(x+te_{i})-f(x)\leq\langle\nabla_{i}f(x),~t\rangle+{L_{i}\over 2}t^{2},\quad\mbox{for $i\in\{1,2,\dotsc,n\}$, and $x,x+te_{i}\in\Omega$,}

    or equivalently

    |if(x)if(x+tei)|Li|t|.|\nabla_{i}f(x)-\nabla_{i}f(x+te_{i})|\leq L_{i}|t|.
  • Lmax:=maxi=1,2,,nLiL_{\max}:=\max_{i=1,2,\dotsc,n}L_{i}.

  • For the initial point x0x_{0}, we define

    R0:=x0𝒫S(x0).R_{0}:=\|x_{0}-\mathcal{P}_{S}(x_{0})\|. (4)

Note that LresLmaxL_{\mbox{\rm\scriptsize res}}\geq L_{\max}.

We use {xj}j=0,1,2,\{x_{j}\}_{j=0,1,2,\dotsc} to denote the sequence of iterates generated by the algorithm from starting point x0x_{0}. Throughout the paper, we assume that SS is nonempty.

Lipschitz Constants

The nonstandard Lipschitz constants LresL_{\mbox{\rm\scriptsize res}}, LmaxL_{\max}, and LiL_{i}, i=1,2,,ni=1,2,\dotsc,n defined above are crucial in the analysis of our method. Besides bounding the nonlinearity of ff along various directions, these quantities capture the interactions between the various components in the gradient f\nabla f, as quantified in the off-diagonal terms of the Hessian 2f(x)\nabla^{2}f(x) — although the stated conditions do not require this matrix to exist.

We have noted already that Lres/Lmax1L_{\mbox{\rm\scriptsize res}}/L_{\max}\geq 1. Let us consider upper bounds on this ratio under certain conditions. When ff is twice continuously differentiable, we have

Lmax=supxΩmaxi=1,2,,n[2f(x)]ii.L_{\max}=\sup_{x\in\Omega}\,\max_{i=1,2,\dotsc,n}\,[\nabla^{2}f(x)]_{ii}.

Since 2f(x)0\nabla^{2}f(x)\succeq 0 for xΩx\in\Omega, we have that

|[2f(x)]ij|LiLjLmax,i,j=1,2,,n.|[\nabla^{2}f(x)]_{ij}|\leq\sqrt{L_{i}L_{j}}\leq L_{\max},\quad\forall\,i,j=1,2,\dotsc,n.

Thus LresL_{\mbox{\rm\scriptsize res}}, which is a bound on the largest column norm for 2f(x)\nabla^{2}f(x) over all xΩx\in\Omega, is bounded by nLmax\sqrt{n}L_{\max}, so that

LresLmaxn.\frac{L_{\mbox{\rm\scriptsize res}}}{L_{\max}}\leq\sqrt{n}.

If the Hessian is structurally sparse, having at most pp nonzeros per row/column, the same argument leads to Lres/LmaxpL_{\mbox{\rm\scriptsize res}}/L_{\max}\leq\sqrt{p}.

If f(x)f(x) is a convex quadratic with Hessian QQ, we have

Lmax=maxiQii,Lres=maxiQi2,L_{\max}=\max_{i}\,Q_{ii},\quad L_{\mbox{\rm\scriptsize res}}=\max_{i}\|Q_{\cdot i}\|_{2},

where QiQ_{\cdot i} denotes the iith column of QQ. If QQ is diagonally dominant, we have for any column ii that

Qi2Qii+[Qji]ji2Qii+ji|Qji|2Qii,\|Q_{\cdot i}\|_{2}\leq Q_{ii}+\|[Q_{ji}]_{j\neq i}\|_{2}\leq Q_{ii}+\sum_{j\neq i}|Q_{ji}|\leq 2Q_{ii},

which, by taking the maximum of both sides, implies that Lres/Lmax2L_{\mbox{\rm\scriptsize res}}/L_{\max}\leq 2 in this case.

Finally, consider the objective f(x)=12Axb2f(x)=\frac{1}{2}\|Ax-b\|^{2} and assume that Am×nA\in\mathbb{R}^{m\times n} is a random matrix whose entries are i.i.d from 𝒩(0,1)\mathcal{N}(0,1). The diagonals of the Hessian are AiTAiA_{\cdot i}^{T}A_{\cdot i} (where AiA_{\cdot i} is the iith column of AA), which have expected value mm, so we can expect LmaxL_{\max} to be not less than mm. Recalling that LresL_{\mbox{\rm\scriptsize res}} is the maximum column norm of ATAA^{T}A, we have

𝔼(ATAi)\displaystyle\mathbb{E}(\|A^{T}A_{\cdot i}\|) 𝔼(|AiTAi|)+𝔼([AjTAi]ji)\displaystyle\leq\mathbb{E}(|A_{\cdot i}^{T}A_{\cdot i}|)+\mathbb{E}(\|[A_{\cdot j}^{T}A_{\cdot i}]_{j\neq i}\|)
=m+𝔼ji|AjTAi|2\displaystyle=m+\mathbb{E}\sqrt{\sum_{j\neq i}|A_{\cdot j}^{T}A_{\cdot i}|^{2}}
m+ji𝔼|AjTAi|2\displaystyle\leq m+\sqrt{\sum_{j\neq i}\mathbb{E}|A_{\cdot j}^{T}A_{\cdot i}|^{2}}
=m+(n1)m,\displaystyle=m+\sqrt{(n-1)m},

where the second inequality uses Jensen’s inequality and the final equality uses

𝔼(|AjTAi|2)=𝔼(AjT𝔼(AiAiT)Aj)=𝔼(AjTIAj)=𝔼(AjTAj)=m.\mathbb{E}(|A_{\cdot j}^{T}A_{\cdot i}|^{2})=\mathbb{E}(A_{\cdot j}^{T}\mathbb{E}(A_{\cdot i}A_{\cdot i}^{T})A_{\cdot j})=\mathbb{E}(A_{\cdot j}^{T}IA_{\cdot j})=\mathbb{E}(A_{\cdot j}^{T}A_{\cdot j})=m.

We can thus estimate the upper bound on Lres/LmaxL_{\mbox{\rm\scriptsize res}}/L_{\max} roughly by 1+n/m1+\sqrt{n/m} for this case.

2 Related Work

This section reviews some related work on coordinate relaxation and stochastic gradient algorithms.

Among cyclic coordinate descent algorithms, Tseng (2001) proved the convergence of a block coordinate descent method for nondifferentiable functions with certain conditions. Local and global linear convergence were established under additional assumptions, by Luo and Tseng (1992) and Wang and Lin (2014), respectively. Global linear (sublinear) convergence rate for strongly (weakly) convex optimization was proved by Beck and Tetruashvili (2013). Block-coordinate approaches based on proximal-linear subproblems are described by Tseng and Yun (2009, 2010). Wright (2012) uses acceleration on reduced spaces (corresponding to the optimal manifold) to improve the local convergence properties of this approach.

Stochastic coordinate descent is almost identical to cyclic coordinate descent except selecting coordinates in a random manner. Nesterov (2012) studied the convergence rate for a stochastic block coordinate descent method for unconstrained and separably constrained convex smooth optimization, proving linear convergence for the strongly convex case and a sublinear 1/K1/K rate for the convex case. Extensions to minimization of composite functions are described by Richtárik and Takáč (2012a) and Lu and Xiao (2013).

Synchronous parallel methods distribute the workload and data among multiple processors, and coordinate the computation among processors. Ferris and Mangasarian (1994) proposed to distribute variables among multiple processors and optimize concurrently over each subset. The synchronization step searches the affine hull formed by the current iterate and the points found by each processor. Similar ideas appeared in (Mangasarian, 1995), with a different synchronization step. Goldfarb and Ma (2012) considered a multiple splitting algorithm for functions of the form f(x)=k=1Nfk(x)f(x)=\sum_{k=1}^{N}f_{k}(x) in which NN models are optimized separately and concurrently, then combined in an synchronization step. The alternating direction method-of-multiplier (ADMM) framework (Boyd et al., 2011) can also be implemented in parallel. This approach dissects the problem into multiple subproblems (possibly after replication of primal variables) and optimizes concurrently, then synchronizes to update multiplier estimates. Duchi et al. (2012) described a subgradient dual-averaging algorithm for partially separable objectives, with subgradient evaluations distributed between cores and combined in ways that reflect the structure of the objective. Parallel stochastic gradient approaches have received broad attention; see Agarwal and Duchi (2011) for an approach that allows delays between evaluation and update, and Cotter et al. (2011) for a minibatch stochastic gradient approach with Nesterov acceleration. Shalev-Shwartz and Zhang (2013) proposed an accelerated stochastic dual coordinate ascent method.

Among synchronous parallel methods for (block) coordinate descent, Richtárik and Takáč (2012b) described a method of this type for convex composite optimization problems. All processors update randomly selected coordinates or blocks, concurrently and synchronously, at each iteration. Speedup depends on the sparsity of the data matrix that defines the loss functions. Several variants that select blocks greedily are considered by Scherrer et al. (2012) and Peng et al. (2013). Yang (2013) studied the parallel stochastic dual coordinate ascent method and emphasized the balance between computation and communication.

We turn now to asynchronous parallel methods. Bertsekas and Tsitsiklis (1989) introduced an asynchronous parallel implementation for general fixed point problems x=q(x)x=q(x) over a separable convex closed feasible region. (The optimization problem (1) can be formulated in this way by defining q(x):=𝒫Ω[(Iαf)(x)]q(x):=\mathcal{P}_{\Omega}[(I-\alpha\nabla f)(x)] for some fixed α>0\alpha>0.) Their analysis allows inconsistent reads for xx, that is, the coordinates of the read xx have different “ages.” Linear convergence is established if all ages are bounded and 2f(x)\nabla^{2}f(x) satisfies a diagonal dominance condition guaranteeing that the iteration x=q(x)x=q(x) is a maximum-norm contraction mapping for sufficient small α\alpha. However, this condition is strong — stronger, in fact, than the strong convexity condition. For convex quadratic optimization f(x)=12xTAx+bxf(x)={1\over 2}x^{T}Ax+bx, the contraction condition requires diagonal dominance of the Hessian: Aii>ij|Aij|A_{ii}>\sum_{i\neq j}|A_{ij}| for all i=1,2,,ni=1,2,\dotsc,n. By comparison, AsySCD guarantees linear convergence rate under the essential strong convexity condition (3), though we do not allow inconsistent read. (We require the vector xx used for each evaluation of if(x)\nabla_{i}f(x) to have existed at a certain point in time.)

Hogwild! (Niu et al., 2011) is a lock-free, asynchronous parallel implementation of a stochastic-gradient method, targeted to a multicore computational model similar to the one considered here. Its analysis assumes consistent reading of xx, and it is implemented without locking or coordination between processors. Under certain conditions, convergence of Hogwild! approximately matches the sublinear 1/K1/K rate of its serial counterpart, which is the constant-steplength stochastic gradient method analyzed in Nemirovski et al. (2009).

We also note recent work by Avron et al. (2014), who proposed an asynchronous linear solver to solve Ax=bAx=b where AA is a symmetric positive definite matrix, proving a linear convergence rate. Both inconsistent- and consistent-read cases are analyzed in this paper, with the convergence result for inconsistent read being slightly weaker.

The AsySCD algorithm described in this paper was extended to solve the composite objective function consisting of a smooth convex function plus a separable convex function in a later work (Liu and Wright, 2014), which pays particular attention to the inconsistent-read case.

3 Algorithm

In AsySCD, multiple processors have access to a shared data structure for the vector xx, and each processor is able to compute a randomly chosen element of the gradient vector f(x)\nabla f(x). Each processor repeatedly runs the following coordinate descent process (the steplength parameter γ\gamma is discussed further in the next section):

  • R:

    Choose an index i{1,2,,n}i\in\{1,2,\dotsc,n\} at random, read xx, and evaluate if(x)\nabla_{i}f(x);

  • U:

    Update component ii of the shared xx by taking a step of length γ/Lmax\gamma/L_{\max} in the direction if(x)-\nabla_{i}f(x).

Algorithm 1 Asynchronous Stochastic Coordinate Descent Algorithm xK+1=AsySCD(x0,γ,K)x_{K+1}=\mbox{\sc AsySCD}(x_{0},\gamma,K)
0:x0Ωx_{0}\in\Omega, γ\gamma, and KK
0:xK+1x_{K+1}
1: Initialize j0j\leftarrow 0;
2:while jKj\leq K do
3:  Choose i(j)i(j) from {1,,n}\{1,\dotsc,n\} with equal probability;
4:  xj+1𝒫Ω(xjγLmaxei(j)i(j)f(xk(j)))x_{j+1}\leftarrow\mathcal{P}_{\Omega}\left(x_{j}-\frac{\gamma}{L_{\max}}e_{i(j)}\nabla_{i(j)}f(x_{k(j)})\right);
5:  jj+1j\leftarrow j+1;
6:end while

Since these processors are being run concurrently and without synchronization, xx may change between the time at which it is read (in step R) and the time at which it is updated (step U). We capture the system-wide behavior of AsySCD in Algorithm 1. There is a global counter jj for the total number of updates; xjx_{j} denotes the state of xx after jj updates. The index i(j){1,2,,n}i(j)\in\{1,2,\dotsc,n\} denotes the component updated at step jj. k(j)k(j) denotes the xx-iterate at which the update applied at iteration jj was calculated. Obviously, we have k(j)jk(j)\leq j, but we assume that the delay between the time of evaluation and updating is bounded uniformly by a positive integer τ\tau, that is, jk(j)τj-k(j)\leq\tau for all jj. The value of τ\tau captures the essential parallelism in the method, as it indicates the number of processors that are involved in the computation.

The projection operation PΩP_{\Omega} onto the feasible set is not needed in the case of unconstrained optimization. For separable constraints (2), it requires a simple clipping operation on the i(j)i(j) component of xx.

We note several differences with earlier asynchronous approaches. Unlike the asynchronous scheme in Bertsekas and Tsitsiklis (1989, Section 6.1), the latest value of xx is updated at each step, not an earlier iterate. Although our model of computation is similar to Hogwild! (Niu et al., 2011), the algorithm differs in that each iteration of AsySCD evaluates a single component of the gradient exactly, while Hogwild!  computes only a (usually crude) estimate of the full gradient. Our analysis of AsySCD below is comprehensively different from that of Niu et al. (2011), and we obtain stronger convergence results.

4 Unconstrained Smooth Convex Case

This section presents results about convergence of AsySCD in the unconstrained case Ω=n\Omega=\mathbb{R}^{n}. The theorem encompasses both the linear rate for essentially strongly convex ff and the sublinear rate for general convex ff. The result depends strongly on the delay parameter τ\tau. (Proofs of results in this section appear in Appendix A.) In Algorithm 1, the indices i(j)i(j), j=0,1,2,j=0,1,2,\dotsc are random variables. We denote the expectation over all random variables as 𝔼\mathbb{E}, the conditional expectation in term of i(j)i(j) given i(0),i(1),,i(j1)i(0),i(1),\cdots,i(j-1) as 𝔼i(j)\mathbb{E}_{i(j)}.

A crucial issue in AsySCD is the choice of steplength parameter γ\gamma. This choice involves a tradeoff: We would like γ\gamma to be long enough that significant progress is made at each step, but not so long that the gradient information computed at step k(j)k(j) is stale and irrelevant by the time the update is applied at step jj. We enforce this tradeoff by means of a bound on the ratio of expected squared norms on f\nabla f at successive iterates; specifically,

ρ1𝔼f(xj+1)2𝔼f(xj)2ρ,\rho^{-1}\leq\frac{\mathbb{E}\|\nabla f(x_{j+1})\|^{2}}{\mathbb{E}\|\nabla f(x_{j})\|^{2}}\leq\rho, (5)

where ρ>1\rho>1 is a user defined parameter. The analysis becomes a delicate balancing act in the choice of ρ\rho and steplength γ\gamma between aggression and excessive conservatism. We find, however, that these values can be chosen to ensure steady convergence for the asynchronous method at a linear rate, with rate constants that are almost consistent with vanilla short-step full-gradient descent.

We use the following assumption in some of the results of this section.

Assumption 1

There is a real number RR such that

xj𝒫S(xj)R,for all j=0,1,2,.\|x_{j}-\mathcal{P}_{S}(x_{j})\|\leq R,\quad\mbox{for all $j=0,1,2,\dotsc$.}

Note that this assumption is not needed in our convergence results in the case of strongly convex functions. in our theorems below, it is invoked only when considering general convex functions.

Theorem 1

Suppose that Ω=n\Omega=\mathbb{R}^{n} in (1). For any ρ>1\rho>1, define the quantity ψ\psi as follows:

ψ:=1+2τρτLresnLmax.\psi:=1+\frac{2\tau\rho^{\tau}L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}. (6)

Suppose that the steplength parameter γ>0\gamma>0 satisfies the following three upper bounds:

γ\displaystyle\gamma 1ψ,\displaystyle\leq\frac{1}{\psi}, (7a)
γ\displaystyle\gamma (ρ1)nLmax2ρτ+1Lres,\displaystyle\leq\frac{(\rho-1)\sqrt{n}L_{\max}}{2\rho^{\tau+1}L_{\mbox{\rm\scriptsize res}}}, (7b)
γ\displaystyle\gamma (ρ1)nLmaxLresρτ(2+LresnLmax).\displaystyle\leq\frac{(\rho-1)\sqrt{n}L_{\max}}{L_{\mbox{\rm\scriptsize res}}\rho^{\tau}(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}})}. (7c)

Then we have that for any j0j\geq 0 that

ρ1𝔼(f(xj)2)𝔼(f(xj+1)2)ρ𝔼(f(xj)2).\rho^{-1}\mathbb{E}(\|\nabla f(x_{j})\|^{2})\leq{\mathbb{E}(\|\nabla f(x_{j+1})\|^{2})}\leq\rho\mathbb{E}(\|\nabla f(x_{j})\|^{2}). (8)

Moreover, if the essentially strong convexity property (3) holds with l>0l>0, we have

𝔼(f(xj)f)(12lγnLmax(1ψ2γ))j(f(x0)f).\mathbb{E}(f(x_{j})-f^{*})\leq\left(1-\frac{2l\gamma}{nL_{\max}}\left(1-\frac{\psi}{2}\gamma\right)\right)^{j}(f(x_{0})-f^{*}). (9)

For general smooth convex functions ff, assuming additionally that Assumption 1 holds, we have

𝔼(f(xj)f)1(f(x0)f)1+jγ(1ψ2γ)/(nLmaxR2).\mathbb{E}(f(x_{j})-f^{*})\leq\frac{1}{(f(x_{0})-f^{*})^{-1}+j\gamma(1-\frac{\psi}{2}\gamma)/(nL_{\max}R^{2})}. (10)

This theorem demonstrates linear convergence (9) for AsySCD in the unconstrained essentially strongly convex case. This result is better than that obtained for Hogwild! (Niu et al., 2011), which guarantees only sublinear convergence under the stronger assumption of strict convexity.

The following corollary proposes an interesting particular choice of the parameters for which the convergence expressions become more comprehensible. The result requires a condition on the delay bound τ\tau in terms of nn and the ratio Lmax/LresL_{\max}/L_{\mbox{\rm\scriptsize res}}.

Corollary 2

Suppose that

τ+1nLmax2eLres.\tau+1\leq\frac{\sqrt{n}L_{\max}}{2eL_{\mbox{\rm\scriptsize res}}}. (11)

Then if we choose

ρ=1+2eLresnLmax,\rho=1+{2eL_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}, (12)

define ψ\psi by (6), and set γ=1/ψ\gamma=1/\psi, we have for the essentially strongly convex case (3) with l>0l>0 that

𝔼(f(xj)f)(1l2nLmax)j(f(x0)f).\mathbb{E}(f(x_{j})-f^{*})\leq\left(1-\frac{l}{2nL_{\max}}\right)^{j}(f(x_{0})-f^{*}). (13)

For the case of general convex ff, if we assume additionally that Assumption 1 is satisfied, we have

𝔼(f(xj)f)1(f(x0)f)1+j/(4nLmaxR2).\mathbb{E}(f(x_{j})-f^{*})\leq\frac{1}{(f(x_{0})-f^{*})^{-1}+{j}/(4nL_{\max}R^{2})}. (14)

We note that the linear rate (13) is broadly consistent with the linear rate for the classical steepest descent method applied to strongly convex functions, which has a rate constant of (12l/L)(1-2l/L), where LL is the standard Lipschitz constant for f\nabla f. If we assume (not unreasonably) that nn steps of stochastic coordinate descent cost roughly the same as one step of steepest descent, and note from (13) that nn steps of stochastic coordinate descent would achieve a reduction factor of about (1l/(2Lmax))(1-l/(2L_{\max})), a standard argument would suggest that stochastic coordinate descent would require about 4Lmax/L4L_{\max}/L times more computation. (Note that Lmax/L[1/n,1]L_{\max}/L\in[1/n,1].) The stochastic approach may gain an advantage from the parallel implementation, however. Steepest descent requires synchronization and careful division of gradient evaluations, whereas the stochastic approach can be implemented in an asynchronous fashion.

For the general convex case, (14) defines a sublinear rate, whose relationship with the rate of the steepest descent for general convex optimization is similar to the previous paragraph.

As noted in Section 1, the parameter τ\tau is closely related to the number of cores that can be involved in the computation, without degrading the convergence performance of the algorithm. In other words, if the number of cores is small enough such that (11) holds, the convergence expressions (13), (14) do not depend on the number of cores, implying that linear speedup can be expected. A small value for the ratio Lres/LmaxL_{\mbox{\rm\scriptsize res}}/L_{\max} (not much greater than 11) implies a greater degree of potential parallelism. As we note at the end of Section 1, this ratio tends to be small in some important applications — a situation that would allow O(n)O(\sqrt{n}) cores to be used with near-linear speedup.

We conclude this section with a high-probability estimate for convergence of the sequence of function values.

Theorem 3

Suppose that the assumptions of Corollary 2 hold, including the definitions of ρ\rho and ψ\psi. Then for any ϵ(0,f(x0)f)\epsilon\in(0,f(x_{0})-f^{*}) and η(0,1)\eta\in(0,1), we have that

(f(xj)fϵ)1η,\mathbb{P}\left(f(x_{j})-f^{*}\leq\epsilon\right)\geq 1-\eta, (15)

provided that either of the following sufficient conditions hold for the index jj. In the essentially strongly convex case (3) with l>0l>0, it suffices to have

j2nLmaxl(logf(x0)fϵη).j\geq\frac{2nL_{\max}}{l}\left({\log{f(x_{0})-f^{*}\over\epsilon\eta}}\right). (16)

For the general convex case, if we assume additionally that Assumption 1 holds, a sufficient condition is

j4nLmaxR2(1ϵη1f(x0)f).j\geq 4nL_{\max}R^{2}\left({1\over\epsilon\eta}-{1\over f(x_{0})-f^{*}}\right). (17)

5 Constrained Smooth Convex Case

This section considers the case of separable constraints (2). We show results about convergence rates and high-probability complexity estimates, analogous to those of the previous section. Proofs appear in Appendix B.

As in the unconstrained case, the steplength γ\gamma should be chosen to ensure steady progress while ensuring that update information does not become too stale. Because constraints are present, the ratio (5) is no longer appropriate. We use instead a ratio of squares of expected differences in successive primal iterates:

𝔼xj1x¯j2𝔼xjx¯j+12,\frac{\mathbb{E}\|x_{j-1}-\bar{x}_{j}\|^{2}}{\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}}, (18)

where x¯j+1\bar{x}_{j+1} is the hypothesized full update obtained by applying the single-component update to every component of xjx_{j}, that is,

x¯j+1:=argminxΩf(xk(j)),xxj+Lmax2γxxj2.\bar{x}_{j+1}:=\arg\min_{x\in\Omega}\,\langle\nabla f(x_{k(j)}),x-x_{j}\rangle+\frac{L_{\max}}{2\gamma}\|x-x_{j}\|^{2}. (19)

In the unconstrained case Ω=n\Omega=\mathbb{R}^{n}, the ratio (18) reduces to

𝔼f(xk(j1))2𝔼f(xk(j))2,\frac{\mathbb{E}\|\nabla f(x_{k(j-1)})\|^{2}}{\mathbb{E}\|\nabla f(x_{k(j)})\|^{2}},

which is evidently related to (5), but not identical.

We have the following result concerning convergence of the expected error to zero.

Theorem 4

Suppose that Ω\Omega has the form (2) and that n5n\geq 5. Let ρ\rho be a constant with ρ>(12/n)1\rho>\left(1-2/\sqrt{n}\right)^{-1}, and define the quantity ψ\psi as follows:

ψ:=1+LresτρτnLmax(2+LmaxnLres+2τn).\psi:={1+\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{\sqrt{n}L_{\max}}\left({2+{L_{\max}\over\sqrt{n}L_{\mbox{\rm\scriptsize res}}}+{2\tau\over n}}\right)}. (20)

Suppose that the steplength parameter γ>0\gamma>0 satisfies the following two upper bounds:

γ1ψ,γ(11ρ2n)nLmax4Lresτρτ.\gamma\leq{1\over\psi},\quad\gamma\leq\left(1-{1\over\rho}-{2\over\sqrt{n}}\right)\frac{\sqrt{n}L_{\max}}{4L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}. (21)

Then we have

𝔼xj1x¯j2ρ𝔼xjx¯j+12,j=1,2,.\mathbb{E}\|x_{j-1}-\bar{x}_{j}\|^{2}\leq\rho\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2},\quad j=1,2,\dotsc. (22)

If the essential strong convexity property (3) holds with l>0l>0, we have for j=1,2,j=1,2,\dotsc that

𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f)\displaystyle\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*}) (23)
(1ln(l+γ1Lmax))j(R02+2γLmax(f(x0)f))\displaystyle\leq\left(1-\frac{l}{n(l+\gamma^{-1}L_{\max})}\right)^{j}\left(R_{0}^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})\right)

where R0R_{0} is defined in (4). For general smooth convex function ff, we have

𝔼f(xj)fn(R02Lmax+2γ(f(x0)f))2γ(n+j).\mathbb{E}f(x_{j})-f^{*}\leq\frac{n(R_{0}^{2}L_{\max}+2\gamma(f(x_{0})-f^{*}))}{2\gamma(n+j)}. (24)

Similarly to the unconstrained case, the following corollary proposes an interesting particular choice for the parameters for which the convergence expressions become more comprehensible. The result requires a condition on the delay bound τ\tau in terms of nn and the ratio Lmax/LresL_{\max}/L_{\mbox{\rm\scriptsize res}}.

Corollary 5

Suppose that τ1\tau\geq 1 and n5n\geq 5 and that

τ(τ+1)nLmax4eLres.\tau(\tau+1)\leq\frac{\sqrt{n}L_{\max}}{4eL_{\mbox{\rm\scriptsize res}}}. (25)

If we choose

ρ=1+4eτLresnLmax,\rho=1+{4e\tau L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}, (26)

then the steplength γ=1/2\gamma=1/2 will satisfy the bounds (21). In addition, for the essentially strongly convex case (3) with l>0l>0, we have for j=1,2,j=1,2,\dotsc that

𝔼(f(xj)f)(1ln(l+2Lmax))j(LmaxR02+f(x0)f),\mathbb{E}(f(x_{j})-f^{*})\leq\left(1-\frac{l}{n(l+2L_{\max})}\right)^{j}(L_{\max}R_{0}^{2}+f(x_{0})-f^{*}), (27)

while for the case of general convex ff, we have

𝔼(f(xj)f)n(LmaxR02+f(x0)f)j+n.\mathbb{E}(f(x_{j})-f^{*})\leq\frac{n(L_{\max}R_{0}^{2}+f(x_{0})-f^{*})}{j+n}. (28)

Similarly to Section 4, and provided τ\tau satisfies (25), the convergence rate is not affected appreciably by the delay bound τ\tau, and near-linear speedup can be expected for multicore implementations when (25) holds. This condition is more restrictive than (11) in the unconstrained case, but still holds in many problems for interesting values of τ\tau. When Lres/LmaxL_{\mbox{\rm\scriptsize res}}/L_{\max} is bounded independently of dimension, the maximal number of cores allowed is of the the order of n1/4n^{1/4}, which is smaller than the O(n1/2)O(n^{1/2}) value obtained for the unconstrained case.

We conclude this section with another high-probability bound, whose proof tracks that of Theorem 3.

Theorem 6

Suppose that the conditions of Corollary 5 hold, including the definitions of ρ\rho and ψ\psi. Then for ϵ>0\epsilon>0 and η(0,1)\eta\in(0,1), we have that

(f(xj)fϵ)1η,\mathbb{P}\left(f(x_{j})-f^{*}\leq\epsilon\right)\geq 1-\eta,

provided that one of the following conditions holds: In the essentially strongly convex case (3) with l>0l>0, we require

jn(l+2Lmax)l|logLmaxR02+f(x0)fϵη|,j\geq\frac{n(l+2L_{\max})}{l}\left|{\log{L_{\max}R_{0}^{2}+f(x_{0})-f^{*}\over\epsilon\eta}}\right|,

while in the general convex case, it suffices that

jn(LmaxR02+f(x0)f)ϵηn.j\geq\frac{n(L_{\max}R_{0}^{2}+f(x_{0})-f^{*})}{\epsilon\eta}-n.

6 Experiments

We illustrate the behavior of two variants of the stochastic coordinate descent approach on test problems constructed from several data sets. Our interests are in the efficiency of multicore implementations (by comparison with a single-threaded implementation) and in performance relative to alternative solvers for the same problems.

All our test problems have the form (1), with either Ω=n\Omega=\mathbb{R}^{n} or Ω\Omega separable as in (2). The objective ff is quadratic, that is,

f(x)=12xTQx+cTx,f(x)=\frac{1}{2}x^{T}Qx+c^{T}x,

with QQ symmetric positive definite.

Our implementation of AsySCD is called DIMM-WITTED (or DW for short). It runs on various numbers of threads, from 1 to 40, each thread assigned to a single core in our 40-core Intel Xeon architecture. Cores on the Xeon architecture are arranged into four sockets — ten cores per socket, with each socket having its own memory. Non-uniform memory access (NUMA) means that memory accesses to local memory (on the same socket as the core) are less expensive than accesses to memory on another socket. In our DW implementation, we assign each socket an equal-sized “slice” of QQ, a row submatrix. The components of xx are partitioned between cores, each core being responsible for updating its own partition of xx (though it can read the components of xx from other cores). The components of xx assigned to the cores correspond to the rows of QQ assigned to that core’s socket. Computation is grouped into “epochs,” where an epoch is defined to be the period of computation during which each component of xx is updated exactly once. We use the parameter pp to denote the number of epochs that are executed between reordering (shuffling) of the coordinates of xx. We investigate both shuffling after every epoch (p=1p=1) and after every tenth epoch (p=10p=10). Access to xx is lock-free, and updates are performed asynchronously. This update scheme does not implement exactly the “sampling with replacement” scheme analyzed in previous sections, but can be viewed as a high performance, practical adaptation of the AsySCD method.

To do each coordinate descent update, a thread must read the latest value of xx. Most components are already in the cache for that core, so that it only needs to fetch those components recently changed. When a thread writes to xix_{i}, the hardware ensures that this xix_{i} is simultaneously removed from other cores, signaling that they must fetch the updated version before proceeding with their respective computations.

Although DW is not a precise implementation of AsySCD, it largely achieves the consistent-read condition that is assumed by the analysis. Inconsistent read happens on a core only if the following three conditions are satisfied simultaneously:

  • A core does not finish reading recently changed coordinates of xx (note that it needs to read no more than τ\tau coordinates);

  • Among these recently changed coordinates, modifications take place both to coordinates that have been read and that are still to be read by this core;

  • Modification of the already-read coordinates happens earlier than the modification of the still-unread coordinates.

Inconsistent read will occur only if at least two coordinates of xx are modified twice during a stretch of approximately τ\tau updates to xx (that is, iterations of Algorithm 1). For the DW implementation, inconsistent read would require repeated updating of a particular component in a stretch of approximately τ\tau iterations that straddles two epochs. This event would be rare, for typical values of nn and τ\tau. Of course, one can avoid the inconsistent read issue altogether by changing the shuffling rule slightly, enforcing the requirement that no coordinate can be modified twice in a span of τ\tau iterations. From the practical perspective, this change does not improve performance, and detracts from the simplicity of the approach. From the theoretical perspective, however, the analysis for the inconsistent-read model would be interesting and meaningful, and we plan to study this topic in future work.

The first test problem QP is an unconstrained, regularized least squares problem constructed with synthetic data. It has the form

minxnf(x):=12Axb2+α2x2.\min_{x\in\mathbb{R}^{n}}\;f(x):={1\over 2}\|Ax-b\|^{2}+{\alpha\over 2}\|x\|^{2}. (29)

All elements of Am×nA\in\mathbb{R}^{m\times n}, the true model x~n\tilde{x}\in\mathbb{R}^{n}, and the observation noise vector δm\delta\in\mathbb{R}^{m} are generated in i.i.d. fashion from the Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1), following which each column in AA is scaled to have a Euclidean norm of 11. The observation bmb\in\mathbb{R}^{m} is constructed from Ax~+δAx~/(5m)A\tilde{x}+\delta\|A\tilde{x}\|/(5m). We choose m=6000m=6000, n=20000n=20000, and α=0.5\alpha=0.5. We therefore have Lmax=1+α=1.5L_{\max}=1+\alpha=1.5 and

LresLmax1+n/m+α1+α2.2.\frac{L_{\mbox{\rm\scriptsize res}}}{L_{\max}}\approx{1+\sqrt{n/m}+\alpha\over 1+\alpha}\approx 2.2.

With this estimate, the condition (11) is satisfied when delay parameter τ\tau is less than about 9595. In Algorithm 1, we set the steplength parameter γ\gamma to 11, and we choose initial iterate to be x0=𝟘x_{0}=\mathbb{0}. We measure convergence of the residual norm f(x)\|\nabla f(x)\|.

Our second problem QPc is a bound-constrained version of (29):

minx+nf(x):=12(xx~)T(ATA+αI)(xx~).\min_{x\in\mathbb{R}^{n}_{+}}\quad f(x):={1\over 2}(x-\tilde{x})^{T}(A^{T}A+\alpha I)(x-\tilde{x}). (30)

The methodology for generating AA and x~\tilde{x} and for choosing the values of mm, nn, γ\gamma, and x0x_{0} is the same as for (29). We measure convergence via the residual x𝒫Ω(xf(x))\|x-\mathcal{P}_{\Omega}(x-\nabla f(x))\|, where Ω\Omega is the nonnegative orthant +n\mathbb{R}^{n}_{+}. At the solution of (30), about half the components of xx are at their lower bound of 0.

Our third and fourth problems are quadratic penalty functions for linear programming relaxations of vertex cover problems on large graphs. The vertex cover problem for an undirected graph with edge set EE and vertex set VV can be written as a binary linear program:

miny{0,1}|V|vVyvsubject toyu+yv1,(u,v)E.\min_{y\in\{0,1\}^{|V|}}\,\sum_{v\in V}y_{v}\quad\mbox{subject to}\;\;y_{u}+y_{v}\geq 1,\quad\forall\,(u,v)\in E.

By relaxing each binary constraint to the interval [0,1][0,1], introducing slack variables for the cover inequalities, we obtain a problem of the form

minyv[0,1],suv[0,1]vVyvsubject toyu+yvsuv=0,(u,v)E.\min_{y_{v}\in[0,1],\,s_{uv}\in[0,1]}\,\sum_{v\in V}y_{v}\quad\mbox{subject to}\;\;y_{u}+y_{v}-s_{uv}=0,\quad\forall\,(u,v)\in E.

This has the form

minx[0,1]ncTxsubject toAx=b,\min_{x\in[0,1]^{n}}\,c^{T}x\quad\mbox{subject to}\;\;Ax=b,

for n=|V|+|E|n=|V|+|E|. The test problem (31) is a regularized quadratic penalty reformulation of this linear program for some penalty parameter β\beta:

minx[0,1]ncTx+β2Axb2+12βx2,\min_{x\in[0,1]^{n}}\quad c^{T}x+{\beta\over 2}\|Ax-b\|^{2}+{1\over 2\beta}\|x\|^{2}, (31)

with β=5\beta=5. Two test data sets Amazon and DBLP have dimensions n=561050n=561050 and n=520891n=520891, respectively.

We tracked the behavior of the residual as a function of the number of epochs, when executed on different numbers of cores. Figure 1 shows convergence behavior for each of our four test problems on various numbers of cores with two different shuffling periods: p=1p=1 and p=10p=10. We note the following points.

  • The total amount of computation to achieve any level of precision appears to be almost independent of the number of cores, at least up to 40 cores. In this respect, the performance of the algorithm does not change appreciably as the number of cores is increased. Thus, any deviation from linear speedup is due not to degradation of convergence speed in the algorithm but rather to systems issues in the implementation.

  • When we reshuffle after every epoch (p=1p=1), convergence is slightly faster in synthetic unconstrained QP but slightly slower in Amazon and DBLP than when we do occasional reshuffling (p=10p=10). Overall, the convergence rates with different shuffling periods are comparable in the sense of epochs. However, when the dimension of the variable is large, the shuffling operation becomes expensive, so we would recommend using a large value for pp for large-dimensional problems.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Residuals vs epoch number for the four test problems. Results are reported for variants in which indices are reshuffled after every epoch (p=1p=1) and after every tenth epoch (p=10p=10).

Results for speedup on multicore implementations are shown in Figures 2 and 3 for DW with p=10p=10. Speedup is defined as follows:

runtime a single core using DWruntime on P cores.\frac{\text{runtime a single core using DW}}{\text{runtime on $P$ cores}}.

Near-linear speedup can be observed for the two QP problems with synthetic data. For Problems 3 and 4, speedup is at most 12-14; there are few gains when the number of cores exceeds about 12. We believe that the degradation is due mostly to memory contention. Although these problems have high dimension, the matrix QQ is very sparse (in contrast to the dense QQ for the synthetic data set). Thus, the ratio of computation to data movement / memory access is much lower for these problems, making memory contention effects more significant.

Figures 2 and 3 also show results of a global-locking strategy for the parallel stochastic coordinate descent method, in which the vector xx is locked by a core whenever it performs a read or update. The performance curve for this strategy hugs the horizontal axis; it is not competitive.

Wall clock times required for the four test problems on 11 and 4040 cores, to reduce residuals below 10510^{-5} are shown in Table 1. (Similar speedups are noted when we use a convergence tolerance looser than 10510^{-5}.)

Refer to caption
Refer to caption
Figure 2: Test problems 1 and 2: Speedup of multicore implementations of DW on up to 40 cores of an Intel Xeon architecture. Ideal (linear) speedup curve is shown for reference, along with poor speedups obtained for a global-locking strategy.
Refer to caption
Refer to caption
Figure 3: Test problems 3 and 4: Speedup of multicore implementations of DW on up to 40 cores of an Intel Xeon architecture. Ideal (linear) speedup curve is shown for reference, along with poor speedups obtained for a global-locking strategy.
Problem 1 core 40 cores
QP 98.4 3.03
QPc 59.7 1.82
Amazon 17.1 1.25
DBLP 11.5 .91
Table 1: Runtimes (seconds) for the four test problems on 11 and 4040 cores.
#cores Time(sec) Speedup
1 55.9 1
10 5.19 10.8
20 2.77 20.2
30 2.06 27.2
40 1.81 30.9
Table 2: Runtimes (seconds) and speedup for multicore implementations of DW on different number of cores for the weakly convex QPc problem (with α=0\alpha=0) to achieve a residual below 0.060.06.
#cores Time(sec) Speedup
SynGD / AsySCD SynGD / AsySCD
1 96.8 / 27.1 0.28 / 1.00
10 11.4 / 2.57 2.38 / 10.5
20 6.00 / 1.36 4.51 / 19.9
30 4.44 / 1.01 6.10 / 26.8
40 3.91 / 0.88 6.93 / 30.8
Table 3: Efficiency comparison between SynGD and AsySCD for the QP problem. The running time and speedup are based on the residual achieving a tolerance of 10510^{-5}.
Dataset # of # of Train time(sec)
Samples Features LIBSVM AsySCD
adult 32561 123 16.15 1.39
news 19996 1355191 214.48 7.22
rcv 20242 47236 40.33 16.06
reuters 8293 18930 1.63 0.81
w8a 49749 300 33.62 5.86
Table 4: Efficiency comparison between LIBSVM and AsySCD for kernel SVM using 40 cores using homogeneous kernels (K(xi,xj)=(xiTxj)2K(x_{i},x_{j})=(x_{i}^{T}x_{j})^{2}). The running time and speedup are calculated based on the “residual” 10310^{-3}. Here, to make both algorithms comparable, the “residual” is defined by x𝒫Ω(xf(x))\|x-\mathcal{P}_{\Omega}(x-\nabla f(x))\|_{\infty}.

All problems reported on above are essentially strongly convex. Similar speedup properties can be obtained in the weakly convex case as well. We show speedups for the QPc problem with α=0\alpha=0. Table 2 demonstrates similar speedup to the essentially strongly convex case shown in Figure 2.

Turning now to comparisons between AsySCD and alternative algorithms, we start by considering the basic gradient descent method. We implement gradient descent in a parallel, synchronous fashion, distributing the gradient computation load on multiple cores and updating the variable xx in parallel at each step. The resulting implementation is called SynGD. Table 3 reports running time and speedup of both AsySCD over SynGD, showing a clear advantage for AsySCD. A high price is paid for the synchronization requirement in SynGD.

Next we compare AsySCD to LIBSVM (Chang and Lin, 2011) a popular multi-thread parallel solver for kernel support vector machines (SVM). Both algorithms are run on 40 cores to solve the dual formulation of kernel SVM, without an intercept term. All data sets used in 4 except reuters were obtained from the LIBSVM dataset repository111http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The dataset reuters is a sparse binary text classification dataset constructed as a one-versus-all version of Reuters-2159222http://www.daviddlewis.com/resources/testcollections/reuters21578/. Our comparisons, shown in Table 4, indicate that AsySCD outperforms LIBSVM on these test sets.

7 Extension

The AsySCD algorithm can be extended by partitioning the coordinates into blocks, and modifying Algorithm 1 to work with these blocks rather than with single coordinates. If LiL_{i}, LmaxL_{\max}, and LresL_{\mbox{\rm\scriptsize res}} are defined in the block sense, as follows:

f(x)f(x+Eit)\displaystyle\|\nabla f(x)-\nabla f(x+E_{i}t)\| Lrestx,i,t|i|,\displaystyle\leq L_{\mbox{\rm\scriptsize res}}\|t\|\quad\forall x,i,t\in\mathbb{R}^{|i|},
if(x)if(x+Eit)\displaystyle\|\nabla_{i}f(x)-\nabla_{i}f(x+E_{i}t)\| Litx,i,t|i|,\displaystyle\leq L_{i}\|t\|\quad\forall x,i,t\in\mathbb{R}^{|i|},
Lmax\displaystyle L_{\max} =maxiLi,\displaystyle=\max_{i}L_{i},

where EiE_{i} is the projection from the iith block to n\mathbb{R}^{n} and |i||i| denotes the number of components in block ii, our analysis can be extended appropriately.

To make the AsySCD algorithm more efficient, one can redefine the steplength in Algorithm 1 to be γLi(j)\frac{\gamma}{L_{i(j)}} rather than γLmax\frac{\gamma}{L_{\max}}. Our analysis can be applied to this variant by doing a change of variables to x~\tilde{x}, with xi=LiLmaxx~i{x}_{i}=\frac{L_{i}}{L_{\max}}\tilde{x}_{i} and defining LiL_{i}, LresL_{\mbox{\rm\scriptsize res}}, and LmaxL_{\max} in terms of x~\tilde{x}.

8 Conclusion

This paper proposes an asynchronous parallel stochastic coordinate descent algorithm for minimizing convex objectives, in the unconstrained and separable-constrained cases. Sublinear convergence (at rate 1/K1/K) is proved for general convex functions, with stronger linear convergence results for functions that satisfy an essential strong convexity property. Our analysis indicates the extent to which parallel implementations can be expected to yield near-linear speedup, in terms of a parameter that quantifies the cross-coordinate interactions in the gradient f\nabla f and a parameter τ\tau that bounds the delay in updating. Our computational experience confirms the theory.


Acknowledgments

This project is supported by NSF Grants DMS-0914524, DMS-1216318, and CCF-1356918; NSF CAREER Award IIS-1353606; ONR Awards N00014-13-1-0129 and N00014-12-1-0041; AFOSR Award FA9550-13-1-0138; a Sloan Research Fellowship; and grants from Oracle, Google, and ExxonMobil.

A Proofs for Unconstrained Case

This section contains convergence proofs for AsySCD in the unconstrained case.

We start with a technical result, then move to the proofs of the three main results of Section 4.

Lemma 7

For any xx, we have

x𝒫S(x)2f(x)2(f(x)f)2.{\|x-\mathcal{P}_{S}(x)\|^{2}}\|\nabla f(x)\|^{2}\geq{(f(x)-f^{*})^{2}}.

If the essential strong convexity property (3) holds, we have

f(x)22l(f(x)f).\|\nabla f(x)\|^{2}\geq 2l(f(x)-f^{*}).

Proof  The first inequality is proved as follows:

f(x)ff(x),x𝒫S(x)f(x)𝒫S(x)x.f(x)-f^{*}\leq\langle\nabla f(x),x-\mathcal{P}_{S}(x)\rangle\leq\|\nabla f(x)\|\|\mathcal{P}_{S}(x)-x\|.

For the second bound, we have from the definition (3), setting yxy\leftarrow x and x𝒫S(x)x\leftarrow\mathcal{P}_{S}(x), that

ff(x)\displaystyle f^{*}-f(x) f(x),𝒫S(x)x+l2x𝒫S(x)2\displaystyle\geq\langle\nabla f(x),\mathcal{P}_{S}(x)-x\rangle+{l\over 2}\|x-\mathcal{P}_{S}(x)\|^{2}
=l2𝒫S(x)x+1lf(x)212lf(x)2\displaystyle={l\over 2}\|\mathcal{P}_{S}(x)-x+\frac{1}{l}\nabla f(x)\|^{2}-{1\over 2l}\|\nabla f(x)\|^{2}
12lf(x)2,\displaystyle\geq-{1\over 2l}\|\nabla f(x)\|^{2},

as required.  

Proof  (Theorem 1) We prove each of the two inequalities in (8) by induction. We start with the left-hand inequality. For all values of jj, we have

𝔼(f(xj)2f(xj+1)2)\displaystyle\mathbb{E}\left(\|\nabla f(x_{j})\|^{2}-\|\nabla f(x_{j+1})\|^{2}\right)
=𝔼f(xj)+f(xj+1),f(xj)f(xj+1)\displaystyle\quad=\mathbb{E}\langle\nabla f(x_{j})+\nabla f(x_{j+1}),\nabla f(x_{j})-\nabla f(x_{j+1})\rangle
=𝔼2f(xj)+f(xj+1)f(xj),f(xj)f(xj+1)\displaystyle\quad=\mathbb{E}\langle 2\nabla f(x_{j})+\nabla f(x_{j+1})-\nabla f(x_{j}),\nabla f(x_{j})-\nabla f(x_{j+1})\rangle
2𝔼f(xj),f(xj)f(xj+1)\displaystyle\quad\leq 2\mathbb{E}\langle\nabla f(x_{j}),\nabla f(x_{j})-\nabla f(x_{j+1})\rangle
2𝔼(f(xj)f(xj)f(xj+1))\displaystyle\quad\leq 2\mathbb{E}(\|\nabla f(x_{j})\|\|\nabla f(x_{j})-\nabla f(x_{j+1})\|)
2Lres𝔼(f(xj)xjxj+1)\displaystyle\quad\leq 2L_{\mbox{\rm\scriptsize res}}\mathbb{E}(\|\nabla f(x_{j})\|\|x_{j}-x_{j+1}\|)
2LresγLmax𝔼(f(xj)i(j)f(xk(j)))\displaystyle\quad\leq\frac{2L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\mathbb{E}(\|\nabla f(x_{j})\|\|\nabla_{i(j)}f(x_{k(j)})\|)
LresγLmax𝔼(n1/2f(xj)2+n1/2i(j)f(xk(j))2)\displaystyle\quad\leq\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\mathbb{E}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{1/2}\|\nabla_{i(j)}f(x_{k(j)})\|^{2})
=LresγLmax𝔼(n1/2f(xj)2+n1/2𝔼i(j)(i(j)f(xk(j))2))\displaystyle\quad=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\mathbb{E}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{1/2}\mathbb{E}_{i(j)}(\|\nabla_{i(j)}f(x_{k(j)})\|^{2}))
=LresγLmax𝔼(n1/2f(xj)2+n1/2f(xk(j))2)\displaystyle\quad=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\mathbb{E}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{-1/2}\|\nabla f(x_{k(j)})\|^{2})
LresγnLmax𝔼(f(xj)2+f(xk(j))2).\displaystyle\quad\leq{L_{\mbox{\rm\scriptsize res}}\gamma\over\sqrt{n}{L_{\max}}}\mathbb{E}\left(\|\nabla f(x_{j})\|^{2}+\|\nabla f(x_{k(j)})\|^{2}\right). (32)

We can use this bound to show that the left-hand inequality in (8) holds for j=0j=0. By setting j=0j=0 in (32) and noting that k(0)=0k(0)=0, we obtain

𝔼(f(x0)2f(x1)2)LresγnLmax2𝔼(f(x0)2).\mathbb{E}\left(\|\nabla f(x_{0})\|^{2}-\|\nabla f(x_{1})\|^{2}\right)\leq\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{\sqrt{n}L_{\max}}2\mathbb{E}(\|\nabla f(x_{0})\|^{2}). (33)

From (7b), we have

2LresγnLmaxρ1ρτρ1ρ=1ρ1,\frac{2L_{\mbox{\rm\scriptsize res}}\gamma}{\sqrt{n}L_{\max}}\leq\frac{\rho-1}{\rho^{\tau}}\leq\frac{\rho-1}{\rho}=1-\rho^{-1},

where the second inequality follows from ρ>1\rho>1. By substituting into (33), we obtain ρ1𝔼(f(x0)2)𝔼(f(x1)2)\rho^{-1}\mathbb{E}(\|\nabla f(x_{0})\|^{2})\leq\mathbb{E}(\|\nabla f(x_{1})\|^{2}), establishing the result for j=1j=1. For the inductive step, we use (32) again, assuming that the left-hand inequality in (8) holds up to stage jj, and thus that

𝔼(f(xk(j))2)ρτ𝔼(f(xj)2),\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2})\leq\rho^{\tau}\mathbb{E}(\|\nabla f(x_{j})\|^{2}),

provided that 0jk(j)τ0\leq j-k(j)\leq\tau, as assumed. By substituting into the right-hand side of (32) again, and using ρ>1\rho>1, we obtain

𝔼(f(xj)2f(xj+1)2)2LresγρτnLmax𝔼(f(xj)2).\mathbb{E}\left(\|\nabla f(x_{j})\|^{2}-\|\nabla f(x_{j+1})\|^{2}\right)\leq\frac{2L_{\mbox{\rm\scriptsize res}}\gamma\rho^{\tau}}{\sqrt{n}L_{\max}}\mathbb{E}\left(\|\nabla f(x_{j})\|^{2}\right).

By substituting (7b) we conclude that the left-hand inequality in (8) holds for all jj.

We now work on the right-hand inequality in (8). For all jj, we have the following:

𝔼(f(xj+1)2f(xj)2)\displaystyle\mathbb{E}\left(\|\nabla f(x_{j+1})\|^{2}-\|\nabla f(x_{j})\|^{2}\right)
=𝔼f(xj)+f(xj+1),f(xj+1)f(xj)\displaystyle\quad=\mathbb{E}\langle\nabla f(x_{j})+\nabla f(x_{j+1}),\nabla f(x_{j+1})-\nabla f(x_{j})\rangle
𝔼(f(xj)+f(xj+1)f(xj)f(xj+1))\displaystyle\quad\leq\mathbb{E}(\|\nabla f(x_{j})+\nabla f(x_{j+1})\|\|\nabla f(x_{j})-\nabla f(x_{j+1})\|)
Lres𝔼(f(xj)+f(xj+1)xjxj+1)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}(\|\nabla f(x_{j})+\nabla f(x_{j+1})\|\|x_{j}-x_{j+1}\|)
Lres𝔼((2f(xj)+f(xj+1)f(xj))xjxj+1)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}((2\|\nabla f(x_{j})\|+\|\nabla f(x_{j+1})-\nabla f(x_{j})\|)\|x_{j}-x_{j+1}\|)
Lres𝔼(2f(xj)xjxj+1+Lresxjxj+12)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}(2\|\nabla f(x_{j})\|\|x_{j}-x_{j+1}\|+L_{\mbox{\rm\scriptsize res}}\|x_{j}-x_{j+1}\|^{2})
Lres𝔼(2γLmaxf(xj)i(j)f(xk(j))+Lresγ2Lmax2i(j)f(xk(j))2)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}\left(\frac{2\gamma}{L_{\max}}\|\nabla f(x_{j})\|\|\nabla_{i(j)}f(x_{k(j)})\|+\frac{L_{\mbox{\rm\scriptsize res}}\gamma^{2}}{L_{\max}^{2}}\|\nabla_{i(j)}f(x_{k(j)})\|^{2}\right)
Lres𝔼(γLmax(n1/2f(xj)2+n1/2i(j)f(xk(j))2+Lresγ2Lmax2i(j)f(xk(j))2)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}\bigg{(}\frac{\gamma}{L_{\max}}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{1/2}\|\nabla_{i(j)}f(x_{k(j)})\|^{2}+\frac{L_{\mbox{\rm\scriptsize res}}\gamma^{2}}{L_{\max}^{2}}\|\nabla_{i(j)}f(x_{k(j)})\|^{2}\bigg{)}
=Lres𝔼(γLmax(n1/2f(xj)2+n1/2𝔼i(j)(i(j)f(xk(j))2))+\displaystyle\quad=L_{\mbox{\rm\scriptsize res}}\mathbb{E}\bigg{(}\frac{\gamma}{L_{\max}}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{1/2}\mathbb{E}_{i(j)}(\|\nabla_{i(j)}f(x_{k(j)})\|^{2}))+
Lresγ2Lmax2𝔼i(j)(i(j)f(xk(j))2))\displaystyle\quad\quad\frac{L_{\mbox{\rm\scriptsize res}}\gamma^{2}}{L_{\max}^{2}}\mathbb{E}_{i(j)}(\|\nabla_{i(j)}f(x_{k(j)})\|^{2})\bigg{)}
=Lres𝔼(γLmax(n1/2f(xj)2+n1/2f(xk(j))2)+Lresγ2nLmax2f(xk(j))2)\displaystyle\quad=L_{\mbox{\rm\scriptsize res}}\mathbb{E}\left(\frac{\gamma}{L_{\max}}(n^{-1/2}\|\nabla f(x_{j})\|^{2}+n^{-1/2}\|\nabla f(x_{k(j)})\|^{2})+\frac{L_{\mbox{\rm\scriptsize res}}\gamma^{2}}{nL_{\max}^{2}}\|\nabla f(x_{k(j)})\|^{2}\right)
=γLresnLmax𝔼(f(xj)2+f(xk(j))2)+γ2Lres2nLmax2𝔼(f(xk(j))2)\displaystyle\quad={\gamma L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}{L_{\max}}}\mathbb{E}\left(\|\nabla f(x_{j})\|^{2}+\|\nabla f(x_{k(j)})\|^{2}\right)+{\gamma^{2}L_{\mbox{\rm\scriptsize res}}^{2}\over n{L_{\max}^{2}}}\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2})
γLresnLmax𝔼(f(xj)2)+(γLresnLmax+γLres2nLmax2)𝔼(f(xk(j))2),\displaystyle\quad\leq{\gamma L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}{L_{\max}}}\mathbb{E}(\|\nabla f(x_{j})\|^{2})+\left({\gamma L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}{L_{\max}}}+{\gamma L_{\mbox{\rm\scriptsize res}}^{2}\over n{L_{\max}^{2}}}\right)\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2}), (34)

where the last inequality is from the observation γ1\gamma\leq 1. By setting j=0j=0 in this bound, and noting that k(0)=0k(0)=0, we obtain

𝔼(f(x1)2f(x0)2)(2γLresnLmax+γLres2nLmax2)𝔼(f(x0)2).\mathbb{E}\left(\|\nabla f(x_{1})\|^{2}-\|\nabla f(x_{0})\|^{2}\right)\leq\left(\frac{2\gamma L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}+\frac{\gamma L_{\mbox{\rm\scriptsize res}}^{2}}{nL_{\max}^{2}}\right)\mathbb{E}(\|\nabla f(x_{0})\|^{2}). (35)

By using (7c), we have

2γLresnLmax+γLres2nLmax2=LresγnLmax(2+LresnLmax)ρ1ρτ<ρ1,\frac{2\gamma L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}+\frac{\gamma L_{\mbox{\rm\scriptsize res}}^{2}}{nL_{\max}^{2}}=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{\sqrt{n}L_{\max}}\left(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}\right)\leq\frac{\rho-1}{\rho^{\tau}}<\rho-1,

where the last inequality follows from ρ>1\rho>1. By substituting into (35), we obtain 𝔼(f(x1)2)ρ𝔼(f(x0)2)\mathbb{E}(\|\nabla f(x_{1})\|^{2})\leq\rho\mathbb{E}(\|\nabla f(x_{0})\|^{2}), so the right-hand bound in (8) is established for j=0j=0. For the inductive step, we use (34) again, assuming that the right-hand inequality in (8) holds up to stage jj, and thus that

𝔼(f(xj)2)ρτ𝔼(f(xk(j))2),\mathbb{E}(\|\nabla f(x_{j})\|^{2})\leq\rho^{\tau}\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2}),

provided that 0jk(j)τ0\leq j-k(j)\leq\tau, as assumed. From (34) and the left-hand inequality in (8), we have by substituting this bound that

𝔼(f(xj+1)2f(xj)2)(2γLresρτnLmax+γLres2ρτnLmax2)𝔼(f(xj)2).\mathbb{E}\left(\|\nabla f(x_{j+1})\|^{2}-\|\nabla f(x_{j})\|^{2}\right)\leq\left(\frac{2\gamma L_{\mbox{\rm\scriptsize res}}\rho^{\tau}}{\sqrt{n}{L_{\max}}}+\frac{\gamma L_{\mbox{\rm\scriptsize res}}^{2}\rho^{\tau}}{nL_{\max}^{2}}\right)\mathbb{E}(\|\nabla f(x_{j})\|^{2}). (36)

It follows immediately from (7c) that the term in parentheses in (36) is bounded above by ρ1\rho-1. By substituting this bound into (36), we obtain 𝔼(f(xj+1)2)ρ𝔼(f(xj)2)\mathbb{E}(\|\nabla f(x_{j+1})\|^{2})\leq\rho\mathbb{E}(\|\nabla f(x_{j})\|^{2}), as required.

At this point, we have shown that both inequalities in (8) are satisfied for all jj.

Next we prove (9) and (10). Take the expectation of f(xj+1)f(x_{j+1}) in terms of i(j)i(j):

𝔼i(j)f(xj+1)\displaystyle~~\mathbb{E}_{i(j)}f(x_{j+1})
=𝔼i(j)f(xjγLmaxei(j)i(j)f(xk(j)))\displaystyle=\mathbb{E}_{i(j)}f\left(x_{j}-\frac{\gamma}{L_{\max}}e_{i(j)}\nabla_{i(j)}f(x_{k(j)})\right)
=1ni=1nf(xjγLmaxeiif(xk(j)))\displaystyle={1\over n}\sum_{i=1}^{n}f\left(x_{j}-\frac{\gamma}{L_{\max}}e_{i}\nabla_{i}f(x_{k(j)})\right)
1ni=1nf(xj)γLmaxf(xj),eiif(xk(j))+Li2Lmax2γ2if(xk(j))2\displaystyle\leq{1\over n}\sum_{i=1}^{n}f(x_{j})-\frac{\gamma}{L_{\max}}\langle\nabla f(x_{j}),e_{i}\nabla_{i}f(x_{k(j)})\rangle+{L_{i}\over 2L_{\max}^{2}}\gamma^{2}\|\nabla_{i}f(x_{k(j)})\|^{2}
f(xj)γnLmaxf(xj),f(xk(j))+γ22nLmaxf(xk(j))2\displaystyle\leq f(x_{j})-{\gamma\over n{L_{\max}}}\langle\nabla f(x_{j}),\nabla f(x_{k(j)})\rangle+{\gamma^{2}\over 2n{L_{\max}}}\|\nabla f(x_{k(j)})\|^{2}
=f(xj)+γnLmaxf(xk(j))f(xj),f(xk(j))T1\displaystyle=f(x_{j})+{\gamma\over n{L_{\max}}}\underbrace{\langle\nabla f(x_{k(j)})-\nabla f(x_{j}),\nabla f(x_{k(j)})\rangle}_{T_{1}}
(γnLmaxγ22nLmax)f(xk(j))2.\displaystyle\quad\quad-\left({\gamma\over n{L_{\max}}}-{\gamma^{2}\over 2n{L_{\max}}}\right)\|\nabla f(x_{k(j)})\|^{2}. (37)

The second term T1T_{1} is caused by delay. If there is no delay, T1T_{1} should be 0 because of f(xj)=f(xk(j))\nabla f(x_{j})=\nabla f(x_{k(j)}). We estimate the upper bound of f(xk(j))f(xj)\|\nabla f(x_{k(j)})-\nabla f(x_{j})\|:

f(xk(j))f(xj)\displaystyle\|\nabla f(x_{k(j)})-\nabla f(x_{j})\| d=k(j)j1f(xd+1)f(xd)\displaystyle\leq\sum_{d=k(j)}^{j-1}\left\|\nabla f(x_{d+1})-\nabla f(x_{d})\right\|
Lresd=k(j)j1xd+1xd\displaystyle\leq L_{\mbox{\rm\scriptsize res}}\sum_{d=k(j)}^{j-1}\left\|x_{d+1}-x_{d}\right\|
=LresγLmaxd=k(j)j1i(d)f(xk(d)).\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\sum_{d=k(j)}^{j-1}\left\|\nabla_{i(d)}f(x_{k(d)})\right\|. (38)

Then 𝔼(|T1|)\mathbb{E}(|T_{1}|) can be bounded by

𝔼(|T1|)\displaystyle\mathbb{E}(|T_{1}|) 𝔼(f(xk(j))f(xj)f(xk(j)))\displaystyle\leq\mathbb{E}(\|\nabla f(x_{k(j)})-\nabla f(x_{j})\|\|\nabla f(x_{k(j)})\|)
LresγLmax𝔼(d=k(j)j1i(d)f(xk(d))f(xk(j)))\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{L_{\max}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\|\nabla_{i(d)}f(x_{k(d)})\|\|\nabla f(x_{k(j)})\|\right)
Lresγ2Lmax𝔼(d=k(j)j1n1/2i(d)f(xk(d))2+n1/2f(xk(j))2)\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{2L_{\max}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{1/2}\|\nabla_{i(d)}f(x_{k(d)})\|^{2}+n^{-1/2}\|\nabla f(x_{k(j)})\|^{2}\right)
=Lresγ2Lmax𝔼(d=k(j)j1n1/2𝔼i(d)(i(d)f(xk(d))2)+n1/2f(xk(j))2)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{2L_{\max}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{1/2}\mathbb{E}_{i(d)}(\|\nabla_{i(d)}f(x_{k(d)})\|^{2})+n^{-1/2}\|\nabla f(x_{k(j)})\|^{2}\right)
=Lresγ2Lmax𝔼(d=k(j)j1n1/2f(xk(d))2+n1/2f(xk(j))2)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}\gamma}{2L_{\max}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{-1/2}\|\nabla f(x_{k(d)})\|^{2}+n^{-1/2}\|\nabla f(x_{k(j)})\|^{2}\right)
=Lresγ2nLmaxd=k(j)j1𝔼(f(xk(d))2+f(xk(j))2)\displaystyle={L_{\mbox{\rm\scriptsize res}}\gamma\over 2\sqrt{n}{L_{\max}}}\sum_{d=k(j)}^{j-1}\mathbb{E}(\|\nabla f(x_{k(d)})\|^{2}+\|\nabla f(x_{k(j)})\|^{2})
τρτLresγnLmax𝔼(f(xk(j))2)\displaystyle\leq{\tau\rho^{\tau}L_{\mbox{\rm\scriptsize res}}\gamma\over\sqrt{n}{L_{\max}}}\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2}) (39)

where the second line uses (38), and the final inequality uses the fact for dd between k(j)k(j) and j1j-1, k(d)k(d) lies in the range k(j)τk(j)-\tau and j1j-1, so we have |k(d)k(j)|τ|k(d)-k(j)|\leq\tau for all dd.

Taking expectation on both sides of (37) in terms of all random variables, together with (39), we obtain

𝔼(f(xj+1)f)\displaystyle\mathbb{E}(f(x_{j+1})-f^{*})
𝔼(f(xj)f)+γnLmax𝔼(|T1|)(γnLmaxγ22nLmax)𝔼(f(xk(j))2)\displaystyle\quad\leq\mathbb{E}(f(x_{j})-f^{*})+{\gamma\over n{L_{\max}}}\mathbb{E}(|T_{1}|)-\left({\gamma\over n{L_{\max}}}-{\gamma^{2}\over 2n{L_{\max}}}\right)\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2})
𝔼(f(xj)f)(γnLmaxτρτLresγ2n3/2Lmax2γ22nLmax)𝔼(f(xk(j))2)\displaystyle\quad\leq\mathbb{E}(f(x_{j})-f^{*})-\left({\gamma\over n{L_{\max}}}-{\tau\rho^{\tau}L_{\mbox{\rm\scriptsize res}}\gamma^{2}\over{n^{3/2}}{L_{\max}^{2}}}-{\gamma^{2}\over 2n{L_{\max}}}\right)\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2})
=𝔼(f(xj)f)γnLmax(1ψ2γ)𝔼(f(xk(j))2),\displaystyle\quad=\mathbb{E}(f(x_{j})-f^{*})-\frac{\gamma}{nL_{\max}}\left(1-\frac{\psi}{2}\gamma\right)\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2}), (40)

which (because of (7a)) implies that 𝔼(f(xj)f)\mathbb{E}(f(x_{j})-f^{*}) is monotonically decreasing.

Assume now that the essential strong convexity property (3) holds. From Lemma 7 and the fact that 𝔼(f(xj)f)\mathbb{E}(f(x_{j})-f^{*}) is monotonically decreasing, we have

f(xk(j))22l𝔼(f(xk(j))f)2l𝔼(f(xj)f).\|\nabla f(x_{k(j)})\|^{2}\geq 2l\mathbb{E}(f(x_{k(j)})-f^{*})\geq 2l\mathbb{E}(f(x_{j})-f^{*}).

So by substituting in (40), we obtain

𝔼(f(xj+1)f)(12lγnLmax(1ψ2γ))𝔼(f(xj)f),\mathbb{E}(f(x_{j+1})-f^{*})\leq\left(1-\frac{2l\gamma}{nL_{\max}}\left(1-\frac{\psi}{2}\gamma\right)\right)\mathbb{E}(f(x_{j})-f^{*}),

from which the linear convergence claim (9) follows by an obvious induction.

For the case of general smooth convex ff, we have from Lemma 7, Assumption 1, and the monotonic decreasing property for 𝔼(f(xj)f)\mathbb{E}(f(x_{j})-f^{*}) that

𝔼(f(xk(j))2)\displaystyle\mathbb{E}(\|\nabla f(x_{k(j)})\|^{2})\geq 𝔼[(f(xk(j))f)2xk(j)𝒫S(xk(j))2]\displaystyle\mathbb{E}\left[\frac{(f(x_{k(j)})-f^{*})^{2}}{\|x_{k(j)}-\mathcal{P}_{S}(x_{k(j)})\|^{2}}\right]
\displaystyle\geq 𝔼[(f(xk(j))f)2]R2\displaystyle\frac{\mathbb{E}[(f(x_{k(j)})-f^{*})^{2}]}{R^{2}}
\displaystyle\geq [𝔼(f(xk(j))f)]2R2\displaystyle\frac{[\mathbb{E}(f(x_{k(j)})-f^{*})]^{2}}{R^{2}}
\displaystyle\geq [𝔼(f(xj)f)]2R2,\displaystyle\frac{[\mathbb{E}(f(x_{j})-f^{*})]^{2}}{R^{2}},

where the third inequality uses Jensen’s inequality 𝔼(v2)(𝔼(v))2\mathbb{E}(v^{2})\geq(\mathbb{E}(v))^{2}. By substituting into (40), we obtain

𝔼(f(xj+1)f)\displaystyle\mathbb{E}(f(x_{j+1})-f^{*}) 𝔼(f(xj)f)γnLmaxR2(1ψ2γ)[𝔼(f(xj)f)]2.\displaystyle\leq\mathbb{E}(f(x_{j})-f^{*})-\frac{\gamma}{nL_{\max}R^{2}}\left(1-\frac{\psi}{2}\gamma\right)[\mathbb{E}(f(x_{j})-f^{*})]^{2}.

Defining

C:=γnLmaxR2(1ψ2γ),C:=\frac{\gamma}{nL_{\max}R^{2}}\left(1-\frac{\psi}{2}\gamma\right),

we have

𝔼(f(xj+1)f)𝔼(f(xj)f)C(𝔼(f(xj)f))2\displaystyle\mathbb{E}(f(x_{j+1})-f^{*})\leq\mathbb{E}(f(x_{j})-f^{*})-{C}(\mathbb{E}(f(x_{j})-f^{*}))^{2}
\displaystyle\Rightarrow~ 1𝔼(f(xj)f)1𝔼(f(xj+1)f)C𝔼(f(xj)f)𝔼(f(xj+1)f)\displaystyle{1\over\mathbb{E}(f(x_{j})-f^{*})}\leq{1\over\mathbb{E}(f(x_{j+1})-f^{*})}-{C}{\mathbb{E}(f(x_{j})-f^{*})\over\mathbb{E}(f(x_{j+1})-f^{*})}
\displaystyle\Rightarrow~ 1𝔼(f(xj+1)f)1𝔼(f(xj)f)C𝔼(f(xj)f)𝔼(f(xj+1)f)C\displaystyle{1\over\mathbb{E}(f(x_{j+1})-f^{*})}-{1\over\mathbb{E}(f(x_{j})-f^{*})}\geq{C}{\mathbb{E}(f(x_{j})-f^{*})\over\mathbb{E}(f(x_{j+1})-f^{*})}\geq C
\displaystyle\Rightarrow~ 1𝔼(f(xj+1)f)1f(x0)f+C(j+1)\displaystyle{1\over\mathbb{E}(f(x_{j+1})-f^{*})}\geq{1\over f(x_{0})-f^{*}}+{C(j+1)}
\displaystyle\Rightarrow~ 𝔼(f(xj+1)f)1(f(x0)f)1+C(j+1),\displaystyle\mathbb{E}(f(x_{j+1})-f^{*})\leq{1\over{(f(x_{0})-f^{*})^{-1}}+{C(j+1)}},

which completes the proof of the sublinear rate (10).  

Proof  (Corollary 2) Note first that for ρ\rho defined by (12), we have

ρτρτ+1=((1+2eLresnLmax)nLmax2eLres)2eLres(τ+1)nLmaxe2eLres(τ+1)nLmaxe,\rho^{\tau}\leq\rho^{\tau+1}=\left(\left(1+{2eL_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}\right)^{\sqrt{n}L_{\max}\over 2eL_{\mbox{\rm\scriptsize res}}}\right)^{{2eL_{\mbox{\rm\scriptsize res}}(\tau+1)\over\sqrt{n}L_{\max}}}\leq e^{{2eL_{\mbox{\rm\scriptsize res}}(\tau+1)\over\sqrt{n}L_{\max}}}\leq e,

and thus from the definition of ψ\psi (6) that

ψ=1+2τρτLresnLmax1+2τeLresnLmax2.\psi=1+\frac{2\tau\rho^{\tau}L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}\leq 1+\frac{2\tau eL_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}\leq 2. (41)

We show now that the steplength parameter choice γ=1/ψ\gamma=1/\psi satisfies all the bounds in (7), by showing that the second and third bounds are implied by the first. For the second bound (7b), we have

(ρ1)nLmax2ρτ+1Lres(ρ1)nLmax2eLres11ψ,\frac{(\rho-1)\sqrt{n}L_{\max}}{2\rho^{\tau+1}L_{\mbox{\rm\scriptsize res}}}\geq\frac{(\rho-1)\sqrt{n}L_{\max}}{2eL_{\mbox{\rm\scriptsize res}}}\geq 1\geq\frac{1}{\psi},

where the second inequality follows from (12). To verify that the right hand side of the third bound (7c) is greater than 1/ψ1/\psi, we consider the cases τ=0\tau=0 and τ1\tau\geq 1 separately. For τ=0\tau=0, we have ψ=1\psi=1 from (6) and

(ρ1)nLmaxLresρτ(2+LresnLmax)=2e2+LresnLmax2e2+(2e)111ψ,\frac{(\rho-1)\sqrt{n}L_{\max}}{L_{\mbox{\rm\scriptsize res}}\rho^{\tau}(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}})}=\frac{2e}{2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}}\geq\frac{2e}{2+(2e)^{-1}}\geq 1\geq\frac{1}{\psi},

where the first inequality is from (11). For the other case τ1\tau\geq 1, we have

(ρ1)nLmaxLresρτ(2+LresnLmax)=2eLresLresρτ(2+LresnLmax)2eLresLrese(2+LresnLmax)=22+LresnLmax1ψ.\frac{(\rho-1)\sqrt{n}L_{\max}}{L_{\mbox{\rm\scriptsize res}}\rho^{\tau}(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}})}=\frac{2eL_{\mbox{\rm\scriptsize res}}}{L_{\mbox{\rm\scriptsize res}}\rho^{\tau}(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}})}\geq\frac{2eL_{\mbox{\rm\scriptsize res}}}{L_{\mbox{\rm\scriptsize res}}e(2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}})}=\frac{2}{2+\frac{L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}}\geq\frac{1}{\psi}.

We can thus set γ=1/ψ\gamma=1/\psi, and by substituting this choice into (9) and using (41), we obtain (13). We obtain (14) by making the same substitution into (10).  

Proof  (Theorem 3) From Markov’s inequality, we have

(f(xj)fϵ)\displaystyle\mathbb{P}(f(x_{j})-f^{*}\geq\epsilon) ϵ1𝔼(f(xj)f)\displaystyle\leq\epsilon^{-1}{\mathbb{E}(f(x_{j})-f^{*})}
ϵ1(1l2nLmax)j(f(x0)f)\displaystyle\leq{\epsilon}^{-1}{\left(1-{l\over 2nL_{\max}}\right)^{j}(f(x_{0})-f^{*})}
ϵ1(1c)(1/c)|logf(x0)fηϵ|(f(x0)f)with c=l/(2nLmax)\displaystyle\leq{\epsilon}^{-1}{\left(1-c\right)^{(1/c)\left|{\log{f(x_{0})-f^{*}\over\eta\epsilon}}\right|}(f(x_{0})-f^{*})}\quad\mbox{with $c=l/(2nL_{\max})$}
ϵ1(f(x0)f)e|logf(x0)fηϵ|\displaystyle\leq{\epsilon}^{-1}(f(x_{0})-f^{*})e^{-\left|{\log{f(x_{0})-f^{*}\over\eta\epsilon}}\right|}
=ηelog(f(x0)f)ηϵe|logf(x0)fηϵ|\displaystyle={\eta}e^{\log{(f(x_{0})-f^{*})\over\eta\epsilon}}e^{-\left|{\log{f(x_{0})-f^{*}\over\eta\epsilon}}\right|}
η,\displaystyle\leq\eta,

where the second inequality applies (13), the third inequality uses the definition of jj (16), and the second last inequality uses the inequality (1c)1/ce1c(0,1)(1-c)^{1/c}\leq e^{-1}~\forall\,c\in(0,1), which proves the essentially strongly convex case. Similarly, the general convex case is proven by

(f(xj)fϵ)ϵ1𝔼(f(xj)f)f(x0)fϵ(1+jf(x0)f4nLmaxR2)η,\displaystyle\mathbb{P}(f(x_{j})-f^{*}\geq\epsilon)\leq\epsilon^{-1}{\mathbb{E}(f(x_{j})-f^{*})}\leq\frac{f(x_{0})-f^{*}}{\epsilon\left(1+j{f(x_{0})-f^{*}\over 4nL_{\max}R^{2}}\right)}\leq\eta,

where the second inequality uses (14) and the last inequality uses the definition of jj (17).  

B Proofs for Constrained Case

We start by introducing notation and proving several preliminary results. Define

(Δj)i(j):=(xjxj+1)i(j),(\Delta_{j})_{i(j)}:=(x_{j}-x_{j+1})_{i(j)}, (42)

and formulate the update in Step 4 of Algorithm 1 in the following way:

xj+1=argminxΩi(j)f(xk(j)),(xxj)i(j)+Lmax2γxxj2.x_{j+1}=\arg\min_{x\in\Omega}\,\langle\nabla_{i(j)}f(x_{k(j)}),(x-x_{j})_{i(j)}\rangle+\frac{L_{\max}}{2\gamma}\|x-x_{j}\|^{2}.

(Note that (xj+1)i=(xj)i(x_{j+1})_{i}=(x_{j})_{i} for ii(j)i\neq i(j).) The optimality condition for this formulation is

(xxj+1)i(j),i(j)f(xk(j))Lmaxγ(Δj)i(j)0,for all xΩ.\displaystyle\left\langle(x-x_{j+1})_{i(j)},\nabla_{i(j)}f(x_{k(j)})-{L_{\max}\over\gamma}(\Delta_{j})_{i(j)}\right\rangle\geq 0,\quad\mbox{\rm for all $x\in\Omega$}.

This implies in particular that for all xΩx\in\Omega, we have

(𝒫S(x)xj+1)i(j),i(j)f(xk(j))Lmaxγ(𝒫S(x)xj+1)i(j),(Δj)i(j).\displaystyle\left\langle(\mathcal{P}_{S}(x)-x_{j+1})_{i(j)},\nabla_{i(j)}f(x_{k(j)})\right\rangle\geq\frac{L_{\max}}{\gamma}\left\langle(\mathcal{P}_{S}(x)-x_{j+1})_{i(j)},(\Delta_{j})_{i(j)}\right\rangle. (43)

From the definition of LmaxL_{\max}, and using the notation (42), we have

f(xj+1)f(xj)+i(j)f(xj),(Δj)i(j)+Lmax2(Δj)i(j)2,\displaystyle f(x_{j+1})\leq f(x_{j})+\langle\nabla_{i(j)}f(x_{j}),-(\Delta_{j})_{i(j)}\rangle+{L_{\max}\over 2}\|(\Delta_{j})_{i(j)}\|^{2},

which indicates that

i(j)f(xj),(Δj)i(j)f(xj)f(xj+1)+Lmax2(Δj)i(j)2.\displaystyle\langle\nabla_{i(j)}f(x_{j}),(\Delta_{j})_{i(j)}\rangle\leq f(x_{j})-f(x_{j+1})+{L_{\max}\over 2}\|(\Delta_{j})_{i(j)}\|^{2}. (44)

From optimality conditions for the problem (19), which defines the vector x¯j+1\bar{x}_{j+1}, we have

xx¯j+1,f(xk(j))+Lmaxγ(x¯j+1xj)0xΩ.\displaystyle\left\langle x-\bar{x}_{j+1},\nabla f(x_{k(j)})+\frac{L_{\max}}{\gamma}(\bar{x}_{j+1}-x_{j})\right\rangle\geq 0\quad\forall\,x\in\Omega. (45)

We now define Δj:=xjx¯j+1\Delta_{j}:=x_{j}-\bar{x}_{j+1}, and note that this definition is consistent with (Δ)i(j)(\Delta)_{i(j)} defined in (42). It can be seen that

𝔼i(j)(xj+1xj2)=1nx¯j+1xj2.\mathbb{E}_{i(j)}(\|x_{j+1}-x_{j}\|^{2})={1\over n}\|\bar{x}_{j+1}-x_{j}\|^{2}.

We now proceed to prove the main results of Section 5.

Proof  (Theorem 4) We prove (22) by induction. First, note that for any vectors aa and bb, we have

a2b2=2a2(a2+b2)2a22a,b2a,ab2aab,\displaystyle\|a\|^{2}-\|b\|^{2}=2\|a\|^{2}-(\|a\|^{2}+\|b\|^{2})\leq 2\|a\|^{2}-2\langle a,b\rangle\leq 2\langle a,a-b\rangle\leq 2\|a\|\|a-b\|,

Thus for all jj, we have

xj1x¯j2xjx¯j+122xj1x¯jxjx¯j+1xj1+x¯j.\|x_{j-1}-\bar{x}_{j}\|^{2}-\|x_{j}-\bar{x}_{j+1}\|^{2}\leq 2\|x_{j-1}-\bar{x}_{j}\|{\|x_{j}-\bar{x}_{j+1}-x_{j-1}+\bar{x}_{j}\|}. (46)

The second factor in the r.h.s. of (46) is bounded as follows:

xjx¯j+1xj1+x¯j\displaystyle\|x_{j}-\bar{x}_{j+1}-x_{j-1}+\bar{x}_{j}\|
=xj𝒫Ω(xjγLmaxf(xk(j)))(xj1𝒫Ω(xj1γLmaxf(xk(j1))))\displaystyle\quad=\left\|x_{j}-\mathcal{P}_{\Omega}(x_{j}-{\gamma\over L_{\max}}\nabla f(x_{k(j)}))-(x_{j-1}-\mathcal{P}_{\Omega}(x_{j-1}-{\gamma\over L_{\max}}\nabla f(x_{k(j-1)})))\right\|
xjγLmaxf(xk(j))𝒫Ω(xjγLmaxf(xk(j)))(xj1γLmaxf(xk(j1))\displaystyle\quad\leq\left\|x_{j}-{\gamma\over L_{\max}}\nabla f(x_{k(j)})-\mathcal{P}_{\Omega}(x_{j}-{\gamma\over L_{\max}}\nabla f(x_{k(j)}))-(x_{j-1}-{\gamma\over L_{\max}}\nabla f(x_{k(j-1)})\right.
𝒫Ω(xj1γLmaxf(xk(j1))))+γLmaxf(xk(j1))f(xk(j))\displaystyle\quad\quad\left.-\mathcal{P}_{\Omega}(x_{j-1}-{\gamma\over L_{\max}}\nabla f(x_{k(j-1)})))\right\|+{\gamma\over L_{\max}}\left\|\nabla f(x_{k(j-1)})-\nabla f(x_{k(j)})\right\|
xjγLmaxf(xk(j))xj1+γLmaxf(xk(j1))\displaystyle\quad\leq\left\|x_{j}-{\gamma\over L_{\max}}\nabla f(x_{k(j)})-x_{j-1}+{\gamma\over L_{\max}}\nabla f(x_{k(j-1)})\right\|
+γLmaxf(xk(j1))f(xk(j))\displaystyle\quad\quad+{\gamma\over L_{\max}}\left\|\nabla f(x_{k(j-1)})-\nabla f(x_{k(j)})\right\|
xjxj1+2γLmaxf(xk(j))f(xk(j1))\displaystyle\quad\leq\|x_{j}-x_{j-1}\|+2{\gamma\over L_{\max}}\left\|\nabla f(x_{k(j)})-\nabla f(x_{k(j-1)})\right\|
xjxj1+2γLmaxd=min{k(j1),k(j)}max{k(j1),k(j)}1f(xd)f(xd+1)\displaystyle\quad\leq\|x_{j}-x_{j-1}\|+2{\gamma\over L_{\max}}\sum_{d=\min\{k(j-1),k(j)\}}^{\max\{k(j-1),k(j)\}-1}\|\nabla f(x_{d})-\nabla f(x_{d+1})\|
xjxj1+2γLresLmaxd=min{k(j1),k(j)}max{k(j1),k(j)}1xdxd+1,\displaystyle\quad\leq\|x_{j}-x_{j-1}\|+2{\gamma L_{\mbox{\rm\scriptsize res}}\over L_{\max}}\sum_{d=\min\{k(j-1),k(j)\}}^{\max\{k(j-1),k(j)\}-1}\|x_{d}-x_{d+1}\|, (47)

where the first inequality follows by adding and subtracting a term, and the second inequality uses the nonexpansive property of projection:

(z𝒫Ω(z))(y𝒫Ω(y))zy.\|(z-\mathcal{P}_{\Omega}(z))-(y-\mathcal{P}_{\Omega}(y))\|\leq\|z-y\|.

One can see that j1τk(j1)j1j-1-\tau\leq k(j-1)\leq j-1 and jτk(j)jj-\tau\leq k(j)\leq j, which implies that j1τdj1j-1-\tau\leq d\leq j-1 for each index dd in the summation in (47). It also follows that

max{k(j1),k(j)}1min{k(j1),k(j)}τ.\displaystyle\max\{k(j-1),k(j)\}-1-\min\{k(j-1),k(j)\}\leq\tau. (48)

We set j=1j=1, and note that k(0)=0k(0)=0 and k(1)1k(1)\leq 1. Thus, in this case, we have that the lower and upper limits of the summation in (47) are 0 and 0, respectively. Thus, this summation is vacuous, and we have

x1x¯2+x0x¯1(1+2γLresLmax)x1x0,\|x_{1}-\bar{x}_{2}+x_{0}-\bar{x}_{1}\|\leq\left(1+2\frac{\gamma L_{\mbox{\rm\scriptsize res}}}{L_{\max}}\right)\|x_{1}-x_{0}\|,

By substituting this bound in (46) and setting j=1j=1, we obtain

𝔼(x0x¯12)𝔼(x1x¯22)(2+4γLresLmax)𝔼(x1x0x¯1x0).\mathbb{E}(\|x_{0}-\bar{x}_{1}\|^{2})-\mathbb{E}(\|x_{1}-\bar{x}_{2}\|^{2})\leq\left(2+4\frac{\gamma L_{\mbox{\rm\scriptsize res}}}{L_{\max}}\right)\mathbb{E}(\|x_{1}-x_{0}\|\|\bar{x}_{1}-x_{0}\|). (49)

For any jj, we have

𝔼(xjxj1x¯jxj1)\displaystyle\mathbb{E}(\|x_{j}-x_{j-1}\|\|\bar{x}_{j}-x_{j-1}\|) 12𝔼(n1/2xjxj12+n1/2x¯jxj12)\displaystyle\leq{1\over 2}\mathbb{E}(n^{1/2}\|x_{j}-x_{j-1}\|^{2}+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
=12𝔼(n1/2𝔼i(j1)(xjxj12)+n1/2x¯jxj12)\displaystyle={1\over 2}\mathbb{E}(n^{1/2}\mathbb{E}_{i(j-1)}(\|x_{j}-x_{j-1}\|^{2})+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
=12𝔼(n1/2x¯jxj12+n1/2x¯jxj12)\displaystyle={1\over 2}\mathbb{E}(n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2}+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
=n1/2𝔼x¯jxj12.\displaystyle=n^{-1/2}\mathbb{E}\|\bar{x}_{j}-x_{j-1}\|^{2}. (50)

Returning to (49), we have

𝔼(x0x¯12)𝔼(x1x¯22)(2n+4γLresnLmax)𝔼x¯1x02\mathbb{E}(\|x_{0}-\bar{x}_{1}\|^{2})-\mathbb{E}(\|x_{1}-\bar{x}_{2}\|^{2})\leq\left({2\over\sqrt{n}}+\frac{4\gamma L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}\right)\mathbb{E}\|\bar{x}_{1}-x_{0}\|^{2}

which implies that

𝔼(x0x¯12)(12n4γLresnLmax)1𝔼(x1x¯22)ρ𝔼(x1x¯22).\mathbb{E}(\|x_{0}-\bar{x}_{1}\|^{2})\leq\left(1-{2\over\sqrt{n}}-\frac{4\gamma L_{\mbox{\rm\scriptsize res}}}{\sqrt{n}L_{\max}}\right)^{-1}\mathbb{E}(\|x_{1}-\bar{x}_{2}\|^{2})\leq\rho\mathbb{E}(\|x_{1}-\bar{x}_{2}\|^{2}).

To see the last inequality above, we only need to verify that

γ(1ρ12n)nLmax4Lres.\gamma\leq\left(1-\rho^{-1}-{2\over\sqrt{n}}\right)\frac{\sqrt{n}L_{\max}}{4L_{\mbox{\rm\scriptsize res}}}.

This proves that (22) holds for j=1j=1.

To take the inductive step, we assume that (22) holds up to index j1j-1. We have for j1τdj2j-1-\tau\leq d\leq j-2 that

𝔼(xdxd+1x¯jxj1)\displaystyle\mathbb{E}(\|x_{d}-x_{d+1}\|\|\bar{x}_{j}-x_{j-1}\|) 12𝔼(n1/2xdxd+12+n1/2x¯jxj12)\displaystyle\leq{1\over 2}\mathbb{E}(n^{1/2}\|x_{d}-x_{d+1}\|^{2}+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
=12𝔼(n1/2𝔼i(d)(xdxd+12)+n1/2x¯jxj12)\displaystyle={1\over 2}\mathbb{E}(n^{1/2}\mathbb{E}_{i(d)}(\|x_{d}-{x}_{d+1}\|^{2})+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
=12𝔼(n1/2xdx¯d+12+n1/2x¯jxj12)\displaystyle={1\over 2}\mathbb{E}(n^{-1/2}\|x_{d}-\bar{x}_{d+1}\|^{2}+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
12𝔼(n1/2ρτxj1x¯j2+n1/2x¯jxj12)\displaystyle\leq{1\over 2}\mathbb{E}(n^{-1/2}\rho^{\tau}\|x_{j-1}-\bar{x}_{j}\|^{2}+n^{-1/2}\|\bar{x}_{j}-x_{j-1}\|^{2})
ρτn1/2𝔼(x¯jxj12),\displaystyle\leq{\rho^{\tau}\over n^{1/2}}\mathbb{E}(\|\bar{x}_{j}-x_{j-1}\|^{2}), (51)

where the second inequality uses the inductive hypothesis. By substituting (47) into (46) and taking expectation on both sides of (46), we obtain

𝔼(xj1x¯j2)𝔼(xjx¯j+12)\displaystyle\mathbb{E}(\|x_{j-1}-\bar{x}_{j}\|^{2})-\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2})
2𝔼(x¯jxj1x¯jx¯j+1+xjxj1)\displaystyle\quad\leq 2\mathbb{E}(\|\bar{x}_{j}-x_{j-1}\|\|\bar{x}_{j}-\bar{x}_{j+1}+x_{j}-x_{j-1}\|)
2𝔼(x¯jxj1(xjxj1+2γLresLmaxd=min{k(j1),k(j)}max{k(j1),k(j)}1xdxd+1))\displaystyle\quad\leq 2\mathbb{E}\left(\|\bar{x}_{j}-x_{j-1}\|\left(\|x_{j}-x_{j-1}\|+2{\gamma L_{\mbox{\rm\scriptsize res}}\over L_{\max}}\sum_{d=\min\{k(j-1),k(j)\}}^{\max\{k(j-1),k(j)\}-1}\|x_{d}-x_{d+1}\|\right)\right)
=2𝔼(x¯jxj1xjxj1)+\displaystyle\quad=2\mathbb{E}(\|\bar{x}_{j}-x_{j-1}\|\|x_{j}-x_{j-1}\|)+
4γLresLmax𝔼(d=min{k(j1),k(j)}max{k(j1),k(j)}1(x¯jxj1xdxd+1))\displaystyle\quad\quad\quad\quad 4{\gamma L_{\mbox{\rm\scriptsize res}}\over L_{\max}}\mathbb{E}\left(\sum_{d=\min\{k(j-1),k(j)\}}^{\max\{k(j-1),k(j)\}-1}(\|\bar{x}_{j}-x_{j-1}\|\|x_{d}-x_{d+1}\|)\right)
n1/2(2+4γLresτρτLmax)𝔼(xj1x¯j2),\displaystyle\quad\leq n^{-1/2}\left(2+\frac{4\gamma L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{L_{\max}}\right)\mathbb{E}(\|{x}_{j-1}-\bar{x}_{j}\|^{2}),

where the last line uses (48), (50), and (51). It follows that

𝔼(xj1x¯j2)(1n1/2(2+4γLresτρτLmax))1𝔼(xjx¯j+12)ρ𝔼(xjx¯j+12).\displaystyle\mathbb{E}(\|x_{j-1}-\bar{x}_{j}\|^{2})\leq\left(1-n^{-1/2}\left(2+\frac{4\gamma L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{L_{\max}}\right)\right)^{-1}\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2})\leq\rho\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2}).

To see the last inequality, one only needs to verify that

ρ111n(2+4γLresτρτLmax)γ(1ρ12n)nLmax4Lresτρτ,\rho^{-1}\leq 1-{1\over\sqrt{n}}\left(2+{4\gamma L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}\over L_{\max}}\right)\;\Leftrightarrow\;\gamma\leq\left({1-\rho^{-1}-{2\over\sqrt{n}}}\right){{\sqrt{n}L_{\max}\over 4L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}},

and the last inequality is true because of the upper bound of γ\gamma in (21). It proves (22).

Next we will show the expectation of objective is monotonically decreasing. We have using the definition (42) that

𝔼i(j)(f(xj+1))=n1i=1nf(xj+(Δj)i)\displaystyle\mathbb{E}_{i(j)}(f(x_{j+1}))=n^{-1}\sum_{i=1}^{n}f(x_{j}+(\Delta_{j})_{i})
n1i=1n[f(xj)+if(xj),(x¯j+1xj)i+Lmax2(xj+1xj)i2]\displaystyle\quad\leq n^{-1}\sum_{i=1}^{n}\left[f(x_{j})+\langle\nabla_{i}f(x_{j}),(\bar{x}_{j+1}-x_{j})_{i}\rangle+\frac{L_{\max}}{2}\|(x_{j+1}-x_{j})_{i}\|^{2}\right]
=f(xj)+n1(f(xj),x¯j+1xj+Lmax2x¯j+1xj2)\displaystyle\quad=f(x_{j})+n^{-1}\left(\langle\nabla f(x_{j}),\bar{x}_{j+1}-x_{j}\rangle+\frac{L_{\max}}{2}\|\bar{x}_{j+1}-x_{j}\|^{2}\right)
=f(xj)+1n(f(xk(j)),x¯j+1xj+Lmax2x¯j+1xj2)+1nf(xj)f(xk(j)),x¯j+1xj\displaystyle\quad=f(x_{j})+{1\over n}\left(\langle\nabla f(x_{k(j)}),\bar{x}_{j+1}-x_{j}\rangle+\frac{L_{\max}}{2}\|\bar{x}_{j+1}-x_{j}\|^{2}\right)+{1\over n}\langle\nabla f(x_{j})-\nabla f(x_{k(j)}),\bar{x}_{j+1}-x_{j}\rangle
f(xj)+1n(Lmax2x¯j+1xj2Lmaxγx¯j+1xj2)+1nf(xj)f(xk(j)),x¯j+1xj\displaystyle\quad\leq f(x_{j})+{1\over n}\left(\frac{L_{\max}}{2}\|\bar{x}_{j+1}-x_{j}\|^{2}-{L_{\max}\over\gamma}\|\bar{x}_{j+1}-x_{j}\|^{2}\right)+{1\over n}\langle\nabla f(x_{j})-\nabla f(x_{k(j)}),\bar{x}_{j+1}-x_{j}\rangle
=f(xj)(1γ12)Lmaxnx¯j+1xj2+1nf(xj)f(xk(j)),x¯j+1xj,\displaystyle\quad=f(x_{j})-\left({\frac{1}{\gamma}-\frac{1}{2}}\right)\frac{L_{\max}}{n}\|\bar{x}_{j+1}-x_{j}\|^{2}+{1\over n}\langle\nabla f(x_{j})-\nabla f(x_{k(j)}),\bar{x}_{j+1}-x_{j}\rangle, (52)

where the second inequality uses (45). Consider the expectation of the last term on the right-hand side of this expression. We have

𝔼f(xj)f(xk(j)),x¯j+1xj\displaystyle\mathbb{E}\langle\nabla f(x_{j})-\nabla f(x_{k(j)}),\bar{x}_{j+1}-x_{j}\rangle
𝔼(f(xj)f(xk(j))x¯j+1xj)\displaystyle\quad\leq\mathbb{E}\left(\|\nabla f(x_{j})-\nabla f(x_{k(j)})\|\|\bar{x}_{j+1}-x_{j}\|\right)
𝔼(d=k(j)j1f(xd)f(xd+1)x¯j+1xj)\displaystyle\quad\leq\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\|\nabla f(x_{d})-\nabla f(x_{d+1})\|\|\bar{x}_{j+1}-x_{j}\|\right)
Lres𝔼(d=k(j)j1xdxd+1x¯j+1xj)\displaystyle\quad\leq L_{\mbox{\rm\scriptsize res}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\|x_{d}-x_{d+1}\|\|\bar{x}_{j+1}-x_{j}\|\right)
Lres2𝔼(d=k(j)j1(n1/2xdxd+12+n1/2x¯j+1xj2))\displaystyle\quad\leq{L_{\mbox{\rm\scriptsize res}}\over 2}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}(n^{1/2}\|x_{d}-x_{d+1}\|^{2}+n^{-1/2}\|\bar{x}_{j+1}-x_{j}\|^{2})\right)
=Lres2𝔼(d=k(j)j1(n1/2𝔼i(d)(xdxd+12)+n1/2x¯j+1xj2))\displaystyle\quad={L_{\mbox{\rm\scriptsize res}}\over 2}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\left(n^{1/2}\mathbb{E}_{i(d)}(\|x_{d}-x_{d+1}\|^{2})+n^{-1/2}\|\bar{x}_{j+1}-x_{j}\|^{2}\right)\right)
=Lres2𝔼(d=k(j)j1(n1/2xdx¯d+12+n1/2x¯j+1xj2))\displaystyle\quad={L_{\mbox{\rm\scriptsize res}}\over 2}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}(n^{-1/2}\|x_{d}-\bar{x}_{d+1}\|^{2}+n^{-1/2}\|\bar{x}_{j+1}-x_{j}\|^{2})\right)
Lres2n1/2𝔼(d=k(j)j1(1+ρτ)x¯j+1xj2)\displaystyle\quad\leq{L_{\mbox{\rm\scriptsize res}}\over 2n^{1/2}}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}(1+\rho^{\tau})\|\bar{x}_{j+1}-x_{j}\|^{2}\right)
Lresτρτn1/2𝔼x¯j+1xj2,\displaystyle\quad\leq{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}\over n^{1/2}}\mathbb{E}\|\bar{x}_{j+1}-x_{j}\|^{2}, (53)

where the fifth inequality uses (22). By taking expectation on both sides of (52) and substituting (53), we have

𝔼(f(xj+1))𝔼(f(xj))1n((1γ12)LmaxLresτρτn1/2)𝔼x¯j+1xj2.\displaystyle\mathbb{E}(f(x_{j+1}))\leq\mathbb{E}(f(x_{j}))-{1\over n}\left(\left({\frac{1}{\gamma}-\frac{1}{2}}\right)L_{\max}-{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}\over n^{1/2}}\right)\mathbb{E}\|\bar{x}_{j+1}-x_{j}\|^{2}.

To see (1γ12)LmaxLresτρτn1/20\left({\frac{1}{\gamma}-\frac{1}{2}}\right)L_{\max}-{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}\over n^{1/2}}\geq 0, we only need to verify

γ(12+LresτρτnLmax)1\displaystyle\gamma\leq\left({1\over 2}+{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}\over\sqrt{n}L_{\max}}\right)^{-1}

which is implied by the first upper bound of γ\gamma (21). Therefore, we have proved the monotonicity 𝔼(f(xj+1))𝔼(f(xj))\mathbb{E}(f(x_{j+1}))\leq\mathbb{E}(f(x_{j})).

Next we prove the sublinear convergence rate for the constrained smooth convex case in (24). We have

xj+1𝒫S(xj+1)2\displaystyle\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}
xj+1𝒫S(xj)2\displaystyle\quad\leq\|x_{j+1}-\mathcal{P}_{S}(x_{j})\|^{2}
=xj(Δj)i(j)ei(j)𝒫S(xj)2\displaystyle\quad=\|x_{j}-(\Delta_{j})_{i(j)}e_{i(j)}-\mathcal{P}_{S}(x_{j})\|^{2}
=xj𝒫S(xj)2+|(Δj)i(j)|22(xj𝒫S(xj))i(j)(Δj)i(j)\displaystyle\quad=\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+|(\Delta_{j})_{i(j)}|^{2}-2(x_{j}-\mathcal{P}_{S}(x_{j}))_{i(j)}(\Delta_{j})_{i(j)}
=xj𝒫S(xj)2|(Δj)i(j)|22((xj𝒫S(xj))i(j)(Δj)i(j))(Δj)i(j)\displaystyle\quad=\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}-|(\Delta_{j})_{i(j)}|^{2}-2\left((x_{j}-\mathcal{P}_{S}(x_{j}))_{i(j)}-(\Delta_{j})_{i(j)}\right)(\Delta_{j})_{i(j)}
=xj𝒫S(xj)2|(Δj)i(j)|2+2(𝒫S(xj)xj+1)i(j)(Δj)i(j)\displaystyle\quad=\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}-|(\Delta_{j})_{i(j)}|^{2}+2(\mathcal{P}_{S}(x_{j})-x_{j+1})_{i(j)}(\Delta_{j})_{i(j)}
xj𝒫S(xj)2|(Δj)i(j)|2+2γLmax(𝒫S(xj)xj+1)i(j)i(j)f(xk(j))\displaystyle\quad\leq\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}-|(\Delta_{j})_{i(j)}|^{2}+\frac{2\gamma}{L_{\max}}(\mathcal{P}_{S}(x_{j})-x_{j+1})_{i(j)}\nabla_{i(j)}f(x_{k(j)}) (54)

where the last inequality uses (43). We bound the last term in (54) by

2γLmax(𝒫S(xj)xj+1)i(j)i(j)f(xk(j))\displaystyle\frac{2\gamma}{L_{\max}}(\mathcal{P}_{S}(x_{j})-x_{j+1})_{i(j)}\nabla_{i(j)}f(x_{k(j)})
=2γLmax(𝒫S(xj)xj)i(j)i(j)f(xk(j))\displaystyle\quad=\frac{2\gamma}{L_{\max}}(\mathcal{P}_{S}(x_{j})-x_{j})_{i(j)}\nabla_{i(j)}f(x_{k(j)})
+2γLmax((Δj)i(j)i(j)f(xj)+(Δj)i(j)(i(j)f(xk(j))i(j)f(xj)))\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}\left((\Delta_{j})_{i(j)}\nabla_{i(j)}f(x_{j})+(\Delta_{j})_{i(j)}\left(\nabla_{i(j)}f(x_{k(j)})-\nabla_{i(j)}f(x_{j})\right)\right)
2γLmax(𝒫S(xj)xj)i(j)i(j)f(xk(j))\displaystyle\quad\leq\frac{2\gamma}{L_{\max}}(\mathcal{P}_{S}(x_{j})-x_{j})_{i(j)}\nabla_{i(j)}f(x_{k(j)})
+2γLmax(f(xj)f(xj+1)+Lmax2|(Δj)i(j)|2\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}\bigg{(}f(x_{j})-f(x_{j+1})+{L_{\max}\over 2}|(\Delta_{j})_{i(j)}|^{2}
+(Δj)i(j)(i(j)f(xk(j))i(j)f(xj)))\displaystyle\quad\quad+(\Delta_{j})_{i(j)}\left(\nabla_{i(j)}f(x_{k(j)})-\nabla_{i(j)}f(x_{j})\right)\bigg{)}
=γ|(Δj)i(j)|2+2γLmax(f(xj)f(xj+1))\displaystyle\quad=\gamma|(\Delta_{j})_{i(j)}|^{2}+\frac{2\gamma}{L_{\max}}(f(x_{j})-f(x_{j+1}))
+2γLmax(𝒫S(xj)xj)i(j)i(j)f(xk(j))\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}(\mathcal{P}_{S}(x_{j})-x_{j})_{i(j)}\nabla_{i(j)}f(x_{k(j)})
+2γLmax(Δj)i(j)(i(j)f(xk(j))i(j)f(xj))\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}(\Delta_{j})_{i(j)}\left(\nabla_{i(j)}f(x_{k(j)})-\nabla_{i(j)}f(x_{j})\right) (55)

where the inequality uses (44).

Together with (54), we obtain

xj+1𝒫S(xj+1)2xj𝒫S(xj)2(1γ)|(Δj)i(j)|2\displaystyle\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}\leq\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}-(1-\gamma)|(\Delta_{j})_{i(j)}|^{2}
+2γLmax(f(xj)f(xj+1))\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}(f(x_{j})-f(x_{j+1}))
+2γLmax(𝒫S(xj)xj)i(j)i(j)f(xk(j))T1\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}\underbrace{(\mathcal{P}_{S}(x_{j})-x_{j})_{i(j)}\nabla_{i(j)}f(x_{k(j)})}_{T_{1}} (56)
+2γLmax(Δj)i(j)(i(j)f(xk(j))i(j)f(xj))T2.\displaystyle\quad\quad+\frac{2\gamma}{L_{\max}}\underbrace{(\Delta_{j})_{i(j)}\left(\nabla_{i(j)}f(x_{k(j)})-\nabla_{i(j)}f(x_{j})\right)}_{T_{2}}.

We now seek upper bounds on the quantities T1T_{1} and T2T_{2} in the expectation sense. For T1T_{1}, we have

𝔼(T1)=n1𝔼𝒫S(xj)xj,f(xk(j))\displaystyle\mathbb{E}(T_{1})=n^{-1}\mathbb{E}\langle\mathcal{P}_{S}(x_{j})-x_{j},\nabla f(x_{k(j)})\rangle
=n1𝔼𝒫S(xj)xk(j),f(xk(j))+n1𝔼d=k(j)j1xdxd+1,f(xk(j))\displaystyle=n^{-1}\mathbb{E}\langle\mathcal{P}_{S}(x_{j})-x_{k(j)},\nabla f(x_{k(j)})\rangle+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})\rangle
=n1𝔼𝒫S(xj)xk(j),f(xk(j))\displaystyle=n^{-1}\mathbb{E}\langle\mathcal{P}_{S}(x_{j})-x_{k(j)},\nabla f(x_{k(j)})\rangle
+n1𝔼d=k(j)j1xdxd+1,f(xd)+xdxd+1,f(xk(j))f(xd)\displaystyle\quad+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\langle x_{d}-x_{d+1},\nabla f(x_{d})\rangle+\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
n1𝔼(ff(xk(j)))+n1𝔼d=k(j)j1(f(xd)f(xd+1)+Lmax2xdxd+12)\displaystyle\leq n^{-1}\mathbb{E}(f^{*}-f(x_{k(j)}))+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\left(f(x_{d})-f(x_{d+1})+{L_{\max}\over 2}\|x_{d}-x_{d+1}\|^{2}\right)
+n1𝔼d=k(j)j1xdxd+1,f(xk(j))f(xd)\displaystyle\quad+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
=n1𝔼(ff(xj))+Lmax2n𝔼d=k(j)j1xdxd+12\displaystyle=n^{-1}\mathbb{E}(f^{*}-f(x_{j}))+{L_{\max}\over 2n}\mathbb{E}\sum_{d={k(j)}}^{j-1}\|x_{d}-x_{d+1}\|^{2}
+n1𝔼d=k(j)j1xdxd+1,f(xk(j))f(xd)\displaystyle\quad+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
=n1𝔼(ff(xj))+Lmax2n2𝔼d=k(j)j1xdx¯d+12\displaystyle=n^{-1}\mathbb{E}(f^{*}-f(x_{j}))+{L_{\max}\over 2n^{2}}\mathbb{E}\sum_{d={k(j)}}^{j-1}\|x_{d}-\bar{x}_{d+1}\|^{2}
+n1𝔼d=k(j)j1xdxd+1,f(xk(j))f(xd)\displaystyle\quad+n^{-1}\mathbb{E}\sum_{d={k(j)}}^{j-1}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
n1𝔼(ff(xj))+Lmaxτρτ2n2𝔼xjx¯j+12\displaystyle\leq n^{-1}\mathbb{E}(f^{*}-f(x_{j}))+{L_{\max}\tau\rho^{\tau}\over 2n^{2}}\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}
+n1d=k(j)j1𝔼xdxd+1,f(xk(j))f(xd)T3,\displaystyle\quad+n^{-1}\sum_{d={k(j)}}^{j-1}\underbrace{\mathbb{E}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle}_{T_{3}},

where the first inequality uses the convexity of f(x)f(x):

f(𝒫S(xj))f(xkj)𝒫S(xj)xk(j),f(xk(j)),f(\mathcal{P}_{S}(x_{j}))-f(x_{k_{j}})\geq\langle\mathcal{P}_{S}(x_{j})-x_{k(j)},\nabla f(x_{k(j)})\rangle,

and the last inequality uses (22).

The upper bound of 𝔼(T3)\mathbb{E}(T_{3}) is estimated by

𝔼(T3)\displaystyle\mathbb{E}(T_{3}) =𝔼xdxd+1,f(xk(j))f(xd)\displaystyle=\mathbb{E}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
=𝔼(𝔼i(d)xdxd+1,f(xk(j))f(xd))\displaystyle=\mathbb{E}(\mathbb{E}_{i(d)}\langle x_{d}-x_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle)
=n1𝔼xdx¯d+1,f(xk(j))f(xd)\displaystyle=n^{-1}\mathbb{E}\langle x_{d}-\bar{x}_{d+1},\nabla f(x_{k(j)})-\nabla f(x_{d})\rangle
n1𝔼xdx¯d+1f(xk(j))f(xd)\displaystyle\leq n^{-1}\mathbb{E}\|x_{d}-\bar{x}_{d+1}\|\|\nabla f(x_{k(j)})-\nabla f(x_{d})\|
n1𝔼(xdx¯d+1t=k(j)d1f(xt)f(xt+1))\displaystyle\leq n^{-1}\mathbb{E}\left(\|x_{d}-\bar{x}_{d+1}\|\sum_{t=k(j)}^{d-1}\|\nabla f(x_{t})-\nabla f(x_{t+1})\|\right)
Lresn𝔼(t=k(j)d1xdx¯d+1xtxt+1)\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}}{n}\mathbb{E}\left(\sum_{t=k(j)}^{d-1}\|x_{d}-\bar{x}_{d+1}\|\|x_{t}-x_{t+1}\|\right)
Lres2n𝔼(t=k(j)d1(n1/2xdx¯d+12+n1/2xtxt+12))\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}}{2n}\mathbb{E}\left(\sum_{t=k(j)}^{d-1}(n^{-1/2}\|x_{d}-\bar{x}_{d+1}\|^{2}+n^{1/2}\|x_{t}-x_{t+1}\|^{2})\right)
Lres2nt=k(j)d1𝔼(n1/2xdx¯d+12+n1/2xtx¯t+12)\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}}{2n}\sum_{t=k(j)}^{d-1}\mathbb{E}(n^{-1/2}\|x_{d}-\bar{x}_{d+1}\|^{2}+n^{-1/2}\|x_{t}-\bar{x}_{t+1}\|^{2})
Lresρτn3/2t=k(j)d1𝔼(xjx¯j+12)\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}\rho^{\tau}}{n^{3/2}}\sum_{t=k(j)}^{d-1}\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2})
Lresτρτn3/2𝔼(xjx¯j+12),\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{n^{3/2}}\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2}),

where the first inequality uses the Cauchy inequality and the second last inequality uses (22). Therefore, 𝔼(T1)\mathbb{E}(T_{1}) can be bounded by

𝔼(T1)\displaystyle\mathbb{E}(T_{1}) =𝔼(𝒫S(xj)xj)i(j),i(j)f(xk(j))\displaystyle=\mathbb{E}\langle(\mathcal{P}_{S}(x_{j})-x_{j})_{i(j)},\nabla_{i(j)}f(x_{k(j)})\rangle
1n𝔼(ff(xj))+Lmaxτρτ2n2𝔼xjx¯j+12+d=k(j)j1Lresτρτn5/2𝔼(xjx¯j+12)\displaystyle\leq\frac{1}{n}\mathbb{E}(f^{*}-f(x_{j}))+{L_{\max}\tau\rho^{\tau}\over 2n^{2}}\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}+\sum_{d=k(j)}^{j-1}\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{n^{5/2}}\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2})
1n(f𝔼f(xj)+(Lmaxτρτ2n+Lresτ2ρτn3/2)𝔼(xjx¯j+12)).\displaystyle\leq\frac{1}{n}\left(f^{*}-\mathbb{E}f(x_{j})+\left({L_{\max}\tau\rho^{\tau}\over 2n}+\frac{L_{\mbox{\rm\scriptsize res}}\tau^{2}\rho^{\tau}}{n^{3/2}}\right)\mathbb{E}(\|x_{j}-\bar{x}_{j+1}\|^{2})\right). (57)

For T2T_{2}, we have

𝔼(T2)\displaystyle\mathbb{E}(T_{2}) =𝔼(Δj)i(j)(i(j)f(xk(j))i(j)f(xj))\displaystyle=\mathbb{E}(\Delta_{j})_{i(j)}\left(\nabla_{i(j)}f(x_{k(j)})-\nabla_{i(j)}f(x_{j})\right)
=n1𝔼Δj,f(xk(j))f(xj)\displaystyle=n^{-1}\mathbb{E}\langle\Delta_{j},\nabla f(x_{k(j)})-\nabla f(x_{j})\rangle
n1𝔼(Δjf(xk(j))f(xj))\displaystyle\leq n^{-1}\mathbb{E}(\|\Delta_{j}\|\|\nabla f(x_{k(j)})-\nabla f(x_{j})\|)
n1𝔼(d=k(j)j1Δjf(xd)f(xd+1))\displaystyle\leq n^{-1}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\|\Delta_{j}\|\|\nabla f(x_{d})-\nabla f(x_{d+1})\|\right)
Lresn𝔼(d=k(j)j1Δjxdxd+1)\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}}{n}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}\|\Delta_{j}\|\|x_{d}-x_{d+1}\|\right)
=Lres2n𝔼(d=k(j)j1n1/2Δj2+n1/2xdxd+12)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}}{2n}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{-1/2}\|\Delta_{j}\|^{2}+n^{1/2}\|x_{d}-x_{d+1}\|^{2}\right)
=Lres2n𝔼(d=k(j)j1n1/2xjx¯j+12+n1/2𝔼i(d)xdxd+12)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}}{2n}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{-1/2}\|x_{j}-\bar{x}_{j+1}\|^{2}+n^{1/2}\mathbb{E}_{i(d)}\|x_{d}-x_{d+1}\|^{2}\right)
=Lres2n𝔼(d=k(j)j1n1/2xjx¯j+12+n1/2xdx¯d+12)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}}{2n}\mathbb{E}\left(\sum_{d=k(j)}^{j-1}n^{-1/2}\|x_{j}-\bar{x}_{j+1}\|^{2}+n^{-1/2}\|x_{d}-\bar{x}_{d+1}\|^{2}\right)
=Lres2n3/2(d=k(j)j1𝔼xjx¯j+12+𝔼xdx¯d+12)\displaystyle=\frac{L_{\mbox{\rm\scriptsize res}}}{2n^{3/2}}\left(\sum_{d=k(j)}^{j-1}\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}+\mathbb{E}\|x_{d}-\bar{x}_{d+1}\|^{2}\right)
Lres(1+ρτ)2n3/2d=k(j)j1𝔼xjx¯j+12\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}(1+\rho^{\tau})}{2n^{3/2}}\sum_{d=k(j)}^{j-1}\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}
Lresτρτn3/2𝔼xjx¯j+12,\displaystyle\leq\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{n^{3/2}}\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}, (58)

where the second last inequality uses (22).

By taking the expectation on both sides of (56), using 𝔼i(j)(|(Δj)i(j)|2)=n1xjx¯j+12\mathbb{E}_{i(j)}(|(\Delta_{j})_{i(j)}|^{2})=n^{-1}\|x_{j}-\bar{x}_{j+1}\|^{2}, and substituting the upper bounds from (57) and (58), we obtain

𝔼xj+1\displaystyle\mathbb{E}\|x_{j+1} 𝒫S(xj+1)2𝔼xj𝒫S(xj)2\displaystyle-\mathcal{P}_{S}(x_{j+1})\|^{2}\leq\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}
1n(1γ2γLresτρτLmaxn1/2γτρτn2γLresτ2ρτLmaxn3/2)𝔼xjx¯j+12\displaystyle\quad-{1\over n}\left(1-\gamma-\frac{2\gamma L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{L_{\max}n^{1/2}}-{{\gamma\tau\rho^{\tau}\over n}-\frac{2\gamma L_{\mbox{\rm\scriptsize res}}\tau^{2}\rho^{\tau}}{L_{\max}n^{3/2}}}\right)\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2}
+2γLmaxn(f𝔼f(xj))+2γLmax(𝔼f(xj)𝔼f(xj+1))\displaystyle\quad+\frac{2\gamma}{L_{\max}n}(f^{*}-\mathbb{E}f(x_{j}))+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-\mathbb{E}f(x_{j+1}))
𝔼xj𝒫S(xj)2+2γLmaxn(f𝔼f(xj))+2γLmax(𝔼f(xj)𝔼f(xj+1)).\displaystyle\leq\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}n}(f^{*}-\mathbb{E}f(x_{j}))+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-\mathbb{E}f(x_{j+1})). (59)

In the second inequality, we were able to drop the term involving 𝔼xjx¯j+12\mathbb{E}\|x_{j}-\bar{x}_{j+1}\|^{2} by using the fact that

1γ2γLresτρτLmaxn1/2γτρτn2γLresτ2ρτLmaxn3/2=1γψ0,1-\gamma-\frac{2\gamma L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{L_{\max}n^{1/2}}-{{\gamma\tau\rho^{\tau}\over n}-\frac{2\gamma L_{\mbox{\rm\scriptsize res}}\tau^{2}\rho^{\tau}}{L_{\max}n^{3/2}}}=1-\gamma\psi\geq 0,

which follows from the definition (20) of ψ\psi and from the first upper bound on γ\gamma in (21). It follows that

𝔼xj+1𝒫S(xj+1)2+2γLmax(𝔼f(xj+1)f)\displaystyle\mathbb{E}\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j+1})-f^{*})
𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f)2γLmaxn(𝔼f(xj)f)\displaystyle\quad\leq\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*})-\frac{2\gamma}{L_{\max}n}(\mathbb{E}f(x_{j})-f^{*}) (60)
x0𝒫S(x0)2+2γLmax(f(x0)f)2γLmaxnt=0j(𝔼f(xt)f)\displaystyle\quad\leq\|x_{0}-\mathcal{P}_{S}(x_{0})\|^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})-\frac{2\gamma}{L_{\max}n}\sum_{t=0}^{j}(\mathbb{E}f(x_{t})-f^{*})
R02+2γLmax(f(x0)f)2γ(j+1)Lmaxn(𝔼f(xj+1)f),\displaystyle\quad\leq R_{0}^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})-\frac{2\gamma(j+1)}{L_{\max}n}(\mathbb{E}f(x_{j+1})-f^{*}),

where the second inequality follows by applying induction to the inequality

Sj+1Sj2γLmaxn𝔼(f(xj)f),S_{j+1}\leq S_{j}-{2\gamma\over L_{\max}n}\mathbb{E}(f(x_{j})-f^{*}),

where

Sj:=𝔼(xj𝒫S(xj)2)+2γLmax𝔼(f(xj)𝒫S(xj)),S_{j}:=\mathbb{E}(\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2})+{2\gamma\over L_{\max}}\mathbb{E}(f(x_{j})-\mathcal{P}_{S}(x_{j})),

and the last line uses the monotonicity of 𝔼f(xj)\mathbb{E}f(x_{j}) (proved above) and the definition R0=x0𝒫S(x0)R_{0}=\|x_{0}-\mathcal{P}_{S}(x_{0})\|. It implies that

𝔼xj+1𝒫S(xj+1)2+2γLmax(𝔼f(xj+1)f)+2γ(j+1)Lmaxn(𝔼f(xj+1)f)\displaystyle\quad\mathbb{E}\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j+1})-f^{*})+\frac{2\gamma(j+1)}{L_{\max}n}(\mathbb{E}f(x_{j+1})-f^{*})
R02+2γLmax(f(x0)f)\displaystyle\quad\quad\leq R_{0}^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})
2γ(n+j+1)Lmaxn(𝔼f(xj+1)f)R02+2γLmax(f(x0)f)\displaystyle\Rightarrow\frac{2\gamma(n+j+1)}{L_{\max}n}(\mathbb{E}f(x_{j+1})-f^{*})\leq R_{0}^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})
𝔼f(xj+1)fn(R02Lmax+2γ(f(x0)f))2γ(n+j+1).\displaystyle\Rightarrow\mathbb{E}f(x_{j+1})-f^{*}\leq\frac{n(R_{0}^{2}L_{\max}+2\gamma(f(x_{0})-f^{*}))}{2\gamma(n+j+1)}.

This completes the proof of the sublinear convergence rate (24).

Finally, we prove the linear convergence rate (23) for the essentially strongly convex case. All bounds proven above hold, and we make use of the following additional property:

f(xj)ff(𝒫S(xj)),xj𝒫S(xj)+l2xj𝒫S(xj)2l2xj𝒫S(xj)2,\displaystyle f(x_{j})-f^{*}\geq\langle\nabla f(\mathcal{P}_{S}(x_{j})),x_{j}-\mathcal{P}_{S}(x_{j})\rangle+\frac{l}{2}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}\geq\frac{l}{2}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2},

due to feasibility of xjx_{j} and f(𝒫S(xj)),xj𝒫S(xj)0\langle\nabla f(\mathcal{P}_{S}(x_{j})),x_{j}-\mathcal{P}_{S}(x_{j})\rangle\geq 0. By using this result together with some elementary manipulation, we obtain

f(xj)f\displaystyle f(x_{j})-f^{*} =(1Lmaxlγ+Lmax)(f(xj)f)+Lmaxlγ+Lmax(f(xj)f)\displaystyle=\left(1-\frac{L_{\max}}{l\gamma+L_{\max}}\right)(f(x_{j})-f^{*})+\frac{L_{\max}}{l\gamma+L_{\max}}(f(x_{j})-f^{*})
(1Lmaxlγ+Lmax)(f(xj)f)+Lmaxl2(lγ+Lmax)xj𝒫S(xj)2\displaystyle\geq\left(1-\frac{L_{\max}}{l\gamma+L_{\max}}\right)(f(x_{j})-f^{*})+\frac{L_{\max}l}{2(l\gamma+L_{\max})}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}
=Lmaxl2(lγ+Lmax)(xj𝒫S(xj)2+2γLmax(f(xj)f)).\displaystyle=\frac{L_{\max}l}{2(l\gamma+L_{\max})}\left(\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(f(x_{j})-f^{*})\right). (61)

Recalling (60), we have

𝔼xj+1𝒫S(xj+1)2+2γLmax(𝔼f(xj+1)f)\displaystyle\mathbb{E}\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j+1})-f^{*})
𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f)2γLmaxn(𝔼f(xj)f).\displaystyle\quad\leq\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*})-\frac{2\gamma}{L_{\max}n}(\mathbb{E}f(x_{j})-f^{*}). (62)

By taking the expectation of both sides in (61) and substituting in the last term of (62), we obtain

𝔼xj+1𝒫S(xj+1)2+2γLmax(𝔼f(xj+1)f)\displaystyle\mathbb{E}\|x_{j+1}-\mathcal{P}_{S}(x_{j+1})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j+1})-f^{*})
𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f)\displaystyle\quad\leq\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*})
2γLmaxn(Lmaxl2(lγ+Lmax)(𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f)))\displaystyle\quad\quad-\frac{2\gamma}{L_{\max}n}\left(\frac{L_{\max}l}{2(l\gamma+L_{\max})}\left(\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*})\right)\right)
=(1ln(l+γ1Lmax))(𝔼xj𝒫S(xj)2+2γLmax(𝔼f(xj)f))\displaystyle\quad=\left(1-\frac{l}{n(l+\gamma^{-1}L_{\max})}\right)\left(\mathbb{E}\|x_{j}-\mathcal{P}_{S}(x_{j})\|^{2}+\frac{2\gamma}{L_{\max}}(\mathbb{E}f(x_{j})-f^{*})\right)
(1ln(l+γ1Lmax))j+1(x0𝒫S(x0)2+2γLmax(f(x0)f)),\displaystyle\quad\leq\left(1-\frac{l}{n(l+\gamma^{-1}L_{\max})}\right)^{j+1}\left(\|x_{0}-\mathcal{P}_{S}(x_{0})\|^{2}+\frac{2\gamma}{L_{\max}}(f(x_{0})-f^{*})\right),

which yields (23).  

Proof  (Corollary 5) To apply Theorem 4, we first show ρ>(12n)1\rho>\left(1-{2\over\sqrt{n}}\right)^{-1}. Using the bound (25), together with Lres/Lmax1L_{\mbox{\rm\scriptsize res}}/L_{\max}\geq 1, we obtain

(12n)(1+4eτLresnLmax)=(1+4eτLresnLmax)(1+4eτLresnLmax)2n\displaystyle\left(1-{2\over\sqrt{n}}\right)\left(1+{4e\tau L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}\right)=\left(1+{4e\tau L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}\right)-\left(1+{4e\tau L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}\right){2\over\sqrt{n}}
(1+4eτn)(1+1τ+1)2n=1+(2eτ11τ+1)2n>1,\displaystyle\quad\geq\left(1+{4e\tau\over\sqrt{n}}\right)-\left(1+{1\over\tau+1}\right){2\over\sqrt{n}}=1+\left({2e\tau}-1-{1\over\tau+1}\right){2\over\sqrt{n}}>1,

where the last inequality uses τ1\tau\geq 1. Note that for ρ\rho defined by (26), and using (25), we have

ρτρτ+1=((1+4eτLresnLmax)nLmax4eτLres)4eτLres(τ+1)nLmaxe4eτLres(τ+1)nLmaxe.\rho^{\tau}\leq\rho^{\tau+1}=\left(\left(1+{4e\tau L_{\mbox{\rm\scriptsize res}}\over\sqrt{n}L_{\max}}\right)^{\sqrt{n}L_{\max}\over 4e\tau L_{\mbox{\rm\scriptsize res}}}\right)^{{4e\tau L_{\mbox{\rm\scriptsize res}}(\tau+1)\over\sqrt{n}L_{\max}}}\leq e^{{4e\tau L_{\mbox{\rm\scriptsize res}}(\tau+1)\over\sqrt{n}L_{\max}}}\leq e.

Thus from the definition of ψ\psi (20), we have that

ψ=1+LresτρτnLmax(2+LmaxnLres+2τn)1+Lresτρτ4eLresτ(τ+1)(2+1n+2τn)\displaystyle\psi={1+\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{\sqrt{n}L_{\max}}\left({2+{L_{\max}\over\sqrt{n}L_{\mbox{\rm\scriptsize res}}}+{2\tau\over n}}\right)}\leq 1+\frac{L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}{4eL_{\mbox{\rm\scriptsize res}}\tau(\tau+1)}\left(2+{1\over{\sqrt{n}}}+{2\tau\over n}\right)
1+14(τ+1)(2+1n+2τn)1+(14+116+110)2.\displaystyle\quad\leq 1+\frac{1}{4(\tau+1)}\left(2+{1\over{\sqrt{n}}}+{2\tau\over n}\right)\leq 1+\left({1\over 4}+{1\over 16}+{1\over 10}\right)\leq 2. (63)

(The second last inequality uses n5n\geq 5 and τ1\tau\geq 1.) Thus, the steplength parameter choice γ=1/2\gamma=1/2 satisfies the first bound in (21). To show that the second bound in (21) holds also, we have

(11ρ2n)nLmax4Lresτρτ=(ρ1ρ2n)nLmax4Lresτρτ\displaystyle\left(1-{1\over\rho}-{2\over\sqrt{n}}\right)\frac{\sqrt{n}L_{\max}}{4L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}=\left({\rho-1\over\rho}-{2\over\sqrt{n}}\right)\frac{\sqrt{n}L_{\max}}{4L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}
=4eτLres4Lresτρτ+1Lmax2Lresτρτ112=12.\displaystyle\quad=\frac{4e\tau L_{\mbox{\rm\scriptsize res}}}{4L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau+1}}-\frac{L_{\max}}{2L_{\mbox{\rm\scriptsize res}}\tau\rho^{\tau}}\geq 1-{1\over 2}={1\over 2}.

We can thus set γ=1/2\gamma=1/2, and by substituting this choice into (23), we obtain (27). We obtain (28) by making the same substitution into (24).  


References

  • Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24, pages 873–881. 2011. URL http://papers.nips.cc/paper/4247-distributed-delayed-stochastic-optimization.pdf.
  • Avron et al. [2014] H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate through randomization. IPDPS, 2014.
  • Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
  • Beck and Tetruashvili [2013] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.
  • Bertsekas and Tsitsiklis [1989] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Pentice Hall, 1989.
  • Boyd et al. [2011] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
  • Chang and Lin [2011] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines, 2011. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
  • Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, pages 273–297, 1995.
  • Cotter et al. [2011] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, pages 1647–1655. 2011. URL http://papers.nips.cc/paper/4432-better-mini-batch-algorithms-via-accelerated-gradient-methods.pdf.
  • Duchi et al. [2012] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.
  • Ferris and Mangasarian [1994] M. C. Ferris and O. L. Mangasarian. Parallel variable distribution. SIAM Journal on Optimization, 4(4):815–832, 1994.
  • Goldfarb and Ma [2012] D. Goldfarb and S. Ma. Fast multiple-splitting algorithms for convex optimization. SIAM Journal on Optimization, 22(2):533–556, 2012.
  • Liu and Wright [2014] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. Technical Report arXiv:1403.3862, 2014.
  • Lu and Xiao [2013] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical Report arXiv:1305.4723, Simon Fraser University, 2013.
  • Luo and Tseng [1992] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72:7–35, 1992.
  • Mangasarian [1995] O. L. Mangasarian. Parallel gradient distribution in unconstrained optimization. SIAM Journal on Optimization, 33(1):916–1925, 1995.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.
  • Nesterov [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004.
  • Nesterov [2012] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • Niu et al. [2011] F. Niu, B. Recht, C. Ré, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems 24, pages 693–701, 2011.
  • Peng et al. [2013] Z. Peng, M. Yan, and W. Yin. Parallel and distributed sparse optimization. Preprint, 2013.
  • Richtárik and Takáč [2012a] P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematrical Programming, 144:1–38, 2012a.
  • Richtárik and Takáč [2012b] P. Richtárik and M. Takáč. Parallel coordinate descent methods for big data optimization. Technical Report arXiv:1212.0873, 2012b.
  • Scherrer et al. [2012] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for accelerating parallel coordinate descent. Advances in Neural Information Processing Systems 25, pages 28–36, 2012.
  • Shalev-Shwartz and Zhang [2013] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. Advances in Neural Information Processing Systems 26, pages 378–385, 2013.
  • Shamir and Zhang [2013] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, 2013.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
  • Tseng [2001] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109:475–494, 2001.
  • Tseng and Yun [2009] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, Series B, 117:387–423, June 2009.
  • Tseng and Yun [2010] P. Tseng and S. Yun. A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Computational Optimization and Applications, 47(2):179–206, 2010.
  • Wang and Lin [2014] P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization. Journal of Machine Learning Research, 15:1523–1548, 2014.
  • Wright [2012] S. J. Wright. Accelerated block-coordinate relaxation for regularized optimization. SIAM Journal on Optimization, 22(1):159–186, 2012.
  • Yang [2013] T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. Advances in Neural Information Processing Systems 26, pages 629–637, 2013.