This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Stochastic Second-Order Proximal Method
for Distributed Optimization

Chenyang Qiu, Shanying Zhu, Zichong Ou and Jie Lu C. Qiu, Z. Ou and J. Lu are with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. Email: {qiuchy, ouzch, lujie}@shanghaitech.edu.cn.S. Zhu is with the Department of Automation and the Key Laboratory of System Control and Information Processing, Shanghai Jiao Tong University, 200240 Shanghai, China. Email: shyzhu@sjtu.edu.cn.
Abstract

In this paper, we propose a distributed stochastic second-order proximal method that enables agents in a network to cooperatively minimize the sum of their local loss functions without any centralized coordination. The proposed algorithm, referred to as St-SoPro, incorporates a decentralized second-order approximation into an augmented Lagrangian function, and then randomly samples the local gradients and Hessian matrices of the agents, so that it is computationally and memory-wise efficient, particularly for large-scale optimization problems. We show that for globally restricted strongly convex problems, the expected optimality error of St-SoPro asymptotically drops below an explicit error bound at a linear rate, and the error bound can be arbitrarily small with proper parameter settings. Simulations over real machine learning datasets demonstrate that St-SoPro outperforms several state-of-the-art distributed stochastic first-order methods in terms of convergence speed as well as computation and communication costs.

I Introduction

Stochastic optimization algorithms have been flourishing recently due to their appealing efficiency in machine learning [1, 2]. In the context of large-scale machine learning, parallel stochastic algorithms are often used to process large datasets[3, 4]. However, such methods need a central node to ensure the consistency of the variables from all the nodes, so that the communication burden of the central node becomes the bottleneck that restricts the algorithm performance.

On the other hand, a collection of distributed optimization algorithms have been proposed over the past decade in order to tackle various network control and resource allocation problems, where agents in a network only communicate with their neighbors and do not rely on any central coordination, eliminating potential communication bottlenecks in the computing infrastructure [5]. Typical methods include the distributed gradient descent (DGD) method [5], the decentralized exact first-order algorithm (EXTRA) [6], and the distributed gradient tracking algorithms [7].

Inheriting the merits of the above two algorithm types, distributed stochastic optimization algorithms have been attracting a lot of recent interest. For smooth, strongly convex optimization problems, [8] develops a distributed stochastic gradient descent (DSGD) method based on DGD, which is shown to achieve the optimal sublinear convergence rate (independent of the network) of a centralized stochastic gradient descent (SGD) method [9, 10]. The same convergence rate is attained by the exact diffusion method with adaptive step-sizes (EDAS) in [11]. In addition, the distributed stochastic gradient tracking (DSGT) method proposed in [12] is guaranteed to linearly converge to a neighborhood of the optimal solution in expectation. For non-convex problems, [13] designs a distributed primal-dual SGD algorithm with both fixed step-sizes (DPD-SGD-F) and adaptive step-sizes (DPD-SGD-T), where the former achieves linear convergence to suboptimality under the Polyak-Łojasiewics (PL) condition.

The aforementioned distributed stochastic algorithms are all constructed upon deterministic first-order methods by nature and only use stochastic gradients to evolve. As second-order information often leads to more accurate approximation and accelerates problem solving, we endeavor to develop a second-order distributed stochastic optimization algorithm.

To this end, we choose SoPro [14], a deterministic distributed second-order proximal algorithm, as the cornerstone. SoPro is developed by virtue of a decentralized second-order approximation of the augmented Lagrangian function in the classic method of multipliers [15], and its convergence performance outperforms that of various distributed first-order methods in the deterministic setting. In this paper, we adapt SoPro to the stochastic scenario. Specifically, instead of letting each agent compute exact local gradient and Hessian matrix determined by all its local data as SoPro does, we allow each agent to update using stochastic approximations of its local gradient and Hessian, which come from two randomly and uniformly chosen batches of samples from its local loss function. Such a stochastic variant of SoPro can significantly enhance the computational and memory efficiency of the agents. We call this algorithm a stochastic second-order proximal algorithm, referred to as St-SoPro.

Under the assumptions that the local loss functions are smooth and convex, and that their sum (i.e., the global objective function) is globally restricted strongly convex, we show that our proposed St-SoPro algorithm linearly converges to a neighborhood of the optimal solution in expectation over undirected networks. In particular, we provide an explicit upper bound on its ultimate suboptimality, and illustrate that this upper bound can be made arbitrarily small as long as the parameters are properly set. Finally, we validate the superior performance of St-SoPro in comparison with several recent distributed stochastic optimization methods over some real datasets for classification problems arising in machine learning with respect to convergence speed, communication load, computational efficiency, and classification accuracy.

The paper is organized as follows. Section II formulates the optimization problem to be solved, and Section III describes the proposed St-SoPro algorithm. Section IV provides the convergence analysis, Section V presents the numerical results, and Section VI concludes the paper.

Notation: For any differentiable function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, its gradient at xdx\in\mathbb{R}^{d} is denoted by f(x)\nabla f(x), and if ff is twice-differentiable, we use 2f(x)\nabla^{2}f(x) to denote its Hessian matrix at xx. For any set SS, |S||S| represents the cardinality of SS. In addition, \otimes is the Kronecker product, ,\langle\cdot,\cdot\rangle is the Euclidean inner product, and \|\cdot\| is the 2\ell_{2} norm. We use 𝟎d,𝟏d,𝐎d,Id\mathbf{0}_{d},\mathbf{1}_{d},\mathbf{O}_{d},I_{d} to denote the dd-dimensional all-zero vector, all-one vector, zero matrix, and identity matrix, respectively. Also, diag(A1,,An)\operatorname{diag}(A_{1},\ldots,A_{n}) represents the block diagonal matrix whose diagonal blocks are sequentially A1,,AnA_{1},\ldots,A_{n}. Given a matrix Ad×dA\in\mathbb{R}^{d\times d}, we write A𝐎dA\succeq\mathbf{O}_{d} if it is positive semidefinite and A𝐎dA\succ\mathbf{O}_{d} if it is positive definite. For any A𝐎dA\succeq\mathbf{O}_{d} and 𝐱d,𝐱A2:=𝐱TA𝐱\mathbf{x}\in\mathbb{R}^{d},\|\mathbf{x}\|_{A}^{2}:=\mathbf{x}^{T}A\mathbf{x}, λmax(A)\lambda_{max}(A) and λmin(A)\lambda_{min}(A) are the largest and smallest real eigenvalues of AA, respectively, and AA^{\dagger} is AA ’s pseudoinverse.

II Problem Formulation

Consider a set 𝒱={1,,N}\mathcal{V}=\{1,\ldots,N\} of agents, where the agents are connected through the link set \mathcal{E}\subseteq {{i,j}𝒱×𝒱ij}\{\{i,j\}\subseteq\mathcal{V}\times\mathcal{V}\mid i\neq j\}. We model such a network as a connected undirected graph (𝒱,)(\mathcal{V},\mathcal{E}), and denote the set of each agent ii’s neighbors by 𝒩i={j𝒱{i,j}}\mathcal{N}_{i}=\{j\in\mathcal{V}\mid\{i,j\}\in\mathcal{E}\}. Suppose each agent ii observes a finite number of local samples ξi,jm\xi_{i,j}\in\mathbb{R}^{m}, j=1,,𝒞ij=1,\ldots,\mathcal{C}_{i} that are independent random vectors, and attempts to solve the following optimization problem:

minimizex1,,xNdi𝒱fi(xi)subject to x1=x2==xN.\begin{array}[]{ll}\underset{x_{1},\ldots,x_{N}\in\mathbb{R}^{d}}{\operatorname{minimize}}&\sum_{i\in\mathcal{V}}f_{i}(x_{i})\\ \text{subject to }&x_{1}=x_{2}=\dots=x_{N}.\end{array}\vspace{-0.15cm} (1)

In problem (1), fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} is the local loss function of agent ii, which is the average of every sample’s loss li,j(xi,ξi,j):dl_{i,j}(x_{i},\xi_{i,j}):\mathbb{R}^{d}\rightarrow\mathbb{R} associated with agent ii, i.e.,

fi(xi)=1𝒞ij=1𝒞ili,j(xi,ξi,j).\displaystyle f_{i}(x_{i})=\frac{1}{\mathcal{C}_{i}}\sum_{j=1}^{\mathcal{C}_{i}}l_{i,j}(x_{i},\xi_{i,j}).

Below, we impose the assumptions on problem (1).

Assumption 1.

Problem (1) satisfies the following:

  1. a)

    There exists an optimal solution xdx^{\star}\in\mathbb{R}^{d} to problem (1), and i𝒱fi(xi)\sum_{i\in\mathcal{V}}f_{i}(x_{i}) is globally restricted strongly convex with respect to xx^{\star} with convexity parameter mf¯>0m_{\bar{f}}>0.

  2. b)

    For any given ξi,j\xi_{i,j}, li,j(,ξi,j)l_{i,j}(\cdot,\xi_{i,j}) is twice continuously differentiable and convex.

  3. c)

    There exists Mi>0M_{i}>0 such that li,j(x,ξi,j)l_{i,j}(x,\xi_{i,j}) is MiM_{i}-smooth for all ξi,j\xi_{i,j}.

Assumption 1 leads to the following inequalities. For any given x,ydx,y\in\mathbb{R}^{d} and ξi,j\xi_{i,j}, mixy2li,j(x,ξi,j)li,j(y,ξi,j),xyMixy2m_{i}\|x-y\|^{2}\leq\langle\nabla l_{i,j}(x,\xi_{i,j})-\nabla l_{i,j}(y,\xi_{i,j}),x-y\rangle\leq M_{i}\|x-y\|^{2} and

miId2li,j(x,ξi,j)MiId\displaystyle m_{i}I_{d}\preceq\nabla^{2}l_{i,j}(x,\xi_{i,j})\preceq M_{i}I_{d} (2)

for some mi[0,Mi]m_{i}\in[0,M_{i}]. Also, the globally restricted strong convexity in Assumption 1a) guarantees that the optimal solution xx^{\star} is unique.

Problem (1) requires that the agents reach a consensus while minimizing all the sample losses throughout the network. Indeed, a wide range of real-world problems can be cast into the form of problem (1), such as distributed model predictive control [16], distributed spectrum sensing [17], and logistic regression [18]. Under many circumstances, these engineering problems involve huge datasets. Thus, we focus on solving problem (1) in a fully decentralized and stochastic fashion. Specifically, we only allow each agent to communicate with its neighbors and use a randomly chosen subset of its local samples to compute.

III Stochastic Second-order Proximal Method

In this section, we develop a distributed stochastic algorithm for solving problem (1) over undirected networks.

To this end, we first provide a brief review of the distributed (deterministic) second-order proximal algorithm (SoPro) in [14]. Note that problem (1) is equivalent to

minimize𝐱Ndf(𝐱):=i𝒱fi(xi)subject to W12𝐱=𝟎Nd,\begin{array}[]{ll}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}&f(\mathbf{x}):=\sum_{i\in\mathcal{V}}f_{i}(x_{i})\\ \text{subject to }&{W}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}_{Nd},\end{array}\vspace{-0.15cm} (3)

where 𝐱=(x1T,,xNT)T\mathbf{x}=(x_{1}^{T},\ldots,x_{N}^{T})^{T}, W=PId𝐎NdW=P\otimes I_{d}\succeq\mathbf{O}_{Nd}, and

[P]ij={s𝒩ipis,i=j,pij,j𝒩i,0,otherwise,i,j𝒱,\displaystyle[P]_{ij}=\left\{\begin{array}[]{ll}\sum_{s\in\mathcal{N}_{i}}p_{is},&i=j,\\ -p_{ij},&j\in\mathcal{N}_{i},\\ 0,&\text{otherwise,}\end{array}\quad\forall i,j\in\mathcal{V},\right.

with pij=pji>0p_{ij}=p_{ji}>0 {i,j}\forall\{i,j\}\in\mathcal{E}, so that the null space of PP is span{𝟏N}\operatorname{span}\{\mathbf{1}_{N}\}. Also, the unique optimal solution of problem (3) is 𝐱=((x)T,,(x)T)T\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}.

The application of the method of multipliers [15] to solve (3) gives the following: Starting from any 𝐯0Nd\mathbf{v}^{0}\in\mathbb{R}^{Nd},

𝐱k+1=argmin𝐱NdLβ(𝐱,𝐯k),\displaystyle\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in\mathbb{R}^{Nd}}L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right), (4)
𝐯k+1=𝐯k+βW12𝐱k+1,\displaystyle\mathbf{v}^{k+1}=\mathbf{v}^{k}+\beta W^{\frac{1}{2}}\mathbf{x}^{k+1}, (5)

where 𝐱k=((x1k)T,,(xNk)T)T\mathbf{x}^{k}=((x_{1}^{k})^{T},\ldots,(x_{N}^{k})^{T})^{T} and 𝐯k\mathbf{v}^{k} are the primal and dual variables, respectively, and Lβ(𝐱,𝐯):Nd×NdL_{\beta}(\mathbf{x},\mathbf{v}):\mathbb{R}^{Nd}\times\mathbb{R}^{Nd}\rightarrow\mathbb{R} is an augmented Lagrangian function given by Lβ(𝐱,𝐯)=f(𝐱)+𝐯TW12𝐱+β2W12𝐱2L_{\beta}(\mathbf{x},\mathbf{v})=f(\mathbf{x})+\mathbf{v}^{T}W^{\frac{1}{2}}\mathbf{x}+\frac{\beta}{2}\|W^{\frac{1}{2}}\mathbf{x}\|^{2}, β>0\beta>0.

Since (4)–(5) cannot be implemented in a distributed way, the SoPro algorithm in [14] introduces a decentralized second-order proximal approximation of Lβ(𝐱,𝐯k)L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right) in (4) and applies a variable change to (5). Specifically, it replaces Lβ(𝐱,𝐯k)L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right) with its second-order Taylor’s expansion at 𝐱k\mathbf{x}^{k}. Then, it replaces the remaining coupling term 12(𝐱𝐱k)TρW(𝐱𝐱k)\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}\rho W(\mathbf{x}-\mathbf{x}^{k}) in the primal update with 12(𝐱𝐱k)TD(𝐱𝐱k)\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}D(\mathbf{x}-\mathbf{x}^{k}), where D=diag(D1,,DN)D=\operatorname{diag}(D_{1},\ldots,D_{N}) is a symmetric block diagonal matrix satisfying 2fi(x)+Di𝐎d\nabla^{2}f_{i}(x)+D_{i}\succ\mathbf{O}_{d} xd\forall x\in\mathbb{R}^{d} i𝒱\forall i\in\mathcal{V}. Furthermore, we define 𝐪k=((q1k)T,,(qNk)T)T=W12𝐯k\mathbf{q}^{k}=((q_{1}^{k})^{T},\ldots,(q_{N}^{k})^{T})^{T}=W^{\frac{1}{2}}\mathbf{v}^{k} as a substitute for 𝐯k\mathbf{v}^{k}, and 𝐪k\mathbf{q}^{k} can be ensured to identically stay in the range of W12W^{\frac{1}{2}} by letting i𝒱qi0=0\sum_{i\in\mathcal{V}}q_{i}^{0}=0. To summarize, SoPro takes the following form: Starting from 𝐪0\mathbf{q}^{0} satisfying i𝒱qi0=0\sum_{i\in\mathcal{V}}q_{i}^{0}=0,

𝐱k+1=𝐱k(2f(𝐱k)+D)1(f(𝐱k)+βW𝐱k+𝐪k),\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}\!-\!(\nabla^{2}f(\mathbf{x}^{k})\!+\!D)^{-1}(\nabla f(\mathbf{x}^{k})\!+\!\beta W\mathbf{x}^{k}\!+\!\mathbf{q}^{k}), (6)
𝐪k+1=𝐪k+βW𝐱k+1,k0,\displaystyle\mathbf{q}^{k+1}=\mathbf{q}^{k}+\beta W\mathbf{x}^{k+1},\quad\forall k\geq 0,

where f(𝐱k)=(f1(x1k)T,,fN(xNk)T)T\nabla f(\mathbf{x}^{k})=(\nabla f_{1}(x_{1}^{k})^{T},\ldots,\nabla f_{N}(x_{N}^{k})^{T})^{T} and 2f(𝐱k)=diag(2f1(x1k),,2fN(xNk))\nabla^{2}f(\mathbf{x}^{k})=\text{diag}(\nabla^{2}f_{1}(x_{1}^{k}),\ldots,\nabla^{2}f_{N}(x_{N}^{k})) satisfying 2f(𝐱k)+D𝐎Nd\nabla^{2}f(\mathbf{x}^{k})+D\succ\mathbf{O}_{Nd}.

The primal update of SoPro (6) requires that each agent uses up all its local samples. However, the agents may only be able to access or process a portion of their local samples at one time, especially in the big data scenario. Motivated by this, we consider approximating the gradient f(𝐱k)\nabla f(\mathbf{x}^{k}) and the Hessian 2f(𝐱k)\nabla^{2}f(\mathbf{x}^{k}) in (6) via a stochastic gradient g(𝐱k)g(\mathbf{x}^{k}) and a stochastic Hessian h(𝐱k)h(\mathbf{x}^{k}) given by

g(𝐱k)=(g1(x1k)T,,gN(xNk)T)T,\displaystyle g(\mathbf{x}^{k})=(g_{1}(x_{1}^{k})^{T},\ldots,g_{N}(x_{N}^{k})^{T})^{T},
where each gi(xik)=1|𝒢ik|j𝒢ikli,j(xik,ξi,j),\displaystyle\text{where each }g_{i}(x_{i}^{k})=\frac{1}{|\mathcal{G}_{i}^{k}|}\sum_{j\in\mathcal{G}_{i}^{k}}\nabla l_{i,j}(x_{i}^{k},\xi_{i,j}), (7)
h(𝐱k)=diag(h1(x1k),,hN(xNk)),\displaystyle h(\mathbf{x}^{k})=\operatorname{diag}(h_{1}(x_{1}^{k}),\ldots,h_{N}(x_{N}^{k})),
where each hi(xik)=1|𝒮ik|j𝒮ik2li,j(xik,ξi,j).\displaystyle\text{where each }h_{i}(x_{i}^{k})=\frac{1}{|\mathcal{S}_{i}^{k}|}\sum_{j\in\mathcal{S}_{i}^{k}}\nabla^{2}l_{i,j}(x_{i}^{k},\xi_{i,j}). (8)

Here, for each agent i𝒱i\in\mathcal{V}, 𝒢ik\mathcal{G}_{i}^{k} and 𝒮ik\mathcal{S}_{i}^{k} are two independent random sample sets uniformly chosen from {1,,𝒞i\{1,\ldots,\mathcal{C}_{i}} without replacement, so that g(𝐱k)g(\mathbf{x}^{k}) and h(𝐱k)h(\mathbf{x}^{k}) are unbiased, i.e.,

𝐄𝒢ik[gi(𝐱k)]=fi(𝐱k),𝐄𝒮ik[hi(𝐱k)]=2fi(𝐱k)\displaystyle\mathbf{E}_{\mathcal{G}_{i}^{k}}[g_{i}(\mathbf{x}^{k})]=\nabla f_{i}(\mathbf{x}^{k}),\quad\mathbf{E}_{\mathcal{S}_{i}^{k}}[h_{i}(\mathbf{x}^{k})]=\nabla^{2}f_{i}(\mathbf{x}^{k})

for all 𝐱k\mathbf{x}^{k}. Due to (2), miIdhi(xik)MiIdm_{i}I_{d}\preceq h_{i}(x_{i}^{k})\preceq M_{i}I_{d} i𝒱\forall i\in\mathcal{V} and

Λmh(𝐱k)ΛM,\Lambda_{m}\preceq h(\mathbf{x}^{k})\preceq\Lambda_{M},\vspace{-0.15cm} (9)

where Λm=diag(m1,,mN)Id𝐎Nd\Lambda_{m}=\operatorname{diag}\left(m_{1},\ldots,m_{N}\right)\otimes I_{d}\succeq\mathbf{O}_{Nd} and ΛM=diag(M1,,MN)Id𝐎Nd\Lambda_{M}=\operatorname{diag}\left(M_{1},\ldots,M_{N}\right)\otimes I_{d}\succ\mathbf{O}_{Nd}.

Using the above randomly sampled gradient and Hessian, we obtain the following stochastic variant of SoPro: Starting from any 𝐪0\mathbf{q}^{0} such that i𝒱qi0=0\sum_{i\in\mathcal{V}}q_{i}^{0}=0,

𝐱k+1=𝐱k(h(𝐱k)+D)1(g(𝐱k)+βW𝐱k+𝐪k),\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}-(h(\mathbf{x}^{k})+D)^{-1}(g(\mathbf{x}^{k})+\beta W\mathbf{x}^{k}+\mathbf{q}^{k}), (10)
𝐪k+1=𝐪k+βW𝐱k+1,k0,\displaystyle\mathbf{q}^{k+1}=\mathbf{q}^{k}+\beta W\mathbf{x}^{k+1},\quad\forall k\geq 0,

where each DiD_{i} satisfies hi(x)+Di𝐎dh_{i}(x)+D_{i}\succ\mathbf{O}_{d} xd\forall x\in\mathbb{R}^{d}, i.e., h(𝐱)+D𝐎Ndh(\mathbf{x})+D\succ\mathbf{O}_{Nd} 𝐱Nd\forall\mathbf{x}\in\mathbb{R}^{Nd}, so that (10) is well-posed. The above initialization and updates compose our proposed stochastic second-order proximal (St-SoPro) method. The distributed implementation of St-SoPro over the undirected network (𝒱,)(\mathcal{V},\mathcal{E}) is described in Algorithm 1, in which yiky_{i}^{k} i𝒱\forall i\in\mathcal{V} are auxiliary variables for better presentation.

Algorithm 1 St-SoPro
1:  Initialization:
2:  Each agent i𝒱i\in\mathcal{V} sets qi0dq_{i}^{0}\in\mathbb{R}^{d} such that i𝒱qi0=𝟎d\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}_{d} (or simply sets qi0=𝟎dq_{i}^{0}=\mathbf{0}_{d}).
3:  Each agent i𝒱i\in\mathcal{V} arbitrarily sets xi0dx_{i}^{0}\in\mathbb{R}^{d}, and sends xi0x_{i}^{0} to each neighbor j𝒩ij\in\mathcal{N}_{i}.
4:  After receiving xj0x_{j}^{0} j𝒩i\forall j\in\mathcal{N}_{i}, each agent i𝒱i\in\mathcal{V} sets yi0=j𝒩ipij(xi0xj0)y_{i}^{0}=\sum_{j\in\mathcal{N}_{i}}p_{ij}(x_{i}^{0}-x_{j}^{0}).
5:  for k=0,1,2,k=0,1,2,\ldots do
6:     Each agent i𝒱i\in\mathcal{V} randomly and uniformly chooses two independent subsets 𝒢ik\mathcal{G}_{i}^{k} and 𝒮ik\mathcal{S}_{i}^{k} of {1,,𝒞i}\{1,\ldots,\mathcal{C}_{i}\}, and then computes gi(xik)g_{i}(x_{i}^{k}) and hi(xik)h_{i}(x_{i}^{k}) according to (7) and (8).
7:     Each agent i𝒱i\in\mathcal{V} computes xik+1=xik(hi(xik)+Di)1(gi(xik)+βyik+qik)x_{i}^{k+1}=x_{i}^{k}-(h_{i}(x_{i}^{k})+D_{i})^{-1}(g_{i}(x_{i}^{k})+\beta y_{i}^{k}+q_{i}^{k}), and then sends xik+1x_{i}^{k+1} to each neighbor j𝒩ij\in\mathcal{N}_{i}.
8:     After receiving xjk+1x_{j}^{k+1} j𝒩i\forall j\in\mathcal{N}_{i}, each agent i𝒱i\in\mathcal{V} computes yik+1=j𝒩ipij(xik+1xjk+1)y_{i}^{k+1}=\sum_{j\in\mathcal{N}_{i}}p_{ij}(x_{i}^{k+1}-x_{j}^{k+1}) and qik+1=qik+βyik+1q_{i}^{k+1}=q_{i}^{k}+\beta y_{i}^{k+1}.
9:  end for

IV Convergence Analysis

This section provides the convergence analysis of St-SoPro.

We first impose an assumption on the expected deviation of each sample loss li,j(xi,ξi,j)l_{i,j}(x_{i},\xi_{i,j}) from each entire local loss fi(xi)f_{i}(x_{i}) in terms of their gradients.

Assumption 2.

The random vectors ξi,j\xi_{i,j} i𝒱\forall i\in\mathcal{V} j=1,,𝒞i\forall j=1,\ldots,\mathcal{C}_{i} are independent and there is some σ>0\sigma>0 such that

𝐄ξi,j[li,j(xi,ξi,j)fi(xi)2]σ2,xid.\mathbf{E}_{\xi_{i,j}}\left[\|\nabla l_{i,j}(x_{i},\xi_{i,j})-\nabla f_{i}(x_{i})\|^{2}\right]\leq\sigma^{2},\quad\forall x_{i}\in\mathbb{R}^{d}.

To simplify the notation, below we let 𝒞i=𝒞\mathcal{C}_{i}=\mathcal{C} i𝒱\forall i\in\mathcal{V} and 𝒢ik\mathcal{G}_{i}^{k} i𝒱\forall i\in\mathcal{V} k0\forall k\geq 0 be of the same size 𝒢\mathcal{G}. We also abbreviate 𝐄𝒢ik[]\mathbf{E}_{\mathcal{G}_{i}^{k}}[\cdot] and 𝐄𝒮ik[]\mathbf{E}_{\mathcal{S}_{i}^{k}}[\cdot] to 𝐄[]\mathbf{E}[\cdot]. According to [19, Chapter 2],

𝐄[g(𝐱k)f(𝐱k)2]=i𝒱𝐄[gi(xik)fi(xik)2]\displaystyle\mathbf{E}\left[\|g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\|^{2}\right]=\sum_{i\in\mathcal{V}}\mathbf{E}\left[\|g_{i}(x_{i}^{k})-\nabla f_{i}(x_{i}^{k})\|^{2}\right]
Nτσ2, where τ𝒞𝒢𝒞𝒢.\displaystyle\qquad\qquad\qquad\qquad\qquad\leq N\tau\sigma^{2},\text{ where }\tau\coloneqq\frac{\mathcal{C}-\mathcal{G}}{\mathcal{C}\mathcal{G}}. (11)

This is consistent with the fact that when computing the stochastic gradient g(𝐱k)g(\mathbf{x}^{k}), if we reduce the number 𝒢\mathcal{G} of randomly selected samples, then the discrepancy between g(𝐱k)g(\mathbf{x}^{k}) and f(𝐱k)\nabla f(\mathbf{x}^{k}) becomes larger in expectation.

Next, for the sake of presenting the convergence result, we introduce the following notation and definitions. According to [14], any 𝐯Nd\mathbf{v}\in\mathbb{R}^{Nd} satisfying f(𝐱)=W12𝐯\nabla f\left(\mathbf{x}^{\star}\right)=-W^{\frac{1}{2}}\mathbf{v} is a dual optimum of problem (3). Thus, we define

𝐯=(W)12f(𝐱)\mathbf{v}^{\star}=-\left(W^{\dagger}\right)^{\frac{1}{2}}\nabla f\left(\mathbf{x}^{\star}\right)\vspace{-0.15cm} (12)

as a particular dual optimum of (3). Also, throughout this section, we let 𝐯k=(W)12𝐪k\mathbf{v}^{k}=(W^{\dagger})^{\frac{1}{2}}\mathbf{q}^{k}, 𝐳k=((𝐱k)T,(𝐯k)T)T\mathbf{z}^{k}=((\mathbf{x}^{k})^{T},(\mathbf{v}^{k})^{T})^{T}, and 𝐳=((𝐱)T,(𝐯)T)T\mathbf{z}^{\star}=((\mathbf{x}^{\star})^{T},(\mathbf{v}^{\star})^{T})^{T}. Since such 𝐯k\mathbf{v}^{k} satisfies (5),

𝐯k,𝐯,𝐯k𝐯{𝐱Nd|x1++xN=𝟎d}.\mathbf{v}^{k},\mathbf{v}^{\star},\mathbf{v}^{k}-\mathbf{v}^{\star}\in\{\mathbf{x}\in\mathbb{R}^{Nd}|x_{1}+\cdots+x_{N}=\mathbf{0}_{d}\}.\vspace{-0.15cm} (13)

In addition, we define fβ(𝐱)=f(𝐱)+β4𝐱W2f_{\beta}(\mathbf{x})=f(\mathbf{x})+\frac{\beta}{4}\|\mathbf{x}\|_{W}^{2}. Based on [14, Lemma 1], for any 𝐱Nd\mathbf{x}\in\mathbb{R}^{Nd},

fβ(𝐱)fβ(𝐱),𝐱𝐱ζ(γ)𝐱𝐱2,\left\langle\nabla f_{\beta}(\mathbf{x})-\nabla f_{\beta}\left(\mathbf{x}^{\star}\right),\mathbf{x}-\mathbf{x}^{\star}\right\rangle\geq\zeta(\gamma)\left\|\mathbf{x}-\mathbf{x}^{\star}\right\|^{2},\vspace{-0.15cm} (14)

where ζ:(0,)\zeta:(0,\infty)\rightarrow\mathbb{R} is given by ζ(γ)=min{mf¯N2Mγ,βλW2(1+1/γ2)}\zeta(\gamma)=\min\{\frac{m_{\bar{f}}}{N}-2M\gamma,\beta\frac{\lambda_{W}}{2(1+1/\gamma^{2})}\}, mf¯m_{\bar{f}} is given in Assumption 1a), M=maxi𝒱Mi>0M=\max_{i\in\mathcal{V}}M_{i}>0, and λW\lambda_{W} is the smallest non-zero eigenvalue of WW. It can be shown that ζ(γ)>0\zeta(\gamma)>0 if and only if γ(0,mf¯/(2MN))\gamma\in(0,m_{\bar{f}}/(2MN)), and its maximum value is attained at the unique positive root of the cubic equation 4MNγ3+(βNλW2mf¯)γ2+4MNγ2mf¯=04MN\gamma^{3}+(\beta N\lambda_{W}-2m_{\bar{f}})\gamma^{2}+4MN\gamma-2m_{\bar{f}}=0. We denote the maximum value of ζ(γ)\zeta(\gamma) by mβm_{\beta}, which is the convexity parameter of fβf_{\beta} (and indeed can be taken as any positive ζ(γ)\zeta(\gamma)).

Our convergence analysis relies on the following parameter condition. Suppose there exists ηs(0,1)\eta_{s}\in(0,1) such that

DΛM2(1ηs)+(ΛMΛm)28ηsmβ+ΛM3Λm2+β(INd2+W).D\succ\frac{\Lambda_{M}}{2(1-\eta_{s})}+\frac{\left(\Lambda_{M}-\Lambda_{m}\right)^{2}}{8\eta_{s}m_{\beta}}+\frac{\Lambda_{M}-3\Lambda_{m}}{2}+\beta(\frac{{I}_{Nd}}{2}+W).\vspace{-0.15cm} (15)

Let R=Λm+ΛM2+DR=\frac{\Lambda_{m}+\Lambda_{M}}{2}+D and Q=diag(βR,INd)Q=\operatorname{diag}(\beta R,I_{Nd}), which are guaranteed to be positive definite due to (15). Furthermore, it follows from (9) and (15) that the condition h(𝐱)+D𝐎Ndh(\mathbf{x})+D\succ\mathbf{O}_{Nd} required in Section III holds.

We provide our main result in the theorem below.

Theorem 1.

Suppose Assumptions 1 and 2 hold. If (15) holds for some ηs(0,1)\eta_{s}\in(0,1), then 𝐳k\mathbf{z}^{k} converges linearly to a neighborhood of 𝐳\mathbf{z}^{\star} in expectation, i.e., there exist δs(0,1)\delta_{s}\in(0,1) and Γ>0\Gamma>0 such that for each k0k\geq 0,

𝐄[𝐳k+1𝐳Q2](1δs)𝐄[𝐳k𝐳Q2]+ΓNτσ2,\!\!\!\mathbf{E}\left[\|\mathbf{z}^{k+1}\!-\!\mathbf{z}^{\star}\|_{Q}^{2}\right]\!\leq\!(1\!-\!\delta_{s})\mathbf{E}\left[\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{Q}^{2}\right]\!+\!\Gamma N\tau\sigma^{2},\vspace{-0.15cm} (16)
limsupk𝐄[𝐳k𝐳Q2]1δsΓNτσ2.\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]\leq\frac{1}{\delta_{s}}\Gamma N\tau\sigma^{2}.\vspace{-0.15cm} (17)

In particular, given any c1>0c_{1}>0, Γ=2(1+c1)δsλW+2\Gamma=\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}+2 and

δs=\displaystyle\delta_{s}= supc2>0min{βλWκc0,ηs2(1+c1)ΛM+D2,1ηs(1+1/c1)(1+c2),\displaystyle\sup_{c_{2}>0}\min\left\{\tfrac{\beta\lambda_{W}\kappa_{c_{0},\eta_{s}}}{2\left(1+c_{1}\right)\left\|\Lambda_{M}+D\right\|^{2}},\tfrac{1-\eta_{s}}{\left(1+1/c_{1}\right)\left(1+c_{2}\right)},\right.
2ηsmβc0λmax(R+(1+1/c1)(1+1/c2)ΛM2/(βλW))},\displaystyle\left.\tfrac{2\eta_{s}m_{\beta}-c_{0}}{\lambda_{\max}\left(R+\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}/\left(\beta\lambda_{W}\right)\right)}\right\}, (18)

where c0(0,2ηsmβ)c_{0}\in(0,2\eta_{s}m_{\beta}) is such that κc0,ηsλmin(RΛM2(1ηs)(ΛMΛm)24c0+ΛmΛMβ(INd2+W))>0\kappa_{c_{0},\eta_{s}}\coloneqq\lambda_{min}(R-\frac{\Lambda_{M}}{2(1-\eta_{s})}-\frac{(\Lambda_{M}-\Lambda_{m})^{2}}{4c_{0}}+\Lambda_{m}-\Lambda_{M}-\beta(\frac{I_{Nd}}{2}+W))>0 (which always exists).

Proof.

See Appendix -A. ∎

It can be shown that larger mini𝒱mi\min_{i\in\mathcal{V}}m_{i}, smaller maxi𝒱Mi\max_{i\in\mathcal{V}}M_{i}, or smaller λW\lambda_{W} leads to faster convergence (i.e., larger δs\delta_{s}) of St-SoPro. Such analysis follows the idea in [14] and is thus omitted here due to space limitation. In addition, it can be seen from (17) that the error bound of limsupk𝐄[𝐳k𝐳Q2]\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right] drops with the decrease of τ\tau. Hence, essentially, larger random sample sets for computing the stochastic gradients lead to smaller optimality error.

In fact, the expected distance between 𝐱k\mathbf{x}^{k} and 𝐱\mathbf{x}^{\star} can eventually reach an arbitrarily small value under proper parameter settings. To see this, for simplicity, let mi=m>0m_{i}=m>0 and Mi=MmM_{i}=M\geq m for all i𝒱i\in\mathcal{V}, and pick any ηs(0,1)\eta_{s}\in(0,1), c0(0,2ηsmβ)c_{0}\in(0,2\eta_{s}m_{\beta}), and c1>0c_{1}>0. We choose, for example, D=αINdD=\alpha I_{Nd} with α=(12+λmax(W))β+μ\alpha=(\frac{1}{2}+\lambda_{max}(W))\beta+\mu, β>0\beta>0, and μ>M3m2+M2(1ηs)+(Mm)24c0\mu>\tfrac{M-3m}{2}+\tfrac{{M}}{2(1-\eta_{s})}+\tfrac{({M}-{m})^{2}}{4c_{0}}, so that (15) holds. This also results in κc0,ηs=μM3m2M2(1ηs)(Mm)24c0>0\kappa_{c_{0},\eta_{s}}=\mu-\frac{M-3m}{2}-\frac{{M}}{2(1-\eta_{s})}-\frac{({M}-{m})^{2}}{4c_{0}}>0. From (17), we obtain limsupk𝐄[𝐱k𝐱2]ΓNτσ2δsβ[(m+M)/2+β(1/2+λmax(W))+μ]\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}]\leq\frac{\Gamma N\tau\sigma^{2}}{\delta_{s}\beta[(m+M)/2+\beta(1/2+\lambda_{max}(W))+\mu]}. It can thus be shown that as β\beta\rightarrow\infty, such an upper bound on limsupk𝐄[𝐱k𝐱2]\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right] goes to zero. Since this bound is continuous at β\beta, given any ϵ>0\epsilon>0, the above parameter setting with a sufficiently large β\beta guarantees limsupk𝐄[𝐱k𝐱2]<ϵ\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]<\epsilon.

V Numerical Experiment

This section compares the practical convergence performance of St-SoPro with several state-of-the-art distributed stochastic optimization algorithms.

In the numerical experiment, we intend to learn linear classifiers by solving l2l_{2}-regularized logistic regression of the following form over a randomly generated, undirected, and connected network:

minxdi𝒱1𝒞ij=1𝒞i(λ2x2+ln(1+e(ai,jTx)bi,j)),\underset{x\in\mathbb{R}^{d}}{\min}\sum_{i\in\mathcal{V}}\frac{1}{\mathcal{C}_{i}}\sum_{j=1}^{\mathcal{C}_{i}}\left(\frac{\lambda}{2}\|x\|^{2}+\ln(1+e^{-(a_{i,j}^{\mathrm{T}}x)b_{i,j}})\right),\vspace{-0.15cm} (19)

where λ>0\lambda>0 is the regularization parameter and {ai,j,bi,j}\{a_{i,j},b_{i,j}\} are the data samples. Our experiment is conducted on two standard real datasets a4a and mushrooms from the LIBSVM library [20]. Table I lists the values of the problem and network parameters corresponding to these two datasets, including the problem dimension dd, the number NN of agents, the network’s average degree da=i𝒱|𝒩i|Nd_{a}=\sum_{i\in\mathcal{V}}\frac{|\mathcal{N}_{i}|}{N}, the total number 𝒞i\mathcal{C}_{i} of samples that we assign to each agent ii, the sizes |𝒢ik||\mathcal{G}_{i}^{k}| and |𝒮ik||\mathcal{S}_{i}^{k}| of the random sample sets that each agent ii chooses per iteration, as well as the regularization parameter λ\lambda.

TABLE I: Parameter values in the numerical experiment.
d NN dad_{a} 𝒞i\mathcal{C}_{i} |𝒢ik||\mathcal{G}_{i}^{k}| |𝒮ik||\mathcal{S}_{i}^{k}| λ\lambda
a4a 123 20 5 239 80 10 10210^{-2}
mushrooms 112 10 3 600 80 25 10210^{-2}

The simulations include DSGD [8], EDAS [11], DSGT [12], and DPD-SGD-T [13], which are all first-order methods, for comparison with our proposed St-SoPro. We fine-tune all the algorithm parameters so that the algorithms reach a given accuracy (2×1012\times 10^{-1} for a4a and 10110^{-1} for mushrooms) within fewest possible iterations.

Figures 1(a)–(c) and 2(a)–(c) plot the evolutions of the optimality error 1Ni𝒱xikx2\frac{1}{N}\sum_{i\in\mathcal{V}}\|{x}_{i}^{k}-{x}^{\star}\|^{2} generated by the aforementioned algorithms over a4a and mushrooms with respect to the number of iterations, the number of communication bits (set as 3232 times the number of transmitted real scalars according to [21]), and computation time. Observe that St-SoPro converges faster than the other algorithms to reach the given accuracy, validating its computational and communication efficiency. It is worth mentioning that although St-SoPro is a second-order method, its computational cost can be comparable with the first-order methods when addressing such common machine learning problems. Figures 1(d) and 2(d) present the correct rates of classification on the test sets upon completing each iteration of these algorithms as training, whereby St-SoPro outperforms the others in the training effect.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Convergence performance of St-SoPro, DSGT, DSGD, DPD-SGD-T, and EDAS on dataset a4a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Convergence performance of St-SoPro, DSGT, DSGD, DPD-SGD-T and EDAS on dataset mushrooms.

VI Conclusion

We have developed St-SoPro, a distributed stochastic second-order proximal method, for addressing strongly convex and smooth optimization over undirected networks. Different from the existing first-order distributed stochastic algorithms, St-SoPro incorporates a second-order approximation of an augmented Lagrangian function and randomly samples each local gradient and Hessian. We show that St-SoPro linearly converges to a neighborhood of the optimal solution in expectation, and the neighborhood can be arbitrarily small. Simulations over two real datasets demonstrate that St-SoPro is both computationally and communication-wise efficient.

-A Proof of Theorem 1

The following lemma intends to bound the difference between 𝐄[𝐳k𝐳Q2]\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right] and 𝐄[𝐳k+1𝐳Q2]\mathbf{E}\left[\left\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right].

Lemma 1.

For each k0k\geq 0,

𝐄[𝐳k𝐳Q2]𝐄[𝐳k+1𝐳Q2]\displaystyle\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]-\mathbf{E}\left[\left\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]
\displaystyle\geq β(2ηsmβc0)𝐄[𝐱k𝐱2]+β2(1ηs)𝐄[𝐱kW2]\displaystyle\beta\left(2\eta_{s}m_{\beta}-c_{0}\right)\mathbf{E}\left[\left\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\|^{2}\right]+\beta^{2}(1-\eta_{s})\mathbf{E}\left[\left\|\mathbf{x}^{k}\right\|_{W}^{2}\right]
β𝐄[𝐱k+1𝐱k𝒜c0,ηs+βWR2]2Nτσ2,\displaystyle-\beta\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\mathcal{A}_{c_{0},\eta_{s}}+\beta W-R}^{2}\right]-2N\tau\sigma^{2}, (20)

where c0>0c_{0}>0 can be arbitrary and 𝒜c0,ηsΛM/(2(1ηs))+(ΛMΛm)2/(4c0)+ΛMΛm+βINd2\mathcal{A}_{c_{0},\eta_{s}}\coloneqq\Lambda_{M}/(2(1-\eta_{s}))+\left(\Lambda_{M}-\Lambda_{m}\right)^{2}/(4c_{0})+\Lambda_{M}-\Lambda_{m}+\frac{\beta I_{Nd}}{2}.

Proof.

We first equivalently expand the left-hand side of (20). Similar to [14, Eq. (27)], we derive

𝐄[𝐳k𝐳Q2]𝐄[𝐳k+1𝐳Q2]\displaystyle\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]-\mathbf{E}\left[\left\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]
=\displaystyle= 𝐄[𝐳k𝐳k+1Q2]+2𝐄[β𝐱k+1𝐱,R(𝐱k𝐱k+1)]\displaystyle\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{k+1}\right\|_{Q}^{2}\right]+2\mathbf{E}\left[\beta\left\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},R\left(\mathbf{x}^{k}-\mathbf{x}^{k+1}\right)\right\rangle\right]
+2𝐄[𝐯k𝐯k+1,𝐯k+1𝐯].\displaystyle+2\mathbf{E}\left[\left\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\right\rangle\right]. (21)

Then, using (5) and W12𝐱=𝟎NdW^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}_{Nd}, we obtain 𝐯k𝐯k+1,𝐯k+1𝐯=β𝐱k+1𝐱,W12(𝐯k+1𝐯)\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle=-\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},W^{\frac{1}{2}}(\mathbf{v}^{k+1}-\mathbf{v}^{\star})\rangle. From (10) and (5), we have W12𝐯k+1=W12(𝐯k+1𝐯k)+W12𝐯k=(βWH~k)(𝐱k+1𝐱k)g(𝐱k)W^{\frac{1}{2}}\mathbf{v}^{k+1}=W^{\frac{1}{2}}\left(\mathbf{v}^{k+1}-\mathbf{v}^{k}\right)+W^{\frac{1}{2}}\mathbf{v}^{k}=(\beta W-\tilde{H}^{k})(\mathbf{x}^{k+1}-\mathbf{x}^{k})-g(\mathbf{x}^{k}), where H~kh(𝐱k)+D\tilde{H}^{k}\coloneqq h(\mathbf{x}^{k})+D. The above two equations, together with (12), give

𝐯k𝐯k+1,𝐯k+1𝐯=β𝐱k+1𝐱,g(𝐱k)f(𝐱)\displaystyle\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle=\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle
+β𝐱k+1𝐱,(H~kβW)(𝐱k+1𝐱k).\displaystyle\qquad+\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-\beta W)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle. (22)

Moreover, based on (5), W𝐱=𝟎NdW\mathbf{x}^{\star}=\mathbf{0}_{Nd}, and [14, Eq. (26)],

2β𝐱k+1𝐱,βW(𝐱k+1𝐱k)\displaystyle-2\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\beta W(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle
=\displaystyle= β𝐱k+1βW2+β𝐱kβW2β𝐱k+1𝐱kβW2\displaystyle-\beta\|\mathbf{x}^{k+1}\|_{\beta W}^{2}+\beta\|\mathbf{x}^{k}\|_{\beta W}^{2}-\beta\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\beta W}^{2}
=\displaystyle= 𝐯k+1𝐯k2+β2𝐱kW2β𝐱k+1𝐱kβW2.\displaystyle-\|\mathbf{v}^{k+1}-\mathbf{v}^{k}\|^{2}+\beta^{2}\|\mathbf{x}^{k}\|_{W}^{2}-\beta\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\beta W}^{2}. (23)

By incorporating (23) into (22) and then combining the resulting equation with (21), we have

𝐄[𝐳k𝐳Q2]𝐄[𝐳k+1𝐳Q2]\displaystyle\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]-\mathbf{E}\left[\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\|_{Q}^{2}\right]
=\displaystyle= 2β𝐄[𝐱k+1𝐱,g(𝐱k)f(𝐱)]+β2𝐄[𝐱kW2]\displaystyle 2\beta\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]+\beta^{2}\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]
+2β𝐄[𝐱k+1𝐱,(H~kR)(𝐱k+1𝐱k)]\displaystyle+2\beta\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-R)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\right]
+β𝐄[𝐱k𝐱k+1RβW2].\displaystyle+\beta\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\|_{R-\beta W}^{2}\right]. (24)

Subsequently, we provide a lower bound for the first term on the right-hand side of (24). To do so, we utilize the AM-GM inequality and (11) to derive

𝐄[𝐱k+1𝐱k,g(𝐱k)f(𝐱)]\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]
=\displaystyle= 𝐄[𝐱k+1𝐱k,f(𝐱k)f(𝐱)]\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]
+𝐄[𝐱k+1𝐱k,g(𝐱k)f(𝐱k)]\displaystyle+\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\rangle\right]
\displaystyle\geq (1ηs)𝐄[f(𝐱k)f(𝐱)ΛM12]\displaystyle-(1-\eta_{s})\mathbf{E}\left[\left\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right\|_{\Lambda_{M}^{-1}}^{2}\right]
𝐄[𝐱k+1𝐱kΛM4(1ηs)2]\displaystyle-\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}}^{2}\right]
β4𝐄[𝐱k+1𝐱k2]1β𝐄[g(𝐱k)f(𝐱k)2]\displaystyle-\frac{\beta}{4}\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|^{2}\right]-\frac{1}{\beta}\mathbf{E}\left[\left\|g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\right\|^{2}\right]
\displaystyle\geq (1ηs)𝐄[f(𝐱k)f(𝐱)ΛM12]\displaystyle-(1-\eta_{s})\mathbf{E}\left[\left\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right\|_{\Lambda_{M}^{-1}}^{2}\right]
𝐄[𝐱k+1𝐱kΛM4(1ηs)+βINd42]1βNτσ2.\displaystyle-\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\beta I_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}. (25)

Due to the Lipschitz continuity of each fi\nabla f_{i} and the unbiasedness of g(𝐱k)g(\mathbf{x}^{k}), we have 𝐄[𝐱k𝐱,g(𝐱k)f(𝐱)]=𝐄[𝐱k𝐱,f(𝐱k)f(𝐱)]𝐄[f(𝐱k)f(𝐱)ΛM12]\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]=\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]\geq\mathbf{E}[\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\|_{\Lambda_{M}^{-1}}^{2}]. We multiply this inequality by (1ηs)(1-\eta_{s}) and then add it to (25), which leads to

𝐄[𝐱k+1𝐱,g(𝐱k)f(𝐱)]\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]
ηs𝐄[𝐱k𝐱,f(𝐱k)f(𝐱)]\displaystyle-\eta_{s}\mathbf{E}\left[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]
\displaystyle\geq 𝐄[𝐱k+1𝐱kΛM4(1ηs)+β𝐈Nd42]1βNτσ2.\displaystyle-\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\mathbf{\beta I}_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}. (26)

Because of the restricted strong convexity of fβ(𝐱)=f(𝐱)+β4𝐱W2f_{\beta}(\mathbf{x})=f(\mathbf{x})+\frac{\beta}{4}\|\mathbf{x}\|_{W}^{2} shown in Section IV and W𝐱=𝟎NdW\mathbf{x}^{\star}=\mathbf{0}_{Nd}, we have 𝐄[𝐱k𝐱,f(𝐱k)f(𝐱)]mβ𝐄[𝐱k𝐱2]β2𝐄[𝐱kW2]\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]\geq m_{\beta}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]-\frac{\beta}{2}\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]. This, along with (26), results in

𝐄[𝐱k+1𝐱,g(𝐱k)f(𝐱)]\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]
\displaystyle\geq 𝐄[𝐱k+1𝐱kΛM4(1ηs)+β𝐈Nd42]1βNτσ2\displaystyle-\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\mathbf{\beta I}_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}
+ηsmβ𝐄[𝐱k𝐱2]ηsβ2𝐄[𝐱kW2].\displaystyle+\eta_{s}m_{\beta}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]-\frac{\eta_{s}\beta}{2}\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]. (27)

Next, we bound the third term on the right-hand side of (24). Because H~kR=h(𝐱k)Λm+ΛM2\tilde{H}^{k}-R=h(\mathbf{x}^{k})-\frac{\Lambda_{m}+\Lambda_{M}}{2} and because of (9), we have ΛmΛM2H~kRΛMΛm2\frac{\Lambda_{m}-\Lambda_{M}}{2}\preceq\tilde{H}^{k}-R\preceq\frac{\Lambda_{M}-\Lambda_{m}}{2}. Let c0>0c_{0}>0. Then, similar to [14, Eq. (30)], we obtain

𝐄[𝐱k+1𝐱,(H~kR)(𝐱k+1𝐱k)]\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-R)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\right]
\displaystyle\geq c02𝐄[𝐱k𝐱2]18c0𝐄[𝐱k+1𝐱k(ΛMΛm)22]\displaystyle-\frac{c_{0}}{2}\mathbf{E}\left[\left\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\|^{2}\right]-\frac{1}{8c_{0}}\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\left(\Lambda_{M}-\Lambda_{m}\right)^{2}}^{2}\right]
12𝐄[𝐱k+1𝐱kΛMΛm2].\displaystyle-\frac{1}{2}\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\Lambda_{M}-\Lambda_{m}}^{2}\right]. (28)

Combining (27) and (28) with (24) gives (20). ∎

In addition to Lemma 1, below we provide an upper bound on 𝐄[𝐳k𝐳Q2]\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]. For any c1,c2>0c_{1},c_{2}>0, through (11), (10), (12), (13), and the AM-GM inequality,

𝐄[𝐯k𝐯2]=𝐄[(W)12W12(𝐯k𝐯)2]\displaystyle\mathbf{E}\left[\|\mathbf{v}^{k}-\mathbf{v}^{\star}\|^{2}\right]=\mathbf{E}\left[\|(W^{\dagger})^{\frac{1}{2}}W^{\frac{1}{2}}(\mathbf{v}^{k}-\mathbf{v}^{\star})\|^{2}\right]
=\displaystyle= 𝐄[(W)12(H~k(𝐱k𝐱k+1)βW𝐱kg(𝐱k)+f(𝐱k)\displaystyle\mathbf{E}\left[\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}-\mathbf{x}^{k+1})-\beta W\mathbf{x}^{k}-g(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k})\right.\right.
f(𝐱k)+f(𝐱))2]\displaystyle\left.\left.-\nabla f(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{\star})\right)\|^{2}\right]
\displaystyle\leq (1+c1)𝐄[(W)12(H~k(𝐱k𝐱k+1)g(𝐱k)+f(𝐱k))2]\displaystyle(1\!+\!c_{1})\mathbf{E}\left[\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}\!-\!\mathbf{x}^{k+1})\!-\!g(\mathbf{x}^{k})\!+\!\nabla f(\mathbf{x}^{k})\right)\|^{2}\right]
+(1+1c1)𝐄[(W)12(βW𝐱k+f(𝐱k)f(𝐱))2]\displaystyle+(1+\frac{1}{c_{1}})\mathbf{E}\left[\|(W^{\dagger})^{\frac{1}{2}}\left(\beta W\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right)\|^{2}\right]
\displaystyle\leq 2(1+c1)λW𝐄[𝐱k+1𝐱k(H~k)22]+2(1+c1)λWNτσ2\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{(\tilde{H}^{k})^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}
+β2(1+1c1)(1+c2)𝐄[𝐱kW2]\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]
+(1+1/c1)(1+1/c2)λW𝐄[𝐱k𝐱ΛM22],\displaystyle+\frac{(1+1/c_{1})(1+1/c_{2})}{\lambda_{W}}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{\Lambda_{M}^{2}}^{2}\right],

leading to

𝐄[𝐳k𝐳Q2]\displaystyle\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]
\displaystyle\leq 2(1+c1)λW𝐄[𝐱k+1𝐱k(ΛM+D)22]+2(1+c1)λWNτσ2\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\|_{\left(\Lambda_{M}+D\right)^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}
+β2(1+1c1)(1+c2)𝐄[𝐱kW2]\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]
+𝐄[𝐱k𝐱(βR+(1+1/c1)(1+1/c2)ΛM2λW)2].\displaystyle+\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|_{\left(\beta R+\frac{\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}}{\lambda_{W}}\right)}^{2}\right]. (29)

Pick an arbitrary δs(0,1)\delta_{s}\in(0,1). By subtracting (29) multiplied by δs\delta_{s} from (20), we have

(1δs)𝐄[𝐳k𝐳Q2]𝐄[𝐳k+1𝐳Q2]\displaystyle(1-\delta_{s})\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]-\mathbf{E}\left[\left\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]
β𝐄[𝐱k𝐱𝛀12]+Ω2𝐄[𝐱kW2]\displaystyle\geq\beta\mathbf{E}\left[\left\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\|_{\mathbf{\Omega}_{1}}^{2}\right]+\Omega_{2}\mathbf{E}\left[\left\|\mathbf{x}^{k}\right\|_{W}^{2}\right]
𝐄[𝐱k+1𝐱k𝛀32](2(1+c1)δsλW+2)Nτσ2,\displaystyle-\mathbf{E}\left[\left\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\|_{\mathbf{\Omega}_{3}}^{2}\right]-(\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}+2)N\tau\sigma^{2}, (30)

where 𝛀1=(2ηsmβc0)INdδs(R+(1+1/c1)(1+1/c2)ΛM2βλW)\mathbf{\Omega}_{1}=\left(2\eta_{s}m_{\beta}-c_{0}\right)I_{Nd}-\delta_{s}(R+\frac{\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}}{\beta\lambda_{W}}), Ω2=β2(1ηs)δsβ2(1+1/c1)(1+c2)\Omega_{2}=\beta^{2}(1-\eta_{s})-\delta_{s}\beta^{2}\left(1+1/c_{1}\right)\left(1+c_{2}\right), and 𝛀3=2(1+c1)δsλW(ΛM+D)2+β(𝒜c0,ηs+βWR)\mathbf{\Omega}_{3}=\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}\left(\Lambda_{M}+D\right)^{2}+\beta\left(\mathcal{A}_{c_{0},\eta_{s}}+\beta W-R\right). To make (16) hold based on (30), it suffices to let 𝛀1𝐎Nd\mathbf{\Omega}_{1}\succeq\mathbf{O}_{Nd}, Ω20\Omega_{2}\geq 0, 𝛀3𝐎Nd\mathbf{\Omega}_{3}\preceq\mathbf{O}_{Nd}, i.e.,

δs\displaystyle\delta_{s}\leq 2ηsmβc0λmax(R+(1+1/c1)(1+1/c2)ΛM2/(βλW)),\displaystyle\frac{2\eta_{s}m_{\beta}-c_{0}}{\lambda_{\max}\left(R+\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}/\left(\beta\lambda_{W}\right)\right)}, (31)
δs\displaystyle\delta_{s}\leq 1ηs(1+1/c1)(1+c2),\displaystyle\frac{1-\eta_{s}}{\left(1+1/c_{1}\right)\left(1+c_{2}\right)}, (32)
δs\displaystyle\delta_{s}\leq βλWκc0,ηs2(1+c1)ΛM+D2.\displaystyle\frac{\beta\lambda_{W}\kappa_{c_{0},\eta_{s}}}{2\left(1+c_{1}\right)\left\|\Lambda_{M}+D\right\|^{2}}. (33)

To guarantee the existence of δs(0,1)\delta_{s}\in(0,1) subject to (31)–(33), we need c0<2ηsmβc_{0}<2\eta_{s}m_{\beta} and κc0,ηs>0\kappa_{c_{0},\eta_{s}}>0. Note from (15) that κc0,ηs>0\kappa_{c_{0},\eta_{s}}>0 at c0=2ηsmβc_{0}=2\eta_{s}m_{\beta}. Then, due to the continuity of κc0,ηs\kappa_{c_{0},\eta_{s}} with respect to c0c_{0}, there is c0(0,2ηsmβ)c_{0}\in(0,2\eta_{s}m_{\beta}) such that κc0,ηs>0\kappa_{c_{0},\eta_{s}}>0. Therefore, we ensure (16) with δs\delta_{s} given by (18).

Finally, from (16), we have 𝐄[𝐳k𝐳Q2](1δs)k𝐄[𝐳0𝐳Q2]+t=0k1(1δs)tΓNτσ2\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]\leq(1-\delta_{s})^{k}\mathbf{E}\left[\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{Q}^{2}\right]+\sum_{t=0}^{k-1}(1-\delta_{s})^{t}\Gamma N\tau\sigma^{2}, k0\forall k\geq 0. It can thus be shown that (17) holds.

References

  • [1] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010.   Springer, 2010, pp. 177–186.
  • [2] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics.   PMLR, 2017, pp. 1273–1282.
  • [4] S. U. Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations (ICLR), 2019.
  • [5] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [6] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  • [7] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, pp. 2597–2633, 2017.
  • [8] S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, vol. 67, no. 11, pp. 5900–5915, 2022.
  • [9] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in 29th International Conference on Machine Learning, 2012, pp. 1571–1578.
  • [10] A. Nemirovski, A. B. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, pp. 1574–1609, 2009.
  • [11] K. Huang and S. Pu, “Improving the transient times for distributed stochastic gradient methods,” IEEE Transactions on Automatic Control (Early Access), 2022.
  • [12] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021.
  • [13] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 5, pp. 812–833, 2022.
  • [14] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, pp. 1864–1871, 2021.
  • [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [16] P. Giselsson, M. D. Doan, T. Keviczky, B. D. Schutter, and A. Rantzer, “Accelerated gradient methods and dual decomposition in distributed model predictive control,” Automatica, vol. 49, pp. 829–833, 2013.
  • [17] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive radio networks by exploiting sparsity,” IEEE Transactions on Signal Processing, vol. 58, pp. 1847–1862, 2010.
  • [18] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 595–627, 2014.
  • [19] S. L. Lohr, Sampling: Design and Analysis.   Chapman and Hall/CRC, 2021.
  • [20] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
  • [21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.