A Stochastic Second-Order Proximal Method
for Distributed Optimization

Chenyang Qiu, Shanying Zhu, Zichong Ou and Jie Lu C. Qiu, Z. Ou and J. Lu are with the School of Information Science and Technology, ShanghaiTech University, 201210 Shanghai, China. Email: {qiuchy, ouzch, lujie}@shanghaitech.edu.cn.S. Zhu is with the Department of Automation and the Key Laboratory of System Control and Information Processing, Shanghai Jiao Tong University, 200240 Shanghai, China. Email: shyzhu@sjtu.edu.cn.

Abstract

In this paper, we propose a distributed stochastic second-order proximal method that enables agents in a network to cooperatively minimize the sum of their local loss functions without any centralized coordination. The proposed algorithm, referred to as St-SoPro, incorporates a decentralized second-order approximation into an augmented Lagrangian function, and then randomly samples the local gradients and Hessian matrices of the agents, so that it is computationally and memory-wise efficient, particularly for large-scale optimization problems. We show that for globally restricted strongly convex problems, the expected optimality error of St-SoPro asymptotically drops below an explicit error bound at a linear rate, and the error bound can be arbitrarily small with proper parameter settings. Simulations over real machine learning datasets demonstrate that St-SoPro outperforms several state-of-the-art distributed stochastic first-order methods in terms of convergence speed as well as computation and communication costs.

I Introduction

Stochastic optimization algorithms have been flourishing recently due to their appealing efficiency in machine learning [1, 2]. In the context of large-scale machine learning, parallel stochastic algorithms are often used to process large datasets[3, 4]. However, such methods need a central node to ensure the consistency of the variables from all the nodes, so that the communication burden of the central node becomes the bottleneck that restricts the algorithm performance.

On the other hand, a collection of distributed optimization algorithms have been proposed over the past decade in order to tackle various network control and resource allocation problems, where agents in a network only communicate with their neighbors and do not rely on any central coordination, eliminating potential communication bottlenecks in the computing infrastructure [5]. Typical methods include the distributed gradient descent (DGD) method [5], the decentralized exact first-order algorithm (EXTRA) [6], and the distributed gradient tracking algorithms [7].

Inheriting the merits of the above two algorithm types, distributed stochastic optimization algorithms have been attracting a lot of recent interest. For smooth, strongly convex optimization problems, [8] develops a distributed stochastic gradient descent (DSGD) method based on DGD, which is shown to achieve the optimal sublinear convergence rate (independent of the network) of a centralized stochastic gradient descent (SGD) method [9, 10]. The same convergence rate is attained by the exact diffusion method with adaptive step-sizes (EDAS) in [11]. In addition, the distributed stochastic gradient tracking (DSGT) method proposed in [12] is guaranteed to linearly converge to a neighborhood of the optimal solution in expectation. For non-convex problems, [13] designs a distributed primal-dual SGD algorithm with both fixed step-sizes (DPD-SGD-F) and adaptive step-sizes (DPD-SGD-T), where the former achieves linear convergence to suboptimality under the Polyak-Łojasiewics (PL) condition.

The aforementioned distributed stochastic algorithms are all constructed upon deterministic first-order methods by nature and only use stochastic gradients to evolve. As second-order information often leads to more accurate approximation and accelerates problem solving, we endeavor to develop a second-order distributed stochastic optimization algorithm.

To this end, we choose SoPro [14], a deterministic distributed second-order proximal algorithm, as the cornerstone. SoPro is developed by virtue of a decentralized second-order approximation of the augmented Lagrangian function in the classic method of multipliers [15], and its convergence performance outperforms that of various distributed first-order methods in the deterministic setting. In this paper, we adapt SoPro to the stochastic scenario. Specifically, instead of letting each agent compute exact local gradient and Hessian matrix determined by all its local data as SoPro does, we allow each agent to update using stochastic approximations of its local gradient and Hessian, which come from two randomly and uniformly chosen batches of samples from its local loss function. Such a stochastic variant of SoPro can significantly enhance the computational and memory efficiency of the agents. We call this algorithm a stochastic second-order proximal algorithm, referred to as St-SoPro.

Under the assumptions that the local loss functions are smooth and convex, and that their sum (i.e., the global objective function) is globally restricted strongly convex, we show that our proposed St-SoPro algorithm linearly converges to a neighborhood of the optimal solution in expectation over undirected networks. In particular, we provide an explicit upper bound on its ultimate suboptimality, and illustrate that this upper bound can be made arbitrarily small as long as the parameters are properly set. Finally, we validate the superior performance of St-SoPro in comparison with several recent distributed stochastic optimization methods over some real datasets for classification problems arising in machine learning with respect to convergence speed, communication load, computational efficiency, and classification accuracy.

The paper is organized as follows. Section II formulates the optimization problem to be solved, and Section III describes the proposed St-SoPro algorithm. Section IV provides the convergence analysis, Section V presents the numerical results, and Section VI concludes the paper.

Notation: For any differentiable function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , its gradient at $x\in\mathbb{R}^{d}$ is denoted by $\nabla f(x)$ , and if $f$ is twice-differentiable, we use $\nabla^{2}f(x)$ to denote its Hessian matrix at $x$ . For any set $S$ , $|S|$ represents the cardinality of $S$ . In addition, $\otimes$ is the Kronecker product, $\langle\cdot,\cdot\rangle$ is the Euclidean inner product, and $\|\cdot\|$ is the $\ell_{2}$ norm. We use $\mathbf{0}_{d},\mathbf{1}_{d},\mathbf{O}_{d},I_{d}$ to denote the $d$ -dimensional all-zero vector, all-one vector, zero matrix, and identity matrix, respectively. Also, $\operatorname{diag}(A_{1},\ldots,A_{n})$ represents the block diagonal matrix whose diagonal blocks are sequentially $A_{1},\ldots,A_{n}$ . Given a matrix $A\in\mathbb{R}^{d\times d}$ , we write $A\succeq\mathbf{O}_{d}$ if it is positive semidefinite and $A\succ\mathbf{O}_{d}$ if it is positive definite. For any $A\succeq\mathbf{O}_{d}$ and $\mathbf{x}\in\mathbb{R}^{d},\|\mathbf{x}\|_{A}^{2}:=\mathbf{x}^{T}A\mathbf{x}$ , $\lambda_{max}(A)$ and $\lambda_{min}(A)$ are the largest and smallest real eigenvalues of $A$ , respectively, and $A^{\dagger}$ is $A$ ’s pseudoinverse.

II Problem Formulation

Consider a set $\mathcal{V}=\{1,\ldots,N\}$ of agents, where the agents are connected through the link set $\mathcal{E}\subseteq$ $\{\{i,j\}\subseteq\mathcal{V}\times\mathcal{V}\mid i\neq j\}$ . We model such a network as a connected undirected graph $(\mathcal{V},\mathcal{E})$ , and denote the set of each agent $i$ ’s neighbors by $\mathcal{N}_{i}=\{j\in\mathcal{V}\mid\{i,j\}\in\mathcal{E}\}$ . Suppose each agent $i$ observes a finite number of local samples $\xi_{i,j}\in\mathbb{R}^{m}$ , $j=1,\ldots,\mathcal{C}_{i}$ that are independent random vectors, and attempts to solve the following optimization problem:

\begin{array}[]{ll}\underset{x_{1},\ldots,x_{N}\in\mathbb{R}^{d}}{\operatorname{minimize}}&\sum_{i\in\mathcal{V}}f_{i}(x_{i})\\ \text{subject to }&x_{1}=x_{2}=\dots=x_{N}.\end{array}\vspace{-0.15cm}

(1)

In problem (1), $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is the local loss function of agent $i$ , which is the average of every sample’s loss $l_{i,j}(x_{i},\xi_{i,j}):\mathbb{R}^{d}\rightarrow\mathbb{R}$ associated with agent $i$ , i.e.,

\displaystyle f_{i}(x_{i})=\frac{1}{\mathcal{C}_{i}}\sum_{j=1}^{\mathcal{C}_{i}}l_{i,j}(x_{i},\xi_{i,j}).

Below, we impose the assumptions on problem (1).

Assumption 1.

Problem (1) satisfies the following:

a)

There exists an optimal solution $x^{\star}\in\mathbb{R}^{d}$ to problem (1), and $\sum_{i\in\mathcal{V}}f_{i}(x_{i})$ is globally restricted strongly convex with respect to $x^{\star}$ with convexity parameter $m_{\bar{f}}>0$ .
b)

For any given $\xi_{i,j}$ , $l_{i,j}(\cdot,\xi_{i,j})$ is twice continuously differentiable and convex.
c)

There exists $M_{i}>0$ such that $l_{i,j}(x,\xi_{i,j})$ is $M_{i}$ -smooth for all $\xi_{i,j}$ .

Assumption 1 leads to the following inequalities. For any given $x,y\in\mathbb{R}^{d}$ and $\xi_{i,j}$ , $m_{i}\|x-y\|^{2}\leq\langle\nabla l_{i,j}(x,\xi_{i,j})-\nabla l_{i,j}(y,\xi_{i,j}),x-y\rangle\leq M_{i}\|x-y\|^{2}$ and

\displaystyle m_{i}I_{d}\preceq\nabla^{2}l_{i,j}(x,\xi_{i,j})\preceq M_{i}I_{d}

(2)

for some $m_{i}\in[0,M_{i}]$ . Also, the globally restricted strong convexity in Assumption 1a) guarantees that the optimal solution $x^{\star}$ is unique.

Problem (1) requires that the agents reach a consensus while minimizing all the sample losses throughout the network. Indeed, a wide range of real-world problems can be cast into the form of problem (1), such as distributed model predictive control [16], distributed spectrum sensing [17], and logistic regression [18]. Under many circumstances, these engineering problems involve huge datasets. Thus, we focus on solving problem (1) in a fully decentralized and stochastic fashion. Specifically, we only allow each agent to communicate with its neighbors and use a randomly chosen subset of its local samples to compute.

III Stochastic Second-order Proximal Method

In this section, we develop a distributed stochastic algorithm for solving problem (1) over undirected networks.

To this end, we first provide a brief review of the distributed (deterministic) second-order proximal algorithm (SoPro) in [14]. Note that problem (1) is equivalent to

\begin{array}[]{ll}\underset{\mathbf{x}\in\mathbb{R}^{Nd}}{\operatorname{minimize}}&f(\mathbf{x}):=\sum_{i\in\mathcal{V}}f_{i}(x_{i})\\ \text{subject to }&{W}^{\frac{1}{2}}\mathbf{x}=\mathbf{0}_{Nd},\end{array}\vspace{-0.15cm}

(3)

where $\mathbf{x}=(x_{1}^{T},\ldots,x_{N}^{T})^{T}$ , $W=P\otimes I_{d}\succeq\mathbf{O}_{Nd}$ , and

\displaystyle[P]_{ij}=\left\{\begin{array}[]{ll}\sum_{s\in\mathcal{N}_{i}}p_{is},&i=j,\\ -p_{ij},&j\in\mathcal{N}_{i},\\ 0,&\text{otherwise,}\end{array}\quad\forall i,j\in\mathcal{V},\right.

with $p_{ij}=p_{ji}>0$ $\forall\{i,j\}\in\mathcal{E}$ , so that the null space of $P$ is $\operatorname{span}\{\mathbf{1}_{N}\}$ . Also, the unique optimal solution of problem (3) is $\mathbf{x}^{\star}=((x^{\star})^{T},\ldots,(x^{\star})^{T})^{T}$ .

The application of the method of multipliers [15] to solve (3) gives the following: Starting from any $\mathbf{v}^{0}\in\mathbb{R}^{Nd}$ ,

	$\displaystyle\mathbf{x}^{k+1}=\arg\min_{\mathbf{x}\in\mathbb{R}^{Nd}}L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right),$		(4)
	$\displaystyle\mathbf{v}^{k+1}=\mathbf{v}^{k}+\beta W^{\frac{1}{2}}\mathbf{x}^{k+1},$		(5)

where $\mathbf{x}^{k}=((x_{1}^{k})^{T},\ldots,(x_{N}^{k})^{T})^{T}$ and $\mathbf{v}^{k}$ are the primal and dual variables, respectively, and $L_{\beta}(\mathbf{x},\mathbf{v}):\mathbb{R}^{Nd}\times\mathbb{R}^{Nd}\rightarrow\mathbb{R}$ is an augmented Lagrangian function given by $L_{\beta}(\mathbf{x},\mathbf{v})=f(\mathbf{x})+\mathbf{v}^{T}W^{\frac{1}{2}}\mathbf{x}+\frac{\beta}{2}\|W^{\frac{1}{2}}\mathbf{x}\|^{2}$ , $\beta>0$ .

Since (4)–(5) cannot be implemented in a distributed way, the SoPro algorithm in [14] introduces a decentralized second-order proximal approximation of $L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right)$ in (4) and applies a variable change to (5). Specifically, it replaces $L_{\beta}\left(\mathbf{x},\mathbf{v}^{k}\right)$ with its second-order Taylor’s expansion at $\mathbf{x}^{k}$ . Then, it replaces the remaining coupling term $\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}\rho W(\mathbf{x}-\mathbf{x}^{k})$ in the primal update with $\frac{1}{2}(\mathbf{x}-\mathbf{x}^{k})^{T}D(\mathbf{x}-\mathbf{x}^{k})$ , where $D=\operatorname{diag}(D_{1},\ldots,D_{N})$ is a symmetric block diagonal matrix satisfying $\nabla^{2}f_{i}(x)+D_{i}\succ\mathbf{O}_{d}$ $\forall x\in\mathbb{R}^{d}$ $\forall i\in\mathcal{V}$ . Furthermore, we define $\mathbf{q}^{k}=((q_{1}^{k})^{T},\ldots,(q_{N}^{k})^{T})^{T}=W^{\frac{1}{2}}\mathbf{v}^{k}$ as a substitute for $\mathbf{v}^{k}$ , and $\mathbf{q}^{k}$ can be ensured to identically stay in the range of $W^{\frac{1}{2}}$ by letting $\sum_{i\in\mathcal{V}}q_{i}^{0}=0$ . To summarize, SoPro takes the following form: Starting from $\mathbf{q}^{0}$ satisfying $\sum_{i\in\mathcal{V}}q_{i}^{0}=0$ ,

	$\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}\!-\!(\nabla^{2}f(\mathbf{x}^{k})\!+\!D)^{-1}(\nabla f(\mathbf{x}^{k})\!+\!\beta W\mathbf{x}^{k}\!+\!\mathbf{q}^{k}),$		(6)
	$\displaystyle\mathbf{q}^{k+1}=\mathbf{q}^{k}+\beta W\mathbf{x}^{k+1},\quad\forall k\geq 0,$

where $\nabla f(\mathbf{x}^{k})=(\nabla f_{1}(x_{1}^{k})^{T},\ldots,\nabla f_{N}(x_{N}^{k})^{T})^{T}$ and $\nabla^{2}f(\mathbf{x}^{k})=\text{diag}(\nabla^{2}f_{1}(x_{1}^{k}),\ldots,\nabla^{2}f_{N}(x_{N}^{k}))$ satisfying $\nabla^{2}f(\mathbf{x}^{k})+D\succ\mathbf{O}_{Nd}$ .

The primal update of SoPro (6) requires that each agent uses up all its local samples. However, the agents may only be able to access or process a portion of their local samples at one time, especially in the big data scenario. Motivated by this, we consider approximating the gradient $\nabla f(\mathbf{x}^{k})$ and the Hessian $\nabla^{2}f(\mathbf{x}^{k})$ in (6) via a stochastic gradient $g(\mathbf{x}^{k})$ and a stochastic Hessian $h(\mathbf{x}^{k})$ given by

	$\displaystyle g(\mathbf{x}^{k})=(g_{1}(x_{1}^{k})^{T},\ldots,g_{N}(x_{N}^{k})^{T})^{T},$
	$\displaystyle\text{where each }g_{i}(x_{i}^{k})=\frac{1}{\|\mathcal{G}_{i}^{k}\|}\sum_{j\in\mathcal{G}_{i}^{k}}\nabla l_{i,j}(x_{i}^{k},\xi_{i,j}),$		(7)
	$\displaystyle h(\mathbf{x}^{k})=\operatorname{diag}(h_{1}(x_{1}^{k}),\ldots,h_{N}(x_{N}^{k})),$
	$\displaystyle\text{where each }h_{i}(x_{i}^{k})=\frac{1}{\|\mathcal{S}_{i}^{k}\|}\sum_{j\in\mathcal{S}_{i}^{k}}\nabla^{2}l_{i,j}(x_{i}^{k},\xi_{i,j}).$		(8)

Here, for each agent $i\in\mathcal{V}$ , $\mathcal{G}_{i}^{k}$ and $\mathcal{S}_{i}^{k}$ are two independent random sample sets uniformly chosen from $\{1,\ldots,\mathcal{C}_{i}$ } without replacement, so that $g(\mathbf{x}^{k})$ and $h(\mathbf{x}^{k})$ are unbiased, i.e.,

\displaystyle\mathbf{E}_{\mathcal{G}_{i}^{k}}[g_{i}(\mathbf{x}^{k})]=\nabla f_{i}(\mathbf{x}^{k}),\quad\mathbf{E}_{\mathcal{S}_{i}^{k}}[h_{i}(\mathbf{x}^{k})]=\nabla^{2}f_{i}(\mathbf{x}^{k})

for all $\mathbf{x}^{k}$ . Due to (2), $m_{i}I_{d}\preceq h_{i}(x_{i}^{k})\preceq M_{i}I_{d}$ $\forall i\in\mathcal{V}$ and

\Lambda_{m}\preceq h(\mathbf{x}^{k})\preceq\Lambda_{M},\vspace{-0.15cm}

(9)

where $\Lambda_{m}=\operatorname{diag}\left(m_{1},\ldots,m_{N}\right)\otimes I_{d}\succeq\mathbf{O}_{Nd}$ and $\Lambda_{M}=\operatorname{diag}\left(M_{1},\ldots,M_{N}\right)\otimes I_{d}\succ\mathbf{O}_{Nd}$ .

Using the above randomly sampled gradient and Hessian, we obtain the following stochastic variant of SoPro: Starting from any $\mathbf{q}^{0}$ such that $\sum_{i\in\mathcal{V}}q_{i}^{0}=0$ ,

	$\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}-(h(\mathbf{x}^{k})+D)^{-1}(g(\mathbf{x}^{k})+\beta W\mathbf{x}^{k}+\mathbf{q}^{k}),$		(10)
	$\displaystyle\mathbf{q}^{k+1}=\mathbf{q}^{k}+\beta W\mathbf{x}^{k+1},\quad\forall k\geq 0,$

where each $D_{i}$ satisfies $h_{i}(x)+D_{i}\succ\mathbf{O}_{d}$ $\forall x\in\mathbb{R}^{d}$ , i.e., $h(\mathbf{x})+D\succ\mathbf{O}_{Nd}$ $\forall\mathbf{x}\in\mathbb{R}^{Nd}$ , so that (10) is well-posed. The above initialization and updates compose our proposed stochastic second-order proximal (St-SoPro) method. The distributed implementation of St-SoPro over the undirected network $(\mathcal{V},\mathcal{E})$ is described in Algorithm 1, in which $y_{i}^{k}$ $\forall i\in\mathcal{V}$ are auxiliary variables for better presentation.

Algorithm 1 St-SoPro

1: Initialization:

2: Each agent

i\in\mathcal{V}

sets

q_{i}^{0}\in\mathbb{R}^{d}

such that

\sum_{i\in\mathcal{V}}q_{i}^{0}=\mathbf{0}_{d}

(or simply sets

q_{i}^{0}=\mathbf{0}_{d}

3: Each agent

i\in\mathcal{V}

arbitrarily sets

x_{i}^{0}\in\mathbb{R}^{d}

, and sends

x_{i}^{0}

to each neighbor

j\in\mathcal{N}_{i}

4: After receiving

x_{j}^{0}

\forall j\in\mathcal{N}_{i}

, each agent

i\in\mathcal{V}

sets

y_{i}^{0}=\sum_{j\in\mathcal{N}_{i}}p_{ij}(x_{i}^{0}-x_{j}^{0})

5: for

k=0,1,2,\ldots

6: Each agent

i\in\mathcal{V}

randomly and uniformly chooses two independent subsets

\mathcal{G}_{i}^{k}

and

\mathcal{S}_{i}^{k}

\{1,\ldots,\mathcal{C}_{i}\}

, and then computes

g_{i}(x_{i}^{k})

and

h_{i}(x_{i}^{k})

according to (7) and (8).

7: Each agent

i\in\mathcal{V}

computes

x_{i}^{k+1}=x_{i}^{k}-(h_{i}(x_{i}^{k})+D_{i})^{-1}(g_{i}(x_{i}^{k})+\beta y_{i}^{k}+q_{i}^{k})

, and then sends

x_{i}^{k+1}

to each neighbor

j\in\mathcal{N}_{i}

8: After receiving

x_{j}^{k+1}

\forall j\in\mathcal{N}_{i}

, each agent

i\in\mathcal{V}

computes

y_{i}^{k+1}=\sum_{j\in\mathcal{N}_{i}}p_{ij}(x_{i}^{k+1}-x_{j}^{k+1})

and

q_{i}^{k+1}=q_{i}^{k}+\beta y_{i}^{k+1}

9: end for

IV Convergence Analysis

This section provides the convergence analysis of St-SoPro.

We first impose an assumption on the expected deviation of each sample loss $l_{i,j}(x_{i},\xi_{i,j})$ from each entire local loss $f_{i}(x_{i})$ in terms of their gradients.

Assumption 2.

The random vectors $\xi_{i,j}$ $\forall i\in\mathcal{V}$ $\forall j=1,\ldots,\mathcal{C}_{i}$ are independent and there is some $\sigma>0$ such that

\mathbf{E}_{\xi_{i,j}}\left[\|\nabla l_{i,j}(x_{i},\xi_{i,j})-\nabla f_{i}(x_{i})\|^{2}\right]\leq\sigma^{2},\quad\forall x_{i}\in\mathbb{R}^{d}.

To simplify the notation, below we let $\mathcal{C}_{i}=\mathcal{C}$ $\forall i\in\mathcal{V}$ and $\mathcal{G}_{i}^{k}$ $\forall i\in\mathcal{V}$ $\forall k\geq 0$ be of the same size $\mathcal{G}$ . We also abbreviate $\mathbf{E}_{\mathcal{G}_{i}^{k}}[\cdot]$ and $\mathbf{E}_{\mathcal{S}_{i}^{k}}[\cdot]$ to $\mathbf{E}[\cdot]$ . According to [19, Chapter 2],

	$\displaystyle\mathbf{E}\left[\\|g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\\|^{2}\right]=\sum_{i\in\mathcal{V}}\mathbf{E}\left[\\|g_{i}(x_{i}^{k})-\nabla f_{i}(x_{i}^{k})\\|^{2}\right]$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\leq N\tau\sigma^{2},\text{ where }\tau\coloneqq\frac{\mathcal{C}-\mathcal{G}}{\mathcal{C}\mathcal{G}}.$		(11)

This is consistent with the fact that when computing the stochastic gradient $g(\mathbf{x}^{k})$ , if we reduce the number $\mathcal{G}$ of randomly selected samples, then the discrepancy between $g(\mathbf{x}^{k})$ and $\nabla f(\mathbf{x}^{k})$ becomes larger in expectation.

Next, for the sake of presenting the convergence result, we introduce the following notation and definitions. According to [14], any $\mathbf{v}\in\mathbb{R}^{Nd}$ satisfying $\nabla f\left(\mathbf{x}^{\star}\right)=-W^{\frac{1}{2}}\mathbf{v}$ is a dual optimum of problem (3). Thus, we define

\mathbf{v}^{\star}=-\left(W^{\dagger}\right)^{\frac{1}{2}}\nabla f\left(\mathbf{x}^{\star}\right)\vspace{-0.15cm}

(12)

as a particular dual optimum of (3). Also, throughout this section, we let $\mathbf{v}^{k}=(W^{\dagger})^{\frac{1}{2}}\mathbf{q}^{k}$ , $\mathbf{z}^{k}=((\mathbf{x}^{k})^{T},(\mathbf{v}^{k})^{T})^{T}$ , and $\mathbf{z}^{\star}=((\mathbf{x}^{\star})^{T},(\mathbf{v}^{\star})^{T})^{T}$ . Since such $\mathbf{v}^{k}$ satisfies (5),

\mathbf{v}^{k},\mathbf{v}^{\star},\mathbf{v}^{k}-\mathbf{v}^{\star}\in\{\mathbf{x}\in\mathbb{R}^{Nd}|x_{1}+\cdots+x_{N}=\mathbf{0}_{d}\}.\vspace{-0.15cm}

(13)

In addition, we define $f_{\beta}(\mathbf{x})=f(\mathbf{x})+\frac{\beta}{4}\|\mathbf{x}\|_{W}^{2}$ . Based on [14, Lemma 1], for any $\mathbf{x}\in\mathbb{R}^{Nd}$ ,

\left\langle\nabla f_{\beta}(\mathbf{x})-\nabla f_{\beta}\left(\mathbf{x}^{\star}\right),\mathbf{x}-\mathbf{x}^{\star}\right\rangle\geq\zeta(\gamma)\left\|\mathbf{x}-\mathbf{x}^{\star}\right\|^{2},\vspace{-0.15cm}

(14)

where $\zeta:(0,\infty)\rightarrow\mathbb{R}$ is given by $\zeta(\gamma)=\min\{\frac{m_{\bar{f}}}{N}-2M\gamma,\beta\frac{\lambda_{W}}{2(1+1/\gamma^{2})}\}$ , $m_{\bar{f}}$ is given in Assumption 1a), $M=\max_{i\in\mathcal{V}}M_{i}>0$ , and $\lambda_{W}$ is the smallest non-zero eigenvalue of $W$ . It can be shown that $\zeta(\gamma)>0$ if and only if $\gamma\in(0,m_{\bar{f}}/(2MN))$ , and its maximum value is attained at the unique positive root of the cubic equation $4MN\gamma^{3}+(\beta N\lambda_{W}-2m_{\bar{f}})\gamma^{2}+4MN\gamma-2m_{\bar{f}}=0$ . We denote the maximum value of $\zeta(\gamma)$ by $m_{\beta}$ , which is the convexity parameter of $f_{\beta}$ (and indeed can be taken as any positive $\zeta(\gamma)$ ).

Our convergence analysis relies on the following parameter condition. Suppose there exists $\eta_{s}\in(0,1)$ such that

D\succ\frac{\Lambda_{M}}{2(1-\eta_{s})}+\frac{\left(\Lambda_{M}-\Lambda_{m}\right)^{2}}{8\eta_{s}m_{\beta}}+\frac{\Lambda_{M}-3\Lambda_{m}}{2}+\beta(\frac{{I}_{Nd}}{2}+W).\vspace{-0.15cm}

(15)

Let $R=\frac{\Lambda_{m}+\Lambda_{M}}{2}+D$ and $Q=\operatorname{diag}(\beta R,I_{Nd})$ , which are guaranteed to be positive definite due to (15). Furthermore, it follows from (9) and (15) that the condition $h(\mathbf{x})+D\succ\mathbf{O}_{Nd}$ required in Section III holds.

We provide our main result in the theorem below.

Theorem 1.

Suppose Assumptions 1 and 2 hold. If (15) holds for some $\eta_{s}\in(0,1)$ , then $\mathbf{z}^{k}$ converges linearly to a neighborhood of $\mathbf{z}^{\star}$ in expectation, i.e., there exist $\delta_{s}\in(0,1)$ and $\Gamma>0$ such that for each $k\geq 0$ ,

\!\!\!\mathbf{E}\left[\|\mathbf{z}^{k+1}\!-\!\mathbf{z}^{\star}\|_{Q}^{2}\right]\!\leq\!(1\!-\!\delta_{s})\mathbf{E}\left[\|\mathbf{z}^{k}\!-\!\mathbf{z}^{\star}\|_{Q}^{2}\right]\!+\!\Gamma N\tau\sigma^{2},\vspace{-0.15cm}

(16)

\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]\leq\frac{1}{\delta_{s}}\Gamma N\tau\sigma^{2}.\vspace{-0.15cm}

(17)

In particular, given any $c_{1}>0$ , $\Gamma=\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}+2$ and

	$\displaystyle\delta_{s}=$	$\displaystyle\sup_{c_{2}>0}\min\left\{\tfrac{\beta\lambda_{W}\kappa_{c_{0},\eta_{s}}}{2\left(1+c_{1}\right)\left\\|\Lambda_{M}+D\right\\|^{2}},\tfrac{1-\eta_{s}}{\left(1+1/c_{1}\right)\left(1+c_{2}\right)},\right.$
		$\displaystyle\left.\tfrac{2\eta_{s}m_{\beta}-c_{0}}{\lambda_{\max}\left(R+\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}/\left(\beta\lambda_{W}\right)\right)}\right\},$		(18)

where $c_{0}\in(0,2\eta_{s}m_{\beta})$ is such that $\kappa_{c_{0},\eta_{s}}\coloneqq\lambda_{min}(R-\frac{\Lambda_{M}}{2(1-\eta_{s})}-\frac{(\Lambda_{M}-\Lambda_{m})^{2}}{4c_{0}}+\Lambda_{m}-\Lambda_{M}-\beta(\frac{I_{Nd}}{2}+W))>0$ (which always exists).

Proof.

See Appendix -A. ∎

It can be shown that larger $\min_{i\in\mathcal{V}}m_{i}$ , smaller $\max_{i\in\mathcal{V}}M_{i}$ , or smaller $\lambda_{W}$ leads to faster convergence (i.e., larger $\delta_{s}$ ) of St-SoPro. Such analysis follows the idea in [14] and is thus omitted here due to space limitation. In addition, it can be seen from (17) that the error bound of $\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]$ drops with the decrease of $\tau$ . Hence, essentially, larger random sample sets for computing the stochastic gradients lead to smaller optimality error.

In fact, the expected distance between $\mathbf{x}^{k}$ and $\mathbf{x}^{\star}$ can eventually reach an arbitrarily small value under proper parameter settings. To see this, for simplicity, let $m_{i}=m>0$ and $M_{i}=M\geq m$ for all $i\in\mathcal{V}$ , and pick any $\eta_{s}\in(0,1)$ , $c_{0}\in(0,2\eta_{s}m_{\beta})$ , and $c_{1}>0$ . We choose, for example, $D=\alpha I_{Nd}$ with $\alpha=(\frac{1}{2}+\lambda_{max}(W))\beta+\mu$ , $\beta>0$ , and $\mu>\tfrac{M-3m}{2}+\tfrac{{M}}{2(1-\eta_{s})}+\tfrac{({M}-{m})^{2}}{4c_{0}}$ , so that (15) holds. This also results in $\kappa_{c_{0},\eta_{s}}=\mu-\frac{M-3m}{2}-\frac{{M}}{2(1-\eta_{s})}-\frac{({M}-{m})^{2}}{4c_{0}}>0$ . From (17), we obtain $\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}]\leq\frac{\Gamma N\tau\sigma^{2}}{\delta_{s}\beta[(m+M)/2+\beta(1/2+\lambda_{max}(W))+\mu]}$ . It can thus be shown that as $\beta\rightarrow\infty$ , such an upper bound on $\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]$ goes to zero. Since this bound is continuous at $\beta$ , given any $\epsilon>0$ , the above parameter setting with a sufficiently large $\beta$ guarantees $\operatorname{lim\;sup}_{k\rightarrow\infty}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]<\epsilon$ .

V Numerical Experiment

This section compares the practical convergence performance of St-SoPro with several state-of-the-art distributed stochastic optimization algorithms.

In the numerical experiment, we intend to learn linear classifiers by solving $l_{2}$ -regularized logistic regression of the following form over a randomly generated, undirected, and connected network:

\underset{x\in\mathbb{R}^{d}}{\min}\sum_{i\in\mathcal{V}}\frac{1}{\mathcal{C}_{i}}\sum_{j=1}^{\mathcal{C}_{i}}\left(\frac{\lambda}{2}\|x\|^{2}+\ln(1+e^{-(a_{i,j}^{\mathrm{T}}x)b_{i,j}})\right),\vspace{-0.15cm}

(19)

where $\lambda>0$ is the regularization parameter and $\{a_{i,j},b_{i,j}\}$ are the data samples. Our experiment is conducted on two standard real datasets a4a and mushrooms from the LIBSVM library [20]. Table I lists the values of the problem and network parameters corresponding to these two datasets, including the problem dimension $d$ , the number $N$ of agents, the network’s average degree $d_{a}=\sum_{i\in\mathcal{V}}\frac{|\mathcal{N}_{i}|}{N}$ , the total number $\mathcal{C}_{i}$ of samples that we assign to each agent $i$ , the sizes $|\mathcal{G}_{i}^{k}|$ and $|\mathcal{S}_{i}^{k}|$ of the random sample sets that each agent $i$ chooses per iteration, as well as the regularization parameter $\lambda$ .

TABLE I: Parameter values in the numerical experiment.

	d	$N$	$d_{a}$	$\mathcal{C}_{i}$	$\|\mathcal{G}_{i}^{k}\|$	$\|\mathcal{S}_{i}^{k}\|$	$\lambda$
a4a	123	20	5	239	80	10	$10^{-2}$
mushrooms	112	10	3	600	80	25	$10^{-2}$

The simulations include DSGD [8], EDAS [11], DSGT [12], and DPD-SGD-T [13], which are all first-order methods, for comparison with our proposed St-SoPro. We fine-tune all the algorithm parameters so that the algorithms reach a given accuracy ( $2\times 10^{-1}$ for a4a and $10^{-1}$ for mushrooms) within fewest possible iterations.

Figures 1(a)–(c) and 2(a)–(c) plot the evolutions of the optimality error $\frac{1}{N}\sum_{i\in\mathcal{V}}\|{x}_{i}^{k}-{x}^{\star}\|^{2}$ generated by the aforementioned algorithms over a4a and mushrooms with respect to the number of iterations, the number of communication bits (set as $32$ times the number of transmitted real scalars according to [21]), and computation time. Observe that St-SoPro converges faster than the other algorithms to reach the given accuracy, validating its computational and communication efficiency. It is worth mentioning that although St-SoPro is a second-order method, its computational cost can be comparable with the first-order methods when addressing such common machine learning problems. Figures 1(d) and 2(d) present the correct rates of classification on the test sets upon completing each iteration of these algorithms as training, whereby St-SoPro outperforms the others in the training effect.

Refer to caption — Figure 1: Convergence performance of St-SoPro, DSGT, DSGD, DPD-SGD-T, and EDAS on dataset *a4a*.

VI Conclusion

We have developed St-SoPro, a distributed stochastic second-order proximal method, for addressing strongly convex and smooth optimization over undirected networks. Different from the existing first-order distributed stochastic algorithms, St-SoPro incorporates a second-order approximation of an augmented Lagrangian function and randomly samples each local gradient and Hessian. We show that St-SoPro linearly converges to a neighborhood of the optimal solution in expectation, and the neighborhood can be arbitrarily small. Simulations over two real datasets demonstrate that St-SoPro is both computationally and communication-wise efficient.

-A Proof of Theorem 1

The following lemma intends to bound the difference between $\mathbf{E}\left[\left\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]$ and $\mathbf{E}\left[\left\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\|_{Q}^{2}\right]$ .

Lemma 1.

For each $k\geq 0$ ,

	$\displaystyle\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]-\mathbf{E}\left[\left\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]$
$\displaystyle\geq$	$\displaystyle\beta\left(2\eta_{s}m_{\beta}-c_{0}\right)\mathbf{E}\left[\left\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\\|^{2}\right]+\beta^{2}(1-\eta_{s})\mathbf{E}\left[\left\\|\mathbf{x}^{k}\right\\|_{W}^{2}\right]$
	$\displaystyle-\beta\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\mathcal{A}_{c_{0},\eta_{s}}+\beta W-R}^{2}\right]-2N\tau\sigma^{2},$	(20)

where $c_{0}>0$ can be arbitrary and $\mathcal{A}_{c_{0},\eta_{s}}\coloneqq\Lambda_{M}/(2(1-\eta_{s}))+\left(\Lambda_{M}-\Lambda_{m}\right)^{2}/(4c_{0})+\Lambda_{M}-\Lambda_{m}+\frac{\beta I_{Nd}}{2}$ .

Proof.

We first equivalently expand the left-hand side of (20). Similar to [14, Eq. (27)], we derive

	$\displaystyle\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]-\mathbf{E}\left[\left\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]$
$\displaystyle=$	$\displaystyle\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{k+1}\right\\|_{Q}^{2}\right]+2\mathbf{E}\left[\beta\left\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},R\left(\mathbf{x}^{k}-\mathbf{x}^{k+1}\right)\right\rangle\right]$
	$\displaystyle+2\mathbf{E}\left[\left\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\right\rangle\right].$	(21)

Then, using (5) and $W^{\frac{1}{2}}\mathbf{x}^{\star}=\mathbf{0}_{Nd}$ , we obtain $\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle=-\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},W^{\frac{1}{2}}(\mathbf{v}^{k+1}-\mathbf{v}^{\star})\rangle$ . From (10) and (5), we have $W^{\frac{1}{2}}\mathbf{v}^{k+1}=W^{\frac{1}{2}}\left(\mathbf{v}^{k+1}-\mathbf{v}^{k}\right)+W^{\frac{1}{2}}\mathbf{v}^{k}=(\beta W-\tilde{H}^{k})(\mathbf{x}^{k+1}-\mathbf{x}^{k})-g(\mathbf{x}^{k})$ , where $\tilde{H}^{k}\coloneqq h(\mathbf{x}^{k})+D$ . The above two equations, together with (12), give

	$\displaystyle\langle\mathbf{v}^{k}-\mathbf{v}^{k+1},\mathbf{v}^{k+1}-\mathbf{v}^{\star}\rangle=\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle$
	$\displaystyle\qquad+\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-\beta W)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle.$		(22)

Moreover, based on (5), $W\mathbf{x}^{\star}=\mathbf{0}_{Nd}$ , and [14, Eq. (26)],

	$\displaystyle-2\beta\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},\beta W(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle$
$\displaystyle=$	$\displaystyle-\beta\\|\mathbf{x}^{k+1}\\|_{\beta W}^{2}+\beta\\|\mathbf{x}^{k}\\|_{\beta W}^{2}-\beta\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\beta W}^{2}$
$\displaystyle=$	$\displaystyle-\\|\mathbf{v}^{k+1}-\mathbf{v}^{k}\\|^{2}+\beta^{2}\\|\mathbf{x}^{k}\\|_{W}^{2}-\beta\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\beta W}^{2}.$	(23)

By incorporating (23) into (22) and then combining the resulting equation with (21), we have

	$\displaystyle\mathbf{E}\left[\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{Q}^{2}\right]-\mathbf{E}\left[\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\\|_{Q}^{2}\right]$
$\displaystyle=$	$\displaystyle 2\beta\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]+\beta^{2}\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right]$
	$\displaystyle+2\beta\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-R)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\right]$
	$\displaystyle+\beta\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{k+1}\\|_{R-\beta W}^{2}\right].$	(24)

Subsequently, we provide a lower bound for the first term on the right-hand side of (24). To do so, we utilize the AM-GM inequality and (11) to derive

	$\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]$
$\displaystyle=$	$\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]$
	$\displaystyle+\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{k},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\rangle\right]$
$\displaystyle\geq$	$\displaystyle-(1-\eta_{s})\mathbf{E}\left[\left\\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right\\|_{\Lambda_{M}^{-1}}^{2}\right]$
	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}}^{2}\right]$
	$\displaystyle-\frac{\beta}{4}\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|^{2}\right]-\frac{1}{\beta}\mathbf{E}\left[\left\\|g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{k})\right\\|^{2}\right]$
$\displaystyle\geq$	$\displaystyle-(1-\eta_{s})\mathbf{E}\left[\left\\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right\\|_{\Lambda_{M}^{-1}}^{2}\right]$
	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\beta I_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}.$	(25)

Due to the Lipschitz continuity of each $\nabla f_{i}$ and the unbiasedness of $g(\mathbf{x}^{k})$ , we have $\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]=\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]\geq\mathbf{E}[\|\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\|_{\Lambda_{M}^{-1}}^{2}]$ . We multiply this inequality by $(1-\eta_{s})$ and then add it to (25), which leads to

	$\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]$
	$\displaystyle-\eta_{s}\mathbf{E}\left[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]$
$\displaystyle\geq$	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\mathbf{\beta I}_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}.$	(26)

Because of the restricted strong convexity of $f_{\beta}(\mathbf{x})=f(\mathbf{x})+\frac{\beta}{4}\|\mathbf{x}\|_{W}^{2}$ shown in Section IV and $W\mathbf{x}^{\star}=\mathbf{0}_{Nd}$ , we have $\mathbf{E}[\langle\mathbf{x}^{k}-\mathbf{x}^{\star},\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle]\geq m_{\beta}\mathbf{E}\left[\|\mathbf{x}^{k}-\mathbf{x}^{\star}\|^{2}\right]-\frac{\beta}{2}\mathbf{E}\left[\|\mathbf{x}^{k}\|_{W}^{2}\right]$ . This, along with (26), results in

	$\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},g(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\rangle\right]$
$\displaystyle\geq$	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\frac{\Lambda_{M}}{4(1-\eta_{s})}+\frac{\mathbf{\beta I}_{Nd}}{4}}^{2}\right]-\frac{1}{\beta}N\tau\sigma^{2}$
	$\displaystyle+\eta_{s}m_{\beta}\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|^{2}\right]-\frac{\eta_{s}\beta}{2}\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right].$	(27)

Next, we bound the third term on the right-hand side of (24). Because $\tilde{H}^{k}-R=h(\mathbf{x}^{k})-\frac{\Lambda_{m}+\Lambda_{M}}{2}$ and because of (9), we have $\frac{\Lambda_{m}-\Lambda_{M}}{2}\preceq\tilde{H}^{k}-R\preceq\frac{\Lambda_{M}-\Lambda_{m}}{2}$ . Let $c_{0}>0$ . Then, similar to [14, Eq. (30)], we obtain

	$\displaystyle\mathbf{E}\left[\langle\mathbf{x}^{k+1}-\mathbf{x}^{\star},(\tilde{H}^{k}-R)(\mathbf{x}^{k+1}-\mathbf{x}^{k})\rangle\right]$
$\displaystyle\geq$	$\displaystyle-\frac{c_{0}}{2}\mathbf{E}\left[\left\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\\|^{2}\right]-\frac{1}{8c_{0}}\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\left(\Lambda_{M}-\Lambda_{m}\right)^{2}}^{2}\right]$
	$\displaystyle-\frac{1}{2}\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\Lambda_{M}-\Lambda_{m}}^{2}\right].$	(28)

Combining (27) and (28) with (24) gives (20). ∎

In addition to Lemma 1, below we provide an upper bound on $\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]$ . For any $c_{1},c_{2}>0$ , through (11), (10), (12), (13), and the AM-GM inequality,

		$\displaystyle\mathbf{E}\left[\\|\mathbf{v}^{k}-\mathbf{v}^{\star}\\|^{2}\right]=\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}W^{\frac{1}{2}}(\mathbf{v}^{k}-\mathbf{v}^{\star})\\|^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}-\mathbf{x}^{k+1})-\beta W\mathbf{x}^{k}-g(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k})\right.\right.$
		$\displaystyle\left.\left.-\nabla f(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{\star})\right)\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle(1\!+\!c_{1})\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}\!-\!\mathbf{x}^{k+1})\!-\!g(\mathbf{x}^{k})\!+\!\nabla f(\mathbf{x}^{k})\right)\\|^{2}\right]$
		$\displaystyle+(1+\frac{1}{c_{1}})\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\beta W\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right)\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{(\tilde{H}^{k})^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}$
		$\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right]$
		$\displaystyle+\frac{(1+1/c_{1})(1+1/c_{2})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|_{\Lambda_{M}^{2}}^{2}\right],$

leading to

	$\displaystyle\mathbf{E}\left[\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{Q}^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\left(\Lambda_{M}+D\right)^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}$
	$\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right]$
	$\displaystyle+\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|_{\left(\beta R+\frac{\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}}{\lambda_{W}}\right)}^{2}\right].$	(29)

Pick an arbitrary $\delta_{s}\in(0,1)$ . By subtracting (29) multiplied by $\delta_{s}$ from (20), we have

	$\displaystyle(1-\delta_{s})\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]-\mathbf{E}\left[\left\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]$
	$\displaystyle\geq\beta\mathbf{E}\left[\left\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\\|_{\mathbf{\Omega}_{1}}^{2}\right]+\Omega_{2}\mathbf{E}\left[\left\\|\mathbf{x}^{k}\right\\|_{W}^{2}\right]$
	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\mathbf{\Omega}_{3}}^{2}\right]-(\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}+2)N\tau\sigma^{2},$		(30)

where $\mathbf{\Omega}_{1}=\left(2\eta_{s}m_{\beta}-c_{0}\right)I_{Nd}-\delta_{s}(R+\frac{\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}}{\beta\lambda_{W}})$ , $\Omega_{2}=\beta^{2}(1-\eta_{s})-\delta_{s}\beta^{2}\left(1+1/c_{1}\right)\left(1+c_{2}\right)$ , and $\mathbf{\Omega}_{3}=\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}\left(\Lambda_{M}+D\right)^{2}+\beta\left(\mathcal{A}_{c_{0},\eta_{s}}+\beta W-R\right)$ . To make (16) hold based on (30), it suffices to let $\mathbf{\Omega}_{1}\succeq\mathbf{O}_{Nd}$ , $\Omega_{2}\geq 0$ , $\mathbf{\Omega}_{3}\preceq\mathbf{O}_{Nd}$ , i.e.,

$\displaystyle\delta_{s}\leq$	$\displaystyle\frac{2\eta_{s}m_{\beta}-c_{0}}{\lambda_{\max}\left(R+\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}/\left(\beta\lambda_{W}\right)\right)},$	(31)
$\displaystyle\delta_{s}\leq$	$\displaystyle\frac{1-\eta_{s}}{\left(1+1/c_{1}\right)\left(1+c_{2}\right)},$	(32)
$\displaystyle\delta_{s}\leq$	$\displaystyle\frac{\beta\lambda_{W}\kappa_{c_{0},\eta_{s}}}{2\left(1+c_{1}\right)\left\\|\Lambda_{M}+D\right\\|^{2}}.$	(33)

To guarantee the existence of $\delta_{s}\in(0,1)$ subject to (31)–(33), we need $c_{0}<2\eta_{s}m_{\beta}$ and $\kappa_{c_{0},\eta_{s}}>0$ . Note from (15) that $\kappa_{c_{0},\eta_{s}}>0$ at $c_{0}=2\eta_{s}m_{\beta}$ . Then, due to the continuity of $\kappa_{c_{0},\eta_{s}}$ with respect to $c_{0}$ , there is $c_{0}\in(0,2\eta_{s}m_{\beta})$ such that $\kappa_{c_{0},\eta_{s}}>0$ . Therefore, we ensure (16) with $\delta_{s}$ given by (18).

Finally, from (16), we have $\mathbf{E}\left[\|\mathbf{z}^{k}-\mathbf{z}^{\star}\|_{Q}^{2}\right]\leq(1-\delta_{s})^{k}\mathbf{E}\left[\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|_{Q}^{2}\right]+\sum_{t=0}^{k-1}(1-\delta_{s})^{t}\Gamma N\tau\sigma^{2}$ , $\forall k\geq 0$ . It can thus be shown that (17) holds.

References

[1] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
[2] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282.
[4] S. U. Stich, “Local SGD converges fast and communicates little,” in International Conference on Learning Representations (ICLR), 2019.
[5] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
[6] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[7] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, pp. 2597–2633, 2017.
[8] S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, vol. 67, no. 11, pp. 5900–5915, 2022.
[9] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in 29th International Conference on Machine Learning, 2012, pp. 1571–1578.
[10] A. Nemirovski, A. B. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, vol. 19, pp. 1574–1609, 2009.
[11] K. Huang and S. Pu, “Improving the transient times for distributed stochastic gradient methods,” IEEE Transactions on Automatic Control (Early Access), 2022.
[12] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021.
[13] X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “A primal-dual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 5, pp. 812–833, 2022.
[14] X. Wu, Z. Qu, and J. Lu, “A second-order proximal algorithm for consensus optimization,” IEEE Transactions on Automatic Control, vol. 66, pp. 1864–1871, 2021.
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
[16] P. Giselsson, M. D. Doan, T. Keviczky, B. D. Schutter, and A. Rantzer, “Accelerated gradient methods and dual decomposition in distributed model predictive control,” Automatica, vol. 49, pp. 829–833, 2013.
[17] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive radio networks by exploiting sparsity,” IEEE Transactions on Signal Processing, vol. 58, pp. 1847–1862, 2010.
[18] F. Bach, “Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 595–627, 2014.
[19] S. L. Lohr, Sampling: Design and Analysis. Chapman and Hall/CRC, 2021.
[20] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
[21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017.

	$\displaystyle\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]-\mathbf{E}\left[\left\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]$
$\displaystyle\geq$	$\displaystyle\beta\left(2\eta_{s}m_{\beta}-c_{0}\right)\mathbf{E}\left[\left\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\\|^{2}\right]+\beta^{2}(1-\eta_{s})\mathbf{E}\left[\left\\|\mathbf{x}^{k}\right\\|_{W}^{2}\right]$
	$\displaystyle-\beta\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\mathcal{A}_{c_{0},\eta_{s}}+\beta W-R}^{2}\right]-2N\tau\sigma^{2},$	(20)

		$\displaystyle\mathbf{E}\left[\\|\mathbf{v}^{k}-\mathbf{v}^{\star}\\|^{2}\right]=\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}W^{\frac{1}{2}}(\mathbf{v}^{k}-\mathbf{v}^{\star})\\|^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}-\mathbf{x}^{k+1})-\beta W\mathbf{x}^{k}-g(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{k})\right.\right.$
		$\displaystyle\left.\left.-\nabla f(\mathbf{x}^{k})+\nabla f(\mathbf{x}^{\star})\right)\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle(1\!+\!c_{1})\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\tilde{H}^{k}(\mathbf{x}^{k}\!-\!\mathbf{x}^{k+1})\!-\!g(\mathbf{x}^{k})\!+\!\nabla f(\mathbf{x}^{k})\right)\\|^{2}\right]$
		$\displaystyle+(1+\frac{1}{c_{1}})\mathbf{E}\left[\\|(W^{\dagger})^{\frac{1}{2}}\left(\beta W\mathbf{x}^{k}+\nabla f(\mathbf{x}^{k})-\nabla f(\mathbf{x}^{\star})\right)\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{(\tilde{H}^{k})^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}$
		$\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right]$
		$\displaystyle+\frac{(1+1/c_{1})(1+1/c_{2})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|_{\Lambda_{M}^{2}}^{2}\right],$

	$\displaystyle\mathbf{E}\left[\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\\|_{Q}^{2}\right]$
$\displaystyle\leq$	$\displaystyle\frac{2(1+c_{1})}{\lambda_{W}}\mathbf{E}\left[\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\\|_{\left(\Lambda_{M}+D\right)^{2}}^{2}\right]+\frac{2(1+{c_{1}})}{\lambda_{W}}N\tau\sigma^{2}$
	$\displaystyle+\beta^{2}(1+\frac{1}{c_{1}})(1+c_{2})\mathbf{E}\left[\\|\mathbf{x}^{k}\\|_{W}^{2}\right]$
	$\displaystyle+\mathbf{E}\left[\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\\|_{\left(\beta R+\frac{\left(1+1/c_{1}\right)\left(1+1/c_{2}\right)\Lambda_{M}^{2}}{\lambda_{W}}\right)}^{2}\right].$	(29)

	$\displaystyle(1-\delta_{s})\mathbf{E}\left[\left\\|\mathbf{z}^{k}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]-\mathbf{E}\left[\left\\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\right\\|_{Q}^{2}\right]$
	$\displaystyle\geq\beta\mathbf{E}\left[\left\\|\mathbf{x}^{k}-\mathbf{x}^{\star}\right\\|_{\mathbf{\Omega}_{1}}^{2}\right]+\Omega_{2}\mathbf{E}\left[\left\\|\mathbf{x}^{k}\right\\|_{W}^{2}\right]$
	$\displaystyle-\mathbf{E}\left[\left\\|\mathbf{x}^{k+1}-\mathbf{x}^{k}\right\\|_{\mathbf{\Omega}_{3}}^{2}\right]-(\frac{2\left(1+c_{1}\right)\delta_{s}}{\lambda_{W}}+2)N\tau\sigma^{2},$		(30)

A Stochastic Second-Order Proximal Method for Distributed Optimization

Abstract

I Introduction

II Problem Formulation

Assumption 1.

III Stochastic Second-order Proximal Method

IV Convergence Analysis

Assumption 2.

Theorem 1.

Proof.

V Numerical Experiment

VI Conclusion

-A Proof of Theorem 1

Lemma 1.

Proof.

References

A Stochastic Second-Order Proximal Method
for Distributed Optimization