This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Federated Averaging

Tao Sun, Dongsheng Li, and Bao Wang This work is sponsored in part by the National Key R&D Program of China under Grant (2018YFB0204300) and the National Natural Science Foundation of China under Grants (61932001 and 61906200). Tao Sun and Dongsheng Li are with the College of Computer, National University of Defense Technology, Changsha, 410073, Hunan, China. (e-mails: nudtsuntao@163.com, dsli@nudt.edu.cn) Bao Wang is with the Scientific Computing & Imaging Institute, University of Utah, USA. (e-mail: wangbaonj@gmail.com) Dongsheng Li and Bao Wang the co-corresponding authors.
Abstract

Federated averaging (FedAvg) is a communication efficient algorithm for the distributed training with an enormous number of clients. In FedAvg, clients keep their data locally for privacy protection; a central parameter server is used to communicate between clients. This central server distributes the parameters to each client and collects the updated parameters from clients. FedAvg is mostly studied in centralized fashions, which requires massive communication between server and clients in each communication. Moreover, attacking the central server can break the whole system’s privacy. In this paper, we study the decentralized FedAvg with momentum (DFedAvgM), which is implemented on clients that are connected by an undirected graph. In DFedAvgM, all clients perform stochastic gradient descent with momentum and communicate with their neighbors only. To further reduce the communication cost, we also consider the quantized DFedAvgM. We prove convergence of the (quantized) DFedAvgM under trivial assumptions; the convergence rate can be improved when the loss function satisfies the PŁ property. Finally, we numerically verify the efficacy of DFedAvgM.

Index Terms:
Decentralized Optimization, Federated Averaging, Momentum, Stochastic Gradient Descent

1 Introduction

Federated learning (FL) is a privacy-preserving distributed machine learning (ML) paradigm [1]. In FL, a central server connects with enormous clients (e.g., mobile phones, pad, etc.); the clients keep their data without sharing it with the server. In each communication round, clients receive the current global model from the server, and a small portion of clients are selected to update the global model by running stochastic gradient descent (SGD) [2] for multiple iterations using local data. The central server then aggregates these updated parameters to obtain the updated global model. The above learning algorithm is known as federated average (FedAvg) [1]. In particular, if the clients are homogeneous, FedAvg is equivalent to the local SGD [3]. FedAvg involves multiple local SGD updates and one aggregation by the server in each communication round, which significantly reduces the communication cost between sever and clients compared to the conventional distributed training with one local SGD update and one communication.

In FL applications, large companies and government organizations usually play the role of the central server. On the one hand, since the number of clients in FL is massive, the communication cost between the server and clients can be a bottleneck [4]. On the other hand, the updated models collected from clients encode the private information of the local data; hackers can attack the central server to break the privacy of the whole system, which remains the privacy issue as a serious concern. To this end, decentralized federated learning has been proposed [5, 6], where all clients are connected with an undirected graph. Decentralized FL replaces the server-clients communication in FL with clients-clients communication.

In this paper, we consider two issues about decentralized FL: 1) Although there is no expensive communication between server and clients in decentralized FL, the communication between local clients is costly when the ML model itself is large. Therefore, it is crucial to ask can we reduce the communication cost between clients? 2) Momentum is a well-known acceleration technique for SGD [7]. It is natural to ask can we use SGD with momentum to improve the training of ML models in decentralized FL with theoretical convergence guarantees?

1.1 Our Contributions

We answer the above questions affirmatively by proposing the decentralized FedAvg with momentum (DFedAvgM). To further reduce the communication cost between clients, we also integrate quantization with DFedAvgM. Our contributions in this paper are elaborated below in threefold.

  • Algorithmically, we extend FedAvg to the decentralized setting, where all clients are connected by an undirected graph. We motivate DFedAvgM from the decentralized SGD (DSGD) algorithm. In particular, we use SGD with momentum to train ML models on each client. To reduce the communication cost between each client, we further introduce a quantized version of DFedAvgM, in which each client will send and receive a quantized model.

  • Theoretically, we prove the convergence of (quantized) DFedAvgM. Our theoretical results show that the convergence rate of (quantized) DFedAvgM is not inferior to that of SGD or DSGD. More specifically, we show that the convergence rates of both DFedAvgM and quantized DFedAvgM depend on the local training and the graph that connects all clients. Besides the convergence results under nonconvex assumptions, we also establish their convergence guarantee under the Polyak-Łojasiewicz (PŁ) condition, which has been widely studied in nonconvex optimization. Under the PŁ condition, we establish a faster convergence rate for (quantized) DFedAvgM. Furthermore, we present a sufficient condition to guarantee reducing communication costs.

  • Empirically, we perform extensive numerical experiments on training deep neural networks (DNNs) on various datasets in both IID and Non-IID settings. Our results show the effectiveness of (quantized) DFedAvgM for training ML models, saving communication costs, and protecting training data’s membership privacy.

1.2 More Related Works

We briefly review three lines of work that are most related to this paper, i.e., federated learning, decentralized training, and decentralized federated learning.

Federated Learning. Many variants of FedAvg have been developed with theoretical guarantees. [8] uses the momentum method for local clients in FedAvg. [9] proposes the adaptive FedAvg, whose central parameter server uses the adaptive learning rate ti aggregate local models. Lazy and quantizatized gradients are used to reduce communications [10, 11]. [12] proposes a Newton-type scheme for FL. The convergence analysis of FedAvg on heterogeneous data is discussed by [13, 14]. The advances and open problems in FL is available in two survey papers [15, 16].

Decentralized Training. Decentralized algorithms are originally developed to calculate the mean of data that are distributed over multiple sensors [17, 18, 19, 20]. Decentralized (sub)gradient descents (DGD), one of the simplest and efficient decentralized algorithms, have been studied in [21, 22, 23, 24, 25]. In DGD, the convexity assumption is unnecessary [26], which makes DGD useful for nonconvex optimization. A provably convergent DSGD is proposed in [27, 28, 4]. [27] provides the complexity result of a stochastic decentralized algorithm. [28] designs a stochastic decentralized algorithm with the dual information and provide the theoretical convergence guarantee. [4] proves that DSGD outperforms SGD in communication efficiency. Asynchronous DSGD is analyzed in [29]. DGD with momentum is proposed in [30, 31]. Quantized DSGD has been proposed in [32].

Decentralized Federated Learning. Decentralized FL is a learning paradigm of choice when the edge devices do not trust the central server in protecting their privacy [33]. The authors in [34] propose a novel FL framework without a central server for medical applications, and the new method offers a highly dynamic peer-to-peer environment. [6] considers training an ML model with a connected network whose nodes take a Bayesian-like approach by introducing a belief of the parameter space.

1.3 Organizations

We organize this paper as follows: in section 2, we present a mathematical formulation of our problem and some necessary assumption. In section 3, we present the DFedAvgM and its quantized algorithms. We present the convergence of the proposed algorithm in section 4. We provide extensive numerical verification of DFedAvgM in section 6. This paper ends up with concluding remarks. Technical proofs and more experimental details are provided in the appendix.

1.4 Notation

We denote scalars and vectors by lower case and lower case boldface letters, respectively, and matrices by upper case boldface letters. For a vector 𝐱=(x1,,xd)d{\bf x}=(x_{1},\cdots,x_{d})\in\mathbb{R}^{d}, we denote its p\ell_{p} norm (p1p\geq 1) by 𝐱p=(i=1d|xi|p)1/p\|{\bf x}\|_{p}={\small(\sum_{i=1}^{d}|x_{i}|^{p})^{1/p}}, and denote the \ell_{\infty} norm of 𝐱{\bf x} by 𝐱=maxi=1d|xi|\|{\bf x}\|_{\infty}=\max_{i=1}^{d}|x_{i}| and denote 2\ell_{2} norm as 𝐱\|{\bf x}\|. For a matrix 𝐀{\bf A}, we denote its transpose as 𝐀{\bf A}^{\top}. Given two sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}) if there exists a positive constant 0<C<+0<C<+\infty such that anCbna_{n}\leq Cb_{n}, and we write an=Θ(bn)a_{n}=\Theta(b_{n}) if there exist two positive constants C1C_{1} and C2C_{2} such that anC1bna_{n}\leq C_{1}b_{n} and bnC2anb_{n}\leq C_{2}a_{n}. 𝒪~(an)\widetilde{\mathcal{O}}(a_{n}) hides the logarithmic factor of ana_{n}. For a function f(𝐱):df({\bf x}):\mathbb{R}^{d}\rightarrow\mathbb{R}, we denote its gradient as f(𝐱)\nabla f({\bf x}) and its Hessian as 2f(𝐱)\nabla^{2}f({\bf x}), and denote its minimum as minf\min f. We use 𝔼[]\mathbb{E}[\cdot] to denote the expectation with respect to the underlying probability space.

2 Problem Formulation and Assumptions

We consider the following optimization task

min𝐱df(𝐱):=1mi=1mfi(𝐱),fi(𝐱)=𝔼ξ𝒟iFi(𝐱;ξ),\displaystyle\hskip-8.5359pt{\small\min_{{\bf x}\in\mathbb{R}^{d}}f({\bf x}):=\frac{1}{m}\sum_{i=1}^{m}f_{i}({\bf x}),~{}~{}f_{i}({\bf x})=\mathbb{E}_{\xi\sim\mathcal{D}_{i}}F_{i}({\bf x};\xi),} (1)

where 𝒟i\mathcal{D}_{i} denotes the data distribution in the ii-th client and Fi(𝐱;ξ)F_{i}({\bf x};\xi) is the loss function associated with the training data ξ\xi. Problem (1) models many applications in ML, which is known as empirical risk minimization (ERM). We list several assumptions for the subsequent analysis.

Assumption 1

The function fif_{i} is differentiable and fi\nabla f_{i} is LL-Lipschitz continuous, i{1,2,,m}\forall i\in\{1,2,\ldots,m\}, i.e., fi(𝐱)fi(𝐲)L𝐱𝐲,\|\nabla f_{i}({\bf x})-\nabla f_{i}({\bf y})\|\leq L\|{\bf x}-{\bf y}\|, for all 𝐱,𝐲d{\bf x},{\bf y}\in\mathbb{R}^{d}.

The first-order Lipschitz assumption is commonly used in the ML community. Here, for simplicity, we suppose all functions enjoy the same Lipschitz constant LL. We can also assume that these functions have non-uniform Lipschitz constants, which does not affect our convergent analysis.

Assumption 2

The gradient of the function fif_{i} have σl\sigma_{l}-bounded variance, i.e., 𝔼[Fi(𝐱;ξ)fi(𝐱)2]σl2\mathbb{E}[\|\nabla F_{i}({\bf x};\xi)-\nabla f_{i}({\bf x})\|^{2}]\leq\sigma_{l}^{2} for all 𝐱d{\bf x}\in\mathbb{R}^{d} i{1,2,,m}\forall i\in\{1,2,\ldots,m\}. This paper also assumes the (global) variance is bounded, i.e., 1mi=1mfi(𝐱)f(𝐱)2σg2\frac{1}{m}\sum_{i=1}^{m}\|\nabla f_{i}({\bf x})-\nabla f({\bf x})\|^{2}\leq\sigma_{g}^{2} for all 𝐱d{\bf x}\in\mathbb{R}^{d}.

The uniform local variance assumption is also used for the ease of presentation, which is straightforward to generalize to non-uniform cases. The global variance assumption is used in [9, 35]. The constant σg\sigma_{g} reflects the heterogeneity of the data sets (𝒟i)1im(\mathcal{D}_{i})_{1\leq i\leq m}, and when (𝒟i)1im(\mathcal{D}_{i})_{1\leq i\leq m} follow the same distribution, σg=0\sigma_{g}=0.

Assumption 3

[36, 4] For any i{1,2,,m}i\in\{1,2,\ldots,m\} and 𝐱d{\bf x}\in\mathbb{R}^{d}, we have fi(𝐱)B\|\nabla f_{i}({\bf x})\|\leq B for some B>0B>0.

An important notion in decentralized optimization is the mixing matrix, which is usually associated with a connected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) with the vertex set 𝒱={1,,m}\mathcal{V}=\{1,...,m\} and the edge set 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}. Any edge (i,l)(i,l)\in\mathcal{E} represents a communication link between nodes ii and ll. We recall the definition of the mixing matrix associated with the graph 𝒢\mathcal{G}.

Definition 1 (Mixing matrix)

The mixing matrix 𝐖=[wi,j]m×m{\bf W}=[w_{i,j}]\in\mathbb{R}^{m\times m} is assumed to have the following properties: 1. (Graph) If iji\neq j and (i,j)(i,j)\notin{\cal E}, then wi,j=0w_{i,j}=0, otherwise, wi,j>0w_{i,j}>0; 2. (Symmetry) 𝐖=𝐖{\bf W}={\bf W}^{\top}; 3. (Null space property) null{𝐈𝐖}=span{𝟏}\mathrm{null}\{{\bf I}-{\bf W}\}=\mathrm{span}\{\bf 1\}; 4. (Spectral property) 𝐈𝐖𝐈.{\bf I}\succeq{\bf W}\succ-{\bf I}.

For a graph, the corresponding mixing matrix is not unique; given the adjacency matrix of a graph, its maximum-degree matrix and metropolis-hastings [37] are both mixing matrices. The symmetric property of 𝐖{\small{\bf W}} indicates that its eigenvalues are real and can be sorted in the non-increasing order. Let λi(𝐖){\small\lambda_{i}({\bf W})} denote the ii-th largest eigenvalue of 𝐖{\small{\bf W}}, that is, λ1(𝐖)=1>λ2(𝐖)λm(𝐖)>1.{\small\lambda_{1}({\bf W})=1>\lambda_{2}({\bf W})\geq\cdots\geq\lambda_{m}({\bf W})>-1.}111This is based on the spectral property of mixing matrix. The mixing matrix also serves as a probability transition matrix of a Markov chain. A quite important constant of 𝐖{\small{\bf W}} is λ=λ(𝐖):=max{|λ2(𝐖)|,|λm(𝐖)|}{\small\lambda=\lambda({\bf W}):=\max\{|\lambda_{2}({\bf W})|,|\lambda_{m}({\bf W})|\}}, which describes the speed of the Markov chain introduced by the mixing matrix converges to the stable state.

3 Decentralized Federated Averaging

3.1 Decentralized FedAvg with Momentum

We first briefly review the previous work on decentralized training, which carries out in the following fashion:

  1. 1.

    client ii holds an approximate copy of the parameters 𝐱(i)d{\bf x}(i)\in\mathbb{R}^{d} and calculate an unbiased estimate of fi:=𝐠(i)\nabla f_{i}:={\bf g}(i) at 𝐱(i){\bf x}(i). (𝐱(i))1im({\bf x}(i))_{1\leq i\leq m} can be non-consensus;

  2. 2.

    (communication) client ii updates its local parameters 𝐱(i){\bf x}(i) as the weighted average of its neighbors: 𝐱~(i)=l𝒩(i)wi,l𝐱(l)\tilde{{\bf x}}(i)=\sum_{l\in\mathcal{N}(i)}w_{i,l}{\bf x}(l);

  3. 3.

    (training) client ii updates its parameters as 𝐱(i)𝐱~(i)η𝐠(i){\bf x}(i)\leftarrow\tilde{{\bf x}}(i)-\eta{\bf g}(i) with a learning rate η>0\eta>0.

Comm.TrainComm.Train
(a) Traditional Decentralization (DSGD)
Comm.TrainTrainComm.TrainTrain
(b) DFedAvgM
Figure 1: Comparison of communication and training styles of traditional decentralized stochastic gradient descent (DSGD) and the proposed decentralized federated average with momentum (DFedAvgM). In DSGD, each client will communicate with its neighbors after one single training step. In DFedAvgM, however, each client will communicate with its neighbors after multiple training iterations.
Algorithm 1 DFedAvgM
1:Parameters: η>0,K+\eta>0,K\in\mathbb{Z}^{+}, 0θ<10\leq\theta<1.
2:Initialization: 𝐱0=0{\bf x}^{0}=\textbf{0}
3:for t=1,2,t=1,2,\ldots do
4:    for i={1,2,,m}i=\{1,2,\ldots,m\}  do
5:        node ii performs local training (4) KK times and sends                 𝐳t(i)=𝐲t,K(i){\bf z}^{t}(i)={\bf y}^{t,K}(i) to 𝒩(i)\mathcal{N}(i)
6:        node ii updates as (5)
7:    end for
8:end for

The traditional decentralization can be described in Figure 1 (a), in which, a communication step is needed after each training iteration. This indicate that the above vanilla decentralized algorithm is different from FedAvg, and the later performs multiple local training step before communication. To this end, we have to slightly modify the scheme of the decentralized algorithm. For simplicity, we consider modifying DSGD to motivate our decentralized FedAvg algorithm. Note that when the original DGD is applied to solve (1), we end up with the following iteration

𝐱t+1(i)\displaystyle{\bf x}^{t+1}(i) =l𝒩(i)wi,l𝐱t(l)γ𝐠t(i)\displaystyle=\sum_{l\in\mathcal{N}(i)}w_{i,l}{\bf x}^{t}(l)-\gamma{\bf g}^{t}(i) (2)
=l𝒩(i)wi,l[𝐱t(l)γ𝐠t(i)],\displaystyle=\sum_{l\in\mathcal{N}(i)}w_{i,l}[{\bf x}^{t}(l)-\gamma{\bf g}^{t}(i)],

where we used the fact that l𝒩(i)wi,l=1\sum_{l\in\mathcal{N}(i)}w_{i,l}=1. In (2), if we replace 𝐱t(l){\bf x}^{t}(l) by 𝐱t(i){\bf x}^{t}(i), the algorithm then iterates as

𝐱t+1(i)=l𝒩(i)wi,l[𝐱t(i)γ𝐠t(i)].\displaystyle{\bf x}^{t+1}(i)=\sum_{l\in\mathcal{N}(i)}w_{i,l}[{\bf x}^{t}(i)-\gamma{\bf g}^{t}(i)]. (3)

In (3), clients communicate with their neighbors after one training iteration, which is then possible to generalize to the federated optimization setting. We replace the single SGD iteration in (3) with multiple SGD with heavy-ball [38] iterations. Therefore, the DFedAvgM can be presented as follows: In each t+t\in\mathbb{Z}^{+}, for each client i{1,2,,m}i\in\{1,2,\ldots,m\}, let 𝐲t,1(i)=𝐲t,0(i)=𝐱t(i){\bf y}^{t,-1}(i)={\bf y}^{t,0}(i)={\bf x}^{t}(i). The inner iteration in each node then performs as

𝐲t,k+1(i)=𝐲t,k(i)η𝐠~t,k(i)+θ(𝐲t,k(i)𝐲t,k1(i)),{\bf y}^{t,k+1}(i)={\bf y}^{t,k}(i)-\eta\tilde{{\bf g}}^{t,k}(i)+\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)), (4)

where 𝔼𝐠~t,k(i)=fi(𝐲t,k(i))\mathbb{E}\tilde{{\bf g}}^{t,k}(i)=\nabla f_{i}({\bf y}^{t,k}(i)). After KK inner iterations in each local client, the resulting parameters 𝐳t(i)𝐲t,K(i){\bf z}^{t}(i)\leftarrow{\bf y}^{t,K}(i) is sent to its neighbors (𝒩(i)\mathcal{N}(i)). Every client then updates its parameters by taking the local averaging as follows

𝐱t+1(i)=l𝒩(i)wi,l𝐳t(l).\displaystyle{\bf x}^{t+1}(i)=\sum_{l\in\mathcal{N}(i)}w_{i,l}{\bf z}^{t}(l). (5)

The procedure of DFedAvgM can be illustrated as Figure 1 (b). It is seen that DFedAvgM plays the tradeoff between local computing and communications. It is well-known that the communication costs are usually much more expensive than the computation costs [39], which indicates DFedAvgM can be more efficient than DSGD.

Algorithm 2 Quantized DFedAvgM
1:parameters: η>0\eta>0, K+K\in\mathbb{Z}^{+}, 0θ<10\leq\theta<1, ss, bb.
2:initialization: 𝐱0=0{\bf x}^{0}=\textbf{0}
3:for t=1,2,t=1,2,\ldots do
4:    for i={1,2,,m}i=\{1,2,\ldots,m\}  do
5:        node ii performs local training (4) KK times and sends 𝐪t(i)=Q[𝐲t,K(i)𝐱t(i)]{\bf q}^{t}(i)=Q[{\bf y}^{t,K}(i)-{\bf x}^{t}(i)] to 𝒩(i)\mathcal{N}(i)
6:        node ii updates as (7)
7:    end for
8:end for

3.2 Efficient Communication via Quantization

In DFedAvgM, client ii needs to send 𝐱t(i){\bf x}^{t}(i) to its neighbours 𝒩(i)\mathcal{N}(i). Thus, when the number of neighbours 𝒩(i)\mathcal{N}(i) grows, client-client communications become the major bottleneck on algorithms’ efficiency. We leverage the quantization trick to reduce the communication cost [40, 41]. In particular, we consider the following quantization procedure: Given a constant s>0s>0 and the limited bit number b+b\in\mathbb{Z}^{+}, the representable range is then {2b1s,(2b11)s,,0,s,2s,,(2b11)s}\{-2^{b-1}s,-(2^{b-1}-1)s,\ldots,0,s,2s,\ldots,(2^{b-1}-1)s\}. For any aa\in\mathbb{R} with 2b1sa<(2b11)s-2^{b-1}s\leq a<(2^{b-1}-1)s, we can find an integer kk\in\mathbb{Z} such that ksa<(k+1)sks\leq a<(k+1)s and we then use ksks to replace aa. The above quantization scheme is deterministic, which can be written as q(a):=assq(a):=\lfloor\frac{a}{s}\rfloor s for aa\in\mathbb{R}; Besides the deterministic rule, the stochastic quantization uses the following scheme

q(a):={ks,w.p.1akss,(k+1)s,w.p.akss.q(a):=\left\{\begin{array}[]{c}ks,~{}\textrm{w.p.}~{}1-\frac{a-ks}{s},\\ (k+1)s,~{}\textrm{w.p.}~{}\frac{a-ks}{s}.\\ \end{array}\right.

It is easy to see that the stochastic quantization is unbiased, i.e., 𝔼[q(a)]=a\mathbb{E}[q(a)]=a for any aa\in\mathbb{R}. When ss is small, deterministic and stochastic quantization schemes perform very similarly. For a vector 𝐱d{\bf x}\in\mathbb{R}^{d} whose coordinates are all stored with 3232 bits, we consider quantizing all coordinates of 𝐱=[x1,x2,,xd]d{\bf x}=[{x}_{1},{x}_{2},\ldots,{x}_{d}]\in\mathbb{R}^{d}. The multi-dimension quantization operator is then defined as

Q(𝐱):=[q(x1),q(x2),,q(xd)].\displaystyle Q({\bf x}):=[q({x}_{1}),q({x}_{2}),\ldots,q({x}_{d})]. (6)

For both deterministic and stochastic quantization schemes, we have 𝔼Q(𝐱)𝐱2d4s2\mathbb{E}\|Q({\bf x})-{\bf x}\|^{2}\leq\frac{d}{4}s^{2} if xi[2b1s,(2b11)s]{x}_{i}\in[-2^{b-1}s,(2^{b-1}-1)s] for i{1,2,,d}i\in\{1,2,\ldots,d\}. In this paper, we consider a quantization operator with the following assumption, which hold for the two quantization schemes mentioned above.

Assumption 4

The quantization operator Q:ddQ:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} satisfies 𝔼Q(𝐱)𝐱2d4s2\mathbb{E}\|Q({\bf x})-{\bf x}\|^{2}\leq\frac{d}{4}s^{2} with s>0s>0 for any 𝐱d{\bf x}\in\mathbb{R}^{d}.

Directly quantize the parameters is feasible for sufficiently smooth loss functions, but may be impossible for DNNs. To this end, we consider quantizing the difference the difference of parameters. Quantized DFedAvgM can be summarized as: After running (4) KK times, client ii quantizes 𝐪t(i)Q(𝐲t,K(i)𝐱t(i)){\small{\bf q}^{t}(i)\leftarrow Q({\bf y}^{t,K}(i)-{\bf x}^{t}(i))} and send it to 𝒩(i){\small\mathcal{N}(i)}. After receiving [𝐪t(j)]j𝒩(i){\small[{\bf q}^{t}(j)]_{j\in\mathcal{N}(i)}}, every client updates its local parameters as

𝐱t+1(i)=𝐱t(i)+l𝒩(i)wi,l𝐪t(l).{\bf x}^{t+1}(i)={\bf x}^{t}(i)+\sum_{l\in\mathcal{N}(i)}w_{i,l}{\bf q}^{t}(l). (7)

In each communication, client ii just needs to send the pair (s,𝐪t(i))(s,{\bf q}^{t}(i)) to 𝒩(i)\mathcal{N}(i), whose representation requires (32+db)deg(𝒩(i))(32+db)\textrm{deg}(\mathcal{N}(i)) bits rather than 32ddeg(𝒩(i))32d\textrm{deg}(\mathcal{N}(i)) bits for sending the unquantized version. If dd is large and b<32b<32, the communications can be significantly reduced.

4 Convergence Analysis

In this section, we analyze the convergence of the proposed (quantized) DFedAvgM. The convergence analysis of DFedAvgM is much more complicated than SGD, SGD with momentum, and DSGD; the technical difficulty is because 𝐳t(i)𝐱t(i){\bf z}^{t}(i)-{\bf x}^{t}(i) fails to be an unbiased estimate of the gradient fi(𝐱t(i))\nabla f_{i}({\bf x}^{t}(i)) after multiple iterations of SGD or SGD with momentum in each client. In the following, we consider the convergence of the average point, which is defined as 𝐱t¯:=i=1m𝐱t(i)/m\overline{{\bf x}^{t}}:={\sum_{i=1}^{m}{\bf x}^{t}(i)}/{m}. We first present the convergence of DFedAvgM for general nonconvex objective function in the following Theorem.

Theorem 1 (General nonconvexity)

Let the sequence {𝐱t(i)}t0\{{\bf x}^{t}(i)\}_{t\geq 0} be generated by DFedAvgM for i{1,2,m}i\in\{1,2\ldots,m\} and suppose Assumptions 1, 2 and 3 hold. Moreover, assume the stepsize η\eta for SGD with momentum that used for training client models satisfies

0<η18LKand64L2K2η2+64LKη<1,0<\eta\leq\tfrac{1}{8LK}~{}~{}\mbox{and}~{}~{}64L^{2}K^{2}\eta^{2}+64LK\eta<1,

where LL is the Lipschitz constant of f\nabla f and KK is the number of client iterations before each communication. Then

min1tT𝔼f(𝐱t¯)2\displaystyle\min_{1\leq t\leq T}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2} 2f(𝐱1¯)2minfγ(K,η)T\displaystyle\leq\frac{2f(\overline{{\bf x}^{1}})-2\min f}{\gamma(K,\eta)T}
+α(K,η)+β(K,η),\displaystyle+\alpha(K,\eta)+\beta(K,\eta),

where TT is the total number of communication rounds and the constants are given as

γ(K,η):=η(Kθ)(1θ)64(1θ)L2K4η3Kθ64LK2η2,\gamma(K,\eta):=\frac{\eta(K-\theta)}{(1-\theta)}-\frac{64(1-\theta)L^{2}K^{4}\eta^{3}}{K-\theta}-64LK^{2}\eta^{2},
α(K,η):=\displaystyle\alpha(K,\eta):=
((1θ)L2K2η3(Kθ)+Lη2)(8Kσl2+32K2σg2+64K2θ2(σl2+B2)(1θ)2)η(Kθ)(1θ)64(1θ)L2K4η3Kθ64LK2η2,\displaystyle{\footnotesize\frac{(\frac{(1-\theta)L^{2}K^{2}\eta^{3}}{(K-\theta)}+L\eta^{2})(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K^{2}\theta^{2}(\sigma_{l}^{2}+B^{2})}{(1-\theta)^{2}})}{\frac{\eta(K-\theta)}{(1-\theta)}-\frac{64(1-\theta)L^{2}K^{4}\eta^{3}}{K-\theta}-64LK^{2}\eta^{2}}},

and

β(K,η,λ):=(64(1θ)L4K4η5(Kθ)+64L3K2η4)×\displaystyle\beta(K,\eta,\lambda):=(\frac{64(1-\theta)L^{4}K^{4}\eta^{5}}{(K-\theta)}+64L^{3}K^{2}\eta^{4})\times
(8Kσl2+32K2σg2+32K2B2+64K2θ2(1θ)2(σl2+B2))[(1λ)(η(Kθ)(1θ)64(1θ)L2K4η3Kθ64LK2η2)].\displaystyle\frac{(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+32K^{2}B^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2}))}{[(1-\lambda)(\frac{\eta(K-\theta)}{(1-\theta)}-\frac{64(1-\theta)L^{2}K^{4}\eta^{3}}{K-\theta}-64LK^{2}\eta^{2})]}.

To get an explicit rate on TT from Theorem 1, we choose η=Θ(1/LKT)\eta=\Theta({1}/{LK\sqrt{T}}). As TT is large enough and 64L2K2η2+64LKη<164L^{2}K^{2}\eta^{2}+64LK\eta<1. Then, γ(K,η)=Θ(1/((1θ)T))\gamma(K,\eta)=\Theta({1}/{((1-\theta)\sqrt{T})}), and α(K,η)=Θ((1θ)σl2+(1θ)Kσg2+θ2(1θ)K(σl2+B2)KT)\alpha(K,\eta)=\Theta(\frac{(1-\theta)\sigma_{l}^{2}+(1-\theta)K\sigma_{g}^{2}+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}), and β(K,η,λ)=Θ((1θ)(σl2+Kσg2+KB2)+θ2(1θ)K(σl2+B2)(1λ)KT3/2)\beta(K,\eta,\lambda)=\Theta(\frac{(1-\theta)(\sigma_{l}^{2}+K\sigma_{g}^{2}+KB^{2})+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{(1-\lambda)KT^{3/2}}). Based on this choice of η\eta and the Theorem 1, we have the following convergence rate for DFedAvgM.

Proposition 1

As the communication round number TT is large enough, it holds that

min1tT𝔼f(𝐱t¯)2=𝒪((1θ)(f(𝐱1¯)minf)T\displaystyle\min_{1\leq t\leq T}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}=\mathcal{O}\Big{(}\frac{(1-\theta)(f(\overline{{\bf x}^{1}})-\min f)}{\sqrt{T}}
+(1θ)σl2+(1θ)Kσg2+θ2(1θ)K(σl2+B2)KT\displaystyle+\frac{(1-\theta)\sigma_{l}^{2}+(1-\theta)K\sigma_{g}^{2}+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}
+(1θ)(σl2+Kσg2+KB2)+θ2(1θ)K(σl2+B2)(1λ)KT3/2).\displaystyle\quad+\frac{(1-\theta)(\sigma_{l}^{2}+K\sigma_{g}^{2}+KB^{2})+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{(1-\lambda)KT^{3/2}}\Big{)}.

From Proposition 1, we can see that the speed of DFedAvgM can be improved when the number of local iteration, KK, increases. Also, when the momentum θ\theta is 0 and KK is large enough, the bound will be dominated by 𝒪(1T+σg2T+σg2+B2(1λ)2T3/2)\mathcal{O}\Big{(}\frac{1}{\sqrt{T}}+\frac{\sigma_{g}^{2}}{\sqrt{T}}+\frac{\sigma_{g}^{2}+B^{2}}{(1-\lambda)^{2}T^{3/2}}\Big{)}, in which the local variance bound diminishes. This phenomenon coincides with our intuitive understanding: in local client, the use of large KK can result in a local minimizer; then the local variance bound will hurt nothing. To reach any given ϵ>0\epsilon>0 error, DFedAvgM needs 𝒪(1ϵ2)\mathcal{O}(\frac{1}{\epsilon^{2}}) communication rounds, which is the same as SGD and DSGD. It is worth mentioning that the above theoretical results show that whether the momentum θ\theta can accelerate the algorithm depends on the relation between f(𝐱1¯)minf+σg2f(\overline{{\bf x}^{1}})-\min f+\sigma_{g}^{2} and σl2+B2\sigma_{l}^{2}+B^{2}, i.e., if f(𝐱1¯)minf+σg2σl2+B2f(\overline{{\bf x}^{1}})-\min f+\sigma_{g}^{2}\gg\sigma_{l}^{2}+B^{2}, as θ[0,1)\theta\in[0,1) increases, the rate improves; if f(𝐱1¯)minf+σg2σl2+B2f(\overline{{\bf x}^{1}})-\min f+\sigma_{g}^{2}\ll\sigma_{l}^{2}+B^{2}, large θ\theta may degrade the performance of DFedAvgM.

The convergence results established above, which simply require smooth assumption on the objective functions, are quite general and somehow not sharp due to extra properties are missing. For example, recent (non)convex studies [42, 43, 44] have exploited the algorithmic performance under the PŁ property, which is named after Polyak and Łojasiewicz [45, 46]. For a smooth function ff, we say it satisfies PŁ-ν\nu property provided

f(𝐱)22ν(f(𝐱)minf),𝐱dom(f).\displaystyle\|\nabla f({\bf x})\|^{2}\geq 2\nu(f({\bf x})-\min f),~{}~{}\forall{\bf x}\in\textrm{dom}(f). (8)

The well-known strong convexity implies PŁ condition, but not vice verse. In the following, we present the convergence of DFedAvgM under the PŁ condition.

Theorem 2 (PŁ condition)

Assume function ff satisfies the PŁ-ν\nu condition, the following convergence rate holds

𝔼f(𝐱T¯)minf[1νγ(K,η)]T(f(𝐱0¯)minf)\displaystyle\mathbb{E}f(\overline{{\bf x}^{T}})-\min f\leq[1-\nu\gamma(K,\eta)]^{T}(f(\overline{{\bf x}^{0}})-\min f)
+α(K,η)2ν+β(K,η,λ)2ν.\displaystyle\quad+\frac{\alpha(K,\eta)}{2\nu}+\frac{\beta(K,\eta,\lambda)}{2\nu}.

Due to the fact that f(𝐱0¯)minf0f(\overline{{\bf x}^{0}})-\min f\geq 0, the right side is larger than α(K,η)2ν+β(K,η,λ)2ν=𝒪(η)\frac{\alpha(K,\eta)}{2\nu}+\frac{\beta(K,\eta,\lambda)}{2\nu}=\mathcal{O}(\eta). If we still let η=Θ(1LKT)\eta=\Theta(\frac{1}{LK\sqrt{T}}), the convergence rate is at least 𝒪(1/T)\mathcal{O}(1/\sqrt{T}). But we cannot choose a very small η\eta; otherwise, the dominated term [1νγ(K,η)]T(f(𝐱0¯)minf)[1-\nu\gamma(K,\eta)]^{T}(f(\overline{{\bf x}^{0}})-\min f) will decay very slowly. If the learning rate enjoys the form as η=c1lnc3T/(LKTc2)\eta={c_{1}\ln^{c_{3}}T}/{(LKT^{c_{2}})} with c1,c2>0c_{1},c_{2}>0 222This learning rate is commonly used in the ML community., we can prove the following results on the optimal choices for c1,c2,c3c_{1},c_{2},c_{3}.

Proposition 2

Let η=c1lnc3T/(LKTc2)\eta={c_{1}\ln^{c_{3}}T}/{(LKT^{c_{2}})} with c1,c2>0c_{1},c_{2}>0, the optimal rate of DFedAvgM is 𝒪~(1/T)\tilde{\mathcal{O}}({1}/{T}), in which case c1=L/νc_{1}={L}/{\nu}, c2=1c_{2}=1, and c3=1c_{3}=-1, that is, η=1/(νKTlnT)\eta={1}/{(\nu KT\ln T)}.

This finding coincides with existing results of the optimal rate for SGD with strong convexity [47, 48]. Under the PŁ condition, the convergence rate of DFedAvgM is improved.

Next, we provide the convergence guarantee for the quantized DFedAvgM, which is stated in the following theorem.

Theorem 3

Let the sequence {𝐱t(i)}t0\{{\bf x}^{t}(i)\}_{t\geq 0} be generated by the quantized DFedAvgM for all i{1,2,m}i\in\{1,2\ldots,m\}, and all the assumptions in Theorem 1 and Assumption 4 hold. Let η=Θ(1LKT)\eta=\Theta(\frac{1}{LK\sqrt{T}}), as TT is sufficiently large, it holds that

min1tT𝔼f(𝐱t¯)2=𝒪((1θ)(f(𝐱1¯)minf)T\displaystyle\min_{1\leq t\leq T}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}=\mathcal{O}\Big{(}\frac{(1-\theta)(f(\overline{{\bf x}^{1}})-\min f)}{\sqrt{T}}
+(1θ)(σl2+Kσg2)+θ2(1θ)K(σl2+B2)KT+\displaystyle+\frac{(1-\theta)(\sigma_{l}^{2}+K\sigma_{g}^{2})+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}+
(1θ)(σl2+Kσg2+KB2)+θ2(1θ)K(σl2+B2)(1λ)KT3/2+Ts).\displaystyle\frac{(1-\theta)(\sigma_{l}^{2}+K\sigma_{g}^{2}+KB^{2})+\frac{\theta^{2}}{(1-\theta)}K(\sigma_{l}^{2}+B^{2})}{(1-\lambda)KT^{3/2}}+\sqrt{T}s\Big{)}.

If the function ff further satisfies the PŁ condition and η=1νTKlnT\eta=\frac{1}{\nu TK\ln T}, it follows that

𝔼(f(𝐱T¯)minf)=𝒪~(1T+Ts).\mathbb{E}(f(\overline{{\bf x}^{T}})-\min f)=\widetilde{\mathcal{O}}(\frac{1}{T}+Ts).

According to Theorem 3, to reach any given ϵ>0\epsilon>0 error in general case, we need to set s=𝒪(ϵ2)s=\mathcal{O}(\epsilon^{2}) and set the the number of communication round as T=Θ(1ϵ2)T=\Theta(\frac{1}{\epsilon^{2}}). While with PŁ condition, we set T=Θ(1ϵ)T=\Theta(\frac{1}{\epsilon}) and s=𝒪(ϵ2)s=\mathcal{O}(\epsilon^{2}). It follows 𝔼(f(𝐱T¯)minf)=𝒪~(ϵ).\mathbb{E}(f(\overline{{\bf x}^{T}})-\min f)=\widetilde{\mathcal{O}}(\epsilon). Therefore, under the PŁ condition, the number of communication round is reduced.

In the following, we provide a sufficient condition for communications-saving of the two quantization rules mentioned in Sec. 3.2 used in quantized DFedAvgM.

Proposition 3

Assume we use the stochastic or deterministic quantization rule with bb bits using stepsize η=1LKT\eta=\frac{1}{LK\sqrt{T}}. Assume that the parameters trained in all clients do not overflow, that is, all coordinates are contained in [2b1s,(2b11)s][-2^{b-1}s,(2^{b-1}-1)s]. Let Assumptions 1, 2 and 3 hold. If the desired error

ϵ\displaystyle\epsilon >(1θ)3LBsd14×\displaystyle>(1-\theta)\sqrt{3LBs}d^{\frac{1}{4}}\times
2(f(𝐱0¯)minf)+8σl2K+32σg2+64θ2(σl2+B2)(1θ)2\displaystyle\sqrt{2(f(\overline{{\bf x}^{0}})-\min f)+\frac{8\sigma_{l}^{2}}{K}+32\sigma_{g}^{2}+\frac{64\theta^{2}(\sigma_{l}^{2}+B^{2})}{(1-\theta)^{2}}}

and b<1289+32db<\frac{128}{9}+\frac{32}{d}, the quantized DFedAvgM can beat DFedAvgM with 32 bits in term of the required communications to reach ϵ\epsilon.

Proposition 3 indicates that the superiority of the quantized DFedAvgM retains when the desired error ϵ\epsilon is not smaller than 𝒪((1θ)s)\mathcal{O}((1-\theta)\sqrt{s}). We can also see that as KK increases, the guaranteed lower bound of ϵ\epsilon decreases, which demonstrates the necessity of multiple local iterations. Moreover, a larger θ\theta can also reduce the lower bound.

5 Proofs

5.1 Technical Lemmas

We define 𝟏:=[1,1,,1]m{\bf 1}:=[1,1,\ldots,1]^{\top}\in\mathbb{R}^{m} and

𝐏:=11mm×m.{\bf P}:=\frac{\textbf{1}\textbf{1}^{\top}}{m}\in\mathbb{R}^{m\times m}.

For a matrix 𝐀{\bf A}, we denote its spectral norm as 𝐀op\|{\bf A}\|_{\textrm{op}}. We also define 𝐗:=[𝐱(1),𝐱(2),,𝐱(m)]m×d{\bf X}:=\begin{bmatrix}{\bf x}(1),{\bf x}(2),\ldots,{\bf x}(m)\end{bmatrix}^{\top}\in\mathbb{R}^{m\times d}.

Lemma 1

[Lemma 4 , [4]] For any k+k\in\mathbb{Z}^{+}, the mixing matrix 𝐖m{\bf W}\in\mathbb{R}^{m} satisfies

𝐖k𝐏opλk,\|{\bf W}^{k}-{\bf P}\|_{\emph{op}}\leq\lambda^{k},

where λ:=max{|λ2|,|λm(W)|}\lambda:=\max\{|\lambda_{2}|,|\lambda_{m}(W)|\}.

In [Proposition 1, [21]], the author also proved that Wk𝐏opCλk\|W^{k}-{\bf P}\|_{\textrm{op}}\leq C\lambda^{k} for some C>0C>0 that depends on the matrix.

Lemma 2

Assume that Assumptions 2 and 3 hold, and 0θ<10\leq\theta<1. Let (𝐲t,k(i))t0({\bf y}^{t,k}(i))_{t\geq 0} be generated by the (quantized) DFedAvgM. It then follows

𝔼𝐲t,k+1(i)𝐲t,k(i)21(1θ)2(2η2σl2+2η2B2)\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf y}^{t,k}(i)\|^{2}\leq\frac{1}{(1-\theta)^{2}}(2\eta^{2}\sigma_{l}^{2}+2\eta^{2}B^{2})

when 0kK10\leq k\leq K-1.

Lemma 3

Given the stepsize 0<η18LK0<\eta\leq\tfrac{1}{8LK} and i{1,2,,m}i\in\{1,2,\ldots,m\} and assume (𝐲t,k(i))t0({\bf y}^{t,k}(i))_{t\geq 0}, (𝐱t(i))t0({\bf x}^{t}(i))_{t\geq 0} are generated by the (quantized) DFedAvgM for all i{1,2,m}i\in\{1,2\ldots,m\}. If Assumption 3 holds, it then follows

1mi=1m𝔼𝐲t,k(i)𝐱t(i)2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}
C1η2+32K2η2i=1m𝔼f(𝐱t(i))2m,\displaystyle\quad\leq C_{1}\eta^{2}+32K^{2}\eta^{2}\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m},

where C1:=8Kσl2+32K2σg2+64K2θ2(1θ)2(σl2+B2)C_{1}:=8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2}) when 0kK0\leq k\leq K.

With the fact that 𝐲t,K(i)=𝐳t(i){\bf y}^{t,K}(i)={\bf z}^{t}(i), Lemma 3 also holds for 𝐳t(i){\bf z}^{t}(i).

Lemma 4

Given the stepsize η>0\eta>0 and let {𝐱t(i)}t0\{{\bf x}^{t}(i)\}_{t\geq 0} be generated by DFedAvgM for all i{1,2,m}i\in\{1,2\ldots,m\}. If Assumption 3 holds, we have the following bound

1mi=1m𝔼𝐱t(i)𝐱t¯2C2η21λ,\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|{\bf x}^{t}(i)-\overline{{\bf x}^{t}}\|^{2}\leq C_{2}\frac{\eta^{2}}{1-\lambda}, (9)

where C2:=8Kσl2+32K2σg2+64K2θ2(1θ)2(σl2+B2)+32K2B2C_{2}:=8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})+32K^{2}B^{2}.

Lemma 5

Given the stepsize η>0\eta>0 and assume {𝐱t(i)}t0\{{\bf x}^{t}(i)\}_{t\geq 0} are generated by the quantized DFedAvgM for all i{1,2,m}i\in\{1,2\ldots,m\}. If Assumption 3 holds, it follows that

1mi=1m𝔼𝐱t(i)𝐱t¯22C2η21λ+2ds21λ.\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}\|{\bf x}^{t}(i)-\overline{{\bf x}^{t}}\|^{2}\leq 2C_{2}\frac{\eta^{2}}{1-\lambda}+\frac{2ds^{2}}{1-\lambda}. (10)

5.2 Proof of Technical Lemmas

5.2.1 Proof of Lemma 2

Given any ψ>0\psi>0, the Cauchy inequality gives us

𝔼𝐲t,k+1(i)𝐲t,k(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf y}^{t,k}(i)\|^{2}
=𝔼η𝐠~k(i)+θ(𝐲t,k(i)𝐲t,k1(i))2\displaystyle=\mathbb{E}\|-\eta\tilde{{\bf g}}^{k}(i)+\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i))\|^{2}
a)(1+ψ)θ2𝔼𝐲t,k(i)𝐲t,k1(i)2\displaystyle\overset{a)}{\leq}(1+\psi)\theta^{2}\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)\|^{2}
+(1+1ψ)η2𝔼𝐠~k(i)fi(𝐲t,k(i))+fi(𝐲t,k(i))2\displaystyle\qquad+(1+\frac{1}{\psi})\eta^{2}\mathbb{E}\|\tilde{{\bf g}}^{k}(i)-\nabla f_{i}({\bf y}^{t,k}(i))+\nabla f_{i}({\bf y}^{t,k}(i))\|^{2}
(1+ψ)θ2𝔼𝐲t,k(i)𝐲t,k1(i)2\displaystyle\leq(1+\psi)\theta^{2}\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)\|^{2}
+(2+2ψ)η2fi(𝐲t,k(i))2\displaystyle\qquad+(2+\frac{2}{\psi})\eta^{2}\|\nabla f_{i}({\bf y}^{t,k}(i))\|^{2}
+2(1+1ψ)η2𝔼𝐠~k(i)fi(𝐲t,k(i))2,\displaystyle\qquad+2(1+\frac{1}{\psi})\eta^{2}\mathbb{E}\|\tilde{{\bf g}}^{k}(i)-\nabla f_{i}({\bf y}^{t,k}(i))\|^{2},

where a)a) uses the Cauchy’s inequality 𝔼𝐚+𝐛2(1+1ψ)𝔼𝐚2+(1+ψ)𝔼𝐛2\mathbb{E}\|{\bf a}+{\bf b}\|^{2}\leq(1+\frac{1}{\psi})\mathbb{E}\|{\bf a}\|^{2}+(1+\psi)\mathbb{E}\|{\bf b}\|^{2} with 𝐚=η𝐠~k(i){\bf a}=-\eta\tilde{{\bf g}}^{k}(i) and 𝐛=θ(𝐲t,k(i)𝐲t,k1(i)){\bf b}=\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)). Without loss of generality, we assume θ0\theta\neq 0. Let ψ=1θ1\psi=\frac{1}{\theta}-1, we get

𝔼𝐲t,k+1(i)𝐲t,k(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf y}^{t,k}(i)\|^{2}
θ𝔼𝐲t,k(i)𝐲t,k1(i)+2η2σl21θ+2η2B21θ.\displaystyle\qquad\leq\theta\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)\|+\frac{2\eta^{2}\sigma_{l}^{2}}{1-\theta}+\frac{2\eta^{2}B^{2}}{1-\theta}.

Using the mathematical induction, for any integer 0kK0\leq k\leq K, we have

𝔼𝐲t,k+1(i)𝐲t,k(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf y}^{t,k}(i)\|^{2}
2η2σl2+2η2B21θ(i=0k1θi)2η2σl2+2η2B2(1θ)2.\displaystyle\quad\leq\frac{2\eta^{2}\sigma_{l}^{2}+2\eta^{2}B^{2}}{1-\theta}(\sum_{i=0}^{k-1}\theta^{i})\leq\frac{2\eta^{2}\sigma_{l}^{2}+2\eta^{2}B^{2}}{(1-\theta)^{2}}.

5.2.2 Proof of Lemma 3

Note that for any k{0,1,,K1}k\in\{0,1,\ldots,K-1\}, in node ii it holds

𝔼𝐲t,k+1(i)𝐱t(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf x}^{t}(i)\|^{2}
=𝔼𝐲t,k(i)η𝐠~k(i)𝐱t(i)+θ(𝐲t,k(i)𝐲t,k1(i))2\displaystyle=\mathbb{E}\|{\bf y}^{t,k}(i)-\eta\tilde{{\bf g}}^{k}(i)-{\bf x}^{t}(i)+\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i))\|^{2}
𝔼𝐲t,k(i)𝐱t(i)η(𝐠~k(i)fi(𝐲t,k(i))+fi(𝐲t,k(i))\displaystyle\leq\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)-\eta\Big{(}\tilde{{\bf g}}^{k}(i)-\nabla f_{i}({\bf y}^{t,k}(i))+\nabla f_{i}({\bf y}^{t,k}(i))
fi(𝐱t(i))+fi(𝐱t(i))f(𝐱t(i))+f(𝐱t(i)))\displaystyle\quad-\nabla f_{i}({\bf x}^{t}(i))+\nabla f_{i}({\bf x}^{t}(i))-\nabla f({\bf x}^{t}(i))+\nabla f({\bf x}^{t}(i))\Big{)}
+θ(𝐲t,k(i)𝐲t,k1(i))2I+II,\displaystyle\quad+\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i))\|^{2}\leq\textrm{I}+\textrm{II},

where I=(1+12K1)𝔼𝐲t,k(i)𝐱t(i)η(𝐠~k(i)fi(𝐲t,k(i))2\textrm{I}=(1+\frac{1}{2K-1})\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)-\eta(\tilde{{\bf g}}^{k}(i)-\nabla f_{i}({\bf y}^{t,k}(i))\|^{2} and II=2Kη2𝔼fi(𝐲t,k(i))fi(𝐱t(i))+fi(𝐱t(i))f(𝐱t(i))+f(𝐱t(i))+θ(𝐲t,k(i)𝐲t,k1(i))2\textrm{II}=2K\eta^{2}\mathbb{E}\|\nabla f_{i}({\bf y}^{t,k}(i))-\nabla f_{i}({\bf x}^{t}(i))+\nabla f_{i}({\bf x}^{t}(i))-\nabla f({\bf x}^{t}(i))+\nabla f({\bf x}^{t}(i))+\theta({\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i))\|^{2}. The unbiased expectation property of 𝐠~k(i)\tilde{{\bf g}}^{k}(i) gives us

I =(1+12K1)(𝔼𝐲t,k(i)𝐱t(i)2\displaystyle=(1+\frac{1}{2K-1})\Big{(}\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}
+η2𝔼𝐠~k(i)fi(𝐲t,k(i))2).\displaystyle+\eta^{2}\mathbb{E}\|\tilde{{\bf g}}^{k}(i)-\nabla f_{i}({\bf y}^{t,k}(i))\|^{2}\Big{)}.

On the other hand, with Lemma 2, we have the following bound

II8Kη2𝔼fi(𝐲t,k(i))fi(𝐱t(i))\displaystyle\textrm{II}\leq 8K\eta^{2}\mathbb{E}\|\nabla f_{i}({\bf y}^{t,k}(i))-\nabla f_{i}({\bf x}^{t}(i))\|
+8Kη2𝔼fi(𝐱t(i))f(𝐱t(i))\displaystyle+8K\eta^{2}\mathbb{E}\|\nabla f_{i}({\bf x}^{t}(i))-\nabla f({\bf x}^{t}(i))\|
+8Kη2𝔼f(𝐱t(i))2+8Kθ2𝔼𝐲t,k(i)𝐲t,k1(i)2\displaystyle+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}+8K\theta^{2}\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf y}^{t,k-1}(i)\|^{2}
8L2Kη2𝐲t,k(i)𝐱t(i)2+8Kη2σg2\displaystyle\leq 8L^{2}K\eta^{2}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}+8K\eta^{2}\sigma_{g}^{2}
+8Kη2𝔼f(𝐱t(i))2+16Kθ2(1θ)2(η2σl2+η2B2).\displaystyle+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}+\frac{16K\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2}).

Thus, we can obtain

𝔼𝐲t,k+1(i)𝐱t(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k+1}(i)-{\bf x}^{t}(i)\|^{2}
(1+12K1+8L2Kη2)𝔼𝐲t,k(i)𝐱t(i)2+2η2σl2\displaystyle\leq(1+\frac{1}{2K-1}+8L^{2}K\eta^{2})\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}+2\eta^{2}\sigma_{l}^{2}
+8Kη2σg2+8Kη2𝔼f(𝐱t(i))2+16Kθ2(1θ)2(η2σl2+η2B2)\displaystyle+8K\eta^{2}\sigma_{g}^{2}+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}+\frac{16K\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2})
(1+1K1)𝔼𝐲t,k(i)𝐱t(i)2+2η2σl2+8Kη2σg2\displaystyle\leq(1+\frac{1}{K-1})\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}+2\eta^{2}\sigma_{l}^{2}+8K\eta^{2}\sigma_{g}^{2}
+16Kθ2(1θ)2(η2σl2+η2B2)+8Kη2𝔼f(𝐱t(i))2,\displaystyle+\frac{16K\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2})+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2},

where the last inequality depends on the selection of the stepsize. The recursion from j=0j=0 to kk yeilds

𝔼𝐲t,k(i)𝐱t(i)2\displaystyle\mathbb{E}\|{\bf y}^{t,k}(i)-{\bf x}^{t}(i)\|^{2}
j=0K1(1+1K1)j[2η2σl2+8Kη2σg2\displaystyle\leq\sum_{j=0}^{K-1}(1+\frac{1}{K-1})^{j}\Big{[}2\eta^{2}\sigma_{l}^{2}+8K\eta^{2}\sigma_{g}^{2}
+16Kθ2(1θ)2(η2σl2+η2B2)+8Kη2𝔼f(𝐱t(i))2]\displaystyle\quad+\frac{16K\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2})+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}\Big{]}
(K1)[(1+1K1)K1]×[2η2σl2+8Kη2σg2\displaystyle\leq(K-1)\Big{[}(1+\frac{1}{K-1})^{K}-1\Big{]}\times\Big{[}2\eta^{2}\sigma_{l}^{2}+8K\eta^{2}\sigma_{g}^{2}
+16Kθ2(1θ)2(η2σl2+η2B2)+8Kη2𝔼f(𝐱t(i))2]\displaystyle\quad+\frac{16K\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2})+8K\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}\Big{]}
8Kη2σl2+32K2η2σg2+64K2θ2(1θ)2(η2σl2+η2B2)\displaystyle\leq 8K\eta^{2}\sigma_{l}^{2}+32K^{2}\eta^{2}\sigma_{g}^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\eta^{2}\sigma_{l}^{2}+\eta^{2}B^{2})
+32K2η2𝔼f(𝐱t(i))2,\displaystyle\quad+32K^{2}\eta^{2}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2},

where we used the inequality (1+1K1)K5(1+\frac{1}{K-1})^{K}\leq 5 holds for any K1K\geq 1.

5.2.3 Proof of Lemma 4

We denote 𝐙t:=[𝐳t(1),𝐳t(2),,𝐳t(m)]m×d{\bf Z}^{t}:=\begin{bmatrix}{\bf z}^{t}(1),{\bf z}^{t}(2),\ldots,{\bf z}^{t}(m)\end{bmatrix}^{\top}\in\mathbb{R}^{m\times d}. With these notation, we have

𝐗t+1=𝐖𝐙t=𝐖𝐗tζt,\displaystyle{\bf X}^{t+1}={\bf W}{\bf Z}^{t}={\bf W}{\bf X}^{t}-{\bf\zeta}^{t}, (11)

where ζt:=𝐖𝐗t𝐖𝐙t{\bf\zeta}^{t}:={\bf W}{\bf X}^{t}-{\bf W}{\bf Z}^{t}. The iteration (11) can be rewritten as the following expression

𝐗t=Wt𝐗0j=0t1𝐖t1jζj.\displaystyle{\bf X}^{t}=W^{t}{\bf X}^{0}-\sum_{j=0}^{t-1}{\bf W}^{t-1-j}{\bf\zeta}^{j}. (12)

Obviously, it follows

𝐖𝐏=𝐏𝐖=𝐏.{\bf W}{\bf P}={\bf P}{\bf W}={\bf P}. (13)

According to Lemma 1, it holds

𝐖t𝐏λt.\|{\bf W}^{t}-{\bf P}\|\leq\lambda^{t}.

Multiplying both sides of (12) with 𝐏{\bf P} and using (13), we then get

𝐏𝐗t=𝐏𝐗0j=0t1𝐏ζj=j=0t1𝐏ζj,\displaystyle{\bf P}{\bf X}^{t}={\bf P}{\bf X}^{0}-\sum_{j=0}^{t-1}{\bf P}{\bf\zeta}^{j}=-\sum_{j=0}^{t-1}{\bf P}{\bf\zeta}^{j}, (14)

where we used initialization 𝐗0=0{\bf X}^{0}=\textbf{0}. Then, we are led to

𝐗t𝐏𝐗t=j=0t1(𝐏𝐖t1j)ζj\displaystyle\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|=\|\sum_{j=0}^{t-1}({\bf P}-{\bf W}^{t-1-j}){\bf\zeta}^{j}\| (15)
j=0t1𝐏𝐖t1jopζjj=0t1λt1jζj.\displaystyle\leq\sum_{j=0}^{t-1}\|{\bf P}-{\bf W}^{t-1-j}\|_{\textrm{op}}\|{\bf\zeta}^{j}\|\leq\sum_{j=0}^{t-1}\lambda^{t-1-j}\|{\bf\zeta}^{j}\|.

With Cauchy inequality,

𝔼𝐗t𝐏𝐗t2𝔼(j=0t1λt1j2λt1j2ζj)2\displaystyle\mathbb{E}\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|^{2}\leq\mathbb{E}(\sum_{j=0}^{t-1}\lambda^{\frac{t-1-j}{2}}\cdot\lambda^{\frac{t-1-j}{2}}\|{\bf\zeta}^{j}\|)^{2}
(j=0t1λt1j)(j=0t1λt1j𝔼ζj2)\displaystyle\leq(\sum_{j=0}^{t-1}\lambda^{t-1-j})(\sum_{j=0}^{t-1}\lambda^{t-1-j}\mathbb{E}\|{\bf\zeta}^{j}\|^{2})

Direct calculation gives us

𝔼ζj2𝐖2𝔼𝐗j𝐙j2𝔼𝐗j𝐙j2.\mathbb{E}\|{\bf\zeta}^{j}\|^{2}\leq\|{\bf W}\|^{2}\cdot\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}\leq\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}.

With Lemma 3 and Assumption 3, for any jj,

𝔼𝐗j𝐙j2\displaystyle\mathbb{E}\|{\bf X}^{j}-{\bf Z}^{j}\|^{2}
m(8Kσl2+32K2σg2+64Kθ2(1θ)2(σl2+B2)+32K2B2)η2.\displaystyle\leq m(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})+32K^{2}B^{2})\eta^{2}.

Thus, we get

𝔼𝐗t𝐏𝐗t2\displaystyle\mathbb{E}\|{\bf X}^{t}-{\bf P}{\bf X}^{t}\|^{2}
m(8Kσl2+32K2σg2+64Kθ2(1θ)2(σl2+B2)+32K2B2)η21λ.\displaystyle\leq\frac{m(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})+32K^{2}B^{2})\eta^{2}}{1-\lambda}.

The fact that 𝐗t𝐏𝐗t=(𝐱t(1)𝐱t¯𝐱t(2)𝐱t¯𝐱t(m)𝐱t¯){\bf X}^{t}-{\bf P}{\bf X}^{t}=\left(\begin{array}[]{c}{\bf x}^{t}(1)-\overline{{\bf x}^{t}}\\ {\bf x}^{t}(2)-\overline{{\bf x}^{t}}\\ \vdots\\ {\bf x}^{t}(m)-\overline{{\bf x}^{t}}\\ \end{array}\right) then proves the result.

5.2.4 Proof of Lemma 5

Let 𝐙~t:=𝐘t,K\widetilde{{\bf Z}}^{t}:={\bf Y}^{t,K}. Obviously, it holds

𝐗t+1=𝐖𝐗tζ~t,\displaystyle{\bf X}^{t+1}={\bf W}{\bf X}^{t}-\widetilde{{\bf\zeta}}^{t}, (16)

where ζ~t=𝐖𝐗t𝐖(𝐐(𝐙~t𝐗t)+𝐗t).\widetilde{{\bf\zeta}}^{t}={\bf W}{\bf X}^{t}-{\bf W}({\bf Q}(\widetilde{{\bf Z}}^{t}-{\bf X}^{t})+{\bf X}^{t}). We just need to bound 𝔼ζ~t2\mathbb{E}\|\widetilde{{\bf\zeta}}^{t}\|^{2},

𝔼ζ~t22𝔼𝐖𝐗t𝐖𝐙~t2\displaystyle\mathbb{E}\|\widetilde{{\bf\zeta}}^{t}\|^{2}\leq 2\mathbb{E}\|{\bf W}{\bf X}^{t}-{\bf W}\widetilde{{\bf Z}}^{t}\|^{2}
+2𝔼𝐖𝐙~t𝐖(𝐐(𝐙~t𝐗t)+𝐗t)2\displaystyle\qquad+2\mathbb{E}\|{\bf W}\widetilde{{\bf Z}}^{t}-{\bf W}({\bf Q}(\widetilde{{\bf Z}}^{t}-{\bf X}^{t})+{\bf X}^{t})\|^{2}
2𝔼𝐗t𝐙~t2+2𝔼𝐖(𝐙~t𝐗t)𝐖(𝐐(𝐙~t𝐗t))2\displaystyle\leq 2\mathbb{E}\|{\bf X}^{t}-\widetilde{{\bf Z}}^{t}\|^{2}+2\mathbb{E}\|{\bf W}(\widetilde{{\bf Z}}^{t}-{\bf X}^{t})-{\bf W}({\bf Q}(\widetilde{{\bf Z}}^{t}-{\bf X}^{t}))\|^{2}
2𝔼𝐗t𝐙~t2+2mds2\displaystyle\leq 2\mathbb{E}\|{\bf X}^{t}-\widetilde{{\bf Z}}^{t}\|^{2}+2mds^{2}
2mη2(8Kσl2+32K2σg2+64Kθ2(1θ)2(σl2+B2)+32K2B2)\displaystyle\leq 2m\eta^{2}(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})+32K^{2}B^{2})
+2mds2,\displaystyle\qquad+2mds^{2},

the last inequality uses Lemma 3.

5.3 Proof of Theorem 1

Noting that 𝐏𝐗t+1=𝐏𝐖𝐙t=𝐏𝐙t{\bf P}{\bf X}^{t+1}={\bf P}{\bf W}{\bf Z}^{t}={\bf P}{\bf Z}^{t}, that is also

𝐱t+1¯=𝐳t¯,\overline{{\bf x}^{t+1}}=\overline{{\bf z}^{t}},

we have

𝐱t+1¯𝐱t¯=𝐱t+1¯𝐳t¯+𝐳t¯𝐱t¯=𝐳t¯𝐱t¯,\displaystyle\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}=\overline{{\bf x}^{t+1}}-\overline{{\bf z}^{t}}+\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}=\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}, (17)

where 𝐳t¯:=i=1m𝐳t(i)m\overline{{\bf z}^{t}}:=\frac{\sum_{i=1}^{m}{\bf z}^{t}(i)}{m}. With the local scheme in each node,

𝐳t¯𝐱t¯=i=1m(𝐳t(i)𝐱t(i))m\displaystyle\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}=\frac{\sum_{i=1}^{m}({\bf z}^{t}(i)-{\bf x}^{t}(i))}{m}
=i=1m(k=0K1𝐲t,k+1(i)𝐲t,k(i))m\displaystyle=\frac{\sum_{i=1}^{m}(\sum_{k=0}^{K-1}{\bf y}^{t,k+1}(i)-{\bf y}^{t,k}(i))}{m}
=ηi=1mk=0K1fi(𝐲t,k(i))m+θ(𝐳t¯𝐱t¯)\displaystyle=-\eta\frac{\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla f_{i}({\bf y}^{t,k}(i))}{m}+\theta(\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}})
+θηi=1mfi(𝐲t,K1(i))m.\displaystyle\qquad+\theta\eta\frac{\sum_{i=1}^{m}\nabla f_{i}({\bf y}^{t,K-1}(i))}{m}.

Thus, we get

𝐳t¯𝐱t¯=11θ(ηi=1mk=0K2fi(𝐲t,k(i))m)\displaystyle\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}=\frac{1}{1-\theta}(-\eta\frac{\sum_{i=1}^{m}\sum_{k=0}^{K-2}\nabla f_{i}({\bf y}^{t,k}(i))}{m}) (18)
=11θ(ηi=1m(1θ)fi(𝐲t,K1(i))m).\displaystyle=\frac{1}{1-\theta}(-\eta\frac{\sum_{i=1}^{m}(1-\theta)\nabla f_{i}({\bf y}^{t,K-1}(i))}{m}).

The Lipschitz continuity of f\nabla f gives us

𝔼f(𝐱t+1¯)\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}}) 𝔼f(𝐱t¯)+𝔼f(𝐱t¯),𝐳t¯𝐱t¯\displaystyle\leq\mathbb{E}f(\overline{{\bf x}^{t}})+\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}\rangle
+L2𝔼𝐱t+1¯𝐱t¯2,\displaystyle+\frac{L}{2}\mathbb{E}\|\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}\|^{2}, (19)

where we used (17). Let

K~=Kθ1θ,\tilde{K}=\frac{K-\theta}{1-\theta},

we can derive

𝔼K~f(𝐱t¯),(𝐳t¯𝐱t¯)/K~\displaystyle\mathbb{E}\langle\tilde{K}\nabla f(\overline{{\bf x}^{t}}),(\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}})/\tilde{K}\rangle
=𝔼K~f(𝐱t¯),ηf(𝐱t¯)+ηf(𝐱t¯)+(𝐳t¯𝐱t¯)/K~\displaystyle=\mathbb{E}\langle\tilde{K}\nabla f(\overline{{\bf x}^{t}}),-\eta\nabla f(\overline{{\bf x}^{t}})+\eta\nabla f(\overline{{\bf x}^{t}})+(\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}})/\tilde{K}\rangle
=ηK~𝔼f(𝐱t¯)2+𝔼f(𝐱t¯),ηf(𝐱t¯)+(𝐳t¯𝐱t¯)/K~\displaystyle=-\eta\tilde{K}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}+\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\eta\nabla f(\overline{{\bf x}^{t}})+(\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}})/\tilde{K}\rangle
a)ηK~𝔼f(𝐱t¯)2\displaystyle\overset{a)}{\leq}-\eta\tilde{K}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}
+η𝔼f(𝐱t¯)i=1mk=0K2[fi(𝐱t¯)fi(𝐲t,k(i))]m\displaystyle\qquad+\eta\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|\cdot\Big{\|}\frac{\sum_{i=1}^{m}\sum_{k=0}^{K-2}[\nabla f_{i}(\overline{{\bf x}^{t}})-\nabla f_{i}({\bf y}^{t,k}(i))]}{m}
+i=1m(1θ)[fi(𝐱t¯)fi(𝐲t,K1(i))]m)\displaystyle\qquad+\frac{\sum_{i=1}^{m}(1-\theta)[\nabla f_{i}(\overline{{\bf x}^{t}})-\nabla f_{i}({\bf y}^{t,K-1}(i))]}{m})\Big{\|}
ηK~𝔼f(𝐱t¯)2+ηLmi=1mk=0K1𝔼f(𝐱t¯)𝐱t¯𝐲t,k(i)\displaystyle\leq-\eta\tilde{K}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}+\frac{\eta L}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|\cdot\|\overline{{\bf x}^{t}}-{\bf y}^{t,k}(i)\|
ηK~𝔼f(𝐱t¯)2+ηK~2𝔼f(𝐱t¯)2\displaystyle\leq-\eta\tilde{K}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}+\frac{\eta\tilde{K}}{2}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}
+ηL2K22K~(C1η2+32K2η2i=1m𝔼f(𝐱t(i))2m),\displaystyle\qquad+\frac{\eta L^{2}K^{2}}{2\tilde{K}}(C_{1}\eta^{2}+32K^{2}\eta^{2}\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}),

where a)a) uses (18). Similarly, we can get

L2𝔼(𝐱t+1¯𝐱t¯2)=L2𝔼(𝐳t¯𝐱t¯2)\displaystyle\frac{L}{2}\mathbb{E}(\|\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}\|^{2})=\frac{L}{2}\mathbb{E}(\|\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}\|^{2})
L21mi=1m𝐳t(i)𝐱t(i)2\displaystyle\leq\frac{L}{2}\frac{1}{m}\sum_{i=1}^{m}\|{\bf z}^{t}(i)-{\bf x}^{t}(i)\|^{2}
L2C1η2+16LK2η2i=1m𝔼f(𝐱t(i))2m.\displaystyle\leq\frac{L}{2}C_{1}\eta^{2}+16LK^{2}\eta^{2}\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}.

Thus, (5.3) can be represented as

𝔼f(𝐱t+1¯)𝔼f(𝐱t¯)ηK~2𝔼f(𝐱t¯)2+L2K22K~C1η3\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}})\leq\mathbb{E}f(\overline{{\bf x}^{t}})-\frac{\eta\tilde{K}}{2}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}+\frac{L^{2}K^{2}}{2\tilde{K}}C_{1}\eta^{3}
+L2C1η2+(16L2K4η3K~+16LK2η2)i=1m𝔼f(𝐱t(i))2m.\displaystyle+\frac{L}{2}C_{1}\eta^{2}+(\frac{16L^{2}K^{4}\eta^{3}}{\tilde{K}}+16LK^{2}\eta^{2})\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}.

Direct computation together with Lemma 4 gives us

i=1m𝔼f(𝐱t(i))2m\displaystyle\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}
=i=1m𝔼f(𝐱t(i))f(𝐱t¯)+f(𝐱t¯)2m\displaystyle=\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))-\nabla f(\overline{{\bf x}^{t}})+\nabla f(\overline{{\bf x}^{t}})\|^{2}}{m}
i=1m2𝔼f(𝐱t(i))f(𝐱t¯)2+2𝔼f(𝐱t¯)2m\displaystyle\leq\frac{\sum_{i=1}^{m}2\mathbb{E}\|\nabla f({\bf x}^{t}(i))-\nabla f(\overline{{\bf x}^{t}})\|^{2}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}}{m}
2L2i=1m𝐱t(i)𝐱t¯2m+2𝔼f(𝐱t¯)2\displaystyle\leq 2L^{2}\frac{\sum_{i=1}^{m}\|{\bf x}^{t}(i)-\overline{{\bf x}^{t}}\|^{2}}{m}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}
2L2C2η21λ+2𝔼f(𝐱t¯)2.\displaystyle\leq\frac{2L^{2}C_{2}\eta^{2}}{1-\lambda}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}.

Therefore, we have

𝔼f(𝐱t+1¯)𝔼f(𝐱t¯)\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}})\leq\mathbb{E}f(\overline{{\bf x}^{t}}) (20)
(η(Kθ)2(1θ)32(1θ)L2K4η3Kθ32LK2η2)\displaystyle\quad-(\frac{\eta(K-\theta)}{2(1-\theta)}-\frac{32(1-\theta)L^{2}K^{4}\eta^{3}}{K-\theta}-32LK^{2}\eta^{2})
×𝔼f(𝐱t¯)2+((1θ)L2K22(Kθ)η3+L2η2)\displaystyle\qquad\times\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}+(\frac{(1-\theta)L^{2}K^{2}}{2(K-\theta)}\eta^{3}+\frac{L}{2}\eta^{2})
×(8Kσl2+32K2σg2+64K2θ2(1θ)2(σl2+B2))\displaystyle\qquad\times(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2}))
+(32(1θ)L4K4η5/(Kθ)+32L3K2η4)/(1λ)\displaystyle\quad+(32(1-\theta)L^{4}K^{4}\eta^{5}/(K-\theta)+32L^{3}K^{2}\eta^{4})/(1-\lambda)
×(8Kσl2+32K2σg2+32K2B2+64K2θ2(1θ)2(σl2+B2)).\displaystyle\qquad\times(8K\sigma_{l}^{2}+32K^{2}\sigma_{g}^{2}+32K^{2}B^{2}+\frac{64K^{2}\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})).

Summing the inequality (20) from t=1t=1 to TT, we then proved the result.

5.4 Proof of Theorem 2

With the PŁ condition,

𝔼f(𝐱t¯)22ν𝔼(f(𝐱t¯)minf).\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}\geq 2\nu\mathbb{E}(f(\overline{{\bf x}^{t}})-\min f).

We start from (20),

𝔼f(𝐱t+1¯)𝔼f(𝐱t¯)νγ(K,η)𝔼(f(𝐱t¯)minf)\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}})\leq\mathbb{E}f(\overline{{\bf x}^{t}})-\nu\gamma(K,\eta)\mathbb{E}(f(\overline{{\bf x}^{t}})-\min f) (21)
+γ(K,η)α(K,η)2+γ(K,η)β(K,η,λ)2.\displaystyle\quad+\frac{\gamma(K,\eta)\alpha(K,\eta)}{2}+\frac{\gamma(K,\eta)\beta(K,\eta,\lambda)}{2}.

By defining ξt:=𝔼(f(𝐱t¯)minf)\xi_{t}:=\mathbb{E}(f(\overline{{\bf x}^{t}})-\min f), it then follows

ξt+1\displaystyle\xi_{t+1} [1νγ(K,η)]ξt\displaystyle\leq[1-\nu\gamma(K,\eta)]\xi_{t} (22)
+γ(K,η)α(K,η)2+γ(K,η)β(K,η,λ)2.\displaystyle+\frac{\gamma(K,\eta)\alpha(K,\eta)}{2}+\frac{\gamma(K,\eta)\beta(K,\eta,\lambda)}{2}.

Thus, we are then led to

ξT[1νγ(K,η)]Tξ0\displaystyle\xi_{T}\leq[1-\nu\gamma(K,\eta)]^{T}\xi_{0}
+(γ(K,η)α(K,η)2+γ(K,η)β(K,η,λ)2)\displaystyle\quad+\Big{(}\frac{\gamma(K,\eta)\alpha(K,\eta)}{2}+\frac{\gamma(K,\eta)\beta(K,\eta,\lambda)}{2}\Big{)}
×(t=0T1[1νγ(K,η)]t)\displaystyle\qquad\times(\sum_{t=0}^{T-1}[1-\nu\gamma(K,\eta)]^{t})
[1νγ(K,η)]Tξ0\displaystyle\leq[1-\nu\gamma(K,\eta)]^{T}\xi_{0}
+(γ(K,η)α(K,η)2+γ(K,η)β(K,η,λ)2)1νγ(K,η)\displaystyle\quad+\Big{(}\frac{\gamma(K,\eta)\alpha(K,\eta)}{2}+\frac{\gamma(K,\eta)\beta(K,\eta,\lambda)}{2}\Big{)}\frac{1}{\nu\gamma(K,\eta)}
=[1νγ(K,η)]Tξ0+α(K,η)2ν+β(K,η,λ)2ν.\displaystyle=[1-\nu\gamma(K,\eta)]^{T}\xi_{0}+\frac{\alpha(K,\eta)}{2\nu}+\frac{\beta(K,\eta,\lambda)}{2\nu}.

The result is then proved.

5.5 Proof of Proposition 2

A quick calculation gives us

{γ(K,η)=Θ(1Tc2),α(K,η)2ν+β(K,η,λ)2ν=O(1Tc2).\displaystyle\left\{\begin{array}[]{c}\gamma(K,\eta)=\Theta(\frac{1}{T^{c_{2}}}),\\ \frac{\alpha(K,\eta)}{2\nu}+\frac{\beta(K,\eta,\lambda)}{2\nu}=O(\frac{1}{T^{c_{2}}}).\end{array}\right.

Thus, we just need to bound the first term in Theorem 2. As TT is large, γ(K,η)0\gamma(K,\eta)\rightarrow 0. Its logarithm is then

Tlog[1νγ(K,η)]=Tlog[1νγ(K,η)]=Θ(Tνγ(K,η)).\displaystyle T\log[1-\nu\gamma(K,\eta)]=T\log[1-\nu\gamma(K,\eta)]=\Theta(-T\nu\gamma(K,\eta)).

With our setting, it follows

Tνγ(K,η)νc1lnc3TLTc21.T\nu\gamma(K,\eta)\approx\frac{\nu c_{1}\ln^{c_{3}}T}{LT^{c_{2}-1}}.

Then we have

𝔼f(𝐱T¯)minf=𝒪(exp(νc1lnc3TLKTc21)+1Tc2).\mathbb{E}f(\overline{{\bf x}^{T}})-\min f=\mathcal{O}(\textrm{exp}(-\frac{\nu c_{1}\ln^{c_{3}}T}{LKT^{c_{2}-1}})+\frac{1}{T^{c_{2}}}).

We first consider how to choose c2c_{2}. From the L’Hospital’s rule, for any δ>0\delta>0, as T+T\rightarrow+\infty

exp(1Tδ)1.\textrm{exp}(-\frac{1}{T^{\delta}})\rightarrow 1.

Thus, we need to set c21c_{2}\leq 1 and the fast rate is slower than O(1T)O(\frac{1}{T}). To this end, we choose c1=Lνc_{1}=\frac{L}{\nu}, c2=1c_{2}=1, and c3=1c_{3}=-1.

5.6 Proof of Theorem 3

Let 𝐲~t:=i=1m𝐲t,K(i)m\widetilde{{\bf y}}^{t}:=\frac{\sum_{i=1}^{m}{\bf y}^{t,K}(i)}{m}, in the quantized DFedAvgM, it follows 𝐱t+1¯𝐱t¯=Q(𝐲~t𝐱t¯).\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}=Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}}). With the Lipschitz continuity of f\nabla f,

𝔼f(𝐱t+1¯)\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}}) 𝔼f(𝐱t¯)\displaystyle\leq\mathbb{E}f(\overline{{\bf x}^{t}})
+𝔼f(𝐱t¯),Q(𝐲~t𝐱t¯)+L2𝔼𝐱t+1¯𝐱t¯2,\displaystyle+\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})\rangle+\frac{L}{2}\mathbb{E}\|\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}\|^{2},

We have

𝔼f(𝐱t¯),Q(𝐲~t𝐱t¯)\displaystyle\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})\rangle
=𝔼f(𝐱t¯),𝐲~t𝐱t¯+𝔼f(𝐱t¯),𝐲~t𝐱t¯Q(𝐲~t𝐱t¯)\displaystyle=\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}}\rangle+\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}}-Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})\rangle
𝔼f(𝐱t¯),𝐲~t𝐱t¯+Bds\displaystyle\leq\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}}\rangle+B\sqrt{d}s

and

L2𝔼𝐱t+1¯𝐱t¯2=L2𝔼Q(𝐲~t𝐱t¯)2\displaystyle\frac{L}{2}\mathbb{E}\|\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}\|^{2}=\frac{L}{2}\mathbb{E}\|Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})\|^{2}
L𝔼𝐲t~𝐱t¯2+L𝔼Q(𝐲~t𝐱t¯)(𝐲~t𝐱t¯)2\displaystyle\leq L\mathbb{E}\|\widetilde{{\bf y}^{t}}-\overline{{\bf x}^{t}}\|^{2}+L\mathbb{E}\|Q(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})-(\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}})\|^{2}
L𝔼𝐲t~𝐱t¯2+Lds2m.\displaystyle\leq L\mathbb{E}\|\widetilde{{\bf y}^{t}}-\overline{{\bf x}^{t}}\|^{2}+\frac{Lds^{2}}{m}.

Note that both 𝔼f(𝐱t¯),𝐲~t𝐱t¯\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\widetilde{{\bf y}}^{t}-\overline{{\bf x}^{t}}\rangle and 𝔼𝐲t~𝐱t¯2\mathbb{E}\|\widetilde{{\bf y}^{t}}-\overline{{\bf x}^{t}}\|^{2} can inherit the bounds of 𝔼f(𝐱t¯),𝐳t¯𝐱t¯\mathbb{E}\langle\nabla f(\overline{{\bf x}^{t}}),\overline{{\bf z}^{t}}-\overline{{\bf x}^{t}}\rangle and 𝔼𝐱t+1¯𝐱t¯2\mathbb{E}\|\overline{{\bf x}^{t+1}}-\overline{{\bf x}^{t}}\|^{2} in the proof of Theorem 1. Thus, we obtain

𝔼f(𝐱t+1¯)𝔼f(𝐱t¯)ηK~2𝔼f(𝐱t¯)2\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}})\leq\mathbb{E}f(\overline{{\bf x}^{t}})-\frac{\eta\tilde{K}}{2}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}
+L2K22K~C1η3+LC1η2+Lds2m+Bds\displaystyle+\frac{L^{2}K^{2}}{2\tilde{K}}C_{1}\eta^{3}+LC_{1}\eta^{2}+\frac{Lds^{2}}{m}+B\sqrt{d}s
+(32L2K4η3K~+32LK2η2)i=1m𝔼f(𝐱t(i))2m.\displaystyle+(\frac{32L^{2}K^{4}\eta^{3}}{\tilde{K}}+32LK^{2}\eta^{2})\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}.

With Lemma 5, we can get

i=1m𝔼f(𝐱t(i))2m\displaystyle\frac{\sum_{i=1}^{m}\mathbb{E}\|\nabla f({\bf x}^{t}(i))\|^{2}}{m}
i=1m2𝔼f(𝐱t(i))f(𝐱t¯)2+2𝔼f(𝐱t¯)2m\displaystyle\leq\frac{\sum_{i=1}^{m}2\mathbb{E}\|\nabla f({\bf x}^{t}(i))-\nabla f(\overline{{\bf x}^{t}})\|^{2}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}}{m}
2L2i=1m𝐱t(i)𝐱t¯2m+2𝔼f(𝐱t¯)2\displaystyle\leq 2L^{2}\frac{\sum_{i=1}^{m}\|{\bf x}^{t}(i)-\overline{{\bf x}^{t}}\|^{2}}{m}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}
2L2C3η21λ+4L2ds21λ+2𝔼f(𝐱t¯)2.\displaystyle\leq\frac{2L^{2}C_{3}\eta^{2}}{1-\lambda}+\frac{4L^{2}ds^{2}}{1-\lambda}+2\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}.

Combining the inequalities together,

𝔼f(𝐱t+1¯)\displaystyle\mathbb{E}f(\overline{{\bf x}^{t+1}}) 𝔼f(𝐱t¯)+ζ(K,η,λ,s)\displaystyle\leq\mathbb{E}f(\overline{{\bf x}^{t}})+\zeta(K,\eta,\lambda,s)
(ηK~264L2K4η3K~64LK2η2)𝔼f(𝐱t¯)2,\displaystyle-(\frac{\eta\tilde{K}}{2}-\frac{64L^{2}K^{4}\eta^{3}}{\tilde{K}}-64LK^{2}\eta^{2})\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2},

where ζ(K,η,λ,s):=L2K22K~C1η3+LC1η2+(32L2K4η3K~+32LK2η2)(2L2C3η21λ+4L2ds21λ)+Lds2m+Bds.\zeta(K,\eta,\lambda,s):=\frac{L^{2}K^{2}}{2\tilde{K}}C_{1}\eta^{3}+LC_{1}\eta^{2}+(\frac{32L^{2}K^{4}\eta^{3}}{\tilde{K}}+32LK^{2}\eta^{2})(\frac{2L^{2}C_{3}\eta^{2}}{1-\lambda}+\frac{4L^{2}ds^{2}}{1-\lambda})+\frac{Lds^{2}}{m}+B\sqrt{d}s. Given the stepsize η=Θ(1LKT)\eta=\Theta(\frac{1}{LK\sqrt{T}}), we can see that ηK~264L2K4η3K~64LK2η2>0\frac{\eta\tilde{K}}{2}-\frac{64L^{2}K^{4}\eta^{3}}{\tilde{K}}-64LK^{2}\eta^{2}>0 as TT is large. When s>0s>0 is small, s2=O(s)s^{2}=O(s). And it then follows ηK~264L2K4η3K~64LK2η2=Θ(1(1θ)T)\frac{\eta\tilde{K}}{2}-\frac{64L^{2}K^{4}\eta^{3}}{\tilde{K}}-64LK^{2}\eta^{2}=\Theta(\frac{1}{(1-\theta)\sqrt{T}}). We now consider

[(32L2K4η3K~+32LK2η2)(2L2C3η21λ+4L2ds1λ)\displaystyle\Big{[}(\frac{32L^{2}K^{4}\eta^{3}}{\tilde{K}}+32LK^{2}\eta^{2})(\frac{2L^{2}C_{3}\eta^{2}}{1-\lambda}+\frac{4L^{2}ds}{1-\lambda}) (23)
+LC1η2+L2K22K~C1η3+Lds2m+Bds]/[ηK~2\displaystyle\quad+LC_{1}\eta^{2}+\frac{L^{2}K^{2}}{2\tilde{K}}C_{1}\eta^{3}+\frac{Lds^{2}}{m}+B\sqrt{d}s\Big{]}\Big{/}\Big{[}\frac{\eta\tilde{K}}{2}
64L2K4η3K~64LK2η2],\displaystyle\quad-\frac{64L^{2}K^{4}\eta^{3}}{\tilde{K}}-64LK^{2}\eta^{2}\Big{]},

which is at the order as O(1T+σl2+Kσg2+θ2(1θ)2K(σl2+B2)KT+σl2+Kσg2+KB2+θ2(1θ)2K(σl2+B2)(1λ)KT3/2+Ts)O\Big{(}\frac{1}{\sqrt{T}}+\frac{\sigma_{l}^{2}+K\sigma_{g}^{2}+\frac{\theta^{2}}{(1-\theta)^{2}}K(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}+\frac{\sigma_{l}^{2}+K\sigma_{g}^{2}+KB^{2}+\frac{\theta^{2}}{(1-\theta)^{2}}K(\sigma_{l}^{2}+B^{2})}{(1-\lambda)KT^{3/2}}+\sqrt{T}s\Big{)}. If the function ff satisfies the PŁ-ν\nu, we have

𝔼(f(𝐱t¯)minf)\displaystyle\mathbb{E}(f(\overline{{\bf x}^{t}})-\min f)
[1νγ(K,η)]T(f(𝐱0¯)minf)+ζ(K,η,λ,s)νγ(K,η).\displaystyle\qquad\leq[1-\nu\gamma(K,\eta)]^{T}(f(\overline{{\bf x}^{0}})-\min f)+\frac{\zeta(K,\eta,\lambda,s)}{\nu\gamma(K,\eta)}.

When η=1νTKlnT\eta=\frac{1}{\nu TK\ln T}, [1νγ(K,η)]T=𝒪~(1T)[1-\nu\gamma(K,\eta)]^{T}=\widetilde{\mathcal{O}}(\frac{1}{T}) and

ζ(K,η,λ,s)νγ(K,η)=𝒪~(1T+Ts).\frac{\zeta(K,\eta,\lambda,s)}{\nu\gamma(K,\eta)}=\widetilde{\mathcal{O}}(\frac{1}{T}+Ts).

5.7 Proof of Proposition 3

We calculate to reach the same error, the communication costs by both algorithms. Omitting the order larger than 1 for η\eta, from (20), we have

min1tT𝔼f(𝐱t¯)22(f(𝐱0¯)minf)ηKT\displaystyle\min_{1\leq t\leq T}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}\approx\frac{2(f(\overline{{\bf x}^{0}})-\min f)}{\eta KT}
+Lη(8σl2+32Kσg2+64Kθ2(1θ)2(σl2+B2))\displaystyle\quad+L\eta(8\sigma_{l}^{2}+32K\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2}))
=2(1θ)(f(𝐱0¯)minf)T\displaystyle=\frac{2(1-\theta)(f(\overline{{\bf x}^{0}})-\min f)}{\sqrt{T}}
+8(1θ)σl2+32(1θ)Kσg2+64Kθ2(1θ)(σl2+B2)KT.\displaystyle\quad+\frac{8(1-\theta)\sigma_{l}^{2}+32(1-\theta)K\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)}(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}.

for DFedAvgM. From (23),

min1tT𝔼f(𝐱t¯)22(1θ)(f(𝐱0¯)minf)T\displaystyle\min_{1\leq t\leq T}\mathbb{E}\|\nabla f(\overline{{\bf x}^{t}})\|^{2}\approx\frac{2(1-\theta)(f(\overline{{\bf x}^{0}})-\min f)}{\sqrt{T}}
+8(1θ)σl2+32(1θ)Kσg2+64Kθ2(1θ)(σl2+B2)KT\displaystyle\quad+\frac{8(1-\theta)\sigma_{l}^{2}+32(1-\theta)K\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)}(\sigma_{l}^{2}+B^{2})}{K\sqrt{T}}
+2(1θ)LBdTs.\displaystyle\quad+2(1-\theta)LB\sqrt{d}\sqrt{T}s.

Given the ϵ>0\epsilon>0, assume TϵT_{\epsilon} obeys

8(1θ)σl2+32(1θ)Kσg2+64Kθ2(1θ)(σl2+B2)KTϵ\displaystyle\frac{8(1-\theta)\sigma_{l}^{2}+32(1-\theta)K\sigma_{g}^{2}+\frac{64K\theta^{2}}{(1-\theta)}(\sigma_{l}^{2}+B^{2})}{K\sqrt{T_{\epsilon}}}
+2(1θ)(f(𝐱0¯)minf)Tϵ=ϵ.\displaystyle\quad+\frac{2(1-\theta)(f(\overline{{\bf x}^{0}})-\min f)}{\sqrt{T_{\epsilon}}}=\epsilon.

That means DFedAvgM can output an ϵ\epsilon error in TϵT_{\epsilon} iterations. However, due to the error caused by the quantization, we have to increase the iteration number for quantized DFedAvgM. We set the iteration number as 94Tϵ\frac{9}{4}T_{\epsilon}. To get the ϵ\epsilon error for quantized DFedAvgM, we also need 3(1θ)LBdTϵsϵ3(1-\theta)LB\sqrt{d}\sqrt{T_{\epsilon}}s\leq\epsilon, which yields

3(1θ)2LBds[2(f(𝐱0¯)minf)+8σl2K\displaystyle 3(1-\theta)^{2}LB\sqrt{d}s\Big{[}2(f(\overline{{\bf x}^{0}})-\min f)+\frac{8\sigma_{l}^{2}}{K}
+32σg2+64θ2(1θ)2(σl2+B2)]ϵ2.\displaystyle\quad+32\sigma_{g}^{2}+\frac{64\theta^{2}}{(1-\theta)^{2}}(\sigma_{l}^{2}+B^{2})\Big{]}\leq\epsilon^{2}.

The total communication cost of DFedAvgM to reach ϵ\epsilon is

32dTϵi=1m[deg(𝒩(i))].32dT_{\epsilon}\sum_{i=1}^{m}[\textrm{deg}(\mathcal{N}(i))].

While for quantized version, the total communication cost is

(32+db)94Tϵi=1m[deg(𝒩(i))].(32+db)\frac{9}{4}T_{\epsilon}\sum_{i=1}^{m}[\textrm{deg}(\mathcal{N}(i))].

Thus, the communications can be reduced if

(32+db)94<32d.(32+db)\frac{9}{4}<32d.

6 Numerical Results

We apply the proposed DFedAvgM with communication quantization to train DNNs for both image classification and language modeling, where we consider a simple ring structured communication network. We aim to verify that DFedAvgM can train DNNs effectively, especially in communication efficiency. Moreover, we consider the membership privacy protection when DFedAvgM is used for training DNNs. We apply the membership inference attack (MIA) [49] to test the efficiency of (quantized) DFedAvgM in protecting the training data’s MP. In MIA, the attack model is a binary classifier 333We use a multilayer perceptron with a hidden layer of 64 nodes, followed by a softmax output function as the attack model, which is adapted from [49]., which is to decide if a data point is in the training set of the target model. For each of the following dataset, to perform MIA we first split its training set into DshadowD_{\rm shadow} and DtargetD_{\rm target} with the same size. Furthermore, we split DshadowD_{\rm shadow} into two halves with the same size and denote them as DshadowtrainD_{\rm shadow}^{\rm train} and DshadowoutD_{\rm shadow}^{\rm out}, and we split DtargetD_{\rm target} by half into DtargettrainD_{\rm target}^{\rm train} and DtargetoutD_{\rm target}^{\rm out}. MIA proceeds as follows: 1) train the shadow model by using DshadowtrainD_{\rm shadow}^{\rm train}; 2) apply the trained shadow model to predict all data points in DshadowD_{\rm shadow} and train the corresponding classification probabilities of belonging to each class. Then we take the top three classification probabilities (or two in the case of binary classification) to form the feature vector for each data point.A feature vector is tagged as1if the corresponding data point is in DshadowtrainD_{\rm shadow}^{\rm train}, and 0 otherwise. Then we train the attack model by leveraging all the labeled feature vectors; 3) train the target model by using DtargettrainD_{\rm target}^{\rm train} and obtain feature vector for each point in DtargetD_{\rm target}. Finally, we leverage the attack model to decide if a data point is in DtargettrainD_{\rm target}^{\rm train}. Note the attack model we build is a binary classifier, which is to decide if a data point is in the training set of the target model. For any data ξDtarget\xi\in D_{\rm target}, we apply the attack model to predict its probability (pp) of belonging to the training set of the target model. Given any fixed threshold tt if ptp\geq t, we classify ξ\xi as a member of the training set (positive sample), and if p<tp<t, we conclude that ξ\xi is not in the training set (negative sample); so we can obtain different attack results with different thresholds. We can plot the ROC curve for different threshold, and regard the area under the ROC curve (AUC) as an evaluation of the membership inference attack.The target model protects perfect membership privacy if the AUC is 0.5 (attack model performs random guess), and the higher AUC is, the less private the target model is.

6.1 MNIST Classification

The efficiency of DFedAvgM.

We train two DNNs for MNIST classification using 100 clients: 1) A simple multilayer-perceptron with 2-hidden layers with 200 units each using ReLU activation (199,210 total parameters), which we refer to as 2NN. 2) A CNN with two 5×55\times 5 convolution layers (the first with 32 channels, the second with 64, each followed with 2×22\times 2 max pooling), a fully connected layer with 512 units and ReLU activation, and the final output layer (1,663,370 total parameters). We study two partitioning of the MNIST data over clients, i.e., IID and Non-IID. In IID setting, the data is shuffled, and then partitioned into 20 clients each receiving 3000 examples. In Non-IID, we first sort the data by digit label, divide it into 40 shards of size 1500, and assign each of 20 clients 2 shards. In training, we set the local batch size (batch size of the training data on clients) to be 50, learning rate 0.01, and momentum 0.9. Figures 2 and 3 show the results of training CNN for MNIST classification (Fig. 2: IID and Fig. 3 Non-IID) by DFedAvgM using different communication bits and different local epochs. These results confirm the efficiency of DFedAvgM for training DNNs; in particular, when the clients’ data are IID. For both IID and Non-IID settings, the communication bits do not affect the performance of DFedAvgM; as we see that the training loss, test accuracy, and AUC under the membership inference attack are almost identical. Increasing local training epochs can accelerate training for IID setting at the cost of faster privacy leakage. However, for Non-IID, increasing local training epochs does not help DFedAvgM in either training or privacy protection. Training 2NN by DFedAvgM behaves similarly, see Figs. 4 and 5.

Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Figure 2: Training CNN for IID MNIST classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Different quantized DFedAvgM performs almost similar, and more local epoch can accelerate training at the cost of faster privacy leakage. CR: communication round.
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Figure 3: Training CNN for Non-IID MNIST classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Different quantized DFedAvgM does not lead to much difference in performance. More local epoch does not help in accelerating training or protect data privacy.
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Figure 4: Training 2NN for IID MNIST classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Different quantized DFedAvgM performs almost similar, and more local epoch can accelerate training at the cost of faster privacy leakage.
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Figure 5: Training 2NN for Non-IID MNIST classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Different quantized DFedAvgM does not lead to much difference in performance. More local epoch does not help in accelerating training or protect data privacy.
Comparison between DFedAvgM, FedAvg, and DSGD.

Now, we compare the DFedAvgM, FedAvg, and DSGD in training 2NNs for MNIST classification. We use the same local batch size 50 for both FedAvg and DSGD, and the learning rates are both set to 0.1444We note that DFedAvg requires smaller learning rates than FedAvg and DSGD for numerical convergence.. For FedAvg, we select all clients to get involved in training and communication in each round. Figure 6 compares three algorithms in terms of test loss and test accuracy for IID MNIST over communication round and communication cost. In terms of communication rounds, DFedAvgM converges as fast as FedAvg, and both are much faster than DSGD. DFedAvgM has a significant advantage over FedAvg and DSGD in communication costs. For Non-IID MNIST, training 2NN by FedAvg can achieve 96.81% test accuracy, but both DFedAvg and DSGD cannot bypass 85%. This disadvantage is because both DSGD and DFedAvgM only communicate with their neighbors, while the neighbors and itself may not contain enough training data to cover all possible classes. One feasible solution to resolve the issues of DFedAvgM for the Non-IID setting is by designing a new graph structure for more efficient global communication.

Refer to caption Refer to caption
CR vs. Test loss CB vs. Test loss
Refer to caption Refer to caption
CR vs. Test acc CB vs. Test acc
Figure 6: The efficiency comparison between DSGD, FedAvg, and DFedAvgM in training 2NN for MNIST classification. (a) and (c): test loss and test accuracy vs. communication round. (b) and (d): test loss and test accuracy vs. communication bits. DFedAvgM performs on par with FedAvg in terms of communication rounds, but DFedAvgM is significantly more efficient than FedAvg from the communication cost viewpoint. CR: communication round; CB: communication bits.

6.2 LSTM for Language Modeling

We consider the SHAKESPEARE dataset and we follow the processing as that used in [1], resulting in a dataset distributed over 1146 clients in the Non-IID fashion. On this data, we use DFedAvgM to train a stacked character-level LSTM language model, which predicts the next character after reading each character in a line. The model takes a series of characters as input and embeds each of these into a learned 8-dimensional space. The embedded characters are then processed through 2 LSTM layers, each with 256 nodes. Finally, the output of the second LSTM layer is sent to a softmax output layer with one node per character. The full model has 866,578 parameters, and we trained using an unroll length of 80 characters. We set the local batch size to 10, and we use a learning rate of 1.47, which is the same as [1]. The momentum is selected to be 0.9. Figure 7 plots the communication round vs. test accuracy and AUC under MIA for different quantization and different local epochs. These results show that 1) both the accuracy and MIA increase as training goes; 2) higher communication cost can lead to faster convergence; 3) increase local epochs deteriorate the performance of DFedAvgM.

Refer to caption Refer to caption
CR vs. Test accuracy CR vs. AUC
Refer to caption Refer to caption
CR vs. Test accuracy CR vs. AUC
Figure 7: Training LSTM for SHAKESPEARE classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Using higher precision communication can slightly improve the performance. More local epoch does not help in accelerating training or protect data privacy.

6.3 CIFAR10 Classification

Finally, we use DFedAvgM to train ResNet20 for CIFAR10 classification, which consists of 10 classes of 32×3232\times 32 images with three channels. There are 50,000 training and 10,000 testing examples, which we partitioned into 20 clients uniformly, and we only consider the IID setting following [1]. We use the same data augmentation and DNN as that used in [1]. The local batch size is set to 50, the learning rate is set to 0.01, and the momentum is set to 0.9. Figure 8 shows the communication round vs. test accuracy and AUC under MIA for different quantization and different local epochs. If the local epoch is set to 1, different communication bits does not lead to a significant difference in training loss, test accuracy, and AUC under MIA. However, for the fixed communication bits 16, increase the local epochs from 1 to 2 or to 5 will make training not even converge.

Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Refer to caption Refer to caption Refer to caption
CR vs. Training loss CR vs. Test acc CR vs. AUC
Figure 8: Training CNN for IID CIFAR10 classification with DFedAvgM using: different communication bits but fix local epoch to one (first row) and different local epochs but fix the communication bits to 16 (second row). Different quantized DFedAvgM performs almost similar and more local epochs can accelerate training at the beginning but does not perform very well as training continues.

7 Concluding Remarks

In this paper, we proposed a DFedAvgM and its quantized version. There two major benefits of the DFedAvgM over the existing FedAvg: 1) In FedAvg, communication between the central parameter server and local clients is required in each communication round, and this communication will be very expensive as the number of clients is very large. On the contrary, in DFedAvgM communications are between clients which are significantly less than FedAvg. 2) In FedAvg, the central server collects the updated models from clients, and attack the central server can break the privacy of the whole system. In contrast, conceptually, it is harder to break the privacy in DFedAvgM than FedAvg. Furthermore, we established the theoretical convergence for DFedAvgM and its quantized version under general nonconvex assumptions, and we showed that the worst-case convergence rate of (quantized) DFedAvgM is the same as that of DSGD. In particular, we proved a sublinear convergence rate of (quantized) DFedAvgM when the objective functions satisfy the PŁ condition. We perform extensive numerical experiments to verify the efficacy of DFedAvgM and its quantized version in training ML models and protect membership privacy.

References

  • [1] H. B. Mcmahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” pp. 1273–1282, 2017.
  • [2] H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
  • [3] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic gradient descent,” in Advances in neural information processing systems, pp. 2595–2603, 2010.
  • [4] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, pp. 5330–5340, 2017.
  • [5] A. Lalitha, S. Shekhar, T. Javidi, and F. Koushanfar, “Fully decentralized federated learning,” in Third workshop on Bayesian Deep Learning (NeurIPS), 2018.
  • [6] A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar, “Peer-to-peer federated learning on graphs,” arXiv preprint arXiv:1901.11173, 2019.
  • [7] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, pp. 1139–1147, 2013.
  • [8] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” arXiv preprint arXiv:1909.06335, 2019.
  • [9] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in International Conference on Learning Representations, 2021.
  • [10] T. Chen, G. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated gradient for communication-efficient distributed learning,” in Advances in Neural Information Processing Systems, pp. 5050–5060, 2018.
  • [11] J. Sun, T. Chen, G. Giannakis, and Z. Yang, “Communication-efficient distributed learning via lazily aggregated quantized gradients,” in Advances in Neural Information Processing Systems, pp. 3365–3375, 2019.
  • [12] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smithy, “Feddane: A federated newton-type method,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 1227–1231, IEEE, 2019.
  • [13] A. Khaled, K. Mishchenko, and P. Richtárik, “First analysis of local gd on heterogeneous data,” arXiv preprint arXiv:1909.04715, 2019.
  • [14] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-IID data,” in International Conference on Learning Representations, 2020.
  • [15] H. B. McMahan et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1, 2021.
  • [16] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions.,” arXiv: Learning, 2019.
  • [17] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip algorithms: Design, analysis and applications,” in INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, vol. 3, pp. 1653–1664, IEEE, 2005.
  • [18] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and cooperation in networked multi-agent systems,” Proceedings of the IEEE, vol. 95, no. 1, pp. 215–233, 2007.
  • [19] L. Schenato and G. Gamba, “A distributed consensus protocol for clock synchronization in wireless sensor network,” in 2007 46th ieee conference on decision and control, pp. 2289–2294, IEEE, 2007.
  • [20] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione, “Broadcast gossip algorithms for consensus,” IEEE Transactions on Signal processing, vol. 57, no. 7, pp. 2748–2761, 2009.
  • [21] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  • [22] A. I. Chen and A. Ozdaglar, “A fast distributed proximal-gradient method,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 601–608, IEEE, 2012.
  • [23] D. Jakovetić, J. Xavier, and J. M. Moura, “Fast distributed gradient methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
  • [24] I. Matei and J. S. Baras, “Performance evaluation of the consensus-based distributed subgradient method under random communication topologies,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 754–771, 2011.
  • [25] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
  • [26] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
  • [27] B. Sirb and X. Ye, “Consensus optimization with delayed and stochastic gradients on decentralized networks,” in Big Data (Big Data), 2016 IEEE International Conference on, pp. 76–85, IEEE, 2016.
  • [28] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.
  • [29] X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proceedings of the 35th International Conference on Machine Learning, pp. 3043–3052, 2018.
  • [30] T. Sun, P. Yin, D. Li, C. Huang, L. Guan, and H. Jiang, “Non-ergodic convergence analysis of heavy-ball algorithms,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5033–5040, 2019.
  • [31] R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” IEEE Transactions on Automatic Control, 2019.
  • [32] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized consensus optimization,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 5838–5843, IEEE, 2018.
  • [33] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, Federated learning. Morgan & Claypool Publishers, 2019.
  • [34] H. Xing, O. Simeone, and S. Bi, “Decentralized federated learning via SGD over wireless D2D networks,” in 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, IEEE, 2020.
  • [35] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of the 1 st Adaptive & Multitask Learning Workshop, Long Beach, California, 2019.
  • [36] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
  • [37] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,” SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
  • [38] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” Ussr Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
  • [39] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems, pp. 19–27, 2014.
  • [40] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
  • [41] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On maintaining linear convergence of distributed learning and optimization under limited communication,” IEEE Transactions on Signal Processing, vol. 68, pp. 6101–6116, 2020.
  • [42] H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811, Springer, 2016.
  • [43] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in International conference on machine learning, pp. 314–323, 2016.
  • [44] D. J. Foster, A. Sekhari, and K. Sridharan, “Uniform convergence of gradients for non-convex learning and optimization,” in Advances in Neural Information Processing Systems, pp. 8745–8756, 2018.
  • [45] B. T. Polyak, “Gradient methods for minimizing functionals,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643–653, 1963.
  • [46] S. Lojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89, 1963.
  • [47] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: Primal estimated sub-gradient solver for svm,” Mathematical programming, vol. 127, no. 1, pp. 3–30, 2011.
  • [48] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009.
  • [49] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes, “Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models,” In Annual Network and Distributed System Security Symposium (NDSS 2019), 2019.