This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Stochastic Gradient Descent Begets Self-Induced Momentum

Abstract

Federated learning (FL) is an emerging machine learning method that can be applied in mobile edge systems, in which a server and a host of clients collaboratively train a statistical model utilizing the data and computation resources of the clients without directly exposing their privacy-sensitive data. We show that running stochastic gradient descent (SGD) in such a setting can be viewed as adding a momentum-like term to the global aggregation process. Based on this finding, we further analyze the convergence rate of a federated learning system by accounting for the effects of parameter staleness and communication resources. These results advance the understanding of the Federated SGD algorithm, and also forges a link between staleness analysis and federated computing systems, which can be useful for systems designers.

Index Terms—  Federated learning, stochastic gradient descent (SGD), momentum, convergence rate.

1 Introduction

Federated learning (FL) is a branch of machine learning models that allow a computing unit, i.e., an edge server, to train a statistical model from data stored on a swarm of end-user entities, i.e., the clients, without directly accessing the clients’ local datasets [1]. Specifically, instead of aggregating all the data to the server for training, FL brings the machine learning models directly to the clients for local computing, where only the resulting parameters are uploaded to the server for global aggregation, after which an improved model is sent back to the clients for another round of local training [2]. Such a training process usually converges after sufficient rounds of parameter exchanges and computing among the server and clients, upon which all the participants can benefit from a better machine learning model [3, 4, 5]. As a result, the salient feature of on-device training mitigates many of the systemic privacy risks as well as communication overheads, hence making FL particularly relevant for next-generation mobile networks [6, 7, 8]. Nonetheless, in the setting of FL, the server usually needs to link up a massive number of clients via a resource-limited medium, e.g., the spectrum, and hence only a limited number of the clients can be selected to participate in the federated training during each round of iteration [9, 10, 11, 12]. This, together with the fact that the time spent on transmitting the parameters can be orders of magnitude higher than that of local computations [13, 14], makes the straggler issue a serious one in FL. To that end, a simple but effective approach has been proposed [15], i.e., reusing the outdated parameters in the global aggregation stage so as to accelerate the training efficiency. The gain of this scheme has been amply demonstrated via experiments while the intrinsic rationale behind it remains unclear. In this paper, we take the stochastic gradient descent (SGD)-based FL training as an example and show that reusing the outdated parameters implicitly introduces a momentum-like term in the global updating process, and prove the subsequent convergence rate of federated computing. This result advances the understanding of FL and may be useful to guide further research in this area.

2 System Model

Let us consider an FL system consisting of one server and KK clients, as depicted per Fig. 1, where KK is usually a large number. Each client kk has a local dataset 𝒟k={𝐱id,yi}i=1nk\mathcal{D}_{k}=\{\mathbf{x}_{i}\in\mathbb{R}^{d},y_{i}\in\mathbb{R}\}_{i=1}^{n_{k}} with size |𝒟k|=nk|\mathcal{D}_{k}|=n_{k}, and we assume the local datasets are statistically independent across the clients. The goal of the server and clients is to jointly learn a statistical model over the datasets residing on all the clients without sacrificing their privacy. To be more concrete, the server aims to fit a vector 𝐰d\mathbf{w}\in\mathbb{R}^{d} so as to minimize the following loss function without having explicit knowledge of 𝒟=k=1K𝒟k\mathcal{D}=\cup_{k=1}^{K}\mathcal{D}_{k}:

min𝐰df(𝐰)\displaystyle\min_{\mathbf{w}\in\mathbb{R}^{d}}f(\mathbf{w}) =1ni=1n(𝐰;𝐱i,yi)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(\mathbf{w};\mathbf{x}_{i},y_{i})
=nkn1nkj=1nk(𝐰;𝐱j,yj)\displaystyle=\frac{n_{k}}{n}\cdot\frac{1}{n_{k}}\sum_{j=1}^{n_{k}}\ell(\mathbf{w};\mathbf{x}_{j},y_{j})
=k=1Kpkfk(𝐰)\displaystyle=\sum_{k=1}^{K}p_{k}f_{k}(\mathbf{w}) (1)

where n=k=1Knkn=\sum_{k=1}^{K}n_{k}, pk=nk/np_{k}=n_{k}/n, ()\ell(\cdot) is the loss function assigned on each data point, and fk(𝐰)=j=1nk(𝐰;𝐱j,yj)/nkf_{k}(\mathbf{w})=\sum_{j=1}^{n_{k}}\ell(\mathbf{w};\mathbf{x}_{j},y_{j})/n_{k} is the local empirical loss function of client kk.

Refer to caption
Fig. 1: An illustration of Federated SGD training: (A) clients leverage their local datasets to evaluate the gradient term, (B) the server aggregates the received updates to produce a new global model, (C) the new model is sent back to the clients, and the process is repeated.

Because the server has no direct access to the individual datasets, the model training needs to be carried out by the clients in a federated fashion. In this paper, we adopt Federated SGD, a widely used mechanism, for this task. The details are summarized in Algorithm 1 [2]. Specifically, at iteration tt, the server needs to send the global model 𝐰t\mathbf{w}^{t} to a subset of clients StS_{t}, where in general N=|St|KN=|S_{t}|\ll K because the limited communication resources cannot support simultaneous transmissions from a vast number of clients [9], for on-device model training. Upon receiving 𝐰t\mathbf{w}^{t}, the selected clients will leverage it to evaluate the gradient of the local empirical loss – by means of an HH-step estimation – and upload the estimated gradients 𝐠kt\mathbf{g}^{t}_{k}, kStk\in S_{t}. In essence, this comprises computing the stochastic gradient with a batch size of HH data points. Finally, the server aggregates the collected parameters to produce a new output per (4). Such an orchestration amongst the server and clients repeats for a sufficient number of communication rounds until the learning process converges.

It is worth noting that the gradient aggregation step (3) in Algorithm 1 utilizes not only the fresh updates collected from the selected clients but also the outdated gradients from the unselected ones. As will be shown later, this procedure, in essence, induces an implicit momentum into the learning process.

Algorithm 1 Federated SGD Algorithm
1:Parameters: HH = number of local steps per communication round, η\eta = step size for stochastic gradient descent
2:Initialize: 𝐰0d\mathbf{w}^{0}\in\mathbb{R}^{d}
3:for  t=0,1,2,,T1t=0,1,2,...,T-1  do
4:     The server randomly selects a set StS_{t} of NN clients and broadcasts the global parameter 𝐰t\mathbf{w}^{t} to them
5:     for  each client kStk\in S_{t} in parallel  do
6:         Initialize 𝐠kt,0=0\mathbf{g}_{k}^{t,0}=0
7:         for  ss = 0 to H1H-1  do
8:              Sample i𝒟ki\in\mathcal{D}_{k} uniformly at random, and update the local estimation of the gradient, 𝐠kt,s\mathbf{g}^{t,s}_{k}, as follows:
𝐠kt,s+1=𝐠kt,s+(𝐰t;𝐱i,yi)\displaystyle\mathbf{g}_{k}^{t,s+1}=\mathbf{g}_{k}^{t,s}+\nabla\ell(\mathbf{w}^{t};\mathbf{x}_{i},y_{i}) (2)
         
9:         Set 𝐠kt=𝐠kt,H/H\mathbf{g}_{k}^{t}=\mathbf{g}_{k}^{t,H}/H and send the parameter back to the server      
10:     The server collects all the updates of {𝐠it}iSt\{\mathbf{g}^{t}_{i}\}_{i\in S_{t}} and assigns 𝐠jt=𝐠jt1\mathbf{g}^{t}_{j}=\mathbf{g}^{t-1}_{j} for all jStj\neq S_{t}. Then, the server updates both the estimation of gradient 𝐠t\mathbf{g}^{t} and parameter 𝐰t+1\mathbf{w}^{t+1} as follows:
𝐠t=k=1Knkn𝐠kt,\displaystyle\mathbf{g}^{t}=\sum_{k=1}^{K}\frac{n_{k}}{n}\mathbf{g}^{t}_{k}, (3)
𝐰t+1=𝐰tη𝐠t\displaystyle\mathbf{w}^{t+1}=\mathbf{w}^{t}-\eta\mathbf{g}^{t} (4)
11:Output: 𝐰T\mathbf{w}^{T}

3 Analysis

This section comprises the main technical part of this paper, in which we analytically characterize the updating process of global parameters and derive the convergence rate of the Federated SGD algorithm.

3.1 Update Process of Global Parameters

Due to limited communication resources, the server can only select a subset of the clients to conduct local computing and update their gradients in every round of global iteration. As a result, the gradients of the unselected clients become stale. In accordance with (3) and (4), after the tt-th communication round, the update of global parameters at the server side can be rewritten as follows:

𝐰t+1=𝐰tηk=1Kpk𝐠ktτk\displaystyle\mathbf{w}^{t+1}=\mathbf{w}^{t}-\eta\,\sum_{k=1}^{K}p_{k}\,\mathbf{g}^{t-\tau_{k}}_{k} (5)

in which τk\tau_{k} is the staleness of the parameters corresponding to the kk-th client. Because the clients to participate in the FL are selected uniformly at random in each communication round, the staleness of parameters, {τk}k=1K\{\tau_{k}\}_{k=1}^{K}, can be abstracted as independently and identically distributed (i.i.d.) random variables with each following a geometric distribution:

(τk=l)=βl(1β),l=0,1,2,\displaystyle\mathbb{P}(\tau_{k}=l)=\beta^{l}(1-\beta),\quad l=0,1,2,... (6)

where β=1N/K\beta=1-N/K.

These considerations bring us to our first result.

Lemma 1.

Under the depicted FL framework, the parameter updating process constitutes the following relationship:

𝔼[𝐰t+1𝐰t]=β𝔼[𝐰t𝐰t1](1β)η𝔼[𝐠t].\displaystyle\mathbb{E}\big{[}\mathbf{w}^{t+1}\!-\!\mathbf{w}^{t}\big{]}\!=\!\beta\,\mathbb{E}\big{[}\mathbf{w}^{t}\!-\!\mathbf{w}^{t-1}\big{]}\!-\!(1\!-\!\beta)\eta\,\mathbb{E}\big{[}\mathbf{g}^{t}\big{]}. (7)
Proof.

Using (5), we can subtract 𝐰t\mathbf{w}^{t} from 𝐰t+1\mathbf{w}^{t+1} and obtain the following:

𝐰t+1𝐰t\displaystyle\mathbf{w}^{t+1}\!-\!\mathbf{w}^{t} =𝐰t𝐰t1ηk=1Kpk(𝐠ktτk𝐠ktτk1).\displaystyle=\mathbf{w}^{t}\!-\!\mathbf{w}^{t-1}-\!\eta\!\sum_{k=1}^{K}p_{k}\big{(}\mathbf{g}^{t-\tau_{k}}_{k}-\mathbf{g}^{t-\tau_{k}-1}_{k}\big{)}. (8)

By taking an expectation with respect to the staleness τk\tau_{k}, k{1,,K}k\in\{1,\cdots,K\} on both sides of the above equation, the following holds:

𝔼[𝐰t+1𝐰t]\displaystyle\mathbb{E}[\mathbf{w}^{t+1}-\mathbf{w}^{t}] =𝔼[𝐰t𝐰t1]\displaystyle=\mathbb{E}[\mathbf{w}^{t}-\mathbf{w}^{t-1}]
ηk=1Kpk𝔼[𝐠ktτk𝐠ktτk1]Q1.\displaystyle\quad-\eta\sum_{k=1}^{K}p_{k}\underbrace{\mathbb{E}\big{[}\,\mathbf{g}^{t-\tau_{k}}_{k}-\mathbf{g}^{t-\tau_{k}-1}_{k}\,\big{]}}_{Q_{1}}. (9)

Since τkGeo(1β)\tau_{k}\sim Geo(1-\beta), we can calculate Q1Q_{1} as

Q1\displaystyle Q_{1} =(1β)𝔼[𝐠kt]+l=1(1β)βl𝔼[𝐠ktl1]\displaystyle=\!\big{(}1-\beta\big{)}\mathbb{E}\big{[}\mathbf{g}^{t}_{k}\big{]}+\sum_{l=1}^{\infty}(1-\beta)\beta^{l}\mathbb{E}\big{[}\mathbf{g}^{t-l-1}_{k}\big{]}
l=0(1β)βl𝔼[𝐠ktl1]\displaystyle~{}~{}\qquad\qquad\qquad\qquad\qquad-\sum_{l=0}^{\infty}(1-\beta)\beta^{l}\mathbb{E}\big{[}\mathbf{g}^{t-l-1}_{k}\big{]}
=(1β)𝔼[𝐠kt]l=0(1β)2βl𝔼[𝐠ktl1].\displaystyle=\!\big{(}1-\beta\big{)}\mathbb{E}\big{[}\mathbf{g}^{t}_{k}\big{]}-\sum_{l=0}^{\infty}(1\!-\!\beta)^{2}\beta^{l}\mathbb{E}\big{[}\mathbf{g}^{t-l-1}_{k}\big{]}. (10)

Furthermore, by noticing that for the stochastic gradient of each client kk, the following result holds:

𝔼[𝐠ktτk1]=l=0(1β)βl𝔼[𝐠ktl1],\displaystyle\mathbb{E}\big{[}\mathbf{g}_{k}^{t-\tau_{k}-1}\big{]}=\sum_{l=0}^{\infty}\,(1-\beta)\beta^{l}\mathbb{E}\big{[}\mathbf{g}_{k}^{t-l-1}\big{]}, (11)

we have

ηk=1Kpk𝔼[𝐠ktτk𝐠ktτk1]\displaystyle\eta\sum_{k=1}^{K}p_{k}\,\mathbb{E}\big{[}\,\mathbf{g}_{k}^{t-\tau_{k}}-\mathbf{g}_{k}^{t-\tau_{k}-1}\,\big{]}
=(1β)ηk=1Kpk𝔼[𝐠kt]\displaystyle=\!\big{(}1\!-\!\beta\big{)}\eta\sum_{k=1}^{K}p_{k}\mathbb{E}\big{[}\mathbf{g}_{k}^{t}\big{]}
(1β)ηk=1Kpkl=0(1β)βl𝔼[𝐠ktl1]\displaystyle\qquad\qquad\qquad\!-\!\big{(}1\!-\!\beta\big{)}\eta\!\sum_{k=1}^{K}p_{k}\!\sum_{l=0}^{\infty}(1\!-\!\beta)\beta^{l}\mathbb{E}\big{[}\mathbf{g}_{k}^{t-l-1}\big{]}
=(1β)η𝔼[𝐠t](1β)ηk=1Kpk𝔼[𝐠ktτk1]\displaystyle=\!\big{(}1\!-\!\beta\big{)}\eta\,\mathbb{E}\big{[}\mathbf{g}^{t}\big{]}\!-\!\big{(}1\!-\!\beta\big{)}\eta\!\sum_{k=1}^{K}p_{k}\mathbb{E}\big{[}\mathbf{g}_{k}^{t-\tau_{k}-1}\big{]}
=(a)(1β)η𝔼[𝐠t]+(1β)𝔼[𝐰t𝐰t1],\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\!\big{(}1\!-\!\beta\big{)}\eta\,\mathbb{E}\big{[}\mathbf{g}^{t}\big{]}+\big{(}1\!-\!\beta\big{)}\mathbb{E}\big{[}\mathbf{w}^{t}\!-\!\mathbf{w}^{t-1}\big{]}, (12)

where (aa) follows from (5). Finally, by substituting (3.1) into (3.1), we complete the proof. ∎

From Lemma 1, we can identify a momentum-like term, namely β𝔼[𝐰t𝐰t1]\beta\mathbb{E}[\mathbf{w}^{t}-\mathbf{w}^{t-1}], when the global parameter is updated from 𝐰t\mathbf{w}^{t} to 𝐰t+1\mathbf{w}^{t+1}. This can mainly be attributed to the reuse of gradients, which introduces memory during the global aggregation step and makes the parameter vector 𝐰t+1\mathbf{w}^{t+1} stay close to the current server model 𝐰t\mathbf{w}^{t}. Notably, such a phenomenon is also observable in the context of completely asynchronized SGD algorithms owing to similar reasons [16]. As a result, Lemma 1 can serve as a useful reference to adjust the controling factor if one intends to accelerate the Federated SGD algorithm by running it in conjunction with an explicit momentum term [16, 17, 18]. Besides, if the delayed gradient averaging such as [19] is employed, the design of gradient correction shall take into account the effect of such an implicit momentum as well.

In the sequel, we quantify the effect of this implicit momentum on the convergence performance of the FL system.

3.2 Convergence Analysis

To facilitate the analysis of the FL convergence rate, we make the following assumption on the structure of the global empirical loss function.

Assumption 1.

The gradient of each fkf_{k} is Lipschitz continuous with a constant L>0L>0, i.e., for any 𝐰,𝐯d\mathbf{w},\mathbf{v}\in\mathbb{R}^{d} the following is satisfied:

fk(𝐰)fk(𝐯)L𝐰𝐯.\displaystyle\|\nabla f_{k}(\mathbf{w})-\nabla f_{k}(\mathbf{v})\|\leq L\|\mathbf{w}-\mathbf{v}\|. (13)

This assumption is standard in the machine learning literature and is satisfied by a wide range of machine learning models, such as SVM, logistic regression, and neural networks. Besides, no assumption regarding the convexity of the objective function is made. We further leverage a notion, termed gradient coherence, to track the variant of the gradient during the training process, defined as follows [20].

Definition 1.

The gradient coherence at communication round tt is defined as

μt=min0stf(𝐰s),f(𝐰t)f(𝐰s)2.\displaystyle\mu_{t}=\min_{0\leq s\leq t}\frac{\langle\,\nabla f(\mathbf{w}^{s}),\nabla f(\mathbf{w}^{t})\,\rangle}{\|\nabla f(\mathbf{w}^{s})\|^{2}}. (14)

The gradient coherence characterizes the largest deviation of directions between the current gradient and the gradients along the past iterations. As such, if μt\mu_{t} is positive, then the direction of the current gradient is well aligned to those of the previous ones, and hence reusing the trained parameters can push forward the global parameter vector toward the optimal point.

Theorem 1.

Suppose the gradient coherence μt\mu_{t} is lower bounded by some μ>0\mu>0 for all tt and the variance of the stochastic gradient is upper bounded by σ2>0\sigma^{2}>0. If we choose the step size to be η=1/LT\eta=1/\sqrt{LT}, then after TT rounds of communication, the Alg. 1 converges as follows:

min0tT1𝔼[f(𝐰t)2]2L[f(𝐰0)f(𝐰)+σ2][1(1μ)β]T\displaystyle\min_{0\leq t\leq T-1}\!\!\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}\leq\frac{2\sqrt{L}\big{[}f(\mathbf{w}^{0})-f(\mathbf{w}^{*})+\sigma^{2}\big{]}}{\big{[}1-(1-\mu)\beta\big{]}\sqrt{T}} (15)

in which 𝐰=argmin𝐰df(𝐰)\mathbf{w}^{*}=\arg\min_{\mathbf{w}\in\mathbb{R}^{d}}f(\mathbf{w}).

Proof.

Following Assumption 1, we know that the empirical loss function f()f(\cdot) is LL-smooth, and hence after the tt-th round of global parameter update the following holds:

𝔼[f(𝐰t+1)]𝔼[f(𝐰t)]+𝔼[𝐰t+1𝐰t,f(𝐰t)]Q1\displaystyle\mathbb{E}\big{[}f(\mathbf{w}^{t+1})\big{]}\leq\mathbb{E}\big{[}f(\mathbf{w}^{t})\big{]}+\underbrace{\mathbb{E}\big{[}\langle\,\mathbf{w}^{t+1}-\mathbf{w}^{t},\nabla f(\mathbf{w}^{t})\,\rangle\big{]}}_{Q_{1}}
+L2𝔼[𝐰t+1𝐰t2]Q2.\displaystyle\qquad\qquad\qquad+\frac{L}{2}\underbrace{\mathbb{E}\big{[}\|\mathbf{w}^{t+1}-\mathbf{w}^{t}\|^{2}\big{]}}_{Q_{2}}. (16)

Using Lemma 1, we can expand the terms in Q1Q_{1} and obtain the following:

Q1=(8)𝔼[β(𝐰t𝐰t1)η(1β)𝐠t,f(𝐰t)]\displaystyle Q_{1}\stackrel{{\scriptstyle(8)}}{{=}}\mathbb{E}\big{[}\langle\,\beta(\mathbf{w}^{t}-\mathbf{w}^{t-1})-\eta(1-\beta)\mathbf{g}^{t},\nabla f(\mathbf{w}^{t})\,\rangle\big{]}
=(a)β𝔼[𝐰t𝐰t1,f(𝐰t)]η(1β)𝔼[f(𝐰t)2]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\beta\mathbb{E}\big{[}\langle\,\mathbf{w}^{t}-\mathbf{w}^{t-1},\nabla f(\mathbf{w}^{t})\,\rangle\big{]}-\eta(1-\beta)\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}
=(b)ηβ𝔼[f(𝐰tτ1),f(𝐰t)]η(1β)𝔼[f(𝐰t)2]\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\!-\eta\beta\mathbb{E}\big{[}\langle\,\nabla f(\mathbf{w}^{t-\tau-1}),\nabla f(\mathbf{w}^{t})\,\rangle\big{]}\!-\!\eta(1\!-\!\beta)\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}
ηβμ𝔼[f(𝐰t)2]η(1β)𝔼[f(𝐰t)2]\displaystyle\leq-\eta\beta\mu\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}-\eta(1-\beta)\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}
=η[1(1μ)β]𝔼[f(𝐰t)2]\displaystyle=-\eta\big{[}1-(1-\mu)\beta\big{]}\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]} (17)

where (aa) follows by noticing that 𝔼[𝐠t]=f(𝐰t)\mathbb{E}[\mathbf{g}^{t}]=\nabla f(\mathbf{w}^{t}) and in (bb) we notice that all the random variables {τk}k=1K\{\tau_{k}\}_{k=1}^{K} possess the same distribution and hence unify them by introducing a random variable τ\tau that satisfies τ=τk\tau=\tau_{k} in distribution. On the other hand, as the stochastic gradient has a bounded variance, Q2Q_{2} can be evaluated as

Q2\displaystyle Q_{2} =𝔼[ηk=1Kpk𝐠ktτk2]\displaystyle=\mathbb{E}\Big{[}\big{\|}\eta\sum_{k=1}^{K}p_{k}\mathbf{g}_{k}^{t-\tau_{k}}\big{\|}^{2}\Big{]}
=η2𝔼[k=1Kpk𝐠ktτkf(𝐰t)+f(𝐰t)2]\displaystyle=\eta^{2}\mathbb{E}\Big{[}\big{\|}\sum_{k=1}^{K}p_{k}\mathbf{g}_{k}^{t-\tau_{k}}-\nabla f(\mathbf{w}^{t})+\nabla f(\mathbf{w}^{t})\big{\|}^{2}\Big{]}
2η2(𝔼[f(𝐰t)2]+σ2).\displaystyle\leq 2\eta^{2}\big{(}\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}+\sigma^{2}\big{)}. (18)

By taking (3.2) and (3.2) back into (3.2) and telescoping tt from 0 to T1T-1, we have

𝔼[f(𝐰T1)]𝔼[f(𝐰0)]Lt=0T1η2𝔼[f(𝐰t)2]\displaystyle\mathbb{E}[f(\mathbf{w}^{T-1})]-\mathbb{E}[f(\mathbf{w}^{0})]\leq L\sum_{t=0}^{T-1}\eta^{2}\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}
+t=0T1Lη2σ2(1(1μ)β)t=0T1η𝔼[f(𝐰t)2].\displaystyle+\sum_{t=0}^{T-1}{L\eta^{2}\sigma^{2}}\!-\!\big{(}1\!-\!(1\!-\!\mu)\beta\big{)}\sum_{t=0}^{T-1}\!\eta\,\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}. (19)

Further rearranging the terms of the above inequality yields

t=0T1[(1(1μ)β)ηLη2]𝔼[f(𝐰t)2]\displaystyle\sum_{t=0}^{T-1}\big{[}\big{(}1-(1-\mu)\beta\big{)}\eta-{L}\eta^{2}\big{]}\mathbb{E}\big{[}\|f(\mathbf{w}^{t})\|^{2}\big{]}
f(𝐰0)𝔼[f(𝐰T1)]+Lσ2t=1Tη2.\displaystyle\leq f(\mathbf{w}_{0})-\mathbb{E}\big{[}f(\mathbf{w}^{T-1})\big{]}+{L\sigma^{2}}\sum_{t=1}^{T}\eta^{2}. (20)

Note that 𝔼[f(𝐰T)]f(𝐰)\mathbb{E}[f(\mathbf{w}^{T})]\geq f(\mathbf{w}^{*}) and η=1/LT\eta=1/\sqrt{LT}, and so we have

min0tT1𝔼[f(𝐰t)2]f(𝐰0)f(𝐰)+Lσ2t=1T1η2t=0T1[(1(1μ)β)ηLη2]\displaystyle\min_{0\leq t\leq T-1}\!\!\!\mathbb{E}\big{[}\|\nabla f(\mathbf{w}^{t})\|^{2}\big{]}\leq\frac{f(\mathbf{w}_{0})-f(\mathbf{w}^{*})+{L\sigma^{2}}\sum_{t=1}^{T-1}\eta^{2}}{\sum_{t=0}^{T-1}\big{[}\big{(}1-(1-\mu)\beta\big{)}\eta-{L}\eta^{2}\big{]}}
=f(𝐰0)f(𝐰)+σ2(1(1μ)β)T/L1.\displaystyle=\frac{f(\mathbf{w}_{0})-f(\mathbf{w}^{*})+\sigma^{2}}{\big{(}1-(1-\mu)\beta\big{)}\sqrt{T/L}-1}. (21)

Finally, when TT is taken to be sufficiently large, we have

(1(1μ)β)T/L1(1(1μ)β)T2L\displaystyle\big{(}1-(1-\mu)\beta\big{)}\sqrt{T/L}-1\geq\frac{\big{(}1-(1-\mu)\beta\big{)}\sqrt{T}}{2\sqrt{L}} (22)

and the result follows. ∎

Following Theorem 1, several observations can be made: (ii) For non-convex objective functions, Federated SGD converges to stationary points on the order of 1/T1/\sqrt{T}; (iiii) the staleness of parameters impacts the convergence rate via the multiplicative constant, which unveils that when the communication resources are abundant, i.e., the server can select many clients for parameter updates in each iteration, that leads to an increase in NN which in turns reduces β\beta and results in a faster convergence rate, and vice versa; and (iiiiii) this result also provides further evidence to the claim that having more clients participate in each round of FL training is instrumental in speeding up the model convergence [21, 22].111A few simulation examples that corroborate these observations are available in: https://person.zju.edu.cn/person/attachments/2022-01/01-1641711371-850767.pdf

4 conclusion

In this paper, we have carried out an analytical study toward a deeper understanding of the FL system. For the Federated SGD algorithm that uses both fresh and outdated gradients in the aggregation stage, we have shown that this implicitly introduces a momentum-like term during the update of global parameters. We have also analyzed the convergence rate of such an algorithm by taking into account the parameter staleness and communication resources. Our results have confirmed that increasing the number of selected clients in each communication round can accelerate the convergence of the FL algorithm through a reduction in the staleness of parameters. The analysis does not assume convexity of the objective function and hence is applicable to even the setting of deep learning systems. The developed framework reveals a link between staleness analysis and FL convergence rate, and may be useful for further research in this area.

References

  • [1] D. Ramage S. Hampson H. B. McMahan, E. Moore and B. A. Y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. 20th Int. Conf. Artif. Intell. Stat. (AISTATS), Fort Lauderdale, FL, USA, Apr. 2017, pp. 1273–1282.
  • [2] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surv. & Tut., vol. 22, no. 3, pp. 2031–2063, Third Quarter, 2020.
  • [3] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” Available as ArXiv:1610.05492, 2016.
  • [4] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
  • [5] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local sgd on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Stat. (AISTATS), 2020, pp. 4519–4529.
  • [6] Z. Zhao, C. Feng, H. H. Yang, and X. Luo, “Federated learning-enabled intelligent fog-radio access networks: Fundamental theory, key techniques, and future trends,” IEEE Wireless Commun. Mag., vol. 27, no. 2, pp. 22–28, Apr. 2020.
  • [7] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proc. IEEE, vol. 107, no. 11, pp. 2204–2239, Nov. 2019.
  • [8] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. J. A. Zhang, “The roadmap to 6G–AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019.
  • [9] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, Jan. 2020.
  • [10] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. V. Poor, “Age-based scheduling policy for federated learning in mobile edge networks,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Barcelona, Spain, 2020, pp. 8743–8747.
  • [11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, Jun. 2019.
  • [12] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, Jan. 2021.
  • [13] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Math. Program., pp. 1–48, Dec. 2018.
  • [14] Y. Arjevani, O. Shamir, and N. Srebro, “A tight convergence analysis for stochastic gradient descent with delayed updates,” in Algorithmic Learning Theory, 2020, pp. 111–132.
  • [15] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communication-efficient distributed learning,” in Adv. Neural Inf. Process. Syst. (NeurIPS), Montreal, CANADA, Dec. 2018.
  • [16] I. Mitliagkas, C. Zhang, S. Hadjis, and C. Ré, “Asynchrony begets momentum, with an application to deep learning,” in Proc. 54th Annu. Allerton Conf. Commun., Control, and Comput. (Allerton), Monticello, IL, Sept. 2016, pp. 997–1004.
  • [17] W. Liu, L. Chen, Y. Chen, and W. Zhang, “Accelerating federated learning via momentum gradient descent,” IEEE Trans. Parallel and Distrib. Syst., vol. 31, no. 8, pp. 1754–1766, Aug. 2020.
  • [18] Z. Huo, Q. Yang, B. Gu, L. Carin, and H. Huang, “Faster on-device training using new federated momentum algorithm,” Available as ArXiv:2002.02090, 2020.
  • [19] L. Zhu, H. Lin, Y. Lu, Y. Lin, and S. Han, “Delayed gradient averaging: Tolerate the communication latency for federated learning,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2021.
  • [20] W. Dai, Y. Zhou, N. Dong, H. Zhang, and E. P. Xing, “Toward understanding the impact of staleness in distributed machine learning,” in Proc. Int. Conf. Learn. Represent. (ICLR), New Orleans, Louisiana, May 2019, pp. 1–6.
  • [21] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [22] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in Proc. IEEE Int. Conf. Commun., Shanghai, China, May 2019, pp. 1–7.