This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Communication-Efficient Local SGD with
Age-Based Worker Selection

Feng Zhu, Jingjing Zhang,  and Xin Wang The authors are with the Key Laboratory for Information Science of Electromagnetic Waves (MoE), Department of Communication Science and Engineering, Fudan University, Shanghai 200433, China (e-mail: {20210720072, jingjingzhang, xwang11}@fudan.edu.cn).)
Abstract

A major bottleneck of distributed learning under parameter-server (PS) framework is communication cost due to frequent bidirectional transmissions between the PS and workers. To address this issue, local stochastic gradient descent (SGD) and worker selection have been exploited by reducing the communication frequency and the number of participating workers at each round, respectively. However, partial participation can be detrimental to convergence rate, especially for heterogeneous local datasets. In this paper, to improve communication efficiency and speed up the training process, we develop a novel worker selection strategy named AgeSel. The key enabler of AgeSel is utilization of the ages of workers to balance their participation frequencies. The convergence of local SGD with the proposed age-based partial worker participation is rigorously established. Simulation results demonstrate that the proposed AgeSel strategy can significantly reduce the number of training rounds needed to achieve a targeted accuracy, as well as the communication cost. The influence of the algorithm hyper-parameter is also explored to manifest the benefit of age-based worker selection.

Index Terms:
Local SGD, communication efficiency, age-based worker selection, distributed learning

I Introduction

Parameter-Server (PS) setting is one of the most popular paradigms in distributed machine learning. In this setting, the PS broadcasts the current global model parameter to the workers for gradient computation and aggregates the computed gradients to update the global model. The operation repeats until some targeted convergence criterion is reached[1, 2, 3, 4]. However, the massive communication overhead between the PS and the workers has become the main bottleneck of the overall system performance as the sizes of the neural networks grow exponentially enormous[5].

In order to address the communication issue, [6] proposes a widely-studied algorithm, named federated averaging (FedAvg), where the PS randomly selects a subset of workers and sends the global model to the selected workers for a certain number of local updates at each round. The PS then collects the latest local models to update the global model and re-sends it to the re-selected workers for local updates again. By decreasing the communication frequency and the number of participating workers, FedAvg can achieve improved communication efficiency, with its convergence analysis extensively studied under both homegenous and heterogeneous data distribution cases in e.g., [7, 8, 9, 10, 11, 12]. Additionally, [13] proposes a robust aggregation scheme to improve the performance of FedAvg.

To further improve communication efficiency, instead of random selection as in FedAvg, adaptive worker selection technique has attracted increasing attention recently. One of the representative works on adaptive selection in SGD-based distributed learning is [14], in which a worker is required to upload its gradient only if its contribution (i.e., the change to the global model) is large enough or its local gradient has become overly stale. The asserted optimal client selection strategy (OCS) is further developed in [15] by selecting the workers with larger gradient norms to minimize the variance of the global update. In addition, another metric of “contribution”, i.e., local loss of workers, is also explored as a criterion for worker selection design [18, 16, 17].

On the other hand, imbalanced participation could render the training unstable or slow, especially under heterogeneous data distribution, since the influence of some parts of the data on the overall training process is weakened [20, 19]. Hence, the age of each worker, indicating the number of consecutive rounds where it has not participated in the computation, could be taken into consideration by adaptive selection techniques. To this end, [22] explores the influence of ages in gradient descent (GD) scenario. For distributed learning over wireless connections, [21] jointly leverages the channel quality and the ages of workers to improve the communication efficiency. Additionally, if with perfect lossless channel, the scheme reduces to the Round Robin policy. Moreover, [23] finds the age-optimal number of workers to select at each round, where the age is defined as the sum of computation time and uplink transmission time.

In this paper, we propose a novel age-based worker selection strategy for local SGD. In what we call AgeSel scheme, a simple age-based mechanism is implemented such that the workers are forced to be selected if they have not been involved for a certain number of consecutive rounds. Different from existing adaptive worker selection strategies, the proposed AgeSel relies on the age information that is readily available at the PS without additional communication and computation overhead. The convergence of AgeSel is also rigorously established to justify the benefit of age-based partial worker participation. The simulation results corroborate the superiority of AgeSel in terms of communication efficiency and required number of training rounds (to achieve a targeted accuracy) over state-of-the-art schemes.

Notations. \mathbb{R} denotes the real number fields; 𝔼\mathbb{E} denotes the expectation operator; \|\cdot\| denotes the 2\ell_{2} norm; F\nabla F denotes the gradient of function FF; \bigcup denotes the union of sets; 𝒜\mathcal{A}\subseteq\mathcal{B} represents that set 𝒜\mathcal{A} is a subset of set \mathcal{B}; and |𝒜||\mathcal{A}| denotes the size of set 𝒜\mathcal{A}.

II Problem Formulation

II-A System Model

Consider the PS-based framework of distributed learning with heterogeneous data distribution. There are MM distributed workers in set {1,,M}\mathcal{M}\triangleq\{1,...,M\}. Each worker mm maintains a local dataset 𝒟m\mathcal{D}_{m} of size NmN_{m}. It is drawn from the global dataset 𝒟={zi}i=1N\mathcal{D}=\{z_{i}\}_{i=1}^{N}, i.e., we have 𝒟=m𝒟m\mathcal{D}=\bigcup_{m\in\mathcal{M}}\mathcal{D}_{m}. Our objective is to minimize the weighted loss function

(𝜽)=m=1MNmNm(𝜽),\displaystyle\mathcal{L}(\boldsymbol{\theta})=\sum_{m=1}^{M}\frac{N_{m}}{N}\mathcal{L}_{m}(\boldsymbol{\theta}), (1)

where 𝜽d\boldsymbol{\theta}\in\mathbb{R}^{d} is the dd-dimension parameter to be optimized, and we have the definition m(𝜽)𝔼[m(𝜽;zm)]\mathcal{L}_{m}(\boldsymbol{\theta})\triangleq\mathbb{E}[\mathcal{L}_{m}(\boldsymbol{\theta};z_{m})], with m(𝜽)\mathcal{L}_{m}(\boldsymbol{\theta}) being the local loss function of worker mm and zmz_{m} being the sample drawn randomly from its local dataset 𝒟m\mathcal{D}_{m}.

To elaborate on local SGD, we define 𝜽j\boldsymbol{\theta}^{j} as the global model parameter at training round jj, and 𝜽mj,0\boldsymbol{\theta}_{m}^{j,0} as the local model of worker mm before operating local updates. At each training round jj, a subset Dj\mathcal{M}_{D}^{j}\subseteq\mathcal{M} of workers are selected to download the global model 𝜽j\boldsymbol{\theta}^{j} from the PS prior to computation. After that, each selected worker mm in Dj\mathcal{M}_{D}^{j} sets its local model as 𝜽mj,0=𝜽j\boldsymbol{\theta}_{m}^{j,0}=\boldsymbol{\theta}^{j} and starts operating UU local iterations, with the updating formula given as

𝜽mj,u+1=𝜽mj,uηBb=1B(𝜽mj,u;zm,bj,u),\displaystyle\boldsymbol{\theta}_{m}^{j,u+1}=\boldsymbol{\theta}_{m}^{j,u}-\frac{\eta}{B}\sum_{b=1}^{B}\nabla\mathcal{L}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u}), (2)

for any local iteration u=0,,U1u=0,...,U-1. Here η\eta is the stepsize and 1Bb=1B(𝜽mj,u;zm,bj,u)\frac{1}{B}\sum_{b=1}^{B}\nabla\mathcal{L}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u}) is the minibatch gradient to be computed by worker mm at local iteration uu, where BB is the minibatch size and zm,bj,uz_{m,b}^{j,u} is a sample drawn independently across iterations from the local dataset 𝒟m\mathcal{D}_{m}.

After all the workers in set Dj\mathcal{M}_{D}^{j} have completed their local computations, a subset of SS workers UjDj\mathcal{M}_{U}^{j}\subseteq\mathcal{M}_{D}^{j} are selected to upload their latest models, and the PS aggregates the models

𝜽j+1=1SmUj𝜽mj,U.\displaystyle\boldsymbol{\theta}^{j+1}=\frac{1}{S}\sum_{m\in\mathcal{M}_{U}^{j}}\boldsymbol{\theta}_{m}^{j,U}. (3)

Note that here we employ the unbiased aggregation since the weight of each worker is reflected in the worker selection process, as will be specified later.

The training process ends when some stopping criterion is satisfied, with the total number of rounds denoted as JJ.

II-B Performance Metrics

To gauge the efficiency of the proposed scheme, we are interested in two performance metrics, i.e., the number of training rounds and the communication cost required to reach a targeted training accuracy.

Firstly, the number of training rounds JJ of the algorithm to achieve a targeted test accuracy is used to reflect the training speed, as well as the communication cost of the algorithm.

Secondly, we define the communication cost CjC^{j} at training round jj as the number of communication rounds between the PS and the workers, given as

Cj=|Dj|+S.\displaystyle C^{j}=|\mathcal{M}_{D}^{j}|+S. (4)

This is due to the fact that the size of the global parameter downloaded by the workers is the same as that of the latest parameter uploaded by each selected worker. As a result, the total communication cost is given as C=j=0J1CjC=\sum_{j=0}^{J-1}C^{j}.

These two metrics will be used in Section IV to compare the performance of different schemes.

III Adaptive Selection in Local SGD (AgeSel)

In this section, we first develop the novel AgeSel scheme that aims at improving communication efficiency by utilizing the age-based worker selection. Then the convergence of AgeSel is rigorously established.

III-A Algorithm Description

To elaborate on worker selection strategy in the proposed AgeSel, we define an MM-length vector 𝝉M\boldsymbol{\tau}_{M} to collect the ages of the workers, as in [14, 15]. Note that 𝝉M\boldsymbol{\tau}_{M} is maintained by the PS and initialized to be a zero vector. Each element τm\tau_{m} of vector 𝝉M\boldsymbol{\tau}_{M} measures the number of consecutive rounds that worker mm has not been selected by the PS. Particularly, at each round jj, τm\tau_{m} is updated as follows:

τm={0,if m is selected,τm+1,otherwise.\tau_{m}=\begin{cases}0,~{}~{}~{}~{}~{}~{}~{}~{}\text{if $m$ is selected},\\ \tau_{m}+1,~{}\text{otherwise}.\end{cases} (5)

To identify the workers with low participation frequency, we pre-define a threshold τmax\tau_{max}. Accordingly, worker mm would be forced to participate when having its τmτmax\tau_{m}\geq\tau_{max}. The specific procedure of AgeSel is delineated next.

At each round jj, the PS selects SS workers to perform computation by checking the vector 𝝉M\boldsymbol{\tau}_{M}. More precisely, with vector 𝝉M\boldsymbol{\tau}_{M}, we can identify all the infrequent workers with their ages greater than τmax\tau_{max}, which would be considered first. Let SjS^{j} denote the number of these workers. The set Dj\mathcal{M}_{D}^{j} of selected workers can be determined as follows. For the first case with SjSS^{j}\geq S, the PS simply picks the SS workers in an age-descending order. Note that the the workers with larger sizes of local datasets are prioritized when there are ties in ages. For the second case with Sj<SS^{j}<S, the PS first picks all the SjS^{j} infrequent workers and then chooses the rest SSjS-S^{j} without replacement from the workers having ages smaller than τmax\tau_{max} with the probabilities proportional to the sizes of their datasets.

Once the set Dj\mathcal{M}_{D}^{j} is determined, the PS then broadcasts the global parameter 𝜽j\boldsymbol{\theta}^{j} to all the selected workers. By initiating its local model with 𝜽mj,0=𝜽j\boldsymbol{\theta}_{m}^{j,0}=\boldsymbol{\theta}^{j}, each worker mm in Dj\mathcal{M}_{D}^{j} then starts UU iterations of local updates through (2) and sends its latest model 𝜽mj,U\boldsymbol{\theta}_{m}^{j,U} to the PS, i.e., we have Uj=Dj\mathcal{M}_{U}^{j}=\mathcal{M}_{D}^{j}. Eventually, the PS updates the vector 𝝉M\boldsymbol{\tau}_{M}, along with the global model aggregated via (3).

Note that when τmax\tau_{max} is very large, we barely have infrequent workers. Then the set Dj\mathcal{M}_{D}^{j} is indeed chosen by weights (i.e., the sizes of datasets), as with the FedAvg algorithm[6]; i.e., in this case, AgeSel is reduced to FedAvg.

Merits of AgeSel: The benefit from the age-based mechanism used in AgeSel is two-fold. First, it has been shown that less participation of some workers can be detrimental to convergence rate due to the lack of gradient diversity [20], especially for heterogeneous local datasets. Our age-based selection strategy can balance the worker participation, thereby preserving gradient diversity and ensuring the fast convergence. Second, generation of age information 𝝉M\boldsymbol{\tau}_{M} here incurs no extra communication and computation cost, in contrast to other information such as the (costly) norm of updates [15], used in existing alternatives. Hence, AgeSel is more communication- and computation-efficient.

III-B Convergence Analysis

We next establish the convergence of the proposed AgeSel algorithm under heterogeneous data distribution, with a general (not necessarily convex) objective function. Our analysis is based on the following two assumptions, which are widely adopted in related works such as [7, 8].

Assumption 1 (Smoothness and Lower Boundedness) Each local function m(𝛉)\mathcal{L}_{m}(\boldsymbol{\theta}) is LL-smooth, i.e.,

m(𝜽1)m(𝜽2)L𝜽1𝜽2,\displaystyle\left\|\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{1})-\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{2})\right\|\leq L\left\|\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\right\|, (6)

𝜽1,𝜽2d\forall\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in\mathbb{R}^{d}. We also assume that the objective function \mathcal{L} is bounded below by \mathcal{L}^{*}.

Assumption 2 (Unbiasedness and Bounded Variance) For the given model parameter 𝛉\boldsymbol{\theta}, the local gradient estimator is unbiased, i.e.,

𝔼[m(𝜽;z)]=m(𝜽).\displaystyle\mathbb{E}[\nabla\mathcal{L}_{m}(\boldsymbol{\theta};z)]=\nabla\mathcal{L}_{m}(\boldsymbol{\theta}). (7)

Moreover, both the variance of the local gradient estimator and the variance of the local gradient from the global one are bounded, i.e., there exist two constants σL,σG>0\sigma_{L},\sigma_{G}>0, such that

𝔼[m(𝜽;z)m(𝜽)2]σL2,m\displaystyle\mathbb{E}[\|\nabla\mathcal{L}_{m}(\boldsymbol{\theta};z)-\nabla\mathcal{L}_{m}(\boldsymbol{\theta})\|^{2}]\leq\sigma_{L}^{2},\forall m (8)
𝔼[m(𝜽)(𝜽)2]σG2,m.\displaystyle\mathbb{E}[\|\nabla\mathcal{L}_{m}(\boldsymbol{\theta})-\nabla\mathcal{L}(\boldsymbol{\theta})\|^{2}]\leq\sigma_{G}^{2},\forall m. (9)

With these assumptions, we can derive an upper bound for the expectation of the average squared gradient norm 1J𝔼[j=0J1(𝜽j)2]\frac{1}{J}\mathbb{E}\left[\sum_{j=0}^{J-1}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}\right], to prove the convergence of the proposed AgeSel. We start by presenting the following lemma.

Lemma 1: Under Assumptions 1-2 and η\eta is chosen such that η18LU\eta\leq\frac{1}{8LU}, there exists a positive constant c<1215U2η2L2Lη(90U3L2η2+3U)c<\frac{1}{2}-15U^{2}\eta^{2}L^{2}-L\eta(90U^{3}L^{2}\eta^{2}+3U), such that

𝔼[(𝜽j+1)](𝜽j)cηU(𝜽j)2+η2UL2SBσL2+Z1+Lη2Z2+2Lη2(Aj)2S2Z22Lη2AjSZ2,\displaystyle\mathbb{E}[\mathcal{L}(\boldsymbol{\theta}^{j+1})]\leq\mathcal{L}(\boldsymbol{\theta}^{j})-c\eta U\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+Z_{1}+L\eta^{2}Z_{2}+\frac{2L\eta^{2}(A^{j})^{2}}{S^{2}}Z_{2}-\frac{2L\eta^{2}A^{j}}{S}Z_{2}, (10)

where we have defined Z1=5U2η3L22(σL2+6UσG2)Z_{1}=\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right), Z2=15U3L2η2(σL2+6UσG2)+3U2σG2Z_{2}=15U^{3}L^{2}\eta^{2}(\sigma_{L}^{2}+6U\sigma_{G}^{2})+3U^{2}\sigma_{G}^{2}; and AjA^{j} is the cardinality of the set 𝒜j\mathcal{A}^{j} of workers selected by age at round jj, i.e., Aj=min{S,Sj}A^{j}=\min\{S,S^{j}\}.

Proof.

The proof can be found in the appendix. ∎

Lemma 1 depicts the one-step difference of the objective function, from which we can see the impact of the age. Particularly, as AjA^{j} grows larger, the variance term 2Lη2AjSZ2-\frac{2L\eta^{2}A^{j}}{S}Z_{2} is reduced, while the variance term 2Lη2(Aj)2S2Z2\frac{2L\eta^{2}(A^{j})^{2}}{S^{2}}Z_{2} is increased. Therefore, the values of τmax\tau_{max} and SS, deciding AjA^{j} jointly, have a significant impact on training speed, which will be further demonstrated in Section IV. Based on Lemma 1, we are able to arrive at the final convergence result.

Theorem 1: With the same conditions as in Lemma 1, we have

1J𝔼[j=0J1(𝜽j)2](𝜽0)cηUJ+V,\displaystyle\frac{1}{J}\mathbb{E}\left[\sum_{j=0}^{J-1}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}\right]\leq\frac{\mathcal{L}(\boldsymbol{\theta}^{0})-\mathcal{L}^{*}}{c\eta UJ}+V, (11)

where we have defined the constant V=1c[ηL2SBσL2+Z1ηU+(3ηL2LηMSR)Z2U]V=\frac{1}{c}\Big{[}\frac{\eta L}{2SB}\sigma_{L}^{2}+\frac{Z_{1}}{\eta U}+(3\eta L-\frac{2L\eta M}{SR})\frac{Z_{2}}{U}\Big{]}, and RR denotes the maximum number of rounds to traverse all the workers in \mathcal{M} due to the age-based mechanism.

Proof.

The proof can be found in the appendix. ∎

Theorem 1 states that the proposed AgeSel scheme can achieve a sublinear convergence rate 𝒪(1J)\mathcal{O}(\frac{1}{J}) as local SGD with partial worker participation[7, 8]. The advantage of age-based worker selection is reflected in the expression of VV, i.e., a smaller τmax\tau_{max} leads to a smaller RR, and then implies a reduced VV.

IV Simulation Results

In this section, we evaluate the effectiveness of the proposed AgeSel against state-of-the-art schemes including FedAvg[6], OCS[15] and Round Robin (RR) [25]. Note that AgeSel with τmax\tau_{max} being large reduces to FedAvg. Furthermore, when τmax\tau_{max} goes to zero, it is obvious that AgeSel is equivalent to RR. Since both τmax\tau_{max} and SS determine the value of AjA^{j}, we will also explore the impact of the hyper-parameter SS on the performance of AgeSel.

We start by briefly introducing the three baseline algorithms. For FedAvg, the PS performs weighted selection according to the sizes of the local datasets and we have Dj=Uj\mathcal{M}_{D}^{j}=\mathcal{M}_{U}^{j} with |Dj|=S|\mathcal{M}_{D}^{j}|=S. For OCS, the PS sends the global model to all the workers in \mathcal{M} to perform local computation and only the SS workers with larger contribution are selected, i.e., we have Dj=\mathcal{M}_{D}^{j}=\mathcal{M} and |Uj|=S|\mathcal{M}_{U}^{j}|=S. For RR, the workers are selected in a circular order with |Dj|=|Uj|=S|\mathcal{M}_{D}^{j}|=|\mathcal{M}_{U}^{j}|=S and the aggregation of the updates is weighted for fair comparison.

Simulation Setting. The dataset 𝒟\mathcal{D} considered here is the EMNIST dataset. We aim to solve the image classification task with a two-layer fully connected neural network. There are M=20M=20 workers in total and the data is heterogeneously distributed among them. Particularly, the samples of the dataset 𝒟\mathcal{D} is sorted according to its labels and allocated to each worker in order with different sizes. The stepsize η\eta is 0.1; the batchsize BB is 100 and the number of local updates performed per round UU is set to be 5. All the schemes stop training once the test accuracy reaches 80%. For comparison, the number SS of workers selected to upload their local parameters at each round in all three schemes is designated to be S=5S=5. Moreover, we set τmax=4\tau_{max}=4 for AgeSel.

AgeSel Outperforms the State-of-Art Schemes in Both Performance Metrics. The performances of FedAvg, OCS, RR and AgeSel in terms of training rounds and communication cost with 10 Monte Carlo runs are depicted in Fig. 1 and Fig. 2, respectively.

As shown in Fig. 1, by either considering the weights (i.e., the sizes of datasets) or the ages only, FedAvg and RR require more training rounds than OCS and AgeSel in the presence of data heterogeneity. With larger norms for worker selection, OCS can converge faster. Clearly, AgeSel performs the best by achieving a desirable balance between ages and weights.

Refer to caption
Figure 1: Comparison of FedAvg, OCS, RR and AgeSel in terms of training rounds with 10 Monte Carlo runs.
Refer to caption
Figure 2: Comparison of FedAvg, OCS, RR and AgeSel in terms of communication cost with 10 Monte Carlo runs.

As illustrated in Fig. 2, OCS has the largest communication cost because all the MM workers download the global model while SS of them are eventually selected. As a result, the reduced training rounds cannot offset the increase of the per-round communication overhead, yielding a larger total cost than the other schemes. Since AgeSel requires fewer training rounds than FedAvg and RR, and they all have the same per-round communication cost, AgeSel achieves better communication efficiency than both FedAvg and RR. Overall, AgeSel is the most communication-efficient selection strategy among all these schemes.

Refer to caption
Figure 3: Number of training rounds of AgeSel with different values of SS.
Refer to caption
Figure 4: Communication cost of AgeSel with different values of SS.

Exploration of SS. As illustrated in Fig. 3, when SS is larger than some value (e.g., S=5S=5), the performance of AgeSel in terms of training round with partial participation becomes quite close to that of full participation (i.e., S=M=20S=M=20). This implies that the age can indeed accelerate the training process. However, as SS increases, the per-round communication cost also monotonically increases, as shown in Fig. 4. From Fig. 1\sim4, we can see that, with proper values of SS and τmax\tau_{max}, AgeSel can strike a desirable balance in terms of convergence speed and communication overhead.

V Conclusions

We developed a novel AgeSel strategy to perform adaptive worker selection for PS-based local SGD under heterogeneous data distribution. The convergence of AgeSel scheme was rigorously proved. Simulation results showed that AgeSel is more communication-efficient and converges faster than state-of-the-art schemes.

Note that AgeSel is compatible with other techniques to jointly improve the overall performance of distributed learning. To name a few, it can be combined with straggler-tolerant techniques to render the learning robust to stragglers; it can also be used together with variance-reduction algorithms to further accelerate the training process. These interesting directions will be pursued in future work.

VI Appendix

VI-A Proof of Lemma 1

With the LL-smoothness of the objective function \mathcal{L} in Assumption 1, by taking expectation of all the randomness over (𝜽j+1)\mathcal{L}(\boldsymbol{\theta}^{j+1}) we have:

𝔼[(𝜽j+1)]\displaystyle\mathbb{E}[\mathcal{L}(\boldsymbol{\theta}^{j+1})] (𝜽j)+(𝜽j),𝔼[𝜽j+1𝜽j]+L2𝔼[𝜽j+1𝜽j2]\displaystyle\leq\mathcal{L}(\boldsymbol{\theta}^{j})+\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}[\boldsymbol{\theta}^{j+1}-\boldsymbol{\theta}^{j}]\right\rangle+\frac{L}{2}\mathbb{E}[\left\|\boldsymbol{\theta}^{j+1}-\boldsymbol{\theta}^{j}\right\|^{2}]
=(a1)(𝜽j)+(𝜽j),𝔼[gj+ηU(𝜽j)ηU(𝜽j)]+L2𝔼[gj2]\displaystyle\overset{(a1)}{=}\mathcal{L}(\boldsymbol{\theta}^{j})+\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}[g^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})]\right\rangle+\frac{L}{2}\mathbb{E}[\left\|g^{j}\right\|^{2}]
=(b1)(𝜽j)ηU(𝜽j)2+(𝜽j),𝔼[gj+ηU(𝜽j)]T1+L2𝔼[gj2]T2,\displaystyle\overset{(b1)}{=}\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\underbrace{\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}[g^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})]\right\rangle}_{T_{1}}+\frac{L}{2}\underbrace{\mathbb{E}[\left\|g^{j}\right\|^{2}]}_{T_{2}}, (12)

where (a1)(a1) is due to the definitions gj=1SmUjgmjg^{j}=\frac{1}{S}\sum_{m\in\mathcal{M}_{U}^{j}}g_{m}^{j} where pm=Nm/Np_{m}=N_{m}/N and gmj=ηu=0U11Bb=1B(𝜽mj,u;zm,bj,u)g_{m}^{j}=-\eta\sum_{u=0}^{U-1}\frac{1}{B}\sum_{b=1}^{B}\nabla\mathcal{L}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u}); and (b1)(b1) can be obtained directly from (a1).

We then bound the terms T1T_{1} and T2T_{2} respectively. For the term T1T_{1}, with the definition of the weighted global update g¯j=mpmgmj\bar{g}^{j}=\sum_{m\in\mathcal{M}}p_{m}g_{m}^{j} and the unbiased global update g~j=1Mmgmj\tilde{g}^{j}=\frac{1}{M}\sum_{m\in\mathcal{M}}g_{m}^{j}, we have

T1\displaystyle T_{1} =(𝜽j),𝔼[gj+ηU(𝜽j)]\displaystyle=\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}[g^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})]\right\rangle
=(a2)(𝜽j),𝔼[1S(mUj\𝒜jgmj+m𝒜jgmj)+1S((SAj)ηU(𝜽j)+AjηU(𝜽j))]\displaystyle\overset{(a2)}{=}\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[\frac{1}{S}\left(\sum_{m\in\mathcal{M}_{U}^{j}\backslash\mathcal{A}^{j}}g_{m}^{j}+\sum_{m\in\mathcal{A}^{j}}g_{m}^{j}\right)+\frac{1}{S}\left((S-A^{j})\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})+A^{j}\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle
=(b2)(𝜽j),𝔼[SAjS(g¯j+ηU(𝜽j))]+(𝜽j),𝔼[AjS(g~j+ηU(𝜽j))]\displaystyle\overset{(b2)}{=}\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[\frac{S-A^{j}}{S}\left(\bar{g}^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle+\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[\frac{A^{j}}{S}\left(\tilde{g}^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle (13)

where (a2)(a2) splits the selected workers at round jj into the ones selected by ages in set 𝒜j\mathcal{A}^{j} with |𝒜j|=Aj=min{S,Sj}|\mathcal{A}^{j}|=A^{j}=\min\{S,S^{j}\} and the ones selected by weights in set Uj\𝒜j\mathcal{M}_{U}^{j}\backslash\mathcal{A}^{j}; (b2)(b2) is because the selection by ages is essentially unweighted random selection in expectation and the selection by weights is equivalent to g¯j\bar{g}^{j} in expectation. The two terms in (13) are then bounded separately.

To bound the first term, we have:

(𝜽j),𝔼[SAjS(g¯j+ηU(𝜽j))]\displaystyle\quad\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[\frac{S-A^{j}}{S}\left(\bar{g}^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle
=(a3)SAjS(𝜽j),𝔼[1Bm=1Mu=0U1b=1Bηpmm(𝜽mj,u;zm,bj,u)+ηU(𝜽j)]\displaystyle\overset{(a3)}{=}\frac{S-A^{j}}{S}\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[-\frac{1}{B}\sum_{m=1}^{M}\sum_{u=0}^{U-1}\sum_{b=1}^{B}\eta p_{m}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u})+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right]\right\rangle
=(b3)SAjS(𝜽j),𝔼[m=1Mu=0U1ηpmm(𝜽mj,u)+ηUm=1Mpm(𝜽j)]\displaystyle\overset{(b3)}{=}\frac{S-A^{j}}{S}\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[-\sum_{m=1}^{M}\sum_{u=0}^{U-1}\eta p_{m}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})+\eta U\sum_{m=1}^{M}p_{m}\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right]\right\rangle
=(c3)SAjSηU(𝜽j),ηU𝔼[m=1Mpmu=0U1(m(𝜽mj,u)(𝜽j))]\displaystyle\overset{(c3)}{=}\frac{S-A^{j}}{S}\left\langle\sqrt{\eta U}\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),-\frac{\sqrt{\eta}}{\sqrt{U}}\mathbb{E}\left[\sum_{m=1}^{M}p_{m}\sum_{u=0}^{U-1}\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})-\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle
(d3)SAjS(ηU2(𝜽j)2+η2U𝔼[m=1Mpmu=0U1(m(𝜽mj,u)(𝜽j))2])\displaystyle\overset{(d3)}{\leq}\frac{S-A^{j}}{S}\left(\frac{\eta U}{2}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{\eta}{2U}\mathbb{E}\left[\left\|\sum_{m=1}^{M}p_{m}\sum_{u=0}^{U-1}\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})-\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right\|^{2}\right]\right)
(e3)SAjS(ηU2(𝜽j)2+η2m=1Mpmu=0U1𝔼[(m(𝜽mj,u)(𝜽j))2])\displaystyle\overset{(e3)}{\leq}\frac{S-A^{j}}{S}\left(\frac{\eta U}{2}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{\eta}{2}\sum_{m=1}^{M}p_{m}\sum_{u=0}^{U-1}\mathbb{E}\left[\left\|\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})-\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right\|^{2}\right]\right)
(f3)SAjS(ηU2(𝜽j)2+ηL22m=1Mpmu=0U1𝔼[𝜽mj,u𝜽j2])\displaystyle\overset{(f3)}{\leq}\frac{S-A^{j}}{S}\left(\frac{\eta U}{2}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{\eta L^{2}}{2}\sum_{m=1}^{M}p_{m}\sum_{u=0}^{U-1}\mathbb{E}\left[\left\|\boldsymbol{\theta}_{m}^{j,u}-\boldsymbol{\theta}^{j}\right\|^{2}\right]\right)
(g3)SAjS(ηU(12+15U2η2L2)(𝜽j)2+5U2η3L22(σL2+6UσG2)),\displaystyle\overset{(g3)}{\leq}\frac{S-A^{j}}{S}\left(\eta U\left(\frac{1}{2}+15U^{2}\eta^{2}L^{2}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)\right), (14)

where (a3)(a3) is derived from the definition of g¯j\bar{g}^{j}; (b3)(b3) and (c3)(c3) come from direct computation; (d3)(d3) uses the fact that 𝐱,𝐲12[𝐱2+𝐲2]\langle\mathbf{x},\mathbf{y}\rangle\leq\frac{1}{2}[\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}]; (e3)(e3) is due to Jensen inequality and Cauchy-Schwartz inequality; (f3)(f3) follows from Assumption 1; and (g3)(g3) is due to the fact that m=1Mpm=1\sum_{m=1}^{M}p_{m}=1 and [24, Lemma 3], which proves that

𝔼[𝜽mj,u𝜽j2]5Uη2(σL2+6UσG2)+30U2η2(𝜽j)2,\displaystyle\mathbb{E}\left[\left\|\boldsymbol{\theta}_{m}^{j,u}-\boldsymbol{\theta}^{j}\right\|^{2}\right]\leq 5U\eta^{2}(\sigma_{L}^{2}+6U\sigma_{G}^{2})+30U^{2}\eta^{2}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}, (15)

under the condition that η18LU\eta\leq\frac{1}{8LU}, where σL\sigma_{L} and σG\sigma_{G} are two constants defined in Assumption 2.

Likewise, the second term in (13) can be bounded as below, with pmp_{m} replaced by 1M\frac{1}{M}:

(𝜽j),𝔼[AjS(g~j+ηU(𝜽j))]\displaystyle\quad\left\langle\nabla\mathcal{L}(\boldsymbol{\theta}^{j}),\mathbb{E}\left[\frac{A^{j}}{S}\left(\tilde{g}^{j}+\eta U\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right]\right\rangle
AjS(ηU(12+15U2η2L2)(𝜽j)2+5U2η3L22(σL2+6UσG2)),\displaystyle\leq\frac{A^{j}}{S}\left(\eta U\left(\frac{1}{2}+15U^{2}\eta^{2}L^{2}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)\right), (16)

Substituting (14) and (16) into (13), we have

T1\displaystyle T_{1} ηU(12+15U2η2L2)(𝜽j)2+5U2η3L22(σL2+6UσG2).\displaystyle\leq\eta U\left(\frac{1}{2}+15U^{2}\eta^{2}L^{2}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right). (17)

With 𝕀{}\mathbb{I}\{\cdot\} denoting the indicator function, and 𝒜\\mathcal{A}\backslash\mathcal{B} denotes the complementary set of set \mathcal{B} in set 𝒜\mathcal{A}, the term T2T_{2} can be bounded as

T2\displaystyle T_{2} =𝔼[gj2]\displaystyle=\mathbb{E}[\|g^{j}\|^{2}]
=(a4)𝔼[1SmUjgmj2]\displaystyle\overset{(a4)}{=}\mathbb{E}\left[\left\|\frac{1}{S}\sum_{m\in\mathcal{M}_{U}^{j}}g_{m}^{j}\right\|^{2}\right]
=(b4)1S2𝔼[m=1M𝕀{mUj}gmj2]\displaystyle\overset{(b4)}{=}\frac{1}{S^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}g_{m}^{j}\right\|^{2}\right]
=(c4)η2S2𝔼[m=1M𝕀{mUj}u=0U1[1Bb=1B(m(𝜽mj,u;zm,bj,u)m(𝜽mj,u))]2]\displaystyle\overset{(c4)}{=}\frac{\eta^{2}}{S^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}\sum_{u=0}^{U-1}\left[\frac{1}{B}\sum_{b=1}^{B}\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u})-\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right)\right]\right\|^{2}\right]
+η2S2𝔼[m=1M𝕀{mUj}u=0U1m(𝜽mj,u)2]\displaystyle+\frac{\eta^{2}}{S^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]
=(d4)η2S2B2𝔼[m=1M𝕀{mUj}u=0U1b=1B(m(𝜽mj,u;zm,bj,u)m(𝜽mj,u))2]\displaystyle\overset{(d4)}{=}\frac{\eta^{2}}{S^{2}B^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}\sum_{u=0}^{U-1}\sum_{b=1}^{B}\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u})-\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right)\right\|^{2}\right]
+η2S2𝔼[m=1M𝕀{mUj}u=0U1m(𝜽mj,u)2]\displaystyle+\frac{\eta^{2}}{S^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]
=(e4)η2USBσL2+η2S2𝔼[m=1M𝕀{mUj}u=0U1m(𝜽mj,u)2]\displaystyle\overset{(e4)}{=}\frac{\eta^{2}U}{SB}\sigma_{L}^{2}+\frac{\eta^{2}}{S^{2}}\mathbb{E}\left[\left\|\sum_{m=1}^{M}\mathbb{I}\{m\in\mathcal{M}_{U}^{j}\}\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]
=(f4)η2USBσL2+η2S2𝔼[m𝒜ju=0U1m(𝜽mj,u)+mUj\𝒜ju=0U1m(𝜽mj,u)2]\displaystyle\overset{(f4)}{=}\frac{\eta^{2}U}{SB}\sigma_{L}^{2}+\frac{\eta^{2}}{S^{2}}\mathbb{E}\left[\left\|\sum_{m\in\mathcal{A}^{j}}\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})+\sum_{m\in\mathcal{M}_{U}^{j}\backslash\mathcal{A}^{j}}\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]
(g4)η2USBσL2+2Ajη2S2m𝒜j𝔼[u=0U1m(𝜽mj,u)2]+2(SAj)η2S2mUj\𝒜j𝔼[u=0U1m(𝜽mj,u)2],\displaystyle\overset{(g4)}{\leq}\frac{\eta^{2}U}{SB}\sigma_{L}^{2}+\frac{2A^{j}\eta^{2}}{S^{2}}\sum_{m\in\mathcal{A}^{j}}\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]+\frac{2(S-A^{j})\eta^{2}}{S^{2}}\sum_{m\in\mathcal{M}_{U}^{j}\backslash\mathcal{A}^{j}}\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right], (18)

where (a4)(a4) is due to the definition of gjg^{j}; (b4)(b4) comes directly from (a3)(a3); (c4)(c4) follows from the fact that 𝔼[x2]=𝔼[x𝔼[x]2+𝔼[x]2]\mathbb{E}[\|x\|^{2}]=\mathbb{E}[\|x-\mathbb{E}[x]\|^{2}+\|\mathbb{E}[x]\|^{2}] and 𝔼[m(𝜽mj,u;zm,bj,u)]=m(𝜽mj,u)\mathbb{E}[\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u};z_{m,b}^{j,u})]=\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u}); (d4)(d4) follows from Cauchy-Schwartz inequality; (e4)(e4) is due to the fact that 𝔼[x1++xn2]=𝔼[x12++xn2]\mathbb{E}[\|x_{1}+...+x_{n}\|^{2}]=\mathbb{E}[\|x_{1}\|^{2}+...+\|x_{n}\|^{2}] if xix_{i}^{\prime}s are independent with zero mean; (f4)(f4) is due to the definition that the subset of workers selected by ages at round jj is 𝒜j\mathcal{A}^{j} and |𝒜j|=Aj=min{S,Sj}|\mathcal{A}^{j}|=A^{j}=\min\{S,S^{j}\}; (g4)(g4) is due to Cauchy-Schwartz inequality.

Next, we bound the term 𝔼[u=0U1m(𝜽mj,u)2]\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right] in (18) and (14) as follows:

𝔼[u=0U1m(𝜽mj,u)2]\displaystyle\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right] =𝔼[u=0U1(m(𝜽mj,u)m(𝜽j)+m(𝜽j)(𝜽j)+(𝜽j))2]\displaystyle=\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\left(\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})-\nabla\mathcal{L}_{m}(\boldsymbol{\theta}^{j})+\nabla\mathcal{L}_{m}(\boldsymbol{\theta}^{j})-\nabla\mathcal{L}(\boldsymbol{\theta}^{j})+\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right)\right\|^{2}\right]
(a5)3UL2u=0U1𝔼[𝜽mj,u𝜽j2]+3U2σG2+3U2(𝜽j)2\displaystyle\overset{(a5)}{\leq}3UL^{2}\sum_{u=0}^{U-1}\mathbb{E}[\|\boldsymbol{\theta}_{m}^{j,u}-\boldsymbol{\theta}^{j}\|^{2}]+3U^{2}\sigma_{G}^{2}+3U^{2}\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\|^{2}
(b5)15U3L2η2(σL2+6UσG2)+(90U4L2η2+3U2)(𝜽j)2+3U2σG2\displaystyle\overset{(b5)}{\leq}15U^{3}L^{2}\eta^{2}(\sigma_{L}^{2}+6U\sigma_{G}^{2})+(90U^{4}L^{2}\eta^{2}+3U^{2})\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\|^{2}+3U^{2}\sigma_{G}^{2}
=(c5)C1(𝜽j)2+C2,\displaystyle\overset{(c5)}{=}C_{1}\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\|^{2}+C_{2}, (19)

where (a5)(a5) is due to Cauchy-Schwartz inequality and the bounded variance assumption; (b5)(b5) follows from (15); (c5)(c5) is due to the definition that C1=90U4L2η2+3U2C_{1}=90U^{4}L^{2}\eta^{2}+3U^{2} and C2=15U3L2η2(σL2+6UσG2)+3U2σG2C_{2}=15U^{3}L^{2}\eta^{2}(\sigma_{L}^{2}+6U\sigma_{G}^{2})+3U^{2}\sigma_{G}^{2}.

By substituting the upper bounds (17), (18) of the terms T1T_{1} and T2T_{2} in (12), we readily have:

𝔼[(𝜽j+1)]\displaystyle\mathbb{E}[\mathcal{L}(\boldsymbol{\theta}^{j+1})] (a6)(𝜽j)ηU(1215U2η2L2)(𝜽j)2+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2\displaystyle\overset{(a6)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\left(\frac{1}{2}-15U^{2}\eta^{2}L^{2}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}
+Lη2AjS2m𝒜j𝔼[u=0U1m(𝜽mj,u)2]+Lη2(SAj)S2mUj\𝒜j𝔼[u=0U1m(𝜽mj,u)2]\displaystyle+\frac{L\eta^{2}A^{j}}{S^{2}}\sum_{m\in\mathcal{A}^{j}}\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]+\frac{L\eta^{2}(S-A^{j})}{S^{2}}\sum_{m\in\mathcal{M}_{U}^{j}\backslash\mathcal{A}^{j}}\mathbb{E}\left[\left\|\sum_{u=0}^{U-1}\nabla\mathcal{L}_{m}(\boldsymbol{\theta}_{m}^{j,u})\right\|^{2}\right]
(b6)(𝜽j)ηU(1215U2η2L2)(𝜽j)2+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2\displaystyle\overset{(b6)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\left(\frac{1}{2}-15U^{2}\eta^{2}L^{2}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}
+Lη2((Aj)2+(SAj)2)S2(C1(𝜽j)2+C2)\displaystyle+\frac{L\eta^{2}\left((A^{j})^{2}+(S-A^{j})^{2}\right)}{S^{2}}\left(C_{1}\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\|^{2}+C_{2}\right)
(c6)(𝜽j)ηU(1215U2η2L2Lη(2(Aj)22SAj+S2)S2UC1)(𝜽j)2\displaystyle\overset{(c6)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\left(\frac{1}{2}-15U^{2}\eta^{2}L^{2}-\frac{L\eta\left(2(A^{j})^{2}-2SA^{j}+S^{2}\right)}{S^{2}U}C_{1}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}
+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2+Lη2(2(Aj)22SAj+S2)S2C2\displaystyle+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+\frac{L\eta^{2}\left(2(A^{j})^{2}-2SA^{j}+S^{2}\right)}{S^{2}}C_{2}
(d6)(𝜽j)ηU(1215U2η2L2LηUC1)(𝜽j)2\displaystyle\overset{(d6)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-\eta U\left(\frac{1}{2}-15U^{2}\eta^{2}L^{2}-\frac{L\eta}{U}C_{1}\right)\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}
+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2+Lη2(2(Aj)22SAj+S2)S2C2\displaystyle+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+\frac{L\eta^{2}\left(2(A^{j})^{2}-2SA^{j}+S^{2}\right)}{S^{2}}C_{2}
(e6)(𝜽j)cηU(𝜽j)2+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2+Lη2C2+2Lη2(Aj)2S2C22Lη2AjSC2,\displaystyle\overset{(e6)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-c\eta U\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+L\eta^{2}C_{2}+\frac{2L\eta^{2}(A^{j})^{2}}{S^{2}}C_{2}-\frac{2L\eta^{2}A^{j}}{S}C_{2}, (20)

where (a6)(a6) comes from direct substitution; (b6)(b6) uses the result in (19); (c6)(c6) follows from direct computation; (d6)(d6) uses the fact that 0AjS0\leq A^{j}\leq S; and (e6)(e6) follows from the fact that there exists a constant cc such that 0<c<1215U2η2L2Lη(90U3L2η2+3U)0<c<\frac{1}{2}-15U^{2}\eta^{2}L^{2}-L\eta(90U^{3}L^{2}\eta^{2}+3U). The proof of Lemma 1 is then complete.

VI-B Proof of Theorem 1

With Lemma 1, we have

𝔼[(𝜽j+1)](a8)(𝜽j)cηU(𝜽j)2+5U2η3L22(σL2+6UσG2)+Lη2(2(Aj)2+S2)S22Lη2AjSC2\displaystyle\mathbb{E}[\mathcal{L}(\boldsymbol{\theta}^{j+1})]\overset{(a8)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-c\eta U\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{L\eta^{2}(2(A^{j})^{2}+S^{2})}{S^{2}}-\frac{2L\eta^{2}A^{j}}{S}C_{2}
(b8)(𝜽j)cηU(𝜽j)2+5U2η3L22(σL2+6UσG2)+η2UL2SBσL2+3Lη2C22Lη2AjSC2,\displaystyle\overset{(b8)}{\leq}\mathcal{L}(\boldsymbol{\theta}^{j})-c\eta U\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+3L\eta^{2}C_{2}-\frac{2L\eta^{2}A^{j}}{S}C_{2}, (21)

where (a8)(a8) is from (20); and (b8)(b8) uses the fact that 0AjS0\leq A^{j}\leq S.

With the age-based mechanism, we denote the minimum number of rounds to traverse all the workers in \mathcal{M} as RR, i.e., we have Aj++Aj+R1MA^{j}+...+A^{j+R-1}\geq M. By rearranging the terms in (21) and summing from j=0,,R1j=0,...,R-1, we can have:

1R𝔼[j=0R1(𝜽j)2](𝜽0)(𝜽R)cηUR+V1,\displaystyle\frac{1}{R}\mathbb{E}\left[\sum_{j=0}^{R-1}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}\right]\leq\frac{\mathcal{L}(\boldsymbol{\theta}^{0})-\mathcal{L}(\boldsymbol{\theta}^{R})}{c\eta UR}+V_{1}, (22)

where V1=1cηU(η2UL2SBσL2+5U2η3L22(σL2+6UσG2)+(3Lη22Lη2MSR)C2)V_{1}=\frac{1}{c\eta U}(\frac{\eta^{2}UL}{2SB}\sigma_{L}^{2}+\frac{5U^{2}\eta^{3}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+(3L\eta^{2}-\frac{2L\eta^{2}M}{SR})C_{2}).

Thus, when JJ is a multiple of RR, we can then readily write

1J𝔼[j=0J1(𝜽j)2](𝜽0)cηUJ+V,\displaystyle\frac{1}{J}\mathbb{E}\left[\sum_{j=0}^{J-1}\left\|\nabla\mathcal{L}(\boldsymbol{\theta}^{j})\right\|^{2}\right]\leq\frac{\mathcal{L}(\boldsymbol{\theta}^{0})-\mathcal{L}^{*}}{c\eta UJ}+V, (23)

where we used Assumption 1 and V=1c[ηL2SBσL2+5Uη2L22(σL2+6UσG2)+(3ηL2LηMSR)(15U2L2η2(σL2+6UσG2)+3UσG2)]V=\frac{1}{c}[\frac{\eta L}{2SB}\sigma_{L}^{2}+\frac{5U\eta^{2}L^{2}}{2}\left(\sigma_{L}^{2}+6U\sigma_{G}^{2}\right)+(3\eta L-\frac{2L\eta M}{SR})(15U^{2}L^{2}\eta^{2}(\sigma_{L}^{2}+6U\sigma_{G}^{2})+3U\sigma_{G}^{2})]. The proof is then complete.

References

  • [1] J. Dean, G. S. Corrado, R. Monga, C. Kai, and A. Y. Ng, “Large scale distributed deep networks,” in Proc. NeurIPS, vol. 25, 2012, pp. 1223–1231.
  • [2] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, and B. Y. Su, “Scaling distributed machine learning with the parameter server,” in Proc. USENIX OSDI, 2014, pp. 583–598.
  • [3] X. Lian, C. Zhang, H. Zhang, C. J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. NeurIPS, vol. 30, 2017, pp. 5336–5346.
  • [4] N. Yan, K. Wang, C. Pan, and K. K. Chai, “Performance analysis for channel-weighted federated learning in oma wireless networks,” IEEE Signal Process. Lett., vol. 29, pp. 772–776, 2022.
  • [5] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT, 2010, pp. 177–186.
  • [6] H. B. Mcmahan, E. Moore, D. Ramage, S. Hampson, and B. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Artif. Intell. and Statist., 2017, pp. 1273–1282.
  • [7] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,” in Proc. ICLR, 2019.
  • [8] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. ICLR, 2020.
  • [9] F. Zhou and G. Cong, “On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 3219–3227.
  • [10] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. R. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Proc. Neural Inf. Process. Syst., vol. 32, 2019, pp. 11 082–11 094.
  • [11] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
  • [12] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019, pp. 5693–5700.
  • [13] M. P. Uddin, Y. Xiang, J. Yearwood, and L. Gao, “Robust federated averaging via outlier pruning,” IEEE Signal Process. Lett., vol. 29, pp. 409–413, 2021.
  • [14] T. Chen, Z. Guo, Y. Sun, and W. Yin, “CADA: Communication-adaptive distributed adam,” in Proc. ICAIS, 2021, pp. 613–621.
  • [15] W. Chen, S. Horvath, and P. Richtarik, “Optimal client sampling for federated learning,” Trans. Mach. Learn. Res., 2022.
  • [16] M. Ribero and H. Vikalo, “Communication-efficient federated learning via optimal client sampling,” arXiv preprint arXiv:2007.15197, 2020.
  • [17] Y. J. Cho, S. Gupta, G. Joshi, and O. Yağan, “Bandit-based communication-efficient client selection strategies for federated learning,” in Proc. Asilomar Conf. Signal, Syst., and Comput., 2020, pp. 1066–1069.
  • [18] J. Goetz, K. Malik, D. Bui, S. Moon, H. Liu, and A. Kumar, “Active federated learning,” arXiv preprint arXiv:1909.12641, 2019.
  • [19] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often should one update?” in Proc. IEEE INFOCOM, 2012, pp. 2731–2735.
  • [20] D. Yin, A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P. Bartlett, “Gradient diversity: A key ingredient for scalable distributed learning,” in Proc. ICAIS, 2018, pp. 1998–2007.
  • [21] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. Vincent Poor, “Age-based scheduling policy for federated learning in mobile edge networks,” in Proc. IEEE ICASSP, 2020, pp. 8743–8747.
  • [22] E. Ozfatura, B. Buyukates, D. Gündüz, and S. Ulukus, “Age-based coded computation for bias reduction in distributed learning,” in Proc. IEEE GLOBECOM, 2020, pp. 1–6.
  • [23] B. Buyukates and S. Ulukus, “Timely communication in federated learning,” in Proc. IEEE INFOCOM, 2021, pp. 1–6.
  • [24] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” in Proc. ICLR, 2020.
  • [25] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, 2019.