FedStaleWeight: Buffered Asynchronous Federated Learning with Fair Aggregation via Staleness Reweighting

Jeffrey Ma Alan Tu Yiling Chen Vijay Janapa Reddi
Harvard University
Cambridge, MA 02134
jeffreyma@g.harvard.edu, alantu@college.harvard.edu
yiling@seas.harvard.edu, vj@eecs.harvard.edu

Abstract

Federated Learning (FL) endeavors to harness decentralized data while preserving privacy, facing challenges of performance, scalability, and collaboration. Asynchronous Federated Learning (AFL) methods have emerged as promising alternatives to their synchronous counterparts bounded by the slowest agent, yet they add additional challenges in convergence guarantees, fairness with respect to compute heterogeneity, and incorporation of staleness in aggregated updates. Specifically, AFL biases model training heavily towards agents who can produce updates faster, leaving slower agents behind, who often also have differently distributed data which is not learned by the global model. Naively upweighting introduces incentive issues, where true fast updating agents may falsely report updates at a slower speed to increase their contribution to model training. We introduce FedStaleWeight, an algorithm addressing fairness in aggregating asynchronous client updates by employing average staleness to compute fair re-weightings. FedStaleWeight reframes asynchronous federated learning aggregation as a mechanism design problem, devising a weighting strategy that incentivizes truthful compute speed reporting without favoring faster update-producing agents by upweighting agent updates based on staleness. Leveraging only observed agent update staleness, FedStaleWeight results in more equitable aggregation on a per-agent basis. We both provide theoretical convergence guarantees in the smooth, non-convex setting and empirically compare FedStaleWeight against the commonly used asynchronous FedBuff with gradient averaging, demonstrating how it achieves stronger fairness, expediting convergence to a higher global model accuracy. Finally, we provide an open-source test bench to facilitate exploration of buffered AFL aggregation strategies, fostering further research in asynchronous federated learning paradigms.¹¹1Code for experiments can be found at https://github.com/18jeffreyma/afl-bench

1 Introduction

1.1 Motivation and Background

In real-world scenarios, the task of training a model may be distributed among up to millions of agents, called edge devices, each of whom contribute their own limited (and perhaps private) set of data, a paradigm known as federated learning (FL). Our paper focuses on cross-device FL, in which there must be a large number of agents in order to extract meaningful results. The other type of FL, cross-silo, utilizes fewer agents because each one has more data and participates actively in the model training process, like a consortium of hospitals sharing information about a new disease [1].

Cross-device FL faces two major challenges when put into practice: scalability and privacy. In a massive distributed network, it’s impossible for every agent to be active at every time step, or for all agents to train their private models at the same speed. Naive aggregation techniques rely on concurrent reports from agents because, for instance, a simple average may be taken over reports in order to update the global model at regular intervals. By relaxing the concurrency constraint, we move away from synchronous FL, and the problem becomes how to best aggregate updates arriving at different points in time. On the topic of privacy, agents want their personal data to remain fully confidential; the global model must therefore evolve in a way that does not reveal individual contributions through gradient updates or the like [2]. Recent work in secure aggregation (SecAgg) and differential privacy provide frameworks for exploring the tradeoffs between privacy and utility [3].

Unlike its synchronous conterpart, Asynchronous federated learning (AFL) does not rely on clients to all finish and communicate their results in the same round and differs from its synchronous counterpart in the following ways:

•

Synchronicity: Clients are assumed to be heterogeneous and may return updates at different times or fail to return updates in time, necessitating mechanisms that do not wait for all participants before updating the global model through aggregation.
•

Staleness: As models may take longer to update, AFL mechanisms often are able to merge updates to models (i.e. model version $T$ is deployed to a client, the client takes a long time to fit, and it replies with its updated model while the global server has already aggregated updates and stepped to model version $T+1$ ). Emerging research in AFL hypothesizes that these stale updates are likely still useful and should be incorporated into learning.
•

Convergence challenges: AFL often experiences higher training instability based on the strategy chosen for aggregating stale updates, similar to issues encountered in off-policy reinforcement learning (RL).

1.2 Our Contribution

We observe that the asynchronicity in AFL naturally leaves slow updating agents behind: since updates are no longer incorporated synchronously and are incorporated on demand, agents who are not able to update as quickly will contribute less to global model learning. However, naively up-weighting slow agent updates creates incentive issues: if our upweighting is too extreme, we incentivize fast updating agents to throttle and falsely report at slower speeds to increase their contribution to learning.

Thus, we explore the problem of fairly aggregating updates from clients in AFL to offset the unfairness introduced by compute heterogeneity in asynchronous federated learning. Several known synchronous approaches such as FedAvg and FedProx exist and more nuanced asynchronous approaches have been explored such as a fixed buffer length with averaging [2] or simply updating the central model in an asynchronous immediate manner [4]. To our knowledge, more nuanced weighting of stale updates or an adaptive approach has not yet been explored. Our contribution is the following:

1.

We reframe the problem of asynchronous federated learning aggregation as a welfare maximization problem and derive a gradient aggregation method which is strategy-proof to fast agents who may throttle their update rate to have their updates weighted higher while increasing the contribution of slower update producing agents to global model training.
2.

We show how our derived fair weighting can be computed via observing agent update staleness, a quantity naturally visible in buffered asynchronous federated learning, and present an algorithm for fair federated learning, FedStaleWeight.
3.

We provide a convergence guaruntee for FedStaleWeight in the smooth, non-convex setting, showing that the upweighting of stale model updates can still yield convergence.
4.

We empirically show in a series of realistic non-IID federated learning settings how current AFL aggregation techniques are biased towards clients with higher throughput and how FedStaleWeight maintains fairness and subsequently converges to higher global model accuracy faster. We release an open-source test bench for other researchers to explore further buffered AFL aggregation strategies.

2 Related Work

Synchronous FL. Synchronous FL excels in the privacy dimension: when the principal must wait for all agents before aggregating their data, it becomes nearly impossible to recover an individual contribution from an average over so many points. The slowest agent, however, determines the training pace of the global model, resulting in a bottleneck. If the principal were to perform aggregation with only, say, the fastest half of agents, then an element of selection bias is introduced. These concerns aside, numerous synchronous techniques have found experimental success, and their primary features can be summarized as follows:

•

FedAvg: McMahan et al. (2016) coined the term “federated learning” and suggest a simple weighted average of gradients to update the global model, where the weights correspond to the $p_{i}$ terms in the overall objective described in the next section [5].
•

FedProx: Li et al. (2020) improve upon FedAvg by adding a regularization term $\frac{\mu}{2}\|\bf{x}-\bf{x_{g}}\|^{2}$ to each local objective $F_{i}$ , where $\bf{x_{g}}$ is the current global model. This allows each local model to more resemble the global model but converges under a more restrictive set of assumptions [6].
•

FedAvgM: Hsu et al. (2019) also employ a form of regularization through server momentum, which accumulates past gradients in order to stabilize and accelerate training [7].
•

FedNova: Wang et al. (2020) note that heterogeneous update speeds can cause objective inconsistency, where the server model approaches a local optimum that does not match the global optimum of the original objective function [8]. The authors provide a formal analysis of this phenomenon and propose FedNova, which normalizes each client’s gradient by the number of updates that the client has sent in the current round before taking the overall average for a round. This technique encompasses both FedAvg and FedProx and has been shown to outperform both of them.

Asynchronous FL. AFL is a natural fit in the cross-device setting due to heterogeneity among agents’ availability as well as training and data transfer speeds. Here, we provide an overview of the most prominent work in AFL to date:

•

FedAsync: Xie et al. (2019) gave the first convergence guarantees for the AFL problem; their fully asynchronous method forces the server to update its model after every client update [9]. As later works observe, this is not only computationally taxing but insecure.
•

ASO-Fed: Chen et al. (2020) deal with the special case of online learning, where each agent’s local dataset is continuously growing as new observations arrive [4]. Newer data need not stem from the same distribution as older data, which means that a robust AFL framework must be able to handle non-IID data.
•

FedAdaGrad, FedAdam, FedYogi: Reddi et al. (2020) tailor the well-known adaptive optimizers AdaGrad, Adam, and Yogi to the FL problem [10].
•

SAFA: Wu et al. (2020) propose Semi-Asynchronous Federated Averaging (SAFA) to deal with the issues of stragglers, dropouts, and staleness [11].
•

Pisces: Jiang et al. (2022) use a custom scoring mechanism to choose which agents will participate in the global model training at a given time step [12]. Their utility function attempts to distinguish between slow clients with high-quality data and those which are just slow.
•

AsyncFedED: Wang et al. (2022) aggregate client data by calculating a “Euclidean distance” between stale models and the global model and then weighting each client’s update appropriately [13].
•

FedBuff: In order to avoid full asynchronicity, Nguyen et al. (2022) hold agent updates in a buffer. Only when the buffer is full does the data get aggregated and pushed to the server [2]. This work directly inspired and most closely resembles our work. SecAgg [14] [15] can also be integrated to maintain privacy guarantees.
•

FedFix: Fraboni et al. (2023) try to unify a large number of the above methods through their FedFix framework: it uses stochastic weights during aggregation [16].

3 Preliminaries

3.1 Asynchronous FL Setting

The federated optimization problem with $m$ agents can be formulated as:

\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{d}}\left[f(\mathbf{x}):=\sum_{i=1}^{m}p_{i}F_{i}(\mathbf{x})\right]

(1)

where $p_{i}$ is the importance of the $i$ th agent (often assumed to be in proportion to the size of each client’s dataset, i.e. $p_{i}=n_{i}/n$ or equal for all agents $p_{i}=1/n$ ). $F_{i}$ represents agent $i$ ’s local objective function and is only available to agent $i$ .

Specifically, we study the buffered asynchronous federated learning setting, consisting of a singular central server with an update aggregation buffer and $m$ clients. The server only updates once $b$ updates arrive at the buffer, where $b$ is a tune-able parameter. Each client has their own private data and pulls the latest version of the global model, trains it on its private dataset and communicates the update back to the server, which is appended to the buffer to later be aggregated (Figure 1). Specifically, in the asynchronous setting, we introduce the challenge of staleness, where updates to update model at version $v$ may have been trained on models of version smaller than $v$ .

In the buffered AFL setting, we denote staleness $\tau_{i}$ for an agent $i$ ’s update as the positive difference in version number between the current global model and the global model that the update was computed from. Formally, we denote a global model with version $n$ and with 0 client training steps as $w_{n,0}$ . For a client $i$ , we denote the global model with version $n^{\prime}$ before and after training on $q$ local steps as $w_{n^{\prime},0}$ and $w_{n^{\prime},q}$ , respectively; the client’s update is denoted as $\Delta_{i}=w_{n^{\prime},q}-w_{n^{\prime},0}$ . Given the current global model $w_{n,0}$ and update $\Delta_{i}$ , the staleness is thus $\tau_{i}=n-n^{\prime}$ .

Refer to caption — Figure 1: Aggregation in buffered asynchronous federated learning. Updates accumulate in the buffer over time: for a buffer size of $b=4$ , the diagram shows aggregation of updates and how model versions are incremented for each arriving update.

3.2 Defining Fairness in Update Aggregation

Each FL client $i$ represents its true update reporting speed as a random variable $\mathbf{R_{i}}$ where we denote the mean reporting speed as $r_{i}=\mathbb{E}[\mathbf{R_{i}}]$ . Each agent $i$ chooses to report a random variable $\tilde{\mathbf{R_{i}}}$ with mean $\tilde{r_{i}}=\mathbb{E}[\tilde{\mathbf{R_{i}}}]$ as their observed update reporting speed. We assume for simplicity that $\tilde{r_{i}}\in[0,r_{i}]$ , following intuition of fixed hardware specs: in other words, clients cannot manipulate to be faster on average than their fastest (true) compute speed on average. Given reported update speeds, we naturally see that the expected proportion that each agent contributes to learning without re-weighting is:

p_{i}(\tilde{r_{i}})=\frac{\tilde{r_{i}}}{\sum_{i=1}^{n}{\tilde{r_{i}}}}

(2)

A fair re-weighting $\alpha(\cdot)$ takes $p_{i}(\tilde{r_{i}})$ and returns a weighting close to equal for every participating learning agent. In other words, we seek an $\alpha(\cdot)$ that solves the following constrained welfare maximization problem:

\min_{\alpha(\cdot)}\sum_{i=1}^{n}{\left\lVert\alpha(p_{i}(\tilde{r_{i}}))p_{i}(\tilde{r_{i}})-\frac{1}{n}\right\rVert_{2}}\quad\text{ such that }\quad\sum_{i=1}^{n}{\alpha(p_{i}(\tilde{r_{i}}))p(\tilde{r_{i}})}=1

(3)

Clients seek to maximize a utility representing their effective influence (and thus their accuracy), modelled as their re-weighting times the overall proportional influence of their reported mean update rate. We use this utility function to justify strategy-proofness in Section 4.2.1.

\max_{\tilde{r_{i}}\in[0,r_{i}]}{\alpha(p_{i}(\tilde{r_{i}}))p_{i}(\tilde{r_{i}})}

(4)

3.3 Step Size as Uncertainty in Approximation

Recall that computing the update step $u^{*}$ for a client with loss $L(\mathbf{x})$ at point $\mathbf{x}$ can be framed as the solution of the following minimization game, where step size $\eta$ represents the uncertainty in using the approximation $L(\mathbf{x}+u^{*})\approx L(\mathbf{x})+u^{*,\top}\nabla L(\mathbf{x})$ for stochastic gradient descent:

\displaystyle u^{*}=\operatorname*{arg\,min}_{u\in\mathbb{R}^{d}}\left\{L(x)+u^{\top}\nabla L(x)+\frac{1}{2\eta}u^{\top}u\right\}

(5)

Similarly, in federated learning, we wish to restrict our aggregated gradient to be no larger than the step sizes of any of the individual aggregated gradients for convergence reasons, necessitating that any solution $\alpha(\cdot)$ must have $\sum_{i=1}^{n}{\alpha(p_{i}(\tilde{r_{i}}))p(\tilde{r_{i}})}=1$ as noted above.

4 Algorithm

4.1 Solving the Welfare Maximization Problem

Naively, if $\alpha(\cdot)=1$ is chosen (all agents are weighted equally), we see that $p(\tilde{r_{i}})$ is strictly increasing with respect to $\tilde{r_{i}}$ and all agents will report $\tilde{r_{i}}=r_{i}$ . However, we see that welfare is poor, since an agent’s influence $\alpha(p(\tilde{r_{i}}))p(\tilde{r_{i}})$ is exactly increasing with respect to its rate, meaning that slow agents are left behind as they cannot manipulate any higher than their true $r_{i}$ .

We observe that $\alpha(\cdot)$ should be monotonically decreasing with respect to its input, but not at a rate that decreases slower than $p(\cdot)$ increases. If the latter as true, faster agents would be incentivized to throttle their rates to improve their own weighted influence. The optimal $\alpha(\cdot)$ becomes $\alpha(\cdot)=\frac{1}{n(\cdot)}$ , where $\alpha(p_{i}(\tilde{r_{i}}))=1/(np_{i}(\tilde{r_{i}}))$ . Thus, with this choice, each agent’s weighted influence remains constant regardless of which reported $\tilde{\mathbf{R_{i}}}$ they choose, and thus there is no incentive to not report truthfully. We note that this weighting achieves weak truthfulness where agents are indifferent between reporting at any speed, since their utility stays the same regardless.

4.2 Deriving Expected Influence From Staleness

In buffered AFL, we assume the global aggregator has access only to the aggregation buffer and not the arrivals of individual agent updates, disallowing use of a weighting scheme that relies on estimating individual agent reporting schemes based on unique identifiers. We proceed to show how a fair weighting can be computed from staleness, an available local quantity, in asynchronous federated learning.

Given an initially empty buffer, we can compute the expected index of the first update from a given agent $i$ with mean rate $\tilde{r_{i}}$ by summing the expected number of updates from all agents in the expected time period for one update from agent $i$ . For example, we note that an agent twice as fast as $i$ would have two updates expected in the same time as $i$ ’s first expected update. Recall that staleness corresponds to the difference between the version number of the current global model and the version number of the original global model that an agent’s update was reported from. We derive the following expression for expected staleness of agent $i$ .

\displaystyle\mathbb{E}[\text{staleness of agent $i$ with rate $\tilde{r_{i}}$}]=\mathbb{E}[\tau_{i}]=\frac{\sum_{j=1}^{n}{\frac{\tilde{r_{j}}}{\tilde{r_{i}}}}-1}{b}

(6)

Reconciling this with our mechanism definitions above, we get the unweighted influence of agent $i$ given its expected staleness.

\displaystyle{p_{i}(\tilde{r_{i}})}=\frac{1}{\mathbb{E}[\tau_{i}]\cdot b+1}

(7)

Recalling our derived weighting from the mechanism, we get a formula for our weighting as a function of expected staleness. We note that under a synchronous setting of zero expected staleness, we recover the exactly equal weighting used in synchronous FedAvg.

\displaystyle\alpha(p_{i}(\tilde{r_{i}}))=\frac{\mathbb{E}[\tau_{i}]\cdot b+1}{n}

(8)

4.2.1 Extending To Online Learning

We reconcile the derived weighting above into the online setting of arriving client updates. As noted in Section 3.3, at any given aggregation, we should avoid updating the global model with a step size larger than any individual clients; otherwise, global model training can quickly diverge. As such, given a buffer of updates to be aggregated, we compute the desired weightings per the above and normalize such that they sum to 1.

However, this introduces complexity into our derivation from above. Intuitively, if an agent chooses to increase their expected staleness by slowing down their reporting speed, the best they can do is enjoy a single aggregation step to themselves of full step size, but consequently, they participate in fewer aggregations later due to their slower reporting speed. As a result, they contribute less to learning. We show this analytically: we can compute the normalized re-weighting for agent $i$ ’s update $\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))$ in a buffer as the following:

	$\displaystyle\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))$	$\displaystyle=\frac{\mathbb{E}[\tau_{a_{i}}]\cdot b+1}{\sum_{j=1}^{b}{\left[\mathbb{E}[\tau_{a_{i}}]\cdot b+1\right]}}=\frac{\left[\sum_{k=1}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{i}}}-1}\right]+1}{\sum_{j=1}^{b}{\left[\left[\sum_{k=1}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}-1}\right]\ +1\right]}}$
		$\displaystyle=\frac{\frac{1}{\tilde{r}_{a_{i}}}\sum_{k\neq a_{i}}^{n}{\tilde{r}_{k}}+1}{b+\sum_{j=1}^{n}{\sum_{k\neq a_{j}}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}}}$

Isolating $\tilde{r}_{a_{i}}$ , we get the following:

	$\displaystyle\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))=\frac{\frac{1}{\tilde{r}_{a_{i}}}\sum_{k\neq a_{i}}^{n}{\tilde{r}_{k}}+1}{b+M}$
	$\displaystyle\text{ where }M=\frac{1}{\tilde{r}_{a_{i}}}\sum_{k=1,k\neq a_{i}}^{n}{\tilde{r}_{k}}+\tilde{r}_{a_{i}}\sum_{j=1,a_{j}\neq a_{i}}^{b}{\frac{1}{\tilde{r}_{a_{j}}}}+\sum_{j=1,a_{j}\neq a_{i}}^{b}{\left[\sum_{k=1,k\notin[a_{j},a_{i}]}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}\right]}$

We further note that, as an agent, by manipulating $\tilde{r}_{a_{i}}$ downwards, we increase expected staleness but also decrease the frequency of an agent’s update. Recall from Section 3.2 that the utility of an agent is the normalized re-weighting times the proportional frequency of an agent’s update based on the agent’s reported update speed.

\displaystyle u_{a_{i}}(\tilde{r}_{a_{i}})=\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))\cdot\left(\frac{\tilde{r}_{a_{i}}}{\tilde{r}_{a_{i}}+\sum_{j=1,a_{j}\neq a_{i}}^{n}{r_{a_{j}}}}\right)

(9)

Denoting $A=\sum_{k=1,k\neq a_{i}}^{n}{\tilde{r}_{k}}$ , $C=\sum_{j=1,a_{j}\neq a_{i}}^{b}{1/\tilde{r}_{a_{j}}}$ and $D=\sum_{j=1,a_{j}\neq a_{i}}^{b}{\left[\sum_{k=1,k\notin[a_{j},a_{i}]}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}\right]}$ , we take the derivative of the utility with respect to an agent’s reporting speed to get:

\displaystyle\frac{\partial u_{a_{i}}}{\partial\tilde{r}_{a_{i}}}=\frac{A-C\tilde{r}_{a_{i}}^{2}}{(C\tilde{r}_{a_{i}}^{2}+(b+D)\tilde{r}_{a_{i}}+A)^{2}}

Since our algorithm reweights based on expected staleness, we can rescale reported update rates $\tilde{r}_{a_{i}}$ linearly by any positive constant without changing the result of expected staleness (or our reweighting) to make $C\tilde{r}_{a_{i}}^{2}$ as small as possible and $A$ as large as possible. Furthermore, since $A$ increases with larger $n$ , we observe with large enough $n$ (a large enough client pool) that $\partial u_{a_{i}}/\partial\tilde{r}_{a_{i}}>0$ , indicating that agents want to truthfully report as high of an update speed as possible and our aggregation mechanism is strictly strategy-proof with respect to agent manipulation of $\tilde{r}_{a_{i}}$ .

4.3 Algorithm

We present the following algorithm, FedStaleWeight, applying the above weighting to each aggregation. In practice, to compute the expected staleness for each agent $i$ , $\mathbb{E}[\tau_{i}]$ , we maintain a moving average of previous staleness for that agent. Our method computes a weighting to apply over the aggregation buffer, meaning it can be combined with methods like weighted secure aggregation [15] to protect client gradients.

Input: buffer size

b

, number of aggregation rounds

N

, global learning rate

\eta_{g}

initial model

w_{0}

Initialize

n\leftarrow 0

B=\{\}

while n < N do

Receive

(\Delta_{i},k)

from any client

i

and add to buffer

B

if $|B|=b$ then

for $j=1:b$ do

\alpha_{j}=\frac{\mathbb{E}[\tau_{j}]+1}{n}

end for

\alpha=\text{Normalize}(\{\alpha_{1},...,\alpha_{b}\})

w_{n+1}=w_{n}+\eta_{g}\sum_{b_{i}=(\Delta_{j},\tau_{j})\in B}{\alpha_{i}\Delta_{j}}

B=\{\},n\leftarrow n+1

end if

end while

Output: global model

w_{N}

Algorithm 1 FedStaleWeight-server

Input: client learning rate

\eta_{\ell}

, number of client SGD steps

Q

while Server Running do

Request latest model

w_{k}

from server

y_{0}\leftarrow w_{k}

for $q=1:Q$ do

y_{q}=y_{q-1}-\eta_{\ell}g_{q}(y_{q-1})

end for

\Delta=y_{q}-y_{0}

Send

(\Delta,k)

to server buffer

end while

Algorithm 2 FedStaleWeight-client

5 Theoretical Results

In this section, we provide a convergence guarantee for FedStaleWeight in the non-convex, smooth setting. Since FedStaleWeight effectively upweights stale updates in addition to operating in the asynchronous setting, it is essential to understand the relationship between convergence and our methods of reweighting to create asynchronous fairness. The full proof can be found in Appendix A.

Notation. We use the following notation and assumptions throughout as used in FedBuff [2]: $[n]$ represents the set of all clients, $\nabla F_{i}(w)$ is the gradient with respect to the loss on client $i$ ’s data. $f(w^{*})$ is the minimum of $f(w)$ , $g_{i}(w;\zeta_{i})$ denotes the stochastic gradient on client $i$ , $b$ is the buffer size for aggregation as noted above, and $Q$ is the number of local steps taken by each client. Similar to other seminal works in federated learning, [6] [10] [17] [18], we take the following assumptions:

Assumption 1.

(Unbiasedness of client stochastic gradient) $\mathbb{E}_{\zeta}[g_{i}(w;\zeta_{i})]=\nabla F_{i}(w)$

Assumption 2.

(Bounded local and global variance) for all clients $i\in[n]$ ,

	$\displaystyle\mathbb{E}_{\zeta_{i}}[\\|g_{i}(w;\zeta_{i})-\nabla F_{i}(w)\\|^{2}]\leq\sigma_{\ell}^{2}$
	$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\\|\nabla F_{i}(w)-\nabla f(w)\\|^{2}\leq\sigma_{g}^{2}$

Assumption 3.

(Bounded gradient) $\|\nabla F_{i}\|^{2}\leq G$ for all $i\in[n]$

Assumption 4.

(Lipschitz gradient) for all client $i\in[n]$ the gradient is L-smooth ,

\displaystyle\|\nabla F_{i}(w)-\nabla F_{i}(w^{\prime})\|^{2}\leq L\|w-w^{\prime}\|^{2}

Assumption 5.

(Bounded Staleness) For all clients $i\in[n]$ and for each server step $t$ , the staleness $\tau_{i}(t)$ between the model version a FedStaleWeight client uses to start local training and the model version in which the aggregated update $\Delta^{i}$ is used to modify the global version is not larger than $\tau_{\text{max},1}$ when $K=1$ . Moreover, any buffered asynchronous aggregation with $b>1$ has the maximum delay $\tau_{\text{max},b}$ at most $\lceil\tau_{\text{max},1}/b\rceil$ [2].

Theorem 1.

Let $\eta_{\ell}^{(q)}$ be the local learning rate of client SGD in the q-th local step, and let $\alpha(Q)=\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}$ , $\beta(Q)=\sum_{q=0}^{Q-1}{(\eta_{\ell}^{(q)})^{2}}$ , and $U(\tau_{\max,b})$ be the maximum squared deviation between equal weighting ( $1/b$ ) and a normalized $\alpha_{j}$ 11. Choosing $\eta_{g}\eta_{\ell}^{(q)}Q\leq 1/L$ for all local steps $q=0,...,Q-1$ , the model iterates in Algorithm 1 achieves the following ergodic convergence rate:

\displaystyle\begin{split}\frac{1}{T}\sum_{t=0}^{T}{\|\nabla f(w^{t})\|^{2}}&\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}\\ &\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\\ &\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G\end{split}

6 Empirical Results

6.1 Simulation Details

We implement a generic testbench to empirically verify our algorithm above. The simulation environment consists of concurrently running clients with private data and and a centralized server and buffer, where an aggregation strategy can be specified.

1.

Clients run each on their own process thread concurrently, continuously pulling the latest available model from the global server and training it on their local private data before communicating the update to the server and repeating. Clients are initialized with both an individual slice of the overall dataset being evaluated (split into training and evaluation) and as a private runtime model representing a training delay distribution, from which the client process samples and sleeps for that duration to simulate local training delays.
2.

Server aggregates updates from the global buffer and updates the global model continuously. The server is initialized with a specified aggregation strategy and an aggregation buffer to pull updates from: once the buffer fills to a specific size, aggregation is performed and the global model is updated, after which clients will begin to pull that new model once they finish training. Clients broadcast their updates once training and their simulated delays complete, and their update is appended to the global buffer.

The testbench allows users to specify custom aggregation strategies by implementing a simple callback function, from which evaluation against other baselines can be computed, logged and compared. We also implement the FedAvg baseline in this testbench.

6.2 Dataset Distribution

We test our algorithm empirically on two datasets, CIFAR10 and FashionMNIST. For both datasets, we model a system of 15 agents, with 10 “fast" agents with training delay randomly sampled from uniform distributions $D_{\text{fast}}=U(1,2)$ and 5 “slow" agents with training delays $D_{\text{slow}}=U(8,12)$ . Data is distributed in a non-IID fashion and we reserve 20% of data across all labels for testing global model accuracy to measure overall generalization For the remainder, the fast agents are given data points with labels 4 through 9, IID-distributed across fast agents where no two agents have the same data point. The slow agents are given labels 0 through 3, similarly IID-distributed. The results are shown in Figure 2.

Specifically, we model a scenario where low-compute agents have exclusive access to a non-trivial subset of all data: in this setting, higher fairness equates to a model with higher global accuracy, demonstrated in our empirical results. We run FashionMNIST and CIFAR10 for 4,000 and 16,000 aggregations respectively. We evaluate FedAvg and our FedStaleWeight with a buffer size of $b=5$ and a local learning rate of $\eta_{\ell}=0.01$ with $Q=1$ local step. We clearly observe that our algorithm FedStaleWeight converges more quickly to a higher server test accuracy, indicating better generalization and less bias towards fast update producing agents.

7 Discussion

In this paper, we tackle the FL problem of training a model through updates received from a large number of edge devices. Our novel asynchronous technique for aggregating updates, FedStaleWeight, takes into account the staleness of client updates in order to re-weight them fairly. We argue that a slower client that sends fewer updates is undervalued in the aggregate under traditional averaging schemes, resulting in worse global models if the client’s data set is unique; similarly, a faster client should not dominate the aggregate off of speed of compute alone. The algorithm utilizes a buffer that holds updates until full, at which time it performs the weighted aggregation computed by approximating expected staleness via a moving average and updates the global model accordingly.

By sending updates to the server, each client implicitly reports an “update speed” to a mechanism which is interpreted only via the update’s staleness. This idea allows us to derive a re-weighting that increases fairness while still incentivizes truthfulness, ensuring that no agent wishes to intentionally slow down. This is critical to include when aggregation criteria uses the expected staleness as weighting, a inverse proxy for update speed. We show that the optimal weights are related to each client’s expected staleness, a quantity easily estimable after multiple rounds have elapsed. We provide work towards verifying that truthfulness guarantees hold when extending to online learning and per buffer normalization by calibrating the mechanism parameters.

Finally, we provide an ergodic convergence guarantee for FedStaleWeight and verify its efficacy through simulation. Our testbench simulates concurrent client threads and a server that executes our buffered aggregation algorithm while supporting the implementation of other aggregation techniques. This environment further serves as a useful resource for future work in the AFL space. In comparison to classic aggregation methods like FedAvg, we observe that FedStaleWeight converges faster and to a more accurate result, avoiding the pitfall of heavily weighting fast clients who may not be representative of the whole population.

There are several open questions, which we leave as future work to build upon. First, our framework assumes continuous client participation, when many real-world settings observe clients participating and dropping out frequently, necessitating more exploration into a scheme robust to probabilistic agent participation while maintaining fairness guarantees. We also assume that manipulation is zero-cost, when throttling could bring benefits of lower power consumption, affecting client utility. However, our work represents a strong step in narrowing the compute fairness gap between synchronous and asynchronous federated learning.

8 Acknowledgements

We are hugely grateful also to Safwan Hossain and the CS236R course for their support and mentorship during this project. This work was also supported by the FASRC computing cluster at Harvard University.

References

[1] Chao Huang, Jianwei Huang, and Xin Liu. Cross-silo federated learning: Challenges and opportunities, 2022.
[2] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581–3607. PMLR, 2022.
[3] Swanand Kadhe, Nived Rajaraman, O. Ozan Koyluoglu, and Kannan Ramchandran. Fastsecagg: Scalable secure aggregation for privacy-preserving federated learning, 2020.
[4] Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online federated learning for edge devices with non-iid data, 2020.
[5] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.
[6] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
[7] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
[8] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
[9] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
[10] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
[11] Wentai Wu, Ligang He, Weiwei Lin, Rui Mao, Carsten Maple, and Stephen Jarvis. Safa: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Transactions on Computers, 70(5):655–668, 2020.
[12] Zhifeng Jiang, Wei Wang, Baochun Li, and Bo Li. Pisces: efficient federated learning via guided asynchronous training. In Proceedings of the 13th Symposium on Cloud Computing, pages 370–385, 2022.
[13] Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation. arXiv preprint arXiv:2205.13797, 2022.
[14] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482, 2016.
[15] Jiale Guo, Ziyao Liu, Kwok-Yan Lam, Jun Zhao, Yiqiang Chen, and Chaoping Xing. Secure weighted aggregation for federated learning, 2021.
[16] Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. A general theory for federated optimization with asynchronous and heterogeneous clients updates. Journal of Machine Learning Research, 24(110):1–43, 2023.
[17] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning, 2021.
[18] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning, 2018.
[19] Andreea B. Alexandru and George J. Pappas. Private weighted sum aggregation. IEEE Transactions on Control of Network Systems, 9(1):219–230, 2022.
[20] Dzmitry Huba, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour, Carole-Jean Wu, Hongyuan Zhan, Pavel Ustinov, Harish Srinivas, et al. Papaya: Practical, private, and scalable federated learning. Proceedings of Machine Learning and Systems, 4:814–832, 2022.
[21] Donald Shenaj, Marco Toldo, Alberto Rigon, and Pietro Zanuttigh. Asynchronous federated continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5055–5063, June 2023.
[22] Zachary Charles, Zachary Garrett, Zhouyuan Huo, Sergei Shmulyian, and Virginia Smith. On large-cohort training for federated learning. Advances in neural information processing systems, 34:20461–20475, 2021.
[23] Marten van Dijk, Nhuong V Nguyen, Toan N Nguyen, Lam M Nguyen, Quoc Tran-Dinh, and Phuong Ha Nguyen. Asynchronous federated learning with reduced number of rounds and with differential privacy from less aggregated gaussian noise. arXiv preprint arXiv:2007.09208, 2020.
[24] Jinhyun So, Ramy E Ali, Başak Güler, and A Salman Avestimehr. Secure aggregation for buffered asynchronous federated learning. arXiv preprint arXiv:2110.02177, 2021.
[25] Yifan Shi, Yingqi Liu, Yan Sun, Zihao Lin, Li Shen, Xueqian Wang, and Dacheng Tao. Towards more suitable personalization in federated learning via decentralized partial model training, 2023.
[26] Yuchen Zeng, Hongxu Chen, and Kangwook Lee. Improving fairness via federated learning, 2022.
[27] Xuezhen Tu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Yang Zhang, and Juan Li. Incentive mechanisms for federated learning: From economic and game theoretic perspective, 2021.

Appendix

Appendix A Proof of Convergence

In this section, we prove the main convergence result for FedStaleWeight. Recall that FedStaleWeight updates can be described as the following update rule:

	$\displaystyle w^{t+1}$	$\displaystyle=w^{t}+\eta_{g}\overline{\Delta}^{t}$
		$\displaystyle=w_{t}+\eta_{g}\sum_{k\in S^{t}}\left(\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}(t)}\right)$
		$\displaystyle=w_{t}-\eta_{g}\sum_{k\in S^{t}}\left(\alpha^{\text{norm}}_{k,t}\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}g_{k}\left(y_{k,q}^{t-\tau_{k}(t)}\right)}\right)$

Table 1: Summary of notation

Description	Symbol
Number of server updates, server update index	$T,t$
Set of clients contributing to update index $t$	$B_{t}$
Number of clients, client index	$n$ , $i$ or $k$
Number of local steps per round, local step index	$Q,q$
Server model after $t$ steps	$w_{t}$
Stochastic gradient at client $i$	$g_{i}(w;\zeta_{i}):=g_{i}(w)$
Local learning rate at local step $q$	$\eta_{\ell}^{(q)}$
Global learning rate	$\eta_{g}$
Number of clients per update	b
Local and global gradient variance	$\sigma_{\ell}^{2},\sigma_{g}^{2}$
Delay or staleness of the client $i$ ’s model update for the $t$ -th server update	$\tau_{i}(t)$
Maximum staleness for buffer size of	$\tau_{\max,b}$
Normalized FedStaleWeight re-weighting with respect to other elements
for buffer $B_{t}$ for agent $k$ such that $\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}}=1$	$\alpha^{\text{norm}}_{k,t}$

As in [2], in addition to our assumptions above, we assume that $B_{t}$ is a uniform subset of $[n]$ : in other words, any client is equally likely to contribute in any given round. We can ensure this is the case in practice by, if a client contributes to a current update round, only allowing the server to sample that client once the current update is complete.

Theorem 2.

\displaystyle\begin{split}\frac{1}{T}\sum_{t=0}^{T}{\|\nabla f(w^{t})\|^{2}}&\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}\\ &\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\\ &\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G\end{split}

(10)

We first restate a useful lemma from [2] which we use below.

Lemma 1.

$\mathbb{E}\left[\|g_{k}\|^{2}\right]\leq(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G)$ , where the total expectation $\mathbb{E}[\cdot]$ is evaluated over the randomness with respect to client participation and the stochastic gradient taken by a client.

Proof. By $L$ -smoothness assumption,

	$\displaystyle f(w^{t+1})$	$\displaystyle\leq f(w^{t})+\eta_{g}\langle\nabla f(w^{t}),\overline{\Delta}^{t}\rangle+\frac{L\eta_{g}^{2}}{2}\left\\|\overline{\Delta}^{t}\right\\|^{2}$
		$\displaystyle\leq f(w^{t})\underbrace{+\eta_{g}\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\left\langle\nabla f(w^{t}),\Delta_{k}^{t-\tau_{k}}\right\rangle}}_{T_{1}}+\underbrace{\frac{L\eta^{2}_{g}}{2}\left\\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}}}\right\\|^{2}}_{T_{2}}$

We then derive the upper bounds on $T_{1}$ and $T_{2}$ . We expand $T_{1}$ as follows:

\displaystyle T_{1}=\eta_{g}\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\left\langle\nabla f(w^{t}),\Delta_{k}^{t-\tau_{k}}\right\rangle}=-\eta_{g}\sum_{k\in B_{t}}{\sum_{q=0}^{Q-1}{\alpha^{\text{norm}}_{k,t}\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}}

Using conditional expectation, we can expand the expectation as

\displaystyle\mathbb{E}[\cdot]=\mathbb{E}_{\mathcal{H}}\mathbb{E}_{i\sim[n],g_{i}|i,\mathcal{H}}[\cdot]

where $\mathbb{E}_{\mathcal{H}}$ takes the expectation of the history of time-steps, $\mathbb{E}_{B_{t}\sim[n]}$ takes the expectation of the distribution of clients contributing at time-step $t$ over all $n$ clients and over the stochastic gradient of one step on a client.

	$\displaystyle\mathbb{E}[T_{1}]$	$\displaystyle=-\mathbb{E}\left[\eta_{g}\sum_{k\in B_{t}}{\sum_{q=0}^{Q-1}{\alpha^{\text{norm}}_{k,t}\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),\;g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}}\right]$
		$\displaystyle=-\eta_{g}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left[\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{i,t}\left[\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\mathbb{E}_{g_{k}\|k,\mathcal{H}}\left\langle\nabla f(w^{t}),\;g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}\right]}\right]$
		$\displaystyle=-\eta_{g}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left[\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),\;\alpha^{\text{norm}}_{k,t}\sum_{k\in B_{t}}{\nabla F_{k}\left(y_{k,q}^{t-\tau_{k}}\right)}\right\rangle}\right]$

Using the identity

\displaystyle\langle a,b\rangle=\frac{1}{2}\left(\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}\right)

we further expand as follows:

	$\displaystyle\mathbb{E}[T_{1}]$	$\displaystyle=-\frac{\eta_{g}}{2}\left(\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}\right){\\|\nabla f(w^{t})\\|^{2}}\underbrace{-\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}\left(y_{k,q}^{t-\tau_{k}}\right)}\right\\|^{2}}\bigg{)}}_{T_{3}}$
		$\displaystyle+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}\bigg{)}$

We expand $T_{2}$ as follows:

	$\displaystyle\mathbb{E}[T_{2}]$	$\displaystyle=\mathbb{E}\left[\frac{L\eta^{2}_{g}}{2}\left\\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}}}\right\\|^{2}\right]$
		$\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\sum_{q=0}^{Q-1}{-\eta_{\ell}^{(q)}\mathbb{E}_{g_{k}\|\mathcal{H},B_{t}}[g_{k}(y_{k,q}^{t-\tau_{k}})]}\right)}\right\\|^{2}$
		$\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}}\right\\|^{2}$
		$\displaystyle\leq\sum_{q=0}^{Q-1}{\frac{QL\eta^{2}_{g}\left(\eta_{\ell}^{(q)}\right)^{2}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}\right\\|^{2}}$

We can choose step sizes such that $\mathbb{E}[T_{2}]+\mathbb{E}[T_{3}]\leq 0$ , meaning that we can choose $\eta_{g}\eta_{\ell}^{(q)}Q\leq 1/L$ for all $q\in[0,...,Q-1]$ . We then get the following inequality:

	$\displaystyle f(w^{t+1})$	$\displaystyle\leq f(w^{t})-\frac{\eta_{g}}{2}\left(\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}\right){\\|\nabla f(w^{t})\\|^{2}}$
		$\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\underbrace{\left\\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}}_{T_{4}}\bigg{)}$

We can telescope and expand $T_{4}$ as follows. Note that this variance term after telescoping breaks down into four key portions:

	$\displaystyle\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}[T_{4}]$	$\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}$
		$\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})+\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})$
		$\displaystyle\qquad\qquad-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}+\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}$
		$\displaystyle\qquad\qquad-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})+\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}$
		$\displaystyle\leq 4\Biggl{(}\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})\Bigg{\\|}^{2}}_{\text{staleness error}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{sampling error}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{local client drift}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{reweighting error}}\Biggl{)}$

Using our assumptions on $L$ -smoothness, we can further reduce this as follows. We note that due to unbiasedness of the sample mean, on average, the mean over all agent’s gradients is equal to the mean of the gradients of agents sampled at each stage.

	$\displaystyle\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}[T_{4}]$	$\displaystyle\leq\frac{4L^{2}}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathcal{H}}\left\\|w^{t}-w^{t-\tau_{i}}\right\\|^{2}+\frac{4L^{2}}{b}\sum_{k\in B_{t}}{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}w^{t-\tau_{k}}-y_{k,q}^{t-\tau_{k}}\Bigg{\\|}^{2}}$
		$\displaystyle\qquad\qquad+4\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\underbrace{\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{T_{5}}$

We can produce an upper-bound on the staleness error (first term above) as follows:

	$\displaystyle\left\\|w^{t}-w^{t-\tau_{i}}\right\\|^{2}$	$\displaystyle=\left\\|\sum_{\rho=t-\tau_{I}}^{t-1}{(w^{\rho+1}-w^{\rho})}\right\\|^{2}$
		$\displaystyle=\eta_{g}^{2}\left\\|\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{\alpha_{j_{\rho},\rho}^{\text{norm}}\sum_{l=0}^{Q-1}{\eta_{\ell}^{(l)}g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})}}}\right\\|^{2}$

Taking the expectation of this with respect to $\mathcal{H}$ , we get:

\displaystyle\mathbb{E}_{\mathcal{H}}\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2}

\displaystyle\leq\eta_{g}^{2}b\tau_{i}\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{(\alpha_{j_{\rho},\rho}^{\text{norm}})^{2}\cdot\mathbb{E}_{\mathcal{H}}\left\|\sum_{l=0}^{Q-1}{\eta_{\ell}^{(l)}g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})}\right\|^{2}}}

Recalling our definition of $\alpha^{\text{norm}}_{j_{\rho},\rho}$ and our assumption on bounded staleness, we can see that given some max-staleness $\tau_{\max,b}$ , we see that the largest that $\alpha^{\text{norm}}_{j_{\rho},\rho}$ can be is an aggregation buffer with one agent with maximum expected staleness $\tau_{\max,b}$ with all other agents having zero staleness (and $1/n$ un-normalized weighting).

	$\displaystyle\alpha^{\text{norm}}_{j_{\rho},\rho}$	$\displaystyle\leq\frac{\frac{\tau_{\max,b}\cdot b+1}{n}}{\frac{\tau_{\max,b}\cdot b+1}{n}+\sum_{i=1}^{b-1}{\frac{1}{m}}}$
		$\displaystyle\leq\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}$

The expectation for the staleness error becomes bounded as follows, using the above result and Lemma 1:

	$\displaystyle\mathbb{E}_{\mathcal{H}}\left\\|w^{t}-w^{t-\tau_{i}}\right\\|^{2}$	$\displaystyle\leq\eta_{g}^{2}b\tau_{i}Q\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{\left(\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2}\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}\cdot\mathbb{E}_{\mathcal{H}}\left\\|g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})\right\\|^{2}}}}$
		$\displaystyle\leq\eta_{g}^{2}b^{2}\max_{\tau_{i}}{\tau_{i}^{2}}Q\left(\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2}\left(\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}}\right)3\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)$
		$\displaystyle\leq 3Q\eta_{g}^{2}\tau_{\max,b}^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}}\right)\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)$

Likewise for the local drift error, we can bound it similarly using Lemma 1:

	$\displaystyle\mathbb{E}\\|w^{t-\tau_{i}}-y_{i,q}^{t-\tau_{i}}\\|^{2}$	$\displaystyle=\mathbb{E}\\|y_{i,0}^{t-\tau_{i}}-y_{i,q}^{t-\tau_{i}}\\|^{2}=\mathbb{E}\left\\|\sum_{l=0}^{q-1}{\eta_{\ell}^{(l)}g_{i}(y_{i,l}^{t-\tau_{i})}}\right\\|^{2}$
		$\displaystyle\leq 3q\left(\sum_{i=0}^{q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)$
		$\displaystyle\leq 3Q\left(\sum_{i=0}^{Q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)$

Finally, we can bound the re-weighting error as follows:

	$\displaystyle\mathbb{E}[T_{5}]$	$\displaystyle=\mathbb{E}\left\\|\sum_{k\in B_{t}}{\left(\frac{1}{\|B_{t}\|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}$
		$\displaystyle\leq\mathbb{E}\left[\sum_{k\in B_{t}}{\left\\|\left(\frac{1}{\|B_{t}\|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\\|^{2}}\right]$
		$\displaystyle\leq b\cdot\mathbb{E}\left[\sum_{k\in B_{t}}{\left(\frac{1}{b}-\alpha^{\text{norm}}_{k,t}\right)^{2}\left\\|\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\\|^{2}}\right]$

We note from before that the largest that $\alpha_{k,t}^{norm}$ can be is $\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}$ . Consequently, the smallest that $\alpha_{k,t}^{norm}$ can be is an agent with zero expected staleness (minimizing the numerator), with all other buffer agents having maximum expected staleness (maximizing the denominator):

	$\displaystyle\alpha_{k,t}^{norm}$	$\displaystyle\geq\frac{\frac{1}{n}}{\frac{1}{n}+\sum_{i=1}^{b-1}{\frac{b\tau_{\max,b}+1}{n}}}$
		$\displaystyle\geq\frac{1}{(b-1)(b\tau_{\max,b}+1)+1}$

We can then upper-bound as follows, denoting this as quantity $U(\tau_{\max,b})$ :

\displaystyle\left(\frac{1}{b}-\alpha_{k,t}^{\text{norm}}\right)^{2}\leq\max\left\{\left(\frac{1}{b}-\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2},\left(\frac{1}{b}-\frac{1}{(b-1)(b\tau_{\max,b}+1)+1}\right)^{2}\right\}=U(\tau_{\max,b})

(11)

We arrive at the following expectation of $T_{5}$ , given the above result and Assumption 3:

\displaystyle\mathbb{E}[T_{5}]\leq U(\tau_{\max,b})b^{2}G

Thus, combining everything, denoting $\alpha(Q)=\sum_{q=0}^{Q-1}(\eta_{\ell}^{(q)})$ and $\beta(Q)=\sum_{q=0}^{Q-1}(\eta_{\ell}^{(q)})^{2}$ , we have

	$\displaystyle\mathbb{E}[f(w^{t+1})]$	$\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\\|\nabla f(w^{t})\\|^{2}}$
		$\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\frac{4L^{2}}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathcal{H}}\left\\|w^{t}-w^{t-\tau_{i}}\right\\|^{2}+\frac{4L^{2}}{b}\sum_{k\in B_{t}}{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}w^{t-\tau_{k}}-y_{k,q}^{t-\tau_{k}}\Bigg{\\|}^{2}}$
		$\displaystyle\qquad\qquad+4\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}\bigg{)}$
		$\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\\|\nabla f(w^{t})\\|^{2}}$
		$\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}12L^{2}Q\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\bigg{)}$
		$\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}12QL^{2}\left(\sum_{i=0}^{Q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)\bigg{)}$
		$\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}4U(\tau_{\max,b})b^{2}G\bigg{)}$
		$\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\\|\nabla f(w^{t})\\|^{2}}$
		$\displaystyle\qquad+6\eta_{g}^{3}\tau_{\max,b}^{2}\alpha(Q)\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)$
		$\displaystyle\qquad+6\eta_{g}\alpha(Q)\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)$
		$\displaystyle\qquad+2b^{2}\eta_{g}U(\tau_{\max,b})\alpha(Q)G$

Rearranging and summing from $t=1$ to $T$ , we get the following:

	$\displaystyle\sum_{t=0}^{T}{\eta_{g}\alpha(Q)\\|\nabla f(w^{t})\\|^{2}}$	$\displaystyle\leq\sum_{t=0}^{T-1}2\left(\mathbb{E}[f(w^{t})-\mathbb{E}[f(w^{t+1})]]\right)$
		$\displaystyle\qquad+12T\eta_{g}^{3}\tau_{\max,b}^{2}\alpha(Q)\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)$
		$\displaystyle\qquad+12T\eta_{g}\alpha(Q)\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)$
		$\displaystyle\qquad+4Tb^{2}\eta_{g}U(\tau_{\max,b})\alpha(Q)G$

Thus, we conclude

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T}{\\|\nabla f(w^{t})\\|^{2}}$	$\displaystyle\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}$
		$\displaystyle\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)$
		$\displaystyle\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G$

∎

	$\displaystyle\mathbb{E}[T_{2}]$	$\displaystyle=\mathbb{E}\left[\frac{L\eta^{2}_{g}}{2}\left\\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}}}\right\\|^{2}\right]$
		$\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\sum_{q=0}^{Q-1}{-\eta_{\ell}^{(q)}\mathbb{E}_{g_{k}\|\mathcal{H},B_{t}}[g_{k}(y_{k,q}^{t-\tau_{k}})]}\right)}\right\\|^{2}$
		$\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}}\right\\|^{2}$
		$\displaystyle\leq\sum_{q=0}^{Q-1}{\frac{QL\eta^{2}_{g}\left(\eta_{\ell}^{(q)}\right)^{2}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}\right\\|^{2}}$

	$\displaystyle\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}[T_{4}]$	$\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}$
		$\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})+\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})$
		$\displaystyle\qquad\qquad-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}+\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}$
		$\displaystyle\qquad\qquad-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})+\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}$
		$\displaystyle\leq 4\Biggl{(}\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})\Bigg{\\|}^{2}}_{\text{staleness error}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{sampling error}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}-\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{local client drift}}$
		$\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\\|}\frac{1}{\|B_{t}\|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\\|}^{2}}_{\text{reweighting error}}\Biggl{)}$

	$\displaystyle\mathbb{E}[T_{5}]$	$\displaystyle=\mathbb{E}\left\\|\sum_{k\in B_{t}}{\left(\frac{1}{\|B_{t}\|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\\|^{2}$
		$\displaystyle\leq\mathbb{E}\left[\sum_{k\in B_{t}}{\left\\|\left(\frac{1}{\|B_{t}\|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\\|^{2}}\right]$
		$\displaystyle\leq b\cdot\mathbb{E}\left[\sum_{k\in B_{t}}{\left(\frac{1}{b}-\alpha^{\text{norm}}_{k,t}\right)^{2}\left\\|\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\\|^{2}}\right]$