This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedStaleWeight: Buffered Asynchronous Federated Learning with Fair Aggregation via Staleness Reweighting

Jeffrey Ma  Alan Tu  Yiling Chen  Vijay Janapa Reddi
Harvard University
Cambridge, MA 02134
jeffreyma@g.harvard.edu, alantu@college.harvard.edu
yiling@seas.harvard.edu, vj@eecs.harvard.edu
Abstract

Federated Learning (FL) endeavors to harness decentralized data while preserving privacy, facing challenges of performance, scalability, and collaboration. Asynchronous Federated Learning (AFL) methods have emerged as promising alternatives to their synchronous counterparts bounded by the slowest agent, yet they add additional challenges in convergence guarantees, fairness with respect to compute heterogeneity, and incorporation of staleness in aggregated updates. Specifically, AFL biases model training heavily towards agents who can produce updates faster, leaving slower agents behind, who often also have differently distributed data which is not learned by the global model. Naively upweighting introduces incentive issues, where true fast updating agents may falsely report updates at a slower speed to increase their contribution to model training. We introduce FedStaleWeight, an algorithm addressing fairness in aggregating asynchronous client updates by employing average staleness to compute fair re-weightings. FedStaleWeight reframes asynchronous federated learning aggregation as a mechanism design problem, devising a weighting strategy that incentivizes truthful compute speed reporting without favoring faster update-producing agents by upweighting agent updates based on staleness. Leveraging only observed agent update staleness, FedStaleWeight results in more equitable aggregation on a per-agent basis. We both provide theoretical convergence guarantees in the smooth, non-convex setting and empirically compare FedStaleWeight against the commonly used asynchronous FedBuff with gradient averaging, demonstrating how it achieves stronger fairness, expediting convergence to a higher global model accuracy. Finally, we provide an open-source test bench to facilitate exploration of buffered AFL aggregation strategies, fostering further research in asynchronous federated learning paradigms.111Code for experiments can be found at https://github.com/18jeffreyma/afl-bench

1 Introduction

1.1 Motivation and Background

In real-world scenarios, the task of training a model may be distributed among up to millions of agents, called edge devices, each of whom contribute their own limited (and perhaps private) set of data, a paradigm known as federated learning (FL). Our paper focuses on cross-device FL, in which there must be a large number of agents in order to extract meaningful results. The other type of FL, cross-silo, utilizes fewer agents because each one has more data and participates actively in the model training process, like a consortium of hospitals sharing information about a new disease [1].

Cross-device FL faces two major challenges when put into practice: scalability and privacy. In a massive distributed network, it’s impossible for every agent to be active at every time step, or for all agents to train their private models at the same speed. Naive aggregation techniques rely on concurrent reports from agents because, for instance, a simple average may be taken over reports in order to update the global model at regular intervals. By relaxing the concurrency constraint, we move away from synchronous FL, and the problem becomes how to best aggregate updates arriving at different points in time. On the topic of privacy, agents want their personal data to remain fully confidential; the global model must therefore evolve in a way that does not reveal individual contributions through gradient updates or the like [2]. Recent work in secure aggregation (SecAgg) and differential privacy provide frameworks for exploring the tradeoffs between privacy and utility [3].

Unlike its synchronous conterpart, Asynchronous federated learning (AFL) does not rely on clients to all finish and communicate their results in the same round and differs from its synchronous counterpart in the following ways:

  • Synchronicity: Clients are assumed to be heterogeneous and may return updates at different times or fail to return updates in time, necessitating mechanisms that do not wait for all participants before updating the global model through aggregation.

  • Staleness: As models may take longer to update, AFL mechanisms often are able to merge updates to models (i.e. model version TT is deployed to a client, the client takes a long time to fit, and it replies with its updated model while the global server has already aggregated updates and stepped to model version T+1T+1). Emerging research in AFL hypothesizes that these stale updates are likely still useful and should be incorporated into learning.

  • Convergence challenges: AFL often experiences higher training instability based on the strategy chosen for aggregating stale updates, similar to issues encountered in off-policy reinforcement learning (RL).

1.2 Our Contribution

We observe that the asynchronicity in AFL naturally leaves slow updating agents behind: since updates are no longer incorporated synchronously and are incorporated on demand, agents who are not able to update as quickly will contribute less to global model learning. However, naively up-weighting slow agent updates creates incentive issues: if our upweighting is too extreme, we incentivize fast updating agents to throttle and falsely report at slower speeds to increase their contribution to learning.

Thus, we explore the problem of fairly aggregating updates from clients in AFL to offset the unfairness introduced by compute heterogeneity in asynchronous federated learning. Several known synchronous approaches such as FedAvg and FedProx exist and more nuanced asynchronous approaches have been explored such as a fixed buffer length with averaging [2] or simply updating the central model in an asynchronous immediate manner [4]. To our knowledge, more nuanced weighting of stale updates or an adaptive approach has not yet been explored. Our contribution is the following:

  1. 1.

    We reframe the problem of asynchronous federated learning aggregation as a welfare maximization problem and derive a gradient aggregation method which is strategy-proof to fast agents who may throttle their update rate to have their updates weighted higher while increasing the contribution of slower update producing agents to global model training.

  2. 2.

    We show how our derived fair weighting can be computed via observing agent update staleness, a quantity naturally visible in buffered asynchronous federated learning, and present an algorithm for fair federated learning, FedStaleWeight.

  3. 3.

    We provide a convergence guaruntee for FedStaleWeight in the smooth, non-convex setting, showing that the upweighting of stale model updates can still yield convergence.

  4. 4.

    We empirically show in a series of realistic non-IID federated learning settings how current AFL aggregation techniques are biased towards clients with higher throughput and how FedStaleWeight maintains fairness and subsequently converges to higher global model accuracy faster. We release an open-source test bench for other researchers to explore further buffered AFL aggregation strategies.

2 Related Work

Synchronous FL. Synchronous FL excels in the privacy dimension: when the principal must wait for all agents before aggregating their data, it becomes nearly impossible to recover an individual contribution from an average over so many points. The slowest agent, however, determines the training pace of the global model, resulting in a bottleneck. If the principal were to perform aggregation with only, say, the fastest half of agents, then an element of selection bias is introduced. These concerns aside, numerous synchronous techniques have found experimental success, and their primary features can be summarized as follows:

  • FedAvg: McMahan et al. (2016) coined the term “federated learning” and suggest a simple weighted average of gradients to update the global model, where the weights correspond to the pip_{i} terms in the overall objective described in the next section [5].

  • FedProx: Li et al. (2020) improve upon FedAvg by adding a regularization term μ2𝐱𝐱𝐠𝟐\frac{\mu}{2}\|\bf{x}-\bf{x_{g}}\|^{2} to each local objective FiF_{i}, where 𝐱𝐠\bf{x_{g}} is the current global model. This allows each local model to more resemble the global model but converges under a more restrictive set of assumptions [6].

  • FedAvgM: Hsu et al. (2019) also employ a form of regularization through server momentum, which accumulates past gradients in order to stabilize and accelerate training [7].

  • FedNova: Wang et al. (2020) note that heterogeneous update speeds can cause objective inconsistency, where the server model approaches a local optimum that does not match the global optimum of the original objective function [8]. The authors provide a formal analysis of this phenomenon and propose FedNova, which normalizes each client’s gradient by the number of updates that the client has sent in the current round before taking the overall average for a round. This technique encompasses both FedAvg and FedProx and has been shown to outperform both of them.

Asynchronous FL. AFL is a natural fit in the cross-device setting due to heterogeneity among agents’ availability as well as training and data transfer speeds. Here, we provide an overview of the most prominent work in AFL to date:

  • FedAsync: Xie et al. (2019) gave the first convergence guarantees for the AFL problem; their fully asynchronous method forces the server to update its model after every client update [9]. As later works observe, this is not only computationally taxing but insecure.

  • ASO-Fed: Chen et al. (2020) deal with the special case of online learning, where each agent’s local dataset is continuously growing as new observations arrive [4]. Newer data need not stem from the same distribution as older data, which means that a robust AFL framework must be able to handle non-IID data.

  • FedAdaGrad, FedAdam, FedYogi: Reddi et al. (2020) tailor the well-known adaptive optimizers AdaGrad, Adam, and Yogi to the FL problem [10].

  • SAFA: Wu et al. (2020) propose Semi-Asynchronous Federated Averaging (SAFA) to deal with the issues of stragglers, dropouts, and staleness [11].

  • Pisces: Jiang et al. (2022) use a custom scoring mechanism to choose which agents will participate in the global model training at a given time step [12]. Their utility function attempts to distinguish between slow clients with high-quality data and those which are just slow.

  • AsyncFedED: Wang et al. (2022) aggregate client data by calculating a “Euclidean distance” between stale models and the global model and then weighting each client’s update appropriately [13].

  • FedBuff: In order to avoid full asynchronicity, Nguyen et al. (2022) hold agent updates in a buffer. Only when the buffer is full does the data get aggregated and pushed to the server [2]. This work directly inspired and most closely resembles our work. SecAgg [14] [15] can also be integrated to maintain privacy guarantees.

  • FedFix: Fraboni et al. (2023) try to unify a large number of the above methods through their FedFix framework: it uses stochastic weights during aggregation [16].

3 Preliminaries

3.1 Asynchronous FL Setting

The federated optimization problem with mm agents can be formulated as:

min𝐱d[f(𝐱):=i=1mpiFi(𝐱)]\displaystyle\min_{\mathbf{x}\in\mathbb{R}^{d}}\left[f(\mathbf{x}):=\sum_{i=1}^{m}p_{i}F_{i}(\mathbf{x})\right] (1)

where pip_{i} is the importance of the iith agent (often assumed to be in proportion to the size of each client’s dataset, i.e. pi=ni/np_{i}=n_{i}/n or equal for all agents pi=1/np_{i}=1/n). FiF_{i} represents agent ii’s local objective function and is only available to agent ii.

Specifically, we study the buffered asynchronous federated learning setting, consisting of a singular central server with an update aggregation buffer and mm clients. The server only updates once bb updates arrive at the buffer, where bb is a tune-able parameter. Each client has their own private data and pulls the latest version of the global model, trains it on its private dataset and communicates the update back to the server, which is appended to the buffer to later be aggregated (Figure 1). Specifically, in the asynchronous setting, we introduce the challenge of staleness, where updates to update model at version vv may have been trained on models of version smaller than vv.

In the buffered AFL setting, we denote staleness τi\tau_{i} for an agent ii’s update as the positive difference in version number between the current global model and the global model that the update was computed from. Formally, we denote a global model with version nn and with 0 client training steps as wn,0w_{n,0}. For a client ii, we denote the global model with version nn^{\prime} before and after training on qq local steps as wn,0w_{n^{\prime},0} and wn,qw_{n^{\prime},q}, respectively; the client’s update is denoted as Δi=wn,qwn,0\Delta_{i}=w_{n^{\prime},q}-w_{n^{\prime},0}. Given the current global model wn,0w_{n,0} and update Δi\Delta_{i}, the staleness is thus τi=nn\tau_{i}=n-n^{\prime}.

Refer to caption
Figure 1: Aggregation in buffered asynchronous federated learning. Updates accumulate in the buffer over time: for a buffer size of b=4b=4, the diagram shows aggregation of updates and how model versions are incremented for each arriving update.

3.2 Defining Fairness in Update Aggregation

Each FL client ii represents its true update reporting speed as a random variable 𝐑𝐢\mathbf{R_{i}} where we denote the mean reporting speed as ri=𝔼[𝐑𝐢]r_{i}=\mathbb{E}[\mathbf{R_{i}}]. Each agent ii chooses to report a random variable 𝐑𝐢~\tilde{\mathbf{R_{i}}} with mean ri~=𝔼[𝐑𝐢~]\tilde{r_{i}}=\mathbb{E}[\tilde{\mathbf{R_{i}}}] as their observed update reporting speed. We assume for simplicity that ri~[0,ri]\tilde{r_{i}}\in[0,r_{i}], following intuition of fixed hardware specs: in other words, clients cannot manipulate to be faster on average than their fastest (true) compute speed on average. Given reported update speeds, we naturally see that the expected proportion that each agent contributes to learning without re-weighting is:

pi(ri~)=ri~i=1nri~p_{i}(\tilde{r_{i}})=\frac{\tilde{r_{i}}}{\sum_{i=1}^{n}{\tilde{r_{i}}}} (2)

A fair re-weighting α()\alpha(\cdot) takes pi(ri~)p_{i}(\tilde{r_{i}}) and returns a weighting close to equal for every participating learning agent. In other words, we seek an α()\alpha(\cdot) that solves the following constrained welfare maximization problem:

minα()i=1nα(pi(ri~))pi(ri~)1n2 such that i=1nα(pi(ri~))p(ri~)=1\min_{\alpha(\cdot)}\sum_{i=1}^{n}{\left\lVert\alpha(p_{i}(\tilde{r_{i}}))p_{i}(\tilde{r_{i}})-\frac{1}{n}\right\rVert_{2}}\quad\text{ such that }\quad\sum_{i=1}^{n}{\alpha(p_{i}(\tilde{r_{i}}))p(\tilde{r_{i}})}=1 (3)

Clients seek to maximize a utility representing their effective influence (and thus their accuracy), modelled as their re-weighting times the overall proportional influence of their reported mean update rate. We use this utility function to justify strategy-proofness in Section 4.2.1.

maxri~[0,ri]α(pi(ri~))pi(ri~)\max_{\tilde{r_{i}}\in[0,r_{i}]}{\alpha(p_{i}(\tilde{r_{i}}))p_{i}(\tilde{r_{i}})} (4)

3.3 Step Size as Uncertainty in Approximation

Recall that computing the update step uu^{*} for a client with loss L(𝐱)L(\mathbf{x}) at point 𝐱\mathbf{x} can be framed as the solution of the following minimization game, where step size η\eta represents the uncertainty in using the approximation L(𝐱+u)L(𝐱)+u,L(𝐱)L(\mathbf{x}+u^{*})\approx L(\mathbf{x})+u^{*,\top}\nabla L(\mathbf{x}) for stochastic gradient descent:

u=argminud{L(x)+uL(x)+12ηuu}\displaystyle u^{*}=\operatorname*{arg\,min}_{u\in\mathbb{R}^{d}}\left\{L(x)+u^{\top}\nabla L(x)+\frac{1}{2\eta}u^{\top}u\right\} (5)

Similarly, in federated learning, we wish to restrict our aggregated gradient to be no larger than the step sizes of any of the individual aggregated gradients for convergence reasons, necessitating that any solution α()\alpha(\cdot) must have i=1nα(pi(ri~))p(ri~)=1\sum_{i=1}^{n}{\alpha(p_{i}(\tilde{r_{i}}))p(\tilde{r_{i}})}=1 as noted above.

4 Algorithm

4.1 Solving the Welfare Maximization Problem

Naively, if α()=1\alpha(\cdot)=1 is chosen (all agents are weighted equally), we see that p(ri~)p(\tilde{r_{i}}) is strictly increasing with respect to ri~\tilde{r_{i}} and all agents will report ri~=ri\tilde{r_{i}}=r_{i}. However, we see that welfare is poor, since an agent’s influence α(p(ri~))p(ri~)\alpha(p(\tilde{r_{i}}))p(\tilde{r_{i}}) is exactly increasing with respect to its rate, meaning that slow agents are left behind as they cannot manipulate any higher than their true rir_{i}.

We observe that α()\alpha(\cdot) should be monotonically decreasing with respect to its input, but not at a rate that decreases slower than p()p(\cdot) increases. If the latter as true, faster agents would be incentivized to throttle their rates to improve their own weighted influence. The optimal α()\alpha(\cdot) becomes α()=1n()\alpha(\cdot)=\frac{1}{n(\cdot)}, where α(pi(ri~))=1/(npi(ri~))\alpha(p_{i}(\tilde{r_{i}}))=1/(np_{i}(\tilde{r_{i}})). Thus, with this choice, each agent’s weighted influence remains constant regardless of which reported 𝐑𝐢~\tilde{\mathbf{R_{i}}} they choose, and thus there is no incentive to not report truthfully. We note that this weighting achieves weak truthfulness where agents are indifferent between reporting at any speed, since their utility stays the same regardless.

4.2 Deriving Expected Influence From Staleness

In buffered AFL, we assume the global aggregator has access only to the aggregation buffer and not the arrivals of individual agent updates, disallowing use of a weighting scheme that relies on estimating individual agent reporting schemes based on unique identifiers. We proceed to show how a fair weighting can be computed from staleness, an available local quantity, in asynchronous federated learning.

Given an initially empty buffer, we can compute the expected index of the first update from a given agent ii with mean rate ri~\tilde{r_{i}} by summing the expected number of updates from all agents in the expected time period for one update from agent ii. For example, we note that an agent twice as fast as ii would have two updates expected in the same time as ii’s first expected update. Recall that staleness corresponds to the difference between the version number of the current global model and the version number of the original global model that an agent’s update was reported from. We derive the following expression for expected staleness of agent ii.

𝔼[staleness of agent i with rate ri~]=𝔼[τi]=j=1nrj~ri~1b\displaystyle\mathbb{E}[\text{staleness of agent $i$ with rate $\tilde{r_{i}}$}]=\mathbb{E}[\tau_{i}]=\frac{\sum_{j=1}^{n}{\frac{\tilde{r_{j}}}{\tilde{r_{i}}}}-1}{b} (6)

Reconciling this with our mechanism definitions above, we get the unweighted influence of agent ii given its expected staleness.

pi(ri~)=1𝔼[τi]b+1\displaystyle{p_{i}(\tilde{r_{i}})}=\frac{1}{\mathbb{E}[\tau_{i}]\cdot b+1} (7)

Recalling our derived weighting from the mechanism, we get a formula for our weighting as a function of expected staleness. We note that under a synchronous setting of zero expected staleness, we recover the exactly equal weighting used in synchronous FedAvg.

α(pi(ri~))=𝔼[τi]b+1n\displaystyle\alpha(p_{i}(\tilde{r_{i}}))=\frac{\mathbb{E}[\tau_{i}]\cdot b+1}{n} (8)

4.2.1 Extending To Online Learning

We reconcile the derived weighting above into the online setting of arriving client updates. As noted in Section 3.3, at any given aggregation, we should avoid updating the global model with a step size larger than any individual clients; otherwise, global model training can quickly diverge. As such, given a buffer of updates to be aggregated, we compute the desired weightings per the above and normalize such that they sum to 1.

However, this introduces complexity into our derivation from above. Intuitively, if an agent chooses to increase their expected staleness by slowing down their reporting speed, the best they can do is enjoy a single aggregation step to themselves of full step size, but consequently, they participate in fewer aggregations later due to their slower reporting speed. As a result, they contribute less to learning. We show this analytically: we can compute the normalized re-weighting for agent ii’s update αnorm(pi(r~ai))\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}})) in a buffer as the following:

αnorm(pi(r~ai))\displaystyle\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}})) =𝔼[τai]b+1j=1b[𝔼[τai]b+1]=[k=1nr~kr~ai1]+1j=1b[[k=1nr~kr~aj1]+1]\displaystyle=\frac{\mathbb{E}[\tau_{a_{i}}]\cdot b+1}{\sum_{j=1}^{b}{\left[\mathbb{E}[\tau_{a_{i}}]\cdot b+1\right]}}=\frac{\left[\sum_{k=1}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{i}}}-1}\right]+1}{\sum_{j=1}^{b}{\left[\left[\sum_{k=1}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}-1}\right]\ +1\right]}}
=1r~aikainr~k+1b+j=1nkajnr~kr~aj\displaystyle=\frac{\frac{1}{\tilde{r}_{a_{i}}}\sum_{k\neq a_{i}}^{n}{\tilde{r}_{k}}+1}{b+\sum_{j=1}^{n}{\sum_{k\neq a_{j}}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}}}

Isolating r~ai\tilde{r}_{a_{i}}, we get the following:

αnorm(pi(r~ai))=1r~aikainr~k+1b+M\displaystyle\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))=\frac{\frac{1}{\tilde{r}_{a_{i}}}\sum_{k\neq a_{i}}^{n}{\tilde{r}_{k}}+1}{b+M}
 where M=1r~aik=1,kainr~k+r~aij=1,ajaib1r~aj+j=1,ajaib[k=1,k[aj,ai]nr~kr~aj]\displaystyle\text{ where }M=\frac{1}{\tilde{r}_{a_{i}}}\sum_{k=1,k\neq a_{i}}^{n}{\tilde{r}_{k}}+\tilde{r}_{a_{i}}\sum_{j=1,a_{j}\neq a_{i}}^{b}{\frac{1}{\tilde{r}_{a_{j}}}}+\sum_{j=1,a_{j}\neq a_{i}}^{b}{\left[\sum_{k=1,k\notin[a_{j},a_{i}]}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}\right]}

We further note that, as an agent, by manipulating r~ai\tilde{r}_{a_{i}} downwards, we increase expected staleness but also decrease the frequency of an agent’s update. Recall from Section 3.2 that the utility of an agent is the normalized re-weighting times the proportional frequency of an agent’s update based on the agent’s reported update speed.

uai(r~ai)=αnorm(pi(r~ai))(r~air~ai+j=1,ajainraj)\displaystyle u_{a_{i}}(\tilde{r}_{a_{i}})=\alpha_{\text{norm}}(p_{i}(\tilde{r}_{a_{i}}))\cdot\left(\frac{\tilde{r}_{a_{i}}}{\tilde{r}_{a_{i}}+\sum_{j=1,a_{j}\neq a_{i}}^{n}{r_{a_{j}}}}\right) (9)

Denoting A=k=1,kainr~kA=\sum_{k=1,k\neq a_{i}}^{n}{\tilde{r}_{k}}, C=j=1,ajaib1/r~ajC=\sum_{j=1,a_{j}\neq a_{i}}^{b}{1/\tilde{r}_{a_{j}}} and D=j=1,ajaib[k=1,k[aj,ai]nr~kr~aj]D=\sum_{j=1,a_{j}\neq a_{i}}^{b}{\left[\sum_{k=1,k\notin[a_{j},a_{i}]}^{n}{\frac{\tilde{r}_{k}}{\tilde{r}_{a_{j}}}}\right]}, we take the derivative of the utility with respect to an agent’s reporting speed to get:

uair~ai=ACr~ai2(Cr~ai2+(b+D)r~ai+A)2\displaystyle\frac{\partial u_{a_{i}}}{\partial\tilde{r}_{a_{i}}}=\frac{A-C\tilde{r}_{a_{i}}^{2}}{(C\tilde{r}_{a_{i}}^{2}+(b+D)\tilde{r}_{a_{i}}+A)^{2}}

Since our algorithm reweights based on expected staleness, we can rescale reported update rates r~ai\tilde{r}_{a_{i}} linearly by any positive constant without changing the result of expected staleness (or our reweighting) to make Cr~ai2C\tilde{r}_{a_{i}}^{2} as small as possible and AA as large as possible. Furthermore, since AA increases with larger nn, we observe with large enough nn (a large enough client pool) that uai/r~ai>0\partial u_{a_{i}}/\partial\tilde{r}_{a_{i}}>0, indicating that agents want to truthfully report as high of an update speed as possible and our aggregation mechanism is strictly strategy-proof with respect to agent manipulation of r~ai\tilde{r}_{a_{i}}.

4.3 Algorithm

We present the following algorithm, FedStaleWeight, applying the above weighting to each aggregation. In practice, to compute the expected staleness for each agent ii, 𝔼[τi]\mathbb{E}[\tau_{i}], we maintain a moving average of previous staleness for that agent. Our method computes a weighting to apply over the aggregation buffer, meaning it can be combined with methods like weighted secure aggregation [15] to protect client gradients.

Input: buffer size bb, number of aggregation rounds NN, global learning rate ηg\eta_{g} initial model w0w_{0}
Initialize n0n\leftarrow 0, B={}B=\{\}
while n < N do
       Receive (Δi,k)(\Delta_{i},k) from any client ii and add to buffer BB
       if |B|=b|B|=b then
             for j=1:bj=1:b do
                   αj=𝔼[τj]+1n\alpha_{j}=\frac{\mathbb{E}[\tau_{j}]+1}{n}
                  
             end for
            α=Normalize({α1,,αb})\alpha=\text{Normalize}(\{\alpha_{1},...,\alpha_{b}\})
             wn+1=wn+ηgbi=(Δj,τj)BαiΔjw_{n+1}=w_{n}+\eta_{g}\sum_{b_{i}=(\Delta_{j},\tau_{j})\in B}{\alpha_{i}\Delta_{j}}
             B={},nn+1B=\{\},n\leftarrow n+1
            
       end if
      
end while
Output: global model wNw_{N}
Algorithm 1 FedStaleWeight-server
Input: client learning rate η\eta_{\ell}, number of client SGD steps QQ
while Server Running do
       Request latest model wkw_{k} from server
       y0wky_{0}\leftarrow w_{k}
       for q=1:Qq=1:Q do
             yq=yq1ηgq(yq1)y_{q}=y_{q-1}-\eta_{\ell}g_{q}(y_{q-1})
       end for
      Δ=yqy0\Delta=y_{q}-y_{0}
       Send (Δ,k)(\Delta,k) to server buffer
      
end while
Algorithm 2 FedStaleWeight-client

5 Theoretical Results

In this section, we provide a convergence guarantee for FedStaleWeight in the non-convex, smooth setting. Since FedStaleWeight effectively upweights stale updates in addition to operating in the asynchronous setting, it is essential to understand the relationship between convergence and our methods of reweighting to create asynchronous fairness. The full proof can be found in Appendix A.

Notation. We use the following notation and assumptions throughout as used in FedBuff [2]: [n][n] represents the set of all clients, Fi(w)\nabla F_{i}(w) is the gradient with respect to the loss on client ii’s data. f(w)f(w^{*}) is the minimum of f(w)f(w), gi(w;ζi)g_{i}(w;\zeta_{i}) denotes the stochastic gradient on client ii, bb is the buffer size for aggregation as noted above, and QQ is the number of local steps taken by each client. Similar to other seminal works in federated learning, [6] [10] [17] [18], we take the following assumptions:

Assumption 1.

(Unbiasedness of client stochastic gradient)   𝔼ζ[gi(w;ζi)]=Fi(w)\mathbb{E}_{\zeta}[g_{i}(w;\zeta_{i})]=\nabla F_{i}(w)

Assumption 2.

(Bounded local and global variance) for all clients i[n]i\in[n],

𝔼ζi[gi(w;ζi)Fi(w)2]σ2\displaystyle\mathbb{E}_{\zeta_{i}}[\|g_{i}(w;\zeta_{i})-\nabla F_{i}(w)\|^{2}]\leq\sigma_{\ell}^{2}
1mi=1mFi(w)f(w)2σg2\displaystyle\frac{1}{m}\sum_{i=1}^{m}\|\nabla F_{i}(w)-\nabla f(w)\|^{2}\leq\sigma_{g}^{2}
Assumption 3.

(Bounded gradient)   Fi2G\|\nabla F_{i}\|^{2}\leq G for all i[n]i\in[n]

Assumption 4.

(Lipschitz gradient)   for all client i[n]i\in[n] the gradient is L-smooth ,

Fi(w)Fi(w)2Lww2\displaystyle\|\nabla F_{i}(w)-\nabla F_{i}(w^{\prime})\|^{2}\leq L\|w-w^{\prime}\|^{2}
Assumption 5.

(Bounded Staleness)   For all clients i[n]i\in[n] and for each server step tt, the staleness τi(t)\tau_{i}(t) between the model version a FedStaleWeight client uses to start local training and the model version in which the aggregated update Δi\Delta^{i} is used to modify the global version is not larger than τmax,1\tau_{\text{max},1} when K=1K=1. Moreover, any buffered asynchronous aggregation with b>1b>1 has the maximum delay τmax,b\tau_{\text{max},b} at most τmax,1/b\lceil\tau_{\text{max},1}/b\rceil [2].

Theorem 1.

Let η(q)\eta_{\ell}^{(q)} be the local learning rate of client SGD in the q-th local step, and let α(Q)=q=0Q1η(q)\alpha(Q)=\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}, β(Q)=q=0Q1(η(q))2\beta(Q)=\sum_{q=0}^{Q-1}{(\eta_{\ell}^{(q)})^{2}}, and U(τmax,b)U(\tau_{\max,b}) be the maximum squared deviation between equal weighting (1/b1/b) and a normalized αj\alpha_{j} 11. Choosing ηgη(q)Q1/L\eta_{g}\eta_{\ell}^{(q)}Q\leq 1/L for all local steps q=0,,Q1q=0,...,Q-1, the model iterates in Algorithm 1 achieves the following ergodic convergence rate:

1Tt=0Tf(wt)22(f(w0)f(w))ηgα(Q)T+12ηg2τmax,b2β(Q)QL2(bτmax,b+1τmax,b+1)2(σ2+σg2+G)+12β(Q)QL2(σ2+σg2+G)+4b2U(τmax,b)G\displaystyle\begin{split}\frac{1}{T}\sum_{t=0}^{T}{\|\nabla f(w^{t})\|^{2}}&\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}\\ &\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\\ &\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G\end{split}

6 Empirical Results

6.1 Simulation Details

We implement a generic testbench to empirically verify our algorithm above. The simulation environment consists of concurrently running clients with private data and and a centralized server and buffer, where an aggregation strategy can be specified.

  1. 1.

    Clients run each on their own process thread concurrently, continuously pulling the latest available model from the global server and training it on their local private data before communicating the update to the server and repeating. Clients are initialized with both an individual slice of the overall dataset being evaluated (split into training and evaluation) and as a private runtime model representing a training delay distribution, from which the client process samples and sleeps for that duration to simulate local training delays.

  2. 2.

    Server aggregates updates from the global buffer and updates the global model continuously. The server is initialized with a specified aggregation strategy and an aggregation buffer to pull updates from: once the buffer fills to a specific size, aggregation is performed and the global model is updated, after which clients will begin to pull that new model once they finish training. Clients broadcast their updates once training and their simulated delays complete, and their update is appended to the global buffer.

The testbench allows users to specify custom aggregation strategies by implementing a simple callback function, from which evaluation against other baselines can be computed, logged and compared. We also implement the FedAvg baseline in this testbench.

6.2 Dataset Distribution

We test our algorithm empirically on two datasets, CIFAR10 and FashionMNIST. For both datasets, we model a system of 15 agents, with 10 “fast" agents with training delay randomly sampled from uniform distributions Dfast=U(1,2)D_{\text{fast}}=U(1,2) and 5 “slow" agents with training delays Dslow=U(8,12)D_{\text{slow}}=U(8,12). Data is distributed in a non-IID fashion and we reserve 20% of data across all labels for testing global model accuracy to measure overall generalization For the remainder, the fast agents are given data points with labels 4 through 9, IID-distributed across fast agents where no two agents have the same data point. The slow agents are given labels 0 through 3, similarly IID-distributed. The results are shown in Figure 2.

Specifically, we model a scenario where low-compute agents have exclusive access to a non-trivial subset of all data: in this setting, higher fairness equates to a model with higher global accuracy, demonstrated in our empirical results. We run FashionMNIST and CIFAR10 for 4,000 and 16,000 aggregations respectively. We evaluate FedAvg and our FedStaleWeight with a buffer size of b=5b=5 and a local learning rate of η=0.01\eta_{\ell}=0.01 with Q=1Q=1 local step. We clearly observe that our algorithm FedStaleWeight converges more quickly to a higher server test accuracy, indicating better generalization and less bias towards fast update producing agents.

Refer to caption
Refer to caption
Figure 2: Example test accuracy curves for FedStaleWeight versus buffered FedAvg. FedStaleWeight more quickly converges to higher accuracy under the same non-IID setting, indicating higher fairness aggregating agent updates.

7 Discussion

In this paper, we tackle the FL problem of training a model through updates received from a large number of edge devices. Our novel asynchronous technique for aggregating updates, FedStaleWeight, takes into account the staleness of client updates in order to re-weight them fairly. We argue that a slower client that sends fewer updates is undervalued in the aggregate under traditional averaging schemes, resulting in worse global models if the client’s data set is unique; similarly, a faster client should not dominate the aggregate off of speed of compute alone. The algorithm utilizes a buffer that holds updates until full, at which time it performs the weighted aggregation computed by approximating expected staleness via a moving average and updates the global model accordingly.

By sending updates to the server, each client implicitly reports an “update speed” to a mechanism which is interpreted only via the update’s staleness. This idea allows us to derive a re-weighting that increases fairness while still incentivizes truthfulness, ensuring that no agent wishes to intentionally slow down. This is critical to include when aggregation criteria uses the expected staleness as weighting, a inverse proxy for update speed. We show that the optimal weights are related to each client’s expected staleness, a quantity easily estimable after multiple rounds have elapsed. We provide work towards verifying that truthfulness guarantees hold when extending to online learning and per buffer normalization by calibrating the mechanism parameters.

Finally, we provide an ergodic convergence guarantee for FedStaleWeight and verify its efficacy through simulation. Our testbench simulates concurrent client threads and a server that executes our buffered aggregation algorithm while supporting the implementation of other aggregation techniques. This environment further serves as a useful resource for future work in the AFL space. In comparison to classic aggregation methods like FedAvg, we observe that FedStaleWeight converges faster and to a more accurate result, avoiding the pitfall of heavily weighting fast clients who may not be representative of the whole population.

There are several open questions, which we leave as future work to build upon. First, our framework assumes continuous client participation, when many real-world settings observe clients participating and dropping out frequently, necessitating more exploration into a scheme robust to probabilistic agent participation while maintaining fairness guarantees. We also assume that manipulation is zero-cost, when throttling could bring benefits of lower power consumption, affecting client utility. However, our work represents a strong step in narrowing the compute fairness gap between synchronous and asynchronous federated learning.

8 Acknowledgements

We are hugely grateful also to Safwan Hossain and the CS236R course for their support and mentorship during this project. This work was also supported by the FASRC computing cluster at Harvard University.

References

  • [1] Chao Huang, Jianwei Huang, and Xin Liu. Cross-silo federated learning: Challenges and opportunities, 2022.
  • [2] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581–3607. PMLR, 2022.
  • [3] Swanand Kadhe, Nived Rajaraman, O. Ozan Koyluoglu, and Kannan Ramchandran. Fastsecagg: Scalable secure aggregation for privacy-preserving federated learning, 2020.
  • [4] Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online federated learning for edge devices with non-iid data, 2020.
  • [5] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.
  • [6] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429–450, 2020.
  • [7] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • [8] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • [9] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
  • [10] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • [11] Wentai Wu, Ligang He, Weiwei Lin, Rui Mao, Carsten Maple, and Stephen Jarvis. Safa: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Transactions on Computers, 70(5):655–668, 2020.
  • [12] Zhifeng Jiang, Wei Wang, Baochun Li, and Bo Li. Pisces: efficient federated learning via guided asynchronous training. In Proceedings of the 13th Symposium on Cloud Computing, pages 370–385, 2022.
  • [13] Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. Asyncfeded: Asynchronous federated learning with euclidean distance based adaptive weight aggregation. arXiv preprint arXiv:2205.13797, 2022.
  • [14] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482, 2016.
  • [15] Jiale Guo, Ziyao Liu, Kwok-Yan Lam, Jun Zhao, Yiqiang Chen, and Chaoping Xing. Secure weighted aggregation for federated learning, 2021.
  • [16] Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. A general theory for federated optimization with asynchronous and heterogeneous clients updates. Journal of Machine Learning Research, 24(110):1–43, 2023.
  • [17] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning, 2021.
  • [18] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning, 2018.
  • [19] Andreea B. Alexandru and George J. Pappas. Private weighted sum aggregation. IEEE Transactions on Control of Network Systems, 9(1):219–230, 2022.
  • [20] Dzmitry Huba, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour, Carole-Jean Wu, Hongyuan Zhan, Pavel Ustinov, Harish Srinivas, et al. Papaya: Practical, private, and scalable federated learning. Proceedings of Machine Learning and Systems, 4:814–832, 2022.
  • [21] Donald Shenaj, Marco Toldo, Alberto Rigon, and Pietro Zanuttigh. Asynchronous federated continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5055–5063, June 2023.
  • [22] Zachary Charles, Zachary Garrett, Zhouyuan Huo, Sergei Shmulyian, and Virginia Smith. On large-cohort training for federated learning. Advances in neural information processing systems, 34:20461–20475, 2021.
  • [23] Marten van Dijk, Nhuong V Nguyen, Toan N Nguyen, Lam M Nguyen, Quoc Tran-Dinh, and Phuong Ha Nguyen. Asynchronous federated learning with reduced number of rounds and with differential privacy from less aggregated gaussian noise. arXiv preprint arXiv:2007.09208, 2020.
  • [24] Jinhyun So, Ramy E Ali, Başak Güler, and A Salman Avestimehr. Secure aggregation for buffered asynchronous federated learning. arXiv preprint arXiv:2110.02177, 2021.
  • [25] Yifan Shi, Yingqi Liu, Yan Sun, Zihao Lin, Li Shen, Xueqian Wang, and Dacheng Tao. Towards more suitable personalization in federated learning via decentralized partial model training, 2023.
  • [26] Yuchen Zeng, Hongxu Chen, and Kangwook Lee. Improving fairness via federated learning, 2022.
  • [27] Xuezhen Tu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Yang Zhang, and Juan Li. Incentive mechanisms for federated learning: From economic and game theoretic perspective, 2021.

Appendix

Appendix A Proof of Convergence

In this section, we prove the main convergence result for FedStaleWeight. Recall that FedStaleWeight updates can be described as the following update rule:

wt+1\displaystyle w^{t+1} =wt+ηgΔ¯t\displaystyle=w^{t}+\eta_{g}\overline{\Delta}^{t}
=wt+ηgkSt(αk,tnormΔktτk(t))\displaystyle=w_{t}+\eta_{g}\sum_{k\in S^{t}}\left(\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}(t)}\right)
=wtηgkSt(αk,tnormq=0Q1η(q)gk(yk,qtτk(t)))\displaystyle=w_{t}-\eta_{g}\sum_{k\in S^{t}}\left(\alpha^{\text{norm}}_{k,t}\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}g_{k}\left(y_{k,q}^{t-\tau_{k}(t)}\right)}\right)
Table 1: Summary of notation
Description Symbol
Number of server updates, server update index T,tT,t
Set of clients contributing to update index tt BtB_{t}
Number of clients, client index nn, ii or kk
Number of local steps per round, local step index Q,qQ,q
Server model after tt steps wtw_{t}
Stochastic gradient at client ii gi(w;ζi):=gi(w)g_{i}(w;\zeta_{i}):=g_{i}(w)
Local learning rate at local step qq η(q)\eta_{\ell}^{(q)}
Global learning rate ηg\eta_{g}
Number of clients per update b
Local and global gradient variance σ2,σg2\sigma_{\ell}^{2},\sigma_{g}^{2}
Delay or staleness of the client ii’s model update for the tt-th server update τi(t)\tau_{i}(t)
Maximum staleness for buffer size of τmax,b\tau_{\max,b}
Normalized FedStaleWeight re-weighting with respect to other elements
for buffer BtB_{t} for agent kk such that kBtαk,tnorm=1\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}}=1 αk,tnorm\alpha^{\text{norm}}_{k,t}

As in [2], in addition to our assumptions above, we assume that BtB_{t} is a uniform subset of [n][n]: in other words, any client is equally likely to contribute in any given round. We can ensure this is the case in practice by, if a client contributes to a current update round, only allowing the server to sample that client once the current update is complete.

Theorem 2.

Let η(q)\eta_{\ell}^{(q)} be the local learning rate of client SGD in the q-th local step, and let α(Q)=q=0Q1η(q)\alpha(Q)=\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}, β(Q)=q=0Q1(η(q))2\beta(Q)=\sum_{q=0}^{Q-1}{(\eta_{\ell}^{(q)})^{2}}, and U(τmax,b)U(\tau_{\max,b}) be the maximum squared deviation between equal weighting (1/b1/b) and a normalized αj\alpha_{j} 11. Choosing ηgη(q)Q1/L\eta_{g}\eta_{\ell}^{(q)}Q\leq 1/L for all local steps q=0,,Q1q=0,...,Q-1, the model iterates in Algorithm 1 achieves the following ergodic convergence rate:

1Tt=0Tf(wt)22(f(w0)f(w))ηgα(Q)T+12ηg2τmax,b2β(Q)QL2(bτmax,b+1τmax,b+1)2(σ2+σg2+G)+12β(Q)QL2(σ2+σg2+G)+4b2U(τmax,b)G\displaystyle\begin{split}\frac{1}{T}\sum_{t=0}^{T}{\|\nabla f(w^{t})\|^{2}}&\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}\\ &\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\\ &\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G\end{split} (10)

We first restate a useful lemma from [2] which we use below.

Lemma 1.

𝔼[gk2](σ2+σg2+G)\mathbb{E}\left[\|g_{k}\|^{2}\right]\leq(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G), where the total expectation 𝔼[]\mathbb{E}[\cdot] is evaluated over the randomness with respect to client participation and the stochastic gradient taken by a client.

Proof. By LL-smoothness assumption,

f(wt+1)\displaystyle f(w^{t+1}) f(wt)+ηgf(wt),Δ¯t+Lηg22Δ¯t2\displaystyle\leq f(w^{t})+\eta_{g}\langle\nabla f(w^{t}),\overline{\Delta}^{t}\rangle+\frac{L\eta_{g}^{2}}{2}\left\|\overline{\Delta}^{t}\right\|^{2}
f(wt)+ηgkBtαk,tnormf(wt),ΔktτkT1+Lηg22kBtαk,tnormΔktτk2T2\displaystyle\leq f(w^{t})\underbrace{+\eta_{g}\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\left\langle\nabla f(w^{t}),\Delta_{k}^{t-\tau_{k}}\right\rangle}}_{T_{1}}+\underbrace{\frac{L\eta^{2}_{g}}{2}\left\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}}}\right\|^{2}}_{T_{2}}

We then derive the upper bounds on T1T_{1} and T2T_{2}. We expand T1T_{1} as follows:

T1=ηgkBtαk,tnormf(wt),Δktτk=ηgkBtq=0Q1αk,tnormη(q)f(wt),gk(yk,qtτk)\displaystyle T_{1}=\eta_{g}\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\left\langle\nabla f(w^{t}),\Delta_{k}^{t-\tau_{k}}\right\rangle}=-\eta_{g}\sum_{k\in B_{t}}{\sum_{q=0}^{Q-1}{\alpha^{\text{norm}}_{k,t}\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}}

Using conditional expectation, we can expand the expectation as

𝔼[]=𝔼𝔼i[n],gi|i,[]\displaystyle\mathbb{E}[\cdot]=\mathbb{E}_{\mathcal{H}}\mathbb{E}_{i\sim[n],g_{i}|i,\mathcal{H}}[\cdot]

where 𝔼\mathbb{E}_{\mathcal{H}} takes the expectation of the history of time-steps, 𝔼Bt[n]\mathbb{E}_{B_{t}\sim[n]} takes the expectation of the distribution of clients contributing at time-step tt over all nn clients and over the stochastic gradient of one step on a client.

𝔼[T1]\displaystyle\mathbb{E}[T_{1}] =𝔼[ηgkBtq=0Q1αk,tnormη(q)f(wt),gk(yk,qtτk)]\displaystyle=-\mathbb{E}\left[\eta_{g}\sum_{k\in B_{t}}{\sum_{q=0}^{Q-1}{\alpha^{\text{norm}}_{k,t}\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),\;g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}}\right]
=ηg𝔼,Bt[n][kBtαi,tnorm[q=0Q1η(q)𝔼gk|k,f(wt),gk(yk,qtτk)]]\displaystyle=-\eta_{g}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left[\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{i,t}\left[\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\mathbb{E}_{g_{k}|k,\mathcal{H}}\left\langle\nabla f(w^{t}),\;g_{k}\left(y_{k,q}^{t-\tau_{k}}\right)\right\rangle}\right]}\right]
=ηg𝔼,Bt[n][q=0Q1η(q)f(wt),αk,tnormkBtFk(yk,qtτk)]\displaystyle=-\eta_{g}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left[\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\left\langle\nabla f(w^{t}),\;\alpha^{\text{norm}}_{k,t}\sum_{k\in B_{t}}{\nabla F_{k}\left(y_{k,q}^{t-\tau_{k}}\right)}\right\rangle}\right]

Using the identity

a,b=12(a2+b2ab2)\displaystyle\langle a,b\rangle=\frac{1}{2}\left(\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}\right)

we further expand as follows:

𝔼[T1]\displaystyle\mathbb{E}[T_{1}] =ηg2(q=0Q1η(q))f(wt)2q=0Q1ηgη(q)2(𝔼,Bt[n]kBtαk,tnormFk(yk,qtτk)2)T3\displaystyle=-\frac{\eta_{g}}{2}\left(\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}\right){\|\nabla f(w^{t})\|^{2}}\underbrace{-\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}\left(y_{k,q}^{t-\tau_{k}}\right)}\right\|^{2}}\bigg{)}}_{T_{3}}
+q=0Q1ηgη(q)2(𝔼,Bt[n]f(wt)kBtαk,tnormFk(yk,qtτk)2)\displaystyle+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\|^{2}\bigg{)}

We expand T2T_{2} as follows:

𝔼[T2]\displaystyle\mathbb{E}[T_{2}] =𝔼[Lηg22kBtαk,tnormΔktτk2]\displaystyle=\mathbb{E}\left[\frac{L\eta^{2}_{g}}{2}\left\|\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\Delta_{k}^{t-\tau_{k}}}\right\|^{2}\right]
=Lηg22𝔼,Bt[n]kBtαk,tnorm(q=0Q1η(q)𝔼gk|,Bt[gk(yk,qtτk)])2\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\sum_{q=0}^{Q-1}{-\eta_{\ell}^{(q)}\mathbb{E}_{g_{k}|\mathcal{H},B_{t}}[g_{k}(y_{k,q}^{t-\tau_{k}})]}\right)}\right\|^{2}
=Lηg22𝔼,Bt[n]q=0Q1η(q)kBtαk,tnorm(Fk(yk,qtτk))2\displaystyle=\frac{L\eta^{2}_{g}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}}\right\|^{2}
q=0Q1QLηg2(η(q))22𝔼,Bt[n]kBtαk,tnorm(Fk(yk,qtτk))2\displaystyle\leq\sum_{q=0}^{Q-1}{\frac{QL\eta^{2}_{g}\left(\eta_{\ell}^{(q)}\right)^{2}}{2}\cdot\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\sum_{k\in B_{t}}{\alpha_{k,t}^{\text{norm}}\left(\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right)}\right\|^{2}}

We can choose step sizes such that 𝔼[T2]+𝔼[T3]0\mathbb{E}[T_{2}]+\mathbb{E}[T_{3}]\leq 0, meaning that we can choose ηgη(q)Q1/L\eta_{g}\eta_{\ell}^{(q)}Q\leq 1/L for all q[0,,Q1]q\in[0,...,Q-1]. We then get the following inequality:

f(wt+1)\displaystyle f(w^{t+1}) f(wt)ηg2(q=0Q1η(q))f(wt)2\displaystyle\leq f(w^{t})-\frac{\eta_{g}}{2}\left(\sum_{q=0}^{Q-1}{\eta_{\ell}^{(q)}}\right){\|\nabla f(w^{t})\|^{2}}
+q=0Q1ηgη(q)2(𝔼,Bt[n]f(wt)kBtαk,tnormFk(yk,qtτk)2T4)\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\underbrace{\left\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\|^{2}}_{T_{4}}\bigg{)}

We can telescope and expand T4T_{4} as follows. Note that this variance term after telescoping breaks down into four key portions:

𝔼,Bt[n][T4]\displaystyle\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}[T_{4}] =𝔼,Bt[n]f(wt)kBtαk,tnormFk(yk,qtτk)2\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\left\|\nabla f(w^{t})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\|^{2}
=𝔼,Bt[n]1ni=1nFi(wt)1ni=1nFi(wtτi)+1ni=1nFi(wtτi)\displaystyle=\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})+\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})
1|Bt|kBtFk(wtτk)+1|Bt|kBtFk(wtτk)\displaystyle\qquad\qquad-\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}+\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}
1|Bt|kBtFk(yk,qtτk)+1|Bt|kBtFk(yk,qtτk)kBtαk,tnormFk(yk,qtτk)2\displaystyle\qquad\qquad-\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})+\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y^{t-\tau_{k}}_{k,q}})-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\|}^{2}
4(𝔼,Bt[n]1ni=1nFi(wt)1ni=1nFi(wtτi)2staleness error\displaystyle\leq 4\Biggl{(}\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t})-\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})\Bigg{\|}^{2}}_{\text{staleness error}}
+𝔼,Bt[n]1ni=1nFi(wtτi)1|Bt|kBtFk(wtτk)2sampling error\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{n}\sum_{i=1}^{n}\nabla F_{i}(w^{t-\tau_{i}})-\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}\Bigg{\|}^{2}}_{\text{sampling error}}
+𝔼,Bt[n]1|Bt|kBtFk(wtτk)1|Bt|kBtFk(yk,qtτk)2local client drift\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(w^{t-\tau_{k}})}-\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\|}^{2}}_{\text{local client drift}}
+𝔼,Bt[n]1|Bt|kBtFk(yk,qtτk)kBtαk,tnormFk(yk,qtτk)2reweighting error)\displaystyle\qquad\qquad+\underbrace{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\|}^{2}}_{\text{reweighting error}}\Biggl{)}

Using our assumptions on LL-smoothness, we can further reduce this as follows. We note that due to unbiasedness of the sample mean, on average, the mean over all agent’s gradients is equal to the mean of the gradients of agents sampled at each stage.

𝔼,Bt[n][T4]\displaystyle\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}[T_{4}] 4L2ni=1n𝔼wtwtτi2+4L2bkBt𝔼,Bt[n]wtτkyk,qtτk2\displaystyle\leq\frac{4L^{2}}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathcal{H}}\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2}+\frac{4L^{2}}{b}\sum_{k\in B_{t}}{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}w^{t-\tau_{k}}-y_{k,q}^{t-\tau_{k}}\Bigg{\|}^{2}}
+4𝔼,Bt[n]1|Bt|kBtFk(yk,qtτk)kBtαk,tnormFk(yk,qtτk)2T5\displaystyle\qquad\qquad+4\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\underbrace{\Bigg{\|}\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\|}^{2}}_{T_{5}}

We can produce an upper-bound on the staleness error (first term above) as follows:

wtwtτi2\displaystyle\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2} =ρ=tτIt1(wρ+1wρ)2\displaystyle=\left\|\sum_{\rho=t-\tau_{I}}^{t-1}{(w^{\rho+1}-w^{\rho})}\right\|^{2}
=ηg2ρ=tτit1jρSραjρ,ρnorml=0Q1η(l)gjρ(yjρ,lρ)2\displaystyle=\eta_{g}^{2}\left\|\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{\alpha_{j_{\rho},\rho}^{\text{norm}}\sum_{l=0}^{Q-1}{\eta_{\ell}^{(l)}g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})}}}\right\|^{2}

Taking the expectation of this with respect to \mathcal{H}, we get:

𝔼wtwtτi2\displaystyle\mathbb{E}_{\mathcal{H}}\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2} ηg2bτiρ=tτit1jρSρ(αjρ,ρnorm)2𝔼l=0Q1η(l)gjρ(yjρ,lρ)2\displaystyle\leq\eta_{g}^{2}b\tau_{i}\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{(\alpha_{j_{\rho},\rho}^{\text{norm}})^{2}\cdot\mathbb{E}_{\mathcal{H}}\left\|\sum_{l=0}^{Q-1}{\eta_{\ell}^{(l)}g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})}\right\|^{2}}}

Recalling our definition of αjρ,ρnorm\alpha^{\text{norm}}_{j_{\rho},\rho} and our assumption on bounded staleness, we can see that given some max-staleness τmax,b\tau_{\max,b}, we see that the largest that αjρ,ρnorm\alpha^{\text{norm}}_{j_{\rho},\rho} can be is an aggregation buffer with one agent with maximum expected staleness τmax,b\tau_{\max,b} with all other agents having zero staleness (and 1/n1/n un-normalized weighting).

αjρ,ρnorm\displaystyle\alpha^{\text{norm}}_{j_{\rho},\rho} τmax,bb+1nτmax,bb+1n+i=1b11m\displaystyle\leq\frac{\frac{\tau_{\max,b}\cdot b+1}{n}}{\frac{\tau_{\max,b}\cdot b+1}{n}+\sum_{i=1}^{b-1}{\frac{1}{m}}}
bτmax,b+1b(τmax,b+1)\displaystyle\leq\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}

The expectation for the staleness error becomes bounded as follows, using the above result and Lemma 1:

𝔼wtwtτi2\displaystyle\mathbb{E}_{\mathcal{H}}\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2} ηg2bτiQρ=tτit1jρSρ(bτmax,b+1b(τmax,b+1))2l=0Q1(η(l))2𝔼gjρ(yjρ,lρ)2\displaystyle\leq\eta_{g}^{2}b\tau_{i}Q\sum_{\rho=t-\tau_{i}}^{t-1}{\sum_{j_{\rho}\in S_{\rho}}{\left(\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2}\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}\cdot\mathbb{E}_{\mathcal{H}}\left\|g_{j_{\rho}}(y^{\rho}_{j_{\rho},l})\right\|^{2}}}}
ηg2b2maxτiτi2Q(bτmax,b+1b(τmax,b+1))2(l=0Q1(η(l))2)3(σ2+σg2+G)\displaystyle\leq\eta_{g}^{2}b^{2}\max_{\tau_{i}}{\tau_{i}^{2}}Q\left(\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2}\left(\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}}\right)3\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)
3Qηg2τmax,b2(bτmax,b+1τmax,b+1)2(l=0Q1(η(l))2)(σ2+σg2+G)\displaystyle\leq 3Q\eta_{g}^{2}\tau_{\max,b}^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sum_{l=0}^{Q-1}{\left(\eta_{\ell}^{(l)}\right)^{2}}\right)\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)

Likewise for the local drift error, we can bound it similarly using Lemma 1:

𝔼wtτiyi,qtτi2\displaystyle\mathbb{E}\|w^{t-\tau_{i}}-y_{i,q}^{t-\tau_{i}}\|^{2} =𝔼yi,0tτiyi,qtτi2=𝔼l=0q1η(l)gi(yi,ltτi)2\displaystyle=\mathbb{E}\|y_{i,0}^{t-\tau_{i}}-y_{i,q}^{t-\tau_{i}}\|^{2}=\mathbb{E}\left\|\sum_{l=0}^{q-1}{\eta_{\ell}^{(l)}g_{i}(y_{i,l}^{t-\tau_{i})}}\right\|^{2}
3q(i=0q1(η(i))2)(σ2+σg2+G)\displaystyle\leq 3q\left(\sum_{i=0}^{q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)
3Q(i=0Q1(η(i))2)(σ2+σg2+G)\displaystyle\leq 3Q\left(\sum_{i=0}^{Q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)

Finally, we can bound the re-weighting error as follows:

𝔼[T5]\displaystyle\mathbb{E}[T_{5}] =𝔼kBt(1|Bt|αk,tnorm)Fk(yk,qtτk)2\displaystyle=\mathbb{E}\left\|\sum_{k\in B_{t}}{\left(\frac{1}{|B_{t}|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\right\|^{2}
𝔼[kBt(1|Bt|αk,tnorm)Fk(yk,qtτk)2]\displaystyle\leq\mathbb{E}\left[\sum_{k\in B_{t}}{\left\|\left(\frac{1}{|B_{t}|}-\alpha^{\text{norm}}_{k,t}\right)\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\|^{2}}\right]
b𝔼[kBt(1bαk,tnorm)2Fk(yk,qtτk)2]\displaystyle\leq b\cdot\mathbb{E}\left[\sum_{k\in B_{t}}{\left(\frac{1}{b}-\alpha^{\text{norm}}_{k,t}\right)^{2}\left\|\nabla F_{k}(y_{k,q}^{t-\tau_{k}})\right\|^{2}}\right]

We note from before that the largest that αk,tnorm\alpha_{k,t}^{norm} can be is bτmax,b+1b(τmax,b+1)\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}. Consequently, the smallest that αk,tnorm\alpha_{k,t}^{norm} can be is an agent with zero expected staleness (minimizing the numerator), with all other buffer agents having maximum expected staleness (maximizing the denominator):

αk,tnorm\displaystyle\alpha_{k,t}^{norm} 1n1n+i=1b1bτmax,b+1n\displaystyle\geq\frac{\frac{1}{n}}{\frac{1}{n}+\sum_{i=1}^{b-1}{\frac{b\tau_{\max,b}+1}{n}}}
1(b1)(bτmax,b+1)+1\displaystyle\geq\frac{1}{(b-1)(b\tau_{\max,b}+1)+1}

We can then upper-bound as follows, denoting this as quantity U(τmax,b)U(\tau_{\max,b}):

(1bαk,tnorm)2max{(1bbτmax,b+1b(τmax,b+1))2,(1b1(b1)(bτmax,b+1)+1)2}=U(τmax,b)\displaystyle\left(\frac{1}{b}-\alpha_{k,t}^{\text{norm}}\right)^{2}\leq\max\left\{\left(\frac{1}{b}-\frac{b\tau_{\max,b}+1}{b(\tau_{\max,b}+1)}\right)^{2},\left(\frac{1}{b}-\frac{1}{(b-1)(b\tau_{\max,b}+1)+1}\right)^{2}\right\}=U(\tau_{\max,b}) (11)

We arrive at the following expectation of T5T_{5}, given the above result and Assumption 3:

𝔼[T5]U(τmax,b)b2G\displaystyle\mathbb{E}[T_{5}]\leq U(\tau_{\max,b})b^{2}G

Thus, combining everything, denoting α(Q)=q=0Q1(η(q))\alpha(Q)=\sum_{q=0}^{Q-1}(\eta_{\ell}^{(q)}) and β(Q)=q=0Q1(η(q))2\beta(Q)=\sum_{q=0}^{Q-1}(\eta_{\ell}^{(q)})^{2}, we have

𝔼[f(wt+1)]\displaystyle\mathbb{E}[f(w^{t+1})] 𝔼[f(wt)]ηg2α(Q)𝔼f(wt)2\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\|\nabla f(w^{t})\|^{2}}
+q=0Q1ηgη(q)2(4L2ni=1n𝔼wtwtτi2+4L2bkBt𝔼,Bt[n]wtτkyk,qtτk2\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}\frac{4L^{2}}{n}\sum_{i=1}^{n}\mathbb{E}_{\mathcal{H}}\left\|w^{t}-w^{t-\tau_{i}}\right\|^{2}+\frac{4L^{2}}{b}\sum_{k\in B_{t}}{\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}w^{t-\tau_{k}}-y_{k,q}^{t-\tau_{k}}\Bigg{\|}^{2}}
+4𝔼,Bt[n]1|Bt|kBtFk(yk,qtτk)kBtαk,tnormFk(yk,qtτk)2)\displaystyle\qquad\qquad+4\mathbb{E}_{\mathcal{H},B_{t}\sim[n]}\Bigg{\|}\frac{1}{|B_{t}|}\sum_{k\in B_{t}}{\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}-\sum_{k\in B_{t}}{\alpha^{\text{norm}}_{k,t}\;\nabla F_{k}(y_{k,q}^{t-\tau_{k}})}\Bigg{\|}^{2}\bigg{)}
𝔼[f(wt)]ηg2α(Q)𝔼f(wt)2\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\|\nabla f(w^{t})\|^{2}}
+q=0Q1ηgη(q)2(12L2Qηg2τmax,b2β(Q)(bτmax,b+1τmax,b+1)2(σ2+σg2+G))\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}12L^{2}Q\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)\bigg{)}
+q=0Q1ηgη(q)2(12QL2(i=0Q1(η(i))2)(σ2+σg2+G))\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}12QL^{2}\left(\sum_{i=0}^{Q-1}{\left(\eta_{\ell}^{(i)}\right)^{2}}\right)(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)\bigg{)}
+q=0Q1ηgη(q)2(4U(τmax,b)b2G)\displaystyle\qquad+\sum_{q=0}^{Q-1}{\frac{\eta_{g}\eta_{\ell}^{(q)}}{2}}\bigg{(}4U(\tau_{\max,b})b^{2}G\bigg{)}
𝔼[f(wt)]ηg2α(Q)𝔼f(wt)2\displaystyle\leq\mathbb{E}[f(w^{t})]-\frac{\eta_{g}}{2}\alpha(Q){\mathbb{E}\|\nabla f(w^{t})\|^{2}}
+6ηg3τmax,b2α(Q)β(Q)QL2(bτmax,b+1τmax,b+1)2(σ2+σg2+G)\displaystyle\qquad+6\eta_{g}^{3}\tau_{\max,b}^{2}\alpha(Q)\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)
+6ηgα(Q)β(Q)QL2(σ2+σg2+G)\displaystyle\qquad+6\eta_{g}\alpha(Q)\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)
+2b2ηgU(τmax,b)α(Q)G\displaystyle\qquad+2b^{2}\eta_{g}U(\tau_{\max,b})\alpha(Q)G

Rearranging and summing from t=1t=1 to TT, we get the following:

t=0Tηgα(Q)f(wt)2\displaystyle\sum_{t=0}^{T}{\eta_{g}\alpha(Q)\|\nabla f(w^{t})\|^{2}} t=0T12(𝔼[f(wt)𝔼[f(wt+1)]])\displaystyle\leq\sum_{t=0}^{T-1}2\left(\mathbb{E}[f(w^{t})-\mathbb{E}[f(w^{t+1})]]\right)
+12Tηg3τmax,b2α(Q)β(Q)QL2(bτmax,b+1τmax,b+1)2(σ2+σg2+G)\displaystyle\qquad+12T\eta_{g}^{3}\tau_{\max,b}^{2}\alpha(Q)\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)
+12Tηgα(Q)β(Q)QL2(σ2+σg2+G)\displaystyle\qquad+12T\eta_{g}\alpha(Q)\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)
+4Tb2ηgU(τmax,b)α(Q)G\displaystyle\qquad+4Tb^{2}\eta_{g}U(\tau_{\max,b})\alpha(Q)G

Thus, we conclude

1Tt=0Tf(wt)2\displaystyle\frac{1}{T}\sum_{t=0}^{T}{\|\nabla f(w^{t})\|^{2}} 2(f(w0)f(w))ηgα(Q)T\displaystyle\leq\frac{2(f(w^{0})-f(w^{*}))}{\eta_{g}\alpha(Q)T}
+12ηg2τmax,b2β(Q)QL2(bτmax,b+1τmax,b+1)2(σ2+σg2+G)\displaystyle\qquad+12\eta_{g}^{2}\tau_{\max,b}^{2}\beta(Q)QL^{2}\left(\frac{b\tau_{\max,b}+1}{\tau_{\max,b}+1}\right)^{2}\left(\sigma_{\ell}^{2}+\sigma^{2}_{g}+G\right)
+12β(Q)QL2(σ2+σg2+G)+4b2U(τmax,b)G\displaystyle\qquad+12\beta(Q)QL^{2}(\sigma_{\ell}^{2}+\sigma_{g}^{2}+G)+4b^{2}U(\tau_{\max,b})G