Adaptive Transmission Scheduling in Wireless Networks for Asynchronous Federated Learning

Hyun-Suk Lee and Jang-Won Lee, Senior Member, IEEE This work was supported in part by and the National Research Foundation of Korea (NRF) grant through the Korea Government (MSIT) under Grant 2021R1G1A1004796, and in part by the NRF grant through the Korea Government (MSIT) under Grant 2019R1A2C2084870. (Corresponding author: Jang-Won Lee) H.-S. Lee is with the School of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, South Korea (e-mail: hyunsuk@sejong.ac.kr) and J.-W. Lee is with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail: jangwon@yonsei.ac.kr).

Abstract

In this paper, we study asynchronous federated learning (FL) in a wireless distributed learning network (WDLN). To allow each edge device to use its local data more efficiently via asynchronous FL, transmission scheduling in the WDLN for asynchronous FL should be carefully determined considering system uncertainties, such as time-varying channel and stochastic data arrivals, and the scarce radio resources in the WDLN. To address this, we propose a metric, called an effectivity score, which represents the amount of learning from asynchronous FL. We then formulate an Asynchronous Learning-aware transmission Scheduling (ALS) problem to maximize the effectivity score and develop three ALS algorithms, called ALSA-PI, BALSA, and BALSA-PO, to solve it. If the statistical information about the uncertainties is known, the problem can be optimally and efficiently solved by ALSA-PI. Even if not, it can be still optimally solved by BALSA that learns the uncertainties based on a Bayesian approach using the state information reported from devices. BALSA-PO suboptimally solves the problem, but it addresses a more restrictive WDLN in practice, where the AP can observe a limited state information compared with the information used in BALSA. We show via simulations that the models trained by our ALS algorithms achieve performances close to that by an ideal benchmark and outperform those by other state-of-the-art baseline scheduling algorithms in terms of model accuracy, training loss, learning speed, and robustness of learning. These results demonstrate that the adaptive scheduling strategy in our ALS algorithms is effective to asynchronous FL.

Index Terms:

Asynchronous learning, distributed learning, federated learning, scheduling, wireless network

I Introduction

Nowadays, a massive amount of data is generated from devices, such as mobile phones and wearable devices, which can be used for a wide range of machine learning (ML) applications from healthcare to autonomous driving. As the computational and storage capabilities of such distributed devices keep growing, distributed learning has become attractive to efficiently exploit the data from devices and to address privacy concerns. Due to the emergence of the need for distributed learning, federated learning (FL) has been widely studied as a potentially viable solution for distributed learning [1, 2]. FL allows to learn a shared central model from the locally trained models of distributed devices under the coordination of a central server while the local training data at the devices is not shared with the central server.

In FL, the central server and the devices should communicate with each other to transmit the trained models. For this, a wireless network composed of one AP and multiple devices has been widely considered [3, 4, 5, 6, 7, 8, 9, 10, 11], and the AP in the network plays the role of the central server in FL. FL is operated in multiple rounds. In each round, first, the AP broadcasts the current central model. In typical FL, all devices substitute their local models into the received central one and train them using their local training data. Then, the locally trained model of each device is uploaded to the AP and the AP aggregates them into the central model. This enables training the central model in a distributed manner while avoiding privacy concerns because it does not require each device to upload its local training data.

However, as FL has been implemented in wireless networks, the limited radio resources of wireless networks have been raised as a critical issue since the radio resources restrict the number of devices that can upload the local models to the AP simultaneously [8, 7]. To address the issue, in [10, 5, 6, 7, 8, 9, 11], device scheduling procedures for FL have been proposed in which the AP schedules the devices to train their local models and upload them considering the limited radio resources. Then, in each round, only the scheduled devices do so and the AP aggregates the local models uploaded from the scheduled devices. On the other hand, the not-scheduled devices do not train their local models and their local data that have been arrived during the round is stored for future use. Hence, every locally trained model in each round is synchronously aggregated into the central model in the round (i.e., once a device trains its local model in a round, it must be aggregated into the central model in that round and its use in future rounds is not allowed), and such FL is called synchronous FL in the literature.

Device scheduling procedures in synchronous FL have effectively addressed the issue of the limited radio resources in wireless networks, but at the same time, they may cause the inefficiency of utilizing local computation resources and the loss of local data due to too much pileup [12, 13]. These issues are raised because of the not-scheduled devices in each round that do not train their local models. Hence, to address them, we can allow even the not-scheduled devices to train their local models and store them for future use. This enables each device to continually train its local model by using the arriving local data in an online manner, which avoids waste of local computation resources and too much pileup of the local data [12, 13]. However, at the same time, when aggregating the stored locally trained models, it causes the time lag between the stored local models and the current central model due to the asynchronous local model aggregation. In the literature, such FL with the time lag is called asynchronous FL. In addition, such devices who store the previous locally trained models are typically called stragglers, and when aggregating them into the central model, they may cause an adverse effect to the convergence of the central model because of the time lag [13].

Recently, in the ML literature, several works on asynchronous FL have addressed the harmful effects from the stragglers which inevitably exist due to network circumstances such as time-varying channels and scarce radio resources [13, 14, 12, 15]. They introduced various approaches which address the stragglers when updating the central model and the local models; balancing between the previous local model and current one [15, 13], adopting dynamic learning rates [14, 13, 12], and using a regularized loss function [15, 13]. However, they mainly focused on addressing the incurred stragglers and did not take into account the issues relevant to the implementation of FL, which are related to the occurrence of the stragglers. Hence, it is necessary to study an asynchronous FL procedure in which the key challenges of implementing asynchronous FL in a wireless network are carefully considered.

When implementing FL in the wireless network, the scarcity of radio resources is one of the key challenges. Hence, several existing works focus on reducing the communication costs of a single local model transmission of each edge device; analog model aggregation methods using the inherent characteristic of a wireless medium [4, 3], a gradient reusing method [16], and local gradient compressing methods [17, 18]. Meanwhile, other existing works in [6, 5, 8, 7, 10, 9, 11] focus on scheduling the transmission of the local models for FL in which only scheduled edge devices participate in FL. Since typical scheduling strategies in wireless networks [19, 20] do not consider FL, for effective learning, various scheduling strategies for FL have been proposed based on traditional scheduling policies [5], scheduling criteria to minimize the training loss [6, 9, 7, 8], an effective temporal scheduling pattern to FL [10], and multi-armed bandits [11]. They allow the AP to efficiently use the radio resources to accelerate FL.

However, these existing works on FL implementation in the wireless network have been studied only for synchronous FL, and do not consider the characteristics of asynchronous FL, such as the stragglers and the continual online training of the local models, at all. Nevertheless, the existing methods on reducing the communication costs can be used for asynchronous FL since in asynchronous FL, the edge devices should transmit their local model to the AP as in synchronous FL. On the other hand, the methods on scheduling the transmission of the local models may cause a significant inefficiency of learning if they are adopted to asynchronous FL. This is because most of them consider each individual FL round separately and do not address transmission scheduling over multiple rounds. In asynchronous FL, each round is highly inter-dependent since the current scheduling affects the subsequent rounds due to the use of the stored local models from the stragglers. Hence, the transmission scheduling for asynchronous FL should be carefully determined considering the stragglers over multiple rounds. In addition, in such scheduling over multiple rounds, the effectiveness of learning depends on system uncertainties, such as time-varying channels and stochastic data arrivals. In particular, the stochastic data arrivals become more important in asynchronous FL since the data arrivals are directly related to the amount of the straggled local data due to the continual online training. However, the existing works do not consider stochastic data arrivals at edge devices.

TABLE I: Comparison of Our Work and Related Works on Transmission Scheduling for FL (

\surd

: Considered /

\times

: Not Considered)

	Wireless channel	Multiple rounds	Stochastic data arrivals	Stragglers from async. FL
[6, 9, 7, 8, 5]	$\surd$	$\times$	$\times$	$\times$
[10, 11]	$\surd$	$\surd$	$\times$	$\times$
Our work	$\surd$	$\surd$	$\surd$	$\surd$

In this paper, we study asynchronous FL considering the key challenges of implementing it in a wireless network. To the best of our knowledge, our work is the first to study asynchronous FL in a wireless network. Specifically, we propose an asynchronous FL procedure in which the characteristics of asynchronous FL, time-varying channels, and stochastic data arrivals of the edge devices are considered for transmission scheduling over multiple rounds. The comparison of our work and the existing works on transmission scheduling in FL is summarized in Table I. We then analyze the convergence of the asynchronous FL procedure. To address scheduling the transmission of the local models in the asynchronous FL procedure, we also propose a metric called an effectivity score. It represents the amount of learning from asynchronous FL considering the properties of asynchronous FL including the harmful effects on learning due to the stragglers. We formulate an asynchronous learning-aware transmission scheduling (ALS) problem to maximize the effectivity score while considering the system uncertainties (i.e., the time-varying channels and stochastic data arrivals). We then develop the following three ALS algorithms that solve the ALS problem:

•

First, an ALS algorithm with the perfect statistical information about the system uncertainties (ALSA-PI) optimally and efficiently solves the problem using the state information reported from the edge devices in the asynchronous FL procedure and the statistical information.
•

Second, a Bayesian ALS algorithm (BALSA) solves the problem using the state information without requiring any a priori information. Instead, it learns the system uncertainties based on a Bayesian approach. We prove that BALSA is optimal in terms of the long-term average effectivity score by its regret bound analysis.
•

Third, a Bayesian ALS algorithm for a partially observable WDLN (BALSA-PO) solves the problem only using partial state information (i.e., channel conditions). It addresses a more restrictive WDLN in practice, where each edge device is allowed to report only its current channel condition to the AP.

Through experimental results, we show that ALSA-PI and BALSAs (i.e., BALSA and BALSA-PO) achieve performance close to an ideal benchmark with no radio resource constraints and transmission failure. We also show that they outperform other baseline scheduling algorithms in terms of training loss, test accuracy, learning speed, and robustness of learning.

The rest of this paper is organized as follows. Section II introduces a WDLN with asynchronous FL. In Section III, we formulate the ALS problem considering asynchronous FL. In Section IV, we develop ALSA-PI, BALSA, and BALSA-PO, and in Section V, we provide experimental results. Finally, we conclude in Section VI.

II Wireless Distributed Learning Network with Asynchronous FL

In this section, we introduce typical learning strategies of asynchronous FL provided in [15, 14, 13, 12]. We then propose an asynchronous FL procedure to adopt the learning strategies in a WDLN. For ease of reference, we summarize some notations in Table II.

TABLE II: List of notations

Notation	Description
${\mathbf{w}}^{t}$	Central parameters of the AP in round $t$
${\mathbf{w}}_{u}^{t}$	Local parameters of device $u$ in round $t$
$c_{u}^{t}$	Central learning weight of device $u$ in round $t$
$n_{u}^{t}$	Number of aggregated samples of device $u$ from the latest successful local update transmission to at the beginning of round $t$
$m_{u}^{t}$	Number of arrived samples of device $u$ during round $t$
$N_{u}^{t}$	Total number of samples from device $u$ used for the central updates before the beginning of round $t$
$\Delta_{u}^{t}$	Local update of device $u$ in round $t$
$\eta_{u}^{t}$	Local learning rate of device $u$ in round $t$
$\psi_{u}^{t}$	Local gradient of device $u$ in round $t$
$h_{u}^{t}$	Channel gain of device $u$ in round $t$
$a_{u}^{t}$	Transmission scheduling indicator of device $u$ in round $t$
$x_{u}^{t}$	Successful transmission indicator of device $u$ in round $t$
${\boldsymbol{\uptheta}}_{u}^{C}$	System parameters related to the channel gain of device $u$
$\theta_{u}^{P}$	System parameter related to the sample arrival of device $u$
$\mu^{1}$	Prior distribution for the parameters ${\boldsymbol{\uptheta}}$
$\mu^{t}$	Posterior distribution for the parameters ${\boldsymbol{\uptheta}}$ in round $t$
$t_{k}$	Start time of stage $k$ of BALSA
$T_{k}$	Length of stage $k$ of BALSA

II-A Central and Local Parameter Updates in Asynchronous FL

Here, we introduce typical learning strategies to update the central and local parameters in asynchronous FL [15, 14, 13, 12], which address the challenges due to the stragglers. To this end, we consider one access point (AP) that plays a role of a central server in FL and $U$ edge devices. The set of edge devices is defined as $\mathcal{U}=\{1,2,...,U\}$ . In asynchronous FL, an artificial neural network (ANN) model composed of multiple parameters is trained in a distributed manner to minimize an empirical loss function $l({\mathbf{w}})$ , where ${\mathbf{w}}$ denotes the parameters of the ANN. Asynchronous FL proceeds over a discrete-time horizon composed of multiple rounds. The set of rounds is defined as $\mathcal{T}=\{1,2,...\}$ and the index of rounds is denoted by $t$ . Then, we can formally define the problem of asynchronous FL with a given local loss function at device $u$ , $f_{u}$ , as follows:

\mathop{\rm minimize}_{{\mathbf{w}}}~{}l({\mathbf{w}})=\frac{1}{K}\sum_{u\in\mathcal{U}}\sum_{k=1}^{K_{u}}f_{u}({\mathbf{w}},k),

(1)

where $K_{u}$ denotes the number of data samples of device $u$ ,¹¹1For simple presentation, we omit the word “edge” from edge device in the rest of the paper. $K=\sum_{u=1}^{U}K_{u}$ , and $f_{u}({\mathbf{w}},k)$ is an empirical local loss function defined by the parameters ${\mathbf{w}}$ at $k$ -th data sample of device $u$ . To solve this problem, in asynchronous FL, every device trains parameters by using its locally available data in an online manner for efficient learning on each device and avoiding too much pileup of the local data. Then, partial devices are scheduled to transmit the trained parameters to the AP. By using the received parameters from the scheduled devices, the AP updates its parameters. We call the parameters at the AP central parameters and those at each device local parameters. The details of how to update the local and central parameters in asynchronous FL are provided in the following.

II-A1 Local Parameter Updates

We describe local parameter updates at each device in asynchronous FL. Let the central parameters of the AP in round $t$ be denoted by ${\mathbf{w}}^{t}$ . Note that the central parameters ${\mathbf{w}}^{t}$ is calculated by the AP in round $t-1$ , which will be described in the central parameter updates section later. In round $t$ , the AP first broadcasts the central parameters ${\mathbf{w}}^{t}$ to all devices and each device $u$ replaces its local parameters ${\mathbf{w}}_{u}^{t}$ with ${\mathbf{w}}^{t}$ . Thus, ${\mathbf{w}}^{t}$ becomes the initial parameters of local training at each device in round $t$ . Then, the device trains its local parameters ${\mathbf{w}}_{u}^{t}$ using its local data samples that have arrived since its previous local parameter update. To this end, it applies a gradient-based update method using the regularized loss function defined as follows [15, 13]:

s_{u}({\mathbf{w}}_{u}^{t})=f_{u}({\mathbf{w}}_{u}^{t})+\frac{\lambda}{2}||{\mathbf{w}}_{u}^{t}-{\mathbf{w}}^{t}||,

(2)

where the second term mitigates the deviations of the local parameters from the central ones and $\lambda$ is the parameter for the regularization. We denote the local gradient of device $u$ calculated using the local parameters ${\mathbf{w}}_{u}^{t}$ and its local data samples in round $t$ by $\nabla g_{u}^{t}$ . In asynchronous FL, the local gradient which has not been transmitted in the previous rounds will be aggregated to the current local gradient. Such local gradients not transmitted in the previous rounds are called delayed local gradients. In the literature, the devices who have such delayed local gradients are called stragglers. It has been shown that they adversely affect model convergence since the parameters used to calculate the delayed local gradients are different from the current local parameters used to calculate the current local gradients [21, 15, 13].

To address this issue, when aggregating the previous local gradients and current ones, we need to balance between them. To this end, in [15, 13], the decay coefficient $\beta$ is introduced and used when aggregating the local gradients as

\psi_{u}^{t}=\nabla g_{u}^{t}+\beta\psi_{u}^{t-1},

(3)

where $\psi_{u}^{t}$ is the aggregated local gradient of device $u$ in round $t$ . By using the aggregated local gradient, we define the local update of device $u$ in round $t$ as

\Delta_{u}^{t}=\eta_{u}^{t}\psi_{u}^{t},

(4)

where $\eta_{u}^{t}$ is the local learning rate of the local parameters of device $u$ in round $t$ . It is worth emphasizing that each device uploads its local update to the AP for updating the central parameters if scheduled. Moreover, a dynamic local learning rate has been widely applied to address the stragglers. The dynamic learning rate of device $u$ in round $t$ is determined as follows [22, 12, 13]:

\eta_{u}^{t}=\eta_{d}\max\left\{1,\log\left(d_{u}^{t}\right)\right\},

(5)

where $\eta_{d}$ is an initial value of the local learning rate and $d_{u}^{t}$ is the number of delayed rounds since the latest successful local update transmission of device $u$ in round $t$ . Each device transmits its local update $\Delta_{u}^{t}$ to the AP according to the transmission scheduling, and then, updates its aggregated local gradient according to the local update transmission as

\psi_{u}^{t}=\begin{cases}0,&\!\!\textrm{if device $u$ successfully transmit its local update}\\ \psi_{u}^{t},&\!\!\textrm{otherwise.}\end{cases}

(6)

II-A2 Central Parameter Updates

In round $t$ , device $u$ obtains its local update, $\Delta_{u}^{t}$ , by using its local data and delayed local gradients as in (3) and (4), respectively. After the scheduled devices transmit their local updates to the AP, the AP recalculates its central parameters by aggregating the successfully received local updates from the scheduled devices. To represent this, we define the set of the devices whose local updates are successfully received at the AP in round $t$ by $\bar{\mathcal{U}}^{t}$ . Then, the central parameters are updated as follows [13, 12]:

{\mathbf{w}}^{t+1}={\mathbf{w}}^{t}-\sum_{u\in\bar{\mathcal{U}}^{t}}c_{u}^{t}\Delta_{u}^{t},

(7)

where $c_{u}^{t}$ is the central learning weight of device $u$ in round $t$ . It is worth empasizing that ${\mathbf{w}}^{t+1}$ is calculated by the AP in round $t$ and will be broadcasted to all devices in round $t+1$ . The central learning weight of each device is determined according to the contribution of the device to the central parameter updates so far, which is evaluated according to the number of the samples used for the central parameter updates from the device [3, 4, 5, 6, 7, 8, 9, 11, 10, 13, 14, 12, 15]. We define the total number of samples from device $u$ used for the central updates before the beginning of round $t$ as $N_{u}^{t}$ , the number of aggregated samples of device $u$ from the latest successful local update transmission to the beginning of round $t$ as $n_{u}^{t}$ , and the number of samples of device $u$ that have arrived during round $t$ as $m_{u}^{t}$ . Then, the central learning weight of device $u$ in round $t$ is determined as

c_{u}^{t}=\frac{N_{u}^{t}+n_{u}^{t}+m_{u}^{t}}{\sum_{u\in\mathcal{U}}N_{u}^{t}+\sum_{u\in\bar{\mathcal{U}}^{t}}n_{u}^{t}+m_{u}^{t}}.

(8)

II-B Asynchronous FL Procedure in WDLN

We consider a WDLN consisting of one AP and $U$ devices that cooperatively operate FL over the discrete-time horizon composed of multiple rounds that have an identical time duration. In the WDLN, the local data stochastically and continually arrives at each device and the device trains its local parameters in each round by using the arrived data. Due to the scarce radio resources of the WDLN, it is impractical that all devices transmit their local parameters to the AP simultaneously. Hence, FL in the WDLN becomes asynchronous FL and the devices who will transmit their local parameters in each round should be carefully scheduled by the AP for training the central parameters effectively.

We now propose an asynchronous FL procedure for the WDLN in which the central and local parameter update strategies introduced in the previous subsection are adopted. In the procedure, the characteristics of the WDLN, such as time-varying channel conditions and system bandwidth, are considered as well. This allows us to implement asynchronous FL in the WDLN while addressing the challenges due to the scarcity of radio resources in the WDLN. We denote the channel gain of device $u$ in round $t$ by $h^{t}_{u}$ . We assume that the channel gain of device $u$ in each round follows a parameterized distribution and denote the corresponding system parameters by ${\boldsymbol{\uptheta}}_{u}^{C}$ . For example, for Rician fading channels, ${\boldsymbol{\uptheta}}_{u}^{C}$ represents the parameters of the Rician distribution, $\Omega$ and $K$ . The vector of the channel gain in round $t$ is defined by ${\mathbf{h}}^{t}=\{h_{u}^{t}\}_{u\in\mathcal{U}}$ .

In the procedure, each round $t$ is composed of three phases: transmission scheduling phase, local parameter update phase, and parameter aggregation phase. In the transmission scheduling phase, the AP observes the state information of the devices which can be used to determine the transmission scheduling. In specific, we define the state information in round $t$ as ${\mathbf{s}}^{t}=({\mathbf{h}}^{t},{\mathbf{n}}^{t})$ , where ${\mathbf{n}}^{t}=\{n_{u}^{t}\}_{u\in\mathcal{U}}$ is the vector of the number of aggregated samples of each device from its latest successful local update transmission to the beginning of round $t$ . It is worth noting that in the following sections, we consider a partially observable WDLN as well, in which the AP can observe the channel gain ${\mathbf{h}}^{t}$ only, to address more restrictive environments in practice. Then, the AP determines the transmission scheduling for asynchronous FL based on the state information. In the local parameter update phase, the AP broadcasts its central parameters ${\mathbf{w}}^{t}$ to all devices. We do not consider the transmission failure for ${\mathbf{w}}^{t}$ at each device as in the related works [9, 7, 8, 10], i.e., all devices receive ${\mathbf{w}}^{t}$ successfully. This is reasonable because the probability of the transmission failure for ${\mathbf{w}}^{t}$ is small in practice since the AP can reliably broadcast the central parameters by using much larger transmission power than that of the devices and robust transmission methods considering the worst-case device.²²2It is worth noting that this assumption is only to avoid unnecessarily complicated modeling. Besides, our proposed method can be easily used even without this assumption as well by allowing each device to maintain its previous local parameters when its reception of the central parameters is failed. Each device $u$ replaces its local parameters with the received central parameters (i.e., ${\mathbf{w}}_{u}^{t}={\mathbf{w}}^{t}$ ). Then, it trains the parameters using its local data samples which have arrived since the local parameter update phase in the previous round and calculates the local update $\Delta_{u}^{t}$ in (4) as described in Section II-A1. Finally, in the parameter aggregation phase, each scheduled device transmits its local update to the AP. Then, the AP updates the central parameters by averaging them as in (7). In the following, we describe the asynchronous FL procedure in the WDLN in more detail.

Refer to caption — Figure 1: The asynchronous FL procedure in the WDLN.

First, the AP schedules the devices to transmit their local updates based on the state information to effectively utilize the limited radio resources in each round. We define the maximum number of devices that can be scheduled in each round as $W$ . It is worth noting that $W$ is restricted by the bandwidth of the WDLN and typically much smaller than the number of devices $U$ (i.e., $W\ll U$ ). We define a transmission scheduling indicator of device $u$ in round $t$ as $a_{u}^{t}\in\{0,1\}$ , where 1 represents device $u$ is scheduled to transmit its gradient in round $t$ and 0 represents it is not. The vector of the transmission scheduling indicators in round $t$ is defined as ${\mathbf{a}}^{t}=\{a_{u}^{t}\}_{u\in\mathcal{U}}$ . In round $t$ , the AP schedules the devices to transmit their local updates satisfying the following constraint:

\sum_{u\in\mathcal{U}}a^{t}_{u}=W.

(9)

We then define the successful transmission indicator of the local update of device $u$ in round $t$ as a Bernoulli random variable $x_{u}^{t}$ , where 1 represents the successful transmission and 0 represents the transmission failure. Since the probability of the successful transmission of device $u$ in round $t$ is determined according to the channel gain of device $u$ , we can model it by using the approximation of the packet error rate (PER) with the given signal-to-interference-noise ratio (SINR) [23, 24]. Moreover, we can use the PER provided in [25, 26] to model it considering the HARQ or ARQ retransmissions. For example, the PER of an uncoded packet can be given by $\mathbb{P}[x=0|\Phi]=1-(1-b(\Phi))^{n}$ , where $\Phi$ is the SINR with the channel gain $h$ , $b(\cdot)$ is the bit error rate for the given SINR, and $n$ is the packet length in bits. We refer the readers to [24] for more examples of the PER approximations with coding. The vector of the successful transmission indicators in round $t$ is defined as ${\mathbf{x}}^{t}=\{x_{u}^{t}\}_{u\in\mathcal{U}}$ .

We assume the number of the samples that have arrived at device $u$ during round $t$ , $m_{u}^{t}$ , is an independently and identically distributed (i.i.d.) Poisson random variable with a system parameter $\theta_{u}$ .³³3It is worth noting that the assumption of Poisson data arrivals is only for simple implementation of the Bayesian approach in our proposed method. We can easily generalize our proposed method for any other i.i.d. data arrivals by using sampling algorithms such as Gibbs sampling. At the end of round $t$ (i.e., at the beginning of round $t+1$ ), the number of aggregated samples of device $u$ from the latest successful local update transmission to the beginning of round $t+1$ ,⁴⁴4For brevity, to denote $n_{u}^{t+1}$ , we simply write “the number of aggregated samples of device $u$ at the beginning of round $t+1$ ” in the rest of this paper. $n_{u}^{t+1}$ , and the total number of samples from device $u$ used for the central updates before the beginning of round $t+1$ , $N_{u}^{t+1}$ , should be calculated depending on the local update transmission as follows:

n_{u}^{t+1}=\begin{cases}0,&a_{u}^{t}=1\textrm{ and }x_{u}^{t}=1\\ n_{u}^{t}+m_{u}^{t},&\textrm{otherwise}\end{cases}

(10)

and

N_{u}^{t+1}=\begin{cases}N_{u}^{t}+n_{u}^{t}+m_{u}^{t},&a_{u}^{t}=1\textrm{ and }x_{u}^{t}=1\\ N_{u}^{t},&\textrm{otherwise.}\end{cases}

(11)

With the scheduling and successful transmission indicators, $a_{u}^{t}$ ’s and $x_{u}^{t}$ ’s, we can rewrite the equation for the centralized parameter update at the AP in (7) as follows:

{\mathbf{w}}^{t+1}={\mathbf{w}}^{t}-\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}c_{u}^{t}\Delta_{u}^{t}.

(12)

We summarize the asynchronous FL procedure in the WDLN in Algorithm 1 and illustrate in Fig. 1.

Algorithm 1 Asynchronous FL in WDLN

1:Input: Regularization parameter

\lambda

, decay coefficient

\beta

2:Initialize variables

N_{u}=n_{u}=0

3:for

t=1,2,...

\triangleright

Transmission scheduling phase

5: The AP observes

{\mathbf{h}}^{t}

and

{\mathbf{n}}^{t}

and schedules the update transmissions

{\mathbf{a}}^{t}

\triangleright

Local parameter update phase

7: The AP broadcasts

{\mathbf{w}}^{t}

and devices updates its local parameters

8: Each device computes its local update as in (3),(4),(5)

\triangleright

Parameter aggregation phase

10: Each scheduled device transmits its local update

\Delta_{u}^{t}

and

n_{u}^{t}+m_{u}^{t}

to allow the AP to calculate

c_{u}^{t}

11: The AP updates the central parameters as in (12) and

N_{u}^{t+1}

as in (11)

12: Each device updates

n_{u}^{t+1}

as in (10) and

\psi_{u}^{t}

as in (6)

13:end for

II-C Convergence Analysis of Asynchronous FL in WDLN

We analyze the convergence of the asynchronous FL procedure proposed in the previous subsection. We first introduce some definitions and an assumption on the objective function of asynchronous FL in (1) for the analysis.

Definition 1

( $L$ -smoothness) The function $f$ is $L$ -smooth if it has Lipschitz continuous gradient with constant $L>0$ (i.e., $\forall{\mathbf{x}}_{1},{\mathbf{x}}_{2}$ ), $f({\mathbf{x}}_{1})-f({\mathbf{x}}_{2})\leq\left<\nabla f({\mathbf{x}}_{2}),{\mathbf{x}}_{1}-{\mathbf{x}}_{2}\right>+\frac{L}{2}||{\mathbf{x}}_{1}-{\mathbf{x}}_{2}||^{2}.$

Definition 2

( $\xi$ -strongly convexity) The function $f$ is $\xi$ -strongly convex with $\xi>0$ if $\forall{\mathbf{x}}_{1},{\mathbf{x}}_{2}$ , $f({\mathbf{x}}_{1})-f({\mathbf{x}}_{2})\geq\left<\nabla f({\mathbf{x}}_{2}),{\mathbf{x}}_{1}-{\mathbf{x}}_{2}\right>+\frac{\xi}{2}||{\mathbf{x}}_{1}-{\mathbf{x}}_{2}||^{2}.$

Definition 3

(Bounded gradient dissimilarity) The local functions are $V$ -locally dissimilar at ${\mathbf{w}}$ if $\mathbb{E}[||(\psi_{u})||^{2}]\leq||\nabla l({\mathbf{w}})||^{2}V^{2}$ , where $\psi_{u}$ is the aggregated local gradients of device $u$ in (6) and $l({\mathbf{w}})$ is the central loss function in (1).

Assumption 1

The objective function of asynchronous FL, $l({\mathbf{w}})$ , in (1) is bounded from below, and there exists $\epsilon>0$ such that $\nabla l({\mathbf{w}})^{\top}\mathbb{E}[\psi_{u}]\geq\epsilon||\nabla l({\mathbf{w}})||^{2}$ hold for all ${\mathbf{w}}$ .

It is worth noting that this assumption is a typical one used in literature [27, 15, 13]. With this assumption, we can show the convergence of the asynchronous FL procedure in the WDLN in the following theorem.

Theorem 1

Suppose that the objective function of asynchronous FL in the WDLN, $l({\mathbf{w}})$ , in (1) is $L$ -smooth and $\xi$ -strongly convex, and the local gradients $\psi_{u}$ ’s are $V$ -dissimilar. Then, if Assumption 1 holds, after $T$ rounds, the asynchronous FL procedure in the WDLN satisfies

	$\displaystyle\mathbb{E}[l({\mathbf{w}}^{T})-l({\mathbf{w}}^{*})]\leq$		(13)
	$\displaystyle\prod_{t=0}^{T-1}\left\{1\!+\!2\xi\bar{\eta}^{t}\left(\frac{L\bar{\eta}^{t}W^{2}V^{2}}{2}\!-\!\epsilon\!\!\sum_{u\in\mathcal{U}}\!a_{u}^{t}\mathbb{P}[x_{u}^{t}\!=\!1]\right)\right\}(l({\mathbf{w}}^{0})\!-\!l({\mathbf{w}}^{*})).$

In addition, if there exists any $\zeta>0$ satisfies $\sum_{u\in\mathcal{U}}a_{u}^{t}\mathbb{P}[x_{u}^{t}=1]\geq\zeta$ and $\underline{\eta}\leq\eta_{u}^{t}<\eta^{t}=\frac{2\epsilon\zeta}{LW^{2}V^{2}}(\max_{u\in\mathcal{U}}\{c_{u}^{t}\})^{-1}$ , the following holds:

\mathbb{E}[l({\mathbf{w}}^{T})-l({\mathbf{w}}^{*})]\leq(1-2\xi\underline{\eta}\epsilon^{\prime})^{T}(l({\mathbf{w}}^{0})-l({\mathbf{w}}^{*})),

(14)

where $\epsilon^{\prime}=\epsilon\zeta-\frac{L\bar{\eta}W^{2}V^{2}}{2}$ and $\bar{\eta}=\max_{\forall t}\eta^{t}$ .

Proof:

See Appendix A. ∎

Theorem 1 takes consideration of the asynchronous FL procedure in the WDLN compared with the convergence analysis of asynchronous FL in [13]. As a result, the theorem implies that the faster convergence can be expected as a larger expected number of successful transmissions in the WDLN, $\sum_{u\in\mathcal{U}}a_{u}^{t}\mathbb{P}[x_{u}^{t}=1]$ . Besides, it theoretically shows the convergence of the asynchronous FL procedure in the WDLN as $T\rightarrow\infty$ if the expected number of successful transmissions is non-zero since $0<2\xi\underline{\eta}\epsilon^{\prime}<1$ . However, various aspects in learning, such as a learning speed and a robustness to stragglers, are not directly shown in this theorem, and they may appear differently according to transmission scheduling strategies used in the asynchronous FL procedure as will be shown in Section V.

III Problem Formulation

We now formulate an asynchronous learning-aware transmission scheduling (ALS) problem to maximize the performance of FL learning. In the literature, it is empirically shown that the FL learning performance mainly depends on the local update of each device, which is determined by the arrived data samples at the devices [2, 15]. However, it is still not clearly investigated which characteristics in a local update bring a larger impact than others to learn the optimal parameters [28, 2]. Moreover, it is hard for the AP to exploit information relevant to the local updates because the AP cannot obtain the local updates before the devices transmit them to the AP and cannot access the data samples of the devices due to the limited radio resources and privacy issues. Due to these reasons, instead of directly finding the impactful local updates, the existing works [9, 7, 8] schedule the transmissions according to the factors in the bound of the convergence rate of FL, which implicitly represent the impact of the local updates in terms of the convergence. However, as pointed out in Section I, the existing works on transmission scheduling for FL do not consider asynchronous FL over multiple rounds. As a result, their scheduling criterion based on the convergence rate may become inefficient in asynchronous FL since in a long-term aspect with asynchronous FL, the deterioration of the central model’s convergence due to the absence of some devices in model aggregation can be compensated later thanks to the nature of asynchronous FL.

Contrary to the existing works, we focus on maximizing the average number of learning from the local data samples considering asynchronous FL. The number of data samples used in learning can implicitly represent the amount of learning. Roughly speaking, a larger number of data samples leads the empirical loss in (1) to be converged into a true one in the real-world. It is also empirically accepted in the literature that more amount of data samples generally results in better performance [29, 30]. In this context, in FL, the number of the local samples used to calculate the local gradients represents the amount of learning since the central parameters are updated by aggregating the local gradients. However, in asynchronous FL, the delayed local gradients due to scheduling or transmission failure are aggregated to the current one, which may cause an adverse effect on learning. Hence, here, we maximize the total number of samples used to calculate the local gradients aggregated in the central updates while considering their delay as a cost to minimize its adverse effect on learning.

We define a state in round $t$ by a tuple of the channel gain ${\mathbf{h}}^{t}$ and the numbers of the aggregated samples ${\mathbf{n}}^{t}$ , ${\mathbf{s}}^{t}=({\mathbf{h}}^{t},{\mathbf{n}}^{t})$ and the state space $\mathcal{S}\subset\mathbb{R}^{U}\times\mathbb{N}^{U}$ . The AP can observe the state information in the transmission scheduling phase of each round. It is worth noting that in the following section, we will also consider the more restrictive environments in which the AP can observe only a partial state in round $t$ (i.e., the channel gain ${\mathbf{h}}^{t}$ ). We denote system uncertainties in the WDLN, such as the successful local update transmissions and data sample arrivals, by random disturbances. The vector of the random disturbances in round $t$ is defined as ${\mathbf{v}}^{t}=({\mathbf{x}}^{t},{\mathbf{m}}^{t})$ , where ${\mathbf{x}}^{t}$ is the vector of the successful transmission indicators in round $t$ and ${\mathbf{m}}^{t}$ is the vector of the number of samples that have arrived at each device during round $t$ . Then, an action in round $t$ , ${\mathbf{a}}^{t}$ , is chosen from the action space, defined by $\mathcal{A}=\{{\mathbf{a}}:\sum_{u\in\mathcal{U}}a_{u}=W\}$ . In each round $t$ , the AP determines the action based on the observed state while considering the random disturbances which are unknown to the AP. To evaluate the effectiveness of the transmission on learning in each round, we define a metric, called an effectivity score of asynchronous FL, as

F({\mathbf{s}}^{t},{\mathbf{a}}^{t},{\mathbf{v}}^{t})=\sum_{u\in\bar{\mathcal{U}}^{t}}n_{u}^{t}+m_{u}^{t}-\sum_{u\in\mathcal{U}\setminus\bar{\mathcal{U}}^{t}}\gamma(n_{u}^{t}+m_{u}^{t}),

(15)

where $\bar{\mathcal{U}}^{t}$ is the set of the devices whose local update transmission was successful in round $t$ and $\gamma\in[0,1)$ is a delay cost coefficient to consider the adverse effect of the stragglers. Note that this metric is correlated over time since $n_{u}^{t}$ depends on the previous scheduling and the transmission result. The first term of the effectivity score represents the total number of the samples that will be effectively used in the central update while the second term represents the total number of the samples that may cause the adverse effects in learning due to the delayed local updates. We now formulate the ALS problem maximizing the total expected effectivity score of asynchronous FL as

\displaystyle\mathop{\rm maximize}_{\pi}

\displaystyle~{}\lim_{T\rightarrow\infty}\mathbb{E}\left[\sum_{t=1}^{T}F({\mathbf{s}}^{t},{\mathbf{a}}^{t},{\mathbf{v}}^{t})\right],

(16)

where $\pi:\mathcal{S}\rightarrow\mathcal{A}$ is a policy that maps a state to an action. This problem maximizes the number of samples used in asynchronous FL while trying to minimize the adverse effect of the stragglers due to the delay.

Remark 1

We highlight that the proposed effective score for asynchronous FL shares the same principle with the metrics in the conventional works [9, 8]. Roughly speaking, the metrics try to maximize the number of data samples used in learning the transmitted local updates since it maximizes the convergence rates of the central model according to the convergence analysis of synchronous FL. Similar to this, the proposed effective score tries to maximize the number of data samples used in learning the successfully transmitted local updates to maximize the amount of learning conveyed to the AP. However, at the same time, it tries to minimize the number of data samples used in learning the delayed local updates to minimize their adverse effects, while the conventional metrics do not introduce any penalty due to the delayed local updates.

We define the system parameters related to the system uncertainties as

{\boldsymbol{\uptheta}}=\{{\boldsymbol{\uptheta}}_{1}^{C},...,{\boldsymbol{\uptheta}}_{U}^{C},\theta_{1}^{P},...,\theta_{U}^{P}\}\in\Theta,

(17)

where $\Theta$ is a system parameter space. These system parameters express the system uncertainties, such as the channel statistics and the average number of arrived samples (i.e., arrival rate) of each device in each round, in the WDLN, and thus, we can solve the problem if they are known. We define the true system parameters for the WDLN as ${\boldsymbol{\uptheta}}^{*}=\{{\boldsymbol{\uptheta}}^{*,C},{\boldsymbol{\uptheta}}^{*,P}\}$ , where ${\boldsymbol{\uptheta}}^{*,C}$ and ${\boldsymbol{\uptheta}}^{*,P}$ are the true system parameters for the channel gains and the arrival rates, respectively. Formally, if the system parameters ${\boldsymbol{\uptheta}}^{*}$ are perfectly known in advance as a priori information, the state transition probabilities, $\mathbb{P}[{\mathbf{s}}^{\prime}|{\mathbf{s}},{\mathbf{a}}]$ , can be derived by using the probability of successful transmission, the distribution of the channel gains, and the Poisson distribution with the a priori information. Then, based on the transition probabilities, the problem in (16) can be optimally solved by standard dynamic programming (DP) methods [31]. However, this is impractical since it is hard to obtain such a priori information in advance. Besides, even if such a priori information is perfectly known, the computational complexity of the standard DP methods is too high as will be shown in Section IV-E. Hence, we need a learning approach such as reinforcement learning (RL) that learns the information while proceeding the asynchronous FL, and develop algorithms based on it in the following section.

IV Asynchronous Learning-Aware Transmission Scheduling

In this section, we develop three ALS algorithms each of which requires different information to solve the ALS problem: an ALS algorithm with perfect a priori information (ALSA-PI), a Bayesian ALS algorithm (BALSA), and a Bayesian ALS algorithm with a partially observable state information (BALSA-PO). We summarize the information required in ASLA-PI, BALSA, and BALSA-PO in Table III. In the table, we can see that from ALSA-PI to BALSA-PO, the required information in those three algorithms is gradually reduced.

TABLE III: Summary of information required in each algorithm

	ALSA-PI	BALSA	BALSA-PO
a priori information on true system parameters	${\boldsymbol{\uptheta}}^{*,P}$	-	-
State information reported from the devices	${\mathbf{h}},{\mathbf{n}}$	${\mathbf{h}},{\mathbf{n}}$	${\mathbf{h}}$

IV-A Parametrized Markov Decision Process

Before developing algorithms, we first define a parameterized Markov decision process (MDP) based on the ALS problem in the previous section. Since the system uncertainties in the WDLN, such as the channel statistics and the arrival rates of the samples, depend on the system parameters ${\boldsymbol{\uptheta}}$ in (17), the MDP is parameterized by system parameters ${\boldsymbol{\uptheta}}$ . In specific, it is defined as $M_{\boldsymbol{\uptheta}}=(\mathcal{S},\mathcal{A},r^{\boldsymbol{\uptheta}},P^{\boldsymbol{\uptheta}})$ , where $\mathcal{S}$ and $\mathcal{A}$ are the state space and action space of the ALS problem, respectively, $r^{\boldsymbol{\uptheta}}$ is the reward function, and $P^{\boldsymbol{\uptheta}}$ is the transition probability such that $P^{\boldsymbol{\uptheta}}({\mathbf{s}}^{\prime}|{\mathbf{s}},{\mathbf{a}})=\mathbb{P}({\mathbf{s}}^{t+1}={\mathbf{s}}^{\prime}|{\mathbf{s}}^{t}={\mathbf{s}},{\mathbf{a}}^{t}={\mathbf{a}},{\boldsymbol{\uptheta}})$ . We define the reward function $r^{\boldsymbol{\uptheta}}$ as the effectivity score in (15). For theoretical results, we assume that the state space $\mathcal{S}$ is finite and the reward function is bounded as $r^{\boldsymbol{\uptheta}}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ .⁵⁵5We can easily implement such a system by quantizing the channel gain and truncating $n_{u}^{t}$ and $m_{u}^{t}$ with a large constant. Then, the state space becomes finite and we can use the reward function as the normalized version of $F({\mathbf{s}},{\mathbf{a}})$ with the maximum expected reward that can be derived from the constant. In practice, this system is more realistic to be implemented since such variables in a real system are typically finite. The average reward per round of a stationary policy $\pi$ is defined as

J_{\pi}({\boldsymbol{\uptheta}})=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}\Big{[}\sum_{t=1}^{T}r^{\boldsymbol{\uptheta}}({\mathbf{s}}^{t},{\mathbf{a}}^{t})\Big{]}.

(18)

The optimal average reward per round $J^{*}({\boldsymbol{\uptheta}})=\max_{\pi}J_{\pi}({\boldsymbol{\uptheta}})$ satisfies the Bellman equation:

J^{*}({\boldsymbol{\uptheta}})+v({\mathbf{s}},{\boldsymbol{\uptheta}})\!=\max_{a\in\mathcal{A}}\Big{\{}r^{\boldsymbol{\uptheta}}({\mathbf{s}},{\mathbf{a}})+\!\sum_{{\mathbf{s}}^{\prime}\in\mathcal{S}}\!P^{\boldsymbol{\uptheta}}({\mathbf{s}}^{\prime}|{\mathbf{s}},{\mathbf{a}})v({\mathbf{s}}^{\prime}\!,\!{\boldsymbol{\uptheta}})\Big{\}},~{}\forall{\mathbf{s}}\!\in\!\mathcal{S},

(19)

where $v({\mathbf{s}},{\boldsymbol{\uptheta}})$ is the value function at state ${\mathbf{s}}$ . We can define the corresponding optimal policy, $\pi^{*}({\mathbf{s}},{\boldsymbol{\uptheta}})$ , as the maximizer of the above optimization. Then, the policy $\pi^{*}({\mathbf{s}},{\boldsymbol{\uptheta}}^{*})$ becomes the optimal solution to the problem in (16).

IV-B ALSA-PI: Optimal ALS with Perfect a Priori Information

We now develop ALSA-PI to solve the ALS problem when ${\boldsymbol{\uptheta}}^{*}$ is perfectly known at the AP in advance. Typically, the true system parameters ${\boldsymbol{\uptheta}}^{*}$ is unknown, but here we introduce ALSA-PI since it will be used to develop BALSA in Section IV-C. Besides, in some cases, ${\boldsymbol{\uptheta}}^{*}$ might be precisely estimated by using past experiences. Since the true system parameters ${\boldsymbol{\uptheta}}^{*}$ are known, the problem in (16) can be solved by finding the optimal policy to ${\boldsymbol{\uptheta}}^{*}$ . However, in general, computing the optimal policy $\pi^{*}$ to given parameters requires a large computational complexity which exponentially increases with the number of devices, which is often called a curse of dimensionality. Hence, even if the true system parameters ${\boldsymbol{\uptheta}}^{*}$ are perfectly known in advance, it is hard to compute the corresponding optimal policy $\pi^{*}({\mathbf{s}},{\boldsymbol{\uptheta}}^{*})$ . However, for the ALS problem, a greedy policy, which myopically chooses the action to maximize the expected reward in the current round, becomes the optimal policy as the following theorem.

Theorem 2

For the parameterized MDP with given finite parameters, the greedy policy in (23) is optimal.

Proof:

See Appendix B. ∎

With this theorem, we can easily develop ALSA-PI by adopting the greedy policy with the known system parameters ${\boldsymbol{\uptheta}}^{*}$ thanks to the structure of the reward. This also implies that we can significantly reduce the computational complexity to solve the ALS problem with the known parameters compared with the DP methods that are typical ones to solve MDPs. The computational complexity will be analyzed in Section IV-E.

In round $t$ , the greedy policy for given system parameters ${\boldsymbol{\uptheta}}$ chooses the action by solving the following optimization problem:

\mathop{\rm maximize}_{{\mathbf{a}}\in\mathcal{A}}\mathbb{E}[F({\mathbf{s}}^{t},{\mathbf{a}},{\mathbf{v}}^{t})|{\boldsymbol{\uptheta}}],

(20)

where $\mathbb{E}[\cdot|{\boldsymbol{\uptheta}}]$ represents the expectation over the probability distribution of the given system parameters ${\boldsymbol{\uptheta}}$ . With the scheduling and successful transmission indicators, $a_{u}^{t}$ ’s and $x_{u}^{t}$ ’s, we rearrange the objective function in (15) as

F({\mathbf{s}}^{t},{\mathbf{a}},{\mathbf{v}}^{t})=\sum_{u\in\mathcal{U}}a_{u}x_{u}^{t}(1+\gamma)(n_{u}^{t}+m_{u}^{t})-\gamma(n_{u}^{t}+m_{u}^{t}).

(21)

Then, we can reformulate the problem as

\mathop{\rm maximize}_{{\mathbf{a}}\in\mathcal{A}}\sum_{u\in\mathcal{U}}a_{u}\mathbb{E}[x_{u}^{t}(n_{u}^{t}+m_{u}^{t})|{\boldsymbol{\uptheta}}^{P}],

(22)

since the last term in (21) does not depend on both transmission scheduling indicator ${\mathbf{a}}$ and uncertainty on the successful transmission ${\mathbf{x}}^{t}$ . In addition, the conditional expectation in the problem depends only on the system parameters for the arrival rates of the data samples, ${\boldsymbol{\uptheta}}^{P}=\{\theta_{1}^{P},...\theta_{U}^{P}\}$ , because $x_{u}^{t}$ is solely determined by the current channel gain. To solve the problem, the AP estimates the expected number of samples of each device $u$ , $\mathbb{E}[x_{u}^{t}(n_{u}^{t}+m_{u}^{t})|{\boldsymbol{\uptheta}}^{P}]$ , by using its channel gain $h_{u}^{t}$ and the system parameter $\theta_{u}^{P}$ . Let the devices be sorted in descending order of their expected number of samples $\mathbb{E}[x_{u}^{t}(n_{u}^{t}+m_{u}^{t})|{\boldsymbol{\uptheta}}^{P}]$ and indexed by $(1),(2),...,(U)$ . Then, the greedy policy in round $t$ is easily obtained as

\pi^{g}({\mathbf{s}}^{t},{\boldsymbol{\uptheta}}^{P})=\begin{cases}a_{u}=1,&u\in\{(1),...,(W)\}\\ a_{u}=0,&\textrm{otherwise.}\end{cases}

(23)

In ALSA-PI, the AP schedules the devices according to the greedy policy with the true system parameter ${\boldsymbol{\uptheta}}^{*,P}$ (i.e., $\pi^{g}$ in (23) with ${\boldsymbol{\uptheta}}^{*,P}$ ) in each round.

Remark 2

For the time-varying maximum number of devices that can be scheduled in each round, the greedy policy in (23) can be easily extended by substituting $W$ into $W^{t}$ , where $W^{t}$ denotes the maximum number in round $t$ . This implies that the ALS algorithms (i.e., ALSA-PI, BALSA, and BALSA-PO) can be easily extended for it as well since they are implemented by using the greedy policy and in the methods, $W$ is related to the greedy policy only.

IV-C BALSA: Optimal Bayesian ALS without a Priori Information

We now develop BALSA that solves the parameterized MDP without requiring any a priori information. To this end, it adopts a Bayesian approach to learn the unknown system parameters ${\boldsymbol{\uptheta}}$ . We define the policy determined by BALSA as $\phi=(\phi^{1},\phi^{2},...)$ each of which $\phi^{t}({\mathbf{g}}^{t})$ chooses an action according to the history of states, actions, and rewards, ${\mathbf{g}}^{t}=({\mathbf{s}}^{1},...,{\mathbf{s}}^{t},{\mathbf{a}}^{1},...,{\mathbf{a}}^{t-1},r^{1},...,r^{t-1})$ . To apply the Bayesian approach, we assume that the system parameters that belong to ${\boldsymbol{\uptheta}}$ are independent to each other. We denote a prior distribution for the system parameters ${\boldsymbol{\uptheta}}$ by $\mu^{1}$ . For the prior distribution, the non-informative prior can be typically used if there is no prior information about the system parameters. On the other hand, if there is any prior information, then it can be defined by considering the information. For example, if we know that a parameter belongs to a certain interval, then the prior distribution can be defined as a truncated distribution whose probability measure outside of the interval is zero. We then define the Bayesian regret of BALSA up to time $T$ as

R(T)=\mathbb{E}\Big{[}TJ^{*}({\boldsymbol{\uptheta}})-\sum_{t=1}^{T}r^{{\boldsymbol{\uptheta}}}({\mathbf{s}}^{t},\phi^{t}({\mathbf{g}}^{t}))\Big{]},

(24)

where the expectation in the above equation is over the prior distribution $\mu^{1}$ and the randomness in state transitions. This Bayesian regret has been widely used in the literature of Bayesian RL as a metric to quantify the performance of a learning algorithm [32, 33, 34].

In BALSA, the system uncertainties are estimated by a Bayesian approach. Formally, in the Bayesian approach, the posterior distribution of ${\boldsymbol{\uptheta}}$ in round $t$ , which is denoted by $\mu^{t}$ , is updated by using the observed information about ${\boldsymbol{\uptheta}}$ according to the Bayes’ rule. Then, the posterior distribution is used to estimate or sample the system parameters for the algorithm. We will describe how the posterior distribution is updated in detail later. We define the number of visits to any state-action pair $({\mathbf{s}},{\mathbf{a}})$ before time $t$ as

M^{t}({\mathbf{s}},{\mathbf{a}})=|\{\tau<t:({\mathbf{s}}^{\tau},{\mathbf{a}}^{\tau})=({\mathbf{s}},{\mathbf{a}})\}|,

(25)

where $|\cdot|$ denotes the cardinality of the set.

BALSA operates in stages each of which is composed of multiple rounds. We denote the start time of stage $k$ of BALSA by $t_{k}$ and the length of stage $k$ by $T_{k}=t_{k+1}-t_{k}$ . With the convention, we set $T_{0}=1$ . Stage $k$ ends and the next stage starts if $t>t_{k}+T_{k-1}$ or $M^{t}({\mathbf{s}},{\mathbf{a}})>2M^{t_{k}}({\mathbf{s}},{\mathbf{a}})$ for some $({\mathbf{s}},{\mathbf{a}})\in\mathcal{S}\times\mathcal{A}$ . This balances the trade-off between exploration and exploitation in BALSA. Thus, the start time of stage $k+1$ is given by

	$\displaystyle t_{k+1}=\min\{t>t_{k}:t>t_{k}+T_{k-1}\textrm{ or }$
	$\displaystyle\qquad M^{t}({\mathbf{s}},{\mathbf{a}})>2M^{t_{k}}({\mathbf{s}},{\mathbf{a}})\textrm{ for some }({\mathbf{s}},{\mathbf{a}})\in\mathcal{S}\times\mathcal{A}\},\quad$		(26)

and $t_{1}=1$ . This stopping criterion allows us to bound the number of stages over $T$ rounds. At the beginning of stage $k$ , system parameters ${\boldsymbol{\uptheta}}_{k}$ are sampled from the posterior distribution $\mu^{t_{k}}$ . Then, the action is chosen by the optimal policy corresponding to the sampled system parameter ${\boldsymbol{\uptheta}}_{k}$ , $\pi^{*}({\mathbf{s}},{\boldsymbol{\uptheta}}_{k})$ , until the stage ends. It is worth noting that this posterior sampling procedure, in which the system parameters sampled from the posterior distribution are used for choosing actions, has been widely applied to address the exploration-exploitation dilemma in RL [32, 33, 34, 35]. Since the greedy policy, $\pi^{g}({\mathbf{s}},{\boldsymbol{\uptheta}}_{k})$ , is the optimal policy to the ALS problem as shown in the previous subsection, we can easily implement BALSA by using it. Besides, this makes the AP not have to estimate the posterior distribution of the parameters for the channel gains, ${\boldsymbol{\uptheta}}^{C}=\{{\boldsymbol{\uptheta}}_{1}^{C},...,{\boldsymbol{\uptheta}}_{U}^{C}\}$ , since the greedy policy do not use them.

Here, we describe BALSA in more detail. In round $t$ , the AP observes the states ${\mathbf{s}}^{t}=({\mathbf{h}}^{t},{\mathbf{n}}^{t})$ . Then, the AP can obtain the numbers of arrived samples of all devices during round $t-1$ , $m_{u}^{t-1}$ ’s, as follows: for each device whose local model was transmitted to the AP in round $t-1$ (i.e., $u\in\bar{\mathcal{U}}^{t-1}$ ), it can be known because $n_{u}^{t-1}+m_{u}^{t-1}$ is transmitted to the AP in round $t-1$ for the local update, and for the other devices (i.e., $u\in\mathcal{U}\setminus\bar{\mathcal{U}}^{t-1}$ ), it can be obtained by subtracting $n_{u}^{t-1}$ from $n_{u}^{t}$ . Then, the AP updates the posterior distribution of ${\boldsymbol{\uptheta}}^{P}$ , $\mu^{t}_{P}$ , according to Bayes’ rule. We denote the posterior distribution for a system parameter $\theta$ that belongs to ${\boldsymbol{\uptheta}}$ in round $t$ by $\mu^{t}_{\theta}$ . Then, the posterior distribution of ${\boldsymbol{\uptheta}}^{P}$ is updated by using $m_{u}$ ’s. For example, if the non-informative Jeffreys prior for the Poisson distribution is used for the system parameter $\theta_{u}^{P}$ , then its posterior distribution in round $t$ is derived as follows [36]:

\mu^{t}_{\theta_{u}^{P}}(\theta|{\mathbf{g}}^{t})=\frac{(t-1)^{S+a}}{\Gamma(S+1/2)}\theta^{S-1/2}e^{-\theta(t-1)},

(27)

where $S=\sum_{\tau=1}^{t-1}m_{u}^{\tau}$ . In this paper, we implement BALSA based on the non-informative prior. In each stage $k$ of the algorithm, the AP samples the system parameters ${\boldsymbol{\uptheta}}^{P}_{k}$ by using the posterior distribution in (27) and schedules the local update transmission according to the greedy policy in (23) using the sampled system parameters ${\boldsymbol{\uptheta}}^{P}_{k}$ . We summarize BALSA in Algorithm 2.

Algorithm 2 BALSA

1:Input: Prior distribution

\mu^{1}

2:Initialize

k\leftarrow 1

t\leftarrow 1

t_{k}\leftarrow 0

3:while TRUE do

T_{k-1}\leftarrow t-t_{k}

and

t_{k}\leftarrow t

5: Sample

{\boldsymbol{\uptheta}}_{k}^{P}\sim\mu^{t_{k}}_{P}

6: while

t\!\leq\!t_{k}\!+\!T_{k-1}

and

M^{t}({\mathbf{s}},{\mathbf{a}})\leq 2M^{t_{k}}({\mathbf{s}},{\mathbf{a}}),~{}\forall({\mathbf{s}},{\mathbf{a}})

7: Choose action

{\mathbf{a}}^{t}\leftarrow\pi^{g}({\mathbf{s}}^{t},{\boldsymbol{\uptheta}}_{k}^{P})

8: Observe state

{\mathbf{s}}^{t+1}

and reward

r^{t+1}

9: Update

\mu^{t+1}_{P}

as in (27) using

m_{u}

’s for all devices

10:

t\leftarrow t+1

11: end while

12:

k\leftarrow k+1

13:end while

We can prove the regret bound of BALSA as the following theorem.

Theorem 3

Suppose that the maximum value function over the state space is bounded, i.e., $\max_{s\in\mathcal{S}}v({\mathbf{s}},{\boldsymbol{\uptheta}})\leq H$ for all ${\boldsymbol{\uptheta}}\in\Theta$ . Then, the Bayes regret of BALSA satisfies

R(T)\leq(H+1)\sqrt{2SAT\log T}+49HS\sqrt{AT\log AT},

(28)

where $S$ and $A$ denote the numbers of states and actions, i.e., $|\mathcal{S}|$ and $|\mathcal{A}|$ , respectively.

Proof:

See Appendix C. ∎

The regret bound in Theorem 3 is sublinear in $T$ . Thus, in theory, it is guaranteed that the average reward of BALSA per round converges to that of the optimal policy (i.e., $\lim_{T\rightarrow\infty}R(T)/T=0$ ), which implies the optimality of BALSA in terms of the long-term average reward per round.

IV-D BALSA-PO: Bayesian ALS for Partially Observable State Information

In the WDLN, the state, ${\mathbf{s}}=({\mathbf{h}},{\mathbf{n}})$ , can be reported to the AP from the devices. In typical wireless networks, reporting $n_{u}^{t}$ ’s to the AP needs data transmissions of the devices, which require exchanging control messages between the AP and the devices individually. Besides, it is expected that the number of devices participating in FL is typically much larger than the capacity of wireless networks that represents the number of devices that can transmit the local updates simultaneously [2]. Hence, it might be impractical (at least inefficient) that all individual devices report their numbers of the aggregated samples ${\mathbf{n}}$ to the AP in each round because of the excessive exchanges of the control messages between the AP and the individual devices for reporting ${\mathbf{n}}$ . On the other hand, the channel gains may not require such data transmissions in an uplink case since the AP can estimate the channel gains of the devices by using the pre-defined reference signals transmitted from the devices [37]. In this context, in this section, we consider a WDLN in which in each round, the AP observes the channel gain ${\mathbf{h}}$ only and the numbers of the aggregated samples ${\mathbf{n}}$ are not reported to the AP. We can model this as a partially observable MDP (POMDP) based on the description in Section III, but the POMDP is often hard to solve because of the intractable computation. In this section, we develop an algorithm to solve the ALS problem in the partially observable WDLN by slightly modifying BALSA, which is called BALSA-PO.

In the fully observable WDLN, the AP observes the numbers of the aggregated samples, $n_{u}^{t}$ ’s, from all devices in each round $t$ . In BALSA, $n_{u}^{t}$ ’s are used to choose the action according to the greedy policy in (23) and to obtain the numbers of the arrived samples of all devices during round $t-1$ , $m_{u}^{t-1}$ ’s, for updating the posterior distribution of $\theta_{u}^{P}$ . Hence, in the partially observable WDLN, where $n_{u}^{t}$ ’s are not observable, the AP cannot choose the action according to the greedy policy as well as obtain $m_{u}^{t-1}$ ’s. To address these issues, first, in BALSA-PO, the AP approximates $n_{u}^{t}$ by using the sampled system parameter of $\theta_{u}^{P}$ as $\tilde{n}_{u}^{t}=(T_{u}^{n}-1)\theta_{u,k}^{P}$ , where $T_{u}^{n}$ is the number of the rounds from the latest successful local update transmission to the current round (i.e., $T_{u}^{n}=t-\max\{\tau<t:x_{u}^{\tau}=1\textrm{ and }a_{u}^{\tau}=1\}$ ) and $\theta_{u,k}^{P}$ is the sampled system parameter of $\theta_{u}^{P}$ in stage $k$ . Then, the AP can choose the action according to the greedy policy by using the approximated $n_{u}^{t}$ ’s, i.e., $\tilde{n}_{u}^{t}$ . However, the chosen action will be meaningful only if the posterior distribution keeps updated correctly since $n_{u}^{t}$ ’s are approximated based on the sampled system parameters. In the partially observable WDLN, it is a challenging issue updating the posterior distribution correctly since the AP cannot obtain $m_{u}^{t}$ ’s which are required to update the posterior distribution. However, fortunately, even in the partially observable WDLN, the AP can still observe the information about the number of samples $n_{u}^{t}+m_{u}^{t}$ in every successful local update transmission since the information is included in the local update transmission (line 10 of Algorithm 1). Note that $n_{u}^{t}+m_{u}^{t}$ denotes the sum of $m_{u}^{t}$ ’s over the rounds from the latest successful local update transmission to the current round. This sum of $m_{u}^{t}$ ’s in the partially observable WDLN is less informative than all individual values of $m_{u}^{t}$ ’s. Nevertheless, it is informative enough to update the posterior distribution because the update of the posterior distribution of $\theta_{u}^{P}$ requires only the sum of $m_{u}^{t}$ ’s as in (27). Hence, in the partially observable WDLN, the AP can update the posterior distribution of $\theta_{u}^{P}$ when device $u$ successfully transmits its local update. Then, BALSA-PO can be implemented by substituting the state in BALSA, ${\mathbf{s}}$ , to $\tilde{{\mathbf{s}}}=({\mathbf{h}},\tilde{{\mathbf{n}}})$ , where $\tilde{{\mathbf{n}}}=\{\tilde{n}_{1},...,\tilde{n}_{U}\}$ and changing the posterior distribution update procedure from BALSA in line 9 of Algorithm 2 as follows: “Update $\mu_{\theta_{u}^{P}}^{t+1}$ as in (27) using $n_{u}^{t}+m_{u}^{t}$ for device $u\in\bar{\mathcal{U}}^{t}$ ”.

TABLE IV: Comparison of computational complexity

	Computational complexity up to round $T$
DP-based with perfect information	$O(S^{2}A)$
DP-based without information	Upper-bound: $O(S^{2}AT)$ Lower-bound: $O(\sqrt{S^{3}AT(\log T)^{-1}})$
ALSA-PI,BALSA(-PO)	$O(UT\log U)$
$S=\|\mathcal{S}\|=(\|\mathcal{H}\|\|\mathcal{N}\|)^{U}$ and $A=\|\mathcal{A}\|=\frac{U!}{W!(U-W)!}$

IV-E Comparison of Computational Complexity

We compare the computational complexities of the DP-based algorithms with/without a priori information, ALSA-PI, and BALSAs. The DP-based algorithm with perfect a priori information requires to run one of the standard DP methods once to compute the optimal policy for the system parameters in the perfect information. For the complexity analysis to compute the optimal policy, we consider the well-known value iteration whose computational complexity is given by $O(S^{2}A)$ , where $S=|\mathcal{S}|=(|\mathcal{H}||\mathcal{N}|)^{U}$ , $A=|\mathcal{A}|=\frac{U!}{W!(U-W)!}$ , $\mathcal{H}$ is the set of the possible channel gain, and $\mathcal{N}$ is the set of the possible number of the aggregated samples. Hence, the computational complexity of the DP-based algorithm with perfect a priori information is given by $O(S^{2}A)$ . The DP-based algorithm without a priori information denotes a learning method based on the posterior sampling in [34]. It operates over multiple stages as in BALSAs but the optimal policy for the sampled system parameters ${\boldsymbol{\uptheta}}_{k}$ in stage $k$ is computed based on the standard DP methods. Thus, its upper-bound complexity up to round $T$ is given by $O(S^{2}AT)$ if the policy is computed in all rounds. Since the system parameters are sampled once for each stage, we can also derive the lower bound of its computational complexity up to round $T$ as $O(\sqrt{S^{3}AT(\log T)^{-1}})$ from Lemma 3 in Appendix C.

Contrary to the DP-based algorithms, ALSA-PI and BALSAs use the greedy policy that finds the devices with the $W$ largest expected number of samples in each round. Thus, their computational complexity up to round $T$ is given by $O(UT\log U)$ , where $U$ is the number of devices, according to the complexity of typical sorting algorithms. It is worth emphasizing that the complexity of BALSAs grows according to the rate of $U\log U$ while the complexity of the DP algorithms exponentially increases according to $U$ . Besides, the DP-based algorithms have much larger computational complexities than BALSAs in terms of not only the asymptotic behavior but also the complexity coefficients in the Big-O notation since the DP-based algorithms require a complex computations based on the Bellman equation while BALSAs require only the computations to update posteriors in (27) and to sort the expected number of samples. From the analyses, we can see that the complexity of BALSAs is significantly lower than that of the DP-based algorithms. The computational complexities of the algorithms are summarized in Table IV.

V Experimental Results

In this section, we provide experimental results to evaluate the performance of our algorithms. To this end, we develop a dedicated Python-based simulator on which the following simulation and asynchronous FL are run. We consider a WDLN composed of one AP and 25 devices. We use a shard as a unit of data samples each of which consists of multiple data samples. The number of data samples in each shard is determined according to datasets. The arrival rate of the data samples of each device (i.e., the system parameter $\theta_{u}^{P}$ ), and the distance between the AP and each device are provided in Table V. The maximum number of scheduled devices in the WDLN, $W$ , is set to be 5 as in [11]. The channel gains are composed of the Rayleigh small-scale fading that follows an independent exponential distribution with unit mean and the large-scale fading based on the pathloss model $128.1+37.6\log_{10}(d)$ , where the pathloss exponent is given by 3.76 and $d$ represents the distance in km. The transmission power of each device is set to be 23 dBm and the noise power is set to be -96 dBm. In the transmission of the local updates, the PER is approximated according to given SNR as in [25] based on turbo code. The delay cost coefficient $\gamma$ in the ALS problem is set to be 0.01. In asynchronous FL, we set the local learning rate $\eta$ to be 0.01 and the decay coefficient $\beta$ to be 0.001.

TABLE V: Simulation Settings of Devices

Device index	1-3	4-6	7-9	10-12	13-15	16	17	18
Arr. rate (shards/round)	1	1	1	1	1	3	3	3
Distance (m)	100	200	300	400	500	300	350	400
Device index	19	20	21	22	23	24	25	-
Arr. rate (shards/round)	3	5	5	5	5	10	10	-
Distance (m)	450	300	350	400	450	400	450	-

In the experiments, we consider MNIST and CIFAR-10 datasets. For the MNIST dataset, we set the number of data samples in each shard to be 10 and consider a convolutional neural network model with two $5\times 5$ convolution layers with $2\times 2$ max pooling, a fully connected layer, and a final softmax output layer. The first convolution layer is with 1 channel and the second one is with 10 channels. The fully connected layer is composed of 320 units with ReLU activation. When training the local models for MNIST, we set the local batch size to be 10 and the number of local epochs to train the local model to be 10. For the CIFAR-10 dataset, we set the number of data samples in each shard to be 50 and consider a well-known VGG19 model [38]. We also set the local batch size to be 50 and the number of local epochs to be 5.

To evaluate the performance of our algorithms, we compare them with an ideal benchmark and state-of-the-art scheduling algorithms.⁶⁶6In this paper, we do not compare our algorithms with model compressing methods [16, 17, 18], which reduces the communication costs of a single local model transmission, since they can be used with our algorithms orthogonally. The descriptions of the algorithms are provided as follows.

•

Bench represents an ideal benchmark algorithm, where in each round, the AP updates the central parameters by aggregating the local updates of all devices as in FedAvg [1]. This provides the upper bound of the model performance because it is based on an ideal system with no radio resource constraints and transmission failure.
•

RR represents an algorithm in which the devices are scheduled in a round robin manner [5]. This algorithm does not consider both arrival rates of the data samples and channel information of the devices.
•

$W$ -max represents an algorithm that schedules the devices who have the $W$ strongest channel gains. This algorithm does not consider the arrival rates of the data samples. It can represent the scheduling strategies in [7, 9].
•

ALSA-PI is implemented as described in Section IV-B to schedule the devices according to the greedy policy in (23) with the perfect a priori information about ${\boldsymbol{\uptheta}}^{*,P}$ .
•

BALSA is implemented as Algorithm 2 in Section IV-C for the fully observable WDLN. For the Bayesian approach, the Jeffreys prior is used.
•

BALSA-PO is implemented as described in Section IV-D. The Jeffreys prior is used as in BALSA. This algorithm is for the partially observable WDLN.

The models are trained by asynchronous FL with above transmission scheduling algorithms. We run 50 simulation instances for MNIST dataset and 20 simulation instances for CIFAR-10 dataset. In the following figures, the 95% confidence interval is illustrated as a shaded region.

V-A Effectivity Scores in the ALS Problem

We first provide the effectivity scores of the algorithms, which are the objective function of the ALS problem in (16), in Fig. 2. Note that Bench is not provided in the figure since in Bench, all devices transmit their local updates and all the transmissions succeed. From the figure, we can see that our BALSAs achieve the similar effectivity score to that of ALSA-PI, which is the optimal one. In particular, as the round proceeds, the effectivity scores of BALSAs converge to that of ALSA-PI. This clearly shows that our BALSAs can effectively learn the uncertainties in the WDLN. On the other hand, RR and $W$ -max achieve the much lower effectivity scores compared with BALSAs. Especially, the effectivity score of $W$ -max decreases as the round proceeds since it fails not only to maximize the number of the data samples used in training and but also to minimize the adverse effect from the stragglers. To show the validity of the effectivity score for effective learning in asynchronous FL, in the following subsections, it will be clearly shown that the algorithms with the higher effectivity scores achieve better trained model performances such as training loss, accuracy, robustness against stragglers, and learning speed.

V-B Training Loss and Test Accuracy

To compare the performance of asynchronous FL with respect to the transmission scheduling algorithms, which is our ultimate goal, we provide the training loss and test accuracy of the algorithms. Fig. 3 provides the training loss and test accuracy with MNIST and CIFAR-10 datasets. From Figs. 3a and 3c, we can see that ALSA-PI, BALSA, and BALSA-PO achieve the similar training loss to Bench. Compared with them, RR and $W$ -max achieve the larger training loss. The larger training loss of a model typically implies the lower accuracy of the model. This is clearly shown in Figs. 3b and 3d. From the figures, we can see that the models trained with RR and $W$ -max achieve the significantly lower accuracy than the models trained with the other algorithms and their variances are much larger than those of the other algorithms. These results imply that RR and $W$ -max fail to effectively gather the local updates while addressing the stragglers because they do not have a capability to consider the characteristics of asynchronous FL and the uncertainties in the WDLN. On the other hand, our BALSAs gather the local updates which are enough to train the model while effectively addressing the stragglers due to asynchronous FL by considering the effectivity score.

V-C Robustness of Learning against Stragglers

In Fig. 3, the unstable learning of the algorithms due to the stragglers is not clearly shown since the fluctuations of the training loss and accuracy in the figure are wiped out while averaging the results from the multiple simulation instances. Hence, in Fig. 4, we show the robustness of the algorithms against the stragglers more clearly through the training loss and test accuracy results of a single simulation instance. First of all, it is worth emphasizing that Bench is the most stable one in terms of the stragglers because it has no straggler. From Figs. 4a and 4c, we can see that the training losses of the algorithms considering asynchronous FL based on the effectivity score (i.e., ALSA-PI and BALSAs) are quite stable as much as that of Bench. Accordingly, their corresponding test accuracies are also stable. On the other hand, for the algorithms not considering asynchronous FL (i.e., RR and $W$ -max), a lot of spikes (i.e., short-lasting peaks) appear in their training losses, and accordingly, their test accuracies are also unstable. Moreover, the training loss and test accuracy of $W$ -max is significantly unstable compared with RR. This is because the scheduling strategy of $W$ -max is highly biased according to the average channel gain while RR sequentially schedules all the devices. Such biased scheduling in $W$ -max raises much more stragglers than RR. This clearly shows that a transmission scheduling algorithm may cause unstable learning if its scheduling strategy is biased without considering the stragglers.

V-D Satisfaction Rate and Learning Speed

In Fig. 5, we provide the target accuracy satisfaction rate and the average required rounds for satisfying the target accuracy of each algorithm. For the MNIST dataset, we vary the target accuracy from 0.7 to 0.95, and for the CIFAR-10 dataset, we vary it from 0.48 to 0.72. The test accuracy satisfaction rate of each algorithm is obtained as the ratio of the simulation instances, where the test accuracy of the corresponding trained model exceeds the target accuracy at the end of the simulation, to the total simulation instances. For the average required rounds for satisfying the target accuracy, we find the minimum required rounds of each simulation instance in which the target accuracy is satisfied. To avoid the effect of the spikes, we set a criteria of satisfying the target accuracy as follows: all test accuracies in three consecutive rounds exceed the target accuracy. From Figs. 5a and 5c, we can see that the satisfaction rates of ALSA-PI and BALSAs are similar to that of Bench. On the other hand, the satisfaction rates of RR and $W$ -max significantly decrease as the target accuracy increases. Figs. 5b and 5d provide the average required rounds of the algorithms to satisfy the target accuracy. From the figures, we can see that the algorithms considering asynchronous FL (ALSA-PI and BALSAs) require the smaller rounds to satisfy the target accuracy. This clearly shows that the algorithms that consider the effective score of asynchronous FL (i.e., ALSA-PI and BALSAs) have a faster learning speed than the algorithms that do not consider it (i.e., RR and $W$ -max).

VI Conclusion and Future Work

In this paper, we proposed the asynchronous FL procedure in the WDLN and investigated its convergence. We also investigated transmission scheduling in the WDLN for effective learning. To this end, we first proposed the effectivity score of asynchronous FL that represents the amount of learning in which the harmful effects on learning due to the stragglers are considered. We then formulated the ALS problem that maximizes the effectivity score of asynchronous FL. We developed ALSA-PI that can solve the ALS problem when the perfect a priori information is given. We also developed BALSA and BALSA-PO that effectively solve the ALS problem without a priori information by learning the uncertainty on stochastic data arrivals with a minimal amount of information. Our experimental results show that our ALSA-PI and BALSAs achieve the similar performance to the ideal benchmark. In addition, they outperform the other baseline scheduling algorithms. These results clearly show that the transmission scheduling strategy based on the effectivity score, which is adopted to our algorithms, is effective for asynchronous FL. Besides, our BALSAs effectively schedules the transmissions even without any a priori information by learning the system uncertainties. As a future work, a non-i.i.d. distribution of data samples over devices can be incorporated into the ALS problem for more effective learning in a non-i.i.d. data distribution scenario. In addition, subchannel allocation and power control can be considered to minimize the time consumption for the FL update procedure by utilizing the resources more effectively in asynchronous FL.

Appendix A Proof of Theorem 1

We first provide the following lemma supported in [13, 39] to prove Theorem 1.

Lemma 1

If $l({\mathbf{w}})$ is $\xi$ -strong convex, then with Assumption 1, we have

2\xi(l({\mathbf{w}}^{t})-l({\mathbf{w}}^{*}))\leq||\nabla l({\mathbf{w}}^{t})||^{2}.

(29)

Using Lemma 1, we now prove Theorem 1. Since $l({\mathbf{w}})$ is $L$ -smooth provided in Definition 1 (i.e., $f({\mathbf{x}}_{1})-f({\mathbf{x}}_{2})\leq\left<\nabla f({\mathbf{x}}_{2}),{\mathbf{x}}_{1}-{\mathbf{x}}_{2}\right>+\frac{L}{2}||{\mathbf{x}}_{1}-{\mathbf{x}}_{2}||^{2},~{}\forall{\mathbf{x}}_{1},{\mathbf{x}}_{2}$ ), the following holds:

	$\displaystyle l({\mathbf{w}}^{t+1})-l({\mathbf{w}}^{t})\leq\left<\nabla l({\mathbf{w}}^{t}),{\mathbf{w}}^{t+1}-{\mathbf{w}}^{t}\right>+\frac{L}{2}\\|{\mathbf{w}}^{t+1}-{\mathbf{w}}^{t}\\|^{2}$
	$\displaystyle=-\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})^{\top}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}+\frac{L}{2}\Big{\\|}\sum_{u\in\bar{\mathcal{U}}^{t}}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}\Big{\\|}^{2}.$
	$\displaystyle\leq-\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})^{\top}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}+\frac{L}{2}\Big{(}\sum_{u\in\bar{\mathcal{U}}^{t}}\|\|\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}\|\|\Big{)}^{2}.$		(30)

Let us define $q=\max_{u\in\mathcal{U}}\{\eta_{u}^{t}c_{u}^{t}\}$ . Then, $q>0$ and with Assumption 1 and gradient dissimilarity, the following holds:

	$\displaystyle~{}~{}\mathbb{E}[l({\mathbf{w}}^{t+1})]-l({\mathbf{w}}^{t})$
	$\displaystyle\leq-q\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})\mathbb{E}[\psi_{u}^{t}]+\frac{Lq^{2}}{2}\mathbb{E}\Big{[}\sum_{u\in\bar{\mathcal{U}}^{t}}\|\|\psi_{u}^{t}\|\|\Big{]}^{2}$
	$\displaystyle\leq-q\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}+\frac{Lq^{2}W^{2}V^{2}}{2}\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}$
	$\displaystyle=-q\left(\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}-\frac{LqW^{2}V^{2}}{2}\right)\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}.$		(31)

Let us define $\bar{\eta}^{t}=\max_{u\in\mathcal{U}}\eta_{u}^{t}$ . Then, because $c_{u}^{t}<1$ for all $u\in\mathcal{U},t\in\mathcal{T}$ , we have $-q\left(\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}-\frac{LqW^{2}V^{2}}{2}\right)<-\bar{\eta}^{t}\left(\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}-\frac{L\bar{\eta}^{t}W^{2}V^{2}}{2}\right).$ Now, with Lemma 1, we can rewrite the inequality in (31) as

	$\displaystyle\mathbb{E}[l({\mathbf{w}}^{t+1})]-l({\mathbf{w}}^{t})\leq$		(32)
	$\displaystyle\quad-2\xi\bar{\eta}^{t}\left(\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}-\frac{L\bar{\eta}^{t}W^{2}V^{2}}{2}\right)(l({\mathbf{w}}^{t})-l({\mathbf{w}}^{*})).$

By subtracting $l({\mathbf{w}}^{*})$ from both sides and rearranging the inequality, we have

	$\displaystyle\mathbb{E}[l({\mathbf{w}}^{t+1})]-l({\mathbf{w}}^{*})\leq$		(33)
	$\displaystyle\quad\left\{1+2\xi\bar{\eta}^{t}\left(\frac{L\bar{\eta}^{t}W^{2}V^{2}}{2}-\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}\right)\right\}(l({\mathbf{w}}^{t})-l({\mathbf{w}}^{*})).$

By taking expectation of both sides, we can get

	$\displaystyle\mathbb{E}[l({\mathbf{w}}^{t+1})-l({\mathbf{w}}^{*})]\leq$		(34)
	$\displaystyle\left\{1+2\xi\bar{\eta}^{t}\left(\frac{L\bar{\eta}^{t}W^{2}V^{2}}{2}-\epsilon\!\sum_{u\in\mathcal{U}}a_{u}^{t}\mathbb{P}[x_{u}^{t}\!=\!1]\right)\right\}(l({\mathbf{w}}^{t})-l({\mathbf{w}}^{*})),$

and with telescoping the above equations, we have the equation in (13) in Theorem 1. Then, if $\sum_{u\in\mathcal{U}}a_{u}^{t}\mathbb{P}[x_{u}^{t}=1]\geq\zeta$ , we can rewrite the equation in (34) as

\mathbb{E}[l({\mathbf{w}}^{t+1})-l({\mathbf{w}}^{*})]\leq(1-2\xi\underline{\eta}\epsilon^{\prime})(l({\mathbf{w}}^{t})-l({\mathbf{w}}^{*})),

(35)

where $\epsilon^{\prime}=\epsilon\zeta-\frac{L\bar{\eta}W^{2}V^{2}}{2}$ and $\bar{\eta}=\frac{2\epsilon\zeta}{LW^{2}V^{2}}(\max_{u\in\mathcal{U},\forall t}\{c_{u}^{t}\})^{-1}$ . With telescoping the above equations, we have the equation in (14) in Theorem 1.

Appendix B Proof of Theorem 2

Suppose that there exists the optimal policy whose chosen action in a round with state ${\mathbf{s}}$ is not identical to the greedy policy. We denote the expected instantaneous effectivity score of round $t$ with the optimal policy by $r_{*}^{t}({\mathbf{s}})$ and that with the greedy policy by $r_{g}^{t}({\mathbf{s}})$ . Based on the equation in (21), we can decompose the expected instantaneous effectivity score into two sub-rewards as $r({\mathbf{s}})=r_{A}({\mathbf{s}})+r_{B}({\mathbf{s}})$ , where $r_{A}({\mathbf{s}})=\mathbb{E}[\sum_{u\in\mathcal{U}}a_{u}x_{u}(1+\gamma)(n_{u}+m_{u})]$ and $r_{B}({\mathbf{s}})=-\mathbb{E}[\sum_{u\in\mathcal{U}}\gamma(n_{u}+m_{u})]$ . In the problem, the samples are accumulated if they are not used for central training. Moreover, the channel gains and arrival rates of the samples of the devices are i.i.d. Hence, if a policy $\pi$ satisfies

\sum_{s\in\mathcal{S}}\mathbb{P}[a_{u}^{\pi}({\mathbf{s}})]>0\textrm{ for all }u\in\mathcal{U},

(36)

where $a_{u}^{\pi}$ represents $a_{u}$ from policy $\pi$ , any arrived samples will be eventually reflected in the sub-reward $r_{A}$ regardless of the policy due to the accumulation of the samples. This implies that the average of sub-reward $\sum_{t=1}^{T}r_{A}^{t}/T$ converges to $\sum_{u\in\mathcal{U}}\theta_{u}^{P}$ as $T\rightarrow\infty$ . Accordingly, if a policy satisfies the condition in (36), its optimality to the ALS problem depends on minimizing the delay cost (i.e., the sub-reward $r_{B}$ ). With the greedy policy, any device will be scheduled when the number of its aggregated samples is large enough. Hence, the greedy policy satisfies the condition if the parameters are finite. We now denote the expected sub-rewards $r_{B}$ of round $t$ with the optimal policy and the greedy policy by $r_{B,*}^{t}$ and $r_{B,g}^{t}$ , respectively. From the definition of the greedy policy, it is obvious that $\mathbb{E}[\sum_{u\in\mathcal{U}}n_{u}^{t}|{\mathbf{a}}_{*}]\geq\mathbb{E}[\sum_{u\in\mathcal{U}}n_{u}^{t}|{\mathbf{a}}_{g}],$ which implies that $r_{B,*}^{t}\leq r_{B,g}^{t},$ for any random disturbances. Consequently, this leads to $J^{*}\leq J^{g},$ which implies that the greedy policy is optimal.

Appendix C Proof of Theorem 3

We define $K_{T}=\mathop{\rm argmax}\{k:t_{k}\leq T\}$ , which represents the number of stages in BALSA until round $T$ . For $t_{k}\leq t<t_{k+1}$ in stage $k$ , we have the following equation from (19):

r({\mathbf{s}}^{t},{\mathbf{a}}^{t})=J({\boldsymbol{\uptheta}}_{k})+v({\mathbf{s}}^{t},{\boldsymbol{\uptheta}}_{k})-\sum_{{\mathbf{s}}^{\prime}\in\mathcal{S}}P^{{\boldsymbol{\uptheta}}_{k}}({\mathbf{s}}^{\prime}|{\mathbf{s}}^{t},{\mathbf{a}}^{t})v({\mathbf{s}}^{\prime},{\boldsymbol{\uptheta}}_{k}).

(37)

Then, the expected regret of BALSA is derived as

\displaystyle T\mathbb{E}[J({\boldsymbol{\uptheta}}_{*})]-\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}r({\mathbf{s}}^{t},{\mathbf{a}}^{t})\Big{]}=R_{1}+R_{2}+R_{3},

(38)

where

	$\displaystyle R_{1}=T\mathbb{E}[J({\boldsymbol{\uptheta}}_{*})]\!-\!\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}J({\boldsymbol{\uptheta}}_{k})\Big{]},$
	$\displaystyle R_{2}=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}v({\mathbf{s}}^{t+1},{\boldsymbol{\uptheta}}_{k})-v({\mathbf{s}}^{t},{\boldsymbol{\uptheta}}_{k})\Big{]},$
	$\displaystyle\textrm{and }R_{3}=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\!\sum_{{\mathbf{s}}^{\prime}\in\mathcal{S}}P^{{\boldsymbol{\uptheta}}_{k}}({\mathbf{s}}^{\prime}\|{\mathbf{s}}^{t},{\mathbf{a}}^{t})v({\mathbf{s}}^{\prime},{\boldsymbol{\uptheta}}_{k})\!-\!v({\mathbf{s}}^{t+1},{\boldsymbol{\uptheta}}_{k})\Big{]}.$

We can bound the regret of BALSA by deriving the bounds on $R_{1}$ , $R_{2}$ , and $R_{3}$ as the following lemma:

Lemma 2

For the expected regret of BALSA, we have the following bounds:

•

The first term is bounded as $R_{1}\leq\mathbb{E}[K_{T}]$ .
•

The second term is bounded as $R_{2}\leq\mathbb{E}[HK_{T}]$ .
•

The third term is bounded as $R_{3}\leq 49HS\sqrt{AT\log(AT)}$ .

In addition, we can bound the number of stages $K_{T}$ as follows.

Lemma 3

The number of stages in BALSA until round $T$ is bounded as $K_{T}\leq\sqrt{2SAT\log(T)}$ .

The above lemmas can be proven in similar steps to Lemmas 1–5 in [34]. Hence, here we omit the proofs due to the lack of space. For more details, we refer to [34]. From the equation in (38), we have $R(T)=R_{1}+R_{2}+R_{3}$ . Then, Theorem 3 holds by Lemmas 2 and 3.

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS, 2017.
[2] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Commun. Surveys Tuts., no. 3, pp. 2031–2063, 2020.
[3] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, May 2020.
[4] ——, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, Mar. 2020.
[5] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, Jan. 2020.
[6] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, June 2019.
[7] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of update aware device scheduling for federated learning at the wireless edge,” vol. 20, no. 6, pp. 3643–3658, June 2021.
[8] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling and resource allocation for latency constrained wireless federated learning,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 453–467, Jan. 2021.
[9] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, Jan. 2021.
[10] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-term perspective,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1188–1200, Feb. 2021.
[11] W. Xia, T. Q. Quek, K. Guo, W. Wen, H. H. Yang, and H. Zhu, “Multi-armed bandit-based client scheduling for federated learning,” vol. 19, no. 11, pp. 7108–7123, Nov. 2020.
[12] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 10, pp. 4229–4238, Oct. 2020.
[13] Y. Chen, Y. Ning, M. Slawski, and H. Rangwala, “Asynchronous online federated learning for edge devices with non-IID data,” in Proc. 2020 IEEE Int. Conf. on Big Data (Big Data), 2020.
[14] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu, “Asynchronous stochastic gradient descent with delay compensation,” in Proc. ICML, 2017.
[15] C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” arXiv preprint arXiv:1903.03934, 2020.
[16] T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communication-efficient distributed learning,” in Proc. NIPS, 2018.
[17] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in Proc. ICLR, 2018.
[18] J. Xu, W. Du, Y. Jin, W. He, and R. Cheng, “Ternary compression for communication-efficient federated learning,” IEEE Trans. Neural Netw. Learn. Syst., to be published.
[19] F. Li, D. Yu, H. Yang, J. Yu, H. Karl, and X. Cheng, “Multi-armed-bandit-based spectrum scheduling algorithms in wireless networks: A survey,” IEEE Wireless Commun., vol. 27, no. 1, pp. 24–30, 2020.
[20] H.-S. Lee, J.-Y. Kim, and J.-W. Lee, “Resource allocation in wireless networks with deep reinforcement learning: A circumstance-independent approach,” IEEE Syst. J., vol. 14, no. 2, pp. 2589–2592, 2020.
[21] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. NIPS, 2017.
[22] I. M. Baytas, M. Yan, A. K. Jain, and J. Zhou, “Asynchronous multi-task learning,” in IEEE ICDM, 2016.
[23] Y. Xi, A. Burr, J. Wei, and D. Grace, “A general upper bound to evaluate packet error rate over quasi-static fading channels,” IEEE Trans. Wireless Commun., vol. 10, no. 5, pp. 1373–1377, May 2011.
[24] P. Ferrand, J.-M. Gorce, and C. Goursaud, “Approximations of the packet error rate under quasi-static fading in direct and relayed links,” EURASIP J. on Wireless Commun. and Netw., vol. 2015, no. 1, p. 12, Jan. 2015.
[25] J. Wu, G. Wang, and Y. R. Zheng, “Energy efficiency and spectral efficiency tradeoff in type-I ARQ systems,” IEEE J. Sel. Areas Commun., vol. 32, no. 2, pp. 356–366, Feb. 2014.
[26] S. Ge, Y. Xi, S. Huang, and J. Wei, “Packet error rate analysis and power allocation for cc-harq over rayleigh fading channels,” IEEE Commun. Lett., vol. 18, no. 8, pp. 1467–1470, Aug. 2014.
[27] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
[28] Z. Tao and Q. Li, “eSGD: Communication efficient distributed deep learning on the edge,” in Proc. USENIX Workshop Hot Topics Edge Comput. (HotEdge 18), 2018.
[29] P. Domingos, “A few useful things to know about machine learning,” Commun. of the ACM, vol. 55, no. 10, pp. 78–87, 2012.
[30] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2.
[31] D. P. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995, vol. 1, no. 2.
[32] A. Gopalan and S. Mannor, “Thompson sampling for learning parameterized Markov decision processes,” in Proc. Conf. on Learn. Theory, 2015.
[33] I. Osband and B. Van Roy, “Why is posterior sampling better than optimism for reinforcement learning?” in Proc. ICML, 2017.
[34] Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain, “Learning unknown Markov decision processes: A Thompson sampling approach,” in Proc. NIPS, 2017.
[35] H.-S. Lee, C. Shen, W. Zame, J.-W. Lee, and M. Schaar, “SDF-Bayes: Cautious optimism in safe dose-finding clinical trials with drug combinations and heterogeneous patient groups,” in Proc. AISTATS, 2021.
[36] M. Misgiyati and K. Nisa, “Bayesian inference of poisson distribution using conjugate and non-informative priors,” in Prosiding Seminar Nasional Metode Kuantitatif, no. 1, 2017.
[37] X. Hou and H. Kayama, Demodulation reference signal design and channel estimation for LTE-Advanced uplink. INTECH Open Access Publisher, 2011.
[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
[39] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.

	$\displaystyle l({\mathbf{w}}^{t+1})-l({\mathbf{w}}^{t})\leq\left<\nabla l({\mathbf{w}}^{t}),{\mathbf{w}}^{t+1}-{\mathbf{w}}^{t}\right>+\frac{L}{2}\\|{\mathbf{w}}^{t+1}-{\mathbf{w}}^{t}\\|^{2}$
	$\displaystyle=-\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})^{\top}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}+\frac{L}{2}\Big{\\|}\sum_{u\in\bar{\mathcal{U}}^{t}}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}\Big{\\|}^{2}.$
	$\displaystyle\leq-\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})^{\top}\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}+\frac{L}{2}\Big{(}\sum_{u\in\bar{\mathcal{U}}^{t}}\|\|\eta_{u}^{t}c_{u}^{t}\psi_{u}^{t}\|\|\Big{)}^{2}.$		(30)

	$\displaystyle~{}~{}\mathbb{E}[l({\mathbf{w}}^{t+1})]-l({\mathbf{w}}^{t})$
	$\displaystyle\leq-q\sum_{u\in\bar{\mathcal{U}}^{t}}\nabla l({\mathbf{w}}^{t})\mathbb{E}[\psi_{u}^{t}]+\frac{Lq^{2}}{2}\mathbb{E}\Big{[}\sum_{u\in\bar{\mathcal{U}}^{t}}\|\|\psi_{u}^{t}\|\|\Big{]}^{2}$
	$\displaystyle\leq-q\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}+\frac{Lq^{2}W^{2}V^{2}}{2}\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}$
	$\displaystyle=-q\left(\epsilon\sum_{u\in\mathcal{U}}a_{u}^{t}x_{u}^{t}-\frac{LqW^{2}V^{2}}{2}\right)\|\|\nabla l({\mathbf{w}}^{t})\|\|^{2}.$		(31)