This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China 11email: qiongwu@jiangnan.edu.cn, siyuanwang@stu.jiangnan.edu.cn
22institutetext: State Key Laboratory of Integrated Services Networks (Xidian University), Xi’an 710071, China
33institutetext: Department of Electronic Engineering, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
33email: fpy@tsinghua.edu.cn
44institutetext: Qualcomm, San Jose CA 95110 USA 44email: qf9898@gmail.com

Deep Reinforcement Learning Based Vehicle Selection for Asynchronous Federated Learning Enabled Vehicular Edge Computing

Qiong Wu 1122    Siyuan Wang 1122    Pingyi Fan 33    Qiang Fan 44
Abstract

In the traditional vehicular network, computing tasks generated by the vehicles are usually uploaded to the cloud for processing. However, since task offloading toward the cloud will cause a large delay, vehicular edge computing (VEC) is introduced to avoid such a problem and improve the whole system performance, where a roadside unit (RSU) with certain computing capability is used to process the data of vehicles as an edge entity. Owing to the privacy and security issues, vehicles are reluctant to upload local data directly to the RSU, and thus federated learning (FL) becomes a promising technology for some machine learning tasks in VEC, where vehicles only need to upload the local model hyperparameters instead of transferring their local data to the nearby RSU. Furthermore, as vehicles have different local training time due to various sizes of local data and their different computing capabilities, asynchronous federated learning (AFL) is employed to facilitate the RSU to update the global model immediately after receiving a local model to reduce the aggregation delay. However, in AFL of VEC, different vehicles may have different impact on the global model updating because of their various local training delay, transmission delay and local data sizes. Also, if there are bad nodes among the vehicles, (that is, the amount of data and local computing resources are small, and the local model is polluted by random noise), it will affect the global aggregation quality at the RSU. footnotetext: Supported in part by the National Natural Science Foundation of China (No. 61701197), in part by the open research fund of State Key Laboratory of Integrated Services Networks (No. ISN23-11), in part by the National Key Research and Development Program of China (No. 2021YFA1000500(4)), in part by the 111 Project (No. B23008). (Qiong Wu and Siyuan Wang contributed equally to this work.) To solve the above problem, we shall propose a deep reinforcement learning (DRL) based vehicle selection scheme to improve the accuracy of the global model in AFL of vehicular network. In the scheme, we present the model including the state, action and reward in the DRL based to the specific problem. Simulation results demonstrate our scheme can effectively remove the bad nodes and improve the aggregation accuracy of the global model.

Keywords:
Deep reinforcement learning (DRL) Asynchronous federated learning (AFL) accuracy mobility delay.

1 Introduction

The emerging Internet of Vehicles (IoV) becomes a promising technology to make our life more convenient [1, 2, 3, 4]. At the same time, intelligent services also become a critical part in various vehicles[5]. Therefore, vehicles driving on the road will generate some computing tasks according to the high quality service requirements of users [6, 7]. However, in the traditional cloud computing, the cloud is far from the moving vehicles, incurring a high task delay when tasks are offloaded to the cloud, which is not suitable for high-speed vehicles. Thus, vehicular edge computing (VEC) [8] is introduced to enable vehicles to offload the computing tasks to a roadside unit (RSU) with a certain computing capability to reduce the task processing delay. However, it requires the vehicle to upload local data to the RSU for processing which is a challenging issue because people are reluctant to open their local data due to privacy issues [9, 10]. So the federated learning (FL) is designed to handle this issue [11, 12]. Specifically, FL performs iterative global aggregations at the RSU. In one round, the vehicle first downloads the current global model update from the RSU, and then uses their local data for local training. The trained local model will be uploaded to the RSU. When the RSU receives all the trained local models from vehicles, it will perform the global aggregation and broadcast the updated global model to the vehicles. Then the second round is repeated until the specified number of rounds is reached. Since local data cannot be accessed at RSU, the data privacy can be ensured physically.

However, in conventional FL[13], the RSU needs to wait for all vehicles to offload the local model before updating the global model[14]. If a vehicle has high local training delay and transmission delay, some vehicles may drive out of the coverage of the RSU and thus cannot participate in the global aggregation. Thus, asynchronous federated training (AFL) is introduced [15, 16, 17]. Specifically, the vehicle uploads the local model after finishing one round of its local training, with which the RSU updates the global model once it receives the local model. This enables a faster update of the global model at the RSU without waiting for other vehicles.

The vehicle mobility will cause the time-varying channel conditions and transmission rates[18, 19], and thus vehicles have different transmission delays [20, 21, 22, 23]. At the same time, different vehicles have different time-varying computing resources and different amounts of local data, which will cause different local training delays. In AFL, vehicles upload their local models asynchronously, it is possible that the RSU has already updated the global model according to its received local models but some vehicles had not uploaded its local model yet. As a result, the local model of these vehicles are in staleness. Staleness is related to local training delay and transmission delay. Therefore, it is important to consider the impact of the above factors on the accuracy of the global model at the RSU.

In AFL, some bad nodes may exist in the network, that is, the vehicle has small available computing resources, small amount of local data, or the local model is polluted by random noise. The bad nodes can significantly affect the accuracy of the global model at the RSU [24]. Therefore, it is necessary to select the vehicles to participate in the global aggregation. Deep reinforcement learning (DRL) may provide a way to select the proper vehicles to solve this problem[25]. Specifically, it takes action based on the current state of vehicles, and then gets the corresponding reward. After that, the next state is reached, and the above steps are repeated. Finally, the neural network can provide an optimal policy for the vehicle selection of the system.

In this paper, we have proposed an AFL weight optimized scheme which selects vehicles based on Deep Deterministic Policy Gradient (DDPG) while considering the mobility of the vehicles, time-varying channel conditions, time-varying available computing resources of the vehicles, different amount of local data of vehicles, and existence of bad nodes111The source code has been released at: https://github.com/qiongwu86/AFLDDPG. The main contributions of this paper are shown as follows:

  • 1)

    By considering the bad nodes with less local data, less available computing resources and a local model polluted by random noise, we have employed DDPG to select vehicles to participate in the AFL, so as to avoid the impact of bad nodes on the global model aggregation.

  • 2)

    We consider the impact of vehicle mobility, time-varying channel conditions, the time-varying available computing resources of the vehicles, and different amount of local data of vehicles to perform a weight optimization and select vehicles that participate in the global aggregation, thereby improving the accuracy of the global model.

  • 3)

    Extensive simulation results demonstrate that our scheme can effectively remove the bad nodes and improve the accuracy of the global model.

2 Related Works

In the literature, there are many research works on FL in vehicular networks.

In [26], Zhou et al. proposed a two-layer federated learning framework based on the 6G supported vehicular networks to improve the learning accuracy. In [27], Zhang et al. proposed a method using federated transfer learning to detect the drowsiness of drivers to preserve drivers’ privacy while increasing the accuracy of their scheme. In [28], Xiao et al. proposed a greedy strategy to select vehicles according to the position and velocity to minimize the cost of FL. In [29], Saputra et al. proposed a vehicle selection method based on their locations and history information, and then developed a multi-principal one-agent contract-based policy to maximize the profits of the service provider and vehicles while improving the accuracy of their scheme. In [30], Yan et al. proposed a power allocation scheme based on FL to maximize energy efficiency while getting better accuracy of power allocation strategy. In [31], Ye et al. proposed an incentive mechanism by using multidimensional contract theory and prospect theory to optimize the incentive for vehicles when preforming tasks. In [32], Kong et al. proposed a federated learning-based license plate recognition framework to get a high accuracy and low cost for detecting and recognizing license plate. In [33], Saputra et al. proposed an economic-efficiency framework using FL for an electric vehicular network to maximize the profits for charging stations. In [34], Ye et al. proposed a selective model aggregation approach to get a higher accuracy of global model. In [35], Zhao et al. proposed a scheme combined FL with local differential privacy to get a high accuracy when the privacy budget is small. In [36], Li et al. proposed an identity-based privacy preserving scheme to protect the privacy of vehicular message. It can reduce the training loss while increasing the accuracy. In [37], Taïk et al. proposed a scheme including FL and corresponding learning and scheduling process to efficiently use vehicular-to-vehicular resources to bypass the communication bottleneck. This scheme can effectively improve the learning accuracy. In [38], Hui et al. proposed a digital twins enabled on-demand matching scheme for multi tasks FL to address the two-way selection problem between task requesters and RSUs. In [39], Liu et al. proposed an efficient-communication approach, which consists of the customized local training strategy, partial client participation rule and a flexible aggregation policy to improve the test accuracy and average communication optimization rate. In [40], Lv et al. proposed a blockchain-based FL scheme to detect misbehavior and finally get higher accuracy and efficiency. In [41], Khan et al. proposed a DRL-based FL to minimize the cost considering packet error rate and global loss. In [42], Samarakoon et al. proposed a scheme considering joint power and resource allocation for ultra-reliable low-latency communication in vehicular networks to keep a high accuracy while reducing the average power consumption and the amount of exchanged data. In [43], Hammoud et al. proposed a horizontal-based FL, empowered by fog federations, devised for the mobile environment to improve the accuracy and service quality of IoV intelligent applications.

However, these works have not considered the situation that vehicles may usually drive out of the coverage of the RSU before they upload their local models, which deteriorates the accuracy of the global model.

A few works have studied the AFL in vehicular networks. In [44], Tian et al. proposed an asynchronous federated deep Q-learning network to solve the task offloading problem in vehicular network, then designed a queue-aware algorithm to allocate computing resources. In [45], Pan et al. proposed a scheme using AFL and deep Q-learning algorithm to get the maximized throughput while considering the long-term ultrareliable and low-latency communication constraints. However, these works have not considered the mobility of vehicles, the amount of data and computing capability to select vehicle in the design of the AFL in vehicular networks and the impact of bad nodes. This motivates us to do this work by considering the key factors affecting the AFL applications in vehicular networks.

3 System Model

Refer to caption
Figure 1: System model.

This section will describe the system model. As shown in Fig.  1, we consider an edge-assisted vehicular network consisting of a RSU and KK vehicles within its coverage. In the network, the bottom of the RSU stands for the origin, the xx-axis direction is toward east, the yy-axis direction is toward south, and the zz-axis direction is perpendicular to the xx-axis and yy-axis and along the direction of the RSU’s antenna. Specifically, vehicles are assumed to move toward east with the same velocity in the coverage of the RSU, which can match most of cases in the highway scenarios. The time domain is divided into discrete time slots. Each vehicle i(1iK)i\left(1\leq i\leq K\right) carries a different amount of data DiD_{i} and has different computing capabilities. At the same time, the vehicle mobility incur a time-varying channel condition.

We first use the DRL algorithm to select the vehicles participating in AFL according to the vehicle’s transmission rate, amount of available computing resources, and the location of vehicle, and then the selected vehicles will train and upload their local models to the RSU. That is, each selected vehicle can use its local data to train the local model, and then the weight of local model is optimized according to the local training delay and transmission delay. All selected vehicles can asynchronously upload the local models to the RSU. After multiple rounds of model aggregations, we can get a more accurate global model at the RSU. For ease of understanding, the main notations used in this paper are listed in Table 1.

Table 1: Notations used in this paper
Notation Description
KK Total number of vehicles within the coverage of RSU
vv Velocity of vehicle
DiD_{i} The amount of data carried by vehicle ii
μi{{\mu}_{i}} Computing resources of vehicle ii
Pi(t){{P}_{i}}\left(t\right) Position of vehicle ii at time slot tt
dix(t){{d}_{ix}}\left(t\right) Position of vehicle ii along the xx-axis from the antenna of the RSU at time slot tt
dy{{d}_{y}} Position of vehicle ii along the yy-axis from the antenna of the RSU at time slot tt
di0{{d}_{i0}} Initial position of vehicle ii along the xx-axis
Hr{{H}_{r}} Height of RSU’s antenna
Pr{{P}_{r}} Position of RSU’s antenna
di(t){{d}_{i}}\left(t\right) Distance from vehicle ii to the antenna of RSU at time slot tt
tri(t)t{{r}_{i}}\left(t\right) Transmission rate of vehicle ii at time slot tt
BB Transmission bandwidth
p0{{p}_{0}} Transmission power of each vehicle
hi(t){{h}_{i}}\left(t\right) Channel gain
α\alpha Path loss exponent
σ2{{\sigma}^{2}} Power of noise
ρi{{\rho}_{i}} Normalized channel correlation coefficient between consecutive time slots
fdif_{d}^{i} Doppler frequency of vehicle ii
Λ\Lambda Wavelength
θ\theta Angle between the moving direction and the uplink communication direction
C0{{C}_{0}} Number of CPU cycles required to train a data
TliT_{l}^{i} Delay of local training of vehicle ii
Tui(t)T_{u}^{i}\left(t\right) Transmission delay for vehicle ii to upload local model at time slot tt
|w|\left|w\right| Size of the local model of each vehicle
γ\gamma Discounted factor
NN Total number of time slots
δ\delta Parameter of actor network
δ{{\delta}^{*}} Optimized parameter of actor network
ξ\xi Parameter of critic network
ξ{{\xi}^{*}} Optimized parameter of critic network
δ1{{\delta}_{1}} Parameter of target actor network
δ1\delta_{1}^{*} Optimized parameter of target actor network
ξ1{{\xi}_{1}} Parameter of target critic network
ξ1\xi_{1}^{*} Optimized parameter of target critic network
τ\tau Update parameter for target networks
Rb{{R}_{b}} Replay buffer
Δt{{\Delta}_{t}} The exploration noise at time slot tt
II Size of mini-batch
μδ{{\mu}_{\delta}} Policy approximated by actor network
μ{{\mu}^{*}} Optimal policy of system
Emax{E}_{\max} Max episode of training stage
KlK_{l} Number of selected vehicles
ll Number of local training
m1m_{1} Parameter of training weight
m2m_{2} Parameter of transmission weight
EmaxE_{\max}^{{}^{\prime}} Max episode of testing stage

4 Parameters Computing

For simplicity, we first introduce some parameters used in the following sections.

4.1 Local Training Delay

Vehicle ii uses its local data to train a local model, thus the local training delay TliT_{l}^{i} of vehicle ii can be calculated as:

Tli=DiC0μiT_{l}^{i}=\frac{{{D}_{i}}{{C}_{0}}}{{{\mu}_{i}}} (1)

where C0C_{0} is the number of CPU cycles required to process one unit of data, μi{{\mu}_{i}} is the computing resources of vehicle ii, i.e., CPU cycles frequency.

4.2 Distance

Denote Pi(t){{P}_{i}}\left(t\right) as the position of vehicle ii at time slot tt, dix(t){{d}_{ix}}\left(t\right) and dyd_{y} as the distance between vehicle ii and the antenna of RSU at time slot tt along the xx-axis and yy-axis. Thus Pi(t){{P}_{i}}\left(t\right) can be expressed as (dix(t),dy,0)\left({{d}_{ix}}\left(t\right),{{d}_{y}},0\right). Here, dyd_{y} is a fixed value, and dix(t){{d}_{ix}}\left(t\right) can be denoted as:

dix(t)=di0+vt{{d}_{ix}}\left(t\right)={{d}_{i0}}+vt (2)

where di0d_{i0} is the initial position of vehicle ii along xx-axis.

We set the height of RSU’s antenna as HrH_{r}, and thus the position of antenna of RSU can be expressed as Pr=(0,0,Hr){{P}_{r}}=\left(0,0,{{H}_{r}}\right). Then the distance between vehicle ii and the antenna of RSU at time slot tt can be expressed as:

di(t)=Pi(t)Pr{{d}_{i}}\left(t\right)=\left\|{{P}_{i}}\left(t\right)-{{P}_{r}}\right\| (3)

4.3 Transmission Rate

We set the transmission rate of vehicle ii at time slot tt to be tri(t)t{r_{i}}\left(t\right). According to Shannon’s theorem, it can be expressed as:

tri(t)=Blog2(1+p0hi(t)(di(t))ασ2)t{r_{i}}\left(t\right)=B{\log_{2}}\left({1+{{{p_{0}}{\rm{\cdot}}{h_{i}}\left(t\right){\rm{\cdot}}{{\left({{d_{i}}\left(t\right)}\right)}^{-\alpha}}}\over{{\sigma^{2}}}}}\right) (4)

where BB is the transmission bandwidth, p0p_{0} is the transmission power of each vehicle, hi(t){{h}_{i}}\left(t\right) is the channel gain of vehicle ii at time slot tt, α\alpha is the path loss exponent, σ2{{\sigma}^{2}} is the power of noise.

We use autoregressive model to formulate the relationship between hi(t)h_{i}\left(t\right) and hi(t1)h_{i}\left(t-1\right):

hi(t)=ρihi(t1)+e(t)1ρi2{{h}_{i}}\left(t\right)={{\rho}_{i}}{{h}_{i}}\left(t-1\right)+e\left(t\right)\sqrt{1-\rho_{i}^{2}} (5)

where ρi{{\rho}_{i}} is the normalized channel correlation coefficient between consecutive time slots, e(t)e\left(t\right) is the error vector following a complex Gaussian distribution and is related to hi(t){{h}_{i}}\left(t\right). According to Jake’s fading spectrum, ρi=J0(2πfdit){{\rho}_{i}}={{J}_{0}}\left(2\pi f_{d}^{i}t\right), where J0(){J_{0}}\left({\rm{\cdot}}\right) is the zeroth-order Bessel function of the first kind and fdif_{d}^{i} is the Doppler frequency of vehicle ii which can be calculated as:

fdi=vΛcosθf_{d}^{i}=\frac{v}{\Lambda}\cos\theta (6)

where Λ\Lambda is the wavelength, θ\theta is the angle between the moving direction x0=(1,0,0){x_{0}}=\left({1,0,0}\right) and the uplink communication direction PrPi(t){P_{r}}-{P_{i}}\left(t\right). Thus cosθ\cos\theta can be calculated as:

cosθ=x0(PrPi(t))PrPi(t)\cos\theta={{{x_{0}}{\rm{\cdot}}\left({{P_{r}}-{P_{i}}\left(t\right)}\right)}\over{\left\|{{P_{r}}-{P_{i}}\left(t\right)}\right\|}} (7)

4.4 Transmission Delay

The transmission delay of vehicle ii for uploading its local model Tui(t)T_{u}^{i}\left(t\right) can be denoted as:

Tui(t)=|w|tri(t)T_{u}^{i}\left(t\right)=\frac{\left|w\right|}{t{{r}_{i}}\left(t\right)} (8)

where |w|\left|w\right| is the size of the local model of each vehicle.

5 Problem Formulation

In this section, we will formulate the problem and define the state, action and reward.

In the system, due to the mobility and the time-varying computing resources and channel conditions of the vehicles, we employ a DRL framework including state, action, and reward to formulate the problem of vehicle selection. Specifically, at each time slot tt, the system takes action according to the policy based on the current state, and then gets the reward and transitions to the next state. Next, the state, action and reward of the system will be defined, respectively.

5.1 State

Considering that the vehicle mobility can be reflected by its position while the local training delay and transmission delay of the vehicle are related to the vehicle’s time-varying available computing resources and current channel condition, so we define the state at time slot tt as:

s(t)=(Tr(t),μ(t),dx(t),a(t1))s\left(t\right)=\left(Tr\left(t\right),\mu\left(t\right),{{d}_{x}}\left(t\right),a\left(t-1\right)\right) (9)

where Tr(t)Tr\left(t\right) is the set of the transmission rates of each vehicle at time slot tt, i.e., Tr(t)=(tr1(t),tr2(t),trK(t))Tr\left(t\right)=\left(t{{r}_{1}}\left(t\right),t{{r}_{2}}\left(t\right),\ldots t{{r}_{K}}\left(t\right)\right), μ(t)\mu\left(t\right) is the set of available computing resources of all vehicles at time slot tt, i.e., μ(t)=(μ1(t),μ2(t),μK(t))\mu\left(t\right)=\left({{\mu}_{1}}\left(t\right),{{\mu}_{2}}\left(t\right),\ldots{{\mu}_{K}}\left(t\right)\right), dx(t){{d}_{x}}\left(t\right) is the set of all vehicles’ positions along the xx-axis at time slot tt, i.e., dx(t)=(d1x(t),d2x(t),dKx(t)){{d}_{x}}\left(t\right)=\left({{d}_{1x}}\left(t\right),{{d}_{2x}}\left(t\right),\ldots{{d}_{Kx}}\left(t\right)\right), a(t1)a\left(t-1\right) is the action at time slot t1t-1.

5.2 Action

Since the purpose of DRL is to select the better vehicles for AFL according to the current state, we define the system action at time slot tt as:

a(t)=(λ1(t),λ2(t),λK(t))a\left(t\right)=\left({{\lambda}_{1}}\left(t\right),{{\lambda}_{2}}\left(t\right),\ldots{{\lambda}_{K}}\left(t\right)\right) (10)

where λi(t),i[1,K]{\lambda_{i}}\left(t\right),i\in\left[{1,K}\right] is the probability of selecting vehicle ii, and we define that λ1(0)=λ2(0)==λK(0)=1{\lambda_{1}}\left(0\right)={\lambda_{2}}\left(0\right)=\ldots={\lambda_{K}}\left(0\right)=1.

We denote a new set ad(t)=(ad1(t),ad2(t),adK(t)){a_{d}}\left(t\right)=\left({{a_{d1}}\left(t\right),{a_{d2}}\left(t\right),\ldots{a_{dK}}\left(t\right)}\right) in order to select specific vehicles. After we normalize the action, if the value of λi(t){{\lambda}_{i}}\left(t\right) is greater than or equal to 0.5, adi(t){{a}_{di}}\left(t\right) is recorded as 1, otherwise 0. Then we get the set that is composed of of 0 and 1 where the binary value stands for if a vehicle is selected or not.

5.3 Reward

We aim to select a vehicle with better performance for AFL to obtain a more accurate global model at the RSU where both the local training delay, transmission delay and the accuracy of the global model are critical metrics. Thus, we define the system reward at time slot tt as:

r(t)=Ki=1Kλi(t)[ω1Loss(t)+ω2i=1K(Tli+Tui(t))adi(t)i=1Kadi(t)]r\left(t\right)=-\frac{K}{\sum\limits_{i=1}^{K}{{{\lambda}_{i}}\left(t\right)}}\left[{{\omega}_{1}}Loss\left(t\right)+{{\omega}_{2}}\frac{\sum\limits_{i=1}^{K}{\left(T_{l}^{i}+T_{u}^{i}\left(t\right)\right){{a}_{di}}\left(t\right)}}{\sum\limits_{i=1}^{K}{{{a}_{di}}\left(t\right)}}\right] (11)

where ω1{\omega}_{1} and ω2{\omega}_{2} are the non-negative weighting factors, Loss(t)Loss\left(t\right) is the loss computed by AFL, which will be discussed later.

The expected long-term discount reward of the system can be expressed as:

J(μ)=E[t=1Nγt1r(t)]J\left(\mu\right)=E\left[\sum\limits_{t=1}^{N}{{{\gamma}^{t-1}}r\left(t\right)}\right] (12)

where γ(0,1)\gamma\in\left(0,1\right) is the discounted factor, NN is the total number of time slots, μ\mu is the policy of system. In this paper we aim to find an optimal policy to maximize the expected long-term discounted reward of the system.

6 DRL-based AFL weight Optimization:DAFL

In this section, we introduce the overall system framework, and the training stage to obtain the optimal strategy, then present the testing stage for the performance evaluation of our model.

6.1 Training Stage

Considering the state and action spaces are continuous and DDPG is suitable for solving DRL problems under continuous state and action spaces, we employ DDPG to solve our problem.

DDPG algorithm is based on actor-critic network architecture. The actor network is used for policy promotion, and the critic network is used for policy evaluation. Here, both actor and critic network are constructed by deep neural network (DNN). Specifically, actor network is used to approximate policy μ\mu, and the approximated policy is expressed as μδ{\mu}_{\delta}. The actor network observes state and output the action based on policy μδ{\mu}_{\delta}.

We improve and evaluate the policy iteratively in order to obtain the optimal policy. To ensure the stability of the algorithm, the target network composed of the target actor network and the target critic network are also employed in the DDPG, where the architectures are the same as the original actor and critic networks, respectively. The proposed algorithm is shown in Algorithm 1.

Input: γ\gamma, τ\tau, δ\delta, ξ\xi, a(0)=(1,1,,1)a\left(0\right)=\left({1,1,\ldots,1}\right)
Output: optimized δ{\delta}^{*}, ξ\xi^{*}
1 Randomly initialize the δ\delta, ξ\xi;
2 Initialize target networks by δ1δ{{\delta}_{1}}\leftarrow\delta, ξ1ξ{{\xi}_{1}}\leftarrow\xi;
3 Initialize replay buffer RbR_{b};
4 for episode from 11 to EmaxE_{max}  do
5       Reset simulation parameters of system model, initialize the global model at the RSU;
6       Receive initial observation state s(1)s\left(1\right);
7       for time slot tt from 11 to NN  do
8             Generate the action according to the current policy and exploration noise a=μδ(s|δ)+Δta={{\mu}_{\delta}}\left(s|\delta\right)+{{\Delta}_{t}} ;
9             Compute ada_{d}, get the selected vehicles;
10             The selected vehicles conduct AFL based on weight to train global model at RSU;
11             Get the reward rr and the next state ss^{\prime};
12             Store transition (s,a,r,s)\left(s,a,r,s^{\prime}\right) in RbR_{b};
13             if number of tuples in RbR_{b} is larger than II then
14                   Randomly sample a mini-batch of II transitions tuples from RbR_{b};
15                   Update the critic network by minimizing the loss function according to Eq. (16);
16                   Update the actor network according to Eq. (17);
17                   Update target networks according to Eqs. (18) and (19).
18            
19      
Algorithm 1 Training Stage for the DAFL-based Framework

Let δ\delta be the actor network parameter, ξ\xi be the critic network parameter, δ{{\delta}^{*}} be the optimized actor network parameter, ξ{{\xi}^{*}} be the optimized critic network parameter, δ1{{\delta}_{1}} be the target actor network parameter and ξ1{{\xi}_{1}} be the target critic network parameter. τ\tau is the update parameter of target network and Δt{{\Delta}_{t}} is the exploration noise at time slot tt. II is the size of mini-batch. Now, we will describe our algorithm in detail.

First, we initialize δ\delta and ξ\xi randomly, and initialize δ1\delta_{1} and ξ1\xi_{1} in the target network as δ\delta and ξ\xi respectively. At the same time, we initialize the replay buffer RbR_{b}.

Then our algorithm will be executed for EmaxE_{max} episodes. In the first episode, we first initialize the positions of all vehicles, the channel states and computing resources of vehicles, and set λ1(0)=λ2(0)==λK(0)=1{{\lambda}_{1}}\left(0\right)\text{=}{{\lambda}_{2}}\left(0\right)\text{=}\ldots\text{=}{{\lambda}_{K}}\left(0\right)\text{=}1. Then at time slot 1, the system can get the state, i.e., s(1)=(Tr(1),μ(1),dx(1),a(0))s\left(1\right)=\left(Tr\left(1\right),\mu\left(1\right),{{d}_{x}}\left(1\right),a\left(0\right)\right). Meanwhile, the convolutional neural network (CNN) is employed as the global model w0{{w}_{0}} at the RSU.

Our algorithm will execute from time slot 1 to time slot NN. At time slot 1, actor network can get the output μδ(s|δ){{\mu}_{\delta}}\left(s|\delta\right) according to the state. Note that we add a random noise Δt{{\Delta}_{t}} to the action, and the system can get the action as a(1)=μδ(s(1)|δ)+Δta\left(1\right)={{\mu}_{\delta}}\left(s\left(1\right)|\delta\right)+{{\Delta}_{t}}. Thus we get ad(1){{a}_{d}}\left(1\right) based on the action and determine the selected vehicles at this time slot. The selected vehicles will conduct AFL. That is, all the selected vehicles train their local models according to their local data, then upload them to RSU asynchronously for global model updating. Given the action, we can get the reward at time slot 1. After that we update the positions of vehicles according to Eq. (2), recalculate the channel states and the available computing resources of the vehicles, and update the transmission rates of the vehicles according to Eq. (4). Then it can get the next state s(2)s\left(2\right). The related sample (s(1),a(1),r(1),s(2))\left(s\left(1\right),a\left(1\right),r\left(1\right),s\left(2\right)\right) will be stored in RbR_{b}. Note that the system will iteratively calculate and store the samples into RbR_{b}, until reaching the capacity of RbR_{b}.

When the number of tuples is bigger than II in RbR_{b}, the parameters δ\delta, ξ\xi, δ1\delta_{1} and ξ1\xi_{1} of actor network, critic network and target networks respectively will be trained to maximize J(μδ)J\left({{\mu}_{\delta}}\right). Here, δ\delta is updated towards the gradient direction of J(μδ)J\left({{\mu}_{\delta}}\right), i.e., δJ(μδ){{\nabla}_{\delta}}J\left({{\mu}_{\delta}}\right). We set Qμδ(s(t),a(t)){{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right) as the action-value function which obeys policy μδ{{\mu}_{\delta}} under s(t)s\left(t\right) and a(t)a\left(t\right), it can be expressed as:

Qμδ(s(t),a(t))=Eμδ[k1=tNγk1tr(k1)]{{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right)={{E}_{{{\mu}_{\delta}}}}\left[\sum\limits_{{{k}_{1}}=t}^{N}{{{\gamma}^{{{k}_{1}}-t}}r\left({{k}_{1}}\right)}\right] (13)

It represents the long term expected discount reward at time slot tt.

The existing paper has proved that the solution to δJ(μδ){{\nabla}_{\delta}}J\left({{\mu}_{\delta}}\right) can be replaced by solving the gradient of Qμδ(s(t),a(t)){{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right), i.e., δQμδ(s(t),a(t)){{\nabla}_{\delta}}{{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right) [46]. Due to the continuous action space of Qμδ(s(t),a(t)){{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right), it cannot be solved by Bellman Equation [47]. In order to solve this problem, critic network uses ξ\xi to approximate Qμδ(s(t),a(t)){{Q}_{{{\mu}_{\delta}}}}\left(s\left(t\right),a\left(t\right)\right) by Qξ(s(t),a(t)){{Q}_{\xi}}\left(s\left(t\right),a\left(t\right)\right).

When the number of tuples is bigger than II in RbR_{b}, system will sample II tuples randomly from RbR_{b} to form a mini-batch. Let (sx,ax,rx,sx),x[1,2,,I]\left({{s}_{x}},{{a}_{x}},{{r}_{x}},s_{x}^{{}^{\prime}}\right),x\in\left[1,2,\ldots,I\right] be the xx-th tuple in the mini-batch. The system will input the sxs_{x}^{{}^{\prime}} to the target actor network, and get the output action ax=μδ1(sx|δ1)a_{x}^{{}^{\prime}}={{\mu}_{{{\delta}_{1}}}}\left(s_{x}^{{}^{\prime}}|{{\delta}_{1}}\right). Then input sxs_{x}^{{}^{\prime}} and axa_{x}^{{}^{\prime}} to the target critic network, and get the action-value function Qξ1(sx,ax){{Q}_{{{\xi}_{1}}}}\left(s_{x}^{{}^{\prime}},a_{x}^{{}^{\prime}}\right). The target value can be calculated as:

yx=rx+γQξ1(sx,ax)|ax=μδ1(sx|δ1){{y}_{x}}={{r}_{x}}+\gamma{{Q}_{{{\xi}_{1}}}}\left(s_{x}^{{}^{\prime}},a_{x}^{{}^{\prime}}\right){{|}_{a_{x}^{{}^{\prime}}={{\mu}_{{{\delta}_{1}}}}\left(s_{x}^{{}^{\prime}}|{{\delta}_{1}}\right)}} (14)

According to sx{{s}_{x}} and ax{{a}_{x}}, critic network will output the Qξ(sx,ax){{Q}_{\xi}}\left({{s}_{x}},{{a}_{x}}\right), then the loss of tuple xx is given by:

Lx=[yxQξ(sx,ax)]2{{L}_{x}}={{\left[{{y}_{x}}-{{Q}_{\xi}}\left({{s}_{x}},{{a}_{x}}\right)\right]}^{2}} (15)

When all the tuples act as the input to the critic network and target networks, we can get the loss function:

L(ξ)=1Ix=1ILxL\left(\xi\right)=\frac{1}{I}\sum\limits_{x=1}^{I}{{{L}_{x}}} (16)

In this case, the critic network updates ξ\xi by employing the gradient descent of ξL(ξ){{\nabla}_{\xi}}L\left(\xi\right) to the loss function L(ξ)L\left(\xi\right).

Similarly, actor network updates δ\delta by employing the gradient ascent, i.e., δJ(μδ){{\nabla}_{\delta}}J\left({{\mu}_{\delta}}\right), to minimize J(μδ)J\left({{\mu}_{\delta}}\right)[48], where δJ(μδ){{\nabla}_{\delta}}J\left({{\mu}_{\delta}}\right) is calculated by action-value function approximated by critic network as follows:

δJ(μδ)1Ix=1IδQξ(sx,axμ)|axμ=μδ(sx|δ)=1Ix=1IaxμQξ(sx,axμ)|axμ=μδ(sx|δ)δμδ(sx|δ)\begin{split}&\nabla_{\delta}J(\mu_{\delta})\\ &\approx\frac{1}{I}\sum_{x=1}^{I}\nabla_{\delta}Q_{\xi}(s_{x},a_{x}^{\mu})|_{a_{x}^{\mu}=\mu_{\delta}(s_{x}|\delta)}\\ &=\frac{1}{I}\sum_{x=1}^{I}\nabla_{a_{x}^{\mu}}Q_{\xi}(s_{x},a_{x}^{\mu})|_{a_{x}^{\mu}=\mu_{\delta}(s_{x}|\delta)}\\ &\quad\cdot\nabla_{\delta}\mu_{\delta}(s_{x}|\delta)\end{split} (17)

Here the input of Qξ{Q}_{\xi} is axμ=μδ(sx|δ)a_{x}^{\mu}={{\mu}_{\delta}}\left({{s}_{x}}|\delta\right).

In the end of a time slot tt, we update the parameters of target networks as follows:

ξ1τξ+(1τ)ξ1{{\xi}_{1}}\leftarrow\tau\xi+\left(1-\tau\right){{\xi}_{1}} (18)
δ1τδ+(1τ)δ1{{\delta}_{1}}\leftarrow\tau\delta+\left(1-\tau\right){{\delta}_{1}} (19)

where τ\tau is a constant and τ1\tau\ll 1.

Then input ss^{\prime} to actor network and start the same procedure for the next time slot. When the time slot tt reaches NN, this episode is completed. In this case, system will initialize the state s(1)=(Tr(1),μ(1),dx(1),a(0))s\left(1\right)=\left(Tr\left(1\right),\mu\left(1\right),{{d}_{x}}\left(1\right),a\left(0\right)\right) and execute the next episode. When the number of episodes reaches EmaxE_{max}, the training is finished. We get the optimized δ{\delta}^{*}, ξ{\xi}^{*}, δ1\delta_{1}^{*} and ξ1\xi_{1}^{*}. The overall DDPG flow diagram is listed in Fig. 2.

Refer to caption
Figure 2: DDPG flow diagram

6.2 Process of AFL

In this section, we will introduce the process of AFL in detail, which is used in the step 10 in Algorithm 1.

Let Vk,k[1,K1]{{V}_{k}},k\in\left[1,{{K}_{1}}\right] be the selected vehicles, where KlK_{l} is the total number of the selected vehicles. In the AFL, each vehicle will go through three stages: global model downloading, local training, uploading and updating. Specifically, vehicle VkV_{k} will first download the global model from the RSU, then it will train a local model using local data for some iterations. Then it will upload the local model to the RSU. Once the RSU receives a local model, it updates the global model immediately. To be clearly, we use the AFL training at time slot tt of vehicle VkV_{k} as an example.

6.2.1 Downloading the Global Model

In time slot tt, vehicle VkV_{k} downloads the global model wt1{w}_{t-1} from the RSU. Note that the global model at the RSU is initialized as w0w_{0} using CNN at the beginning of the whole training process.

6.2.2 Local Training

Vehicle VkV_{k} trains local model (CNN) based on its local data. The local training includes ll iterations. In iteration m(m[1,l])m\left(m\in\left[1,l\right]\right), vehicle VkV_{k} first inputs the data aa into the CNN of local model wk,m{{w}_{k,m}}, then outputs the prediction probability ya^\hat{{y}_{a}} of each label yay_{a} of data aa. Cross-entropy loss function is used to compute the loss of wk,m{{w}_{k,m}}:

fk(wk,m)=a=1Diyalogya^{{f}_{k}}\left({{w}_{k,m}}\right)\text{=}-\sum\limits_{a=1}^{{{D}_{i}}}{{{y}_{a}}\log}\hat{y_{a}}\, (20)

Then stochastic gradient descent (SGD) algorithm is used to update our model as follows:

wk,m+1=wk,mηfk(wk,m){{w}_{k,m+1}}={{w}_{k,m}}-\eta\nabla{{f}_{k}}\left({{w}_{k,m}}\right) (21)

where fk(wk,m)\nabla{{f}_{k}}\left({{w}_{k,m}}\right) is the gradient of fk(wk,m){{f}_{k}}\left({{w}_{k,m}}\right), η\eta is the learning rate. Vehicle VkV_{k} will use the updated local model in the proceeding iteration of m+1m+1. The local training will stop when the iteration reaches ll. At this time, the vehicle gets the updated local model wkw_{k}.

For local model wkw_{k}, the loss is:

fk(wk)=a=1Diyalogya^{{f}_{k}}\left({{w}_{k}}\right)\text{=}-\sum\limits_{a=1}^{{{D}_{i}}}{{{y}_{a}}\log}\hat{y_{a}}\, (22)

In our proposed scheme, the impact of the delay on the model has also been investigated. Specifically, the local training and local model uploading will incur some delay, during which other vehicles may upload the local models to the RSU. In this situation, the local model of this vehicle will have staleness. Considering this issue, we introduce the training weight and transmission weight.

The training weight is related to the local training delay, and it can be expressed as:

β1,k=m1TlVk0.5{{\beta}_{1,k}}={{m}_{1}}^{T_{l}^{{{V}_{k}}}-0.5} (23)

where TlVkT_{l}^{{{V}_{k}}} is the local training delay of vehicle VkV_{k}, which can be calculated by Eq. (1). m1(0,1){{m}_{1}}\in\left(0,1\right) is the parameter to make β1,k{{\beta}_{1,k}} decrease with the increase of local training delay.

The transmission weight is related to the transmission delay of vehicles for uploading local models to the RSU. Here, due to the downloading delay, i.e., the duration of vehicles downloading the global model from the RSU, is very small compared with transmission delay so it can be ignored, thus the transmission weight can be denoted as:

β2,k(t)=m2TuVk(t)0.5{{\beta}_{2,k}}\left(t\right)={{m}_{2}}^{T_{u}^{{{V}_{k}}}\left(t\right)-0.5} (24)

where TuVk(t)T_{u}^{{{V}_{k}}}\left(t\right) is the transmission delay of VkV_{k}, which can be calculated by Eq. (8), m2(0,1){{m}_{2}}\in\left(0,1\right) is the parameter to make β2,k{{\beta}_{2,k}} decrease with the increase of transmission delay.

Then we can get the weight optimized local model, i.e.,

wkw=wkβ1,kβ2,k{{w}_{kw}}={{w}_{k}}*{{\beta}_{1,k}}*{{\beta}_{2,k}} (25)

6.2.3 Uploading and Updating

When vehicle VkV_{k} uploads the weight optimized local model, the RSU will update the global model. The formula is

wnew=βwold+(1β)wkw{{w}_{new}}=\beta{{w}_{old}}+\left(1-\beta\right){{w}_{kw}} (26)

where wold{{w}_{old}} is the current global model at the RSU, wneww_{new} is the updated global model, β(0,1)\beta\in\left(0,1\right) is the proportion of aggregation.

When RSU receives the first uploaded weight optimized local model, wold=wt1{{w}_{old}}={{w}_{t-1}}. When RSU receives all the weight optimized local models of selected vehicles and get the global model wtw_{t} updated for KlK_{l} rounds, it indicates the global model updating is finished at this time slot.

At the same time, we get the average loss of the selected vehicles:

Loss(t)=1Klk=1Klfk(wk)Loss\left(t\right)=\frac{1}{{{K}_{l}}}\sum\limits_{k=1}^{{{K}_{l}}}{{{f}_{k}}\left({{w}_{k}}\right)} (27)

Then step 10 in Algorithm 1 is completely explained. The procedure of AFL training is shown in Algorithm 2.

1 Initialize the global model w0w_{0};
2 for each round xx from 11 to KlK_{l} do
3       wkVehicle Updates(w0)w_{k}\leftarrow\textbf{Vehicle Updates}(w_{0});
4       Vehicle VkV_{k} calculates the weight optimized local model wkww_{kw} based on Eq. (25);
5       Vehicle VkV_{k} uploads the weight optimized local model wkww_{kw} to the RSU;
6      
7      RSU receives the weight optimized local model wkww_{kw};
8       RSU updates the global model based on Eq. (26);
9       return wneww_{new}
10Get the updated global model wtw_{t} after KlK_{l} rounds.
11 Vehicle Update(ww):
12 Input: w0w_{0}
13 for each local iteration mm from 11 to ll do
14       Vehicle VkV_{k} calculates the cross-entropy loss function based on Eq. (20);
15       Vehicle VkV_{k} updates the local model based on Eq. (21);
16      
17Set wk=wk,lw_{k}=w_{k,l};
18 return wkw_{k}
Algorithm 2 weight optimized AFL scheme

6.3 Testing Stage

The testing stage employs the achieved critic network, target actor network and target critic network in the training stage. In the testing stage, the system will select the policy with optimized parameter δ{{\delta}^{*}}. The process of testing stage is shown in Algorithm 3.

1 for episode from 11 to EmaxE_{\max}^{{}^{\prime}}  do
2       Reset simulation parameters of system model, initialize the global model at the RSU;
3       Receive initial observation state s(1)s\left(1\right);
4       for time slot tt from 11 to NN  do
5             Generate the action according to the current policy a=μδ(s|δ)a={{\mu}_{\delta}}\left(s|\delta\right) ;
6             Compute ada_{d}, get the selected vehicles;
7             The selected vehicles conduct AFL based on weight to train global model at the RSU;
8             Get the reward rr and the next state s{{s}^{{}^{\prime}}};
9            
10      
Algorithm 3 Testing Stage for the DAFL-based Framework

7 Simulation and Results

7.1 Simulation Setup

In this section, the simulation tool is python 3.9. Our actor network and critic network are all selected as the DNN with two hidden layers. The two hidden layers have 400 and 300 neurons, respectively. The exploration noise obeys the Ornstein-Uhlenbeck (OU) process with variation 0.02 and decay-rate 0.15. We use MNIST dataset to allocate data to vehicles and the computing resources of vehicles obey a truncated Gaussian distribution. Here the unit of the computing resources is CPU-cycles/s. We configured a vehicle as the bad vehicle, that is, it has small amount of data and computing resources, and the local model is disturbed by random noise. The rest simulation parameters are shown in Table 2.

Table 2: Parameters of simulation
Parameter Value Parameter Value
γ\gamma 0.99 HrH_{r} 10mm
τ\tau 0.001 BB 1000HZHZ
II 64 p0p_{0} 0.25ww
EmaxE_{max} 1000 σ2{{\sigma}^{2}} 109mw{{10}^{-9}}mw
EmaxE_{\max}^{{}^{\prime}} 3 Λ\Lambda 7mm
KK 5 |w|\left|w\right| 5000bitsbits
vv 20m/sm/s α\alpha 2
C0C_{0} 106CPUcycles{{10}^{6}}CPU-cycles m1m_{1} 0.9
tt 0.5 m2m_{2} 0.9
dyd_{y} 5mm

7.2 Experiment Results

Refer to caption
Figure 3: Reward for different epochs

Fig. 3 shows the system reward with respect to different epochs in the training stage. One can see that when the number of epoches is small, the reward has a large variation. This is due to that the system is learning and optimizing the network in the initial phase, so some explorations (i.e., action) incur poor performance. As the number of epoches increases, the reward gradually becomes stable and smoother. It means that at this time, the system has gradually learned the optimal policy, and the training of the neural network has been close to be completed.

Refer to caption
Figure 4: Relation between delay and loss

Fig. 4 depicts the two components of the reward in the testing stage: the loss calculated in the AFL and the sum of the local training delay and the transmission delay. It can be seen that the loss is decreasing as the number of steps increases. This is attributed to the fact that vehicles are constantly uploading local model to update global model at the RSU, so the global model becomes more accurate. The sum of delay shows a certain fluctuation, because of the dynamic available computing resources of the vehicle and its time-varying location.

Refer to caption
Figure 5: Accuracy v.s. number of steps in testing stage

Fig. 5 shows the accuracy of our scheme, traditional AFL and traditional FL in the presence of bad node. From the figure one can see that the accuracy of our scheme remains at a good level and gradually increases and finally reaches stability. This indicates that our scheme can effectively remove the bad node in the model training. Since traditional AFL and FL do not have the function of selecting the vehicles, their accuracy are seriously affected by the bad node, resulting in large fluctuations.

Refer to caption
Figure 6: Accuracy with optimized model weights in testing stage
Refer to caption
Figure 7: Training delay v.s. number of training rounds

Fig. 6 shows the accuracy of our scheme and traditional AFL and FL after the vehicle selection. From the figure one can see that the accuracy of all schemes are increasing when the number of step increases and finally become stable. However, our scheme has the highest accuracy among them. This is because our scheme considers the impact of local training delay and transmission delay of vehicles in global model updating.

Fig. 7 depicts the training delay of our scheme compared to the FL as the global round (i.e., step) increases. It shows the delay of FL remains to be high while our scheme keeps a small delay. This is because FL only starts updating the global model when all local models of selected vehicles are received, whereas in our scheme, RSU is updated every time when it receives a local model uploaded from a vehicle. At the same time, we can observe that the delay of our scheme appears to rise at first and then fall and rise again. This is because the proposed scheme selects four vehicles to update the global model one by one. At the same time, owning to the large local computing delay of the vehicles compared to the transmission delay, the local computing delay of the vehicle dominates. In this case, since the vehicle that finishes the earliest local training will update the global model first, the training delay gradually increases. When the four vehicles all update the global model, the vehicles will repeat the above global model updating until reaches the maximum number of step.

Refer to caption
Figure 8: Accuracy v.s. different beta

Fig. 8 depicts the accuracy of our scheme and traditional AFL with selected vehicles under different value of β\beta. It shows when β\beta is small, the accuracy of the model keeps relatively high. In contrast, when β\beta gradually increases, the accuracy of global model gradually decreases. This is because when β\beta is relatively large, the weight of the local model is much small, thus the update of the global model mainly depends on the previous hyperparameter values of global model, which decreases the influence and contribution of new local model of all vehicles, so it has a significant impact on the accuracy of global model. At the same time, the accuracy of our scheme is better than that of AFL. This is because our scheme has considered the influence of local computing delay and transmission delay of vehicles.

8 Conclusion

In this paper, we considered the vehicle mobility, time-varying channel states, time-varying computing resources of vehicles, different amount of local data of vehicles and the situation of bad node and proposed a DAFL-based framework. The conclusions are summarized as follows:

  • The accuracy of our scheme is better than traditional AFL and FL. This is due to that our scheme can effectively remove the bad node and thus prevent the global model updating from being affected by bad node.

  • For the absent of bad nodes, the accuracy of our scheme is still higher than that of AFL and FL, because our scheme considers the mobility of the vehicles, time-varying channel conditions, available computing resources of the vehicles and different amounts of local data to allocate different weights to different vehicles’s local model in AFL.

  • The aggregate proportion β\beta affects the accuracy of the global model. Specifically, when β\beta is relatively small, it can get the desirable accuracy.

References

  • [1] X. Xu, H. Li, W. Xu, Z. Liu, L. Yao and F. Dai, “Artificial intelligence for edge service optimization in Internet of Vehicles: A survey,” Tsinghua Science and Technology, vol. 27, no. 2, pp. 270-287, Apr. 2022.
  • [2] Q. Wu, H. Liu, C. Zhang, Q. Fan, Z. Li and K. Wang, “Trajectory protection schemes based on a gravity mobility model in loT,” Electronics, vol. 8, no. 2, pp. 148, Jan. 2019.
  • [3] J. Fan, S. Yin, Q. Wu and F. Gao, “Study on Refined Deployment of Wireless Mesh Sensor Network,” 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM), Chengdu, China, pp. 1-5, Sep. 2010.
  • [4] J. Fan, Q. Wu and J. Hao. “Optimal deployment of wireless mesh sensor networks based on Delaunay triangulations,” 2010 International Conference on Information, Networking and Automation (ICINA), Kunming, China, pp. 370-374, Nov. 2010.
  • [5] Q. Wu, H. Ge, P. Fan, J. Wang, Q. Fan and Z. Li, “Time-dependent Performance Analysis of the 802.11p-based Platooning Communications Under Disturbance,” IEEE Transactions on Vehicular Technology, vol. 69, no. 12, pp. 15760-15773, Nov. 2020.
  • [6] J. Liu, M. Ahmed, M. Mirza, W. Khan, D. Xu, J. Li, A. Aziz and Z. Han, “RL/DRL Meets Vehicular Task Offloading Using Edge and Vehicular Cloudlet: A Survey,” IEEE Internet of Things Journal, vol. 9, no. 11, pp. 8315-8338, Jun. 2022.
  • [7] Q. Wu, S. Shi, Z. Wan, Q. Fan, P. Fan and C. Zhang, “Towards V2I Age-aware Fairness Access: A DQN Based Intelligent Vehicular Node Training and Test Method,” Chinese Journal of Electronics, published online, 2022, doi: 10.23919/cje.2022.00.093.
  • [8] X. Xu, H. Li, W. Xu, Z. Liu, L. Yao and F. Dai, “Artificial intelligence for edge service optimization in Internet of Vehicles: A survey,” Tsinghua Science and Technology, vol. 27, no. 2, pp. 270-287, Apr. 2022.
  • [9] W. Cheng, E. Luo, Y. Tang, L. Wan and M. Wei, “A Survey on Privacy-security in Internet of Vehicles,” 2021 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), AB, Canada, pp. 644-650, Oct. 2021.
  • [10] S. Wan, J. Lu, P. Fan and K. Letaief, “To smart city: Public safety network design for emergency,” IEEE access, vol. 6, pp. 1451-1460, Dec. 2017.
  • [11] D. C. Nguyen, M. Ding, P. N. Pathirana, A. Seneviratne, J. Li and H. Vincent Poor, “Federated Learning for Internet of Things: A Comprehensive Survey,” IEEE Communications Surveys & Tutorials, vol. 23, no. 3, pp. 1622-1658, Apr. 2021.
  • [12] L. Xing, P. Zhao, J. Gao, H. Wu and H. Ma, “A Survey of the Social Internet of Vehicles: Secure Data Issues, Solutions, and Federated Learning,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 2, pp. 70-84, Mar.-Apr. 2023.
  • [13] Z. Zhu, S. Wan, P. Fan and K. Letaief, “Federated multiagent actor–critic learning for age sensitive mobile-edge computing,” vol. 9, no. 2, pp. 1053-1067, May. 2021.
  • [14] Q. Wu, X. Wang, Q. Fan, P. Fan, C. Zhang and Z. Li, “High Stable and Accurate Vehicle Selection Scheme based on Federated Edge Learning in Vehicular Networks,” China Communications, vol. 20, no. 3, pp. 1-17, 2023, doi: 10.23919/JCC.2023.03.001.
  • [15] Z. Wang, G. Xie, J. Chen and C. Yu, “Adaptive Asynchronous Federated Learning for Edge Intelligence,” 2021 International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), pp. 285-289, Jul. 2021.
  • [16] Z. Wang, Z. Zhang, Y. Tian, Q. Yang, H. Shan, W. Wang and T. Quek,“Asynchronous Federated Learning Over Wireless Communication Networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 6961-6978, Sep. 2022.
  • [17] Q. Wu, Y. Zhao, Q. Fan, P. Fan, J. Wang and C. Zhang, “Mobility-Aware Cooperative Caching in Vehicular Edge Computing Based on Asynchronous Federated and Deep Reinforcement Learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 17, no. 1, pp. 66-81. Aug. 2022.
  • [18] Q. Wu, J. Zheng, “Performance modeling and analysis of the ADHOC MAC protocol for VANETs,” 2015 IEEE International Conference on Communications, London, United Kingdom, pp. 3646-3652, Jun. 2015.
  • [19] Q. Wu, J. Zheng, “Performance modeling and analysis of the ADHOC MAC protocol for vehicular networks,” Wireless Networks, vol. 22, no. 3, pp. 799-812, Apr. 2016.
  • [20] X. Chen, W. Wei, Q. Yan, N. Yang and J. Huang, “Time-delay deep Q-network based retarder torque tracking control framework for heavy-duty vehicles,” IEEE Transactions on Vehicular Technology, vol. 72, no. 1, pp. 149-161, Jan. 2023.
  • [21] Q. Wu, S. Xia, P. Fan, Q. Fan, Z. Li. “Velocity-adaptive V21 fair-access scheme based on IEEE 802.11 DCF for platooning vehicles,” Sensors, vol. 18, no. 12, pp. 4198, Nov. 2018.
  • [22] Q. Wu, Y. Zhao and Q. Fan, “Time-Dependent Performance Modeling for Platooning Communications at Intersection,” IEEE Internet of Things Journal, vol. 9, no. 19, pp. 18500-18513, Aug. 2022.
  • [23] Q. Wang, D. Wu and P. Fan, “Delay-constrained optimal link scheduling in wireless sensor networks,” IEEE Transactions on Vehicular Technology, vol. 59, no. 9, pp. 4564-4577, Sep. 2010.
  • [24] Y. M. Saputra, D. N. Nguyen, D. T. Hoang and E. Dutkiewicz, “Selective Federated Learning for On-Road Services in Internet-of-Vehicles,” 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, pp. 1-6, Dec. 2021.
  • [25] D. Long, Q. Wu, Q. Fan, P. Fan, Z. Li and J. Fan,“A Power Allocation Scheme for MIMO-NOMA and D2D Vehicular Edge Computing Based on Decentralized DRL,”Sensors, vol. 23, no. 7, pp. 3449, Mar. 2023.
  • [26] X. Zhou, W. Liang, J. She, Z. Yan and K. I. -K. Wang, “Two-Layer Federated Learning With Heterogeneous Model Aggregation for 6G Supported Internet of Vehicles,” IEEE Transactions on Vehicular Technology, vol. 70, no. 6, pp. 5308-5317, Jun. 2021.
  • [27] L. Zhang, H. Saito, L. Yang and J. Wu, “Privacy-Preserving Federated Transfer Learning for Driver Drowsiness Detection,” IEEE Access, vol. 10, pp. 80565-80574, Jul. 2022.
  • [28] H. Xiao, J. Zhao, Q. Pei, J. Feng, L. Liu and W. Shi, “Vehicle Selection and Resource Optimization for Federated Learning in Vehicular Edge Computing,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 11073-11087, Aug. 2022.
  • [29] Y. M. Saputra, H. T. Dinh, D. Nguyen, L. -N. Tran, S. Gong and E. Dutkiewicz, “Dynamic Federated Learning-Based Economic Framework for Internet-of-Vehicles,” IEEE Transactions on Mobile Computing, vol. 22, no. 4, pp. 2100-2115, Jan. 2021.
  • [30] M. Yan, B. Chen, G. Feng and S. Qin, “Federated Cooperation and Augmentation for Power Allocation in Decentralized Wireless Networks,” IEEE Access, vol. 8, pp. 48088-48100, Mar. 2020.
  • [31] D. Ye, X. Huang, Y. Wu and R. Yu, “Incentivizing Semisupervised Vehicular Federated Learning: A Multidimensional Contract Approach With Bounded Rationality,” IEEE Internet of Things Journal, vol. 9, no. 19, pp. 18573-18588, Oct. 2022.
  • [32] X. Kong, K. Wang, M. Hou, X. Hao, G. Shen, X. Chen and F. Xia, “A Federated Learning-Based License Plate Recognition Scheme for 5G-Enabled Internet of Vehicles,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8523-8530, Dec. 2021.
  • [33] Y. M. Saputra, D. N. Nguyen, D. T. Hoang, T. X. Vu, E. Dutkiewicz and S. Chatzinotas, “Federated Learning Meets Contract Theory: Economic-Efficiency Framework for Electric Vehicle Networks,” IEEE Transactions on Mobile Computing, vol. 21, no. 8, pp. 2803-2817, Aug. 2022.
  • [34] D. Ye, R. Yu, M. Pan and Z. Han, “Federated Learning in Vehicular Edge Computing: A Selective Model Aggregation Approach,” IEEE Access, vol. 8, pp. 23920-23935, Jan. 2020.
  • [35] Y. Zhao, J. Zhao, M. Yang, T. Wang, N. Wang, L. Lyu, D. Niyato and K. Lam, “Local Differential Privacy-Based Federated Learning for Internet of Things,” IEEE Internet of Things Journal, vol. 8, no. 11, pp. 8836-8853, Jun. 2021.
  • [36] Y. Li, X. Tao, X. Zhang, J. Liu and J. Xu, “Privacy-Preserved Federated Learning for Autonomous Driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8423-8434, Jul. 2022.
  • [37] A. Taïk, Z. Mlika and S. Cherkaoui, “Clustered Vehicular Federated Learning: Process and Optimization,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 25371-25383, Jan. 2022.
  • [38] Y. Hui, G. Zhao, C. Li, N. Cheng, Z. Yin, T. Luan and X. Xiao, “Digital Twins Enabled On-demand Matching for Multi-task Federated Learning in HetVNets,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2352-2364. Sep. 2022.
  • [39] S. Liu, J. Yu, X. Deng and S. Wan, “FedCPF: An Efficient-Communication Federated Learning Approach for Vehicular Edge Computing in 6G Communication Networks,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 2, pp. 1616-1629, Feb. 2022.
  • [40] P. Lv, L. Xie, J. Xu, X. Wu and T. Li, “Misbehavior Detection in Vehicular Ad Hoc Networks Based on Privacy-Preserving Federated Learning and Blockchain,” IEEE Transactions on Network and Service Management, vol. 19, no. 4, pp. 3936-3948, Nov. 2022.
  • [41] L. U. Khan, Y. K. Tun, M. Alsenwi, M. Imran, Z. Han and C. S. Hong, “A Dispersed Federated Learning Framework for 6G-Enabled Autonomous Driving Cars,” IEEE Transactions on Network Science and Engineering, 2022, doi: 10.1109/TNSE.2022.3188571.
  • [42] S. Samarakoon, M. Bennis, W. Saad and M. Debbah, “Distributed Federated Learning for Ultra-Reliable Low-Latency Vehicular Communications,” IEEE Transactions on Communications, vol. 68, no. 2, pp. 1146-1159, Feb. 2020.
  • [43] A. Hammoud, H. Otrok, A. Mourad and Z. Dziong, “On Demand Fog Federations for Horizontal Federated Learning in IoV,” IEEE Transactions on Network and Service Management, vol. 19, no. 3, pp. 3062-3075, Sep. 2022.
  • [44] G. Tian, Y. Ren, C. Pan, Z. Zhou and X. Wang, “Asynchronous Federated Learning Empowered Computation Offloading in Collaborative Vehicular Networks,” 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, pp. 315-320, Apr. 2022.
  • [45] C. Pan, Z. Wang, H. Liao, Z. Zhou, X. Wang, M. Tariq and S. AlOtaibi, “Asynchronous Federated Deep Reinforcement Learning-Based URLLC-Aware Computation Offloading in Space-Assisted Vehicular Networks,” IEEE Transactions on Intelligent Transportation Systems, 2022, doi: 10.1109/TITS.2022.3150756.
  • [46] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” 2014 International Conference on Machine Learning(ICML), Beijing, Chain, pp. 387-395, Jun. 2014.
  • [47] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” IEEE Trans. Neural Netw., vol. 9, no. 5, pp. 51-52, Mar. 1998.
  • [48] T. P. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, “Continuous control with deep reinforcement learning,” Sep. 2015. arXiv:1509.02971.