This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Technical Report for Trend Prediction Based Intelligent UAV Trajectory Planning for Large-scale Dynamic Scenarios

Jinjing Wang, Xindi Wang
Abstract

The unmanned aerial vehicle (UAV)-enabled communication technology is regarded as an efficient and effective solution for some special application scenarios where existing terrestrial infrastructures are overloaded to provide reliable services. To maximize the utility of the UAV-enabled system while meeting the QoS and energy constraints, the UAV needs to plan its trajectory considering the dynamic characteristics of scenarios, which is formulated as the Markov Decision Process (MDP). To solve the above problem, a deep reinforcement learning (DRL)-based scheme is proposed here, which predicts the trend of the dynamic scenarios to provide a long-term view for the UAV trajectory planning. Simulation results validate that our proposed scheme converges more quickly and achieves the better performance in dynamic scenarios.

Index Terms:
unmanned aerial vehicle, dynamic scenario, trajectory planning, trend prediction, reinforcement learning.

I Introduction

Unmanned Aerial Vehicles (UAVs) have been used in providing emergency communication services for some special scenarios that cannot be served satisfactorily with existing terrestrial infrastructures, such as some mass gathering events, like new year events, large conferences, etc. In these situations, the trajectory planning of the UAV involved can seriously affect system performances, e.g., data throughput[1], transmission delay[2], and user fairness[3].

For the practical applications, the classical optimization-based UAV trajectory planning strategies [15] may no longer be feasible, as most of them are designed to be operated in an iterative fashion with high computational complexity. In the light of the aforementioned, Reinforcement Learning (RL) [6] based strategies have been regarded as promising solutions for the UAV-enabled system owing to the great self-learning capability, which exhibits significant efficiency gains in static scenarios by optimizing the interaction process [4] and experience selection [5]. Furthermore, faced with realistic application scenarios with highly dynamic characteristics, more and more researches have studied on how to extract scene features by adding Deep Neural Networks to further enhance the capability of RL, so as to construct the Deep RL framework (DRL). To achieve above goals, [7, 10, 11, 9] construct the Artificial Neural Network (ANN) to mine the correlations between scene information (e.g. location [7, 10], energy state [10] and throughput [11, 9]). In order to avoid the uncertainty caused by manual feature extraction in ANN-based schemes, [6, 12, 13] construct the Convolutional Neural Network (CNN) and model the scene information as a tensor composed of multiple channels (e.g. communication range [6], location [12] and object tracking information [13]), which makes scene features more hierarchical. However, considering the dynamic characteristics of real scenarios, there are still two major research gaps in existing researches: 1) The number of GUs is related to the structure of the DRL framework***Assuming that each GU has nn features, then the input of the ANN-based model is a vector of dimension nN×1nN\times 1 where NN denotes the number of GUs in the scenario. Apparently, a change of NN can lead to changes in the network structure. designed in existing studies [7, 10, 11, 9], the variation in the number of GUs in dynamic scenarios leads to the need for re-tuning and re-training for the proposed DRL models. 2) The scene information of GUs is not fully utilized in existing studies [6, 12, 13] where the whole dynamic process is usually processed as multiple independent frames, which lacks further exploration of their associations.

Therefore, the critical issue is how to bridge the above gaps by designing a DRL framework that is flexible enough and can fully exploit the dynamic characteristics of the scene, which motivates this work with the objective to maximizes the long-term performance of the UAV-enabled large-scale dynamic scenarios by optimizing the UAV’s movement action, subjected to the constraints of the communication QoS and energy. Therefore, we design a moving Trend Prediction (TP) based DRL framework for the UAV (served as an agent) to perceive the state of the current environment and predict the trend of the future state. Through continuous interaction with the environment, the agent optimizes its actions according to the received feedback (known as reward in DRL). Simulations have been used to verify and validate the performance of the proposed scheme.

II System Model and Problem Formulation

II-A System Model

Consider an area of interest (AoI) where the explosive growth of access requirements have already far outstripped the capacity of existing terrestrial base stations. One UAV with velocity vuavv_{uav} is dispatched to provide the extra communication capacity for the GUs inside the AoI (denoted as a GU set Ωallt\Omega_{all}^{t}) with fixed flight altitude HH. In particular, the entire mission period of the UAV is discretized into multiple individual time slots and each with equal duration τ\tau. Similarly, referring to [9], the AoI has been divided into K×KK\times K equal grids, of which the centers are used as way-points of the UAV in each time slot. Thus, the location of the UAV (projected on the ground) and the ii-th mobile GU in time slot tt can be formulated as 𝐋𝐮𝐭=[xut,yut]\mathbf{L_{u}^{t}}=[x_{u}^{t},y_{u}^{t}] and 𝐋𝐢𝐭=[xit,yit]\mathbf{L_{i}^{t}}=[x_{i}^{t},y_{i}^{t}], respectively. Following the model proposed in [9], the velocity and moving direction of the GU ii will be updated as

vit\displaystyle v_{i}^{t} =k1vit1+(1k1)v¯,\displaystyle=k_{1}v_{i}^{t-1}+(1-k_{1})\overline{v}, (1)
θit\displaystyle\theta_{i}^{t} =θit1+θ~k2.\displaystyle=\theta_{i}^{t-1}+\widetilde{\theta}k_{2}. (2)

where v¯\overline{v} and θ~\widetilde{\theta} denote average velocity and steering angle, respectively, and k2k_{2} follows ϵ\epsilon-greedy modelIn each time slot, the GU will keep the same moving direction with probability ϵ\epsilon, otherwise the GU chooses one of the remaining directions randomly..

Since the terrestrial infrastructures cannot provide communication services, the GUs’ data will be temporarily stored in their on-board buffer and wait for the UAV to start the data upload process. Here, the GU-to-UAV channel is modeled as Rician models [1] that can capture the shadowing and small-scale fading effects due to multi-path propagation, in which the channel coefficient from the GU ii to the UAV in time slot tt can be expressed as

hit=βithit^,\displaystyle h_{i}^{t}=\sqrt{\beta_{i}^{t}}\hat{h_{i}^{t}}, (3)

where βit=α(H2+𝐋𝐮𝐭𝐋𝐢𝐭2)kps/2\beta_{i}^{t}=\frac{\alpha}{(H^{2}+||\mathbf{L_{u}^{t}}-\mathbf{L_{i}^{t}}||^{2})^{k_{ps}/2}} and hit^=ksks+1hit~+ksks+1hit~~\hat{h_{i}^{t}}=\sqrt{\frac{k_{s}}{k_{s}+1}}\tilde{h_{i}^{t}}+\sqrt{\frac{k_{s}}{k_{s}+1}}\tilde{\tilde{h_{i}^{t}}}, |hit~|=1|\tilde{h_{i}^{t}}|=1, |hit~~|CN(0,1)|\tilde{\tilde{h_{i}^{t}}}|\sim CN(0,1), and KpsK_{ps} and KsK_{s} denote the GU-to-UAV path loss component and Rician factor, respectively. α\alpha denotes the channel power gain at the reference distance 𝐋𝐮𝐭𝐋𝐢𝐭=1||\mathbf{L_{u}^{t}}-\mathbf{L_{i}^{t}}||=1m.

By adopting OFDMA technology, the UAV can pre-divide communication resources into multiple equal and orthogonal resource blocks in advance, of which the bandwidth allocated to each communication GU is WW. In this way, the UAV using OFDMA technology can support concurrent data transmissions with surrounding GUs. Thus, the communication rate with respect to each GU is given by rit=Wlog2(1+|hit|2pσ2)r_{i}^{t}=W\log_{2}(1+\frac{|h_{i}^{t}|^{2}p}{\sigma^{2}}), where pp, WW, and σ2\sigma^{2} denote the transmission power, bandwidth and noise power, respectively. In this way, the data uploaded throughput of the ii-th GU is obtained by ritτcr_{i}^{t}\tau_{c}, in which τc\tau_{c} denotes the duration of the hovering time used for communications within one time slot, which is assumed as ττc\tau\gg\tau_{c}. Here we denote the data queue length of the GU ii in time slot tt as BitB_{i}^{t}, which depends on both the newly generated data (i.e. IitI_{i}^{t}) and the data uploaded (i.e. ritτcr_{i}^{t}\tau_{c}) to the UAV, which is given by Bit=Bit1+IitritτcB_{i}^{t}={B_{i}^{t-1}+I_{i}^{t}-r_{i}^{t}\tau_{c}}.

In addition, the energy consumption of the UAV in time slot tt is associated with the flying distance of the UAV in the past time slot [9], which is formulated as

Eft=pf𝐋𝐮𝐭𝐋𝐮𝐭𝟏vuav,\displaystyle E_{f}^{t}=p_{f}\frac{||\mathbf{L_{u}^{t}}-\mathbf{L_{u}^{t-1}}||}{v_{uav}}, (4)

where pfp_{f} denotes the UAV flying power.

II-B Problem Formulation

Regarding the dynamic scenarios considered in this work, the location, number, and data queue of GUs that change in real time greatly increases the feature dimension of the UAV-Ground communication scenario, which poses a severe challenge to build the efficient and reliable data links between GUs and the UAV. Therefore, our objective is to find a policy that helps the UAV make the optimal decision on trajectory planning, so that the utility of the UAV-enabled communication system over all the time slots is maximized, subjected to the coverage range, energy consumption, and the communication QoS. Thus, the UAV trajectory planning issueWe summarize the notations used in this paper in section II of our technical report [Wang2021tech]. is formulated as

P0:\displaystyle\textup{P0}: maxt𝒯𝒰t\displaystyle\max\ \sum\nolimits_{t\in\mathcal{T}}\mathcal{U}^{t}
s.t. C1:hith¯,t,\displaystyle{\text{C1}}:\ h_{i}^{t}\geq\underline{h},\ \forall\ t,
C2:tEftE¯,\displaystyle{\text{C2}}:\ \sum_{t}E_{f}^{t}\leq\overline{E},

where the system utility function 𝒰t\mathcal{U}^{t} depends on both the throughput and fairness during the data upload processes. Furthermore, C1 ensures that the communication link should meet the required QoS, namely the channel coefficient of each communication GU ii (see (3)) should not be smaller than the threshold h¯\underline{h}. C2 limits the total energy consumption during the UAV flight process tEft\sum_{t}E_{f}^{t} (see (4)) to within the portable energy E¯\overline{E} under the premise that the communication energy consumption is negligible.

Furthermore, the utility function 𝒰t\mathcal{U}^{t} is defined here, 𝒰t=fGtiΩalltritτc\mathcal{U}^{t}=f_{G}^{t}\sum_{i\in\Omega_{all}^{t}}r_{i}^{t}\tau_{c}, in which fGtf_{G}^{t} called as the Jain’s fairness index [3] is defined as

fGt=(iΩalltci)2|Ωallt|iΩalltci2,\displaystyle f_{G}^{t}=\frac{(\sum_{i\in\Omega_{all}^{t}}c_{i})^{2}}{|\Omega_{all}^{t}|\sum_{i\in\Omega_{all}^{t}}c_{i}^{2}}, (5)

in which cic_{i} denotes the number of UAV-enabled communication services that GU ii has participated in. As a widely-used metric for fairness, the value of fGtf_{G}^{t} approaches 11 when the total number of time slots that each GU is served are very close, which can be regarded as a measure of the fairness of UAV communication services in existing scenarios.

II-C MDP Model

In general, conventional methods normally would narrow down P0 for an individual time slot, namely maxUt,t𝒯\max U^{t},\forall t\in\mathcal{T}, and solve it by classical convex optimization methods or heuristic algorithms, which may obtain the greedy-like performance due to lacking of a long-term target. To tackle above issue, we apply Markov Decision Process (MDP), defined as a tuple (𝒮\mathcal{S}, 𝒜\mathcal{A}, \mathcal{R}, 𝒫\mathcal{P}), to model the UAV trajectory planning problem in large-scale dynamic scenarios, which are detailed as follows.

1) The statestate 𝒮\mathcal{S} contains the locations of the UAV and GUs as well as their real-time status, which is formulated as

𝒮{st={𝐋𝐮𝐭,𝐋𝐢𝐭,Bit,hit|iΩallt,t}}.\displaystyle\mathcal{S}\triangleq\{s_{t}=\{\mathbf{L_{u}^{t}},\mathbf{L_{i}^{t}},B_{i}^{t},h_{i}^{t}|i\in\Omega_{all}^{t},\forall t\}\}. (6)

2) The actionaction 𝒜\mathcal{A} contains available actions of the UAV in each time slot, which is formulated as

𝒜\displaystyle\mathcal{A}\triangleq {at={Up,Down,Left,Right,Rightupper,\displaystyle\{a_{t}=\{Up,Down,Left,Right,Right\ upper, (7)
Rightlower,Leftupper,Leftlower|t}}.\displaystyle Right\ lower,Left\ upper,Left\ lower|\forall t\}\}.

3) The rewardreward 𝒱π(st)\mathcal{V}_{\pi}(s_{t}) represents the discounted accumulated reward from the state sts_{t} to the end of the task with the policy π\pi, which is formulated as

𝒱π(st)=𝔼π[j=0𝒥γjr(st+j,at+j)],\displaystyle\mathcal{V}_{\pi}(s_{t})=\mathbb{E}_{\pi}[\sum_{j=0}^{\mathcal{J}}\gamma^{j}r(s_{t+j},a_{t+j})], (8)

where r(st,at)r(s_{t},a_{t}) denotes the immediate reward through executing action ata_{t} at the state sts_{t}, and 𝒥\mathcal{J} denotes the end of the task. In particularly, the action ata_{t} adopted here is selected following the policy π\pi, i.e., π(st)=at\pi(s_{t})=a_{t}. According to problem P0, r(st,at)r(s_{t},a_{t}) is defined as

r(st,at)={𝒰t,if C1 and C2 are satisfied,0,otherwise.r(s_{t},a_{t})=\left\{\begin{aligned} \mathcal{U}^{t}&,&\ {\text{if C1 and C2 are satisfied,}}\\ 0&,&\ {\text{otherwise.}}\end{aligned}\right.

4) The transitiontransition probabilityprobability 𝒫{p(st+1|st)}\mathcal{P}\triangleq\{p(s_{t+1}|s_{t})\} represents the probability that the UAV reaches the next state st+1s_{t+1} while in the state sts_{t}, which is formulated as

p(st+1|st)={η,ifst+1argmaxat𝒜𝒱π(st+1|st,at),1η,otherwise,p(s_{t+1}|s_{t})=\left\{\begin{aligned} &\eta\ \ \ \ ,&\text{if}\ s_{t+1}\triangleq\arg\max_{a_{t}\in\mathcal{A}}\mathcal{V}_{\pi}(s_{t+1}|s_{t},a_{t}),\\ &1-\eta,&\text{otherwise},\end{aligned}\right.

where η\eta corresponds to the greedy coefficient during the action selection process.

Based on the formulated MDP model (𝒮\mathcal{S}, 𝒜\mathcal{A}, \mathcal{R}, 𝒫\mathcal{P}), the UAV first observes the state sts_{t} in each time slot tt. Then, it takes an action ata_{t} following the policy π\pi, i.e., at=π(st)a_{t}=\pi(s_{t}), and thus obtains the corresponding immediate reward r(st,at)r(s_{t},a_{t}). Then, the UAV moves to the next way-point and the statestate updates to st+1s_{t+1}. Therefore, the problem P0 can be transformed into maximizing the discounted accumulated reward by optimizing the policy π\pi.

A table that summarizes all notations in this paper is given in Table. 1.

III NN-Step Mobile Trend Based DRL Scheme

III-A DRL Framework

It is well-known that the DRL algorithm is efficient and effective in solving MDP for the uncertain (i.e. 𝒱π(st)\mathcal{V}_{\pi}(s_{t})) and complex (i.e. 𝒮\mathcal{S}) system, in which the critical issue here is how to obtain the optimal policy π\pi^{*} for the action selection process.

To achieve that, we first rephrase 𝒱π(st)\mathcal{V}^{\pi}(s_{t}) into the form of state-action pairs as 𝒱π(st)=Qπ(st,at)\mathcal{V}^{\pi}(s_{t})=Q^{\pi}(s_{t},a_{t}) where at=π(st)a_{t}=\pi(s_{t}) means the action ata_{t} is selected at the state sts_{t} according to the policy π\pi. Here we call Qπ(st,at)Q^{\pi}(s_{t},a_{t}) as the Q-function. Referring to [9], it is easy to draw the conclusion that the optimal policy can be derived by π(st)=argmaxa𝒜Q(st,a)\pi^{*}(s_{t})=\mathop{\arg\max}\limits_{a\in\mathcal{A}}Q(s_{t},a). In other words, once the UAV selects the action that can maximize the corresponding Q-function at each statestate st𝒮s_{t}\in\mathcal{S}, the optimal policy can be realized. Then, the remaining issue is how to obtain Q(st,at)Q(s_{t},a_{t}) with the given environment sts_{t} and action ata_{t}.

Based on above conclusions and (8), we have

Qπ(st,at)=r(st,at)+γQπ(st+1,π(st+1)),\displaystyle Q^{\pi^{*}}(s_{t},a_{t})=r(s_{t},a_{t})+\gamma Q^{\pi^{*}}(s_{t+1},\pi^{*}(s_{t+1})), (9)

where Qπ(st+1,π(st+1))=maxa𝒜Qπ(st+1,a)Q^{\pi^{*}}(s_{t+1},\pi^{*}(s_{t+1}))=\mathop{\max}\limits_{a\in\mathcal{A}}Q^{\pi^{*}}(s_{t+1},a), thus the issue about searching the Q-function Qπ(st,at)Q^{\pi^{*}}(s_{t},a_{t}) can be formulated as a regression problem, that is, through iteratively optimizing the parameters of the Q-function so that the left-hand side of the (9) is infinitely close to the right-hand side. Here, considering the high dimension of 𝒮\mathcal{S} and 𝒜\mathcal{A} in the scenario we studied, it is common to apply the neural networks (NN) to fit the Q-function mentioned above. To be specific, the NN can address the sophisticate mapping between the statestate sts_{t}, actionaction ata_{t} and their corresponding Q-function value Q(st,at)Q(s_{t},a_{t}) based on a large training data set. Accordingly, two individual deep neural networks are established here. One with parameters θP\theta_{P} is utilized to construct the Evaluation network QθP(st,at)Q_{\theta_{P}}(s_{t},a_{t}) that models the function Qπ(st,at)Q^{\pi^{*}}(s_{t},a_{t}), and another one with parameters θT\theta_{T} is used to construct the Target network QθT(st,at)Q_{\theta_{T}}(s_{t},a_{t}) for obtaining the target value (i.e., r(st,at)+γQπ(st+1,π(st+1))r(s_{t},a_{t})+\gamma Q^{\pi^{*}}(s_{t+1},\pi^{*}(s_{t+1}))) in the training process. Finally, the parameters θP\theta_{P} of QθP(st,at)Q_{\theta_{P}}(s_{t},a_{t}) are updated by minimizing the loss function LL, which is formulated as

L=𝔼st𝒮[(Qθp(st,at)Evaluation(r(st,at)+γmaxQθt(st+1,a))Target)2].\displaystyle L=\mathbb{E}_{s_{t}\in\mathcal{S}}[(\underbrace{Q_{\theta_{p}}(s_{t},a_{t})}_{{\text{Evaluation}}}-\underbrace{(r(s_{t},a_{t})+\gamma\max Q_{\theta_{t}}(s_{t+1},a))}_{{\text{Target}}})^{2}]. (10)

Finally, after the NN has been well trained, the UAV can make a decision at the state sts_{t} according to the obtained Q-function that is modeled as the NN, and then the UAV takes the action at=argmaxa𝒜Qa(st,a)a_{t}=\arg\max_{a\in\mathcal{A}}Q_{a}(s_{t},a)

III-B Design the Input Layer of Neural Network

In order to extract spatial features of the scenario, here we adopt Convolutional Neural Networks (CNNs) to achieve the Evaluation network QθP(st,at)Q_{\theta_{P}}(s_{t},a_{t}), which sets a three-channel tensor with size K×K×3\mathbb{R}^{K\times K\times 3} as the input. Meanwhile, each channel of the tensor is modeled as a matrix with the size K×K\mathbb{R}^{K\times K}, corresponding to the scenario model that is divided into K×KK\times K equal grids. (see the system model in Section. II-A). Here, two convolution layers are constructed to extract the features from the input, and two full connected layers are constructed to establish associations between them. The design of the three-channel tensor are given as follows.

III-B1 Channel 1

The Channel 11 describes the effective communication range of the UAV with the location 𝐋𝐮(𝐭)\mathbf{L_{u}(t)}, which is formulated as a matrix 𝐓𝟏K×K\mathbf{T_{1}}\in\mathbb{R}^{K\times K}. To meet the QoS during the Ground-to-UAV communication process, we substitute (3) into the constraint C1, namely

𝐋𝐮(𝐭)𝐋𝐢(𝐭)2(αhi(t)^2h¯2)2/KpsH2.\displaystyle||\mathbf{L_{u}(t)}-\mathbf{L_{i}(t)}||^{2}\leq(\frac{\alpha\hat{h_{i}(t)}^{2}}{\underline{h}^{2}})^{2/K_{ps}}-H^{2}.

which indicates that the location of GUs that can establish reliable communication with the UAV must meet above condition. In this way, we can calculate whether the distance from GU iΩalli\in\Omega_{all} to the location of the UAV 𝐋𝐮(𝐭)\mathbf{L_{u}(t)} satisfies the above condition. If so, we assign the corresponding element in matrix 𝐓𝟏\mathbf{T_{1}} as the real-time channel coefficient obtained, otherwise set it as 0.

III-B2 Channel 2

The Channel 22 describes the interaction information between GUs and the UAV, which is formulated as a matrix 𝐓𝟐K×K\mathbf{T_{2}}\in\mathbb{R}^{K\times K}. In 𝐓𝟐\mathbf{T_{2}}, we record the communication times between each GU ii and the UAV (i.e., cic_{i}), and set the element of 𝐓𝟐\mathbf{T_{2}} corresponding to the location of GU ii as ci,iΩallc_{i},i\in\Omega_{all}, otherwise is 0.

III-B3 Channel 3

The Channel 33 describes the predicted movement trend of GUs in the future NN time slots, which is formulated as a matrix 𝐓𝟑K×K\mathbf{T_{3}}\in\mathbb{R}^{K\times K}. The design principle of 𝐓𝟑\mathbf{T_{3}} is predicting the future movement situation of GUs by extracting their mobile features from historical records, which enables the UAV a future-oriented view during the trajectory planning to achieve the maximum accumulated reward from the current state until the end of the task. The specific steps for constructing the channel 33 are as follows.

\bullet Step 1: The matrix GtK×K\textbf{G}^{t}\in\mathbb{R}^{K\times K} is produced to record the buffer state of GUs in the scenario, in which the element is set as Gt[m,n]=Bit\textbf{G}^{t}[m,n]=B_{i}^{t} if GU iΩalli\in\Omega_{all} is located in the corresponding grid (i.e. the mm-th row and the nn-th column) of the scenario, otherwise Gt[m,n]=0\textbf{G}^{t}[m,n]=0.

\bullet Step 2: The difference matrix ΔGtK×K\Delta\textbf{G}^{t}\in\mathbb{R}^{K\times K} is produced as ΔGt=GtGt1\Delta\textbf{G}^{t}=\textbf{G}^{t}-\textbf{G}^{t-1} here, showing the changes of Gt\textbf{G}^{t} between adjacent time slots.

\bullet Step 3: Four direction kernels 𝒰2×1\mathcal{U}\in\mathbb{R}^{2\times 1}, 𝒟2×1\mathcal{D}\in\mathbb{R}^{2\times 1}, 1×2\mathcal{L}\in\mathbb{R}^{1\times 2}, and 1×2\mathcal{R}\in\mathbb{R}^{1\times 2} are utilized here to detect the moving direction of GUs (corresponding to up, down, left and right respectively), which are given by

𝒰=[1,1]T,𝒟=[1,1]T,=[1,1],=[1,1].\displaystyle\mathcal{U}=[1,-1]^{T},\mathcal{D}=[-1,1]^{T},\mathcal{L}=[1,-1],\mathcal{R}=[-1,1]. (11)

\bullet Step 4: The SAME convolution operation is performed on the difference matrix ΔGt\Delta\textbf{G}^{t} using the above four kernels, respectively. The corresponding outputs are four matrices G𝒰t\textbf{G}_{\mathcal{U}}^{t}, G𝒟t\textbf{G}_{\mathcal{D}}^{t}, Gt\textbf{G}_{\mathcal{L}}^{t} and Gt\textbf{G}_{\mathcal{R}}^{t} with the same size K×K\mathbb{R}^{K\times K}, which denotes the detection result of the movement direction of GUs in the past two time slots, respectively. Specifically, elements of above four matrices are obtained by

{G𝒰t[m,n]=i=01ΔGt[m+i,n]×𝒰[i],G𝒟t[m,n]=i=01ΔGt[m+i,n]×𝒟[i],Gt[m,n]=i=01ΔGt[m,n+i]×[i],Gt[m,n]=i=01ΔGt[m,n+i]×[i].m,n.\left\{\begin{aligned} \textbf{G}^{t}_{\mathcal{U}}[m,n]=\sum\nolimits_{i=0}^{1}\Delta\textbf{G}^{t}[m+i,n]\times\mathcal{U}[i],\\ \textbf{G}^{t}_{\mathcal{D}}[m,n]=\sum\nolimits_{i=0}^{1}\Delta\textbf{G}^{t}[m+i,n]\times\mathcal{D}[i],\\ \textbf{G}^{t}_{\mathcal{L}}[m,n]=\sum\nolimits_{i=0}^{1}\Delta\textbf{G}^{t}[m,n+i]\times\mathcal{L}[i],\\ \textbf{G}^{t}_{\mathcal{R}}[m,n]=\sum\nolimits_{i=0}^{1}\Delta\textbf{G}^{t}[m,n+i]\times\mathcal{R}[i].\end{aligned}\ \forall m,n.\right.

\bullet Step 5: Based on step 44, the trend prediction matrices of up, down, left and right directions are constructed respectively, which are denoted as T𝒰N\textbf{T}^{N}_{\mathcal{U}}, T𝒟N\textbf{T}^{N}_{\mathcal{D}}, TN\textbf{T}^{N}_{\mathcal{L}} and TN\textbf{T}^{N}_{\mathcal{R}}. Taking TN\textbf{T}^{N}_{\mathcal{R}} as an example, the specific steps are given in Algorithm. 1.

Algorithm 1 Generate the matrix TN\textbf{T}^{N}_{\mathcal{R}} with NN steps
1:Initialize i=1i=1, generate TN[m,n]\textbf{T}^{N}_{\mathcal{R}}[m,n] as
TN[m,n]={Gt[m,n],ifGt[m,n]>0,0,otherwise.\textbf{T}^{N}_{\mathcal{R}}[m,n]=\left\{\begin{aligned} \textbf{G}_{\mathcal{R}}^{t}[m,n]&,&\ {\text{if}}\ \textbf{G}_{\mathcal{R}}^{t}[m,n]>0,\\ 0&,&{\text{otherwise}}.\end{aligned}\right.
2:for each TN[m,n]0\textbf{T}^{N}_{\mathcal{R}}[m,n]\neq 0, set xmx\leftarrow m, yny\leftarrow n do
3:    for iNi\leq N do
4:         With probability ϵ\epsilon, set TN[x+1,y]=TN[x+1,y]+γTN[x,y]\textbf{T}^{N}_{\mathcal{R}}[x+1,y]=\textbf{T}^{N}_{\mathcal{R}}[x+1,y]+\gamma\textbf{T}^{N}_{\mathcal{R}}[x,y], and xx+1x\leftarrow x+1. (Right)
5:         Otherwise, select another direction randomly, and update the corresponding element as
TN[x1,y]=TN[x1,y]+γTN[x,y],xx1,(Left)\displaystyle\textbf{T}^{N}_{\mathcal{R}}[x-1,y]=\textbf{T}^{N}_{\mathcal{R}}[x-1,y]+\gamma\textbf{T}^{N}_{\mathcal{R}}[x,y],x\leftarrow x-1,(\text{Left})
or
TN[x,y+1]=TN[x,y+1]+γTN[x,y],yy+1,(Up)\displaystyle\textbf{T}^{N}_{\mathcal{R}}[x,y+1]=\textbf{T}^{N}_{\mathcal{R}}[x,y+1]+\gamma\textbf{T}^{N}_{\mathcal{R}}[x,y],y\leftarrow y+1,(\text{Up})
or
TN[x,y1]=TN[x,y1]+γTN[x,y],yy1.(Down)\displaystyle\textbf{T}^{N}_{\mathcal{R}}[x,y-1]=\textbf{T}^{N}_{\mathcal{R}}[x,y-1]+\gamma\textbf{T}^{N}_{\mathcal{R}}[x,y],y\leftarrow y-1.(\text{Down})
6:         i=i+1.
7:    end for
8:end for
9:Normalize TN\textbf{T}^{N}_{\mathcal{R}}.

For TN\textbf{T}^{N}_{\mathcal{L}}, T𝒰N\textbf{T}^{N}_{\mathcal{U}}, T𝒟N\textbf{T}^{N}_{\mathcal{D}}, their elements are updated with the similar principle as in step 55 of the Algorithm. 1, namely

{TN[x1,y]=TN[x1,y]+γTN[x,y],xx1,T𝒰N[x,y+1]=T𝒰N[x,y+1]+γT𝒰N[x,y],yy+1,T𝒟N[x,y1]=T𝒟N[x,y1]+γT𝒟N[x,y],yy1.\left\{\begin{aligned} \textbf{T}^{N}_{\mathcal{L}}[x-1,y]=\textbf{T}^{N}_{\mathcal{L}}[x-1,y]+\gamma\textbf{T}^{N}_{\mathcal{L}}[x,y],x\leftarrow x-1,\\ \textbf{T}^{N}_{\mathcal{U}}[x,y+1]=\textbf{T}^{N}_{\mathcal{U}}[x,y+1]+\gamma\textbf{T}^{N}_{\mathcal{U}}[x,y],y\leftarrow y+1,\\ \textbf{T}^{N}_{\mathcal{D}}[x,y-1]=\textbf{T}^{N}_{\mathcal{D}}[x,y-1]+\gamma\textbf{T}^{N}_{\mathcal{D}}[x,y],y\leftarrow y-1.\end{aligned}\right.

\bullet Step 6: output the NN-step moving trend prediction matrix 𝐓𝟑=T𝒰N+T𝒟N+TN+TN\mathbf{T_{3}}=\textbf{T}^{N}_{\mathcal{U}}+\textbf{T}^{N}_{\mathcal{D}}+\textbf{T}^{N}_{\mathcal{L}}+\textbf{T}^{N}_{\mathcal{R}}, and set it as the Channel 33 of the CNN model.

III-C Joint Offline and Online Learning Framework

In order to enable the UAV to update the DRL model at the same time during executing the communication task, a training framework that combines online and offline manner is proposed, in which the UAV performs the task following the policy obtained at the offline stage, and then updates the its policy online in current episode.

The offline learning process is designed to learn the most valuable experiences from the past. Therefore, a large ”Replay Memory” (denoted as RlR_{l}) is constructed to store the experiences including statestate sts_{t}, actionaction ata_{t}, and reward 𝒰t\mathcal{U}^{t} during the past episodes, of which seventy percent are with the largest reward and the remaining thirty percent are chosen randomly from the rest. In contrast, the online learning process aims to learn from the current experiences, resulting in the current best behavioral decisions. Therefore, it will generate a relative small ”Replay Memory” (denoted as RsR_{s}) to store the experiences sampled during the current communication task, of which eighty percent are with the largest reward from the current episode and the remaining twenty percent are picked randomly.

The combination of the above two training processes not only allows the UAV to take full advantage of past experiences, but also adjust its policy according to the current actual situation, and sum up all past experiences after the end.

III-D Computational Complexity

The computational complexity of our proposed moving Trend Prediction (TP) based DRL framework is mainly involved in two parts, namely the building process of the channel 33 and the CNN calculation process. The former can be obtained as follows: step 11, 22 and 33 only involve simple calculations, step 44 requires O(4×|Ωallt|)O(4\times|\Omega_{all}^{t}|) and step 55 requires O(8×N×|Ωallt|)O(8\times N\times|\Omega_{all}^{t}|). Then, the computational complexity of the CNN calculation process is given by O(Nc×Mc2×Kc2×Cin×Cout)O(N_{c}\times M_{c}^{2}\times K_{c}^{2}\times C_{in}\times C_{out}) [14] in which NcN_{c}=22 McM_{c}=3232, KcK_{c}=3, CinC_{in}=2 and CoutC_{out}=32 are the parameters of the model in our scheme.

IV NUMERICAL RESULTS

The simulation parameters of the scenario are set as |Ωallt|=50|\Omega_{all}^{t}|=50, K=30K=30, k1=0.9k_{1}=0.9, v¯=1\overline{v}=1m/s, vuav=30m/sv_{uav}=30m/s, pf=110p_{f}=110W, θ~=π2\widetilde{\theta}=\frac{\pi}{2}, H=40H=40m, Kps=2K_{ps}=2, Ks=1K_{s}=1 and k2k_{2} follows ϵ\epsilon-greedy model with ϵ=0.9\epsilon=0.9. The simulation parameters of the communication process are set as W=2W=2Mhz, τ=1\tau=1s, τc=0.1\tau_{c}=0.1s, h¯=2.5×109\underline{h}=2.5\times 10^{-9}, E¯=104\overline{E}=10^{4}kJ, p=0.1p=0.1W α=105\alpha=10^{-5}, σ=109\sigma=10^{-9} and Iit=5×103I_{i}^{t}=5\times 10^{-3}bits/s. The Trend Prediction based model proposed in this work is called as TP here, in which the greedy coefficient η=0.9\eta=0.9. The results are shown in Fig. 1 and Fig. 2.

Refer to caption
Refer to caption
Figure 1: Overall performance in large-scale scenarios with fixed 5050 GUs, in which the maximum number of training steps per episode is set as (a) 3000 and (b) 6000.
Refer to caption
Refer to caption
Figure 2: Overall performance in large-scale scenarios with fixed 5050 GUs, in which the maximum number of training steps per episode is set as (a) 8000 and (b) 10000.

We tested the total reward and number of steps available under multiple training parameters in a scenario with 5050 GUs, where the total reward can be seen as the overall performance that combines fairness and throughput, while the number of steps reflects the efficiency of the UAV trajectory.

From the above results, even if the number of GUs is increased from 2020 to 5050, the proposed solution can still obtain good performance. We didn’t compare the other two schemes because even in a 2020-GU scenario, the other two were not only unable to converge but were also significantly at a performance disadvantage.

Refer to caption
Figure 3: Overall performance in large-scale scenarios with increasing GUs

We further tested the performance of the proposed strategy when the number of GUs in the scene dynamically increased, in which only 5050 GUs are deployed in the base scene, and one new GU is deployed at the end of each training, the results are given in Fig. 3.

Refer to caption
Figure 4: The trajectory of UAV in the scenario with 5050 GUs.

We further give the trajectory of the UAV in Fig. 4. The UAV starts at the green dot and ends at the blue pentagram. From the perspective of the entire trajectory, the UAV has a good effect on the coverage of the entire area, which indirectly shows that the UAV with the trend prediction function has a broader and long-term vision.

V Notation Table

A table that summarizes all notations in this paper is given in Table. 1.

TABLE I: Notations table.
Notations           Meaning
Ωallt\Omega_{all}^{t} The set of GUs in the scenario in time slot tt.
vitv_{i}^{t} The velocity of the GU ii in time slot tt.
vuavv_{uav} The velocity of the UAV.
θit\theta_{i}^{t} The moving direction of GU ii in time slot tt.
hi(t)h_{i}(t) The channel coefficient between GU ii and UAV in time slot tt.
𝐋𝐮𝐭1×2\mathbf{L_{u}^{t}}\in\mathbb{R}^{1\times 2} The location of the UAV in time slot tt.
𝐋𝐢𝐭1×2\mathbf{L_{i}^{t}}\in\mathbb{R}^{1\times 2} The location of GU ii in time slot tt.
HH The fixed flight altitude of the UAV.
ϵ\epsilon The greedy coefficient in the selection of moving direction of GUs.
η\eta The greedy coefficient in action selection of the UAV.
pp The transmission power of the UAV.
𝐓𝐧K×K\mathbf{T_{n}}\in\mathbb{R}^{K\times K} The nn-th channel of the input of CNN model.
τ\tau The duration of one time slot.
τc\tau_{c} The duration of the hovering time within each time slot.
BitB_{i}^{t} The data queue length of GU ii in time slot tt.
IitI_{i}^{t} The newly generated data of GU ii in time slot tt.
EftE_{f}^{t} The energy consumption of the UAV in time slot tt.
h¯\underline{h} The lower bound of the channel quality.
E¯\overline{E} The upper bound of the portable energy carried by the UAV.
fGtf_{G}^{t} The Jain’s fairness index.
𝒱π(st)\mathcal{V}_{\pi}(s_{t}) The discounted accumulated reward from state sts_{t} to the end.
GtK×K\textbf{G}^{t}\in\mathbb{R}^{K\times K} The matrix used to record the buffer state of GUs in the scenario.

References

  • [1] Y. Sun, D. Xu, D. W. K. Ng, L. Dai and R. Schober, “Optimal 3D-Trajectory Design and Resource Allocation for Solar-Powered UAV Communication Systems,” IEEE Transactions on Communications, vol. 67, no. 6, pp. 4281-4298, 2019.
  • [2] K. Meng, D. Li, X. He and M. Liu, “Space Pruning Based Time Minimization in Delay Constrained Multi-Task UAV-Based Sensing,” IEEE Transactions on Vehicular Technology, vol. 70, no. 3, pp. 2836-2849, 2021.
  • [3] C. H. Liu, Z. Chen, J. Tang, J. Xu and C. Piao, “Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 2059-2070, 2018.
  • [4] A. M. Seid, G. O. Boateng, B. Mareri, G. Sun and W. Jiang, “Multi-Agent DRL for Task Offloading and Resource Allocation in Multi-UAV Enabled IoT Edge Network,” IEEE Transactions on Network and Service Management, vol. 18, no. 4, pp. 4531-4547, 2021.
  • [5] Y. Li, A. Hamid Aghvami and D. Dong, “Path Planning for Cellular-Connected UAV: A DRL Solution with Quantum-Inspired Experience Replay,” IEEE Transactions on Wireless Communications, 2022, early access.
  • [6] C. H. Liu, Z. Dai, Y. Zhao, J. Crowcroft, D. Wu and K. K. Leung, “Distributed and Energy-Efficient Mobile Crowdsensing with Charging Stations by Deep Reinforcement Learning,” IEEE Transactions on Mobile Computing, vol. 20, no. 1, pp. 130-146, 2021.
  • [7] O. S. Oubbati, M. Atiquzzaman, A. Baz, H. Alhakami and J. Ben-Othman, “Dispatch of UAVs for Urban Vehicular Networks: A Deep Reinforcement Learning Approach,” IEEE Transactions on Vehicular Technology, vol. 70, no. 12, pp. 13174-13189, 2021.
  • [8] X. Wang, X. Liu, C. T. Cheng, L. Deng, X. Chen and F. Xiao, “A Joint User Scheduling and Trajectory Planning Data Collection Strategy for the UAV-Assisted WSN,” IEEE Communications Letters, vol. 25, no. 7, pp. 2333-2337, 2021.
  • [9] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding and F. Shu, “Path Planning for UAV-Mounted Mobile Edge Computing With Deep Reinforcement Learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 5, pp. 5723-5728, 2020.
  • [10] P. Luong, F. Gagnon, L. -N. Tran and F. Labeau, “Deep Reinforcement Learning-Based Resource Allocation in Cooperative UAV-Assisted Wireless Networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 11, pp. 7610-7625, 2021
  • [11] Z. Qin, Z. Liu, G. Han, C. Lin, L. Guo and L. Xie, “Distributed UAV-BSs Trajectory Optimization for User-Level Fair Communication Service With Multi-Agent Deep Reinforcement Learning,” IEEE Transactions on Vehicular Technology, vol. 70, no. 12, pp. 12290-12301, 2021.
  • [12] H. Huang, Y. Yang, H. Wang, Z. Ding, H. Sari and F. Adachi, “Deep Reinforcement Learning for UAV Navigation Through Massive MIMO Technique,” IEEE Transactions on Vehicular Technology, vol. 69, no. 1, pp. 1117-1121, 2020.
  • [13] W. Zhang, K. Song, X. Rong and Y. Li, “Coarse-to-Fine UAV Target Tracking With Deep Reinforcement Learning,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 4, pp. 1522-1530, 2019.
  • [14] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800-1807.
  • [15] X. Wang, X. Liu, C. T. Cheng, L. Deng, X. Chen and F. Xiao, “A Joint User Scheduling and Trajectory Planning Data Collection Strategy for the UAV-Assisted WSN,” IEEE Communications Letters, vol. 25, no. 7, pp. 2333-2337, 2021.