Situation-Aware Deep Reinforcement Learning for Autonomous Nonlinear Mobility Control in Cyber-Physical Loitering Munition Systems

Hyunsoo Lee, Soohyun Park, Won Joon Yun, Soyi Jung, , and
Joongheon Kim The earlier version of this work was presented at IEEE Vehicular Technology Society (VTS) Asia Pacific Wireless Communications Symposium (APWCS), Seoul, Korea, August 2022 [1], and this paper received IEEE VTS Seoul Chapter Award (December 20222).This research was funded by the National Research Foundation of Korea (2022R1A2C2004869). (Corresponding authors: Soyi Jung, Joongheon Kim)Hyunsoo Lee, Soohyun Park, Won Joon Yun, and Joongheon Kim are with the Department of Electrical and Computer Engineering, Korea University, Seoul 02841, Republic of Korea (e-mails: {hyunsoo, soohyun828, ywjoon95, joongheon}@korea.ac.kr).Soyi Jung is with the Department of Electrical of Computer Engineering, Ajou University, Suwon, Republic of Korea (e-mail: sjung@ajou.ac.kr).

Abstract

According to the rapid development of drone technologies, drones are widely used in many applications including military domains. In this paper, a novel situation-aware DRL-based autonomous nonlinear drone mobility control algorithm in cyber-physical loitering munition applications. On the battlefield, the design of DRL-based autonomous control algorithm is not straightforward because real-world data gathering is generally not available. Therefore, the approach in this paper is that cyber-physical virtual environment is constructed with Unity environment. Based on the virtual cyber-physical battlefield scenarios, a DRL-based automated nonlinear drone mobility control algorithm can be designed, evaluated, and visualized. Moreover, many obstacles exist which is harmful for linear trajectory control in real-world battlefield scenarios. Thus, our proposed autonomous nonlinear drone mobility control algorithm utilizes situation-aware components those are implemented with a Raycast function in Unity virtual scenarios. Based on the gathered situation-aware information, the drone can autonomously and nonlinearly adjust its trajectory during flight. Therefore, this approach is obviously beneficial for avoiding obstacles in obstacle-deployed battlefields. Our visualization-based performance evaluation shows that the proposed algorithm is superior from the other linear mobility control algorithms.

Index Terms:

Drone, Loitering Munition, Sensing, Deep Reinforcement Learning, Drone mobility Control, Unity.

I Introduction

I-A Background and Motivation

With the Internet of Things (IoT) revolution in modern communications and network applications, drones for autonomous aerial ad-hoc on-demand three-dimensional (3D) networking have encountered a rapid change in their applications, from professional toys to professional application-specific complex IoT devices [2, 3, 4]. In modern embedded drone design and implementation, drones are integrated with several sensors, including cameras and global positioning system (GPS), and thus, they are actively and widely used in various fields. It means the drones are not only used for live broadcasts, agriculture, and weather forecasting but also have potentials to be used in global logistics or future mobility to bring dramatic convenience to human life. For more details, when drones are utilized in agriculture applications, multi-drone networking platform can manage vast farms, diagnose crop conditions, and provide appropriate solutions to increase productivity [5]. In the field of logistics, enterprises like Amazon have launched drone delivery pilot services, delivering packages safely and on time through fully autonomous driving systems. Kong et al. [6] also conducted a research project to optimize the drone path/trajectory using attention-based pointer networks for future mobility applications. In future mobility, when vertiports are built in smart cities, drone taxis are expected to transport passengers [7].

Refer to caption — Figure 1: The use of drones in the military battlefield, i.e., autonomous drone mobility control for loitering munition.

In order to fully utilize the drone-related technologies in many emerging applications, research to control the drone mobility in an autonomous way is essentially required. To design and implement the algorithms for the autonomous drone mobility control, the use of deep reinforcement learning (DRL) is one of the promising deep learning algorithms because DRL is formally defined as a discrete-time stochastic decision-making control process for maximizing the expected return/utility. Furthermore, during drone trajectory control in various applications, many obstacles (such as buildings and structures) can exist. Therefore, nonlinear autonomous drone mobility control under the consideration of near-field situation sensing is essentially desired for avoiding the obstacles in physical worlds. Although major studies have been conducted on how drones avoid obstacles via situation-aware DRL algorithms, this research contribution is inappropriate for military usage (such as loitering munition) as it has been studied in terms of multi-drone swarm flight for attacking target areas via autonomous nonlinear drone mobility control [8, 9].

In considering multi-drone autonomous networks, the use of cyber-physical systems for interacting/mapping between the actions in the virtual environment and the ones in the real world environment is getting a lot of attention in industry and academia [10, 11, 12, 13]. The CPS technologies are based on the integration of communications, networking, sensing, mechatronics, and data analytics. Therefore, CPS is widely used in many industrial applications such as smart production, smart electric grids, smart logistics, and smart health care [13, 14]. In order to realize the interaction/mapping between the actions in the virtual environment and the ones in the physical environment, visualization tools (such as Unity) can be definitely helpful because the algorithm designers can intuitively and immediately understand and validate the algorithm functionalities, as well-studied in [12, 15, 16, 17].

I-B Use Cases

In this paper, among various promising applications and use cases in order to utilize DRL algorithms in multi-drone networks, military applications are considered because it is one of the promising topics nowadays based on the big success in modern wars, especially in urban areas. In the military field, the application-specific drones can take on a variety of roles, such as monitoring the enemy, providing supplies, and executing combat missions, as illustrated in Fig. 1. Despite the short operation time due to the limitations of batteries (approximately a few tens of minutes), military drones have a significant advantage in swiftly navigating areas that infantry cannot observe, with avoiding pursuit. In December 2022, the South Korean military failed to track five North Korean drones that invaded South Korean airspace, sending them all back to North Korea [18]. In modern warfare, drone attacks while operating loitering munitions have become an inevitable strategy to gain the upper hand in wars. As one example of the use of drones in wars, military-purpose drones are also used in the current war between Ukraine and Russia. The SwitchBlade 300 loitering munition, made by AeroVironment of the United States, is being used successfully to attack the Russian garrison and tanks [19]. According to the Russian defense ministry, the drone attack on Russian bases fatally wounded three technical Russian staff [20]. Furthermore, dogfights between reconnaissance drones have also occurred [21]. There have also been some cases of massive damage caused by drone attacks. Saudi Arabia’s state-owned oil refinery enterprise, Aramco, was hit by more than 17 loitering munition trials at its two refineries in 2019 [22]. More than 20 drones each carried more than three kilograms of explosives and suffered massive damage, disrupting supplies of 5.7 million barrels of petroleum. It was a situation caused by the lack of defense measures and an opportunity to remind people of the need to build defense systems.

I-C Algorithm Design Rationale

According to the modern trends in the usage scenarios of drone networks, we can observe that drones can be effective and efficient in battlefield autonomous unmanned defense system design based on autonomous multi-drone UAV mobility control cooperation and coordination. For the purpose, deep reinforcement learning (DRL) algorithms can be the best solution due to the nature of discrete-time stochastic control and sequential decision-making processes. Among various DRL algorithms, our proposed algorithm is fundamentally designed based on deep deterministic policy gradient (DDPG) for utilizing continuous action space in order to realize nonlinear drone mobility control [23, 24, 25]. Furthermore, for dealing with unexpected blockage and real-time environment changes dynamically in the physical worlds, situation-aware mechanisms are additionally considered on top of DDPG-based DRL design. In previous research results, drones fly over slight-line trajectories, thus it is not possible to consider nonlinear drone mobility control for avoiding blockages [24, 26, 27, 28, 29].

I-D Contributions

The major contributions of this research in situation-aware DDPG-based nonlinear drone mobility control can be summarized as follows.

•

First of all, our proposed DRL-based algorithm fundamentally aims at autonomous nonlinear drone mobility control based on near-field situation sensing for taking care of obstacles. In the literature, many approaches are using pure DRL algorithms without situation-awareness, which can be harmful in an urban battlefield environment where the locations are with many obstacles (blockages, buildings, and structures). This situation-aware nonlinear mobility control is definitely beneficial in our considering applications, i.e., autonomous drone mobility control for loitering munition.
•

Moreover, the proposed situation-aware DRL-based autonomous nonlinear drone mobility control algorithm is also implemented with Unity 3D Environment for loitering munition mobility control. This implementation with Unity 3D Environment is the essential part for the realization of CPS for interacting and matting the actions in the virtual environment and the ones in the real world for better intuitive and efficient understanding to the algorithm designers and system users (military officers and soldiers).
•

Lastly, compared to our previous work [1], this paper mathematically specified the algorithm and the performances are evaluated in various ways.

I-E Organization

The rest of this paper is organized as follows. Sec. II presents the related work of drone technologies and reinforcement learning. Sec. III explains our proposed situation-aware autonomous nonlinear drone mobility control algorithm with mathematical analysis, and Sec. IV includes simulation-based performance evaluation results with corresponding discussions. Finally, Sec. V concludes this paper and presents future research directions.

II Related Work

There are a lot of research results in terms of drone trajectory optimization and mobility control. In order to design the drone mobility control algorithms, each research contribution is with its own objective for the mobility control. In [30], DRL-based multi-drone mobility control algorithms are designed and implemented under the considerations of weights and batteries which can restrict the drone’s flight time and mobility. This type of research results is to optimize the trajectories of drones to move efficiently within a limited time [31]. In addition, free-space optical communication (FSOC) is employed to deal with the trajectory optimization of a fixed-wing UAV over various atmospheric conditions, as well-presented in [31].

The autonomous drone mobility control algorithms can be also designed under the consideration of the drone’s specific applications and objectives. In [32], the drone mobility control algorithms are designed using multi-agent DRL for CCTV-based surveillance systems in smart city applications. In addition, autonomous drone mobility control algorithms can be designed for cooperative and coordinated mobile access points and base-stations positioning [33, 34, 35, 36]. Furthermore, the proposed DRL-based algorithm in [7] aims at optimal passenger delivery in urban air mobility (UAM) applications (i.e., drone-taxi) while avoiding accidents. The algorithm in [7] is distributed DRL-based mobility control to electric vertical takeoff and landing (eVTOL) drone platforms for UAM passenger delivery applications. The eVTOL vehicles search for the optimal passenger transport routes under the consideration of passengers’ behaviors, collision potentials, and battery status levels through QMIX-based multi-agent DRL.

According to the fact that it is essential to develop power-charging algorithms in power-hungry drones, the proposed multi-agent DRL algorithm in [37] studied an outdoor environment ad-hoc battery charging method for multiple drone environment. For the sake of overcoming the lack of charging platform and the pressing of charging time, an auction-based resource allocation using deep learning framework is employed to perform the charging schedule. The main reason why auction-based algorithms should be used in this problem is that it is fully distributed (i.e., also working with only neighbor information under uncertainty) and truthful [37]. The proposed Myerson auction with a deep learning framework is to ensure dominant strategy incentive compatibility (DSIC) and individual rationality (IR) for truthful operations. According to the data-intensive performance evaluation for the proposed algorithm, it ensures optimal revenue under the considerations of DSIC and IR while achieving revenue-optimality. Lastly, the proposed multi-agent DRL-based drone mobility control algorithm in [38] is for joint cooperative power-charging control and drone charging scheduling under the support of built-in infrastructure, i.e., charging towers. The proposed algorithm in the paper determines scheduling between drones and charging towers at first. During this phase, it has been confirmed that the proposed scheduling optimization framework is non-convex; thus, it is converted to convex with some mathematical theories. After that, the amounts of power-charging in the scheduled drone-tower pairs are determined based on multi-agent DRL.

III Situation-Aware and DRL-based Autonomous Nonlinear Drone Mobility Control for Loitering Munition

Our proposed autonomous nonlinear drone mobility control algorithm is fundamentally designed on top of the deep deterministic policy gradient (DDPG) algorithm, which is one of the well-known policy-gradient (PG) DRL frameworks. By utilizing the PG-based DRL algorithms, the shortcomings of the existing value-based deep Q-network (DQN) algorithms can be compensated [39]. Unlike the value-based algorithms, which can only be applied to the environments with discrete actions, DDPG-based DRL algorithms can handle the continuous movements of drones [23]. To conduct action inference computation in A2C, an $\arg\max$ function should be used after passing through the actor network, which is infeasible for continuous action spaces. To take care of this issue, our proposed autonomous nonlinear drone mobility control algorithm is designed based on DDPG because it aims at continuous action control. There are also differences in how the network is updated for training via gradient descent. A2C updates its own actor network by evaluating actions, using the estimation of its own Q-function. On the other hand, our considering DDPG-based algorithm updates the critic network by temporal difference (TD) target to improve the stability.

When the action control and experiments are conducted for drone aerial movements using conventional PG-based DRL algorithms such as actor-critic (A2C)-based algorithms, the corresponding modeling can increase (+) and decrease (-) the movements in 3D $x$ , $y$ , and $z$ directions, e.g., $(1,0,0)$ , $(-1,0,0)$ , $(0,1,0)$ , $(0,-1,0)$ , $(0,0,1)$ , and $(0,0,-1)$ , respectively. However, according to the simulation-based verification as shown in Fig. 2, we can observe that the performance of A2C-based algorithms is significantly lower than that of DDPG-based algorithms because DDPG-based algorithms can control the mobility via continuous action control which is easily tractable for nonlinear continuous behavior modeling and computation.

Note that the parameters of the actor neural network and the critic neural network as denoted as $\theta^{\mu}$ and $\theta^{Q}$ in our desired DDPG-based DRL algorithm computation. In addition, the objectives of actor $\mu(\cdot|\theta^{\mu})$ and critic network $Q_{\theta^{Q}}$ are for our actor network to maximize action-value derived from critic network, as well as for our critic network to follow Bellman’s equation which consists of the target critic network $Q_{\theta^{\hat{Q}}}$ and reward function, respectively. For this purpose, our proposed DDPG-based DRL algorithm adopts a greedy policy that the actor network approximates the action which maximizes action-value function (i.e., critic network) for given state $\mathbf{x}_{t}$ , which can be expressed as,

\mu(\mathbf{x}_{t}|\theta^{\mu})\approx\arg\max_{\mathbf{u}_{t}}Q_{\theta^{Q}}(\mathbf{x}_{t},\mathbf{u}_{t}).

(1)

The parameter of the actor neural network should meet the following optimization criteria,

{\theta^{\mu}}^{*}=\arg\max_{\theta^{\mu}}Q_{\theta^{Q}}(\mathbf{x}_{t},\mu(\mathbf{x}_{t}|\theta^{\mu}))

(2)

and the parameter $\theta^{\mu}$ can be calculated by maximizing $Q_{\theta^{Q}}$ based on following stochastic gradient ascent computation,

\theta^{\mu}\leftarrow\theta^{\mu}+\alpha\nabla_{\theta^{\mu}}Q_{\theta^{Q}}(\mathbf{x}_{t},\mu(\mathbf{x}_{t}|\theta^{\mu})).

(3)

Now, the gradient for $\theta^{\mu}$ in $Q_{\theta^{Q}}$ can be expressed as follows,

$\displaystyle\nabla_{\theta^{\mu}}Q_{\theta^{Q}}$	$\displaystyle=\frac{dQ_{\theta^{Q}}}{d\theta^{\mu}}$	(4)
	$\displaystyle=\frac{d\mathbf{u}_{t}}{d\theta^{\mu}}\frac{dQ_{\theta^{Q}}}{d\mathbf{u}_{t}}$	(5)
	$\displaystyle=\nabla_{\theta^{\mu}}\mu(\mathbf{x}_{t}\|{\theta^{\mu}})\nabla_{\mathbf{u}_{t}}Q_{\theta^{Q}}(\mathbf{x}_{t},\mathbf{u}_{t})$	(6)

according to a chain rule.

We can now calculate TD error $y_{i}$ using the target critic neural network as follows,

$\displaystyle y_{i}$	$\displaystyle=r(\mathbf{x}_{i},\mathbf{u}_{i})+\gamma\max_{\mathbf{u}_{i}^{\prime}}Q_{\theta^{\hat{Q}}}(\mathbf{x}_{i}^{\prime},\mathbf{u}_{i}^{\prime})$	(7)
	$\displaystyle=r(\mathbf{x}_{i},\mathbf{u}_{i})+\gamma Q_{\theta^{\hat{Q}}}(\mathbf{x}_{i}^{\prime},\arg\max_{u_{i+1}}Q_{\theta^{\hat{Q}}}(\mathbf{x}_{i}^{\prime},\mathbf{u}_{i}^{\prime}))$	(8)
	$\displaystyle\approx r(\mathbf{x}_{i},\mathbf{u}_{i})+\gamma Q_{\theta^{\hat{Q}}}(\mathbf{x}_{i}^{\prime},\mu(\mathbf{x}_{i}^{\prime}\|{\theta^{\hat{\mu}}}))$	(9)

where $\mathbf{x}_{i}^{\prime}$ and $\mathbf{u}_{i}^{\prime}$ stand for the next state and the action given state-action $(\mathbf{x}_{i},\mathbf{a}_{i})$ , respectively. The loss function of actor neural network is defined as follows,

L({\theta^{\mu}})=-\sum_{\forall i}Q_{\theta^{Q}}(\mathbf{x}_{i},\mu(\mathbf{x}_{i}|{\theta^{\mu}}))

(10)

where

\nabla_{\theta^{\mu}}Q_{\theta^{Q}}(\mathbf{x}_{i},\mu(\mathbf{x}_{i}|{\theta^{\mu}}))

(11)

is the gradient obtained by stochastic gradient descent methods, and thus, (10) indicates that the neural network updates its parameters for the direction of loss function minimization.

In addition, the loss function of the critic neural network is calculated with a TD target as follows,

L(\theta^{\mu})=\frac{1}{2}\sum_{\forall i}||y_{i}-Q_{\theta^{Q}}(\mathbf{x}_{i},\mu(\mathbf{x}_{i}|{\theta^{\mu}}))||^{2},

(12)

where

y_{i}=r(\mathbf{x}_{i},\mathbf{u}_{i})+\gamma Q_{\theta^{\hat{Q}}}(\mathbf{x}_{i}^{\prime},\mu(\mathbf{x}_{i}^{\prime}|{\theta^{\hat{\mu}}})),

(13)

as described in (7). Note that the action $\mu(\mathbf{x}_{i}^{\prime}|{\theta^{\hat{\mu}}})$ is calculated by applying target actor neural network parameterized to $\theta^{\hat{\mu}}$ . Our proposed DDPG-based DRL algorithm applies soft target update method, which means the parameter of the target neural network follows the actor neural network slowly, setting $\gamma$ with a very small value, i.e.,

	$\displaystyle\theta^{\hat{Q}}$	$\displaystyle\leftarrow$	$\displaystyle\gamma\theta^{Q}+(1-\gamma)\theta^{\hat{Q}}$
	$\displaystyle\theta^{\hat{\mu}}$	$\displaystyle\leftarrow$	$\displaystyle\gamma\theta^{\mu}+(1-\gamma)\theta^{\hat{\mu}}.$		(14)

According to the fact that our proposed DDPG-based DRL algorithm is a deterministic policy based algorithm, it is required to provide randomness to the action. Therefore, a noise factor $\epsilon_{t}$ is added to the action as shown in (15),

\mu_{\textit{noisy}}(\mathbf{x}_{t})=\mu(\mathbf{x}_{t}|\theta^{\mu})+\epsilon_{t}.

(15)

Our proposed algorithm’s state space contains information about the agent and the environment. The difference between the coordinate vector of the drone and the target, and the vector represents the drone’s flight speed and angular speeds are included in the state space. For realizing the obstacle factors in drone mobility control, a Raycast function is provided by Unity in the state space. Note that the Raycast is a function that shoots a virtual laser, and it can detect the direction and distance between the agent and the obstacle by shooting lasers in vertical or horizontal directions. In addition, the actions are defined as the moves over $x$ , $y$ , and $z$ directions, respectively. Finally, we can confirm that situation-awareness can be realized by the Raycast function for utilizing virtual laser functionalities.

The proposed algorithm is different from the existing conventional DDPG because it can recognize surrounding objects, i.e., situation-awareness. As explained, we added the Raycast function provided by Unity to identify nearby obstacles, where the Raycast is a function that casts virtual rays from the moving object to measure the distance to another object or determine whether the laser reaches an object. Fig. 3 represents the Raycast shot from the agent. In Fig. 3, the drone/agent can know that there are no obstacles at the 12 o’clock, 1 o’clock, and 9 o’clock directions through the Raycast. In addition, considering the previously acquired location information of the target, the drone/agent will act to fly in the direction of 1 o’clock. A ray in the vertical direction measures the distance to the ground and is used to avoid colliding with the ground. In order to induce the drone/agent to avoid surrounding obstacles, when the launched ray collides with an obstacle, the drone/agent gets a negative reward that is inversely proportional to the distance to the obstacle. By recognizing obstacles through Raycast, our proposed situation-aware autonomous nonlinear drone mobility control algorithm is designed.

1 Initialize the critic and actor networks of the drone agent with weights

\theta^{Q}

and

\theta^{\mu}

2 Initialize the target networks as:

\theta^{\hat{Q}}\leftarrow\theta^{Q},\theta^{\hat{\mu}}\leftarrow\theta^{\mu}

3 for episode = 1, $\mathcal{E}$ do

4 Phase 1. Initialize the replay buffer

\mathcal{D}

5 for mini batch = 1 to c do

\triangleright

Randomly generate

\varphi

states

\textbf{x}\in\mathbb{X}

\triangleright

Observe surrounding obstacles through Raycast and store to the states x to aware situation

\triangleright

Get corresponding a set of actions

\textbf{u}=\mu(\textbf{x}|\theta^{\hat{\mu}})\in\mathbb{U}

for each x

\triangleright

Input the state-action pairs to predefined drone environments and get a set of reward

r

for each pair, and observe the next set of states

\textbf{x}^{\prime}\in\mathbb{X}^{\prime}

10 end for

\triangleright

Store the transition pairs

\xi=(\textbf{x},\textbf{u},r,\textbf{x}^{\prime})

as a minibatch, which composes

\mathcal{D}

12 end for

13Phase 2. Update neural networks periodically

14 for time step = 1, $\mathcal{T}$ do

15 If time step is update period, do followings:

\triangleright

Sample a random minibatch

\mathcal{V}

without replacement from

\mathcal{D}

\triangleright

Set

y_{i}=r_{i}+\gamma\hat{Q}(\textbf{x}_{i}^{\prime},\mu(\textbf{x}_{i}^{\prime}|\theta^{\hat{\mu}})|\theta^{\hat{Q}})

\triangleright

Update the

\theta^{Q}

by applying stochastic gradient descent to the loss function of critic network, which can be obtained as

\frac{1}{\mathcal{V}}\sum_{i}{(y_{i}-Q(\textbf{x}_{i},\textbf{u}_{i}|\theta^{Q}))}^{2}

\triangleright

Update the

\theta^{\mu}

by applying stochastic gradient ascent concerning the gradient of actor network:

\nabla_{\theta^{\mathcal{U}}}J(\theta^{\mu})\approx\frac{1}{\varphi}\sum_{i}{\nabla_{\textbf{u}}Q(\textbf{x},\textbf{u}|\theta^{Q})\nabla_{\theta^{\mu}}\mu(\textbf{x}|\theta^{\mu})|_{\textbf{x}=\textbf{x}_{i},\textbf{u}=\mu(\textbf{x}_{i}|\theta^{\mu})}}

\triangleright

Soft update

\theta^{\hat{Q}}

and

\theta^{\hat{\mu}}

as follows:

\theta^{\hat{Q}}\leftarrow\gamma\theta^{Q}+(1-\gamma)\theta^{\hat{Q}}

\theta^{\hat{\mu}}\leftarrow\gamma\theta^{\mu}+(1-\gamma)\theta^{\hat{\mu}}

22 end for

Algorithm 1 Proposed situation-aware autonomous nonlinear drone mobility control algorithm

The pseudo-code of our proposed situation-aware DDPG-based algorithm is described in Algorithm 1. First of all, the weights $\theta^{Q}$ and $\theta^{\mu}$ of actor/critic networks and the weights $\theta^{\hat{Q}}$ and $\theta^{\hat{\mu}}$ of target network are initialized. After that, a replay buffer is also initialized in the last part of the episode. For each minibatch, $\varphi$ states are randomly generated and get the appropriate set of actions $\textbf{u}\in\mathbb{U}$ for each state $\mathbf{x}$ . After that, the drone/agent recognizes obstacles around itself through Raycast observation for situation-aware DRL computation, and stores the information in state space $\mathbf{x}$ , in (line 7). Then we input the $\mathbf{x}$ and $\mathbf{u}$ pairs to predesigned drone environments and obtain the set of reward $r$ for each pair, and observe the next state $\mathbf{x}^{\prime}$ . The actor network uses the state obtained from the drone as an input, and calculates an appropriate action as an output through two fully connected layers. The drone agent that receives the action $\mathbf{u}$ performs the corresponding action in the environment and gets the next state $\mathbf{x}^{\prime}$ . The observed transition pairs $\xi$ from the drone/agent are saved as a minibatch $\mathcal{V}$ , as shown in Fig. 4.

During the next phase, we should update the networks regularly. For each time step, if the time step is in the update period, we sample a random minibatch $\mathcal{V}$ without replacement from the replay buffer $\mathcal{D}$ . In the successive policy decision process, in order to solve the correlation problem between samples, the learning outcome can be increased by performing learning in the replay buffer in the unit of samples. To obtain the network’s TD target, this algorithm calculates the equation in (line 17) of Algorithm 1. The random sample extracted from the tuple delivered to the replay buffer by the drone/agent in Fig. 4 becomes the input of the target actor network, critic network, and target critic network, as shown in Fig. 5. Finally, this algorithm updates the target critic and the target actor network for $\mathcal{T}$ time steps and ends the algorithm computation procedure. The stability of learning can be increased through the usage of target networks. The update of the critic network and the two target networks using the loss function and the policy gradient method is a series of processes for maximizing the expected reward by eventually deriving more appropriate actions for the actor network.

IV Performance Evaluation

IV-A Unity Environment

Unity is a 3D tool-creating platform that utilizes Unity Machine Learning Agents (ML-Agents), an API to support artificial intelligence for training agents through Unity. The ML-Agents basically support various functions for DRL learning computation. In real world scenarios, the use of DRL algorithms to multi-drone mobility control takes much longer time and can lead to costs, tens of times or more. Therefore, the use of DRL algorithms in a simulation environment is suitable, whereas it is not suitable in the physical/real world. The software architecture for RL/DRL trainers and Unity environment is illustrated in Fig. 6. We build the environment using the assets furnished by Unity 3D and set it by applying the ML-Agent to the environment [40]. Then the drone/agent is trained by employing a PyTorch-based reinforcement learning trainer. When the learning is completed, it is embedded into the Unity environment through the communicator. Finally, the Unity environment with the trained agent is created.

IV-B Evaluation Settings

We build the city environment (for emulating urban military battlefield scenarios) through the Unity Asset Store. In addition, we add drone and tank assets to the environment scene. The drone is set as an agent, and the target with the tank shape is set as a game object. The batch size of each neural network is set to 128, and the discount factor from the previous step is set to 0.9. The learning computation is carried out for a total of 100,000 steps, and the test step is performed 10,000 times. The critic network updates the Q-function in the direction of reducing the difference between the predicted value and target value. Moreover, the policy of the actor network is updated to maximize the objective function through two fully connected layers, which have the size of 128. The agents and targets are generated randomly within a designated area, respectively, and the target moves forward at a constant speed. To ensure that the agent does not consider staying in place as optimal, we start with the addition of the reward of -0.01 by default. In order to induce the drone to approach the target gradually, a value obtained by subtracting the current distance from the previous distance is reflected in the compensation function. When the drone reaches the target, it receives a reward of +1; if it is too far from the target or collides with the obstacle, it receives a reward of -1 and ends the episode. If an obstacle is recognized nearby through the Raycast (i.e., situation-awareness), a negative reward is given so that the agent learns to avoid the obstacle. By acquiring information about surrounding obstacles, the proposed situation-aware autonomous nonlinear drone mobility control algorithm can obtain better results compared to the conventional DDPG-based DRL algorithm.

In order to evaluate the algorithm in various ways and more difficult environments, the cyber environment that can be harsher (i.e., more obstacles) than the physical situation is constructed. By setting different densities of obstacles in the cyber environment with no obstacles at all to environments where it is harsh for the agent to pass through, the robustness of the algorithm can be evaluated, especially in environments with densely distributed obstacles. Fig. 7 depicts the top-down view of the environment where the obstacle density is set to 20%, 50%, 80%, and 100%, respectively. Note that the obstacles are created in random sizes in random locations between the agent and the target in each episode. The agent and the target are also produced at random positions within a designated space.

IV-C Evaluation Results

Fig. 8 shows the actor loss and critic loss, which indicates the convergence of the proposed actor/critic-enabled DDPG-based DRL algorithm. The actor loss and critic loss values also decreased as learning data were accumulated, as described in Fig. 8. A decrease in actor loss means that the action value is maximized, and thus the cumulative reward is maximized. In addition, the reduction in the critical loss indicates that the actor network approaches the optimal Q-network according to Bellman’s expectation equation, as described in (7). Fig. 9 depicts the case where a drone reaches a target (tank) while avoiding obstacles in six frames, in a physical urban environment. As shown in Fig. 9, a trained drone/agent can accurately reach a target (tank). The white arrow is a predicted trajectory of the conventional linear mobility control (LMC), which may not be able to reach the target. It can be seen that the learning was performed smoothly in both the visualized simulations and the corresponding numerical values.

In order to show the robustness of the algorithm, the experimental results in the cyber-physical space with many obstacles are shown in Fig. 10 and Fig. 11. According to the benefits of the nature of autonomous nonlinear drone mobility control, the drone/agent found its destination well, even in the cyber CPS systems with dense obstacles. In Fig. 10, we can find out that the drone/agent controls its altitude autonomously, as displayed through the drone/agent’s shadow. We can check the trajectory drawn by the drone/agent through the top-down view, but we cannot confirm the height difference of the buildings. For more details, in Fig. 11, we can see that the drone/agent is controlling its altitude autonomously, as depicted in a third-person view avoiding obstacles and reaching the target/tank. In the top-down view, it may seem to fly very close to the building, but in the third-person view expressed in Fig 11, you can confirm that the drone is flying while adjusting the altitude and distance from the building.

Fig. 12 shows the number of successful attacks according to the density of obstacles. In the cyber-physical visualization environment, each experiment was performed for 10 rounds, 20 times in one environment, and the number of times that the target was successfully reached/destroyed was measured in each round. If there are no obstacles, the drone/agent successfully reaches the target/tank in all trials, however the accuracy decreases as the density of the obstacles increases. Notice that the comparison between Fig. 12(a) and Fig. 12(b) is presented in Appendix A. Moreover, this results are numerically presented in Table I and Table II. When the obstacle density becomes 50%, our comparing LMC algorithm shows about 48% less performance than our proposed situation-aware autonomous nonlinear drone mobility control algorithm. In addition, even with the maximum number of obstacles deployed, our proposed algorithm succeeded in attacking at least four times. This matches the maximum number of successful attacks of conventional DDPG algorithms in the same environment and shows a maximum attack success rate of 2.5 times better than the rate of conventional DDPG algorithm. In particular, in an environment with high obstacle density, our comparing LMC algorithm is almost impossible to use, whereas our proposed situation-aware autonomous nonlinear drone mobility control algorithm can be operated in the same environment. In Table III, the performance difference between the two algorithms is compared and summarized. Overall, the higher density of obstacles introduces better performance improvements. When the obstacle density exceeds 70%, the performance improvement becomes at least 100%, as presented in Table III.

TABLE I: The number of successful attacks per 20 trials based on obstacle density through our proposed situation-aware autonomous nonlinear drone mobility control.

Nonlinear Mobility Control (Proposed)	0%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
Round 1	20/20	18/20	16/20	14/20	14/20	12/20	10/20	13/20	10/20	8/20	7/20
Round 2	20/20	15/20	17/20	17/20	15/20	10/20	7/20	9/20	8/20	7/20	7/20
Round 3	20/20	18/20	13/20	12/20	17/20	13/20	8/20	9/20	11/20	5/20	5/20
Round 4	20/20	16/20	18/20	17/20	16/20	11/20	9/20	6/20	7/20	8/20	7/20
Round 5	20/20	17/20	17/20	14/20	16/20	12/20	14/20	11/20	5/20	10/20	10/20
Round 6	20/20	18/20	17/20	18/20	15/20	12/20	6/20	6/20	11/20	9/20	8/20
Round 7	20/20	18/20	17/20	17/20	16/20	12/20	14/20	13/20	6/20	11/20	10/20
Round 8	20/20	15/20	19/20	17/20	17/20	14/20	8/20	9/20	13/20	5/20	4/20
Round 9	20/20	19/20	15/20	16/20	16/20	13/20	12/20	7/20	7/20	5/20	5/20
Round 10	20/20	18/20	17/20	16/20	10/20	10/20	11/20	8/20	8/20	9/20	10/20
Maximum	20	19	19	18	17	14	14	13	13	11	10
Average	20	17.2	16.6	15.8	15.2	11.9	9.9	9.1	8.6	7.7	7.3
Median	20	18	17	16.5	16	12	9.5	9	8	8	7
Minimum	20	15	13	12	10	10	6	6	5	5	4

TABLE II: The number of successful attacks per 20 trials based on obstacle density through our comparing linear mobility control (LMC) algorithm.

Linear Mobility Control (DDPG)	0%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
Round 1	20/20	13/20	17/20	15/20	9/20	4/20	5/20	6/20	4/20	2/20	2/20
Round 2	20/20	17/20	13/20	13/20	10/20	8/20	4/20	4/20	2/20	1/20	1/20
Round 3	20/20	17/20	15/20	14/20	6/20	6/20	4/20	5/20	2/20	3/20	1/20
Round 4	20/20	13/20	14/20	9/20	11/20	8/20	2/20	6/20	3/20	2/20	2/20
Round 5	20/20	18/20	18/20	10/20	12/20	6/20	9/20	3/20	1/20	3/20	1/20
Round 6	20/20	15/20	13/20	14/20	7/20	8/20	10/20	4/20	4/20	3/20	0/20
Round 7	20/20	13/20	11/20	8/20	8/20	4/20	5/20	5/20	3/20	1/20	2/20
Round 8	20/20	17/20	15/20	11/20	6/20	4/20	5/20	5/20	2/20	1/20	3/20
Round 9	20/20	18/20	10/20	7/20	4/20	7/20	4/20	3/20	2/20	0/20	0/20
Round 10	20/20	16/20	14/20	8/20	13/20	6/20	7/20	3/20	4/20	1/20	4/20
Maximum	20	18	18	15	13	8	10	6	4	3	4
Average	20	15.7	14	10.9	8.6	6.1	5.5	4.4	2.7	1.7	1.6
Median	20	16.5	14	10.5	8.5	6	5	4.5	2.5	1.5	1.5
Minimum	20	13	10	7	4	4	2	3	1	0	0

TABLE III: Comparison of our proposed nonlinear control and the comparing linear control algorithm.

Performance Improvement(%)	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
Maximum	5.5	5.5	20	30.77	75	40	116.67	225	266.67	150
Average	9.55	18.57	44.95	76.74	95.08	80	106.82	218.52	352.94	356.25
Median	9.09	21.43	57.14	88.24	100	90	100	220	433.33	366.67
Minimum	15.38	30	71.43	150	150	200	100	400	-	-

V Concluding Remarks and Future Work

This paper proposes a novel situation-aware autonomous nonlinear drone mobility control DRL-based algorithm in cyber-physical loitering munition applications. On the battlefield, the design and implementation of a DRL-based algorithm are not straightforward because real-world data gathering is not easy at all. Therefore, the cyber-physical virtual environment is constructed with Unity in this paper. Based on that, a DRL-based automated drone mobility control algorithm can be designed, evaluated, and visualized. In real world battlefield scenarios, many obstacles exist which is harmful to linear trajectory control. Therefore, this paper proposes a novel nonlinear drone mobility control using situation-aware components (e.g., Raycast function in Unity for the virtual environment). Based on the sensed situation-aware information, the drone can adjust its trajectory during flight. Therefore, this approach can be definitely beneficial for avoiding obstacles on the battlefield. Our visualization-based performance evaluation shows that the proposed algorithm is superior to the other conventional linear mobility control algorithms.

As future research directions, the proposed algorithm can be implemented on top of embedded drone platforms where the platforms are equipped with multiple sensors for situation-awareness functionalities.

References

[1] H. Lee, W. J. Yun, S. Jung, J.-H. Kim, and J. Kim, “DDPG-based deep reinforcement learning for loitering munition mobility control: Algorithm design and visualization,” in Proc. IEEE VTS Asia Pacific Wireless Communications Symposium (APWCS), Seoul, Korea, August 2022.
[2] S. Anokye, D. Ayepah-Mensah, A. M. Seid, G. O. Boateng, and G. Sun, “Deep reinforcement learning-based mobility-aware UAV content caching and placement in mobile edge networks,” IEEE Systems Journal, vol. 16, no. 1, pp. 275–286, 2022. [Online]. Available: https://doi.org/10.1109/JSYST.2021.3082837
[3] W. Na, N. Dao, and S. Cho, “Reinforcement-learning-based spatial resource identification for IoT D2D communications,” IEEE Systems Journal, vol. 16, no. 1, pp. 1068–1079, 2022. [Online]. Available: https://doi.org/10.1109/JSYST.2021.3087167
[4] Y. Shi, E. Alsusa, and M. W. Baidas, “Energy-efficient decoupled access scheme for cellular-enabled UAV communication systems,” IEEE Systems Journal, vol. 16, no. 1, pp. 701–712, 2022. [Online]. Available: https://doi.org/10.1109/JSYST.2020.3046689
[5] C. Singh, R. Mishra, H. P. Gupta, and P. Kumari, “The Internet of drones in precision agriculture: Challenges, solutions, and research opportunities,” IEEE Internet of Things Magazine, vol. 5, no. 1, pp. 180–184, March 2022.
[6] F. Kong, J. Li, B. Jiang, H. Wang, and H. Song, “Trajectory optimization for drone logistics delivery via attention-based pointer network,” IEEE Transactions on Intelligent Transportation Systems (Early Access), pp. 1–13, May 2022.
[7] W. J. Yun, S. Jung, J. Kim, and J.-H. Kim, “Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications,” ICT Express, vol. 7, no. 1, pp. 1–4, March 2021.
[8] J. Hu, X. Yang, W. Wang, P. Wei, L. Ying, and Y. Liu, “Obstacle avoidance for UAS in continuous action space using deep reinforcement learning,” IEEE Access, vol. 10, pp. 90 623–90 634, August 2022.
[9] S. Qamar, S. H. Khan, M. A. Arshad, M. Qamar, J. Gwak, and A. Khan, “Autonomous drone swarm navigation and multi-target tracking with island policy-based optimization framework,” IEEE Access, pp. 1–1, August 2022.
[10] U. Bodkhe, D. Mehta, S. Tanwar, P. Bhattacharya, P. K. Singh, and W.-C. Hong, “A survey on decentralized consensus mechanisms for cyber physical systems,” IEEE Access, vol. 8, pp. 54 371–54 401, 2020.
[11] R. Atat, L. Liu, J. Wu, G. Li, C. Ye, and Y. Yang, “Big data meet cyber-physical systems: A panoramic survey,” IEEE Access, vol. 6, pp. 73 603–73 636, 2018.
[12] H.-D. Tran, F. Cai, M. L. Diego, P. Musau, T. T. Johnson, and X. Koutsoukos, “Safety verification of cyber-physical systems with reinforcement learning control,” ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, pp. 1–22, 2019.
[13] C. Sun, G. Cembrano, V. Puig, and J. Meseguer, “Cyber-physical systems for real-time management in the urban water cycle,” in Proc. IEEE International Workshop on Cyber-physical Systems for Smart Water Networks (CySWater), 2018, pp. 5–8.
[14] Z. Wang, H. Song, D. W. Watkins, K. G. Ong, P. Xue, Q. Yang, and X. Shi, “Cyber-physical systems for water sustainability: Challenges and opportunities,” IEEE Communications Magazine, vol. 53, no. 5, pp. 216–222, 2015.
[15] W.-T. Wang, C.-H. Chang, and R.-N. Sheng, “The study on the implementation of multi-axis cutting & cyber-physical system on unity 3D platform,” in Proc. IEEE International Conference on Industrial Engineering and Applications (ICIEA), 2019, pp. 77–80.
[16] W.-Y. Chang, B.-Y. Hsu, and J.-W. Hsu, “Real-time collision avoidance for five-axis CNC machine tool based on cyber-physical system,” in Proc. IEEE International Conference on Advanced Manufacturing (ICAM), 2018, pp. 284–287.
[17] C. Villacís, W. Fuertes, L. Escobar, F. Romero, and S. Chamorro, “A new real-time flight simulator for military training using mechatronics and cyber-physical system methods,” Military Engineering, pp. 73–90, 2020.
[18] K. Armstrong, “North Korea drones: South’s military apologises for pursuit failure,” BBC, December 2022, last accessed 30 December 2022. [Online]. Available: https://www.bbc.com/news/world-asia-64100974
[19] D. Hambling, “Failure Or Savior? Busting Myths About Switchblade Loitering Munitions In Ukraine,” Forbes, June 2022, last accessed 16 November 2022. [Online]. Available: https://www.forbes.com/sites/davidhambling/2022/06/08/failure-or-savior-busting-myths-about-switchblade-loitering-munitions-in-ukraine
[20] S. Rosenberg and J. Lukiv, “Ukraine war: Drone attack on Russian bomber base leaves three dead,” BBC, December 2022, last accessed 27 December 2022. [Online]. Available: https://www.bbc.com/news/world-europe-64092183
[21] D. Hambling, “Ukraine Wins First Drone Vs. Drone Dogfight Against Russia, Opening A New Era Of Warfare,” Forbes, October 2022, last accessed 16 November 2022. [Online]. Available: https://www.forbes.com/sites/davidhambling/2022/10/14/ukraine-wins-first-drone-vs-drone-dogfight-against-russia-opening-a-new-era-of-warfare
[22] P. K. Ben Hubbard and S. Reed, “Two Major Saudi Oil Installations Hit by Drone Strike, and U.S. Blames Iran,” The New York Times, September 2019, last accessed 16 November 2022. [Online]. Available: https://www.nytimes.com/2019/09/14/world/middleeast/saudi-arabia-refineries-drone-attack.html
[23] D. Kwon, J. Jeon, S. Park, J. Kim, and S. Cho, “Multiagent DDPG-based deep learning for smart ocean federated learning IoT networks,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9895–9903, 2020.
[24] T. M. Ho, K.-K. Nguyen, and M. Cheriet, “Energy-aware control of UAV-based wireless service provisioning,” in Proc. IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1–6.
[25] D. Kwon and J. Kim, “Multi-agent deep reinforcement learning for cooperative connected vehicles,” in Proc. IEEE Global Communications Conference (GLOBECOM), December 2019, pp. 1–6.
[26] T. M. Ho, K.-K. Nguyen, and M. Cheriet, “UAV control for wireless service provisioning in critical demand areas: A deep reinforcement learning approach,” IEEE Transactions on Vehicular Technology, vol. 70, no. 7, pp. 7138–7152, 2021.
[27] G. Lee, W. J. Yun, S. Jung, J. Kim, and J.-H. Kim, “Visualization of deep reinforcement autonomous aerial mobility learning simulations,” in Proc. IEEE Conference on Computer Communications Demo (INFOCOM Demo), Virtual, May 2021, pp. 1–2.
[28] Q. Wang, A. Gao, and Y. Hu, “Joint power and QoE optimization scheme for multi-UAV assisted offloading in mobile computing,” IEEE Access, vol. 9, pp. 21 206–21 217, 2021.
[29] X. Luo, Y. Zhang, Z. He, G. Yang, and Z. Ji, “A two-step environment-learning-based method for optimal UAV deployment,” IEEE Access, vol. 7, pp. 149 328–149 340, 2019.
[30] S. Jung, W. J. Yun, J. Kim, and J.-H. Kim, “Infrastructure-assisted cooperative multi-UAV deep reinforcement energy trading learning for big-data processing,” in Proc. International Conference on Information Networking (ICOIN), Jeju Island, Korea (South), January 2021, pp. 159–162.
[31] J.-H. Lee, K.-H. Park, Y.-C. Ko, and M.-S. Alouini, “A UAV-mounted free space optical communication: Trajectory optimization for flight time,” IEEE Transactions on Wireless Communications, vol. 19, no. 3, pp. 1610–1621, March 2020.
[32] W. J. Yun, S. Park, J. Kim, M. Shin, S. Jung, D. A. Mohaisen, and J.-H. Kim, “Cooperative multiagent deep reinforcement learning for reliable surveillance via autonomous Multi-UAV control,” IEEE Transactions on Industrial Informatics, vol. 18, no. 10, pp. 7086–7096, October 2022.
[33] Y. Dang, C. Benzaïd, B. Yang, T. Taleb, and Y. Shen, “Deep-ensemble-learning-based GPS spoofing detection for cellular-connected UAVs,” IEEE Internet of Things Journal, vol. 9, no. 24, pp. 25 068–25 085, 2022.
[34] A. Soliman, M. Bahri, D. Izham, and A. Mohamed, “AirEye: UAV-based intelligent DRL mobile target visitation,” in 2022 International Wireless Communications and Mobile Computing (IWCMC), 2022, pp. 554–559.
[35] M. Mozaffari, W. Saad, M. Bennis, Y.-H. Nam, and M. Debbah, “A tutorial on UAVs for wireless networks: Applications, challenges, and open problems,” IEEE Communications Surveys and Tutorials, vol. 21, no. 3, pp. 2334–2360, 2019.
[36] B. Jiang, S. N. Givigi, and J.-A. Delamer, “A marl approach for optimizing positions of VANET aerial base-stations on a sparse highway,” IEEE Access, vol. 9, pp. 133 989–134 004, 2021.
[37] M. Shin, J. Kim, and M. Levorato, “Auction-based charging scheduling with deep learning framework for multi-drone networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 5, pp. 4235–4248, May 2019.
[38] S. Jung, W. J. Yun, M. Shin, J. Kim, and J.-H. Kim, “Orchestrated scheduling and multi-agent deep reinforcement learning for cloud-assisted multi-UAV charging systems,” IEEE Transactions on Vehicular Technology, vol. 70, no. 6, pp. 5362–5377, June 2021.
[39] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in Proc. 4th International Conference on Learning Representations (ICLR), May 2016.
[40] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, Y. Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,” arXiv preprint arXiv:1809.02627, 2018.

Appendix A Performance Comparison

In Fig. 13, we can compare the performance of the two algorithms, i.e., our proposed nonlinear drone mobility control and the comparing linear drone mobility control at a glance. As the obstacle density increases, performance degradation occurs in both algorithms. Therefore, we can confirm that our proposed nonlinear mobility control has much less performance degradation. Furthermore, we can also confirm that our proposed algorithm becomes getting better when the density of obstacles increases, i.e., the performance gap between the proposed nonlinear mobility control algorithm and the comparing linear mobility control becomes larger when the density of obstacles increases.