Distributed Computation Offloading for Energy Provision Minimization in WP-MEC Networks with Multiple HAPs

Xiaoying Liu, , Anping Chen, Kechen Zheng, , Kaikai Chi, , Bin Yang, and Tarik Taleb X. Liu, A. Chen, K. Zheng, and K. Chi are with the School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China. E-mail: {xiaoyingliu, 221122120218, kechenzheng, kkchi}@zjut.edu.cn.B. Yang is with the School of Computer and Information Engineering, Chuzhou University, Chuzhou 239000, China. E-mail: yangbinchi@gmail.com.T. Taleb is with the Faculty of Electrical Engineering and Information Technology, Ruhr University Bochum, Bochum 44801, Germany. E-mail: tarik.taleb@rub.de.

Abstract

This paper investigates a wireless powered mobile edge computing (WP-MEC) network with multiple hybrid access points (HAPs) in a dynamic environment, where wireless devices (WDs) harvest energy from radio frequency (RF) signals of HAPs, and then compute their computation data locally (i.e., local computing mode) or offload it to the chosen HAPs (i.e., edge computing mode). In order to pursue a green computing design, we formulate an optimization problem that minimizes the long-term energy provision of the WP-MEC network subject to the energy, computing delay and computation data demand constraints. The transmit power of HAPs, the duration of the wireless power transfer (WPT) phase, the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing are jointly optimized adapting to the time-varying generated computation data and wireless channels of WDs. To efficiently address the formulated non-convex mixed integer programming (MIP) problem in a distributed manner, we propose a Two-stage Multi-Agent deep reinforcement learning-based Distributed computation Offloading (TMADO) framework, which consists of a high-level agent and multiple low-level agents. The high-level agent residing in all HAPs optimizes the transmit power of HAPs and the duration of the WPT phase, while each low-level agent residing in each WD optimizes its offloading decision, time allocation for offloading and CPU frequency for local computing. Simulation results show the superiority of the proposed TMADO framework in terms of the energy provision minimization.

Index Terms:

Mobile edge computing, wireless power transfer, multi-agent deep reinforcement learning, energy provision minimization.

I Introduction

The fast development of Internet of Things (IoT) has driven various new applications, such as automatic navigation and autonomous driving [1]. These new applications have imposed a great demand on the computing capabilities of wireless devices (WDs) since they have computation-intensive and latency-sensitive tasks to be executed [2]. However, most WDs in IoT have low computing capabilities. Mobile edge computing (MEC) [3] has been identified as one of the promising technologies to improve the computing capabilities of WDs, through offloading computing tasks from WDs to surrounding MEC servers acted by access points (APs) or base stations (BSs). In this way, MEC servers’ computing capabilities could be shared with WDs [4]. For MEC, there are two computation offloading policies, i.e., the binary offloading policy and the partial offloading policy [5], [6]. The former is appropriate for indivisible computing tasks, where each task is either computed locally at WDs (i.e., local computing mode) or entirely offloaded to the MEC server for computing (i.e., edge computing mode). The latter is appropriate for arbitrarily divisible computing tasks, where each task is divided into two parts. One part is computed locally at WDs, and the other part is offloaded to the MEC server for computing. Works [7] and [8] focused on the partial offloading policy in a multi-user multi-server MEC environment, formulated a non-cooperative game, and proved the existence of the Nash equilibrium.

Energy supply is a key factor impacting the computing performance, such as the computing delay and computing rate, and the offloading decisions, such as computing mode selections under the binary offloading policy and the offloading volume under the partial offloading policy. However, most WDs in IoT are powered by batteries with finite capacity. Frequent battery replacement is extremely cost or even impractical in hard-to-reach locations, which limits the lifetime of WDs. To break this limitation, wireless power transfer (WPT) technology [9, 10], which realizes wireless charging of WDs by using hybrid access points (HAPs) or energy access points (EAPs) to broadcast radio frequency (RF) signals, is widely believed as a viable solution due to its advantages of stability and controllability in energy supply [11, 12]. As such, wireless powered mobile edge computing (WP-MEC) has recently aroused increasing attention [13] since it combines the advantages of MEC and WPT technologies, i.e., enhancing computing capabilities of WDs while providing sustainable energy supply.

TABLE I: Comparison between this paper and the related works.

Ref.	Goal	Energy supply	Multiple servers	Binary offloading	Long-term optimization	Computation data demand of network	Distributed computation offloading
[14]	computation rate	$\surd$		$\surd$			$\surd$
[15]	energy efficiency	$\surd$		$\surd$
[16]	energy provision	$\surd$		$\surd$	$\surd$
[18]	energy provision	$\surd$			$\surd$
[19]	energy provision	$\surd$
[20]	energy efficiency	$\surd$
[21]	energy efficiency	$\surd$				$\surd$
[22]	computation rate	$\surd$	$\surd$	$\surd$
[23]	computation delay	$\surd$	$\surd$	$\surd$	$\surd$		$\surd$
[24]	energy provision	$\surd$
Ours	energy provision	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$
$\surd$ denotes the existence of the feature.

Due to the coupling of the energy supply and the communication/computation demands of WDs, the critical issue in WP-MEC networks is how to reasonably allocate energy resource and make offloading decisions, so as to optimize various network performance. Consequently, many works [14, 15, 16, 17, 18, 19, 20, 21] have been done to address the issue. Under the binary offloading policy, Bi et al. [14] maximized the weighted sum computation rate of WDs in a multi-user WP-MEC network by jointly optimizing WDs’ computing mode selections and the time allocation between WPT and data offloading. Under the max-min fairness criterion, Zhou et al. [15] maximized the computation efficiency of a multi-user WP-MEC network by jointly optimizing the duration of the WPT phase, the CPU frequency, the time allocation and offloading power of WDs. While Chen et al. [16] considered a WP-MEC network where a multiple-antenna BS serves multiple WDs, and proposed an augmented two-stage deep Q-network (DQN) algorithm to minimize the average energy requirement of the network. Under the partial offloading policy, Park et al. [17] investigated a WP-MEC with the simultaneous wireless information and power transmission (SWIPT) technique, and minimized the computation delay by jointly optimizing the data offloading ratio, the data offloading power, the CPU frequency of the WD and the power splitting ratio. In a single-user WP-MEC network consisting of a multi-antenna EAP, a MEC server and a WD, Wang et al. [18] minimized the transmission energy consumption of the EAP during the WPT phase by jointly optimizing the energy allocation during the WPT phase at the EAP and the data allocation for offloading at the WD. In a two-user WP-MEC network with the nonorthogonal multiple access (NOMA) protocol, Zeng et al. [19] also minimized the transmission energy consumption of the EAP, similar to [18], under the energy and computing delay constraints, and proposed an iterative algorithm to solve it. Different from the above works, works [20] and [21] focused on the backscatter-assisted WP-MEC network, where WDs harvest energy for local computing and data offloading through active transmission and backscatter communication. Considering the limited computation capacity of the MEC server, Ye et al. [20] respectively maximized the computation energy efficiency and the total computation bits by proposing two resource allocation schemes. By leveraging the NOMA protocol to enhance backscatter communication, Du et al. [21] maximized the computation energy efficiency. The aforementioned works [14, 15, 16, 17, 18, 19, 20, 21], however, only focused on the WP-MEC networks with a single HAP, which makes it difficult to efficiently process the massive amount of the computation data offloaded from a large number of WDs.

Recently, few works [22], [23] have studied the WP-MEC networks with multiple HAPs, which are practical for large-scale IoT. Specifically, with the goal of computation rate maximization for a single time slot, Zhang et al. [22] first obtained the near-optimal offloading decisions by proposing an online deep reinforcement learning (DRL)-based algorithm, and then optimized the time allocation by designing a Lagrangian duality-based algorithm. While Wang et al. [23] focused on the long-term average task completion delay minimization problem, and proposed an online learning algorithm implemented distributively for HAPs to learn both the duration of the WPT phase and offloading decisions at each time slot. Actually, besides the computation rate [14], [22], computation efficiency [15], [20], and computation delay [17], [23], the energy provision is also a very important metric for evaluating the design of the WP-MEC networks [24]. However, as far as we know, the energy provision minimization of the WP-MEC networks with multiple HAPs has seldom been studied. Although works [16] and [18] have studied it in the WP-MEC networks with one HAP, the design in [16] and [18] can not be applied in the WP-MEC networks with multiple HAPs due to the complex association between WDs and HAPs.

To fill this gap, we study the long-term energy provision minimization problem of a multi-HAP WP-MEC network in a dynamic environment, where WDs harvest energy from the RF signals of HAPs, and then compute or offload the computation data under the binary offloading policy. Besides the energy constraint, the computing delay and computation data demand constraints should be satisfied in order to ensure the computing performance of the network. The computing delay constraint ensures that the generated computation data at each time slot is computed either locally or remotely at HAPs within the acceptable duration. The computation data demand constraint ensures that the total amount of the processed computation data at each time slot is no smaller than the required computation data demand, which is in accordance with the real demand. Since the amount of computation data generated by WDs and the channel gains between HAPs and WDs are uncertain in a dynamic network environment, optimization methods such as convex optimization and approximation method are difficult to well address the problem. Fortunately, DRL has been demonstrated in the literature as a more flexible and robust approach to adapt the MEC offloading decisions and the resource allocation by interacting with the dynamic network environment [25], [26]. Hence we exploit the DRL approach to address the problem. A straightforward implementation of the DRL approach is employing a centralized agent at HAPs to collect all network information and then adapt the actions for HAPs and WDs. However, with the increasing number of HAPs/WDs, the state space and action space increase explosively, resulting in long training time and poor performance. To address this dilemma, adopting a distributed computation offloading framework is a promising solution. We summarize the differences between this paper and the related works in TABLE I, so as to highlight the novelty of this paper. The main contributions are summarized as follows.

•

In order to pursue a green computing design for a multi-HAP WP-MEC network, we formulate an energy provision minimization problem by jointly optimizing the transmit power of HAPs, the duration of the WPT phase, the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing, subject to the energy, computing delay and computation data demand constraints. The formulated non-convex mixed integer programming (MIP) problem is very challenging to be tackled by proving it for a single time slot is NP-hard.
•

To efficiently tackle the non-convex MIP problem in a distributed manner, we decompose it into three subproblems, and then propose a two-stage multi-agent DRL-based distributed computation offloading (TMADO) framework to solve them. The main idea is that the high-level agent residing in all HAPs is responsible for solving the first subproblem, i.e., optimizing the transmit power of HAPs and the duration of the WPT phase, while the low-level agents residing in WDs are responsible for solving the second and third subproblems, i.e., the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing. In a distributed way, each WD optimizes its offloading decision, time allocation for offloading, and CPU frequency.
•

Simulation results validate the superiority of the proposed TMADO framework in terms of the energy provision minimization compared with comparison schemes. It is observed that when the number of HAPs/WDs reaches a certain value, the scheme with only edge computing mode is better than that with only local computing mode in terms of the energy provision minimization due to the reduced average distance between HAPs and WDs. It is also observed that, with the purpose of minimizing energy provision of HAPs, the WDs with high channel gains are prone to select local computing mode, and vice versa.

The rest of this paper is organized as follows. In Section II, we introduce the system model of the WP-MEC network with multiple HAPs. In Section III, we formulate the energy provision minimization problem. In Section IV, we present the proposed TMADO framework. In Section V, we present the simulation results. Section VI concludes this paper.

II System Model

II-A Network Model

Refer to caption — Figure 1: An example of the multi-HAP WP-MEC network.

As shown in Fig. 1, we study a WP-MEC network, where $M$ one-antenna HAPs, and $N$ one-antenna WDs coexist. Let $\mathcal{M}=\{1,2,...,M\}$ denote the set of HAPs, and $\mathcal{N}=\{1,2,...,N\}$ denote the set of WDs. Equipped with one MEC server, each HAP broadcasts the RF signal to WDs through the downlink channel, and receives the computation data from WDs through the uplink channel. Equipped with one battery with capacity $E_{b}$ , each WD harvests energy from RF signals for local computing or offloading the computation data to the chosen HAP. We consider that WDs adopt the harvest-then-offload protocol [15], i.e., WDs harvest energy before offloading data to the HAPs. In accordance with the real demand, we define the transmission zone as a circle centered at the WD with radius $R_{t}$ . The WD could offload its computation data to the HAPs within the corresponding transmission zone.

To process the generated computation data in the WP-MEC network, we consider that WDs follow the binary offloading policy, i.e., there are two computing modes for WDs. The WD in local computing mode processes the computation data locally by utilizing the computing units, and the WD in edge computing mode offloads the computation data to the chosen HAP through the uplink channel. After receiving the computation data, the HAP processes the received computation data, and transmits computation results back to the WD through the downlink channel.

As illustrated in Fig. 2, the system time is divided into $\mathcal{T}$ time slots with equal duration $T$ . Time slot $t$ is divided into the control information exchanges (CIE) phase with negligible duration due to small-sized control information [27], the WPT phase with duration $\alpha(t)$ , and the computation data offloading (CDO) phase with duration $T-\alpha(t)$ . During the CIE phase, the control information exchanges between HAPs and WDs complete. During the WPT phase, HAPs broadcast RF signals to WDs through the downlink channels, and WDs harvest energy from RF signals of HAPs simultaneously. During the CDO phase, the WDs in edge computing mode offload the computation data to the chosen HAPs through the uplink channels. While the WDs in local computing mode process the computation data during the time slot, since the energy harvesting circuit and the computing unit could work simultaneously [15, 21], as shown in the WD’s circuit structure of Fig. 1. At the beginning of time slot $t$ , $N$ WDs generate computation data $\mathcal{D}(t)=[D_{1}(t),...,D_{N}(t)]$ , where $D_{1}(t),...,D_{N}(t)$ respectively denote the amount of the computation data generated by WD₁, $...$ , WD_N at time slot $t$ , and WDs transmit the state information about the amount of the generated computation data and the amount of the energy in the batteries to the HAPs. Then the HAPs follow the proposed TMADO framework to output the feasible solutions of the offloading decisions, broadcast them to WDs, and conduct WPT with duration $\alpha(t)$ . According to the received feasible solutions of the offloading decisions, each WD independently makes the optimal offloading decision, i.e., it processes the generated computation data locally during the current time slot, or offloads the computation data to the chosen HAP during the CDO phase. To avoid potential interference, the time division multiple access (TDMA) protocol is employed by WDs that offload the computation data to the same HAP at the same time slot, while the frequency division multiple access (FDMA) protocol is employed by WDs that offload the computation data to different HAPs. As the MEC server of the HAP has high CPU frequency [24], there is no competition among WDs for the edge resource. Similarly, as the MEC server of the HAP has high CPU frequency [24] and the computation results are small-sized [15], we neglect the time spent by HAPs on processing data and transmitting the computation results back to the WDs.

II-B Channel Model

For the channels in the WP-MEC network, we adopt the free-space path loss model for the large-scale fading, and adopt the Rayleigh fading model [28] for the small-scale fading. Let ${h}_{n,m}(t)$ denote the channel gain between WD_n and HAP_m at time slot $t$ as

{h}_{n,m}(t)={\sigma}_{L,n,m}{\left|{\sigma}_{S,n,m}(t)\right|}^{2},

(1)

where ${\sigma}_{L,n,m}$ denotes the large-scale fading component, and ${\sigma}_{S,n,m}(t)$ denotes the small-scale fading component at time slot $t$ . The uplink channel and downlink channel are considered to be symmetric [14], i.e., the channel gain of the uplink channel equals that of the corresponding downlink channel. We consider that channel gains remain unchanged within each time slot, and vary from one time slot to another.

II-C Energy Harvesting Model

During the WPT phase, WDs harvest energy from the RF signals broadcasted by HAPs. The amount of the energy harvested by WDs depends on the transmit power of the HAPs, the distance between WDs and HAPs, and the duration of the WPT phase. We adopt the linear energy harvesting (EH) model, and formulate the amount of the energy harvested by WD_n at time slot $t$ as

E_{h,n}(t)=\sum_{m=1}^{M}\mu{P}_{{h},m}(t){h}_{n,m}(t){\alpha(t)},

(2)

where $\mu\in(0,1)$ denotes the EH efficiency, ${P}_{{h},m}(t)\in[0,{P}_{\max}]$ denotes the transmit power of HAP_m at time slot $t$ , and ${P}_{\max}$ denotes the maximum transmit power of HAPs.

The amount of the available energy in WD_n at time slot $t$ , i.e., the sum of the amount of the initial energy in WD_n and the amount of the energy harvested by WD_n at time slot $t$ , is formulated as

E_{n}(t)=\text{min}\{E_{i,n}(t)+E_{h,n}(t),E_{b}\},

(3)

where $E_{i,n}(t)$ denotes the amount of the initial energy in WD_n at time slot $t$ . At the beginning of the first time slot, the initial energy in WDs equals zero, i.e., $E_{i,n}(0)=0,\forall n\in\mathcal{N}$ . As shown in (3), $\text{min}\{E_{i,n}(t)+E_{h,n}(t),E_{b}\}$ ensures that the amount of the available energy in WD_n does not exceed the battery capacity. Then the amount of initial energy in WD_n at time slot $t+1$ updates as

E_{i,n}(t+1)=\text{min}\{E_{i,n}(t)+E_{h,n}(t),E_{b}\}-E_{{w},n}(t),

(4)

where $E_{{w},n}(t)$ denotes the amount of the energy consumed by WD_n at time slot $t$ .

II-D Offloading Decisions

According to the computing modes, WDs are divided into nonoverlapping sets $\mathcal{N}_{m}(t)$ , $\mathcal{N}_{0}(t)$ , and $\mathcal{N}_{\text{-1}}(t)$ . Let $\mathcal{N}_{m}(t),m\in\mathcal{M}$ denote the set of WDs that process the computation data by offloading it to HAP_m at time slot $t$ , $\mathcal{N}_{0}(t)$ denote the set of WDs that process the computation data locally at time slot $t$ , and $\mathcal{N}_{\text{-1}}(t)$ denote the set of WDs that fail to process the computation data at time slot $t$ . If the available energy in WD is not enough for local computing in (9) or edge computing in (13), WD fails to process the computation data at time slot $t$ . The WDs failing to process the computation data at time slot $t$ are not in the feasible solutions of the offloading decisions $(\mathcal{N}-\mathcal{N}_{\text{-1}}(t))$ from HAPs. Considering the quality of service requirement for the WDs in $\mathcal{N}_{\text{-1}}(t)$ , the WDs in $\mathcal{N}_{\text{-1}}(t)$ directly drop the computation data without retransmitting, and accumulate the energy in the batteries so as to successfully process the computation data of the following time slots. $\mathcal{N}_{\text{-1}}(t)$ varies across time slots since WDs with sufficient harvested energy process the generated computation data locally, or offload it to the chosen HAPs. Then $\mathcal{N}_{m}(t)$ , $\mathcal{N}_{0}(t)$ , and $\mathcal{N}_{\text{-1}}(t)$ satisfy

\mathcal{N}=\bigcup^{M}_{m=-1}{\mathcal{N}_{m}(t)},

(5)

where $\mathcal{N}_{x}(t)\cap\mathcal{N}_{y}(t)=\phi$ , $x,y\in\{-1,...,M\}$ , $x\neq y$ , and $\phi$ denotes the empty set.

II-E Local Computing Mode

According to the received feasible solutions of the offloading decisions from HAPs, when WD_n chooses local computing mode at time slot $t$ , i.e., $n\in\mathcal{N}_{0}(t)$ , it processes the generated computation data of $D_{n}(t)$ bits locally during the whole time slot. Let $f_{n}(t)$ with $f_{n}(t)\in(0,f_{\max}]$ denote the CPU frequency of WD_n by using dynamic voltage and frequency scaling technique [29] at time slot $t$ , where $f_{\max}$ denotes the maximum CPU frequency of WDs. Let $C_{n}>0$ denote the number of CPU cycles required by WD_n to process 1-bit data. The computing delay constraint, i.e., the delay of WD_n to process the generated computation data $D_{n}(t)$ at time slot $t$ , denoted by $\tau_{l,n}(t)$ , does not exceed the duration of the time slot $T$ , is formulated as

\tau_{l,n}(t)=\frac{{C}_{n}{D}_{n}(t)}{f_{n}(t)}\leq T,\forall n\in\mathcal{N}_{0}(t).

(6)

Since energy is the product of power and time, the amount of the energy consumed by WD_n for local computing at time slot $t$ is the product of the CPU power of WD_n, denoted by $P_{\text{cpu},n}(t)$ , and the delay of WD_n to process the generated computation data, i.e., $\tau_{l,n}(t)$ . According to [30] and [31], the CPU power of WD_n at time slot $t$ is expressed as

P_{\text{cpu},n}(t)=C_{L,n}V_{n}^{2}(t)f_{n}(t),

(7)

where $C_{L,n}$ denotes the switched load capacitance of WD_n, and $V_{n}(t)$ denotes the CPU voltage of WD_n at time slot $t$ . As pointed out by [31], when the CPU operates at the low voltage limit, which is in accordance with the real-world WDs, the CPU frequency is approximately linear with the CPU voltage. Consequently, $P_{\text{cpu},n}(t)$ in (7) can be reformulated as $k_{n}f_{n}^{3}(t)$ , where $k_{n}$ satisfying $C_{L,n}V_{n}^{2}(t)=k_{n}f_{n}^{2}(t)$ is called the effective switched load capacitance of WD_n [32]. Then the amount of the energy consumed by WD_n for local computing at time slot $t$ is expressed as

E_{l,n}(t)\!={P_{\text{cpu},n}(t)\tau_{l,n}(t)}={k_{n}f^{3}_{n}(t)\tau_{l,n}(t)},\forall n\in\mathcal{N}_{0}(t),

(8)

which has been extensively used in the works related to MEC networks [14, 15, 20, 21, 32].

Since WDs in local computing mode are powered by the energy harvested during the WPT phase and that stored in the batteries, the energy constraint for WD_n in local computing mode, i.e., the amount of the energy consumed by WD_n for local computing at time slot $t$ is no larger than the available energy in WD_n at time slot $t$ , is formulated as [24]

E_{l,n}(t)\leq E_{n}(t),\forall n\in\mathcal{N}_{0}(t).

(9)

II-F Edge Computing Mode

According to the received feasible solutions of the offloading decisions from HAPs, when WD_n chooses edge computing mode at time slot $t$ , it offloads the computation data to HAP_m, i.e., $n\in\mathcal{N}_{m}(t)$ , $m\in\mathcal{M}$ . Let $\tau_{o,n,m}(t)$ denote the duration that WD_n offloads the computation data of $D_{n}(t)$ bits to HAP_m at time slot $t$ . Then $\tau_{o,n,m}(t)$ and $\alpha(t)$ satisfy

\sum_{n\in{\mathcal{N}_{m}(t)}}\tau_{o,n,m}(t)+\alpha(t)\leq T,\forall m\in\mathcal{M}.

(10)

Based on Shannon’s formula [33], to ensure that WD_n in edge computing mode successfully offloads the computation data to HAP_m during the CDO phase, $\tau_{o,n,m}(t)$ satisfies

\tau_{o,n,m}(t)\geq\frac{v{D}_{n}(t)}{B{\mathrm{log}}_{2}\left(1+\frac{P_{n}{h}_{n,m}(t)}{{N}_{0}}\right)},\forall n\in\mathcal{N}_{m}(t),m\in\mathcal{M},

(11)

where $B$ denotes the uplink bandwidth of each HAP, and ${v}\geq 1$ represents the communication overhead including the encryption and data header costs [14]. $P_{n}$ denotes the transmit power of WD_n, and ${N}_{0}$ denotes the power of the additive white Gaussian noise (AWGN). Besides, there is no interference between multiple WDs in (11) due to the TDMA protocol. Let $E_{o,n}(t)$ denote the amount of the energy consumed by WD_n for transmitting the computation data of $D_{n}$ bits to HAP_m at time slot $t$ as

E_{o,n}(t)=(P_{n}+P_{c,n})\tau_{o,n,m}(t),\forall n\in\mathcal{N}_{m}(t),m\in\mathcal{M},

(12)

where $P_{c,n}$ denotes the circuit power of WD_n. During the CDO phase, the energy constraint for WD_n in edge computing mode, i.e., the amount of the energy consumed by WD_n for edge computing is no larger than that of the available energy in WD_n at time slot $t$ , is formulated as

{}E_{o,n}(t)\leq E_{n}(t),\forall n\in\mathcal{N}_{m}(t),m\in\mathcal{M}.

(13)

Then the amount of the energy consumed by WD_n at time slot $t$ is defined as

{E_{{w},n}(t)}=\begin{cases}E_{o,n}(t),&{\text{if}}\ n\in\mathcal{N}_{m}(t),m\in\mathcal{M};\\ E_{l,n}(t),&{\text{if}}\ n\in\mathcal{N}_{0}(t);\\ {0,}&{\text{otherwise.}}\end{cases}

(14)

If the WD fails to process the computation data under the TMADO framework at time slot $t$ , its computation data would not be scheduled [23], and $E_{{w},n}(t)$ equals $0$ .

The amount of the energy consumed by HAP_m at time slot $t$ , denoted by $E_{{h},m}(t)$ , is formulated as

E_{{h},m}(t)=E_{1,m}(t)+E_{2,m}(t).

(15)

(15) indicates that the energy provision by HAPs consists of the amount of the energy consumed by HAPs for broadcasting RF signals and the amount of the energy consumed by HAPs for processing the received computation data from WDs. During the WPT phase of time slot $t$ , the amount of the energy consumed by HAP_m for broadcasting the RF signal, denoted by $E_{1,m}(t)$ , is formulated as

E_{1,m}(t)=\alpha(t){P}_{{h},m}(t),\forall m\in\mathcal{M}.

(16)

During the CDO phase of time slot $t$ , the amount of the energy consumed by HAP_m for processing the received computation data of $\sum_{n\in{\mathcal{N}_{m}(t)}}D_{n}(t)$ bits from the WDs in $\mathcal{N}_{m}(t)$ , denoted by $E_{2,m}(t)$ , is formulated as [34]

E_{2,m}(t)=\sum_{n\in{\mathcal{N}_{m}(t)}}{e_{m}D_{n}(t)},\forall m\in\mathcal{M},

(17)

where $e_{m}$ represents the energy consumption of HAP_m for processing per offloaded bit.

III Problem Formulation

In this section, we formulate the long-term energy provision minimization problem in the multi-HAP WP-MEC network as $\mathbf{P0}$ . Then we adopt Lemmas 1 and 2 to prove that $\mathbf{P0}$ for a single time slot is NP-hard, which makes $\mathbf{P0}$ much more perplexed and challenging.

The total amount of the energy provision by HAPs at time slot $t$ , denoted by $\Psi(t)$ , is expressed as

\Psi(t)=\sum_{m=1}^{M}E_{{h},m}(t).

(18)

We aim to minimize $\Psi(t)$ in the long term through optimizing the transmit power of HAPs $\bm{{P}_{{h}}(t)}=[{P}_{{h},1}(t),...,{P}_{{h},M}(t)]$ , the duration of the WPT phase $\alpha(t)$ , the offloading decisions $\bm{\mathcal{X}(t)}=\{\mathcal{N}_{0}(t),...,\mathcal{N}_{M}(t)\}$ , the time allocated to WDs for offloading the computation data to HAPs $\bm{\tau_{o}(t)}=[\tau_{o,1,1}(t),...,\tau_{o,1,M}(t),\tau_{o,2,1}(t),...,\tau_{o,2,M}(t),...,\tau_{o,N,M}(t)]$ , and the CPU frequencies of WDs $\bm{f(t)}=[f_{1}(t),...,f_{N}(t)]$ as


$\displaystyle\begin{split}\mathbf{P0}&:\underset{\bm{{P}_{{h}}(t)},\alpha(t),\bm{\mathcal{X}(t)},\bm{\tau_{o}(t)},\bm{f(t)}}{\text{min}}~\sum_{t=0}^{\mathcal{T}-1}\Psi(t)\end{split}$		(19a)
	$\displaystyle\mathrm{s.t.}~\sum_{n\in(\mathcal{N}-\mathcal{N}_{\text{-1}}(t))}D_{n}(t)\geq D_{th},~$	(19b)
	$\displaystyle~~~~~0\leq\frac{{C}_{n}{D}_{n}(t)}{f_{n}(t)}\leq T,~~~\forall n\in\mathcal{N}_{0}(t),$	(19c)
	$\displaystyle~~~~~0\leq{\tau}_{o,n,m}(t)\leq T,~~~\forall n\in\mathcal{N}_{m}(t),m\in\mathcal{M},$	(19d)
	$\displaystyle~~~~~0\leq\alpha(t)\leq T,$	(19e)
	$\displaystyle~~~~\sum_{n\in{\mathcal{N}_{m}(t)}}{{\tau}}_{o,n,m}(t)+\alpha(t)\leq T,~~~~\forall m\in\mathcal{M},$	(19f)
	$\displaystyle~~~~~0\leq{P}_{{h},m}(t)\leq{P}_{\max},~~~\forall m\in\mathcal{M},$	(19g)
	$\displaystyle~~~~~k_{n}f^{3}_{n}(t)\tau_{l,n}(t)\leq E_{n}(t),~~~\forall n\in\mathcal{N}_{0}(t),$	(19h)
	$\displaystyle~~~~~\left(P_{n}+P_{c,n}\right)\tau_{o,n,m}(t)\leq E_{n}(t),$
	$\displaystyle~~~~~~~~~~~~~~~~~~~~~~~~~~~~\forall n\in\mathcal{N}_{m}(t),m\in\mathcal{M},$	(19i)
	$\displaystyle~~~~~0<f_{n}(t)\leq f_{\max},~~~\forall n\in\mathcal{N}_{0}(t),$	(19j)
	$\displaystyle~~~~~E_{n}(t)=\text{min}\{E_{i,n}(t)\!+\!E_{h,n}(t),E_{b}\},\forall n\in\mathcal{N}.$	(19k)

In $\mathbf{P0}$ , (19b) represents the computation data demand constraint that the total amount of the processed computation data at each time slot is no smaller than the computation data demand $D_{th}$ . (19c) represents the computing delay constraint for each WD in local computing mode. (19d), (19e), and (19f) respectively represent that the duration of offloading the computation data to HAP_m, the duration of the WPT phase, and the sum of the total duration of the offloaded computation data to HAP_m and the duration of the WPT phase do not exceed the duration of the time slot. (19g) ensures that the transmit power of each HAP does not exceed the maximum transmit power of HAPs. (19h) represents the energy constraint for each WD in local computing mode [23]. (19i) represents that the amount of the energy consumed by the WD in edge computing mode is no larger than that of the available energy in the WD. (19j) ensures that the CPU frequency of each WD does not exceed the maximum CPU frequency. (19k) ensures that the amount of the available energy in each WD, which is equivalent to the sum of the amount of the initial energy in each WD and that of the energy harvested by each WD, does not exceed the battery capacity.

In the following, we provide a detailed explanation of the NP-hardness of the formulated non-convex MIP problem. As defined in [35], we provide the definition of the non-convex MIP problem as


$\displaystyle\begin{split}&\underset{x,y}{\text{min}}~f_{0}(x,y)\end{split}$		(20a)
	$\displaystyle\mathrm{s.t.}~f_{i}(x,y)\leq 0,~i=\{1,\ldots,m\},$	(20b)
	$\displaystyle~~~~~x\in\mathbb{Z}_{+}^{n_{1}},y\in\mathbb{R}_{+}^{n_{2}},$	(20c)

where $x$ denotes the vector of integer variables, $y$ denotes the vector of continuous variables, $f_{0}(x,y),f_{1}(x,y),...,f_{m}(x,y)$ represent the arbitrary functions mapping from $\mathbb{Z}_{+}^{n_{1}}\times\mathbb{R}_{+}^{n_{2}}$ to the real numbers, $n_{1}>0$ denotes the number of integer variables, $n_{2}\geq 0$ denotes the number of continuous variables, and $m\geq 0$ denotes the number of constraints. The MIP problem is considered as a general class of problems, consisting of the convex MIP problem and the non-convex MIP problem. The MIP problem is convex if $f_{0}(x,y),f_{1}(x,y),...,f_{m}(x,y)$ are convex, and vice versa [35].

Then we explain the NP-hardness of the non-convex MIP problem in this paper. According to (19a)-(19k), the formulated energy provision minimization problem for time slot $t$ that optimizes $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\bm{\mathcal{X}(t)}$ , $\bm{\tau_{o}(t)}$ , and $\bm{f(t)}$ , is


$\displaystyle\begin{split}&\underset{\bm{{P}_{{h}}(t)},\alpha(t),\bm{\mathcal{X}(t)},\bm{\tau_{o}(t)},\bm{f(t)}}{\text{min}}~\!\!\!\!\sum_{m=1}^{M}(\alpha(t){P}_{{h},m}(t)\!+\!\!\!\!\!\!\sum_{n\in{\mathcal{N}_{m}(t)}}\!\!\!\!\!\!{e_{m}D_{n}(t)})\end{split}$		(21a)
	$\displaystyle\mathrm{s.t.}~(\ref{Const:Ddemand})-(\ref{Const:Battery}).$

It is observed that (21a) includes integer variables $\bm{\mathcal{X}(t)}$ and the non-convex function resulting from the coupling relationship between $\bm{{P}_{{h}}(t)}$ and $\alpha(t)$ . According to the definition of the non-convex MIP problem, the formulated energy provision minimization problem for time slot $t$ is a non-convex MIP problem. Then we adopt Lemmas 1-2 to prove this non-convex MIP problem is NP-hard.

Lemma 1

Given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ , $\tau_{l,n}(t)$ , $f_{n}(t)$ in the optimal solution of $\mathbf{P0}$ for time slot $t$ , i.e., $\mathcal{T}=1$ , WD_n that satisfies the constraints in (19b), (19h), and (19j) at time slot $t$ chooses local computing mode, i.e., $n\in\mathcal{N}_{0}(t)$ .

Proof:

Satisfying the constraints in (19b), (19h), and (19j) indicates that WD_n satisfies the computation data demand constraint, energy constraint, and CPU frequency constraint. Then $n\in\mathcal{N}_{0}(t)$ is a feasible solution of $\mathbf{P0}$ . With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ , $\tau_{l,n}(t)$ , $f_{n}(t)$ , the value of $E_{1,m}$ in (16) is determined. Based on (15)-(18), the energy provision by HAPs in $\mathbf{P0}$ with WD_n in local computing mode is smaller than that with WD_n in edge computing mode. This completes the proof. ∎

Lemma 2

With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ , $\bm{\tau_{o}(t)}$ , and $\bm{f(t)}$ , $\mathbf{P0}$ for time slot $t$ is NP-hard.

Proof:

With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ , $\bm{\tau_{o}(t)}$ , and $\bm{f(t)}$ , the value of $E_{1,m}(t)$ in (16) is determined. Based on (15), $\mathbf{P0}$ for time slot $t$ aims to minimize $\sum_{m=1}^{M}E_{2,m}(t)$ in (17) and (18) by optimizing the offloading decisions $\bm{\mathcal{X}(t)}=\{\mathcal{N}_{0}(t),...,\mathcal{N}_{M}(t)\}$ . Based on (17), we infer that $\sum_{m=1}^{M}E_{2,m}(t)$ is independent of $\mathcal{N}_{0}(t)$ . Based on Lemma 1, $\mathcal{N}_{0}(t)$ in the optimal offloading solution $\bm{\mathcal{X}(t)}^{*}$ of $\mathbf{P0}$ for time slot $t$ is determined.

According to the aforementioned analysis, we transform $\mathcal{N}_{m}(t)\subset\bm{\mathcal{X}(t)}^{*}$ for time slot $t$ into the multiple knapsack problem as follows. Treat $M$ HAPs as $M$ knapsacks with the same load capacity $T-\alpha(t)$ . Treat $N$ WDs as $N$ items with the weight of item_n equaling the duration that WD_n offloads the computation data of $D_{n}(t)$ bits to the chosen HAP. The value of item_n is equal to the amount of the energy consumed by the chosen HAP for processing the computation data of $D_{n}(t)$ bits from WD_n. Then finding $\mathcal{N}_{m}(t)\subset\bm{\mathcal{X}(t)}^{*}$ is equivalent to finding the optimal item-assigning solution to minimize the total value of the items assigned to $M$ knapsacks. The multiple knapsack problem is NP-hard [36]. This completes the proof. ∎

As aforementioned, $\mathbf{P0}$ is a non-convex MIP problem, and we provide Lemmas 1-2 to demonstrate that this non-convex MIP problem is NP-hard in general. Therefore, it is quite challenging to solve $\mathbf{P0}$ , especially in a distributed manner, i.e., each WD independently makes its optimal offloading decision according to the local observation. Each WD can not capture the state information of other WDs and the HAPs that are located outside the corresponding transmission zone.

IV TMADO Framework

In essence, $\mathbf{P0}$ is a sequential decision-making question. As a power tool for learning effective decisions in dynamic environments, DRL is exploited to tackle $\mathbf{P0}$ . Specifically, as shown in Fig. 3, we propose the TMADO framework to decompose $\mathbf{P0}$ into three subproblems, i.e., WPT and computation data optimization (WCDO) subproblem, offloading decision optimization (ODO) subproblem, and resource optimization (RO) subproblem. We specify the TMADO framework as follows.

•

With state information of HAPs, that of WDs, and that of channel gains, the WCDO subproblem optimizes the transmit power of HAPs $\bm{{P}_{{h}}(t)}$ , duration of the WPT phase $\alpha(t)$ , and feasible solutions of the offloading decisions $(\mathcal{N}-\mathcal{N}_{\text{-1}}(t))$ by HAPs, subject to the computation data demand constraint (19b), duration of the WPT phase constraint (19e), and transmit power constraint (19g).
•

With the output of the WCDO subproblem { $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ }, and the local observations by each WD, the ODO subproblem optimizes the offloading decision $\bm{\mathcal{X}(t)}$ by WDs, subject to the computing delay constraint (19d) and energy constraints (19h), (19i).
•

With the output of the WCDO subproblem { $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ }, and the output of the ODO subproblem $\bm{\mathcal{X}(t)}$ , the RO subproblem outputs the duration that WDs in edge computing mode offload the computation data to HAPs $\bm{\tau_{o}(t)}$ , and the CPU frequencies $\bm{f(t)}$ of WDs in local computing mode , subject to the computing delay constraints (19c), (19f), and CPU frequency constraint (19j).

The HAPs are considered as a single high-level agent assisted by the server of cloud center in the WCDO subproblem. Each WD is considered as a low-level independent agent in the ODO subproblem. Under the TMADO framework, the decision-making process is completed by both HAPs as the high-level agent and WDs as low-level agents due to the following reasons. 1) Completing the decision-making process by only the high-level agent faces challenges of large-scale output actions and poor scalability. 2) Completing the decision-making process by only low-level agents makes it difficult to determine the duration of the WPT phase and satisfy the computation data demand in (19b). 3) It is difficult to determine the actions output by only high-level agent and those output by only low-level agents due to the coupling relationship between the output of the WCDO subproblem and that of the ODO subproblem.

Based on aforementioned three reasons, we explore a hierarchical structure in the TMADO framework that consists of a single-agent actor-critic architecture based on deep deterministic policy gradient (DDPG), and a multi-agent actor-critic architecture based on independent proximal policy optimization (IPPO). The HAPs first determine the high-level action through the DDPG algorithm. Then each low-level WD updates its own offloading decision through the IPPO algorithm. The states of the high-level agent are influenced by the low-level WDs’ decisions, driving the HAPs to update their actions in the next decision epoch. The TMADO framework performs the control information exchanges between HAPs and WDs during the CIE phase of each time slot as follows. 1) HAPs and WDs transmit states to the high-level agent through the dedicated control channel by using the TDMA protocol. Specifically, HAPs transmit the states of the energy consumption to the high-level agent, and WDs transmit the states of data generation and initial energy to the high-level agent. 2) The high-level agent combines the received states, obtains its action through the high-level learning model, and broadcasts its action to HAPs and WDs. 3) Each low-level agent receives its state, obtains its action through the low-level learning model, and then executes its action.

The WCDO subproblem involves not only the integer variable (i.e., $\mathcal{N}_{\text{-1}}(t)$ ) but also the continuous variables (i.e., $\bm{{P}_{{h}}(t)}$ and $\alpha(t)$ ). The DDPG does well in searching for the optimal policy with continuous variables for a single agent [37]. Hence the DDPG is employed by the high-level agent at the HAPs to solve the WCDO subproblem. Whereas the ODO subproblem involves only integer variables (i.e., $\bm{\mathcal{X}(t)}$ ), and the multi-WD offloading decisions need to be made by the multiple low-level agents at WDs. Different from other multi-agent DRL algorithms such as multi-agent deep deterministic policy gradient (MADDPG), multi-agent deep Q-learning network (MADQN) and multi-agent proximal policy optimization (MAPPO), IPPO estimates individual state value function of each agent without the interference from irrelevant state information of other agents, and then could achieve a better reward [38]. Hence IPPO is employed by the low-level agents at WDs to solve the ODO subproblem.

IV-A WCDO Subproblem

We formulate the WCDO subproblem as a Markov decision process (MDP), represented by a tuple of the high-level agent $(\mathcal{S}^{h},\mathcal{A}^{h},\mathcal{R}^{h},\gamma^{h})$ , where $\mathcal{S}^{h}$ , $\mathcal{A}^{h}$ , $\mathcal{R}^{h}$ , and $\gamma^{h}$ represent state, action, reward, and the discount factor, respectively. The high-level agent receives state $\bm{s}^{h}(t)\in\mathcal{S}^{h}$ , selects action $\bm{a}^{h}(t)\in\mathcal{A}^{h}$ at time slot $t$ , and receives reward $r^{h}(t)\in\mathcal{R}^{h}$ and state $\bm{s}^{h}(t+1)$ .

•

State space: The global state of the WCDO subproblem at time slot $t$ is defined as

${}\begin{split}\bm{s}^{h}(t)=\{\bm{s}_{p}^{h}(t),\bm{s}_{w}^{h}(t),\bm{s}_{c}^{h}(t)\},\end{split}$ (22)

where $\!\bm{s}_{p}^{h}(t)\!\!\!\!=\!\!\!\!\{E^{\text{tot}}_{{h},1}(t),...,E^{\text{tot}}_{{h},M}(t)\}$ represents the state information of HAPs, $\bm{s}_{w}^{h}(t)=\{D_{1}(t),...,D_{N}(t),E_{i,1}(t),...,E_{i,N}(t)\}$ represents that of WDs, and $\bm{s}_{c}^{h}(t)=\{{h}_{n,m}(t)|n\in\mathcal{N},m\in\mathcal{M}\}$ represents the channel gains at time slot $t$ . $E^{\text{tot}}_{{h},m}(t)$ is the state information of HAP_m that represents the total amount of the energy consumed by HAP_m during time slots $[0,t-1)$ . $E_{i,n}(t)$ represents the amount of the initial energy in WD_n at time slot $t$ .

•

Action space: At time slot $t$ , the action space of HAPs is defined as

{}\begin{split}\!\mathcal{A}^{h}(t)\!\!=\!\!\{\alpha(t),{P}_{{h},1}(t),...,&{P}_{{h},M}(t),\\ &a_{{c},1}(t),...,a_{{c},N}(t)\}\!\!\in\!\mathcal{A}^{h},\end{split}

(23)

where $a_{\text{c},n}(t)$ represents the energy provision cost of HAPs for the energy supply of WD_n estimated by the high-level agent, i.e., the amount of the energy consumed by HAPs for WD_n to harvest per joule energy. To be specific, if the energy provision cost of HAPs for the energy supply of WD_n is small, the probability that HAPs determine that WD_n in local computing mode or edge computing mode processes the computation data is high. Otherwise, the probability that HAPs determine that WD_n is not in the feasible solutions of the offloading decisions is high. As $a_{\text{c},n}(t)$ is a continuous action, and the feasible solutions of the offloading decisions are discrete actions, we convert continuous action $a_{\text{c},n}(t)$ into discrete action, which represents the set of WDs to process the computation data, and obtain $\mathcal{N}_{\text{-1}}(t)$ . To optimize the feasible solutions of the offloading decisions, the sub-action space $\mathcal{A}_{c}^{h}(t)$ is established in the ascending order of action $a_{\text{c},n}(t)$ as

{}\mathcal{A}_{c}^{h}(t)=\text{a-sorted}({a_{c,n}(t)}),\forall n\in\mathcal{N}.

(24)

The WDs in the feasible solutions of the offloading decisions are powered by HAPs with low energy provision cost, and satisfy (19b). Based on (24), we obtain the feasible solutions of the offloading decisions.

•

Reward function: The reward function measures the amount of the energy provision by HAPs at time slot $t$ as

$r^{h}(t)=-\sum_{m=1}^{M}{E_{{h},m}(t)}-\omega_{d},$ (25)

where $\omega_{d}$ denotes the penalty of dissatisfying (19b). The high-level agent maximizes a series of rewards $r^{h}$ as

$r^{h}=\sum_{t=0}^{\mathcal{T}-1}\gamma^{h}~r^{h}(t).$ (26)

IV-A1 Architecture Design of the High-level Agent

DDPG is employed by the high-level agent. As the single-agent actor-critic architecture, DDPG uses deep neural networks (DNNs) as a state-action value function $Q_{\pi^{h}}(\bm{s}^{h},\bm{a}^{h}|\bm{\theta}^{h})$ to learn the optimal policy in multi-dimensional continuous action space. Let $\bm{\theta}^{h}$ denote the parameters of DNNs, and $\bm{\theta}^{h*}$ denote the optimal policy parameters. There are four DNNs in DDPG as follows.

The actor network of the high-level agent with parameters $\bm{\theta}^{h}_{a}$ outputs policy $\pi^{h}(\bm{s}^{h}|\bm{\theta}^{h}_{a})$ . Under policy $\pi^{h}(\bm{s}^{h}|\bm{\theta}^{h}_{a})$ , the high-level agent adopts state $\mathcal{S}^{h}$ as the input of the actor network to output corresponding action $\mathcal{A}^{h}$ .

The critic network of the high-level agent with parameters $\bm{\theta}^{h}_{c}$ evaluates policy $\pi^{h}(\bm{s}^{h}|\bm{\theta}^{h}_{a})$ generated by the actor network through $Q_{\pi^{h}}(\bm{s}^{h},\bm{a}^{h}|\bm{\theta}^{h})$ .

The target actor network and target critic network of the high-level agent are used to improve the learning stability. The structure of the target actor network with parameters $\bm{\theta}_{a,t}^{h}$ is the same as that of the actor network. The structure of the target critic network with parameters $\bm{\theta}_{c,t}^{h}$ is the same as that of the critic network.

The input of the DNNs is state $\bm{s}^{h}$ . Motivated by reward $r^{h}$ in (26), the high-level agent learns the optimal policy $\pi^{h*}$ with $\bm{\theta}^{h*}$ through exploration and training. With obtained optimal policy $\pi^{h*}$ , the high-level agent chooses the action that maximizes $Q_{\pi^{h*}}(\bm{s}^{h},\bm{a}^{h}|\bm{\theta}^{h*})$ for each state.

IV-A2 Training Process of the High-level Agent

The server of the cloud center assists in training the critic network and actor network of the high-level agent. The experience replay buffer stores the experiences of each time slot. The high-level agent randomly selects $k$ -size samples of the experiences, and calculates the loss function of the critic network at time slot $t$ as

\begin{split}\mathcal{L}^{h}_{c}(\bm{\theta}^{h}_{c}(t))=\frac{1}{k}\sum_{i=1}^{k}[Q_{i}(\bm{s}^{h}_{i}(t),\bm{a}^{h}_{i}(t)|\bm{\theta}^{h}_{c}(t))-(r^{h}_{i}+\\ \gamma^{h}Q_{i}(\bm{s}^{h}_{i}(t+1),\bm{a}^{h}_{i}(t+1)|\bm{\theta}^{h}_{c}(t)))]^{2}.\end{split}

(27)

By the gradient descent method, the parameters of the critic network is updated as

\bm{\theta}^{h}_{c}(t+1)\longleftarrow\bm{\theta}^{h}_{c}(t)-\beta^{h}_{c}\bigtriangledown_{\bm{\theta}^{h}_{c}(t)}\mathcal{L}^{h}_{c}(\bm{\theta}^{h}_{c}(t)),

(28)

where $\beta^{h}_{c}$ denotes the learning rate of the critic network.

The action is generated by the actor network of the high-level agent as

\bm{a}^{h}(t)=\pi^{h}(\bm{s}^{h}(t)|\bm{\theta}^{h}_{a}(t))+\mathcal{G}(t),

(29)

where $\mathcal{G}(t)$ denotes the Gaussian noise at time slot $t$ . The Gaussian noise improves the stability and convergence of the actor network [39].

The actor network of the high-level agent aims to maximize the state-action value of the critic network by optimizing the loss function of the actor network as

\mathcal{L}^{h}_{a}(\bm{\theta}^{h}_{a}(t))=\frac{1}{k}\sum_{i=1}^{k}Q(\bm{s}^{h}(t),\bm{a}^{h}(t)|\bm{\theta}^{h}_{c}(t)).

(30)

To achieve the maximum state-action value of the critic network, the gradient ascent method is adopted to update parameters $\bm{\theta}^{h}_{a}(t)$ as

\bm{\theta}^{h}_{a}(t+1)\longleftarrow\bm{\theta}^{h}_{a}(t)+\beta^{h}_{a}\bigtriangledown_{a}\mathcal{L}^{h}_{a}(\bm{a}^{h}(t))\bigtriangledown_{\bm{\theta}^{h}_{a}(t)}\bm{a}^{h}(t),

(31)

where $\beta^{h}_{a}$ denotes the learning rate of the actor network.

The target actor network and target critic network of the high-level agent update parameters through soft updating as

	$\displaystyle\bm{\theta}_{a,t}^{h}(t+1)$	$\displaystyle=v_{a}\bm{\theta}^{h}_{a}(t)+(1-v_{a})\bm{\theta}_{a,t}^{h}(t),$
	$\displaystyle\bm{\theta}_{c,t}^{h}(t+1)$	$\displaystyle=v_{c}\bm{\theta}^{h}_{c}(t)+(1-v_{c})\bm{\theta}_{c,t}^{h}(t),$		(32)

where $0<v_{a}\leq 1$ and $0<v_{c}\leq 1$ are soft updating factors.

IV-B ODO Subproblem

We formulate the ODO subproblem as a decentralized partially observable Markov decision process (Dec-POMDP), represented by a tuple of low-level agents $(\mathcal{S}^{l},\mathcal{O}^{l},\mathcal{A}^{l},\mathcal{R}^{l},$ Pr $,\gamma^{l})$ , where $\mathcal{S}^{l}$ , $\mathcal{O}^{l}$ , $\mathcal{A}^{l}$ , $\mathcal{R}^{l}$ , Pr, and $\gamma^{l}$ represent states, local observations, actions, rewards, the transition probability, and the discount factor, respectively. As low-level agents, WDs adapt actions to maximize the rewards based on local observations. At the beginning of time slot $0$ , the global state of the ODO subproblem is initialized as $\bm{s}^{l}(0)$ . At time slot $t$ , WD_n has observation $\bm{o}^{l}_{n}(t)$ from state $\bm{s}^{l}(t)$ , and then outputs action $\bm{a}^{l}_{n}(t)$ . The environment receives the set of $N$ WDs’ actions $\bm{a}^{l}(t)=\{{\bm{a}^{l}_{n}(t)}\}^{N}_{n=1}$ , and calculates reward $r^{l}_{n}(t)=\mathcal{R}^{l}_{n}(\bm{s}^{l}(t),\bm{a}^{l}(t)),n\in\mathcal{N}$ for $N$ WDs. Then the environment transites to state $\bm{s}^{l}(t+1)$ according to transition probability Pr $(\bm{s}^{l}(t+1)|\bm{s}^{l}(t),\bm{a}^{l}(t))$ .

•

State space: The global state of the ODO subproblem at time slot $t$ is defined as

\begin{split}\bm{s}^{l}(t)\!=\!\!\{\bm{s}_{p,1}^{l}(t),...,\bm{s}_{p,M}^{l}(t),&\bm{s}_{w,1}^{l}(t),...,\bm{s}_{w,N}^{l}(t),\\ &\bm{s}_{c,1}^{l}(t),...,\bm{s}_{c,M}^{l}(t)\},\end{split}

(33)

where $\bm{s}_{p,m}^{l}(t)=\{E^{\text{tot}}_{{h},m}(t),T-\alpha(t)\}$ represents the state information of HAP_m, $\bm{s}_{w,n}^{l}=\{D_{n}(t),E_{n}(t),a_{c,n}(t)\}$ represents that of WD_n, and $\bm{s}_{c,m}^{l}(t)=\{{h}_{n,m}(t)|n\in\mathcal{N}\}$ represents the channel gains between $N$ WDs and HAP_m.

•

Observation space: $\bm{o}^{l}_{n}(t)$ denotes the observable state by WD_n from the global state of ODO subproblem $\bm{s}^{l}(t)$ as

{}\begin{split}\bm{o}^{l}_{n}(t)\!=\!\!\{\bm{o}_{p,1}^{l}(t),...,\bm{o}_{p,M}^{l}(t),&\bm{o}_{w,1}^{l}(t),...,\bm{o}_{w,N}^{l}(t),\\ &\bm{o}_{c,1}^{l}(t),...,\bm{o}_{c,M}^{l}(t)\},\end{split}

(34)

where $\bm{o}_{p,m}^{l}(t)=\{E^{\text{tot}}_{{h},m}(t),T-\alpha(t)\}$ represents the observation information of HAP_m, $\bm{o}_{w,n}^{l}=\{D_{n}(t),E_{n}(t),a_{c,n}(t)\}$ represents the observation information of WD_n, and $\bm{o}_{c,m}^{l}(t)=\{{h}_{n,m}(t)|n\in\mathcal{N}\}$ represents the channel gains between $N$ observable WDs and HAP_m. The size of $\bm{o}^{l}_{n}(t)$ for WD_n is the same as that of $\bm{s}^{l}(t)$ . If HAP_m is located outside the transmission zone of WD_n, the observation information of HAP_m is considered as $\bm{o}_{p,m}^{l}(t)=\{0,0\}$ , and the channel gain between WD_n and HAP_m is considered as $h_{n,m}(t)=0$ in $\bm{o}_{c,m}^{l}(t)$ . As WD_n can not capture the observation information of the other WDs, the corresponding observation information of other WDs is considered as $\bm{o}_{w,k}^{l}(t)=\{0,0,0\}$ for $k\in\mathcal{N}\ \&\ k\neq n$ in (34).

•

Action space: $\mathcal{A}_{n}^{l}(t)=\{x_{n}(t)\}$ represents the action space of WD_n at time slot $t$ . $x_{n}(t)=m$ , $m\in\mathcal{M}$ represents that WD_n in edge computing mode offloads the computation data to HAP_m at time slot $t$ , and $x_{n}(t)=0$ represents that WD_n in local computing mode processes the computation data locally at time slot $t$ .

•

Reward function: In view of observation $\bm{o}^{l}_{n}(t)$ , WD_n outputs offloading policy $\bm{a}^{l}_{n}(t)$ to interact with the environment, and receives reward $r^{l}_{n}(t)$ as

{r^{l}_{n}(t)}=\begin{cases}u-(a_{c,n}(t)E_{o,n}(t)+\!\!\!\!\!\!\!&e_{m}D_{n}(t)),\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\text{if}}\ n\in\!\!\!\!&\mathcal{N}_{m}(t),m\in\mathcal{M};\\ u-a_{c,n}(t)E_{l,n}(t),&{\text{if}}\ n\in\mathcal{N}_{0}(t);\\ 0,&\text{otherwise},\end{cases}

(35)

where $u$ denotes a constant for the non-negative reward. The value of $u$ needs to satisfy two requirements: 1) $u>\max\{a_{c,n}(t)E_{o,n}(t)+e_{m}D_{n}(t),a_{c,n}(t)E_{l,n}(t)\}$ for $t\in[0,\mathcal{T}-1]$ ; 2) the value of $u$ and those of $a_{c,n}(t)E_{o,n}(t)+e_{m}D_{n}(t)$ and $a_{c,n}(t)E_{l,n}(t)$ are similar in the order of magnitude¹¹1We set the value of $u$ as the average upper bound of the energy provision by HAPs for the WP-MEC network with a single WD in edge computing mode over all time slots. Namely, when HAPs with the maximum transmit power $P_{\max}$ provide energy for the WD in edge computing mode during $\mathcal{T}$ time slots, the average value over all time slots, including the energy consumed by HAPs for broadcasting RF signals and that for processing the received computation data from the WD, is defined as ${}\begin{split}\overline{\Psi}=\frac{\mathcal{T}MP_{\max}T+\mathop{max}\limits_{m}\{e_{m}\}\sum_{t=1}^{\mathcal{T}}\mathbb{E}\left(D_{n}(t)\right)}{\mathcal{T}},\end{split}$ (36) where $\mathbb{E}\left(D_{n}(t)\right)$ represents the expectation of offloaded bits from WD_n at time slot $t$ .. The TMADO framework aims to maximize the cumulative sum of (35), i.e., the total reward of low-level agents, by minimizing the energy provision by HAPs in (19a). In the solution of $\mathbf{P0}$ , HAPs only need to provide the required amount of energy for WDs in local computing mode or edge computing mode in (19a).

Based on (22) and (24), HAPs determine the feasible solutions of the offloading decisions $(\mathcal{N}-\mathcal{N}_{\text{-1}}(t))$ , and broadcast them at the beginning of time slot $t$ . Then WDs receive the feasible solutions of the offloading decisions, and output actions $\bm{a}^{l}(t)$ based on local observations. For WD_n, $n\in\mathcal{N}$ , the ODO subproblem aims to find the optimal policy $\pi^{l*}_{\bm{\theta}^{l}_{a,n}}$ that maximizes the long-term accumulated discounted reward as

~~\underset{\pi^{l*}_{\bm{\theta}^{l}_{a,n}}}{\text{max}}~~~\mathbb{E}\left[\sum_{t=0}^{\mathcal{T}-1}\gamma^{l}_{t}~r^{l}_{n}(t)\right].

(37)

Let $\bm{\theta}^{l}_{a,n}$ denote the parameters of the actor network of low-level agent $n$ ,

IV-B1 Architecture Design of Low-level Agents

For the architecture design of low-level agents, IPPO is employed by low-level agents. IPPO, consisting of the actor networks and critic networks, is the multi-agent actor-critic architecture. The number of the actor networks of low-level agents is the same as that of WDs. The actor network of low-level agent $n$ with parameters $\bm{\theta}^{l}_{a,n},n\in\mathcal{N}$ outputs policy $\pi_{\bm{\theta}^{l}_{a,n}}^{l}(\bm{a}^{l}_{n}(t)|\bm{o}^{l}_{n}(t))$ , which is predicted distribution of action $\bm{a}^{l}_{n}(t)$ given local observation $\bm{o}^{l}_{n}(t)$ . By adding a softmax function [40], the actor network of low-level agent $n$ outputs transition probability Pr ${}_{i}(\bm{o}^{l}_{n}(t))$ of WD_n’s actions to provide the discrete action, which follows the categorical distribution as

\displaystyle\!\pi^{l}_{\bm{\theta}^{l}_{a,n}}\left(\bm{a}^{l}_{n}(t)\mid\bm{o}^{l}_{n}(t)\right)\!=\!\prod_{i=1}^{M+1}\!\text{Pr}_{i}

\displaystyle\left(\bm{o}^{l}_{n}(t)\right)\!I_{\left\{\bm{a}^{l}_{n}(t)=i\right\}},

(38)

and

\displaystyle\!\sum_{i=1}^{M+1}\!\text{Pr}_{i}\left(\bm{o}^{l}_{n}(t)\right)\!=\!1,

(39)

where $M+1$ represents the number of WD_n’s actions in the action space $\mathcal{A}_{n}^{l}(t)$ .

Besides, the number of the critic networks of low-level agents is the same as that of WDs. The critic network of low-level agent $n$ with parameters $\bm{\theta}^{l}_{c,n},n\in\mathcal{N}$ evaluates state value function $V_{n}(t),n\in\mathcal{N}$ in (40), and guides the update of the actor network during the training process.

IV-B2 Training Process of Low-level Agents

For the training and execution of low-level agents, centralized training and decentralized execution (CTDE) mechanisms are employed by low-level agents. The CTDE mechanisms do well in tackling the non-stationary issue in multi-agent DRL algorithms by reducing the impact of the interference from irrelevant state information of other agents, and show promising performance in distributed scenarios [41]. To employ CTDE mechanisms, we consider that the server of the cloud center assists in training the critic networks and actor networks of low-level agents in a centralized manner [42], and the trained low-level agents determine actions in a distributed manner. During centralized training, the cloud computing layer has fully-observable access to the states, actions, and rewards of other low-level agents. During decentralized execution, the low-level agent has partially-observable access to the states, actions, and rewards of them, i.e., the low-level agent relies on its local observations to output action for the computation offloading, without the states, actions, and rewards of other WDs.

We introduce the critic networks and actor networks of low-level agents in detail. The critic networks estimate the unknown state value functions to generate update rules for the actor networks. The actor networks output policies $\pi_{\bm{\theta}^{l}_{a,n}}^{l}(\bm{a}^{l}_{n}(t)|\bm{o}^{l}_{n}(t)),n\in\mathcal{N}$ to maximize the fitted state values.

The objective function for the critic network of low-level agent $n$ at time slot $t$ , i.e., the state value function, is formulated as

V_{n}(t)=V_{n}\left(\bm{o}^{l}_{n}(t)\right)=\mathbb{E}\left[\sum_{t^{\prime}=t}^{\mathcal{T}-1}\gamma^{l}r^{l}_{n}(t^{\prime})\right],

(40)

where $\gamma^{l}\in[0,1]$ is the discount factor. The advantage function of low-level agent $n$ at time slot $t$ , i.e., comparing the old policy and the current policy in terms of variance, is formulated as

\begin{split}A_{n}(t)&=A_{n}\left(\bm{o}^{l}_{n}(t),\bm{a}^{l}_{n}(t)\right)\\ &=r^{l}_{n}(t)+\gamma^{l}V_{n}(t+1)-V_{n}(t).\end{split}

(41)

The state value function $V_{n}\left(\bm{o}^{l}_{n}(t)\right)$ in (40) is updated by the mean squared error loss function as

\mathcal{L}^{l}_{n}(t)=\mathbb{E}\left[r^{l}_{n}(t)+\gamma^{l}V_{n}\left(\bm{o}^{l}_{n}(t+1)\right)-V_{n}\left(\bm{o}^{l}_{n}(t)\right)\right]^{2}.

(42)

Based on (41), the objective function of the actor network of low-level agent $n$ at time slot $t$ , i.e., the surrogate objective of low-level agent $n$ , is formulated as

J^{l}\!\!\left(\!\pi^{l}_{\bm{\theta}^{l}_{a,n}}\!\right)\!\!=\!\mathbb{E}_{(\bm{s}^{l},\bm{a}^{l}\!)}\min\left(\varrho(t)A_{n}(t),\operatorname{clip}(\varrho,1\!-\!\epsilon,1\!+\!\epsilon)A_{n}(t)\right)\!,

(43)

where $\varrho(t)=\frac{\pi^{l}_{\bm{\theta}^{l}_{a,n}}\left(\bm{a}_{n}(t)\mid\bm{o}_{n}(t)\right)}{\pi^{l}_{\bm{\theta}^{l,old}_{a,n}}\left(\bm{a}_{n}(t)\mid\bm{o}_{n}(t)\right)}$ represents the truncated importance sampling factor, $\pi^{l}_{\bm{\theta}^{l}_{a,n}}\left(\bm{a}^{l}_{n}(t)\mid\bm{o}^{l}_{n}(t)\right)$ represents the current policy, $\pi^{l}_{\bm{\theta}^{l,old}_{a,n}}\left(\bm{a}^{l}_{n}(t)\mid\bm{o}^{l}_{n}(t)\right)$ represents the old policy, the clip function makes $r^{l}_{n}(t)$ in the value range $[1-\epsilon,1+\epsilon]$ to train the actor network of low-level agent $n$ , and $\epsilon$ represents a positive hyperparameter. To evaluate the difference between the old and current policies, the surrogate objective of low-level agent $n$ uses the importance sampling strategy [40] to treat the samples from the old policy as the surrogate of new samples in training the actor network of low-level agent $n$ .

IV-C RO Subproblem

With given { $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\mathcal{N}_{\text{-1}}(t)$ , $\bm{\mathcal{X}(t)}$ }, we analyze the RO subproblem as follows. We first consider the case when WD_n chooses local computing mode at time slot $t$ , i.e., $n\in\mathcal{N}_{0}(t)$ .

Lemma 3

With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\bm{\mathcal{X}(t)}$ , the optimal computing delay $\tau_{l,n}(t),n\in\mathcal{N}_{0}(t)$ and the CPU frequency $f_{n}(t)$ of WD_n in local computing mode at time slot $t$ are respectively given as

{}\tau_{l,n}^{*}(t)=T,

(44)

and

{}f_{n}^{*}(t)=\frac{{C}_{n}{D}_{n}(t)}{T}.

(45)

Proof:

With given $\bm{\mathcal{X}(t)}$ , based on (3), the initial energy in WD_n at time slot $t+1$ is determined by the initial energy in WD_n and the energy consumption of WD_n for local computing at time slot $t$ . With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , the amount of the initial energy in WD_n at time slot $t+1$ decreases with the amount of the energy consumed by WD_n for local computing at time slot $t$ , which is given as follows based on (6) and (8).

{}E_{l,n}(t)=k_{n}C_{n}D_{n}(t)f^{2}_{n}(t)=\frac{k_{n}(C_{n}D_{n}(t))^{3}}{\tau_{l,n}(t)^{2}}.

(46)

$E_{l,n}(t)$ decreases with $\tau_{l,n}(t)$ . Based on (6) and the monotonicity of $E_{l,n}(t)$ with respect to $\tau_{l,n}(t)$ , we have (45). ∎

Then we consider the case when WD_n chooses edge computing mode at time slot $t$ , i.e., $n\in\mathcal{N}_{m}(t)$ .

Lemma 4

With given $\bm{{P}_{{h}}(t)}$ , $\alpha(t)$ , $\bm{\mathcal{X}(t)}$ , the optimal duration that WD_n offloads the computation data to HAP_m $\tau_{o,n,m}(t)$ , $n\in\mathcal{N}_{m}(t)$ , $m\in\mathcal{M}$ at time slot $t$ is

\tau_{o,n,m}^{*}(t)=\frac{v{D}_{n}(t)}{B{\mathrm{log}}_{2}\left(1+\frac{P_{n}{h}_{n,m}(t)}{{N}_{0}}\right)}.

(47)

Proof:

Based on Lemma 3 and (19i), the amount of the energy consumed by WD_n for edge computing is

E_{o,n}(t)=\left(P_{n}+P_{c,n}\right)\tau_{o,n,m}(t).

(48)

It is easy to observe that $E_{o,n}(t)$ increases with $\tau_{o,n,m}(t)$ . Based on (11), (19d), (19f), and the monotonicity of $E_{o,n}(t)$ with respect to $\tau_{o,n,m}(t)$ , the minimum duration that WD_n offloads the computation data to HAP_m in (47) ensures that WD_n in edge computing mode successfully offloads the computation data of $D_{n}(t)$ bits to HAP_m. ∎

Input: Parameters of the critic network, the actor network, the target critic network, and the target actor network as

\bm{\theta}^{h}_{c}

\bm{\theta}^{h}_{a}

\bm{\theta}^{h}_{c,t}

, and

\bm{\theta}^{h}_{a,t}

, parameters of the critic networks and actor networks as

\bm{\theta}^{l}_{c,n}

and

\bm{\theta}^{l}_{a,n}

n\in\mathcal{N}

, and experience replay buffers.

Output: Optimal policy

\pi^{h}

and

\left\{\pi^{l}_{\bm{\theta}^{l}_{a,n}}\right\}_{n=1}^{N}

1Randomly initialize parameters of the critic network, the actor network, the target critic network, and the target actor network as

\bm{\theta}^{h}_{c}

\bm{\theta}^{h}_{a}

\bm{\theta}^{h}_{c,t}

, and

\bm{\theta}^{h}_{a,t}

, parameters of the critic networks and the actor networks as

\bm{\theta}^{l}_{c,n}

and

\bm{\theta}^{l}_{a,n}

n\in\mathcal{N}

, and experience replay buffers;

2 for $\text{Iter}_{1}=1,2,\dots,\text{Iter}_{\max}$ do

3 Clean experience replay buffers;

4 for $t=0,1,\dots,\mathcal{T}-1$ do

5 Get global state of the WCDO subproblem

\bm{s}^{h}(t)

;

6 The high-level agent outputs action

\bm{a}^{h}(t)\sim\pi^{h}

;

7 Interact with the environment by

\bm{a}^{h}(t)

;

8 Distribute

\bm{a}^{h}(t)

to low-level agents;

9 for $n=1,2,\dots,N$ do

10 Get local observation

\bm{o}^{l}_{n}(t)

;

11 Low-level agent

n

outputs action

\bm{a}^{l}_{n}(t)\sim\pi_{\bm{\theta}^{l}_{a,n}}^{l}

;

12 Interact with the environment by

\bm{a}^{l}_{n}(t)

;

13 Compute reward

r^{l}_{n}(t)

according to (26);

14 Store the tuple

\left(\bm{o}^{l}_{n}(t),\bm{a}^{l}_{n}(t),r^{l}_{n}(t)\right)

and state

\bm{s}^{l}(t+1)

in the experience replay buffer of low-level agent

n

;

16 end for

18 Compute reward

r^{h}(t)

according to (35);

19 Store the tuple

\left(\bm{s}^{h}(t),\bm{a}^{h}(t),r^{h}(t),\bm{s}^{h}(t+1)\right)

in the experience replay buffer of the high-level agent;

20 Update the critic network, the actor network, the target critic network, and the target actor network of the high-level agent;

22 end for

23 Store current policy

\pi^{l}_{\bm{\theta}^{l,old}_{a,n}}\leftarrow\pi_{\bm{\theta}^{l}_{a,n}}^{l}

for each low-level agent;

24 for $k=1,2,\dots,M_{1}$ do

25 for $n=1,2,\dots,N$ do

26 Compute state value function according to (40);

27 Compute advantage function according to (41);

28 Update the critic network and the actor network of low-level agent

n

;

30 end for

32 end for

34 end for

Algorithm 1 TMADO Framework to Solve

\mathbf{P}_{0}

IV-D Proposed TMADO Framework

Algorithm 1 shows the pseudo-code of the proposed TMADO framework. To be specific, we initialize the actor-critic network parameters of the high-level agent and low-level agents, and the experience replay buffers of the high-level agent and low-level agents in line 1. Then we start a loop for sampling and training in line 2. The high-level agent outputs action by the DDPG algorithm in lines 5-6. After interacting with the environment in line 7, the actions of the high-level agent $\mathcal{A}^{h}(t)$ in (23) are broadcasted to low-level agents, and used as the starting point for low-level agents’ action exploration in line 8. Low-level agents output actions by the IPPO algorithm in lines 10-11. After interacting with the environment in line 12, low-level agent $n$ stores the sampled experience $\left(\bm{o}^{l}_{n}(t),\bm{a}^{l}_{n}(t),r^{l}_{n}(t)\right)$ and $\bm{s}^{l}(t+1)$ into the experience replay buffer of low-level agent $n$ in lines 13-14. The high-level agent stores the sampled experience $\left(\bm{s}^{h}(t),\bm{a}^{h}(t),r^{h}(t),\bm{s}^{h}(t+1)\right)$ into the experience replay buffer of the high-level agent in lines 16-17. Then the critic network, the actor network, the target critic network, and the target actor network of the high-level agent are updated in line 18. We update the critic networks and actor networks of low-level agents for $M_{1}$ times as sample reuse in PPO [38] in lines 20-27. Each time, the experience replay buffer is traversed to conduct mini-batch training.

IV-E Computation Complexity Analysis

We provide the computational complexity analysis of the training process as follows. For the high-level agent, let $L_{a}$ , $L_{c}$ , $n_{l_{a}}$ , $n_{l_{c}}$ , and $T_{\text{ep}}$ denote the number of layers for the actor network, that for the critic network, the number of neurons in layer $l_{a}$ of the actor network, the number of neurons in layer $l_{c}$ of the critic network, and the number of episodes, respectively. Then the complexity of the high-level agent can be derived as $O(T_{\text{ep}}\mathcal{T}(\sum_{l_{a}=0}^{L_{a}-1}n_{l_{a}}n_{{l_{a}}+1}+\sum_{l_{c}=0}^{L_{c}-1}n_{l_{c}}n_{{l_{c}}+1}))$ [39]. Since low-level agents adopt mini-batches of experience to train policies by maximizing the discounted reward in (37), based on (43), we adopt the sample complexity in [41] to characterize the convergence rate by achieving

\mathbb{E}\Big{[}\Big{\lVert}\nabla_{(}\pi^{l}_{\bm{\theta}^{l}_{a,n})}J^{l}\!\left(\pi^{l}_{\bm{\theta}^{l}_{a,n}}\right)\Big{\lVert}^{2}\Big{]}\leq\epsilon.

(49)

Then the sample complexity of low-level agents is $O(\epsilon^{-2}\log(\epsilon^{-1}))$ [41]. The complexity of solving (44)-(47) is $O(1)$ . In summary, the overall computational complexity of the training process is $O(T_{\text{ep}}\mathcal{T}(\sum_{l_{a}=0}^{L_{a}-1}n_{l_{a}}n_{{l_{a}}+1}+\sum_{l_{c}=0}^{L_{c}-1}n_{l_{c}}n_{{l_{c}}+1})+\epsilon^{-2}\log(\epsilon^{-1}))$ .

V Simulation Results

TABLE II: Simulation Parameters

Descriptions	Parameters and values
	$M$ = $3$ , $N=10$
	$\mathcal{T}=100$ , $T=0.4$ s
Network model	$R_{t}=25$ m, $E_{b}=100$ mJ
	$\lambda=50$ , $D_{p}=10^{3}$ bit
	$D_{{th}}=3.5\times 10^{5}$ bit
EH model	$\mu=0.51$ , $P_{\max}=3$ W
Local computing mode	$k_{n}=10^{-27}$ , $C_{n}=10^{3}$ cycle/bit,
[23]	$f_{\max}=0.3$ GHz
	$B=1$ MHz, $N_{0}=10^{-9}$ W [24]
Edge computing mode	$P_{n}=0.1$ W, $P_{c,n}=10^{-3}$ W
	$e_{m}=1\times 10^{-6}$ J/bit [34], $v=1.1$
High-level agent	$\bm{\theta}^{h}_{a}=10^{-5}$ , $\bm{\theta}^{h}_{c}=10^{-5}$
(DDPG)	$v_{a}=10^{-4}$ , $v_{c}=10^{-4}$ , $\gamma^{h}=0.95$
Low-level agents	$\bm{\theta}^{l}_{a}=10^{-5}$ , $\bm{\theta}^{l}_{c}=10^{-5}$
(IPPO)	$\gamma^{l}=0.99$ , $\epsilon=0.2$ , $u=3.65$

In this section, we set the WP-MEC network with the area of 100 m $\times$ 100 m field [42], where HAPs are uniformly distributed, and WDs are randomly distributed. The distance between WD_n and HAP_m is $d_{n,m}$ , and then the large-scale fading component is ${\sigma}_{L,n,m}\!=\!A_{d}\left(\frac{3\times 10^{8}}{4\pi f_{\text{cf}}d_{n,m}}\right)^{d_{e}}$ [14], where $A_{d}$ with $4.11$ is the antenna gain, $f_{\text{cf}}$ with 915 MHz is the carrier frequency, and $d_{e}$ with $2$ is the path loss exponent. The small-scale Rayleigh fading follows an exponential distribution with unit mean. The number of the arrived data packets in WDs follows an independent Poisson point process with rate $\lambda=50$ . Other simulation parameters are given in TABLE II.

In the following, we first show the convergence of the proposed TMADO scheme in Figs. 6-6 and then reveal the impacts of crucial parameters, such as the number of HAPs/WDs and the radius of the transmission zone, on the energy provision by HAPs in Figs. 9-14. To evaluate the proposed TMADO scheme in terms of the energy provision by HAPs, we provide five comparison schemes as follows.

•

Proximal policy optimization (PPO)-TMADO scheme: We use the PPO [43] to obtain $\{\bm{{P}_{{h}}(t)},\alpha(t),\mathcal{N}_{\text{-1}}(t)\}$ , and use the proposed TMADO scheme to obtain $\{\bm{\mathcal{X}(t)},\bm{\tau}_{o}(t),\bm{f(t)}\}$
•

TMADO-MADDPG scheme: MADDPG is DDPG with the centralized state value function [25]. We use the proposed TMADO scheme to obtain $\{\bm{{P}_{{h}}(t)},\alpha(t),\mathcal{N}_{\text{-1}}(t)\}$ , and use the MADDPG to obtain $\{\bm{\mathcal{X}(t)},\bm{\tau}_{o}(t),\bm{f(t)}\}$ .
•

TMADO-MAPPO scheme: MAPPO is PPO with the centralized state value function [40]. We use the proposed TMADO scheme to obtain $\{\bm{{P}_{{h}}(t)},\alpha(t),\mathcal{N}_{\text{-1}}(t)\}$ , and use the MAPPO to obtain $\{\bm{\mathcal{X}(t)},\bm{\tau}_{o}(t),\bm{f(t)}\}$ .
•

TMADO-random edge computing (REC) scheme: All WDs offload the computation data to random HAPs. We set the radius of the transmission zone $R_{t}=+\infty$ m. Besides, we use the proposed TMADO scheme to obtain $\{\bm{{P}_{{h}}(t)},\alpha(t),\mathcal{N}_{\text{-1}}(t)\}$ .
•

TMADO-local computing (LC) scheme: All WDs adopt local computing mode. Besides, we use the proposed TMADO scheme to obtain $\{\bm{{P}_{{h}}(t)},\alpha(t),\mathcal{N}_{\text{-1}}(t)\}$ .

Fig. 6 plots the energy provision (EP) versus $u$ . The hyperparameter $u$ directly determines the reward of the WD that successfully processes the computation data. Since the number of the arrived data packets in WD_n, i.e., $D_{n}(t)$ , follows an independent Poisson point process with rate $\lambda=50$ in the simulation, the value of $u$ in (36) is reformulated as $MP_{\max}T+\mathop{max}\limits_{m}\{e_{m}\}D_{p}\lambda$ , which is equal to $3.6+0.05=3.65$ . We observe that $u=3.65$ reaches the minimum EP, which validates the rationality and superiority of the selected $u$ . The reason is that, when $u$ is small, WDs tend to select the action of failing to process the computation data, and then HAPs need to provide more energy to satisfy the computation data demand. When $u$ is large, WDs tend to select the action of successfully processing the computation data, but are hard to make the offloading decisions, due to the indistinguishableness in the reward of selecting edge computing mode and that of selecting local computing mode. Hence HAPs also need to provide more energy to satisfy the computation data demand.

Fig. 6 plots the EP versus training episodes with four values of the learning rate for the high-level agent. We observe that, the proposed TMADO scheme with learning rate $2\times 10^{-5}$ achieves smaller EP than that with learning rates $2\times 10^{-3}$ and $2\times 10^{-4}$ , and converges faster than that with learning rate $2\times 10^{-6}$ . Thus we adopt $2\times 10^{-5}$ as the learning rate of the high-level agent. Fig. 6 plots the mean episode rewards of WDs versus training episodes. We observe that, the mean episode rewards of WDs converge after 450 training episodes. This observation verifies the advantage of the TMADO scheme for WDs to achieve a stable policy.

Fig. 9 plots EP under the proposed TMADO scheme versus the number of HAPs $M$ and that of WDs $N$ . We observe that, EP decreases with $N$ . This is due to the reason that, the increase of $N$ means that more WDs could harvest energy from HAPs, and accordingly more WDs have sufficient energy to process the computation data in local computing mode or edge computing mode. Thus, with fixed $D_{th}$ , HAPs could provide less energy for WDs to satisfy computing delay and computation data demand constraints. We also observe that, EP decreases with $M$ . This is due to the reason that, the increase of $M$ reduces the average distance between HAPs and WDs, and increases the probability that WDs successfully process the computation data in edge computing mode. Thus HAPs could provide less energy for WDs to satisfy computing delay and computation data demand constraints.

Fig. 9 and Fig. 9 plot EP under six schemes versus the number of WDs $N$ with the number of HAPs $M=6$ , and EP versus the number of HAPs $M$ with the number of WDs $N=40$ , respectively. We observe that, the proposed TMADO scheme achieves the minimum EP compared with other comparison schemes, which validates the superiority of the proposed TMADO scheme in terms of EP. This is due to the reason that, the DDPG takes advantage of learning continuous policies to solve the WCDO subproblem, while the IPPO employs the individual state value function of each WD to reduce the impact of interference from irrelevant state information of other WDs, which is better for solving the ODO subproblem than the algorithms with the centralized state value function of WDs, i.e., the MADDPG and MAPPO. We also observe that, EP under the TMADO-REC scheme is larger than that under the TMADO-LC scheme for $N<35$ in Fig. 9 and $M<5$ in Fig. 9, but is smaller than that under the TMADO-LC scheme for $N\geq 35$ in Fig. 9 and $M\geq 5$ in Fig. 9. This is due to the reason that, when $N$ or $M$ is small, the average distance between HAPs and WDs is large. Then WDs under the TMADO-REC scheme would need more energy to offload the computation data to random HAPs to satisfy computing delay and computation data demand constraints. While when $N$ or $M$ is large, the average distance between HAPs and WDs would be small, which increases the number of accessible HAPs to WDs. Accordingly, HAPs could provide less energy for WDs under the TMADO-REC scheme to satisfy computing delay and computation data demand constraints.

Fig. 12 plots EP versus the transmission bandwidth $B$ . We observe that, EP under the proposed TMADO scheme is the minimum, which validates the advantage of the proposed TMADO scheme in terms of EP. We also observe that, EP under the TMADO-LC scheme remains unchanged with the transmission bandwidth, and EP under the other schemes decreases with the transmission bandwidth. This is due to the reason that, the increase of the transmission bandwidth shortens the duration that WDs offload the computation data to HAPs, and decreases the amount of the energy consumed by WDs in edge computing mode. As edge computing mode is not considered in the TMADO-LC scheme, EP under the TMADO-LC scheme remains unchanged with the offloading bandwidth.

Fig. 12 plots EP versus the computation data demand $D_{th}$ . We observe that, EP increases with $D_{th}$ . With higher value of $D_{th}$ , WDs need to process larger amount of computation data in local computing mode or edge computing mode and make more offloading decisions, which requires more EP.

Fig. 12 plots EP versus the radius of the transmission zone $R_{t}$ . We observe that, EP first decreases with $R_{t}$ and then increases with $R_{t}$ . Apparently, the value of $R_{t}$ not only influences the number of accessible HAPs to WDs but also the observation space of low-level agents at WDs, and both of them increase with $R_{t}$ . When $R_{t}$ is small, the accessible HAPs are not enough for WDs to find the optimal offloading decisions, and then the number of accessible HAPs to WDs is the key factor influencing the EP by HAPs. Hence the EP by HAPs first decreases with $R_{t}$ . When $R_{t}$ is large, more HAPs far away from WDs are included in the transmission zones, which means that more redundant observation information is included in the observation space of WDs, i.e., $\bm{o}^{l}_{n}(t),n\in\mathcal{N}$ in Section IV-B. Although the accessible HAPs are enough for WDs to find the optimal offloading decisions, the observation space of WDs, i.e., $\bm{o}^{l}_{n}(t),n\in\mathcal{N}$ in Section IV-B, also becomes large, which in turn causes that the low-level agents at WDs are difficult to find the optimal offloading decisions. Hence, the EP by HAPs then increases with $R_{t}$ .

Fig. 13 (a) plots EP versus the maximum CPU frequency of WDs $f_{\max}$ . We observe that, EP under the TMADO-REC scheme remains unchanged, and EP under the other schemes decreases with the maximum CPU frequency of WDs for $f_{\max}\leq 0.2$ GHz, and remains unchanged with the maximum CPU frequency of WDs for $f_{\max}>0.2$ GHz. This is due to the reason that, WDs under the TMADO-REC scheme are always in edge computing mode, and the energy consumption is independent of the maximum CPU frequency of WDs. Besides, when $f_{\max}$ is small, the computing delay constraint for WDs in local computing mode can not be satisfied, and $f_{\max}$ is the bottleneck limiting WDs to choose local computing mode. With the increase of $f_{\max}$ , WDs have higher probabilities to choose local computing mode, i.e., the ratio of the number of WDs in local computing mode to the total number of WDs that process computation data (RLC) becomes larger, as shown in Fig. 13 (b). Hence EP first decreases with $f_{\max}$ . However, when $f_{\max}$ reaches a certain value such as $0.2$ GHz in Fig. 13 (a), the optimal $f_{n}^{*}(t)$ in (45) is in the range of $[0,f_{\max}]$ , and the RLC is no longer affected by $f_{\max}$ , as shown in Fig. 13 (b). Hence EP keeps unchanged. We also observe that, EP with $f_{\max}<0.15$ GHz under the TMADO-LC scheme is not shown in Fig. 13 (a), due to the reason that WDs with small $f_{\max}$ can not satisfy the computing delay constraint and the computation data demand $D_{th}$ .

Fig. 14 (a) plots EP versus the maximum transmit power of HAPs $P_{\max}$ . We observe that, the proposed TMADO scheme achieves the minimum EP compared with comparison schemes. We also observe that, EP first decreases and then keeps unchanged with the maximum transmit power of HAPs $P_{\max}$ . The reason is that, when $P_{\max}$ is small, such as $P_{\max}\leq 3.75$ W in Fig. 14 (a), WDs are in the energy-deficit state. In such a context, it is best for HAPs to transmit RF signals with the maximum transmit power $P_{\max}$ , so that WDs harvest more energy to have more choices, i.e., local computing mode or edge computing mode. Actually, the WDs with more energy, such as the WDs close to HAPs (i.e., with high channel gains), prefer local computing so as to avoid the energy consumption by HAPs for processing the offloaded computation data. While the WDs with less energy, such as the WDs far away from HAPs (i.e., with low channel gains), can not support local computing and have to offload the computation data to HAPs. Then HAPs consume their energy to process the computation data. With the increase of $P_{\max}$ , WDs harvest larger amount of energy. More WDs, especially the WDs with high channel gains, have sufficient energy to perform local computing, which is beneficial to reduce the EP by HAPs for processing the offloaded computation data. Hence the RLC increases with $P_{\max}$ , as shown in Fig. 14 (b). Accordingly, EP first decreases with $P_{\max}$ . When $P_{\max}$ reaches a certain value, i.e., $3.75$ W in Fig. 14 (a), WDs have two modes to process the computation data, and the optimal transmit power of HAPs is not the maximum. Hence EP keeps unchanged for $P_{\max}\geq 3.75$ W.

VI Conclusion

This paper studied the long-term energy provision minimization problem in a dynamic multi-HAP WP-MEC network under the binary offloading policy. We mathematically formulated the optimization problem by jointly optimizing the transmit power of HAPs, the duration of the WPT phase, the offloading decisions of WDs, the time allocation for offloading and the CPU frequency for local computing, subject to the energy, computing delay and computation data demand constraints. To efficiently address the formulated problem in a distributed way, we proposed a TMADO framework with which each WD could optimize its offloading decision, time allocation for offloading and CPU frequency for local computing. Simulation results showed that the proposed TMADO framework achieves a better performance in terms of the energy provision by comparing with other five comparison schemes. This paper investigated the fundamental scenario with single-antenna HAPs. As one of the future research directions, we will further study the scenario with multi-antenna HAPs by jointly considering the beamforming technique and channel allocation.

References

[1] Y. Chen, Y. Sun, B. Yang, and T. Taleb, “Joint caching and computing service placement for edge-enabled IoT based on deep reinforcement learning,” IEEE Internet Things J., vol. 9, no. 19, pp. 19501–19514, Oct. 2022.
[2] X. Wang, J. Li, Z. Ning, Q. Song, L. Guo, S. Guo, and M. S. Obaidat, “Wireless powered mobile edge computing networks: A survey,” ACM Comput. Surv., vol. 55, no. 13, pp. 1–37, Jul. 2023.
[3] K. Wang, J. Jin, Y. Yang, T. Zhang, A. Nallanathan, C. Tellambura, and B. Jabbari, “Task offloading with multi-tier computing resources in next generation wireless networks,” IEEE J. Sel. Areas Commun., vol. 41, no. 2, pp. 306–319, Feb. 2023.
[4] S. Yue, J. Ren, N. Qiao, Y. Zhang, H. Jiang, Y. Zhang, and Y. Yang, “TODG: Distributed task offloading with delay guarantees for edge computing,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 7, pp. 1650–1665, Jul. 2022.
[5] X. Qiu, W. Zhang, W. Chen, and Z. Zheng, “Distributed and collective deep reinforcement learning for computation offloading: A practical perspective,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 5, pp. 1085–1101, May 2021.
[6] Z. Wang, X. Mu, Y. Liu, X. Xu, and P. Zhang, “NOMA-aided joint communication, sensing, and multi-tier computing systems,” IEEE J. Sel. Areas Commun., vol. 41, no. 3, pp. 574–588, Mar. 2023.
[7] G. Fragkos, N. Kemp, E. E. Tsiropoulou, and S. Papavassiliou, “Artificial intelligence empowered UAVs data offloading in mobile edge computing,” in Proc. IEEE Int. Conf. Commun., Dublin, Ireland, Jun. 2020, pp. 1-7.
[8] P. A. Apostolopoulos, G. Fragkos, E. E. Tsiropoulou, and S. Papavassiliou, “Data offloading in UAV-assisted multi-access edge computing systems under resource uncertainty,” IEEE Trans. Mobile Comput., vol. 22, no. 1, pp. 175–190, Jan. 2023.
[9] H. Dai, Y. Xu, G. Chen, W. Dou, C. Tian, X. Wu, and T. He, “ROSE: Robustly safe charging for wireless power transfer,” IEEE Trans. Mobile Comput., vol. 21, no. 6, pp. 2180–2197, Jun. 2022.
[10] Y. Wu, Y. Song, T. Wang, L. Qian, and T. Q. S. Quek, “Non-orthogonal multiple access assisted federated learning via wireless power transfer: A cost-efficient approach,” IEEE Trans. Commun., vol. 70, no. 4, pp. 2853-2869, Apr. 2022.
[11] X. Liu, B. Xu, K. Zheng, and H. Zheng, “Throughput maximization of wireless-powered communication network with mobile access points,” IEEE Trans. Wireless Commun., vol. 22, no. 7, pp. 4401–4415, Jul. 2023.
[12] X. Liu, H. Liu, K. Zheng, J. Liu, T. Taleb, and N. Shiratori, “AoI-minimal clustering, transmission and trajectory co-design for UAV-assisted WPCNs,” IEEE Trans. Veh. Technol., early access, Sep. 16, 2024, doi: 10.1109/TVT.2024.3461333.
[13] X. Deng, J. Li, L. Shi, Z. Wei, X. Zhou, and J. Yuan, “Wireless powered mobile edge computing: Dynamic resource allocation and throughput maximization,” IEEE Trans. Mobile Comput., vol. 21, no. 6, pp. 2271–2288, Jun. 2022.
[14] S. Bi and Y. J. Zhang, “Computation rate maximization for wireless powered mobile-edge computing with binary computation offloading,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–4190, Jun. 2018.
[15] F. Zhou and R. Q. Hu, “Computation efficiency maximization in wireless-powered mobile edge computing networks,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3170–3184, May 2020.
[16] X. Chen, W. Dai, W. Ni, X. Wang, S. Zhang, S. Xu, and Y. Sun, “Augmented deep reinforcement learning for online energy minimization of wireless powered mobile edge computing,” IEEE Trans. Commun., vol. 71, no. 5, pp. 2698–2710, May 2023.
[17] J. Park, S. Solanki, S. Baek, and I. Lee, “Latency minimization for wireless powered mobile edge computing networks with nonlinear rectifiers,” IEEE Trans. Veh. Technol., vol. 70, no. 8, pp. 8320–8324, Aug. 2021.
[18] F. Wang, J. Xu, and S. Cui, “Optimal energy allocation and task offloading policy for wireless powered mobile edge computing systems,” IEEE Trans. Wireless Commun., vol. 19, no. 4, pp. 2443–2459, Apr. 2020.
[19] S. Zeng, X. Huang, and D. Li, “Joint communication and computation cooperation in wireless-powered mobile-edge computing networks with NOMA,” IEEE Internet Things J., vol. 10, no. 11, pp. 9849–9862, Jun. 2023.
[20] Y. Ye, L. Shi, X. Chu, R. Q. Hu, and G. Lu, “Resource allocation in backscatter-assisted wireless powered MEC networks with limited MEC computation capacity,” IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10678–10694, Dec. 2022.
[21] J. Du, H. Wu, M. Xu, and R. Buyya, “Computation energy efficiency maximization for NOMA-based and wireless-powered mobile edge computing with backscatter communication,” IEEE Trans. Mobile Comput., vol. 23, no. 6, pp. 6954–6970, Jun. 2024.
[22] S. Zhang, S. Bao, K. Chi, K. Yu, and S. Mumtaz, “DRL-based computation rate maximization for wireless powered multi-AP edge computing,” IEEE Trans. Commun., vol. 72, no. 2, pp. 1105–1118, Feb. 2024.
[23] X. Wang, Z. Ning, L. Guo, S. Guo, X. Gao, and G. Wang, “Online learning for distributed computation offloading in wireless powered mobile edge computing networks,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 8, pp. 1841–1855, Aug. 2022.
[24] F. Wang, J. Xu, X. Wang, and S. Cui, “Joint offloading and computing optimization in wireless powered mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–1797, Mar. 2018.
[25] H. Zhou, Y. Long, S. Gong, K. Zhu, D. T. Hoang, and D. Niyato, “Hierarchical multi-agent deep reinforcement learning for energy-efficient hybrid computation offloading,” IEEE Trans. Veh. Technol., vol. 72, no. 1, pp. 986–1001, Jan. 2023.
[26] G. Hong, B. Yang, W. Su, H. Li, Z. Huang, and T. Taleb, “Joint content update and transmission resource allocation for energy-efficient edge caching of high definition map,” IEEE Trans. Veh. Technol., vol. 73, no. 4, pp. 5902–5914, Apr. 2024.
[27] Q. Shafiee, Č. Stefanović, T. Dragičević, P. Popovski, J. C. Vasquez, and J. M. Guerrero, “Robust networked control scheme for distributed secondary control of islanded microgrids,” IEEE Trans. Ind. Electron., vol. 61, no. 10, pp. 5363–5374, Oct. 2014.
[28] R. Sun, H. Wu, B. Yang, Y. Shen, W. Yang, X. Jiang, and T. Taleb, “On covert rate in full-duplex D2D-enabled cellular networks with spectrum sharing and power control,” IEEE Trans. Mobile Comput., early access, Mar. 8, 2024, doi: 10.1109/TMC.2024.3371377.
[29] S. Barbarossa, S. Sardellitti, and P. Di Lorenzo, “Communicating while computing: Distributed mobile cloud computing over 5G heterogeneous networks,” IEEE Signal Process. Mag., vol. 31, no. 6, pp. 45–55, Nov. 2014.
[30] C.-L. Su, C.-Y. Tsui, and A. Despain, “Low power architecture design and compilation techniques for high-performance processors,” in Proc. IEEE COMPCON, San Francisco, CA, USA, Feb. 1994, pp. 489–498.
[31] T. D. Burd and R. W. Brodersen, “Processor design for portable systems,” J. VLSI Signal Process. Syst., vol. 13, no. 2, pp. 203–221, Aug. 1996.
[32] L. T. Hoang, C. T. Nguyen, and A. T. Pham, “Deep reinforcement learning-based online resource management for UAV-assisted edge computing with dual connectivity,” IEEE/ACM Trans. Netw., vol. 31, no. 6, pp. 2761–2776, Dec. 2023.
[33] H. Li, K. Xiong, Y. Lu, B. Gao, P. Fan, and K. B. Letaief, “Distributed design of wireless powered fog computing networks with binary computation offloading,” IEEE Trans. Mobile Comput., vol. 22, no. 4, pp. 2084–2099, Apr. 2023.
[34] T. Bai, C. Pan, H. Ren, Y. Deng, M. Elkashlan, and A. Nallanathan, “Resource allocation for intelligent reflecting surface aided wireless powered mobile edge computing in OFDM systems,” IEEE Trans. Wireless Commun., vol. 20, no. 8, pp. 5389–5407, Aug. 2021.
[35] S. Burer and A. N. Letchford, “Non-convex mixed-integer nonlinear programming: A survey,” Surv. Oper. Res. Manage. Sci., vol. 17, no. 2, pp. 97–106, Jul. 2012.
[36] V. Cacchiani, M. Iori, A. Locatelli, and S. Martello, “Knapsack problems — An overview of recent advances. Part II: Multiple, multidimensional, and quadratic knapsack problems,” Comput. Operations Res., vol. 143, p. 105693, Jul. 2022.
[37] Y. Yu, J. Tang, J. Huang, X. Zhang, D. K. C. So, and K.-K. Wong, “Multi-objective optimization for UAV-assisted wireless powered IoT networks based on extended DDPG algorithm,” IEEE Trans. Commun., vol. 69, no. 9, pp. 6361–6374, Jun. 2021.
[38] Y. Ye, C. H. Liu, Z. Dai, J. Zhao, Y. Yuan, G. Wang, and J. Tang, “Exploring both individuality and cooperation for air-ground spatial crowdsourcing by multi-agent deep reinforcement learning,” in Proc. Int. Conf. Data Eng., Anaheim, CA, USA, Apr. 2023, pp. 205–217.
[39] K. Zheng, R. Luo, Z. Wang, X. Liu, and Y. Yao, “Short-term and long-term throughput maximization in mobile wireless-powered Internet of Things,” IEEE Internet Things J., vol. 11, no. 6, pp. 10575–10591, Mar. 2024.
[40] Z. Hao, G. Xu, Y. Luo, H. Hu, J. An, and S. Mao, “Multi-agent collaborative inference via DNN decoupling: Intermediate feature compression and edge learning,” IEEE Trans. Mobile Comput., vol. 22, no. 10, pp. 6041–6055, Oct. 2023.
[41] Hairi, J. Liu, and S. Lu, “Finite-time convergence and sample complexity of multi-agent actor-critic reinforcement learning with average reward,” in Proc. Int. Conf. Learn. Represent., Virtual, Oct. 2022, pp. 25–29.
[42] Z. Gao, L. Yang, and Y. Dai, “Large-scale computation offloading using a multi-agent reinforcement learning in heterogeneous multi-access edge computing,” IEEE Trans. Mobile Comput., vol. 22, no. 6, pp. 3425–3443, Jun. 2023.
[43] Y. Wang, M. Chen, T. Luo, W. Saad, D. Niyato, H. V. Poor, and S. Cui, “Performance optimization for semantic communications: An attention-based reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 40, no. 9, pp. 2598–2613, Sep. 2022.