This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-Agent Meta-Reinforcement Learning for Self-Powered and Sustainable Edge Computing Systems

Md. Shirajum Munir,  Nguyen H. Tran,  Walid Saad,  and Choong Seon Hong This work was partially supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1A4A1018607) and by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-01287, Evolvable Deep Learning Model Generation Platform for Edge Computing).Md. Shirajum Munir, and Choong Seon Hong are with the Department of Computer Science and Engineering, Kyung Hee University, Yongin-si 17104, Republic of Korea (e-mail: munir@khu.ac.kr; cshong@khu.ac.kr).Nguyen H. Tran is with the School of Computer Science, The University of Sydney, Sydney, 2006, NSW, Australia. (e-mail: nguyen.tran@sydney.edu.au).Walid Saad is with the Wireless@VT Group, Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA 24061 USA, and also with the Department of Computer Science and Engineering, Kyung Hee University, Yongin-si 17104, Republic of Korea (e-mail: walids@vt.edu).Corresponding author: Choong Seon Hong (e-mail: cshong@khu.ac.kr).©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

The stringent requirements of mobile edge computing (MEC) applications and functions fathom the high capacity and dense deployment of MEC hosts to the upcoming wireless networks. However, operating such high capacity MEC hosts can significantly increase energy consumption. Thus, a base station (BS) unit can act as a self-powered BS. In this paper, an effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities is studied. First, a two-stage linear stochastic programming problem is formulated with the goal of minimizing the total energy consumption cost of the system while fulfilling the energy demand. Second, a semi-distributed data-driven solution is proposed by developing a novel multi-agent meta-reinforcement learning (MAMRL) framework to solve the formulated problem. In particular, each BS plays the role of a local agent that explores a Markovian behavior for both energy consumption and generation while each BS transfers time-varying features to a meta-agent. Sequentially, the meta-agent optimizes (i.e., exploits) the energy dispatch decision by accepting only the observations from each local agent with its own state information. Meanwhile, each BS agent estimates its own energy dispatch policy by applying the learned parameters from meta-agent. Finally, the proposed MAMRL framework is benchmarked by analyzing deterministic, asymmetric, and stochastic environments in terms of non-renewable energy usages, energy cost, and accuracy. Experimental results show that the proposed MAMRL model can reduce up to 11%11\% non-renewable energy usage and by 22.4%22.4\% the energy cost (with 95.8%95.8\% prediction accuracy), compared to other baseline methods.

Index Terms:
Mobile edge computing (MEC), stochastic optimization, meta-reinforcement learning, self-powered, demand response.

I Introduction

Next-generation wireless networks are expected to significantly rely on edge applications and functions that include edge computing and edge artificial intelligence (edge AI) [1, 2, 3, 4, 5, 6, 7]. To successfully support such edge services within a wireless network with mobile edge computing (MEC) capabilities, energy management (i.e., demand and supply) is one of the most critical design challenges. In particular, it is imperative to equip next-generation wireless networks with alternative energy sources, such as renewable energy, in order to provide extremely reliable energy dispatch with less energy consumption cost [8, 9, 11, 10, 12, 13, 14, 15]. An efficient energy dispatch design requires energy sustainability, which not only saves energy consumption cost, but also fulfills the energy demand of the edge computing by enabling its own renewable energy sources. Specifically, sustainable energy is the practice of seamless energy flow to the MEC system that emerges to meet the energy demand without compromising the ability of future energy generation. Furthermore, to ensure a sustainable MEC operation, the retrogressive penetration of uncertainty for energy consumption and generation is essential. A summary of the challenges that are solved by the literature to enable renewable energy sources for the wireless network is presented in Table I.

To provide sustainable edge computing for next-generation wireless systems, each base station (BS) with MEC capabilities unit can be equipped with renewable energy sources. Thus, the energy source of such a BS unit not only relies solely on the power grid, but also on the equipped renewable energy sources. In particular, in a self-powered network, wireless BSs with MEC capabilities is equipped with its own renewable energy sources that can generate renewable energy, consume, store, and share energy with other BS units.

Delivering seamless energy flow with a low energy consumption cost in a self-powered wireless network with MEC capabilities can lead to uncertainty in both energy demand and generation. In particular, the randomness of the energy demand is induced by the uncertain resources (i.e., computation and communication) request by the edge services and applications. Meanwhile, the energy generation of a renewable source (i.e., a solar panel) at each self-powered BS unit varies on the time of a day. In other words, the pattern of energy demand and generation will differ from one self-powered BS unit to another. Thus, such fluctuating energy demand and generation pattern induces a non-independent and identically distributed (non-i.i.d.) of energy dispatch at each BS over time. To overcome this non-i.i.d. energy demand and generation, characterizing the expected amount of uncertainty is crucial to ensure a seamless energy flow to the self-powered wireless network. As such, when designing self-powered wireless networks, it is necessary to take into account this uncertainty in the energy patterns.

TABLE I: Summary of the challenges that are solved by the literature for enabling renewable energy sources in the wireless network.
Ref. Energy sources MEC capabilities Non-i.i.d. dataset Energy dispatch Energy cost Remarks
[8] Renewable No No No No Activation and deactivation of BSs in a self-powered network
[9] Hybrid energy No No No No User scheduling and network resource management
[10] Hybrid energy Yes No No No Load balancing between the centralized cloud and edge server
[11] Microgrid Yes No Yes No MEC task assignment and energy demand-response (DR) management
[12] Microgrid Yes No Yes No Risk-sensitive energy profiling for microgrid-powered MEC network
[13] Renewable No No Yes No Energy load balancing among the SBSs with a microgrid
[14] Smart grid enabled hybrid energy No No Yes No Joint network resource allocation and energy sharing among the BSs
[15] Hybrid energy No No Yes No Overall system architecture for edge computing and renewable energy resources
This work Smart grid enabled self-powered renewable energy Yes Yes Yes Yes An effective energy dispatch mechanism for self-powered wireless networks with edge computing capabilities

I-A Related Works

The problem of energy management for MEC-enabled wireless networks has been studied in [16, 17, 18, 19, 20, 21, 22] (summary in Table II). In [16], the authors proposed a joint mechanism for radio resource management and users task offloading with the goal of minimizing the long-term power consumption for both mobile devices and the MEC server. The authors in [17] proposed a heuristic to solve the joint problem of computational resource allocation, uplink transmission power, and user task offloading problem. The work in [18] studied the tradeoff between communication and computation for a MEC system and the authors proposed a MEC server CPU scaling mechanism for reducing the energy consumption. Further, the work in [19] proposed an energy-aware mobility management scheme for MEC in ultra-dense networks, and they addressed the problem using Lyapunov optimization and multi-armed bandits. Recently, the authors in [21] proposed a distributed power control scheme for a small cell network by using the concept of a multi-agent calibrate learning. Further, the authors in [22] studied the problem of energy storage and energy harvesting (EH) for a wireless network using deviation theory and Markov processes. However, all of these existing works assume that the consumed energy is available from the energy utility source to the wireless network system [16, 17, 18, 19, 20, 21, 22]. Since the assumed models are often focused on energy management and user task offloading on network resource allocations, the random demand for computational (e.g., CPU computation, memory, etc.) and communication requirements of the edge applications and services are not considered. In fact, even if enough energy supply is available, the energy cost related to network operation can be significant because of the usage of non-renewable (e.g., coal, petroleum, natural gas). Indeed, it is necessary to include renewable energy sources towards the next-generation wireless networking infrastructure.

TABLE II: Summary of the related works [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28].
Ref. Contributions Method Limitation
[16] Radio resource management and users task offloading Optimization Usage of non-renewable, deterministic environment
[17] Computational resource allocation, uplink transmission power, and user task offloading Heuristic Usage of non-renewable, energy dispatch, performance guarantee
[18] MEC server CPU scaling mechanism for reducing the energy consumption Optimization Usage of non-renewable, energy dispatch
[19] Energy-aware mobility management scheme for MEC Lyapunov and multi-armed bandits Energy dispatch, i.i.d. energy demand-response
[20] Energy efficient green-IoT network Heuristic Edge computing, Energy dispatch, deterministic environment
[21] Distributed power control scheme for a small cell network Multi-agent calibrate learning Usage of non-renewable, energy dispatch
[22] Energy storage and energy harvesting (EH) for a wireless network Deviation theory and Markov processes MEC capabilities, i.i.d. energy demand-response
[23] Non-coordinated energy shedding and mis-aligned incentives for mixed-use building Auction theory MEC capabilities, i.i.d. energy demand-response
[24] Tradeoff between effectiveness and available amounts of training data Deep meta-RL Stochastic environment and a multi-agent scenario
[25] Controling the meta-parameter in both static and dynamic environments SGD-based meta-parameter learning Single-agent, same environment
[26] Learning to learn mechanism with the recurrent neural networks Generalized transfer learning Deterministic environment, single-agent
[27] Asynchronous multi-agent RL framework One-step Q-learning, one-step Sarsa, and n-step Q-learning Deterministic environment
[28] General-purpose multi-agent scheme Extension of the actor-critic policy gradient Same environment for all of the local actors

Recently, some of the challenges of renewable energy powered wireless networks have been studied in [8, 9, 11, 10, 12, 13, 14, 23]. In [8], the authors proposed an online optimization framework to analyze the activation and deactivation of BSs in a self-powered network. In [9], proposed a hybrid power source infrastructure to support heterogeneous networks (HetNets), a model-free deep reinforcement learning (RL) mechanism was proposed for user scheduling and network resource management. In [10], the authors developed an RL scheme for edge resource management while incorporating renewable energy in the edge network. In particular, the goal of [10] is to minimize a long-term system cost by load balancing between the centralized cloud and edge server. The authors in [11] introduced a microgrid enabled edge computing system. A joint optimization problem is studied for MEC task assignment and energy demand-response (DR) management. The authors in [11] developed a model-based deep RL framework to tackle the joint problem. In [12], the authors proposed a risk-sensitive energy profiling for microgrid-powered MEC network to ensure a sustainable energy supply for green edge computing by capturing the conditional value at risk (CVaR) tail distribution of the energy shortfall. The authors in [12] proposed a multi-agent RL system to solve the energy scheduling problem. In [13], the authors proposed a self-sustainable mobile networks, using graph-based approach for intelligent energy management with a microgrid. The authors in [14] proposed a smart grid-enabled wireless network and minimized grid energy consumption by applying energy sharing among the BSs. Furthermore, in [23], the authors addressed challenges of non-coordinated energy shedding and mis-aligned incentives for mixed-use building (i.e., buildings and data centers) using auction theory to reduce energy usage. However, these works [9, 11, 10, 12, 13, 14, 23] do not investigate the problem of energy dispatch nor do they account for the energy cost of MEC-enabled, self-powered networks when the demand and generation of each self-powered BS are non-i.i.d.. Dealing with non-i.i.d. energy demand and generation among self-powered BSs is challenging due to the intrinsic energy requirements of each BS evolve the uncertainty. In order to overcome this unique energy dispatch challenge, we propose to develop a multi-agent meta-reinforcement learning framework that can adapt new uncertain environment without considering the entire past experience.

Some interesting problems related to meta-RL and multi-agent deep RL are studied in [24, 25, 26, 27, 28] (summary in Table II). In [24], the authors focused on studying the challenges of the tradeoff between effectiveness and available amounts of training data for a deep-RL based learning system. To this end, the authors in [24] tackled those challenges by exploring a deep meta-reinforcement learning architecture. This learning architecture comprises of two learning systems: 1) lower-level system that can learn each new task very quickly, 2) higher-level system is responsible to improve the performance of each lower-level system task. In particular, this learning mechanism is involved with one lower-level system that can learn relatively quickly as compared with a higher-level system. This lower-level system can adapt to a new task while a higher-level system performs fine-tuning so as to improve the performance of the lower-level system. In particular, in deep meta-reinforcement learning, a lower-level system quantifies a reward based on the desired action and feeds back that reward to a higher-level system to tune the weights of a recurrent network. However, the authors in [24] do not consider a stochastic environment nor do they extend their work for a multi-agent scenario. The authors in [25] proposed a stochastic gradient-based meta-parameter learning scheme for tuning reinforcement learning parameters to the physical environmental dynamics. Particularly, the experiment in [25] performed in both animal and robot environments, where an animal must recognize food before it starves and a robot must recharge before the battery is empty. Thus, the proposed scheme can effectively find meta-parameter values and controls the meta-parameter in both static and dynamic environments. In [26], the authors investigated a learning to learn (i.e., meta-learning) mechanism with the recurrent neural networks, where the meta-learning problem was designed as a generalized transfer learning scheme. In particular, the authors in [26] considered a parametrized optimizer that can transfer the neural network parameters update to an optimizee. Meanwhile, the optimizee can determine the gradients without relying on the optimizer parameters. Moreover, the optimizee sends the error to the optimizer, and updates its own parameters based on the transferred parameters. This mechanism allows an agent to learn new tasks for a similar structure. An asynchronous multi-agent RL framework was studied in [27], where the authors investigated how parallel actor learners of asynchronous advantage actor-critic (A3C) can achieve better stability during the neural network training comparted to asynchronous RL schemes. Such schemes include asynchronous one-step Q-learning, one-step Sarsa, and n-step Q-learning. The authors in [28] proposed a general-purpose multi-agent scheme by adopting the framework of centralized training with decentralized execution. In particular, the authors in [28] proposed an extension of the actor-critic policy gradient mechanism by modifying the role of the critic. This critic is augmented with an additional policy information from the other actors (agents). Sequentially, each local actor executes in a decentralized manner and sends its own policy to the centralized critic for further investigation. However, the environment (i.e., state information) of this model remains the same for all of the local actors while in our setting the environment of each BS agent is deferent from others based on its own energy demand and generation. Moreover, the works in [24, 25, 26, 27, 28], do not consider a multi-agent environment in which the policy of each agent relies on its own state information. In particular, such state information belongs to a non-i.i.d. learning environment when environmental dynamics become distinct among the agents.

I-B Contributions

Refer to caption


Figure 1: Multi-agent meta-reinforcement learning framework of self-powered energy dispatch for sustainable edge computing.

The main contribution of this paper is a novel energy management framework for next-generation MEC in self-powered wireless network that is reliable against extreme uncertain energy demand and generation. We formulate a two-stage stochastic energy cost minimization problem that can balance renewable, non-renewable, and storage energy without knowing the actual demand. In fact, the formulated problem also investigates the realization of renewable energy generation after receiving the uncertain energy demand from the MEC applications and service requests. To solve this problem, we propose a multi-agent meta-reinforcement learning (MAMRL) framework that dynamically observes the non-i.i.d. behavior of time-varying features in both energy demand and generation at each BS and, then transfers those observations to obtain an energy dispatch decision and execute the energy dispatch policy to the self-powered BS. Fig. 1 illustrates how we propose to dispatch energy to ensure sustainable edge computing over a self-powered network using MAMRL framework. As we can see, each BS that includes small cell base stations (SBSs) and a macro base station (MBS) will act as a local agent and transfer their own decision (reward and action) to the meta-agent. Then, the meta-agent accumulates all of the non-i.i.d. observations from each local agent (i.e., SBSs and MBS) and optimizes the energy dispatch policy. The proposed MAMRL framework then provides feedback to each BS agent for exploring efficiently that acquire the right decision more quickly. Thus, the proposed MAMRL framework ensures autonomous decision making under an uncertain and unknown environment. Our key contributions include:

  • We formulate a self-powered energy dispatch problem for MEC-supported wireless network, in which the objective is to minimize the total energy consumption cost of network while considering the uncertainty of both energy consumption and generation. The formulated problem is, thus, a two-stage linear stochastic programming. In particular, the first stage makes a decision when energy demand is unknown, and the second stage discretizes the realization of renewable energy generation after knowing energy demand of the network.

  • To solve the formulated problem, we propose a new multi-agent meta-reinforcement learning framework by considering the skill transfer mechanism [24, 25] between each local agent (i.e., self-powered BS) and meta-agent. In this MAMRL scheme, each local agent explores its own energy dispatch decision using Markovian properties for capturing the time-varying features of both energy demand and generation. Meanwhile, the meta-agent evaluates (exploits) that decision for each local agent and optimizes the energy dispatch decision. In particular, we design a long short-term memory (LSTM) as a meta-agent (i.e., run at MBS) that is capable of avoiding the incompetent decision from each local agent and learns the right features more quickly by maintaining its own state information.

  • We develop the proposed MAMRL energy dispatch framework in a semi-distributed manner. Each local agent (i.e., self-powered BS) estimates its own energy dispatch decision using local energy data (i.e., demand and generation), and provides observations to the meta-agent individually. Consequently, the meta-agent optimizes the decision centrally and assists the local agent toward a globally optimized decision. Thus, this approach not only reduces the computational complexity and communication overhead but it also mitigates the curse of dimensionality under the uncertainty by utilizing non-i.i.d. energy demand and generation from each local agent.

  • Experimental results using real datasets establish a significant performance gain of the energy dispatch under the deterministic, asymmetric, and stochastic environments. Particularly, the results show that the proposed MAMRL model saves up to 22.44%22.44\% of energy consumption cost over a baseline approach while achieving an average accuracy of around 95.8%95.8\% in a stochastic environment. Our approach also decreases the usage of non-renewable energy up to 11%11\% of total consumed energy.

The rest of the paper is organized as follows. Section II presents the system model of self-powered edge computing. The problem formulation is described in Section III. Section IV provides MAMRL framework for solving energy dispatch problem. Experimental results are analyzed in Section V. Finally, conclusions are drawn in Section VI.

II System Model of Self-Powered Edge Computing

Refer to caption

Figure 2: System model for a self-powered wireless network with MEC capabilities.

Consider a self-powered wireless network that is connected with a smart grid controller as shown in Fig. 2. Such a wireless network enables edge computing services for various MEC applications and services. The energy consumption of the network depends on network operations energy consumption along with the task loads of the MEC applications. Meanwhile, the energy supply of the network relies on the energy generation from renewable sources that are attached to the BSs, as well as both renewable and non-renewable sources of the smart grid. Furthermore, the smart grid controller is a representative of the main power grid (i.e, smart grid), where an additional amount of energy can be supplied via the smart grid controller to the network. Therefore, we will first discuss the energy demand model that includes MEC server energy consumption, and network communication energy consumption. We will then describe the energy generation model that consists of the non-renewable energy generation cost, surplus energy storage cost, and total energy generation cost. Table III illustrates the summary of notations.

II-A Energy Demand Model

Consider a set ={0,1,2,,B}\mathcal{B}=\left\{{0,1,2,\dots,B}\right\} of B+1B+1 (0 for MBS) BSs that encompass BB SBSs overlaid over a single MBS. Each BS ii\in\mathcal{B} includes a set 𝒦i={1,2,,Ki}\mathcal{K}_{i}=\left\{{1,2,\dots,K_{i}}\right\} of KiK_{i} MEC application servers. We consider a finite time horizon 𝒯=1,2,,T\mathcal{T}={1,2,\dots,T} with each time slot being indexed by tt and having a duration of 15 minutes [29]. The observational period of each time slot tt ends at the 1515-th minute and is capable of capturing the changes of network dynamics [11, 12, 30]. A set 𝒥i\mathcal{J}_{i} of JiJ_{i} heterogeneous MEC application task requests from users will arrive to BS ii with an average task arrival rate λi(t)\lambda_{i}(t) (bits/s) at time tt. The task arrival rate λi(t)\lambda_{i}(t) at BS ii\in\mathcal{B} follows a Poisson process at time slot tt. BS ii integrates KiK_{i} heterogeneous active MEC application servers that has uki(t)u_{k_{i}}(t) (bits/s) processing capacity. Thus, JiJ_{i} computational task requests will be accumulated into the service pool with an average traffic size Si(t)S_{i}(t) (bits) at time slot tt. The average traffic arrival rate is defined as λi(t)=1Si(t)\lambda_{i}(t)=\frac{1}{S_{i}(t)}. Therefore, an M/M/K queuing model is suitable to model these JiJ_{i} user tasks using KiK_{i} MEC servers at BS ii and time tt [31, 32]. The task size of this queuing model is exponentially distributed since the average traffic size Si(t)S_{i}(t) is already known. Hence, the service rate of the BS ii is determined by μi(t)=1𝔼[ki𝒦iuki(t)]\mu_{i}(t)=\frac{1}{\mathbb{E}[\sum_{k_{i}\in\mathcal{K}_{i}}u_{k_{i}}(t)]}. At any given time tt, we assume that all of the tasks in 𝒥i\mathcal{J}_{i} are uniformly distributed at each BS ii. Thus, for a given MEC server task association indicator Υjki(t)=1\Upsilon_{jk_{i}}(t)=1 if task jj is assigned to server kk at BS ii, and 0 otherwise, the average MEC server utilization is defined as follows [11]:

ρi(t)={j𝒥iki𝒦iΥjki(t)λi(t)μi(t)Ki,if Υjki(t)=1,0,otherwise.\rho_{i}(t)=\left\{\begin{array}[]{ll}\sum_{j\in\mathcal{J}_{i}}\sum_{k_{i}\in\mathcal{K}_{i}}\Upsilon_{jk_{i}}(t)\frac{\lambda_{i}(t)}{\mu_{i}(t)K_{i}},\text{if }\Upsilon_{jk_{i}}(t)=1,\\ 0,\;\;\;\;\;\;\;\;\;\;\text{otherwise.}\end{array}\right. (1)
TABLE III: Summary of notations.
Notation Description
\mathcal{B} Set of BSs (SBSs and MBS)
𝒦i\mathcal{K}_{i} Set of active servers under the BS ii\in\mathcal{B}
𝒥i\mathcal{J}_{i} Set of user tasks at BS ii\in\mathcal{B}
\mathcal{R} Set of renewable energy sources
ρi(t)\rho_{i}(t) Server utilization in BS ii\in\mathcal{B}
LL No. of CPU cores
Ri(t)R_{i}(t) Average downlink data of BS ii
WijW_{ij} Fixed channel bandwidth of BS ii for user task jj
PiP_{i} Transmission power of BS ii
gij(t)g_{ij}(t) Downlink channel gain between user task jj to BS ii
Iij(t)I_{ij}(t) Co-channel interference for user task jj at BS ii
δi\delta_{i} Energy coefficient for BS ii\in\mathcal{B}
ff MEC server CPU frequency for a single core
τ\tau Server switching capacitance
ηstMEC(t)\eta^{\textrm{MEC}}_{\textrm{st}}(t) MEC server static energy consumption
ηidleMEC(t)\eta^{\textrm{MEC}}_{\textrm{idle}}(t) MEC server idle state power consumption
ϖk\varpi_{k} Scaling factor of heterogeneous MEC CPU core
ηstnet(t)\eta^{\textrm{net}}_{\textrm{st}}(t) Static energy consumption of BS
ctrenc^{\textrm{ren}}_{t} Renewable energy cost per unit
ctnonc^{\textrm{non}}_{t} Non-renewable energy cost per unit
ctstoc^{\textrm{sto}}_{t} Storage energy cost per unit
ξtren\xi^{\textrm{ren}}_{t} Amount of renewable energy
ξtnon\xi^{\textrm{non}}_{t} Amount of non-renewable energy
ξtsto\xi^{\textrm{sto}}_{t} Amount of surplus energy
ξtd\xi^{\textrm{d}}_{t} Energy demand at time slot tt
ξtD\xi^{\textrm{D}}_{t} Random variable for energy demand
ξtrenmax\xi^{\textrm{ren}_{\textrm{max}}}_{t} Maximum capacity of renewable energy at BS ii\in\mathcal{B}
𝒪i\mathcal{O}_{i} Set of observation at BS ii\in\mathcal{B}
O(.)O(.) Big OO notation to represent complexity
β\beta Entropy regularization coefficient
γ\gamma Discount factor
θi\theta_{i} Learning parameters for BS ii\in\mathcal{B}
πθi\pi_{\theta_{i}} Energy dispatch policy with parameters θi\theta_{i} at BS ii\in\mathcal{B}
ϕ\phi Meta-agent learning parameters

II-A1 MEC Server Energy Consumption

In case of MEC server energy consumption, the computational energy consumption (dynamic energy) will be dependent on the CPU activity for executing computational tasks [17, 33, 16]. Further, such dynamic energy is also accounted with the thermal design power (TDP), memory, and disk I/O operations of the MEC server [17, 33, 16] and we denote as ηstMEC(t)\eta^{\textrm{MEC}}_{\textrm{st}}(t). Meanwhile, static energy ηidleMEC(t)\eta^{\textrm{MEC}}_{\textrm{idle}}(t) includes the idle state power of CPU activities [16, 18]. We consider, a single core CPU with a processor frequency ff (cycles/s), an average server utilization ρi(t)\rho_{i}(t) (using (1)) at time slot tt, and a switching capacitance τ=5×1027\tau=5\times 10^{-27} (farad) [17]. The dynamic power consumption of such single core CPU can be calculated by applying a cubic formula τρi(t)f3\tau\rho_{i}(t)f^{3} [18, 34]. Thus, energy consumption of KiK_{i} MEC servers with LL CPU cores at BS ii is defined as follows:

ξiMEC(t)={kKilLτρi(t)fki3ϖkil+ηstMEC(t),if ρi(t)>0,ηidleMEC(t),otherwise,\xi^{\textrm{MEC}}_{i}(t)=\left\{\begin{array}[]{ll}\sum_{k\in{K}_{i}}\sum_{l\in L}\tau\rho_{i}(t)f_{k_{i}}^{3}\varpi_{k_{il}}+\;\eta^{\textrm{MEC}}_{\textrm{st}}(t),\text{if }\rho_{i}(t)>0,\\ \eta^{\textrm{MEC}}_{\textrm{idle}}(t),\;\;\;\;\;\;\;\;\;\;\text{otherwise,}\end{array}\right. (2)

where ϖkil\varpi_{k_{il}} denotes a scaling factor of heterogeneous CPU core of the MEC server. Thus, the value of ϖkil\varpi_{k_{il}} is dependent on the processor architecture [35] that assures the heterogeneity of the MEC server.

II-A2 Base Station Energy Consumption

The energy consumption needed for the operation of the network base stations (i.e., SBSs and MBS) includes two types of energy: dynamic and static energy consumption [36]. On one hand, a static energy consumption ηstnet(t)\eta^{\textrm{net}}_{\textrm{st}}(t) includes the energy for maintaining the idle state of any BS, a constant power consumption for receiving packet from users, and the energy for wired transmission among the BSs. On the other hand, the dynamic energy consumption of the BSs depends on the amount of data transfer from BSs to users which essentially relates to the downlink [37] transmit energy. Thus, we consider that each BS ii\in\mathcal{B} operates at a fixed channel bandwidth WijW_{ij} and constant transmission power PiP_{i} [37]. Then the average downlink data of BS ii will be given by [11]:

Ri(t)=ij𝒥iWijlog2(1+Pigij(t)σ2+Iij(t))R_{i}(t)=\sum_{i\in\mathcal{B}}\sum_{j\in\mathcal{J}_{i}}W_{ij}\\ \log_{2}\Big{(}1+\frac{P_{i}g_{ij}(t)}{\sigma^{2}+I_{ij}(t)}\Big{)} (3)

where gij(t)g_{ij}(t) represents downlink channel gain between user task jj to BS ii, σ2\sigma^{2} determines a variance of an Additive White Gaussian Noise (AWGN), and Iij(t)I_{ij}(t) denotes the co-channel interference [38, 39] among the BSs. Here, the co-channel interference Iij(t)=i,iiPigij(t)I_{ij}(t)=\sum_{i^{\prime}\in\mathcal{B},i^{\prime}\neq i}P_{i^{\prime}}g_{{i^{\prime}}j}(t) relates to the transmissions from other BSs ii^{\prime}\in\mathcal{B} that use the same subchannels of WijW_{ij}. PiP_{i^{\prime}} and gij(t)g_{{i^{\prime}}j}(t) represent, respectively, the transmit power and the channel gain of the BS iii^{\prime}\neq i\in\mathcal{B}. Therefore, downlink energy consumption of the data transfer of BS ii\in\mathcal{B} is defined by PiSi(t)Ri(t)\frac{P_{i}S_{i}(t)}{R_{i}(t)} [watt-seconds or joule], where Si(t)Ri(t)\frac{S_{i}(t)}{R_{i}(t)} [seconds] determines the duration of transmit power PiP_{i} [watt]. Thus, the network energy consumption for BS ii at time tt is defined as follows [36, 19]:

ξinet(t)=j𝒥i(δinetPiSi(t)Ri(t)+ηstnet(t)),\xi^{\textrm{net}}_{i}(t)=\sum_{j\in\mathcal{J}_{i}}\Big{(}\delta_{i}^{\textrm{net}}\frac{P_{i}S_{i}(t)}{R_{i}(t)}+\eta^{\textrm{net}}_{\textrm{st}}(t)\Big{)}, (4)

where δinet\delta_{i}^{\textrm{net}} determines the energy coefficient for transferring data through the network. In fact, the value of δinet\delta_{i}^{\textrm{net}} depends on the type of the network device (e.g., δinet=2.8\delta_{i}^{\textrm{net}}=2.8 for a 66 unit transceiver remote radio head [36]).

II-A3 Total Energy Demand

The total energy consumption (demand) of the network consists of both MEC server computational energy (in (2)) consumption, and network the operational energy (i.e., BSs energy consumption in (4)). Thus, the overall energy demand of the network at time slot tt is given as follows:

ξtd=i(ξinet(t)+ξiMEC(t)).\xi^{\textrm{d}}_{t}=\sum_{i\in\mathcal{B}}\Big{(}\xi^{\textrm{net}}_{i}(t)+\xi^{\textrm{MEC}}_{i}(t)\Big{)}. (5)

The demand ξtd\xi^{\textrm{d}}_{t} is random over time and completely depends on the computational tasks load of the MEC servers.

II-B Energy Generation Model

The energy supply of the self-powered wireless network with MEC capabilities relates to the network’s own renewable (e.g., solar, wind, biofuels, etc.) sources as well as the main grid’s non-renewable (e.g., diesel generator, coal power, and so on) energy sources [8, 9]. In this energy generation model, we consider a set ={0,1,,B}\mathcal{R}=\left\{{\mathcal{R}_{0},\mathcal{R}_{1},\dots,\mathcal{R}_{B}}\right\} of renewable energy sources of the network, with each element i\mathcal{R}_{i} representing the set of renewable energy sources of BS ii\in\mathcal{B}. Each renewable energy source qiq\in\mathcal{R}_{i} at BS ii\in\mathcal{B} can generate an amount ξiqren(t)\xi^{\textrm{ren}}_{iq}(t) of renewable energy at time tt. Therefore, the total amount of renewable energy generation ξiren(t)\xi^{\textrm{ren}}_{i}(t) at BS ii\in\mathcal{B} will be ξiren(t)=qiξiqren(t)\xi^{\textrm{ren}}_{i}(t)=\sum_{q\in\mathcal{R}_{i}}\xi^{\textrm{ren}}_{iq}(t) for time slot tt. Thus, the total renewable energy generation for the considered network at time tt is defined as ξtren=iξiren(t)\xi^{\textrm{ren}}_{t}=\sum_{i\in\mathcal{B}}\xi^{\textrm{ren}}_{i}(t). The maximum limit of this renewable energy ξtren\xi^{\textrm{ren}}_{t} is less than or equal to the maximum capacity ξtrenmax\xi^{\textrm{ren}_{\textrm{max}}}_{t} of renewable energy generation at time period tt. Thus, we consider a maximum storage limit that is equal to the maximum capacity ξtrenmax\xi^{\textrm{ren}_{\textrm{max}}}_{t} of the renewable energy generation [40, 41, 42]. Further, the self-powered wireless network is able to get an additional non-renewable energy amount ξtnon\xi^{\textrm{non}}_{t} from the main grid at time tt. The per unit renewable and non-renewable energy cost are defined by ctren{c^{\textrm{ren}}_{t}} and ctnon{c^{\textrm{non}}_{t}}, respectively. In general, the renewable energy cost only depends on the maintenance cost of the renewable energy sources [40, 41, 42]. Therefore, the per unit non-renewable energy cost is greater than the renewable energy cost ctnon>ctren{c^{\textrm{non}}_{t}}>{c^{\textrm{ren}}_{t}}. Additionally, the surplus amount of the energy ξtsto\xi^{\textrm{sto}}_{t} at time tt can be stored in energy storage medium for the future usages [41, 42] and the energy storage cost of per unit energy store is denoted by ctsto{c^{\textrm{sto}}_{t}}.

II-B1 Non-renewable Energy Generation Cost

In order to fulfill the energy demand ξtd{\xi^{\textrm{d}}_{t}} when it is greater than the generated renewable energy ξtren\xi^{\textrm{ren}}_{t}, the main grid can provide an additional amount of energy ξtnon\xi^{\textrm{non}}_{t} from its non-renewable sources. Thus, the non-renewable energy generation cost Ctnon{C^{\textrm{non}}_{t}} of the network is determined as follows:

Ctnon={ctnon[ξtdξtren],ifξtd>ξtren,0,otherwise,C^{\textrm{non}}_{t}=\left\{\begin{array}[]{ll}c^{\textrm{non}}_{t}[{\xi^{\textrm{d}}_{t}}-\xi^{\textrm{ren}}_{t}],\;\text{if}\;\;{\xi^{\textrm{d}}_{t}}>\xi^{\textrm{ren}}_{t},\\ 0,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{otherwise},\end{array}\right. (6)

where ctnon{c^{\textrm{non}}_{t}} represents a unit energy cost.

II-B2 Surplus Energy Storage Cost

The surplus amount of energy is stored in a storage medium when ξtd<ξtren{\xi^{\textrm{d}}_{t}}<\xi^{\textrm{ren}}_{t} (i.e., energy demand is smaller than the renewable energy generation) at time tt. We consider the per unit energy storage cost ctstoc^{\textrm{sto}}_{t}. This storage cost depends on the storage medium and amount of the energy store at time tt [41, 43, 23, 44]. With the per unit energy storage cost ctstoc^{\textrm{sto}}_{t}, the total storage cost at time tt is defined as follows:

Ctsto={ctsto[ξtrenξtd],ifξtd<ξtren,0,otherwise.C^{\textrm{sto}}_{t}=\left\{\begin{array}[]{ll}c^{\textrm{sto}}_{t}[\xi^{\textrm{ren}}_{t}-{\xi^{\textrm{d}}_{t}}],\;\text{if}\;\;{\xi^{\textrm{d}}_{t}}<\xi^{\textrm{ren}}_{t},\\ 0,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{otherwise}.\end{array}\right. (7)

II-B3 Total Energy Generation Cost

The total energy generation cost includes renewable, non-renewable, and storage energy cost. Naturally, this total energy generation cost will depend on the energy demand ξtd\xi^{\textrm{d}}_{t} of the network at time tt. Therefore, the total energy generation cost at time tt is defined as follows:

Q(ξtren,ξtd)=ctrenξtren+ctnon[ξtdξtren]++ctsto[ξtrenξtd]+,\begin{split}Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t})={c^{\textrm{ren}}_{t}}\xi^{\textrm{ren}}_{t}\;+{c^{\textrm{non}}_{t}}{[{\xi^{\textrm{d}}_{t}}-\xi^{\textrm{ren}}_{t}]}_{+}\\ +{c^{\textrm{sto}}_{t}}{[\xi^{\textrm{ren}}_{t}-{\xi^{\textrm{d}}_{t}}]}_{+},\end{split} (8)

where the energy cost of the renewable, non-renewable, and storage energy are given by ctrenξtren{c^{\textrm{ren}}_{t}}\xi^{\textrm{ren}}_{t}, ctnon[ξtdξtren]+{c^{\textrm{non}}_{t}}{[{\xi^{\textrm{d}}_{t}}-\xi^{\textrm{ren}}_{t}]}_{+}, and ctsto[ξtrenξtd]+{c^{\textrm{sto}}_{t}}{[\xi^{\textrm{ren}}_{t}-{\xi^{\textrm{d}}_{t}}]}_{+}, respectively. In (8), energy demand ξtd\xi^{\textrm{d}}_{t} and renewable energy generation ξtren\xi^{\textrm{ren}}_{t} are stochastic in nature. The energy cost of non-renewable energy (6) and storage energy (7) completely rely on energy demand ξtd\xi^{\textrm{d}}_{t} and renewable energy generation ξtren\xi^{\textrm{ren}}_{t}. Hence, to address the uncertainty of both energy demand and renewable energy generation in a self-powered wireless network, we formulate a two-stage stochastic programing problem. In particular, the first stage makes a decision of the energy dispatch without knowing the actual demand of the network. Then we make further energy dispatch decisions by analyzing the uncertainty of the network demand in the second stage. A detailed discussion of the problem formulation is given in the following section.

III Problem Formulation with a Two-Stage Stochastic Model

We now consider the case in which the non-renewable energy cost is greater than the renewable energy cost, ctnon>ctrenc^{\textrm{non}}_{t}>c^{\textrm{ren}}_{t} that is often the case in a practical smart grid as discussed in [40], [41], [42], and [45]. Here, ξtren\xi^{\textrm{ren}}_{t} and ξtd\xi^{\textrm{d}}_{t} are the continuous variables over the observational duration tt. The objective is to minimize the total energy consumption cost Q(ξtren,ξtd)Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t}). ξtren\xi^{\textrm{ren}}_{t} is the decision variable and the energy demand ξtd\xi^{\textrm{d}}_{t} is a parameter. When the energy demand ξtd\xi^{\textrm{d}}_{t} is known, the optimization problem will be:

χ=minξtren0\displaystyle\chi=\underset{{\xi^{\textrm{ren}}_{t}\geq 0}}{\min} Q(ξtren,ξtd).\displaystyle\;Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t}). (9)

In problem (9), after removing the non-negativity constraints ξtren0\xi^{\textrm{ren}}_{t}\geq 0, we can rewrite the objective function in the form of piecewise linear functions as follows:

Q(ξtren,ξtd)=maxξtren{((ctrenctnon)ξtren+ctnonξtd),((ctren+ctsto)ξtrenctstoξtd)}.\begin{split}Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t})=\underset{{\xi^{\textrm{ren}}_{t}}}{\max}\left\{\Big{(}(c^{\textrm{ren}}_{t}-c^{\textrm{non}}_{t})\xi^{\textrm{ren}}_{t}+c^{\textrm{non}}_{t}\xi^{\textrm{d}}_{t}\Big{)},\right.\\ \left.\Big{(}(c^{\textrm{ren}}_{t}+c^{\textrm{sto}}_{t})\xi^{\textrm{ren}}_{t}-c^{\textrm{sto}}_{t}\xi^{\textrm{d}}_{t}\Big{)}\right\}.\end{split} (10)

Where (ctrenctnon)ξtren+ctnonξtd(c^{\textrm{ren}}_{t}-c^{\textrm{non}}_{t})\xi^{\textrm{ren}}_{t}+c^{\textrm{non}}_{t}\xi^{\textrm{d}}_{t} and (ctren+ctsto)ξtrenctstoξtd(c^{\textrm{ren}}_{t}+c^{\textrm{sto}}_{t})\xi^{\textrm{ren}}_{t}-c^{\textrm{sto}}_{t}\xi^{\textrm{d}}_{t} determine the cost of non-renewable (i.e., ξtd>ξtren{\xi^{\textrm{d}}_{t}}>\xi^{\textrm{ren}}_{t}) and storage (i.e., ξtd<ξtren{\xi^{\textrm{d}}_{t}}<\xi^{\textrm{ren}}_{t}) energy, respectively. Therefore, we have to choose one out of the two cases. In fact, if the energy demand ξtd\xi^{\textrm{d}}_{t} is known and also the amount of renewable energy ξtren\xi^{\textrm{ren}}_{t} is the same as the energy demand, then problem (10) provides the optimal decision in order to exact amount of demand ξtd\xi^{\textrm{d}}_{t}. However, the challenge here is to make a decision about the renewable energy ξtren\xi^{\textrm{ren}}_{t} usage before the demand becomes known. To overcome this challenge, we consider the energy demand ξtD\xi^{\textrm{D}}_{t} as a random variable whose probability distribution can be estimated from the previous history of the energy demand. We can re-write problem (9) using the expectation of the total cost as follows:

minξtren0\displaystyle\underset{{\xi^{\textrm{ren}}_{t}\geq 0}}{\min} 𝔼[Q(ξtren,ξtD)].\displaystyle\;\mathbb{E}[Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{D}}_{t})]. (11)

The solution of problem (11) will provide an optimal result on average. However, in the practical scenario, we need to solve problem (11) repeatedly over the uncertain energy demand ξtD\xi^{\textrm{D}}_{t}. Thus, this solution approach does not significantly affect our model in terms of scalability while B+1B+1 number of BSs generates a large variety of energy demand over the observational period of tt. In fact, energy demand and generation can change over time for each BS ii\in\mathcal{B}, and they can also induce large variations of demand-generation among the BSs. Hence, the solution to problem (11) cannot rely on an iterative scheme due to a lake of the adaptation for uncertain change of energy demand and generation over time.

We consider the moment of random variable ξtD\xi^{\textrm{D}}_{t} that has a finitely supported distribution and takes values ξt0D,,ξtBD\xi^{\textrm{D}}_{t0},\dots,\xi^{\textrm{D}}_{tB} with respective probabilities p0,,pBp_{0},\dots,p_{B} of BSs B+1B+1. The cumulative distribution function (CDF) H(ξtD)H(\xi^{\textrm{D}}_{t}) of energy demand ξtD\xi^{\textrm{D}}_{t} is a step function and jumps of size pip_{i} at each demand ξtid\xi^{\textrm{d}}_{ti}. Therefore, the probability distribution of each BS energy demand ξtid\xi^{\textrm{d}}_{ti} belongs to the CDF H(ξtD)H(\xi^{\textrm{D}}_{t}) of historical observation of energy demand ξtD\xi^{\textrm{D}}_{t}. In this case, we can convert problem (11) into a deterministic optimization problem and the expectation of energy usage cost 𝔼[Q(ξtren,ξtD)]\mathbb{E}[Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{D}}_{t})] is determined by ipiQ(ξtren,ξtd)\sum_{i\in\mathcal{B}}p_{i}Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t}). Thus, we can rewrite the problem (9) as a linear programming problem using the representation in (10) as follows:

minξtren,χ\displaystyle\underset{{\xi^{\textrm{ren}}_{t},\chi}}{\min} χ\displaystyle\;\chi (12)
s.t. χ(ctrenctnon)ξtren+ctnonξtd,\displaystyle\chi\geq(c^{\textrm{ren}}_{t}-c^{\textrm{non}}_{t})\xi^{\textrm{ren}}_{t}+c^{\textrm{non}}_{t}\xi^{\textrm{d}}_{t}, (12a)
χ(ctren+ctsto)ξtrenctstoξtd,\displaystyle\chi\geq(c^{\textrm{ren}}_{t}+c^{\textrm{sto}}_{t})\xi^{\textrm{ren}}_{t}-c^{\textrm{sto}}_{t}\xi^{\textrm{d}}_{t}, (12b)
ξtrenmaxξtren0.\displaystyle\xi^{\textrm{ren}_{\textrm{max}}}_{t}\geq\xi^{\textrm{ren}}_{t}\geq 0. (12c)

For a fixed value of the renewable energy ξtren\xi^{\textrm{ren}}_{t}, problem (12) is an equivalent of problem (10). Meanwhile, problem (12) is equal to Q(ξtren,ξtd)Q(\xi^{\textrm{ren}}_{t},\xi^{\textrm{d}}_{t}). We have converted the piecewise linear function from problem (10) into the inequality constraints (12a) and (12b). Constraint (12c) ensures a limit on the maximum allowable renewable energy usage. We consider pip_{i} as a highest probability of energy demand at each BS ii\in\mathcal{B}. Therefore, for B+1B+1 BSs, we define p0,,pBp_{0},\dots,p_{B} as the probability of energy demand with respect to BSs i=0,,Bi=0,\dots,B. Thus, we can rewrite the problem (11) for B+1B+1 BSs 𝝃tD=(ξt0D,,ξtBD)\boldsymbol{\xi}^{\textrm{D}}_{t}=(\xi^{\textrm{D}}_{t0},\dots,\xi^{\textrm{D}}_{tB}) is as follows:

minξtren,χ0,,χB\displaystyle\underset{{\xi^{\textrm{ren}}_{t},{\chi_{0}},\dots,\chi_{B}}}{\min} ipiχi,\displaystyle\;\sum_{i\in\mathcal{B}}p_{i}\chi_{i}, (13)
s.t.χi(ctrenctnon)ξtren+ctnonξtiD,i,\displaystyle\begin{split}\text{s.t.}\quad&\chi_{i}\geq(c^{\textrm{ren}}_{t}-c^{\textrm{non}}_{t})\xi^{\textrm{ren}}_{t}+c^{\textrm{non}}_{t}\xi^{\textrm{D}}_{ti},\forall i\in\mathcal{B},\end{split} (13a)
χi(ctren+ctsto)ξtrenctstoξtiD,i,\displaystyle\begin{split}&\chi_{i}\geq(c^{\textrm{ren}}_{t}+c^{\textrm{sto}}_{t})\xi^{\textrm{ren}}_{t}-c^{\textrm{sto}}_{t}\xi^{\textrm{D}}_{ti},\forall i\in\mathcal{B},\end{split} (13b)
ξtrenmaxξtren0,\displaystyle\xi^{\textrm{ren}_{\textrm{max}}}_{t}\geq\xi^{\textrm{ren}}_{t}\geq 0, (13c)

where pip_{i} represents the highest probability of the energy demand ξtiD=ξtid\xi^{\textrm{D}}_{ti}=\xi^{\textrm{d}}_{ti}, in which ξtiD\xi^{\textrm{D}}_{ti} is a random variable and ξtid\xi^{\textrm{d}}_{ti} denotes a realization of energy demand on BS ii\in\mathcal{B} at time tt. The value of pip_{i} belongs to the empirical CDF H(𝝃tiD)H(\boldsymbol{\xi}^{\textrm{D}}_{ti}) of the energy demand ξtiD\xi^{\textrm{D}}_{ti} for BS ii\in\mathcal{B}. This CDF is calculated from the historical observation of the energy demand at BS ii\in\mathcal{B}. In fact, for a fixed value of non-renewable energy ξtren\xi^{\textrm{ren}}_{t}, problem (13) is separable. As a result, we can decompose this problem with a structure of two-stage linear stochastic programming problem [46, 47].

To find an approximation for a random variable with a finite probability distribution, we decompose problem (13) in a two-stage linear stochastic programming under uncertainty. The decision is made using historical data of energy demand, which is fully independent from the future observation. As a result, the first stage of self-powered energy dispatch problem for sustainable edge computing is formulated as follows:

minξtren0\displaystyle\underset{{\xi^{\textrm{ren}}_{t}\geq 0}}{\min} (ctren)ξtren+𝔼[Q(ξtren,𝝃tD)],\displaystyle\;(c^{\textrm{ren}}_{t})^{\top}\xi^{\textrm{ren}}_{t}+\mathbb{E}[Q(\xi^{\textrm{ren}}_{t},\boldsymbol{\xi}^{\textrm{D}}_{t})], (14)
s.t. ξtrenmaxξtren0,\displaystyle\xi^{\textrm{ren}_{\textrm{max}}}_{t}\geq\xi^{\textrm{ren}}_{t}\geq 0, (14a)

where Q(ξtren,𝝃tD)Q(\xi^{\textrm{ren}}_{t},\boldsymbol{\xi}^{\textrm{D}}_{t}) determines an optimal value of the second stage problem. In problem (14), the decision variable ξtren\xi^{\textrm{ren}}_{t} is calculated before the realization of uncertain energy demand 𝝃tD\boldsymbol{\xi}^{\textrm{D}}_{t}. Meanwhile, at the first stage of the formulated problem (14), the cost (ctren)ξtren(c^{\textrm{ren}}_{t})^{\top}\xi^{\textrm{ren}}_{t} is minimized for the decision variable ξtren\xi^{\textrm{ren}}_{t} which then allows us to estimate the expected energy cost 𝔼[Q(ξtren,𝝃tD)]\mathbb{E}[Q(\xi^{\textrm{ren}}_{t},\boldsymbol{\xi}^{\textrm{D}}_{t})] for the second stage decision. Constraint (14a) provides a boundary for the maximum allowable renewable energy usage. Thus, based on the decision of the first stage problem, the second stage problem can be defined as follows:

minξtnon,ξtsto\displaystyle\underset{{\xi^{\textrm{non}}_{t},\xi^{\textrm{sto}}_{t}}}{\min} (ctnon)ξtnon(ctsto)ξtsto,\displaystyle\;(c^{\textrm{non}}_{t})^{\top}\xi^{\textrm{non}}_{t}-(c^{\textrm{sto}}_{t})^{\top}\xi^{\textrm{sto}}_{t}, (15)
s.t. ξtsto=|ξtrenξtnon|,\displaystyle\xi^{\textrm{sto}}_{t}=|\xi^{\textrm{ren}}_{t}-\xi^{\textrm{non}}_{t}|, (15a)
0ξtnon(𝝃tD),\displaystyle 0\leq\xi^{\textrm{non}}_{t}\leq(\boldsymbol{\xi}^{\textrm{D}}_{t})^{\top}, (15b)
ξtnon0.\displaystyle\xi^{\textrm{non}}_{t}\geq 0. (15c)

In the second stage problem (15)\eqref{Opt_1_7}, the decision variables ξtnon\xi^{\textrm{non}}_{t} and ξtsto\xi^{\textrm{sto}}_{t} depend on the realization of the energy demand 𝝃tD\boldsymbol{\xi}^{\textrm{D}}_{t} of the first stage problem (14)\eqref{Opt_1_6}, where, ξtren\xi^{\textrm{ren}}_{t} determines the amount of renewable energy usage at time tt. The first constraint (15a)\eqref{Opt_1_7:const1} is an equality constraint that determines the surplus amount of energy ξtsto\xi^{\textrm{sto}}_{t} must be equal to the absolute value difference between the usage of renewable ξtren\xi^{\textrm{ren}}_{t} and non-renewable ξtnon\xi^{\textrm{non}}_{t} energy amount. The second constraint (15b)\eqref{Opt_1_7:const2} is an inequality constraint that uses the optimal demand value from the first stage realization. In particular, the value of demand comes from (5)\eqref{eq:total_energy_demand} that is the historical observation of energy demand. Finally, the constraint (15c)\eqref{Opt_1_7:const3} protects from the non-negativity for the non-renewable energy ξtnon\xi^{\textrm{non}}_{t} usage.

The formulated problems (14)\eqref{Opt_1_6} and (15)\eqref{Opt_1_7} can characterize the uncertainty between network energy demand and renewable energy generation. Particularly, the second stage problem (15)\eqref{Opt_1_7} contains random demand 𝝃tD\boldsymbol{\xi}^{\textrm{D}}_{t} that leads the optimal cost 𝔼[Q(ξtren,𝝃tD)]\mathbb{E}[Q(\xi^{\textrm{ren}}_{t},\boldsymbol{\xi}^{\textrm{D}}_{t})] as a random variable. As a result, we can rewrite the problems (14)\eqref{Opt_1_6} and (15)\eqref{Opt_1_7} in a one large linear programming problem for B+1B+1 BSs and the problem formulation is as follows:

minξtren,ξtnon,ξtstot𝒯((ctren)ξtren+ipi[(ctnon)ξtinon(ctsto)ξtisto]),\displaystyle\begin{split}\underset{{{\xi}^{\textrm{ren}}_{t},{\xi}^{\textrm{non}}_{t},{\xi}^{\textrm{sto}}_{t}}}{\min}\sum_{t\in\mathcal{T}}\Big{(}(c^{\textrm{ren}}_{t})^{\top}\xi^{\textrm{ren}}_{t}\;+\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \sum_{i\in\mathcal{B}}p_{i}[(c^{\textrm{non}}_{t})^{\top}\xi^{\textrm{non}}_{ti}-(c^{\textrm{sto}}_{t})^{\top}\xi^{\textrm{sto}}_{ti}]\Big{)},\end{split} (16)
s.t. ξtisto=|ξtirenξtinon|,i,\displaystyle\xi^{\textrm{sto}}_{ti}=|\xi^{\textrm{ren}}_{ti}-\xi^{\textrm{non}}_{ti}|,\forall i\in\mathcal{B}, (16a)
0ξtinonξtiD,i,\displaystyle 0\leq\xi^{\textrm{non}}_{ti}\leq\xi^{\textrm{D}}_{ti},\forall i\in\mathcal{B}, (16b)
ξtinon0,i,\displaystyle\xi^{\textrm{non}}_{ti}\geq 0,\forall i\in\mathcal{B}, (16c)
ξtrenmaxξtiren0,i.\displaystyle\xi^{\textrm{ren}_{\textrm{max}}}_{t}\geq\xi^{\textrm{ren}}_{ti}\geq 0,\forall i\in\mathcal{B}. (16d)

In problem (16)\eqref{Opt_1_8}, for B+1B+1 BSs, energy demand ξt0DξtBD\xi_{t0}^{\textrm{D}}\dots\xi_{tB}^{\textrm{D}} happens with positive probabilities p0pBp_{0}\dots p_{B} and ipi=1\sum_{i\in\mathcal{B}}p_{i}=1. The decision variables are ξtren{\xi}^{\textrm{ren}}_{t}, ξtnon{\xi}^{\textrm{non}}_{t} and ξtsto{\xi}^{\textrm{sto}}_{t}, which represent the amount of renewable, non-renewable, and storage energy, respectively. Constraint (16a)\eqref{Opt_1_8:const1} defines a relationship among all of the decision variables ξtren{\xi}^{\textrm{ren}}_{t}, ξtnon{\xi}^{\textrm{non}}_{t} and ξtsto{\xi}^{\textrm{sto}}_{t}. In essence, this constraint discretizes the surplus amount of energy for storage. Hence, constraint (16b)\eqref{Opt_1_8:const2} ensures the utilization of non-renewable energy based on the energy demand of the network. Constraint (16c)\eqref{Opt_1_8:const3} ensures that the decision variable ξtnon{\xi}^{\textrm{non}}_{t} will not be a negative value. Finally, constraint (16d)\eqref{Opt_1_8:const4} restricts the renewable energy ξtren{\xi}^{\textrm{ren}}_{t} usages in to maximum capacity ξtrenmax\xi^{\textrm{ren}_{\textrm{max}}}_{t} at time tt. Problem (16)\eqref{Opt_1_8} is an integrated form of the first-stage problem in (14)\eqref{Opt_1_6} and the second-stage problem in (15)\eqref{Opt_1_7}, where the solution of ξtnon\xi^{\textrm{non}}_{t} and ξtsto\xi^{\textrm{sto}}_{t} completely depends on realization of demand ξtiD\xi^{\textrm{D}}_{ti} for all B+1B+1 BSs. The decision of the ξtren\xi^{\textrm{ren}}_{t} comes before the realization of demand ξtiD\xi^{\textrm{D}}_{ti} and, thus, the estimation of renewable energy generation ξtren\xi^{\textrm{ren}}_{t} will be independent and random. Therefore, problem (16)\eqref{Opt_1_8} holds the property of relatively complete recourse. In problem (16)\eqref{Opt_1_8}, the number of variables and constraints is proportional to the numbers of BSs, B+1B+1. Additionally, the complexity of the decision problem (16)\eqref{Opt_1_8} leads to 𝒪(2|𝒯|×||)\mathcal{O}(2^{|\mathcal{T}|\times|\mathcal{B}|}) due to the combinatorial properties of the decisions and constraints [46, 47, 48].

The goal of the self-powered energy dispatch problem (16)\eqref{Opt_1_8} is to find an optimal energy dispatch policy that includes amount of renewable ξtiren{\xi}^{\textrm{ren}}_{ti}, non-renewable ξtinon{\xi}^{\textrm{non}}_{ti}, and storage ξtisto{\xi}^{\textrm{sto}}_{ti} energy of each BS ii\in\mathcal{B} while minimizing the energy consumption cost. Meanwhile, such energy dispatch policy relies on an empirical probability distribution H(𝝃tD)H(\boldsymbol{\xi}^{\textrm{D}}_{t}) of historical demand at each BS ii\in\mathcal{B} at time tt. In order to solve problem (16)\eqref{Opt_1_8}, we choose an approach that does not rely on the conservativeness of a theoretical probability distribution of energy demand in problem (16)\eqref{Opt_1_8}, and also will capture the uncertainty of renewable energy generation from the historical data. In contrast, we can construct a theoretical probability distribution of energy demand 𝝃tD\boldsymbol{\xi}^{\textrm{D}}_{t} when we know what the exact distribution is as well as what its parameters will be (e.g., mean, variance, and standard deviation). In fact, in practice, the distribution of energy demand 𝝃tD\boldsymbol{\xi}^{\textrm{D}}_{t} is unknown and instead, a certain amount of historical energy demand data are available. As a result, we cannot rely on this distribution to measure uncertainty while the renewable energy generation and energy demand are random over time. Hence, we can obtain time-variant features of both energy demand and generation by characterizing the Markovian properties from the historical observation over time. In particular, we capture the dynamics of Markovian by considering a data-driven approach. This approach can overcome the conservativeness of theoretical probability distribution as historical observation goes to finitely many.

To prevalence the aforementioned contemporary, we propose a multi-agent meta-reinforcement learning framework that can explore the Markovian behavior from historical energy demand and generation of each BS ii\in\mathcal{B}. Meanwhile, meta-agent can cope with such time-varying features to a globally optimal energy dispatch policy for each BS ii\in\mathcal{B}.

We design an MAMRL framework by converting the cost minimization problem (16)\eqref{Opt_1_8} to a reward maximization problem that we then solve with a data-driven approach. In the MAMRL setting, each agent works as a local agent for each BS ii\in\mathcal{B} and determines an observation (i.e., exploration) for the decision variables, renewable ξtiren\xi^{\textrm{ren}}_{ti}, non-renewable ξtinon\xi^{\textrm{non}}_{ti}, and storage ξtisto\xi^{\textrm{sto}}_{ti} energy. The goal of this exploration is to find time-varying features from the local historical data so that the energy demand ξtid\xi^{\textrm{d}}_{ti} of the network is satisfied. Furthermore, using these observations and current state information, a meta-agent is used to determine a stochastic energy dispatch policy. Thus, to obtain such dispatch policy, the meta-agent only requires the observations (behavior) from each local agent. Then, the meta-agent can evaluate (exploit) behavior toward an optimal decision for dispatching energy. Further, the MAMRL approach is capable of capturing the exploration-exploitation tradeoff in a way that the meta-agent optimizes decisions of the each self-powered BS under uncertainty. A detailed discussion of the MAMRL framework is given in the following section.

Refer to caption

Figure 3: Multi-agent meta-reinforcement learning framework.

IV Energy Dispatch with Multi-Agent Meta-Reinforcement Learning Framework

In this section, we developed our proposed multi-agent meta-reinforcement learning framework (as seen in Fig. 3) for energy dispatch in the considered network. The proposed MAMRL framework includes two types of agents: A local agent that acts as a local learner at each self-powered with MEC capabilities BS and a meta-agent that learns the global energy dispatch policy. In particular, each local BS agent can discretize the Markovian dynamics for energy demand-generation of each BS (i.e., both SBSs and MBS) separately by applying deep-reinforcement learning. Meanwhile, we train a long short-term memory (LSTM) [49, 50] as a meta-agent at the MBS that optimizes [26] the accumulated energy dispatch of the local agents. As a result, the meta-agent can handle the non-i.i.d. energy demand-generation of the each local agent with own state information of the LSTM. To this end, MAMRL mitigates the curse of dimensionality for the uncertainty of energy demand and generation while providing an energy dispatch solution with a less computational and communication complexity (i.e., less message passing between the local agents and meta-agent).

IV-A Preliminary Setup

Algorithm 1 State Space Generation of BS ii\in\mathcal{B} in MAMRL Framework
0:  Wij,Pi,gij(t),σ2,Iij(t),Υjki(t),τ,fki,ϖkil,ηstMEC(t),Si(t)W_{ij},P_{i},g_{ij}(t),\sigma^{2},I_{ij}(t),\Upsilon_{jk_{i}}(t),\tau,f_{k_{i}},\varpi_{k_{il}},\eta^{\textrm{MEC}}_{\textrm{st}}(t),S_{i}(t)
0:  δinet,ηstnet(t),ctnon,ctsto\delta_{i}^{\textrm{net}},\eta^{\textrm{net}}_{\textrm{st}}(t),c^{\textrm{non}}_{t},c^{\textrm{sto}}_{t}
0:  𝒔ti:(ξid,ξiren(t),Ctisto,Ctinon),𝒔ti𝒮i𝒮,t𝒯\boldsymbol{s}_{ti}\colon(\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti},C^{\textrm{non}}_{ti}),\forall\boldsymbol{s}_{ti}\in\mathcal{S}_{i}\in\mathcal{S},\forall t\in\mathcal{T} Initialization: i,𝒦i,𝒥i\mathcal{R}_{i},\mathcal{K}_{i},\mathcal{J}_{i},𝒮i,λi(t),μi(t),ρi(t),Ri(t)\mathcal{S}_{i},\lambda_{i}(t),\mu_{i}(t),\rho_{i}(t),R_{i}(t)
1:  for each t𝒯t\in\mathcal{T} do
2:     for each ii\in\mathcal{B} do
3:        for each j𝒥ij\in\mathcal{J}_{i} do
4:           Calculate: ξiMEC(t)\xi^{\textrm{MEC}}_{i}(t) using eq. (2)
5:           Calculate: ξinet(t)\xi^{\textrm{net}}_{i}(t) using eq. (4)
6:        end for
7:        Calculate: ξtd=ξinet(t)+ξiMEC(t)\xi^{\textrm{d}}_{t}=\xi^{\textrm{net}}_{i}(t)+\xi^{\textrm{MEC}}_{i}(t) using eq. (5)
8:        Calculate: ξtren=qξiqren(t)\xi^{\textrm{ren}}_{t}=\sum_{q\in\mathcal{R}}\xi^{\textrm{ren}}_{iq}(t)
9:        Calculate: CtnonC^{\textrm{non}}_{t} using eq. (6)
10:        Calculate: CtstoC^{\textrm{sto}}_{t} using eq. (7)
11:        Assign: 𝒔ti:(ξid,ξiren(t),Ctisto,Ctinon)\boldsymbol{s}_{ti}\colon(\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti},C^{\textrm{non}}_{ti})
12:     end for
13:     Append: 𝒔ti𝒮i\boldsymbol{s}_{ti}\in\mathcal{S}_{i}
14:  end for
15:  return  𝒮i𝒮\forall\mathcal{S}_{i}\in\mathcal{S}

In the MAMRL setting, each BS ii\in\mathcal{B} acts as a local agent and the number of local agents is equal to B+1B+1 BSs (i.e., 11 MBS and BB SBSs). We define a set 𝒮={𝒮0,𝒮1,,𝒮B}\mathcal{S}=\left\{{\mathcal{S}_{0},\mathcal{S}_{1},\dots,\mathcal{S}_{B}}\right\} of state spaces and a set 𝒜={𝒜0,𝒜1,,𝒜B}\mathcal{A}=\left\{{\mathcal{A}_{0},\mathcal{A}_{1},\dots,\mathcal{A}_{B}}\right\} of actions for the B+1B+1 agents. The state space of a local agent ii is defined by 𝒔ti:(ξid,ξiren(t),Ctisto,Ctinon)𝒮i\boldsymbol{s}_{ti}\colon(\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti},C^{\textrm{non}}_{ti})\in\mathcal{S}_{i}, where ξid,ξiren(t),Ctisto\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti}, and CtinonC^{\textrm{non}}_{ti} represent the amount of energy demand, renewable generation, storage cost, and non-renewable energy cost, respectively, at time tt. We execute Algorithm 1 to generate the state space for every BSs ii\in\mathcal{B}, individually. In Algorithm 1, lines 33 to 66 calculate the individual energy consumption of the MEC computation and network operation using (2) and (4), respectively. Overall, the energy demand of the BS ii is computed in line 77 and the self-powered energy generation is estimated by line 88 in Algorithm 1. Non-renewable and storage energy costs are calculated in lines 99 and 1010 for time slot tt. Finally, line 1111 creates state space tuple (i.e., 𝒔ti\boldsymbol{s}_{ti}) for time tt in Algorithm 1.

IV-B Local Agent Design

Consider each local BS agent ii\in\mathcal{B} that can take two types of actions ξisto(t)\xi^{\textrm{sto}}_{i}(t) and ξinon(t)\xi^{\textrm{non}}_{i}(t) which are the amount of storage energy ξisto(t)\xi^{\textrm{sto}}_{i}(t), and the amount of non-renewable energy ξinon(t)\xi^{\textrm{non}}_{i}(t) at time tt. We consider a discrete set of actions that consists of two actions 𝒂ti:(ξisto(t),ξinon(t))𝒜i\boldsymbol{a}_{ti}\colon(\xi^{\textrm{sto}}_{i}(t),\xi^{\textrm{non}}_{i}(t))\in\mathcal{A}_{i} for each BS unit ii\in\mathcal{B}. Since the state 𝒔ti\boldsymbol{s}_{ti} and action 𝒂ti\boldsymbol{a}_{ti} both contain a time varying information of the agent ii\in\mathcal{B}, we consider the dynamics of Markovian and represent problem (16)\eqref{Opt_1_8} as a discounted reward maximization problem for each agent ii (i.e., each BS). Thus, the objective function of the discounted reward maximization problem of agent ii is defined as follows [28]:

ri(𝒂ti,𝒔ti)=max𝒂ti𝒜i𝔼𝒂ti𝒔ti[t=tγttΥt(𝒂ti,𝒔ti)],r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})=\underset{\boldsymbol{a}_{ti}\in\mathcal{A}_{i}}{\max}\;\mathbb{E}_{\boldsymbol{a}_{ti}\sim\boldsymbol{s}_{ti}}[\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}\Upsilon_{t}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})], (17)

where γ(0,1)\gamma\in(0,1) is a discount factor and each reward Υt𝒂ti,𝒔ti)\Upsilon_{t}\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}) is considered as,

Υt(𝒂ti,𝒔ti)={1,ifξtirenξtid>1,0,otherwise.\Upsilon_{t}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})=\left\{\begin{array}[]{ll}1,\;\text{if}\;\;\frac{\xi^{\textrm{ren}}_{ti}}{\xi^{\textrm{d}}_{ti}}>1,\\ 0,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{otherwise}.\end{array}\right. (18)

In (18), ξtirenξtid\frac{\xi^{\textrm{ren}}_{ti}}{\xi^{\textrm{d}}_{ti}} determines a ratio between renewable energy generation and energy demand (supply-demand ratio) of the BS agent ii\in\mathcal{B} at time tt. When renewable energy generation-demand ratio ξtirenξtid\frac{\xi^{\textrm{ren}}_{ti}}{\xi^{\textrm{d}}_{ti}} is larger than 11 then the BS agent ii achieves a reward of 11 because the amount of renewable energy exceeds the demand that can be stored in the storage unit.

Each action 𝒂ti\boldsymbol{a}_{ti} of BS agent ii\in\mathcal{B} determines a stochastic policy πθi\pi_{\theta_{i}}. θi\theta_{i} is a parameter of πθi\pi_{\theta_{i}} and the energy dispatch policy is defined by πθi:𝒮i×𝒜i[0,1]\pi_{\theta_{i}}\colon\mathcal{S}_{i}\times\mathcal{A}_{i}\mapsto[0,1]. Policy πθi\pi_{\theta_{i}} decides a state transition function Γ:𝒮i×𝒜B𝒮i\Gamma\colon\mathcal{S}_{i}\times\mathcal{A}_{B}\mapsto\mathcal{S}_{i} for the next state 𝒔ti𝒮i\boldsymbol{s}_{t^{\prime}i}\in\mathcal{S}_{i}. Thus, the state transition function Γ\Gamma of BS agent ii\in\mathcal{B} is determined by a reward function (18)\eqref{eq:each_reward_fn}, where Υt(𝒂ti,𝒔ti):𝒮i×𝒜i\Upsilon_{t}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})\colon\mathcal{S}_{i}\times\mathcal{A}_{i}\mapsto\mathbb{R}. Further, each BS agent ii\in\mathcal{B} chooses an action 𝒂ti\boldsymbol{a}_{ti} from a parametrized energy dispatch policy πθi(𝒂ti|𝒔ti;θi)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i}). Therefore, for a given state 𝒔ti\boldsymbol{s}_{ti}, the state value function with a cumulative discounted reward will be:

Vπθi(𝒔ti)=𝔼𝒂tiπθi(𝒂ti|𝒔ti;θi)[t=tγttΥt+t(𝒂ti,𝒔ti)|𝒔ti,𝒂ti],V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti})=\mathbb{E}_{\boldsymbol{a}_{ti}\sim\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i})}[\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}\Upsilon_{t+t^{\prime}}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})|\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}], (19)

where γtt\gamma^{t^{\prime}-t} is a discount factor and ensures the convergence of state value function Vπθi(𝒔ti)V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti}) over the infinity time horizon. Thus, for a given state 𝒔ti\boldsymbol{s}_{ti}, the optimal policy πθi(𝒂ti|𝒔ti)\pi_{\theta_{i}}^{*}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti}) for the next state 𝒔ti\boldsymbol{s}_{t^{\prime}i} can be determined by an optimal state value function while a Markovian property is imposed. Therefore, the optimal value function is given as follows:

Vπθi(𝒔ti)=maxat𝒜𝔼πθi[iri(𝒂ti,𝒔ti)+t=tγttVtπθi(𝒔ti)|𝒔ti;θi,𝒂ti].\begin{split}V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})=\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \underset{a_{t}\in\mathcal{A}}{\max}\;\mathbb{E}_{\pi_{\theta_{i}}^{*}}[\sum_{i\in\mathcal{B}}r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i})+\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}V_{t^{\prime}}^{\pi_{\theta_{i}}}(\boldsymbol{s}_{t^{\prime}i})|\boldsymbol{s}_{ti};\theta_{i},\boldsymbol{a}_{ti}].\end{split} (20)

Here, the optimal value function (20) learns a parameterized policy πθi(𝒂ti|𝒔ti;θi)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i}) by using an LSTM-based Q-networks for the parameters θi\theta_{i}. Thus, each BS agent ii\in\mathcal{B} determines its parametrized energy dispatch policy πθi(𝒂ti|𝒔ti;θi)=P(𝒂ti|𝒔ti;θi)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i})=P(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i}), where P(ξisto(t))=P(𝒂ti=ξisto(t)|𝒔ti;θi)P(\xi^{\textrm{sto}}_{i}(t))=P(\boldsymbol{a}_{ti}=\xi^{\textrm{sto}}_{i}(t)|\boldsymbol{s}_{ti};\theta_{i}) and P(ξinon(t))=1P(ξisto(t))P(\xi^{\textrm{non}}_{i}(t))=1-P(\xi^{\textrm{sto}}_{i}(t)) for the parameters θi\theta_{i}. The decision of each BS agent ii\in\mathcal{B} relies on θi\theta_{i}. In particular, energy dispatch policy πθi\pi_{\theta_{i}} is the probability of taking action 𝒂ti\boldsymbol{a}_{ti} for a given state 𝒔ti\boldsymbol{s}_{ti} with parameters θi\theta_{i}. In this setting, each local agent ii\in\mathcal{B} is comprised of an actor and a critic [27, 51]. The policy of energy dispatch is determined by choosing an action in (20) that can be seen as an actor of BS agent ii. Meanwhile, the value function (19) is estimated by a critic of each local BS agent ii. The critic can criticize actions that are made by the actor of each BS agent ii. Therefore, each BS agent ii\in\mathcal{B} can determine a temporal difference (TD) error [51] based on the current energy dispatch policy of the actor and value estimation by the critic. The TD error is considered as an advantage function and the advantage function of agent ii is defined as follows:

Λπθi(𝒔ti,𝒂ti)=(ri(𝒂ti,𝒔ti)+t=tγttVπθi(𝒔ti))Vπθi(𝒔ti).\begin{split}\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})=\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \Big{(}r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})+\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{t^{\prime}i})\Big{)}-V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti}).\end{split} (21)

Thus, the policy gradient of each BS agent ii\in\mathcal{B} is determined as,

θiΛπθi(𝒔ti,𝒂ti)=𝔼πθi[t=tγttθilogπθi(𝒂ti|𝒔ti;θi)Λπθi(𝒔ti,𝒂ti)],\begin{split}\nabla_{\theta_{i}}\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})=\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \mathbb{E}_{\pi_{\theta_{i}}}[\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i})\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})],\end{split} (22)

where logπθi(𝒂ti|𝒔ti;θi)\log\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i}), and Λπθi(𝒔ti,𝒂ti)\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}) represent the actor and critic, respectively, for each local BS agent ii\in\mathcal{B}.

Using (22), we can discretize the energy dispatch decision 𝒂ti:(ξisto(t),ξinon(t))\boldsymbol{a}_{ti}\colon(\xi^{\textrm{sto}}_{i}(t),\xi^{\textrm{non}}_{i}(t)) for each self-powered BS ii\in\mathcal{B} in the network. In fact, we can achieve a centralized solution for i\forall i\in\mathcal{B} when all of the BSs state information (i.e., demand and generation) are known. However, the space complexity for computation increases as O(2|𝒮i|×|𝒜i|×||×T)O(2^{|\mathcal{S}_{i}|\times|\mathcal{A}_{i}|\times|\mathcal{B}|\times T}) and also the computational complexity becomes O(|𝒮i|×|𝒜i|×||2×T)O({|\mathcal{S}_{i}|\times|\mathcal{A}_{i}|\times|\mathcal{B}|^{2}}\times T) [21]. Further, the solution does not meet the exploration-exploitation dilemma since the centralized (i.e., single agent) method ignores the interactions and energy dispatch decision strategies of other agents (i.e., BSs) which creates an imbalance between exploration and exploitation. In other words, this learning approach optimizes the action policy by exploring its own state information. Therefore, when we change the energy environment (i.e., demand and generation), this method cannot cope with an unknown environment due to the lack of diverse state information during the training. Next, we propose an approach that not only reduces the complexity but also explores alternative energy dispatch decision to achieve the highest expected reward in (17).

IV-C Multi-Agent Meta-Reinforcement Learning Modeling

We consider a set 𝒪={𝒪0,𝒪1,,𝒪B}\mathcal{O}=\left\{{\mathcal{O}_{0},\mathcal{O}_{1},\dots,\mathcal{O}_{B}}\right\} of B+1B+1 observations [24, 52] and for an BS agent ii\in\mathcal{B}, a single observation tuple is given by 𝒐i𝒪i\boldsymbol{o}_{i}\in\mathcal{O}_{i}. For a given state 𝒔ti\boldsymbol{s}_{ti}, the observation 𝒐i\boldsymbol{o}_{i} of the next state 𝒔ti\boldsymbol{s}_{t^{\prime}i} consists of 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)){\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))}, where ri(𝒂ti,𝒔ti)r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}), ri(𝒂ti,𝒔ti)r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}), 𝒂ti,𝒂ti\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i}, tt^{\prime} and Λπθi(𝒔ti,𝒂ti)\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}) are next-state discounted rewards, current state discounted rewards, next action, current action, time slot, and TD error, respectively. Here, a complete information of the observation 𝒐i\boldsymbol{o}_{i} is correlated with the state space 𝒐i:𝒮i𝒪i\boldsymbol{o}_{i}\colon\mathcal{S}_{i}\mapsto\mathcal{O}_{i} while observation 𝒐i\boldsymbol{o}_{i} does not require the complete state information of the previous states.

Thus, the space complexity for computation at each BS agent ii\in\mathcal{B} leads to O((|𝒮i|+|𝒜i|)2×T)O((|\mathcal{S}_{i}|+|\mathcal{A}_{i}|)^{2}\times T). Meanwhile, the computational complexity for each time slot tt becomes O(|𝒮i|2×𝒜i×θt+H)O(|\mathcal{S}_{i}|^{2}\times\mathcal{A}_{i}\times\theta_{t}+H), where θt\theta_{t} is the learning parameter and HH represents the numbers of LSTM units. Each BS agent ii\in\mathcal{B} requires to send an amount of |𝒪i||\mathcal{O}_{i}| observational data (i.e., payload) to the meta-agent. Therefore, the communication overhead for each BS agent ii\in\mathcal{B} leads to O(|𝒪|×TB+1)O(\frac{|\mathcal{O}|\times T}{B+1}). On the other hand, the computational complexity of the meta-agent leads to O(|𝒪|2×ϕ+H)O(|\mathcal{O}|^{2}\times\phi+H) while ϕ\phi represents learning parameter at meta-agent. In particular, for a fixed number of output memory ϕ\phi, the meta-agent’s update complexity at each time slot tt becomes O(ϕ2)O(\phi^{2}) [53]. Further, when transferring the learned parameters θt\theta_{t^{\prime}} from the meta-agent to all local agents i\forall i\in\mathcal{B}, the communication overhead goes to the O(θt×(B+1))O(\theta_{t^{\prime}}\times(B+1)) at each time slot tt. Here, the size of θt\theta_{t^{\prime}} depends on the memory size of the LSTM cell at the meta-agent [see Appendix A].

In the MAMRL framework, the local agents work as an optimizee and the meta-agent performs the role of optimizer [26]. To model our meta-agent, we consider an LSTM architecture [49, 50] that stores its own state information (i.e., parameters) and the local agent (i.e., optimizee) only provides the observation of a current state. In the proposed MAMRL framework, a policy πθi\pi_{\theta_{i}} is determined by updating the parameters 111We consider recurrent neural networks (RNNs) state parameters for the parameterization of energy dispatch policy. In particular, we consider a long short-term memory (LSTM) for RNN, in which cell state and hidden state are considered as parameters. θi\theta_{i}. Therefore, we can represent the state value function (20) for time tt is as follows: Vπθi(𝒔ti)Vπθi(𝒔ti;θt)V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})\approx V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti};\theta_{t}), and the advantage (temporal difference) function (21) is presented by, Λπθi(𝒔ti,𝒂ti)Λπθi(𝒔ti,𝒂ti;θt)\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})\approx\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti};\theta_{t}). As a result, the parameterized policy is defined by, πθi(𝒂ti|𝒔ti)πθi(𝒂ti|𝒔ti;θt)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti})\approx\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t}). Considering all of the BS agents B+1B+1 and the advantage function (21)\eqref{eq:TD_fn} is rewritten as,

Λπθi(𝒔ti,𝒂t0,,𝒂tB;θt)=ri(𝒔ti,𝒂t0,,𝒂tB)+𝒔ti𝒮i,t=tγttΓ(𝒔ti|𝒔ti,𝒂t0,,𝒂tB)Vπθi(𝒔ti,πθ0,,πθB)Vπθi(𝒔ti,πθ0,,πθB),\begin{split}\Lambda^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB};\theta_{t})=r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\;+\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \sum_{\boldsymbol{s}_{t^{\prime}i}\in\mathcal{S}_{i},t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}\Gamma(\boldsymbol{s}_{t^{\prime}i}|\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{t^{\prime}i},\pi_{\theta_{0}}^{*},\dots,\pi_{\theta_{B}}^{*})\;-\\ V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\pi_{\theta_{0}}^{*},\dots,\pi_{\theta_{B}}^{*}),\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\end{split} (23)

where πθ:(πθ0,,πθB)\pi_{\theta}^{*}\colon(\pi_{\theta_{0}}^{*},\dots,\pi_{\theta_{B}}^{*}) is a joint energy dispatch policy and Γ(𝒔ti|𝒔ti,𝒂t0,,𝒂tB)[0,1]\Gamma(\boldsymbol{s}_{t^{\prime}i}|\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\mapsto[0,1] represents state transition probability. Using (23), we can get the value loss function for agent ii and the objective is to minimize the temporal difference [27],

L(θi)=minπθi1||i12((ri(𝒂ti,𝒔ti)+t=tγttVπθi(𝒔ti|θt))Vπθi(𝒔ti))2.\begin{split}L(\theta_{i})=\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \underset{\pi_{\theta_{i}}}{\min}\;\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\frac{1}{2}\bigg{(}\Big{(}r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti})+\sum_{t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{t^{\prime}i}|\theta_{t})\Big{)}-V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})\bigg{)}^{2}.\end{split} (24)

To improve the exploration with a low bias, we consider an entropy regularization 222Entropy [54, 55, 56, 57] can allow us to manage non-i.i.d. datasets when changes in the environment over time lead to an uncertainty. Therefore, we use entropy regularization to handle the non-i.i.d. energy demand and generation over time by managing with the uncertainty for each BS agent ii\in\mathcal{B}. βh(πθi(𝒂ti|𝒔ti;θt))\beta h(\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t})) that cope with the non-i.i.d. energy demand and generation for all of the BS agents i\forall i\in\mathcal{B}. Here, β\beta is a coefficient for the magnitude of regularization and h(πθi(𝒂ti|𝒔ti;θt))h(\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t})) determines the entropy of the policy πθi\pi_{\theta_{i}} for the parameter θi\theta_{i}. Additionally, a larger value of βh(πθi(𝒂ti|𝒔ti;θt))\beta h(\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t})) encourages the agents to have a more diverse exploration to estimate the energy dispatch policy. Thus, we can redefine the policy loss function as follows:

L(θi)=𝔼𝒔ti,𝒂ti[πθi(𝒂ti|𝒔ti)+βh(πθi(𝒂ti|𝒔ti;θt))].\begin{split}L(\theta_{i})=-\mathbb{E}_{\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}}[\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti})+\beta h(\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t}))].\end{split} (25)

Therefore, the policy gradient of the loss function (25) is defined in terms of temporal difference and entropy. The policy gradient of the loss function is defined as follows:

θiL(θi)=1||it=tθilogπθi(𝒂ti|𝒔ti)Λπθi(𝒔ti,𝒂ti|θt)+βh(πθi(𝒂ti|𝒔ti;θt)).\begin{split}\nabla_{\theta_{i}}L(\theta_{i})=\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\sum_{t^{\prime}=t}^{\infty}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti})\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}|\theta_{t})\\ +\;\beta h(\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{t})).\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\ \end{split} (26)

To design our meta-agent, we consider meta-agent parameters ϕ\phi and optimized parameters θ\theta^{*} of the optimizee (i.e., local agent). The meta-agent is defined as Mt(𝒪t;ϕ)Mt(θtL(θt);ϕ)M_{t}(\mathcal{O}_{t};\phi)\coloneqq M_{t}(\nabla_{\theta_{t}}L(\theta_{t});\phi), where Mt(.)M_{t}(.) is modeled by an LSTM. Consider an observational vector 𝒪it𝒪\mathcal{O}_{it^{\prime}}\in\mathcal{O} of a local BS agent ii\in\mathcal{B} at time tt^{\prime} and each observation is 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti))𝒪it{\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))}\in\mathcal{O}_{it^{\prime}}. The LSTM-based meta-agent takes the observational vector 𝒪it\mathcal{O}_{it^{\prime}} as an input. Meanwhile, the meta-agent holds long-term dependencies by generating its own state with parameters ϕ\phi. To do this, the LSTM model creates several gates to determine an optimal policy πθi\pi_{\theta_{i}}^{*} and advantage function Λπθi(𝒔ti,𝒂t0,,𝒂tB;θt)\Lambda^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB};\theta_{t}) for the next state 𝒔ti\boldsymbol{s}_{t^{\prime}i}. As a result, the structure of the recurrent neural network for the meta-agent is the same as the LSTM model [49, 50]. In particular, each LSTM unit for the meta-agent consists of four gate layers such as forget gate 𝑭t\boldsymbol{F}_{t^{\prime}}, input gate 𝑰t\boldsymbol{I}_{t^{\prime}}, cell state 𝑬^t\boldsymbol{\hat{E}}_{t^{\prime}}, and output 𝒁t\boldsymbol{Z}_{t^{\prime}} layer. The cell state gate 𝑬^t\boldsymbol{\hat{E}}_{t^{\prime}} usages a tanh\tanh activation function and other gates are used sigmoid σ(.)\sigma(.) as an activation function. Thus, the outcome of the meta policy for a single unit LSTM cell is presented as follows:

Mt(𝒪t;ϕ)=softmax((𝑯t)),\displaystyle M_{t^{\prime}}(\mathcal{O}_{t^{\prime}};\phi)=\text{softmax}\Big{(}(\boldsymbol{H}_{t^{\prime}})^{\top}\Big{)},\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; (27)
where𝑭t=σ(ϕFO(𝒪it)+ϕFH(𝑯t)+𝒃F),\displaystyle\text{where}\quad\boldsymbol{F}_{t^{\prime}}=\sigma\Big{(}\phi_{FO}(\mathcal{O}_{it^{\prime}})^{\top}+\phi_{FH}(\boldsymbol{H}_{t})^{\top}+\boldsymbol{b}_{F}\Big{)},\;\;\;\; (27a)
𝑰𝒕=σ(ϕIO(𝒪it)+ϕIH(𝑯t)+𝒃I),\displaystyle\boldsymbol{I_{t^{\prime}}}=\sigma\Big{(}\phi_{IO}(\mathcal{O}_{it^{\prime}})^{\top}+\phi_{IH}(\boldsymbol{H}_{t})^{\top}+\boldsymbol{b}_{I}\Big{)},\;\;\;\;\;\;\;\; (27b)
𝑬𝒕^=tanh(ϕEO(𝒪it)+ϕEH(𝑯t)+𝒃E),\displaystyle\boldsymbol{\hat{E_{t^{\prime}}}}=\tanh\Big{(}\phi_{EO}(\mathcal{O}_{it^{\prime}})^{\top}+\phi_{EH}(\boldsymbol{H}_{t})^{\top}+\boldsymbol{b}_{E}\Big{)}, (27c)
𝑬t=𝑬^𝒕𝑰t+𝑭t𝑬t,\displaystyle\boldsymbol{E}_{t^{\prime}}=\boldsymbol{\hat{E}_{t^{\prime}}}\odot\boldsymbol{I}_{t^{\prime}}+\boldsymbol{F}_{t^{\prime}}\odot\boldsymbol{E}_{t},\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; (27d)
𝒁t=σ(ϕZO(𝒪it)+ϕZH(𝑯t)+𝒃Z),\displaystyle\boldsymbol{Z}_{t^{\prime}}=\sigma\Big{(}\phi_{ZO}(\mathcal{O}_{it^{\prime}})^{\top}+\phi_{ZH}(\boldsymbol{H}_{t})^{\top}+\boldsymbol{b}_{Z}\Big{)},\;\;\;\;\; (27e)
𝑯t=tanh(𝑬t)𝒁t.\displaystyle\boldsymbol{H}_{t^{\prime}}=\tanh(\boldsymbol{E}_{t^{\prime}})\odot\boldsymbol{Z}_{t^{\prime}}.\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; (27f)

Refer to caption

Figure 4: Recurrent neural network architecture for the proposed multi-agent meta-reinforcement learning framework.

In the meta-agent policy formulation (27), the forget gate vector (27a) determines what information is needed to throw away. Input gate vector (27b) helps to decide which information is needed to update, the cell state (27c) creates a vector of new candidate values using tanh()\tanh(\cdot) function, and updates the cell state information by applying (27d). The output layer (27e) that determines what parts of the cell state are going to output and calculate the cell outputs using the equation (27f)\eqref{meta_agent_model:H}. Further, the cell state through the tanh()\tanh(\cdot) will restrict the values between 1-1 and +1+1. This entire process is followed for each LSTM block and finally, (27)\eqref{meta_agent_model} determines the meta-policy for πθi\pi_{\theta_{i}}^{*} of the state sts_{t^{\prime}}. In addition, optimized RNN state parameters θ\theta^{*} are obtained from the cell state (27d) and hidden state (27f)\eqref{meta_agent_model:H} of an LSTM unit. Thus, the loss function L(ϕ)=𝔼L(θ)[L(θ(L(θ);ϕ))]L(\phi)=\mathbb{E}_{L(\theta)}[L(\theta^{*}(L(\theta);\phi))] of meta-agent depends on the distribution of L(θt)L(\theta_{t}) and the expectation of the meta-agent loss function is defined as follows [26]:

L(ϕ)=𝔼L(θ)[t=1TL(θt)].\begin{split}L(\phi)=\mathbb{E}_{L(\theta)}[\sum_{t=1}^{T}L(\theta_{t})].\end{split} (28)

In the proposed MAMRL framework, we transfer the learned parameters (i.e., cell state and hidden state) of meta-agent to the local agents so that each local agent will be estimated an optimal energy dispatch policy by updating its own learning parameters. Thus, the parameters of each agent (i.e., BS) is updated with θt=θ\theta_{t^{\prime}}=\theta^{*} while πθi=Mt(θtL(θt);ϕ)\pi_{\theta_{i}}^{*}=M_{t}(\nabla_{\theta_{t}}L(\theta_{t});\phi) to decide the energy dispatch policy.

We consider an LSTM-based recurrent neural network (RNN) for the both local agents and the meta-agent. This LSTM RNN consists of 4848 LSTM units for each LSTM cell as shown in Fig. 4. In particular, the configuration of the LSTM for the meta-agent and each local agent is the same while the objective of the loss functions differ from local agent to meta-agent. In which, local BS agent determines its own energy dispatch policy by exploring its own environmental state information for reducing the TD error. Meanwhile, meta-agent deals with the observations of each local BS agent by exploiting its own RNN states information using entropy based loss function to capture non-i.i.d. energy demand and generation of each local BS. Therefore, having different loss functions for local and meta agent leads the proposed MAMRL model to learn a domain specific generalized model so that it can cope with an unknown environment. Further, this RNN consists of a branch of two fully connected output layers on top of the LSTM cell. In particular, fully connected layer with a softmax activation is considered for energy dispatch policy determination, and another fully connected output layer without activation function is deployed for value function estimation. Thus, the advantage is calculated based on value function estimation from the second fully connected layer. Each local LSTM-based RNN receives a current reward ri(𝒂tir_{i}(\boldsymbol{a}_{ti}, 𝒔ti)\boldsymbol{s}_{ti}), current action 𝒂ti\boldsymbol{a}_{ti}, and next time slot tt^{\prime} as an input for each BS agent ii\in\mathcal{B}. Meanwhile, this local LSTM model estimates a policy πθi\pi_{\theta_{i}} and value Vπθi(𝒔ti)V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti}) for BS agent ii\in\mathcal{B}. On the other hand, meta agent LSTM-based RNN feeds input as an observational tuple 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)){\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))} from each BS agent ii\in\mathcal{B}. This observation consists of the current and next reward, current and next action, next time slot, and TD error for each BS agent ii. Thus, this meta agent estimates parameters θt\theta_{t^{\prime}} to find a globally optimal energy dispatch policy πθi\pi_{\theta_{i}}^{*} for each BS ii\in\mathcal{B}. The learned parameters of the meta-agent are transferred to each local BS agent ii\in\mathcal{B} asynchronously while this local agent updates its own parameters for estimating the globally optimal energy dispatch policy via the local LSTM-based RNN. In particular, the learned parameters (i.e., RNN states) are transfered from meta-agent to each local agent ii\in\mathcal{B}. Additionally, these RNN state parameters include cell state and hidden state of the LSTM cell, which do not depend on any of the fully connected out layers of the proposed RNN architecture. Meanwhile, each local agent ii\in\mathcal{B} updates its own RNN states using the transferred parameters by the meta-agent. We consider a cellular network for exchanging observations and parameters between local BS agent and meta-agent.

Algorithm 2 Local Agent Training of Energy Dispatch of BS ii\in\mathcal{B} in MAMRL Framework
0:  𝒔ti:(ξid,ξiren(t),Ctisto,Ctinon),𝒔ti𝒮i,tT\boldsymbol{s}_{ti}\colon(\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti},C^{\textrm{non}}_{ti}),\forall\boldsymbol{s}_{ti}\in\mathcal{S}_{i},\forall t\in T
0:  𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)){\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))}, 𝒐i𝒪i\boldsymbol{o}_{i}\in\mathcal{O}_{i}, θtL(θt)\nabla_{\theta_{t}}L(\theta_{t}) Initialization: LocalLSTM(.)LocalLSTM(.), θi,i,γ,𝒪i\theta_{i},i\in\mathcal{B},\gamma,\mathcal{O}_{i}
1:  for episode = 1 to maximum episodes do
2:     Initialization: epcBuf[]epcBuf[]
3:     for each t𝒯t\in\mathcal{T} do
4:        for step=1step=1 to MaxStepMaxStep do
5:           Calculate: ri(𝒂ti,𝒔ti)r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}) using eq. (17)
6:           Calculate: Vπθi(𝒔ti)V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti}) using eq. (19)
7:           Choose Action: 𝒂tiπθi(𝒂ti|𝒔ti)\boldsymbol{a}_{ti}\sim\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti})
8:           Append: epcBuf[𝒂ti,ri(𝒂ti,𝒔ti),t,step,Vπθi(𝒔ti)]epcBuf[\boldsymbol{a}_{ti},r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),t,step,V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti})]
9:        end for
10:        LocalLSTM(ri(𝒂ti,𝒔ti),𝒂ti,t=t+1)LocalLSTM(r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},t^{\prime}=t+1) {LSTM-based local RNN block}
11:        {
12:        Evaluate: Λπθi(𝒔ti,𝒂ti)\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}) using eq. (21)
13:        Local agent policy gradient: θiΛπθi(𝒔ti,𝒂ti)\nabla_{\theta_{i}}\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}) using eq. (22) {In (22), πθi(𝒂ti|𝒔ti;θi)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti};\theta_{i}) is determined by a fully connected output layer with a softmax activation function and Λπθi(𝒔ti,𝒂ti)\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}) is calculated through a fully connected output layer without activation function}
14:        }
15:        Append: 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti))\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})), 𝒐i𝒪i\boldsymbol{o}_{i}\in\mathcal{O}_{i}
16:        Get Meta-agent policy πθi\pi_{\theta_{i}}^{*} and RNN states θ\theta^{*}: Mt(𝒪t;ϕ)M_{t}(\mathcal{O}_{t};\phi) using Algorithm 3
17:        Update: θt=θ\theta_{t^{\prime}}=\theta^{*} {RNN states update}
18:     end for
19:  end for
20:  return  new_state(𝐬ti=argmaxπθi(𝐚ti)),i(\boldsymbol{s}_{t^{\prime}i}=\operatorname*{argmax}_{\pi_{\theta_{i}}^{*}}(\boldsymbol{a}_{ti})),i\in\mathcal{B}

We run the proposed Algorithm 2 at each self-powered BS ii\in\mathcal{B} with MEC capabilities as local agent ii. The input of Algorithm 2 is the state information 𝒮i\mathcal{S}_{i} of local agent ii, which is the output from Algorithm 1. The cumulative discounted reward (17) and state value in (19) are calculated in lines 55 and 66, respectively (in Algorithm 2) for each step (until the maximum step size 333To capture the heterogeneity for energy demand and generation of each BS separately, we consider the same number of user tasks that are executed by each BS agent ii\in\mathcal{B} during one observational period tt as the steps size. for time step tt). Consequently, based on a chosen action 𝒂ti\boldsymbol{a}_{ti} from the estimated policy πθi(𝒂ti|𝒔ti)\pi_{\theta_{i}}(\boldsymbol{a}_{ti}|\boldsymbol{s}_{ti}) (in line 77), episode buffer is generated and appended in line 88. Advantage function (21) of local agent ii is evaluated in line 1212 and the policy gradient (22) is calculated in line 1313 using an LSTM-based local neural network. Algorithm 2 generates observational tuple 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti))\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti})) in line 1515. Here, we transfer the knowledge of local BS agent ii\in\mathcal{B} to the meta-agent learner (deployed in MBS) in Algorithm 3 so as to optimize the energy dispatch decision (in Algorithm 2 line 1616). Hence, the observation tuple 𝒐i\boldsymbol{o}_{i} of local BS agent ii consists of only the decision from BS ii, where does not require to send all of the state information to meta-agent learner. Employing the meta-agent policy gradient, each local agent is capable of updating the energy dispatch decision policy in line 1717 in Algorithm 2. Finally, the energy dispatch policy is executed in line 2020 at the BS ii\in\mathcal{B} by local agent ii.

Algorithm 3 Meta-Agent Learner of Energy Dispatch in MAMRL Framework
0:  𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)),{\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))}, 𝒐i𝒪i\forall\boldsymbol{o}_{i}\in\mathcal{O}_{i}, t𝒯t\in\mathcal{T}, ii\in\mathcal{B}
0:  ϕ\phi Initialization: MetaLSTM(.)MetaLSTM(.), ϕ\phi, πθi\pi_{\theta_{i}}, γ\gamma
1:  for each t𝒯t\in\mathcal{T} do
2:     for each ii\in\mathcal{B} do
3:        𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)),{\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))}, 𝒐i𝒪i\boldsymbol{o}_{i}\in\mathcal{O}_{i}
4:        MetaLSTM(𝒪i,πθi)MetaLSTM(\mathcal{O}_{i},\pi_{\theta_{i}}) {LSTM-based RNN block}
5:        { {Lines from 66 to 77 using fully connected output layer without activation function}
6:        Entropy loss: L(θi)L(\theta_{i}) using eq. (25)
7:        Gradient of the loss: θtL(θt)\nabla_{\theta_{t}}L(\theta_{t}) using eq. (26) {Policy is estimated using a fully connected output layer with softmax activation function}
8:        Calculate: πθi=Mt(θtL(θt);ϕ)\pi_{\theta_{i}}=M_{t}(\nabla_{\theta_{t}}L(\theta_{t});\phi) using eq. (27)
9:        Get meta policy loss L(ϕ)L(\phi) using eq. (28)
10:        Update: πθi=πθi\pi_{\theta_{i}}^{*}=\pi_{\theta_{i}}
11:        Get RNN states: θ\theta^{*} {cell state and hidden state from the LSTM cell}
12:        }
13:     end for
14:     Send: Meta-agent policy πθi\pi_{\theta_{i}}^{*} and RNN states θ\theta^{*}
15:  end for
16:  return  

The meta-agent learner (Algorithm 3 in MBS) receives the observations 𝒪i𝒪\mathcal{O}_{i}\in\mathcal{O} from each local BS agent ii\in\mathcal{B} asynchronously. Then the meta-agent asynchronously updates the meta policy gradient of the each BS agent ii\in\mathcal{B}. Lines from 44 to 1212 of Algorithm 3 represent the LSTM block for the meta-agent. In Algorithm 3, entropy loss (25) and gradient of the loss (26) are estimated in lines 66 and 77, respectively. In order to estimate this, Algorithm 3 deploys a fully connected output layer without activation function, so that advantage loss can be calculated without affecting the value that is calculated by the value function of the proposed MAMRL framework. The meta-agent energy dispatch policy is updated in line 1010 of Algorithm 3. Before that, a fully connected output layer with a softmax activation function of the LSTM cell assists to determine the energy dispatch policy and meta policy loss in lines 88 and 99 (in Algorithm 3), respectively, for the meta-agent. Additionally, the meta-agent utilizes the observations of the local agents and determines its own state information that helps to estimate the energy dispatch policy of the meta-agent. In line 1111, the meta-agent RNN states θ\theta^{*} (i.e., cell and hidden states) are received from the considered LSTM cell in Algorithm 3. Finally, the meta-agent policy and RNN states are transfered to each BS agent for updating the parameters (i.e., RNN states) of each local BS agent. To this end, a meta-agent learner deployed at center node (i.e., MBS) in the considered network and sends the learning parameters of the optimal energy dispatch policy to each local BS (i.e., MBS and SBS) through the network.

The proposed MAMRL framework established a guarantee to converge with an optimal energy dispatch policy. In fact, the MAMRL framework can be reduced to a |||\mathcal{B}|-player Markovian game [58, 59] as a base problem that establishes more insight into convergence and optimality. The proposed MAMRL model has at least one Nash equilibrium point that ensures an optimal energy dispatch policy. This argument is similar from the previous studies of |||\mathcal{B}|-player Markovian game [58, 59]. Hence, we can conclude with the following proposition:

Proposition 1.

πθi\pi_{\theta_{i}}^{*} is an optimal energy dispatch policy that is an equilibrium point with an equilibrium value Vπθi(𝐬ti,πθ0,,πθB)V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\pi_{\theta_{0}}^{*},\dots,\pi_{\theta_{B}}^{*}) for BS ii [see Appendix B].

We can justify the convergence of MAMRL framework via the following Proposition:

Proposition 2.

Consider a stochastic environment with a state space 𝐬ti𝒮,i\boldsymbol{s}_{ti}\in\mathcal{S},i\in\forall\mathcal{B} of |||\mathcal{B}| BS agents such that all BS agents are initialized with an equal probability of 0.50.5 for a binary actions, P(ξisto(t))=P(ξinon(t))=θi0.5P(\xi^{\textrm{sto}}_{i}(t))=P(\xi^{\textrm{non}}_{i}(t))=\theta_{i}\approx 0.5, where 𝐚ti:(ξisto(t),ξinon(t))𝒜i,i\boldsymbol{a}_{ti}\colon(\xi^{\textrm{sto}}_{i}(t),\xi^{\textrm{non}}_{i}(t))\in\mathcal{A}_{i},\forall i\in\mathcal{B}, and ri(𝐬ti,𝐚t0,,𝐚tB)r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB}). Therefore, to estimate the gradient of loss function (24), we can establish a relationship among the gradient of approximation ^θiL(θi)\hat{\nabla}_{\theta_{i}}L(\theta_{i}) and true gradient θiL(θi)\nabla_{\theta_{i}}L(\theta_{i}),

P(^θiL(θi),θiL(θi)>0)(0.5)||.\begin{split}P\left(\hat{\nabla}_{\theta_{i}}L(\theta_{i}),\nabla_{\theta_{i}}L(\theta_{i})>0\right)\propto\left(0.5\right)^{|\mathcal{B}|}.\end{split} (29)

[See Appendix C].

Propositions 1 and 2 validate the optimality and convergence, respectively for the proposed MAMRL framework. Proposition 1 guarantees an optimal energy dispatch policy. Meanwhile, Proposition 2 ensures that the proposed MAMRL model can meet the convergence for a single state 𝒔ti𝒮,i\boldsymbol{s}_{ti}\in\mathcal{S},i\in\forall\mathcal{B}. That implies this model is also able to converge for 𝒔ti𝒮,i\forall\boldsymbol{s}_{ti}\in\mathcal{S},i\in\forall\mathcal{B}.

The significance of the proposed MAMRL model are explained as follows:

  • First, each BS (i.e., local agent) can explore its own energy dispatch policy based on individual requirements for the energy generation and consumption. Meanwhile, the meta-agent exploits each BS energy dispatch decision from its own recurrent neural networks state information. As a result, individual BS anticipates its own energy demand and generation while meta-agent handles the non-i.i.d. energy demand and generation for all BS agents to efficiently meet the exploration-exploitation tradeoff of the proposed MAMRL.

  • Second, the proposed MAMRL model can effectively handle distinct environment dynamics for non-i.i.d. energy demand and generation among the agents.

  • Third, the proposed MAMRL model ensures less information exchange between the local agents and meta-agent. In particular, each local BS agent only sends an observational vector to meta-agent and received neural network parameters at the end of 1515 minutes observation period. Additionally, the proposed MAMRL model does not require sending an entire environment state from each local agent to the meta-agent.

  • Finally, the meta-agent can learn a generalized model toward the energy dispatch decision and transfer its skill to each local BS agent. This, in turn, can significantly increase the learning accuracy as well as reduce the computational time for each local BS agent thus enhancing the robustness of the energy dispatch decision.

We benchmark the proposed MAMRL framework by performing an extensive experimental analysis, and the experimental analysis and discussion are given in the later section.

V Experimental Results and Analysis

TABLE IV: Summary of experimental setup
Description Value
No. of SBSs (no. of local agents) 99
No. of MEC servers in each SBS 22
No. of MBS (meta-agent) 11
Channel bandwidth 180180 kHz [62]
System bandwidth 2020 MHz [17]
Transmission power 2727 dB [16]
Channel gain 140.7+36.7logd140.7+36.7\log d [17]
A variance of an AWGN -114 dBm/Hz [62]
Energy coefficient for data transfer δinet\delta_{i}^{\textrm{net}} 2.82.8 [36]
MEC server CPU frequency ff 2.5 GHz [16]
Server switching capacitance τ\tau 5×10275\times 10^{-27} (farad) [17]
MEC static energy ηstMEC(t)\eta^{\textrm{MEC}}_{\textrm{st}}(t) [7.5,25][7.5,25] Watts [63]
Task sizes (uniformly distributed) [31,1546060][31,1546060] bytes [60]
No. of task requests at BS ii [1,10,000][1,10,000] [11]
Unit cost renewal energy ctrenc^{\textrm{ren}}_{t} $50\$50 per MW-hour [45]
Unit cost non-renewal energy ctnonc^{\textrm{non}}_{t} $102\$102 per MW-hour [45]
Unit cost storage energy ctstoc^{\textrm{sto}}_{t} 10%10\% additional [44]
Initial discount factor γ\gamma 0.90.9
Initial action selection probability [0.5,0.5][0.5,0.5]
One observation period tt 1515 minutes
No. of episodes 800800
No. of epochs TT for each day 9696
No. of steps for each epoch at each agent ii JiJ_{i} = [1,10,000][1,10,000] [60]
No. of actions 22 (i.e., ξisto(t)\xi^{\textrm{sto}}_{i}(t), ξinon(t))\xi^{\textrm{non}}_{i}(t))
No. of LSTM units in one LSTM cell 4848
No. of LSTM cells 1010 (i.e., B+1)
LSTM cell API BasicLSTMCell(.) tf.contrib.rnn [64]
Entropy regularization coefficient β\beta 0.050.05
Learning rate 0.0010.001
Optimizer Adam [65]
Output layer activation function Softmax [51]
Refer to caption
(a) MEC network energy demand
Refer to caption
(b) Renewable energy generation
Figure 5: Histogram of energy demand and renewable energy generation for 99 SBSs and each SBS consists of 9696 time slots after preprocessing using Algorithm 1.

In our experiment, we use the CRAWDAD nyupoly/video packet delivery dataset [60] to discretize the self-powered SBS network’s energy consumption. Further, we choose a state-of-the-art UMass solar panel dataset [61] to evaluate renewable energy generation. We create deterministic, asymmetric, and stochastic environments by selecting different days of the same solar unit for the generation. Meanwhile, usage several session from the network packet delivery dataset. We train our proposed meta-reinforcement learning (Meta-RL)-based MAMRL framework using deterministic environment and evaluate the testing performance for the three environments. Three environments444For example, we train and test the MAMRL model using the known (i.e., deterministic environment) network energy consumption, and renewable generation data of day 11. Then we have tested the trained model using day 22 data, where network energy consumption is known, and renewable generation is unknown which represents an asymmetric environment. In a stochastic environment, let us consider day 33 data, where both energy consumption and renewable generation are unknown to the trained model. are as follows: 1) In the deterministic environment, both network energy consumption and renewable generation are known, 2) network energy consumption is known but renewable generation is unknown in the asymmetric environment, and 3) the stochastic environment contains both energy consumption and renewable generation are unknown. To benchmark the proposed MAMRL framework intuitively, we have considered a centralized single-agent deep-RL, multi-agent centralized A3C deep-RL with a same neural networks configuration as the proposed MAMRL, and a pure greedy model as baselines. These are as follows:

  • We consider the neural advantage actor-critic (A2C) [51, 66] method as a centralized single-agent deep-RL. In particular, the learning environment encompasses the state information of all BSs i\forall i\in\mathcal{B} and is learned by a neural A2C [51, 66] scheme with the same configuration as the MAMRL model.

  • An asynchronous advantage actor-critic (A3C) based multi-agent RL framework [28] is considered a second benchmark in a cooperative environment [27]. In particular, each local actor can find its own policy in a decentralized manner while a centralized critic is augmented with additional policy information. Therefore, this model is learned by a centralized training with decentralized execution [28]. We call this model a multi-agent centralized A3C deep-RL [28]. The environment (i.e., state information) of this model remains the same for all of the local actor agents. To ensure a meaningful comparison with the proposed MAMRL model, we employ this joint energy dispatch policy using the same advantage function (23) as the MAMRL model.

  • We deploy a pure greedy-based algorithm [51] to find the best action-value mapping. In particular, this algorithm never takes the risk to choose an unknown action. Meanwhile, it explores other strategies and learns from them so as to infer more reasonable decisions. Thus, we choose this upper confidence bounded action selection mechanism [51] as one of the baselines used for benchmarking our proposed MAMRL model.

We implement our MAMRL framework using multi-threading programming in Python platform 555MAMRL, along with TensorFlow APIs [68]. Table IV shows the key parameters of this experiment setup.

Refer to caption
Figure 6: Reward value achieved for proposed Meta-RL training of the meta-agent alone with other SBS agents.
Refer to caption
Figure 7: Reward value achieved of proposed Meta-RL, single-agent centralized, and multi-agent centralized methods.

We prepossess both of the datasets ([60] and [61]) using Algorithm 1 that generates the state space information. The histograms of the network energy demand (in 5(a)) and a renewable energy generation (in 5(b)) of the deterministic environment are shown in Fig. 5. To the best of our knowledge, there are no publicly available datasets that comprises both of energy consumption and generation of a self-powered network with MEC capabilities. Additionally, if we change the environment using other datasets, the proposed MAMRL framework can deal with the new, unknown environment by using the skill transfer feature from the meta-agent to each local BS agent. In particular, the MAMRL approach can readily deal with the case in which the BS agent achieves a much lower reward due to more variability in consumption and generation. As a result, the above experiment setup is reasonable for the benchmarking of the proposed MAMRL framework.

Refer to caption
(a) Proposed Meta-RL
Refer to caption
(b) Single-agent centralized
Refer to caption
(c) Multi-agent centralized
Figure 8: Relationship among the entropy loss, value loss, and policy loss in the training phase of proposed Meta-RL, single-agent centralized, and multi-agent centralized methods.

Fig. 6 illustrates the reward achieved by each local SBS along with a meta-agent, where we take an average reward for each 5050 episodes. In the MAMRL setting, we design a maximum reward of 9696 (1515 minute slot for 2424 hours), where meta-agent converges with a high reward value (around 9090). Hence, all of the local agents converge with around 808580-85 reward value except the SBS 66 that achieves a reward of 7070 at convergence because its energy consumption and generation vary more than the others. In fact, this variation of reward among the BSs is leading to anticipate the non-i.i.d. energy demand and generation of the considered network as well as densification of the exploration and exploitation tradeoff for energy dispatch. The proposed approach can adapt the uncertain energy demand and generation over time by characterizing the expected amount of uncertainty in an energy dispatch decision for each BS ii\in\mathcal{B} individually. Meanwhile, the meta-agent exploits the energy dispatch decision by employing a joint policy toward the globally optimal energy dispatch for each BS ii\in\mathcal{B}. Therefore, the challenges of distinct energy demand and generation of the state space among the BSs can be efficiently handled by applying learned parameters from the meta-agent to each BS ii\in\mathcal{B} during the training that establishes a balance between exploration and exploitation.

We compare the achieved reward of proposed MAMRL model with single-agent centralized and multi-agent centralized models in Fig. 7. The single agent centralized (diamond mark with red line) model converges faster than the other two models but it achieves the lowest reward due to the lack of exploitation as it has only one agent. Further, the multi-agent centralized (circle mark with blue line) model converges with a higher reward than the single agent method. The proposed MAMRL (cross mark with green line) model outperforms the other two models while converges with the highest reward value. In addition, multi-agent centralized needs the entire state information. In contrast, the meta-agent requires only the observation from local agents, and it can optimize the neural network parameters by using its own state information.

We analyze the relationship among the value loss, entropy loss, and policy loss in Fig. 8, where the maximum policy loss of the proposed MAMRL (in 8(a)) model is around 0.060.06 whereas single-agent centralized (in 8(b)) and multi-agent centralized (in 8(c)) methods gain about 1.881.88 and 0.120.12, respectively. Therefore, the training accuracy increases due to more variation between exploration and exploitation. Thus, our MAMRL model is capable of incorporating the decision of each local BS agent that solves the challenge of non-i.i.d. demand-generation for the other BSs.

Refer to caption
Figure 9: Testing accuracy of the proposed Meta-RL, single-agent centralized, and multi-agent centralized methods with deterministic, asymmetric, and stochastic environments of the 99 SBSs.

In Fig. 9, we examine the testing accuracy [69] of the storage energy ξisto(t)\xi^{\textrm{sto}}_{i}(t) and the non-renewable energy generation decision 666Each BS agent ii\in\mathcal{B} can calculate its action from a globally optimal energy dispatch policy πθi\pi_{\theta_{i}}^{*} by using argmax(.)\operatorname*{argmax}(.) (i.e., argmaxπθi(𝒂ti)\operatorname*{argmax}_{\pi_{\theta_{i}}^{*}}(\boldsymbol{a}_{ti})). In which, at the end of 1515 minutes duration of each time slot tt, the each BS agent ii\in\mathcal{B} can choose one action (i.e., storage or non-renewable) from the energy dispatch policy πθi\pi_{\theta_{i}}^{*}. ξinon(t)\xi^{\textrm{non}}_{i}(t) for 9696 time slots (11 days) of 99 SBSs under the deterministic, asymmetric, and stochastic environments. In the experiment, we have used the well-known UMass solar panel dataset [61] for renewable energy generation information as well as, the CRAWDAD nyupoly/video dataset[60], for estimating the energy consumption of the self-powered network. Further, we preprocess both of the datasets ([60] and [61]) using Algorithm 1 that generates the state space information. Thus, the ground truth comes from this state-space information of the considered datasets, where the actions are depended on the renewable energy generation and consumption of a particular BS ii\in\mathcal{B}. The proposed MAMRL (green box) and multi-agent centralized (blue box) methods achieve a maximum accuracy of around 95%95\% and 92%92\%, respectively, under the stochastic environment (in Fig. 9). Further, Fig. 9 shows that the mean accuracy (88%88\%) of the proposed method is also higher than the centralized solution (86%86\%). Similarly, in the deterministic and asymmetric environment, the average accuracy (around 87%87\%) of the proposed low complexity semi-distributed solution is almost the same as the baseline method.

Refer to caption
Figure 10: Prediction result of renewable, storage, and non-renewable energy usages of a single SBS (SBS 22) for 2424 hours (9696 time slots) under the stochastic environment.
Refer to caption
Figure 11: Explained variance score of a single SBS (SBS 2) for 2424 hours (9696 time slots) under the stochastic environment.
Refer to caption
Figure 12: Mean absolute error of a single SBS (SBS 2) for 2424 hours (9696 time slots) under the stochastic environment.
Refer to caption
Figure 13: Kernel density analysis of non-renewable energy usages for 2424 hours (9696 time slots) under the stochastic environment.

The prediction results of renewable, storage and non-renewable energy usage for a single SBS (SBS 22) for 2424 hours (9696 time slots) under the stochastic environment are shown in Fig. 10. The proposed MAMRL outperforms all other baselines and achieves an accuracy of around 95.8%95.8\%. In contrast, the accuracy of the other two methods is 75%75\% and 93.7%93.7\% for the single-agent centralized and multi-agent centralized, respectively.

In Figs. 11 and 12, we validate our approach with two standard regression model evaluation metric, explained variance 777We measure the discrepancy for energy dispatch decisions between the proposed and baseline models on the ground truth of the datasets ([60] and [61]). We deploy the explained variance regression score function using sklearn API [70] to measure and compare this discrepancy. (i.e., explained variation) and mean absolute error (MAE) [69], respectively. Fig. 11 shows that the explained variance score of the proposed MAMRL method almost the same as the multi-agent centralized. However, in the case of renewable energy generation, MAMRL method significantly performs better (i.e., 1%1\% more score) than the multi-agent centralized solution. In particular, the proposed MAMRL model has pursued the uncertainty of renewable energy generation by the dynamics of Markovian for each BS. Further, meta-agent anticipates the energy dispatch by other BSs decisions and its own state information. We analyze the MAE 888This performance metric provides us with the average magnitude of errors for the energy dispatch decision of a single SBS (SBS 2) for 2424 hours (9696 time slots). Particularly, we analyze the average error over the 9696 time slots of the absolute differences between prediction and actual observation. To evaluate this metric, we have used the mean absolute error regression loss function of sklearn API [71]. for the three environments (i.e., deterministic, asymmetric, and stochastic) among the proposed MAMRL, single-agent centralized, and multi-agent centralized methods in Fig. 12. The MAE of the proposed MAMRL is 11%11\%, 15%15\%, and 4%4\% for deterministic, asymmetric, and stochastic, respectively since meta-agent has the capability to adopt the uncertain environment very fast. This adaptability is enhanced by the exploration mechanism that is taken into account at each BS, and exploitation that performs by capitalizing the non-i.i.d. explored information of all BSs.

Refer to caption
Figure 14: Energy consumption cost analysis of 99 SBSs for 2424 hours (9696 time slots) under deterministic, asymmetric, and stochastic environments using the proposed Meta-RL method over pure greedy method.

Fig. 13 illustrates the efficacy of the proposed MAMRL model in terms of the non-renewable energy usages into a stochastic environment with other benchmarks. This figure considers a kernel density analysis for 2424 hours (9696 time slots) under a stochastic environment, where the median of the non-renewable energy usages 0.150.15 (kWh), and 0.270.27 (kWh) for the proposed MAMRL, and pure greedy, respectively, at each 1515 minutes time slot. Further, the proposed MAMRL can significantly reduce the usages of non-renewable energy for the considered self-powered wireless network, where the MAMRL can save up to 13.3%13.3\% of the non-renewable energy usages. Here, the meta agent of the MAMRL model can discretize uncertainty from each local BS agent and transfer the knowledge (i.e., learning parameters) to each local agent that can take a globally optimal energy dispatch decision.

Refer to caption
Figure 15: Amount of renewable, non-renewable, and storage energy estimation for 2424 hours (9696 time slots) for proposed meta-RL, single-agent RL, multi-agent RL, next fit, first fit, and first fit decreasing methods.
Refer to caption
Figure 16: Competitive cost ratio of the proposed Meta-RL method for 2424 hours (9696 time slots) under the deterministic, asymmetric, and stochastic environments.
TABLE V: Comparison between the proposed method and other methods with ground truth for a single SBS (SBS 2) for 2424 hours (9696 time slots) under the stochastic environment.
Method Non-renewable energy usage (kWh) Storage energy usage (kWh) Renewable energy usage (kWh) Non-renewable energy usage cost ($) Storage energy usage cost ($) Renewable energy usage cost ($) Total energy usage cost ($) Cost difference with ground truth (%)
Ground truth (i.e., optimal) 30.1530.15 8.878.87 8.678.67 3.073.07 0.490.49 0.430.43 3.993.99 NA
MAMRL (proposed) 30.88\boldsymbol{30.88} 8.50\boldsymbol{8.50} 8.32\boldsymbol{8.32} 3.14\boldsymbol{3.14} 0.47\boldsymbol{0.47} 0.42\boldsymbol{0.42} 4.03\boldsymbol{4.03} 0.90\boldsymbol{0.90}
Single-agent RL 34.5334.53 6.656.65 6.506.50 3.523.52 0.370.37 0.330.33 4.214.21 5.435.43
Multi-agent RL 31.2431.24 8.318.31 8.148.14 3.193.19 0.460.46 0.410.41 4.054.05 1.361.36
Next Fit 38.9238.92 4.444.44 4.344.34 3.973.97 0.240.24 0.210.21 4.434.43 10.8610.86
First Fit 37.3737.37 5.225.22 5.105.10 3.813.81 0.290.29 0.260.26 4.354.35 8.948.94
First Fit Decreasing 37.1237.12 5.345.34 5.235.23 3.793.79 0.300.30 0.260.26 4.344.34 8.638.63
Without renewable 47.6947.69 NA NA 4.864.86 NA NA 4.864.86 21.7221.72

Fig. 14 presents the energy consumption cost analysis for 99 SBSs over 2424 hours (9696 time slots) under deterministic, asymmetric, and stochastic environments using the proposed Meta-RL method while comparing it to the pure greedy method. The total energy cost achieved by the proposed approach for a particular day will be $33.75\$33.75, $28.29\$28.29, and $25.83\$25.83 for deterministic, asymmetric, and stochastic environments, respectively. Fig. 14 also shows that the proposed method significantly reduces the energy consumption cost (by at least 22.4%22.4\%) for all three environments over the pure greedy method. The median of the energy cost at each time slot is $0.04\$0.04, $0.03\$0.03, and $0.03\$0.03 for the deterministic, asymmetric, and stochastic environments, respectively. In contrast, Fig. 14 has shown that a median energy cost for the pure greedy baseline is $0.05\$0.05 at each time slot that is due to a lack of the competence to cope with an unknown environment for energy consumption and generation. Therefore, the proposed MAMRL model can overcome the challenges of an unknown environment as well as non-i.i.d. characteristics for energy consumption and generation of a self-powered MEC network.

In Fig. 15, we compare our proposed meta-RL model with single-agent RL, multi-agent RL, next fit, first fit, and first fit decreasing methods in terms of amount of renewable, non-renewable, and storage energy usages for 2424 hours (9696 time slots). Fig. 15 shows that the proposed MAMRL model outperforms the others that achieves around 22%22\% less non-renewable energy usages than the next fit scheduling algorithm. Additionally, next fit, first fit, and first fit decreasing scheduling methods [72] cannot capture the uncertainty of energy generation and consumption, as well as provide a near optimal solution. Further, a comparison between the proposed method and other methods with the ground truth for a single SBS (SBS 2) for 2424 hours (9696 time slots) under the stochastic environment is illustrated in Table V. The proposed method can achieve significant outcomes with respect to energy cost as compared with the ground truth. In particular, the experiment shows that the energy usage cost difference between the proposed method and ground truth is around 1%1\% for a single BS (in Table V). This leads to one of the evidence that the proposed MAMRL can adopt the unknown environment and can utilize it during the execution for each BS energy dispatch.

Finally, in Fig. 16, we examine the competitive cost ratio [29] of the proposed MAMRL framework. From this figure, we observe that the proposed MAMRL framework effectively minimizes the energy consumption cost for each BS under deterministic, asymmetric, and stochastic environments. In fact, Fig. 16 ensures the robustness of the proposed MAMRL framework that is performed a tremendous performance gain by coping with non-i.i.d. energy consumption and generation under the uncertainty. Furthermore, in MAMRL training, each local agent has captured the time-variant features of energy demand and generation from the historical data while meta-agent optimizes energy dispatch decisions by obtaining those features with its own parameters of LSTM. In the case of testing, a generalized MAMRL trained model is employed that makes a fully independent and unbiased energy dispatch from an unknown environment. To this end, the proposed MAMRL framework shows the efficacy of solving the energy dispatch of a self-powered wireless network with MEC capabilities with a higher degree of reliability.

VI Conclusions

In this paper, we have investigated an energy dispatch problem of a self-powered wireless network with MEC capabilities. We have formulated a two-stage stochastic linear programming energy dispatch problem for the considered network. To solve the energy dispatch problem in a semi-distributed manner, we have proposed a novel multi-agent meta-reinforcement learning framework. In particular, each local BS agent obtains the time-varying features by capturing the Markovian properties of the network’s energy consumption and renewable generation for each BS unit, and predict its own energy dispatch policy. Meanwhile, a meta-agent optimizes each BS agent’s energy dispatch policy from its own state information, and it transfers global learning parameters to each BS agent so that they can update their energy dispatch policy into an optimal policy. We have shown that the proposed MAMRL framework can capture the uncertainty of non-i.i.d. energy demand and generation for the self-powered wireless network with MEC capabilities. Our experimental results have shown that the proposed MAMRL framework can save a significant amount of non-renewable energy with higher accuracy prediction that ensures the energy sustainability of the network. In particular, the performance of energy dispatch over deterministic, asymmetric, and stochastic environments outperform other baseline approaches, where average accuracy achieves up to 95.8%95.8\% and reduces the energy cost about 22.4%22.4\% of the self-powered wireless network. To this end, the proposed MAMRL model can reduce by at least 11%11\% of the non-renewable energy usage for the self-powered wireless network.

Appendix A Example of Information Exchange between Local BS Agent and Meta Agent in MAMRL Framework

For example, consider an LSTM cell with 4848 LSTM units [49, 64]. Thus, the dimension of forget gate, input gate, gate/memory/activation gate, and output gate will be 4848 for each gate. Now consider a local BS agent ii\in\mathcal{B} that embedding a dimension of 33 inputs (ri(𝒂ti,𝒔ti),𝒂ti,t)(r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},t^{\prime}) to a local LSTM cell. This input comes from the state information 𝒔ti:(ξid,ξiren(t),Ctisto,Ctinon)\boldsymbol{s}_{ti}\colon(\xi_{i}^{\textrm{d}},\xi_{i}^{\textrm{ren}}(t),C^{\textrm{sto}}_{ti},C^{\textrm{non}}_{ti}) of a local BS agent ii. As a result, inputs are appended to all gates during the training. Therefore, the number of learning parameters will be 4×(48(48+3)+48)4\times(48(48+3)+48) (i.e., gates×[units(units+input)+units]gates\times[units(units+input)+units]). Additionally, the size of hidden state and cell state parameters remain 4848 for each due to an LSTM cell with 4848 LSTM units. Further, on top of the LSTM cell, we have two fully connected output layers, a fully connected output layer with a Softmax activation to determine the local energy dispatch policy. Meanwhile, advantage is determined from another fully connected output layer without activation function by value function estimation. The hidden and cell state of each local agent are updated by receiving the state parameters with a 48×248\times 2 dimensional data from the meta-agent. In case of the meta-agent, the configuration of LSTM cell is the same as each local LSTM cell. Therefore, at the end of each time slot duration, the meta-agent sends a 48×248\times 2 dimensional state information to each local BS agent ii\in\mathcal{B}. Subsequently, the meta-agent receives a 66 dimensional observation 𝒐i:(ri(𝒂ti,𝒔ti),ri(𝒂ti,𝒔ti),𝒂ti,𝒂ti,t,Λπθi(𝒔ti,𝒂ti)){\boldsymbol{o}_{i}\colon(r_{i}(\boldsymbol{a}_{t^{\prime}i},\boldsymbol{s}_{t^{\prime}i}),r_{i}(\boldsymbol{a}_{ti},\boldsymbol{s}_{ti}),\boldsymbol{a}_{ti},\boldsymbol{a}_{t^{\prime}i},t^{\prime},\Lambda^{\pi_{\theta_{i}}}(\boldsymbol{s}_{ti},\boldsymbol{a}_{ti}))} as an input from each local BS agent, where the number of learning parameters at the meta-agent will be 4×(48(48+6)+48)4\times(48(48+6)+48) for each iteration. The output layer of the meta-agent also consists of two fully connected output layers for determining meta-policy (i.e., joint policy) and meta advantage. Thus, these output layers do not affect the dimension of hidden and cell states for the meta agent’s LSTM cell. In fact, these RNN states are used as an input to these fully connected layers. As a result, for each epoch (i.e., end of a time slot duration), the meta-agent sends 48×248\times 2 dimensional RNN states to each local agent along with an energy dispatch policy, and each local agent sends 66 dimensional observation to the meta-agent.

Appendix B Proof of Proposition 1

Proof.

For a BS agent ii, energy dispatch policy πθi\pi_{\theta_{i}}^{*} is the best response for the equilibrium responses from all other BS agents. Thus, the BS agent ii can not be improved the value Vπθi(𝒔ti)V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti}) any more by deviating of policy πθi\pi_{\theta_{i}}^{*}. Therefore, (24) holds the following property,

Vπθi(𝒔ti)ri(𝒔ti,𝒂t0,,𝒂tB)+𝒔ti𝒮i,t=tγttΓ(𝒔ti|𝒔ti,𝒂t0,,𝒂tB)Vπθi(𝒔ti,πθ0,,πθB).\begin{split}V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})\geq r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\;+\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \sum_{\boldsymbol{s}_{t^{\prime}i}\in\mathcal{S}_{i},t^{\prime}=t}^{\infty}\gamma^{t^{\prime}-t}\Gamma(\boldsymbol{s}_{t^{\prime}i}|\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})V^{\pi_{\theta_{i}}}(\boldsymbol{s}_{t^{\prime}i},\pi_{\theta_{0}}^{*},\dots,\pi_{\theta_{B}}^{*}).\\ \end{split} (30)

Hence, the meta-agent Mt(𝒪t;ϕ)M_{t}(\mathcal{O}_{t};\phi) of the |||\mathcal{B}|-agent energy dispatch model (i.e., MAMRL) reaches a Nash equilibrium point for policy πθi\pi_{\theta_{i}}^{*} with parameters θi\theta_{i}. As a result, the optimal value of BS agent ii\in\mathcal{B} can be as follows:

Vπθi(𝒔ti)=Mt(θtL(θt);ϕ).\begin{split}V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})=M_{t}(\nabla_{\theta_{t}}L(\theta_{t});\phi).\end{split} (31)

(31) implies that πθi\pi_{\theta_{i}}^{*} is an optimal policy of energy dispatch decisions. Thus, the optimal policy πθi\pi_{\theta_{i}}^{*} belongs to a Nash equilibrium point and holds the following inequality,

Vπθi(𝒔ti)𝔼L(θ)[L(θ(L(θ);ϕ))]\begin{split}V^{\pi_{\theta_{i}}^{*}}(\boldsymbol{s}_{ti})\geq\mathbb{E}_{L(\theta)}[L(\theta^{*}(L(\theta);\phi))]\end{split} (32)

Appendix C Proof of Proposition 2

Proof.

A probability of action 𝒂ti\boldsymbol{a}_{ti} of BS agent ii\in\mathcal{B} at time tt can be presented as follows:

P(𝒂ti)=θi𝒂ti(1θi)1𝒂ti=𝒂tilogθi+(1𝒂ti)log(1θi).\begin{split}P(\boldsymbol{a}_{ti})=\theta_{i}^{\boldsymbol{a}_{ti}}(1-\theta_{i})^{1-\boldsymbol{a}_{ti}}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;={\boldsymbol{a}_{ti}}\log\theta_{i}+({1-\boldsymbol{a}_{ti}})\log(1-\theta_{i}).\end{split} (33)

We consider a single state, and a policy gradient estimator can be defined as,

^θiL(θi)=ri(𝒔ti,𝒂t0,,𝒂tB)θilogP(𝒂t0,,𝒂tB)=ri(𝒔ti,𝒂t0,,𝒂tB)θii𝒂tilogθi+(1𝒂ti)log(1θi)=ri(𝒔ti,𝒂t0,,𝒂tB)θi(𝒂tilogθi+(1𝒂ti)log(1θi))=ri(𝒔ti,𝒂t0,,𝒂tB)(𝒂tiθi(1𝒂ti)(1θi))=ri(𝒔ti,𝒂t0,,𝒂tB)(2𝒂ti1),for θi=0.5.\begin{split}\frac{\hat{\partial}}{\partial\theta_{i}}L(\theta_{i})=r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\frac{\partial}{\partial\theta_{i}}\log P(\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\;\;\;\;\;\;\;\;\;\;\;\;\;\\ =r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\frac{\partial}{\partial\theta_{i}}\sum_{\forall i\in\mathcal{B}}{\boldsymbol{a}_{ti}}\log\theta_{i}+({1-\boldsymbol{a}_{ti}})\log(1-\theta_{i})\\ =r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})\frac{\partial}{\partial\theta_{i}}({\boldsymbol{a}_{ti}}\log\theta_{i}+({1-\boldsymbol{a}_{ti}})\log(1-\theta_{i}))\;\;\;\;\;\;\;\\ =r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})(\frac{\boldsymbol{a}_{ti}}{\theta_{i}}-\frac{(1-\boldsymbol{a}_{ti})}{(1-\theta_{i})})\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\ =r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})(2\boldsymbol{a}_{ti}-1),\;\text{for $\theta_{i}=0.5$}.\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\end{split} (34)

Thus, an expected reward for |||\mathcal{B}| BS agents can be represented as, 𝔼[ri]=iri(𝒔ti,𝒂t0,,𝒂tB)(0.5)||\mathbb{E}[r_{i}]=\sum_{\forall i\in\mathcal{B}}r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})(0.5)^{|\mathcal{B}|}, where by applying ri(𝒔ti,𝒂t0,,𝒂tB)=𝟏|ri(𝒔ti,𝒂t0,,𝒂tB)r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB})=\boldsymbol{1}|r_{i}(\boldsymbol{s}_{ti},\boldsymbol{a}_{t0},\dots,\boldsymbol{a}_{tB}), we can get 𝔼[ri]=(0.5)||\mathbb{E}[r_{i}]=(0.5)^{|\mathcal{B}|}. Now, we can define an expectation of a gradient estimation as, 𝔼[^θiL(θi)]=θiL(θi)=(0.5)||\mathbb{E}[\frac{\hat{\partial}}{\partial\theta_{i}}L(\theta_{i})]=\frac{\partial}{\partial\theta_{i}}L(\theta_{i})=(0.5)^{|\mathcal{B}|}. Therefore, a variance of the estimated gradient can be defined as,

𝕍[^θiL(θi)]=𝔼[^θiL2(θi)](𝔼[^θiL(θi)])2=(0.5)||(0.5)2||.\begin{split}\mathbb{V}\big{[}\frac{\hat{\partial}}{\partial\theta_{i}}L(\theta_{i})\big{]}=\mathbb{E}\big{[}\frac{\hat{\partial}}{\partial\theta_{i}}L^{2}(\theta_{i})\big{]}-\left(\mathbb{E}\big{[}\frac{\hat{\partial}}{\partial\theta_{i}}L(\theta_{i})\big{]}\right)^{2}\\ =\left(0.5\right)^{|\mathcal{B}|}-\left(0.5\right)^{2|\mathcal{B}|}.\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\end{split} (35)

Now, we can analyze the step of gradient for P((^θiL(θi),θiL(θi))>0)P((\hat{\nabla}_{\theta_{i}}L(\theta_{i}),\nabla_{\theta_{i}}L(\theta_{i}))>0) (in (29)), where

P(^θiL(θi),θiL(θi))=(0.5)||i^θiL(θi).\begin{split}P\left(\hat{\nabla}_{\theta_{i}}L(\theta_{i}),\nabla_{\theta_{i}}L(\theta_{i})\right)=\left(0.5\right)^{|\mathcal{B}|}\sum_{\forall i\in\mathcal{B}}\frac{\hat{\partial}}{\partial\theta_{i}}L(\theta_{i}).\end{split} (36)

As a result, P((^θiL(θi),θiL(θi))>0)=(0.5)||P((\hat{\nabla}_{\theta_{i}}L(\theta_{i}),\nabla_{\theta_{i}}L(\theta_{i}))>0)=(0.5)^{|\mathcal{B}|} implies that the gradient step not only moves in the correct direction but also decreases exponentially with an increasing number of BS agents. ∎

References

  • [1] W. Saad, M. Bennis and M. Chen, “A Vision of 6G Wireless Systems: Applications, Trends, Technologies, and Open Research Problems," IEEE Network, vol. 34, no. 3, pp. 134-142, May/June 2020, doi: 10.1109/MNET.001.1900287.
  • [2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless Network Intelligence at the Edge," Proceedings of the IEEE, vol. 107, no. 11, pp. 2204-2239, Nov. 2019, doi: 10.1109/JPROC.2019.2941458.
  • [3] E. Dahlman, S. Parkvall, J. Peisa, H. Tullberg, H. Murai and M. Fujioka, “Artificial Intelligence in Future Evolution of Mobile Communication," International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, February 2019, pp. 102-106.
  • [4] M. S. Munir, S. F. Abedin and C. S. Hong, “Artificial Intelligence-based Service Aggregation for Mobile-Agent in Edge Computing," 2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS), Matsue, Japan, September 2019, pp. 1-6.
  • [5] M. Chen, U. Challita, W. Saad, C. Yin and M. Debbah, “Artificial Neural Networks-Based Machine Learning for Wireless Networks: A Tutorial," IEEE Communications Surveys & Tutorials, Early Access, July 2019.
  • [6] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A Joint Learning and Communications Framework for Federated Learning over Wireless Networks," IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 269-283, Jan. 2021, doi: 10.1109/TWC.2020.3024629.
  • [7] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen and C. S. Hong, “Federated Learning over Wireless Networks: Optimization Model Design and Analysis," IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, Paris, France, 2019, pp. 1387-1395.
  • [8] G. Lee, W. Saad, M. Bennis, A. Mehbodniya and F. Adachi, “Online Ski Rental for ON/OFF Scheduling of Energy Harvesting Base Stations," IEEE Transactions on Wireless Communications, vol. 16, no. 5, pp. 2976-2990, May 2017.
  • [9] Y. Wei, F. R. Yu, M. Song and Z. Han, “User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An Actor-Critic Reinforcement Learning Approach," IEEE Transactions on Wireless Communications, vol. 17, no. 1, pp. 680-692, Jan. 2018.
  • [10] J. Xu, L. Chen and S. Ren, “Online Learning for Offloading and Autoscaling in Energy Harvesting Mobile Edge Computing," IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 3, pp. 361-373, Sept. 2017.
  • [11] M. S. Munir, S. F. Abedin, N. H. Tran and C. S. Hong, “When Edge Computing Meets Microgrid: A Deep Reinforcement Learning Approach," in IEEE Internet of Things Journal, vol. 6, no. 5, pp. 7360-7374, October 2019.
  • [12] M. S. Munir, S. F. Abedin, D. H. Kim, N. H. Tran, Z. Han, and C. S. Hong, "A Multi-Agent System Toward the Green Edge Computing with Microgrid," 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9-13 December 2019.
  • [13] N. Piovesan, D. A. Temesgene, M. Miozzo and P. Dini, "Joint Load Control and Energy Sharing for Autonomous Operation of 5G Mobile Networks in Micro-Grids," IEEE Access, vol. 7, pp. 31140-31150, March 2019.
  • [14] X. Huang, T. Han and N. Ansari, “Smart Grid Enabled Mobile Networks: Jointly Optimizing BS Operation and Power Distribution," IEEE/ACM Transactions on Networking, vol. 25, no. 3, pp. 1832-1845, June 2017.
  • [15] W. Li, T. Yang, F. C. Delicato, P. F. Pires, Z. Tari, S. U. Khan, and A. Y. Zomaya "On Enabling Sustainable Edge Computing with Renewable Energy Resources," IEEE Communications Magazine, vol. 56, no. 5, pp. 94-101, May 2018.
  • [16] Y. Mao, J. Zhang, S. H. Song and K. B. Letaief, “Stochastic Joint Radio and Computational Resource Management for Multi-User Mobile-Edge Computing Systems," IEEE Transactions on Wireless Communications, vol. 16, no. 9, pp. 5994-6009, September 2017.
  • [17] T. X. Tran and D. Pompili, “Joint Task Offloading and Resource Allocation for Multi-Server Mobile-Edge Computing Networks," IEEE Transactions on Vehicular Technology, vol. 68, no. 1, pp. 856-868, January 2019.
  • [18] P. Chang and G. Miao, “Resource Provision for Energy-Efficient Mobile Edge Computing Systems," IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 1-6.
  • [19] Y. Sun, S. Zhou and J. Xu, “EMM: Energy-Aware Mobility Management for Mobile Edge Computing in Ultra Dense Networks," IEEE Journal on Selected Areas in Communications, vol. 35, no. 11, pp. 2637-2646, Nov. 2017.
  • [20] S. F. Abedin, M. G. R. Alam, R. Haw and C. S. Hong, “A system model for energy efficient green-IoT network," 2015 International Conference on Information Networking (ICOIN), Cambodia, 2015, pp. 177-182.
  • [21] X. Zhang, M. R. Nakhai, G. Zheng, S. Lambotharan and B. Ottersten, “Calibrated Learning for Online Distributed Power Allocation in Small-Cell Networks," IEEE Transactions on Communications, vol. 67, no. 11, pp. 8124-8136, Nov. 2019, doi: 10.1109/TCOMM.2019.2938514.
  • [22] S. Akin and M. C. Gursoy, “On the Energy and Data Storage Management in Energy Harvesting Wireless Communications," IEEE Transactions on Communications, Early Access, August 2019.
  • [23] N. H. Tran, C. Pham, M. N. H. Nguyen, S. Ren and C. S. Hong, “Incentivizing Energy Reduction for Emergency Demand Response in Multi-Tenant Mixed-Use Buildings," IEEE Transactions on Smart Grid, vol. 9, no. 4, pp. 3701-3715, July 2018.
  • [24] J. X. Wang et al., “Learning to reinforcement learn," CogSci, 2017. (In London, UK).
  • [25] N. Schweighofera, and K. Doya, “Meta-learning in Reinforcement Learning," Neural Networks, vol. 16, no. 1, pp. 5-9, January 2003.
  • [26] M. Andrychowicz et al., “Learning to learn by gradient descent by gradient descent," Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016.
  • [27] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods for Deep Reinforcement Learning," Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1928-1937, Jan. 2016.
  • [28] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments," In Advances in Neural Information Processing Systems, pp. 6379-6390. 2017.
  • [29] Y. Zhang, M. H. Hajiesmaili, S. Cai, M. Chen and Q. Zhu, “Peak-Aware Online Economic Dispatching for Microgrids," IEEE Transactions on Smart Grid, vol. 9, no. 1, pp. 323-335, Jan. 2018.
  • [30] T. Han and N. Ansari, "Network Utility Aware Traffic Load Balancing in Backhaul-Constrained Cache-Enabled Small Cell Networks with Hybrid Power Supplies," in IEEE Transactions on Mobile Computing, vol. 16, no. 10, pp. 2819-2832, 1 Oct. 2017.
  • [31] S. F. Abedin, A. K. Bairagi, M. S. Munir, N. H. Tran and C. S. Hong, “Fog Load Balancing for Massive Machine Type Communications: A Game and Transport Theoretic Approach," IEEE Access, vol. 7, pp. 4204-4218, December 2018.
  • [32] Z. Chang, Z. Zhou, T. Ristaniemi, and Z. Niu, “Energy Efficient Optimization for Computation Offloading in Fog Computing System," IEEE Global Communications Conference, Singapore, December 2017, pp. 1-6.
  • [33] A. Ndikumana, N. H. Tran, T. M. Ho, Z. Han, W. Saad, D. Niyato, and C. S. Hong, “Joint Communication, Computation, Caching, and Control in Big Data Multi-access Edge Computing," IEEE Transactions on Mobile Computing, vol. 19, no. 6, pp. 1359-1374, 1 June 2020, doi: 10.1109/TMC.2019.2908403.
  • [34] T. Rauber, G. Runger, M. Schwind, H. Xu, and S. Melzner, “Energy measurement, modeling, and prediction for processors with frequency scaling," The Journal of Supercomputing, vol. 70, no. 3, pp. 1454-1476, 2014.
  • [35] R. Bertran, M. Gonzalez, X. Martorell, N. Navarro and E. Ayguade, “A Systematic Methodology to Generate Decomposable and Responsive Power Models for CMPs," IEEE Transactions on Computers, vol. 62, no. 7, pp. 1289-1302, July 2013. Firstquarter 2016.
  • [36] G. Auer et al., “How much energy is needed to run a wireless network?," IEEE Wireless Communications,vol. 18, no. 5, pp. 40-49, October 2011.
  • [37] ETSI TS, “5G; NR; Physical layer procedures for data", [Online]: www.etsi.org/deliver/etsi_ts/138200_138299/138214/15.03.00_60/
    ts_138214v150300p.pdf
    , 3GPP TS 38.214 version 15.3.0 Release 15, October 2018 (Visited on 18 July, 2019).
  • [38] Y. Gu, W. Saad, M. Bennis, M. Debbah and Z. Han, “Matching theory for future wireless networks: fundamentals and applications," IEEE Communications Magazine, vol. 53, no. 5, pp. 52-59, May 2015.
  • [39] F. Pantisano, M. Bennis, W. Saad, S. Valentin and M. Debbah, “Matching with externalities for context-aware user-cell association in small cell networks," 2013 IEEE Global Communications Conference (GLOBECOM), Atlanta, GA, 2013, pp. 4483-4488.
  • [40] N.L. Panwar, S.C. Kaushik, and S. Kothari, “Role of renewable energy sources in environmental protection: a review," Renewable and Sustainable Energy Reviews, vol. 15, no. 3, pp. 1513-1524, April, 2011.
  • [41] F. A. Chacra, P. Bastard, G. Fleury and R. Clavreul, “Impact of energy storage costs on economical performance in a distribution substation," IEEE Transactions on Power Systems, vol. 20, no. 2, pp. 684-691, May 2005.
  • [42] H. Kanchev, D. Lu, F. Colas, V. Lazarov and B. Francois, “Energy Management and Operational Planning of a Microgrid With a PV-Based Active Generator for Smart Grid Applications," IEEE Transactions on Industrial Electronics, vol. 58, no. 10, pp. 4583-4592, Oct. 2011.
  • [43] A. Mishra, D. Irwin, P. Shenoy, J. Kurose and T. Zhu, “GreenCharge: Managing RenewableEnergy in Smart Buildings," IEEE Journal on Selected Areas in Communications, vol. 31, no. 7, pp. 1281-1293, July 2013.
  • [44] X. Xu, Z. Yan, M. Shahidehpour, Z. Li, M. Yan and X. Kong, “Data-Driven Risk-Averse Two-Stage Optimal Stochastic Scheduling of Energy and Reserve with Correlated Wind Power," IEEE Transactions on Sustainable Energy, vol. 11, no. 1, pp. 436-447, Jan. 2020, doi: 10.1109/TSTE.2019.2894693.
  • [45] Business Insider, “One simple chart shows why an energy revolution is coming", [Online]: www.businessinsider.com/solar-power-cost-decrease-2018-5, May 2018 (Visited on 23 July, 2019).
  • [46] Y. Liu, and N. K. C. Nair, “A Two-Stage Stochastic Dynamic Economic Dispatch Model Considering Wind Uncertainty," IEEE Transactions on Sustainable Energy, vol. 7, no. 2, pp. 819-829, April 2016.
  • [47] D. Zhou, M. Sheng, B. Li, J. Li and Z. Han, “Distributionally Robust Planning for Data Delivery in Distributed Satellite Cluster Network," IEEE Transactions on Wireless Communications, vol. 18, no. 7, pp. 3642-3657, July 2019.
  • [48] S. F. Abedin, M. G. R. Alam, S. M. A. Kazmi, N. H. Tran, D. Niyato and C. S. Hong, “Resource Allocation for Ultra-reliable and Enhanced Mobile Broadband IoT Applications in Fog Network," IEEE Transactions on Communications, vol. 67, no. 1, pp. 489-502, January 2019.
  • [49] S. Hochreiter, and J. Schmidhuber,, “Long short-term memory," Neural Computation, vol. 9, pp. 1735-1780, 1997.
  • [50] Z. M. Fadlullah et al., “State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems," IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2432-2455, Fourthquarter 2017.
  • [51] R. S. Sutton and A. G. Barto, "Reinforcement Learning: An Introduction," 2nd ed. Cambridge, MA, USA: MIT Press, vol. 1, 2017.
  • [52] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning," In Proceedings of the eleventh international conference on machine learning, pp. 157-163, 1994.
  • [53] R. J. Williams, and D. Zipser, "Gradient-based learning algorithms for recurrent," Backpropagation: Theory, architectures, and applications, vol. 433, 1995.
  • [54] I. Bialynicki-Birula, and J. Mycielski, "Uncertainty relations for information entropy," Commun. Math. Phys., vol. 44, pp. 129-132, 1975.
  • [55] T. Seidenfeld, "Entropy and uncertainty," Phil. Sci., vol. 53, pp. 467-491, 1986.
  • [56] H. R. Feyzmahdavian, A. Aytekin and M. Johansson, "An Asynchronous Mini-Batch Algorithm for Regularized Stochastic Optimization," IEEE Transactions on Automatic Control, vol. 61, no. 12, pp. 3740-3754, Dec. 2016, doi: 10.1109/TAC.2016.2525015.
  • [57] A. Agarwal and J. C. Duchi, "The Generalization Ability of Online Algorithms for Dependent Data," IEEE Transactions on Information Theory, vol. 59, no. 1, pp. 573-587, Jan. 2013, doi: 10.1109/TIT.2012.2212414.
  • [58] A. M. Fink, "Equilibrium in a stochastic nn-person game," Journal of Science of the Hiroshima University, Series A-I (Mathematics), vol. 28, no. 1, pp. 89-93, 1964.
  • [59] P. J. Herings, and R. J. A. P. Peeters, "Stationary equilibria in stochastic games: structure, selection, and computation," Journal of Economic Theory, Elsevier, vol. 118, no. 1, pp. 32-60, September, 2004.
  • [60] S. Fu, and Y. Zhang, "CRAWDAD dataset due/packet-delivery (v. 2015-04-01)," downloaded from www.crawdad.org/due/packet-delivery/20150401, Apr 2015 (Visited on 3 July, 2019).
  • [61] Online:“Solar panel dataset," UMassTraceRepository www.traces.cs.umass.edu/index.php/Smart/Smart, (Visited on 3 July, 2019).
  • [62] A. K. Bairagi, N. H. Tran, W. Saad, Z. Han and C. S. Hong, “A Game-Theoretic Approach for Fair Coexistence Between LTE-U and Wi-Fi Systems," IEEE Transactions on Vehicular Technology, vol. 68, no. 1, pp. 442-455, Jan. 2019.
  • [63] Intel, “Intel Core i7-6500U Processor," [Online]: www.ark.intel.com/content/www/us/en/ark/products/88194/intel-core-i7-6500u-processor-4m-cache-up-to-3-10-ghz.html, (Visited on 17 August, 2019).
  • [64] Online: "TensorFlow Core v2.2.0," TensorFlow, www.tensorflow.org/api_docs/python/tf/compat/v1/nn/rnn_cell
    /BasicLSTMCell (Visited on 27 May, 2020).
  • [65] D.P. Kingma, and J. Ba, "Adam: A Method for Stochastic Optimization," in Proceedings of the 3rd International Conference on Learning Representations (ICLR),pp. 1-41, San Diego, USA, May 2015.
  • [66] Y. Takahashi, G. Schoenbaum, and Y. Niv, "Silencing the critics: Understanding the effects of cocaine sensitization on dorsolateral and ventral striatum in the context of an Actor/Critic model," Front. Neurosci., vol. 2, no. JUL, pp. 86-89, 2009.
  • [67] C. G. Li, M. Wang and Q. N. Yuan, "A Multi-agent Reinforcement Learning using Actor-Critic methods," 2008 International Conference on Machine Learning and Cybernetics, Kunming, 2008, pp. 878-882, doi: 10.1109/ICMLC.2008.4620528.
  • [68] Online: “All symbols in TensorFlow," TensorFlow, www.tensorflow.org/api_docs/python/ (Visited on 3 July, 2019).
  • [69] Online: “Model evaluation: quantifying the quality of predictions," scikit-learn, www.scikit-learn.org/stable/modules/model_evaluation.html (Visited on 3 August, 2019).
  • [70] Online: "Explained variance regression score function," scikit-learn, www.scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance
    _score.html (Visited on 28 May, 2020).
  • [71] Online: "Mean absolute error regression loss," scikit-learn, www.scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute
    _error.html (Visited on 28 May, 2020)
  • [72] D. S. Johnson, "Near-optimal bin packing algorithms," Massachusetts Institute of Technology, 1973.
[Uncaptioned image] Md. Shirajum Munir (Graduate Student Member, IEEE) received the B.S. degree in computer science and engineering from Khulna University, Khulna, Bangladesh, in 2010. He is currently pursuing the Ph.D. degree in computer science and engineering at Kyung Hee University, Seoul, South Korea. He served as a Lead Engineer with the Solution Laboratory, Samsung Research and Development Institute, Dhaka, Bangladesh, from 2010 to 2016. His current research interests include IoT network management, fog computing, mobile edge computing, software-defined networking, smart grid, and machine learning.
[Uncaptioned image] Nguyen H. Tran (S’10-M’11-SM’18) received the B.S. degree from the Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam, in 2005, and the Ph.D. degree in electrical and computer engineering from Kyung Hee University, Seoul, South Korea, in 2011. Since 2018, he has been with the School of Computer Science, University of Sydney, Sydney, NSW, Australia, where he is currently a Senior Lecturer. He was an Assistant Professor with the Department of Computer Science and Engineering, Kyung Hee University, from 2012 to 2017. His current research interests include applying analytic techniques of optimization, game theory, and stochastic modeling to cutting-edge applications, such as cloud and mobile edge computing, data centers, heterogeneous wireless networks, and big data for networks. Dr. Tran was a recipient of the Best KHU Thesis Award in Engineering in 2011 and the Best Paper Award of IEEE ICC 2016. He has been an Editor of the IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING since 2016 and served as the Editor of the 2017 Newsletter of Technical Committee on Cognitive Networks on Internet of Things.
[Uncaptioned image] Walid Saad d (S’07, M’10, SM’15, F’19) received his Ph.D degree from the University of Oslo in 2010. He is currently a Professor at the Department of Electrical and Computer Engineering at Virginia Tech, where he leads the Network sciEnce, Wireless, and Security (NEWS) laboratory. His research interests include wireless networks, machine learning, game theory, security, unmanned aerial vehicles, cyber-physical systems, and network science. Dr. Saad is a Fellow of the IEEE and an IEEE Distinguished Lecturer. He is also the recipient of the NSF CAREER award in 2013, the AFOSR summer faculty fellowship in 2014, and the Young Investigator Award from the Office of Naval Research (ONR) in 2015. He was the author/co-author of eight conference best paper awards at WiOpt in 2009, ICIMP in 2010, IEEE WCNC in 2012, IEEE PIMRC in 2015, IEEE SmartGridComm in 2015, EuCNC in 2017, IEEE GLOBECOM in 2018, and IFIP NTMS in 2019. He is the recipient of the 2015 Fred W. Ellersick Prize from the IEEE Communications Society, of the 2017 IEEE ComSoc Best Young Professional in Academia award, of the 2018 IEEE ComSoc Radio Communications Committee Early Achievement Award, and of the 2019 IEEE ComSoc Communication Theory Technical Committee. From 2015-2017, Dr. Saad was named the Stephen O. Lane Junior Faculty Fellow at Virginia Tech and, in 2017, he was named College of Engineering Faculty Fellow. He received the Dean’s award for Research Excellence from Virginia Tech in 2019. He currently serves as an editor for the IEEE Transactions on Wireless Communications, IEEE Transactions on Mobile Computing, IEEE Transactions on Cognitive Communications and Networking, and IEEE Transactions on Information Forensics and Security. He is an Editor-at-Large for the IEEE Transactions on Communications.
[Uncaptioned image] Choong Seon Hong (S’95-M’97-SM’11) received the B.S. and M.S. degrees in electronic engineering from Kyung Hee University, Seoul, South Korea, in 1983 and 1985, respectively, and the Ph.D. degree from Keio University, Japan, in 1997. In 1988, he joined KT, where he was involved in broadband networks as a Member of Technical Staff. Since 1993, he has been with Keio University. He was with the Telecommunications Network Laboratory, KT, as a Senior Member of Technical Staff and as the Director of the Networking Research Team until 1999. Since 1999, he has been a Professor with the Department of Computer Science and Engineering, Kyung Hee University. His research interests include future Internet, ad hoc networks, network management, and network security. He is a member of the ACM, the IEICE, the IPSJ, the KIISE, the KICS, the KIPS, and the OSIA. Dr. Hong has served as the General Chair, the TPC Chair/Member, or an Organizing Committee Member of international conferences such as NOMS, IM, APNOMS, E2EMON, CCNC, ADSN, ICPP, DIM, WISA, BcN, TINA, SAINT, and ICOIN. He was an Associate Editor of the IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, and the IEEE JOURNAL OF COMMUNICATIONS AND NETWORKS. He currently serves as an Associate Editor of the International Journal of Network Management, and an Associate Technical Editor of the IEEE Communications Magazine.